OCT Retinal and Choroidal Layer Instance Segmentation Using Mask R-CNN

Optical coherence tomography (OCT) of the posterior segment of the eye provides high-resolution cross-sectional images that allow visualization of individual layers of the posterior eye tissue (the retina and choroid), facilitating the diagnosis and monitoring of ocular diseases and abnormalities. The manual analysis of retinal OCT images is a time-consuming task; therefore, the development of automatic image analysis methods is important for both research and clinical applications. In recent years, deep learning methods have emerged as an alternative method to perform this segmentation task. A large number of the proposed segmentation methods in the literature focus on the use of encoder–decoder architectures, such as U-Net, while other architectural modalities have not received as much attention. In this study, the application of an instance segmentation method based on region proposal architecture, called the Mask R-CNN, is explored in depth in the context of retinal OCT image segmentation. The importance of adequate hyper-parameter selection is examined, and the performance is compared with commonly used techniques. The Mask R-CNN provides a suitable method for the segmentation of OCT images with low segmentation boundary errors and high Dice coefficients, with segmentation performance comparable with the commonly used U-Net method. The Mask R-CNN has the advantage of a simpler extraction of the boundary positions, especially avoiding the need for a time-consuming graph search method to extract boundaries, which reduces the inference time by 2.5 times compared to U-Net, while segmenting seven retinal layers.


Introduction
Optical coherence tomography (OCT) imaging has become the standard clinical tool to image the posterior segment of the eye (i.e., the retina and choroid), since it provides fundamental information regarding the health of the eye [1]. The detailed high-resolution images allow clinicians and researchers to visualize the individual tissue layers of the posterior eye. These images are used to diagnose and monitor ocular diseases and abnormalities. Analysis of structural changes (thickness or area metrics) is commonly used as a surrogate of health status or disease progression [2]. In order to extract these metrics, tissue boundary positions need to be first segmented. Manual labelling of these boundaries requires experts to segment the areas of interest, which is a time consuming and subjective process, potentially prone to bias and errors [3][4][5].
Thus, the development of automatic methods of OCT image analysis (segmentation, classification) is fundamental to extract quantitative data from these medical images in a rapid and precise manner. Deep learning (DL), a sub-field of machine learning (ML), represents a new method to analyse these images with tools such as convolutional neural networks (CNN). Overall, CNN based methods have achieved remarkable performance in medical and natural image segmentation and are becoming the state of the art [6]. In retinal OCT image analysis, they have been used for a large range of applications such This study proposes the application and evaluation of a DL approach for the segmentation of OCT images of the posterior segment of the eye, incorporating a Mask R-CNN region proposal architecture, for posterior segment retinal and choroidal layers instance segmentation using OCT images from healthy eyes and comparing results with benchmark encoder-decoder methods. The contributions of this study are as follows: • Evaluate the use of a region proposal architecture, based on the application of a Mask R-CNN method, as a new DL approach for retinal and choroidal layer instance segmentation using a large dataset of OCT images from healthy participants. • The solution provides an end-to-end DL framework, which unlike most of the previously proposed methods eliminates the need for an additional post-processing steps to extract boundary positions. This reduces the inference time for retinal and choroidal layers boundary position segmentation. • The proposed experiments cover some of the basic considerations on the use of the region proposal architecture for OCT image segmentation and explores some of the key network hyper-parameters to be considered to improve the performance when using Mask R-CNN for this approach.
This document is organized as follows: Section 2 presents more detail about the network and considered experiments. Section 3 presents the results and comparison with other methods together with a discussion of the main findings. Finally, Section 4 presents concluding remarks and suggests future work.

Data
A dataset consisting of spectral domain (SD) OCT images from healthy eyes was used for this study, with the data collection methods and population details described extensively elsewhere [46]. For completeness, only a brief summary is provided here. The data were captured using the Spectralis OCT instrument (Heidelberg Engineering, Heidelberg, Germany) from healthy participants (single instrument dataset). The original dataset contains images from over 100 children aged between 10 to 15 years (mean age, 13.1 ± 1.4 years) and each subject had data captured at four different visits, six months apart over an 18-month period. For each visit, two sets of six radial B-scans were captured on the same eye. To explore different image analysis tasks, two aspects of this dataset were used for the experiments, one included the retinal and intra-retinal layer segmentation, while the second included the global segmentation of the choroid and retinal layer structures. For the entire dataset, all boundaries were segmented automatically following a previously published work [47,48], after which an experienced trained observer reviewed and manually corrected any boundaries if needed. The observer's manual segmentation has shown an excellent repeatability even for the complex task of choroidal segmentation, as reported in [46,49,50].
In the first analysis of the dataset, which was used to examine the intra-retinal layer segmentation, each image has seven retinal boundaries. The boundaries of interest can be visualized in Figure 2a. From the top of the image to the bottom, the layers include the internal limiting membrane (ILM), retinal nerve fibre layer (NFL), inner plexiform layer (IPL), outer plexiform layer (OPL), external limiting membrane (ELM), photoreceptor inner segment/outer segment junction layer (ISOS), and retinal pigment epithelium (RPE). For this experiment, the dataset was adapted using a total of 2135 OCT images obtained from 98 subjects belonging to visits 1 and 4 of the dataset. These images were split into three groups: The training set consisting of 1293 images belonging to 59 subjects; the validation set consisting of 429 images belonging to 20 subjects; and the evaluation set consisting of 413 images belonging to 19 subjects. The second analysis of the dataset, which was used to explore the total retina and choroid segmentation task, contained the boundary information for three different layers. The boundaries can be visualized in Figure 2c. From the top of the image to the bottom, the layers include ILM, RPE boundary layers, as well as the choroid scleral interface (CSI). The full retina is encompassed between the ILM and RPE and the choroid lies between the RPE and CSI boundaries. Similar to the first dataset, the images were separated into three groups: The training set with 1311 images belonging to 60 subjects, a validation set with 431 images belonging to 21 subjects, and evaluation set with 406 images belonging to 18 subjects. All OCT images used in both datasets were taken from the first and last subject visits of the main dataset.
For both datasets, the data were separated to ensure different subjects were allocated in each set, thus avoiding mixing of information from the same subject into two different datasets. Thus, ensuring a fair assessment of the performance when evaluating the model, assessing only OCT images from new 'unseen' subjects. Before analysis, the dataset was adapted using the COCO dataset [51] format to use a structure that allows it to be easily processed by the Mask R-CNN algorithm. Figure 2 shows examples of the dataset and annotated masks, showing the defined masks used for the first (Figure 2b) and second ( Figure 2d) analyses of the dataset.

Training
For the purpose of this experiment, the hyper-parameters were set following the default values defined in the Mask R-CNN study [41] with some modification. For training, a maximum of 1000 training epochs were defined with an automatic early stopping, which stopped the training when the model loss does not decrease after 75 consecutive epochs. Images were resized using a custom crop mode called "specific_crop", which applies a 512 × 512 pixel size crop to the original OCT image. Given the dimensions of the OCT image (1534 × 512 pixels), five separate column wise fixed positions inside the image were defined (at pixel 0, 256, 512, 768, and 1022), the position expanded across the horizontal (transversal) dimension of the image to ensure the entire image is covered (fed to the network). The cropped sections have an overlap of 50% between adjacent sections. This option was preferred instead of using random coordinates, which may not ensure full image coverage. This configuration gives the square image dimensions needed to be processed by the model and ensures that the model will receive crops including all parts of the retina from the full OCT images and thus a more diverse training dataset.
The backbone CNN architecture used in this experiment corresponds to a ResNet50 with feature pyramid network (ResNet50-FPN) architecture proposed in [41]. This architecture combines the features extracted by the ResNet50 convolutional layers at different scales using the feature pyramid network method proposed in [52]. The total number of parameters for the Mask R-CNN model is equal to 44, 698, 312. Unlike other datasets where this segmentation method can be applied, the OCT regions to be segmented do not have as much variation of size and ratio between the objects to be detected and to generate the bounding boxes. In other words, the anatomical dimensions of the retina provide a 'natural' constraint to the segmentation problem, thus modification of the size and ratio parameter was investigated to understand its impact on performance.
The RPN anchors were modified by only defining wide anchors that fit the specific retinal layer's dimensions for the cropped OCT images. To understand the range of appropriate values, the ratios (width/height box dimension that fit the layer) of every retinal and choroid layer to be segmented in the cropped OCT images were analysed using the available data. Figure 3 shows the graphical analysis, where each histogram shows the frequency of the ratios per layer. The peak of each histogram is at a different location, ranging between 4 to 16. With 8 representing the most common ratio between the individual retinal layers. Due to the dimension reduction of the images after each convolutional block, four anchor ratios were defined for the first version of the dataset (16,32,64,128), to cover the feature maps obtained during the feature extraction process of the backbone. For the second version of the dataset, and after the same ratio analysis was performed, narrower ratios were used due to the sizes of the full retina and choroid tissues present in the OCT images (2,4,6,8,10). With the anchor ratios defined, five anchor scales were set (8,16,32,64,128) to fill the 512 pixels wide crops when using each ratio for the region proposals, these scales were kept the same for both datasets. The number of ROIs per image to be fed to the mask prediction branch was set to 200. For each training run of the model, data augmentation was used by applying rotation to each cropped image with a variation of −10/+10 degrees. All the segmentation methods were implemented in Python using Keras and Tensorflow (as the back end) modules, both of which are opensource software libraries for deep learning. The training process was performed on an Intel i7-9800X 3.80 GHz processor and Nvidia TITAN Xp GPU. Across the study, all the values represent the mean of 4 independent network runs, to ensure the proper assessment of the network repeatability and stability Retinal and choroid layer ratio (width/height) histograms for the dataset. Ratio is calculated from a frame (i.e., bounding box) which covers the maximum horizontal (width) and vertical (height) distances of each layer belonging to the raw OCT images without any pre-processing or flattening of the retinal area. The plots show the distribution of the ratios for each layer of the dataset and informed the selection of the anchor ratios hyper-parameter.

Defined Tests
One of the appealing properties of DL models is their capability of being initialized with pre-trained parameters. The "pre-trained" model has been trained with different types of images from which they have learned features, which can benefit the new imaging analysis task. In general, the use of pre-trained models tend to provide an improvement in performance compared to those initialized with random weights. Additionally, the pre-trained model may require less training data to obtain good performance. Thus, the pre-trained model allows a reduction of computation time in the training of the model, in addition to providing a base on which it will begin its learning, improving its generalization ability [53].
To be able to understand how pre-training affects retinal layers segmentation performance, three types of simulations were conducted: • Scratch, training a Mask R-CNN model from scratch without any previous learned feature, thus the network parameters are randomly initialized. • COCO, initializing the Mask R-CNN model with the features learned using the COCO dataset [51]. • ImageNet, initializing the Mask R-CNN model with the features learned using the ImageNet dataset [54].
The use of the previously learned features in the initialization of the model (pre-trained model), also allows for different levels of fine-tuning. Thus, adjusting the parameters within feature extraction layers and adapting them to better solve the segmentation problem for the data set (OCT retinal images). Different convolutional blocks (i.e., level of the network) can be fine-tuned for different levels of abstraction, so another two simulations were defined when using the pre-trained models divided into four convolutional blocks. Three combinations were tested; training the features of all the convolutional blocks (called all), the last two convolutional blocks of the backbone (called 3+) and training the last convolutional block of the backbone (called 4+). After this test, the best performing model provides knowledge regarding which learned features are the best for initializing the model to learn the specific features for segmentation of retinal layers.
Finally, to put the findings of the Mask R-CNN into perspective, the performance was compared with results obtained using a state-of-the art U-Net approach described as "standard U-Net" in [23], which have been used previously for OCT retinal segmentation [7,15,36], which is a network that contains a total of 489,160 parameters. Additionally, a pretrained fully convolutional network (FCN) [55] and a DeeplabV3 [56] based on a ResNet50 architecture, which were adapted from [57], were used to compare results against a commonly used model with pretrained weights. The FCN model contains 25,557,032 parameters and DeeplabV3 39,635,014 parameters. The OCT images are not flattened or pre-processed prior to the segmentation tasks.

Performance Evaluation
To evaluate the method's performance, two metrics were used, particularly the segmentation mask overlap, and boundary error metrics. The segmentation mask overlap metric, specifically the Dice overlap coefficient, has been used to report performance in different Mask R-CNN studies, while the mean (signed and absolute) boundary error, is a more clinically relevant metric to assess OCT segmentation performance, since the boundary error is closely related to thickness measurements which are commonly used for the quantification and clinical interpretation of OCT images. Similar to the training step, during the evaluation of the images, each image was cropped into five crops using the "specific_crop" method and each crop was segmented. After this, to have a fair assessment with other methods, which usually work on the entire OCT image, the five image crops with their corresponding segmentation information were fused to obtain the results for the whole OCT image. This is illustrated in Figure 4, where crops A, B, and C were used as main crops, and D and E were the auxiliary crops. These auxiliary crops are used to replace the border pixels of the main crops where the segmentation was not complete. This was done by cropping the necessary pixels from the auxiliary crops and replacing them in the main crops, rebuilding the OCT image with the information of the complete image segmentation results. For the analysis of the results with the different methods, each segmentation output was post-processed to remove the border information of the full OCT image, specifically 100 pixels from the left side of the image were excluded due to the poor quality (low contrast tissue information) sometimes evident in this region. Similarly, on the right side of the image, 300 pixels were removed to mitigate the presence on the optic nerve head, which is an anatomical landmark that does not contain the boundaries of interest.
For the first performance analysis, the segmentation output of each network was analysed, by calculating the Dice coefficient corresponding to the overlap of each segmented retinal layer compared with its annotations. Then, for the boundary error analysis, the boundaries for the Mask R-CNN and FCN methods were directly obtained from the same segmentation outputs, by taking the upper and lower pixels belonging to each retinal layer, and internal layers with common boundaries, calculating the middle point between adjacent masks. For the U-Net and DeepLabV3 analysis, the boundaries were obtained using a graph-search method [23]. For these two methods (U-Net and DeepLabV3), direct boundary extraction yielded high boundary errors, thus the difference in techniques to post-process the boundaries. These data were used to calculate the mean signed and absolute boundary errors and the results between methods were compared.

Retinal Layers Analysis
Initially, the performance comparison of different model initializations was carried out, as this provides the optimal model to be later used for comparison with the benchmark method. Table 1 shows the obtained results (Dice coefficient for each retinal layer) between the Mask R-CNN model trained from scratch, and the pre-trained models (COCO and ImageNet) used for initialization, and also the comparison of the different fine-tuned blocks. A noticeable difference in performance for each initialization of the Mask R-CNN model can be observed. Using a model trained from scratch with randomly initialized parameters provides the worst segmentation performance when compared to using a model that has the same hyper-parameters, but pre-loaded weight values. This means that, using randomly initialized parameters may require a more detailed analysis of the hyperparameters, a bigger volume of data and/or will need to be trained for longer to obtain comparable performance. However, the problem is solved when using the pre-trained models, which give better performance (i.e., higher Dice metrics for each individual retinal layer and the overall). When comparing between the different pre-trained models, the overall Dice values are close to each other, but the best performance is obtained when using the COCO dataset with an overall Dice coefficient of 94.13% across all considered layers. This indicates that the COCO dataset presents features that may be more suitable for the segmentation of retinal OCT images. Moreover, the best results belong to the tuning of the last two convolutional blocks of the backbone (3+), indicating that the low-level features (e.g., ridges, lines, spots) of the pretrained model contribute to improved performance and only the high-level features need to be updated.
After obtaining the best Mask R-CNN initialization for retinal layer segmentation, the effect of the anchor ratios on the network performance was investigated. The network's default configuration for anchor ratios and sizes was compared to the defined configuration after the analysis performed on the size of the masks (Figure 3). Table 2 presents the results obtained from this analysis using Dice coefficients to compare performance when using the default (0.5, 1, 2) anchor ratios and anchor sizes (32,64,128,256,512) and the ratios and sizes proposed in Section 2. When comparing the results, a substantial difference in segmentation performance can be observed, demonstrating the importance of performing a detailed analysis of the sizes and ratios of the objects to be detected before performing the segmentation. The overall performance for the network default ratios is 86.71%, versus the custom selected ratios with a 94.13%. The biggest difference corresponds to the ISOS-RPE layer with close to 40% difference, with a Dice coefficient of 54.83% for the default anchors, showing the importance of adequate selection of the hyper-parameters of the Mask R-CNN method. When performing a detailed analysis on the results, it was noticeable that low performance was associated with loss of retinal layers in the images that were not detected and thus not segmented.
Once the best initialization of the model, and highest performing anchor parameters were determined, an optimal Mask R-CNN was selected. The obtained results using the Mask R-CNN method shows good performance when segmenting each retinal layer, with an overall Dice coefficient of 94.13% and individual values ranging from 90.91% to 96.11%, demonstrating that the Mask R-CNN can be applied to the segmentation of retinal OCT images. To put these results into perspective, a U-Net, a FCN and a DeeplabV3 segmentation methods were used to perform the same segmentation task, using the same training and testing data. Table 3 shows the obtained results in terms of Dice coefficient for each retinal layer. This table compares the performance of the four methods for each retinal layer and the overall retina, before applying the post processing step for each method to extract the boundaries.   A boundary error analysis was also performed. It is worth noting that boundary metrics, and the subsequently derived thickness metrics, are a more relevant clinical metric used to evaluate tissue health status. Thus, they may be considered potentially more relevant for assessing the performance of the methods. The corresponding retinal boundaries for the four methods were obtained from the segmented masks following the post-processing steps previously described (direct extraction for the Mask R-CNN, FCN and DeeplabV3, and graph-search for the U-Net). For boundary error analysis, results of mean signed and unsigned error were obtained. Table 4 shows the performance of all four methods focused on the segmentation of the boundaries.
Boundaries obtained from the Mask R-CNN segmented masks show an average of −0.087 pixels of mean signed error, and 0.784 pixels of mean absolute error, reinforcing the capability of this method to perform a retinal layer segmentation with a good level of performance. It is worth noting that the boundaries with Mask R-CNN, the FCN and DeeplabV3 were obtained without the use of any post-processing techniques, unlike the graph-search method used in the U-Net.
Boundary error results shows a close performance between methods. While assessing the average mean absolute error between these methods, the range of values between methods only varies 0.2 pixels, demonstrating the close overall agreement (similar segmentation performance).
A graphical comparison of the boundary, similar to the masks analysis, is shown in Figure 6.

Full Retina/Choroid Analysis
Similar to the previous section, the best initialization and hyper-parameters were set for training the Mask R-CNN method. In this, the second version of the dataset was used to analyse performance when segmenting the full retina and choroid regions of the OCT images. Table 5 compares the obtained Dice coefficient results for the segmentation of the full retina and choroid mask layers between Mask R-CNN, U-Net, FCN and DeeplabV3 methods. When comparing the results obtained for the full retina and choroid regions, similar to the results obtained for the segmentation of the different retinal layers, the U-Net method presents a slightly superior performance in terms of Dice coefficient. However, the results show a close agreement. With respect to the U-Net, a difference of 2.29% and 1.28% in the retinal and choroidal regions was found compared to the Mask R-CNN, and less than 1% in both areas compared to the FCN and DeeplabV3 methods. When analysing the graphical results of each method (Figure 7), it is worth noting that the pixelation problem presented around the foveal region evident in the retinal layer analysis, is not affecting the results of the full retinal region segmentation. The same post processing steps were performed for this segmentation approach to obtain the boundaries of interest: Extracting the ILM, RPE and CSI boundaries from the segmentation outputs of the four methods. Table 6 shows the boundary error analysis between the two methods for the segmentation of boundaries after the post-processing steps. When analysing the boundary error results for the ILM, RPE and CSI layer boundaries, higher errors are found when using the Mask R-CNN method. From the graphical analysis shown in Figure 7 minimal difference is observed between methods, but it is possible to note "softer" curves when using the U-Net method for the full retina region segmentation. As an addition to the Dice and boundary error, a number of segmentation metrics were calculated and analyzed to further compare the performance between methods. Table 7 shows the overall performance comparison obtained for the retinal layers and retina/choroid segmentation. From these results, the pixel accuracy, precision, recall, and specificity showed comparable values between methods, with only the recall showing a larger difference of 4.02% between Mask R-CNN and U-Net methods. It is worth noting that the DeeplabV3 provides the most consistent performance across both datasets while compared to the other methods, yet the boundary error metrics for the DeeplabV3 method do not show the same level of superior performance. The detailed performance obtained for these metrics are shown as Appendix Tables A1-A8.

Cross-Dataset Analysis
A final analysis was performed to assess the method's performance on a different dataset, with a particular interest in assessing cross-dataset performance, thus providing a better understanding of the generalization of the model [58]. For this purpose, a publicly available age-related macular degeneration (AMD) dataset [59] was used.
The original AMD dataset contains images from 269 patients with AMD, and each patient had a set of volumetric scans composed of 100 OCT images (512 × 512 pixels). The scans were captured using the Bioptigen SD-OCT instrument, which is a different OCT instrument from the one used in the previous dataset (Spectralis SD-OCT). Each image in this AMD dataset corresponds to a single capture, without using any image averaging techniques, thus presenting more speckle noise than the original dataset (O.D.). In order to compare the performance with the O.D., two boundaries were used, corresponding to the ILM and RPE boundaries, which provides the full retinal tissue thickness as shown in Figure 2. An initial test of the models was performed with a model trained on the O.D. (healthy dataset imaged with the Spectralis instrument) and testing on the AMD dataset (pathology dataset imaged with Bioptigen instrument). However, for all networks this method was not able to properly identify the classes of interest. Given the obvious differences between the dataset characteristics (presence of noise, image appearance and pathology) this is not an unexpected result.
To improve the segmentation performance, a small portion of images from the AMD dataset was added to the training of each model, adding a total of 756 images from 28 subjects (different from the images used for testing) to the original training O.D. dataset. Thus, providing the model an opportunity to learn from both datasets. Table 8 describes the Dice coefficient obtained when segmenting the retinal area, while Table 9 presents the boundary errors from the segmentation outputs. . Some marginal improvements can be observed in some of the metrics for some networks, this could be associated with the noise in the AMD dataset that could act as a regulation to the network training. While using the same training dataset (O.D. + AMD), but testing on the AMD exclusively, all models present a good performance with comparable metrics to the O.D. and show good Dice and boundary error metrics. However, the Mask R-CNN method seems to provide lower performance compared to the other models. This could be attributed to the difference in the dimension of the AMD dataset images (512 × 512 pixels) that does not allow the full segmentation processing to be performed as the O.D., which blends the segmentation results from different crops (as shown in Figure 4). Table 9. Cross-dataset analysis on the full retinal tissue segmentation results, presenting the mean (standard deviation) mean signed and absolute boundary errors (in pixels) results. Three different training/testing combinations were included for each model, including:

Discussion
When evaluating the use of the Mask R-CNN region proposal architecture, the pretrained models improve the segmentation results by providing a basis on which the model to be trained using the most relevant features for OCT layer segmentation. Similar to the Mask R-CNN, the FCN and DeeplabV3 methods use pre-trained weights. However, in this study the proposed model provided lower performance than the other architectures, it is possible the network architecture with a large number of parameters or the initial object detection step included in the Mask R-CNN could underlie this difference in performance. In contrast, the U-Net model is not normally used with pre-trained weights. Most of the previous OCT segmentation studies have used a U-Net trained with randomly initialized parameters, which is the same approach used in our current work. The U-Net architecture was proposed with the objective of being able to perform well even when the model is randomly initialized and trained with low quantities of medical images [27]. Given the larger number of parameters of the Mask R-CNN network, the benefit seen from using a pre-trained model is not surprising.
One of the hyper-parameters found to be particularly important when using the Mask R-CNN method corresponds to the sizes and ratios of the anchors, with changes to these parameters resulting in a significant improvement in performance. The use of appropriate anchor sizes and ratios allows the trained model to detect objects that belong to those specific sizes and ratios in order to be later segmented.
When comparing with the U-Net, the segmentation results showed slightly better performance for the U-Net method with Dice coefficients of 2.45% and 1.80% in the overall metric for the retinal layer and choroidal datasets respectively. Interestingly while evaluating the boundary error, which is a more clinically relevant metric, the difference is minimal between methods, with a difference of mean absolute error of 0.12 and 0.15 pixels in the overall metric for the retinal layer and choroidal datasets respectively. It is worth noting the retinal layers are small in area, so changes of a few pixels may have a large impact on the Dice metric, yet these differences are less significant while assessing the boundaries. Similarly, when performing the analysis of the full retina versus the choroid region, higher Dice coefficients were obtained in the retina dataset for all methods. Similarly, the boundary error metrics showed little to no difference between methods for segmenting the retina and choroid.
While performing a graphical analysis of the segmentation results, it was also noted that the Mask R-CNN, FCN and DeeplabV3 methods can pixelate the outcome around the more curved zone of the foveal pit region. Outside the centre of the image, which has minimal curvature throughout the dataset, the method provided smoother segmentation results. Interestingly, this effect cannot be appreciated in the second dataset, so it is likely that the interaction between the retinal layers may play a role in this effect and a post-processing step may be needed to smooth this profile. When compared with U-Net results, it was possible to demonstrate one of the advantages of using the Mask R-CNN instance segmentation method. The U-Net is not capable of differentiating two different segmented portions as shown in Figure 5, where inside the ILM-NFL layer, there is a portion segmented as the ISOS-RPE class by the U-Net. This shows one of the advantages of the Mask R-CNN method, where the use of instance segmentation allows the method to differentiate both instances of the same segmented class (retinal layer) and to choose the one that corresponds to the correct position. This selection process may also be improved by applying anatomically correct segmentation, since the Mask R-CNN method has the ability to add the information on how classes (retinal layers) are connected to each other, and this positioning of the retina layers can be used in the training process using the "keypoints" feature [51].
The use of Mask R-CNN for retinal layer segmentation provides an end-to-end DL framework, which simplifies the post-processing required to extract the boundaries of interest. For example, the U-Net method used in this study for comparison requires a graph-search post processing step to obtain the boundaries. While the Mask R-CNN boundaries were obtained by taking the upper and lower pixel values of each segmented mask and calculating the middle point between those layers with common boundaries. This substantially simplifies the boundary extraction process and reduces the processing time.

Conclusions
While a large number of published methods have assessed OCT segmentation based on encoder-decoder architectures (U-Net and its variations), other DL architectures have received significantly less attention. In this paper, the Mask R-CNN method is applied to retinal and choroid layer segmentation. The proposed method provides good performance for segmentation of the retina (full and intra-layer) and choroid. This is particularly evident while assessing the boundary error metrics, which are the most clinically relevant performance indicator.
The promising results obtained from this study, demonstrate that the Mask R-CNN architecture can be used as an end-to-end method for OCT image segmentation. However, this experiment does not take full advantage of all the Mask R-CNN properties, so future work should consider hyper-parameters inside this method, anatomically correct segmentation using the instance segmentation and different losses. While the proposed Mask R-CNN model shows promising performance, the results, particularly those of the cross-dataset analysis, demonstrate that the performance may be linked to specific features of the OCT dataset. This cross-dataset analysis (different acquisition protocols, instruments and clinical features) and their effect on OCT image segmentation performance is an area that has received limited attention in OCT segmentation [34,60,61], and is thus worthy of further consideration. Institutional Review Board Statement: The original study capturing the images in the main dataset was approved by the Queensland University of Technology human research ethics committee and all study procedures followed the tenets of the Declaration of Helsinki.
Informed Consent Statement: All parents of participating children in the original study provided written informed consent for their child to participate, and all children provided written assent.

Data Availability Statement: Not available.
Acknowledgments: Computational resources and services used in this work were provided by the eResearch Office, Queensland University of Technology, Brisbane, Australia.

Conflicts of Interest:
The authors have no conflict of interest to declare that are relevant to the content of this article.

Appendix A.1
Appendix Tables describing in detail the comparison of using different metrics to report the segmentation performance of each applied method when segmenting each defined retinal region. Appendix Tables describing in detail the comparison of using different metrics to report the segmentation performance of each applied method when segmenting each the retinal and choroidal regions. Table A5. Mean (standard deviation) pixel accuracy (Acc.), precision (Pr.), recall (Rc.), specificity (Sp.), and intersection over union (IoU) of the Mask R-CNN(coco/3+ and custom ratios) for the full retina (ILM-RPE) and choroid (RPE-CSI) regions segmentation.Each value represents the mean of four independent runs.