Segmentation of Aorta 3D CT Images Based on 2D Convolutional Neural Networks

: The automatic segmentation of the aorta can be extremely useful in clinical practice, allowing the diagnosis of numerous pathologies to be sped up, such as aneurysms and dissections, and allowing rapid reconstructive surgery, essential in saving patients’ lives. In recent years, the success of Deep Learning (DL)-based decision support systems has increased their popularity in the medical ﬁeld. However, their effective application is often limited by the scarcity of training data. In fact, collecting large annotated datasets is usually difﬁcult and expensive, especially in the biomedical domain. In this paper, an automatic method for aortic segmentation, based on 2D convolutional neural networks (CNNs), using 3D CT (computed axial tomography) scans as input is presented. For this purpose, a set of 153 CT images was collected and a semi-automated approach was used to obtain their 3D annotations at the voxel level. Although less accurate, the use of a semi-supervised labeling technique instead of a full supervision proved necessary to obtain enough data in a reasonable amount of time. The 3D volume was analyzed using three 2D segmentation networks, one for each of the three CT views (axial, coronal and sagittal). Two different network architectures, U-Net and LinkNet, were used and compared. The main advantages of the proposed method lie in its ability to work with a reduced number of data even with noisy targets. In addition, analyzing 3D scans based on 2D slices allows for them to be processed even with limited computing power. The results obtained are promising and show that the neural networks employed can provide accurate segmentation of the aorta.


Introduction
In recent years, the use of Deep Learning (DL) has improved the state of the art in many different fields, ranging from computer vision [1][2][3] and text analysis [4][5][6] to bioinformatics [7,8]. More specifically, DL-based decision support systems have become increasingly popular in the medical field [9]-and in particular for application in the Internet of Medical Things-where CNNs have been successfully employed for the classification of radiological, magnetic resonance or CT (Computerized Axial Tomography) images [10,11] and for natural images, for instance, in the classification of atypical nevi and melanomas [12][13][14], for the segmentation of bacterial colonies grown on Petri plates [15,16] and for retinal images [17]. In this paper, a DL aortic image semantic segmentation system is presented. The automatic segmentation tool was developed based on a new proprietary dataset, collected at the Department of Medicine, Surgery and Neuroscience of the University of Siena (Italy).
The aorta is the most important arterial vessel in the human body and is responsible for transporting blood from the heart to all other organs. It originates from the left ventricle and extends throughout the abdomen, where it divides into the iliac arteries. The aorta can be classified according to its anatomical location [18,19] in the thoracic aorta and the abdominal aorta. Depending on its morphology and on the direction of blood flow, the different parts of the aorta are also classified as ascending/descending and the aortic arch. Normally, the average aortic diameter does not exceed 2.5 cm, but over time it can dilate, stiffen or deform due to various pathologies, such as aneurysms [20] and dissections [21]. An aortic aneurysm is a permanent and non-physiological dilation of the aorta, where the vessel diameter exceeds the normal one by more than 1.5 cm. Aortic aneurysms are linked to a high mortality rate and are difficult to treat because they make the vascular wall of the dilated segment thinner and more prone to rupture. In addition, vessel dilation alters the blood flow, promoting abnormal blood clots (emboli) or thrombi. Instead, aortic dissection is a serious condition in which there is a tear in the aortic wall. When the tear extends along the wall, blood can flow between the layers of the aortic wall, resulting in a false lumen. This can lead to rupture of the vessel or a decrease in blood flow to the organs (ischemia).
The automated segmentation of the aorta from CT images helps clinicians speed up the diagnosis of these pathologies, enabling rapid reconstructive surgery, which is critical to saving patients' lives. The purpose of image segmentation is to divide a digital image into multiple segments based on certain visual features (which characterize the whole segment). In particular, image semantic segmentation can be viewed as a pixel-wise (voxel in 3D images) classification, in which a label is assigned to each pixel/voxel. This is an important step in image understanding, because it provides a complete characterization of the objects present in the image. In recent years, several semantic segmentation models, based on deep neural networks, have been proposed [22][23][24][25][26]. DL architectures are usually trained through supervised learning, exploiting large sets of labeled data, which are commonly publicly available. Such datasets are used to train generic semantic segmentation networks, which can later be adapted to specific domains with less data. Unfortunately, no publicly available aortic datasets which can be used for this purpose exist. For this reason, in collaboration with the Department of Medicine, Surgery and Neuroscience of the University of Siena, a dataset of 154 3D CT images was collected. In each image, all pixels belonging to the aorta were labeled using a semi-automatic approach. Labeling images is, in fact, an extremely time consuming task. Therefore, even if labels obtained in a semi-automatic way provide a lower quality information this was a necessary trade-off to obtain enough data in a reasonable amount of time. Subsequently, following an approach inspired by [10], the segmented images were employed to train three 2D CNN segmentation models, one for each view (coronal, sagittal and axial). Several architectures based on both LinkNet [27] and U-Net [28] were employed as segmentation networks and their results were compared. Twodimensional models were preferred to three-dimensional for computational reasons and also to reduce overfitting-with a small number of images available, in fact, training a 3D model would be difficult due to the large number of parameters required by convolutions. The results obtained are very promising and show that, even with the use of low-quality labeled images, DL architectures can successfully segment the aorta from CT scans. In particular, the main contributions of this manuscript can be summarized as: • A new approach for 3D CT scan segmentation is proposed for aorta images, based on 2D CNNs; • The model has low computational requirements and can also be employed with limited computational resources; • The approach can be employed on small datasets, possibly with noisy labels; • The method was tested on an original dataset collected at the University of Siena not publicly available due to privacy issues.
The paper is organized as follows. In Section 2, the related literature is reviewed. Section 3 presents a description of the proposed approach, and Section 4 discusses the obtained experimental results. Finally, Section 5 draws some conclusions and future perspectives. Table A1 in Appendix A summarizes the nomenclature used throughout the manuscript.

Natural Image Segmentation
In recent years, many advances were made in image semantic segmentation of natural scenes using deep fully convolutional neural networks [22][23][24][25][26]. Usually, these architectures are based on an encoder-decoder structure. On the one hand, the encoder extracts a highlevel representation of the input image by employing subsequent layers of convolutions and down-sampling. On the other hand, the decoder produces a representation at the image level by recovering the original spatial resolution. Supervised training of these architectures requires pixel-level labeling, which is often time consuming and difficult to obtain, especially in medical image processing, where often, available datasets are too small. For this reason, in biomedical imaging, it is critical to use networks that effectively recover the input resolution while maintaining a small number of parameters. In this paper, U-Net [28] and LinkNet [27], two popular networks often employed in medical image semantic segmentation, were compared. U-Net consists of a convolutional encoder followed by a decoder composed of up-convolutions combined with skip-connections. Skip-connections concatenate specific feature maps in the encoder with feature maps at the same resolution in the decoder. In comparison, LinkNet [27] is a network architecture devised to reduce the number of parameters by efficiently sharing the information learnt by the encoder with the decoder. After each down-sampling block, feature maps from the encoder layer are summed up with feature maps of the corresponding decoder layer at the same resolution.

Computerized Axial Tomography Segmentation
Some tools such as ITK-SNAP [29] and 3D Slicer [30] allow one to manually segment or, alternatively, to obtain a semi-automatic segmentation of CT scans. The ITK-SNAP semiautomatic procedure uses an active contour method based on snakes. A snake is a spline that adapts its shape to the contours of an object while also trying to avoid discontinuities in the approximated curve. The obtained segmentation usually needs post-processing to reduce the noise and to be refined. Manual segmentation is more time consuming but gives accurate segments, while semi-automatic segmentation is slightly faster but often not so accurate. For this reason, the development of fully automated segmentation systems could provide an excellent alternative method for obtaining segmented images.
Recently, several methods based on DL were proposed for medical image segmentation. In [31], a fully automated system was proposed for the segmentation of abdominal organs, including the abdominal aorta. In particular, a feature-based approach was used to approximately localize the organ, while a 3D CNN was dedicated to segmentation. In [32], a fully automatic pipeline for thrombi detection was developed, where thrombus segmentation is performed on a single 2D slice-a method subsequently extended with the use of 3D convolutions in [33]. An automated method for segmenting the ascending aorta, the aortic arch and the descending aorta was proposed in [34]. A dilated convolutional neural network was applied separately to the axial, coronal and sagittal planes, with the final segmentation obtained by averaging the results on each view. In [35], instead, the aorta was located using a CNN-based classifier trained on image patches. After a first phase of detection, the edges of the aorta were extracted with the Circle Hough Transform algorithm and the lumen diameter was used to predict the risk of abdominal aneurysms. A multitask learning approach was used to segment the entire aorta, true lumen and false lumen using a 3D CNN in [36]. Finally, in [10], a fully automated pipeline was developed for the segmentation of the entire aorta, including the common iliac arteries, using 2D networks trained on axial, coronal and sagittal views. Similarly to [10], in this work, the aorta segmentation was obtained from three 2D networks trained on the axial, coronal and sagittal views.

Aorta Segmentation
In the following sections, the proposed aortic segmentation approach and its pipeline are described in Figure 1. In particular, in Section 3.1, the pre-processing steps (resampling of CT scans-available in DICOM format-normalization and extraction of slices for each view) are presented. Subsequently, segmentation models and their training are discussed in Section 3.2.

Pre-Processing
Preliminarily, each scan (and each corresponding label) was oriented in RAI (Right-toleft, Anterior-to-posterior, Inferior-to-superior) mode. The 3D volume was then resampled, normalizing the voxel to the size of 1 mm × 1 mm × 1 mm. The resampling process mapped the image from a given reference system, f , to a new coordinate system, m. It was defined by a lattice in the reference system, f , and by a transformation function, T m f , that mapped the points from f to m. Nearest-neighbor was used as the interpolation algorithm, so that any value assigned at any point in m equaled the nearest point in f . Adaptive Histogram Equalization (AHE) normalization was employed to reduce the variability of the scans, mainly caused by the different setups used during the acquisition process. AHE was applied over the entire CT scan and is a common technique used to enhance contrast and edges in images. The algorithm calculated a local histogram (see Equation (1)) for each sub-part of the image and, based on these histograms, the intensity values of the whole image were normalized: where g is the equalized sub-part of the image, p n is the normalized histogram and L is the maximum intensity of a pixel. Some slices normalized by AHE are reported in Figure 2.
After pre-processing, CT scans were cropped to reduce their size. This also allowed one to remove parts of the scan where the aorta was not present (normally, the entire scan goes from the pelvis to the head). The dimensions of the 3D bounding box were multiples of 32 (to have a final image size that fit the input of the chosen segmentation network without using padding). Finally, the slices for each view (coronal, sagittal and axial) were extracted and used for network training. An example of an axial slice together with its label is given in Figure 3.

Deep Segmentation Network Training
In this section, the training procedure employed to segment the aorta from 2D slices extracted from 3D CT scans is presented. In particular, two different networks were tested, U-Net [28] and LinkNet [27] (see Figure 4), which share an encoder-decoder architecture. The encoder transforms the original image into a set of feature maps, whereas the decoder up-samples the encoded representation (which typically has a lower resolution due to a series of down-sampling operations) to restore the size of the original input. The main difference between U-Net and LinkNet resides in the decoding structure. In particular, while U-Net concatenates the encoder and decoder feature maps, LinkNet simply adds the corresponding feature maps. In this work, two encoder models, pretrained on ImageNet [37], ResNet34 [38] and Inception ResNet V2 [39], were used. The Inception ResNet V2 is deeper than the ResNet34 and usually provides better performance. The reduced number of parameters of ResNet34 can guarantee a better generalization on small datasets (such as the one proposed in this paper). The pseuodocode of the training procedure is reported in Algorithm 1. Three networks, one for each view of the CT scans, were trained using a linear combination of binary cross entropy [40] and Jaccard error as the loss function (see Equations (2)-(4) for the definitions): L jac (gt, pr) = 1 − gt ∩ pr gt ∪ pr L(gt, pr) = L bce + L jac (4) where pr is the network prediction and gt is the ground truth. Moreover, Adam optimizer [41] and early stopping were employed on the validation set to avoid overfitting. To augment the number of training samples, data augmentation strategies (horizontal and vertical shift, max 10 px, and rotations, max 5°) were employed during training. The results obtained with the different models were compared using the Mean Intersection over Union (MIoU) on the validation set. The test set was then used as an hold-out set to assess the quality of the best model.

Experiments and Results
The dataset used in our experiments is described in Section 4.1, while the experimental setup and the obtained results are discussed in Section 4.2.

Dataset
The dataset, collected at the Department of Medicine, Surgery and Neuroscience of the University of Siena, is made up of 154 CT scans acquired with contrast medium. The scans were saved in DICOM (Digital Imaging and Communications in Medicine) format, which is the standard for CT images. The DICOM format requires the presence of a set of files, one for each slice, which collectively describe a 3D volume, together with a dictionary of metadata, that contains information on the acquisition setup and on the patient (patient data were anonymized; only details about their age and gender were kept). Table 1 reports the demographic description of the dataset as well as the number of CT scans available. Additional metadata, which contain some descriptions of the collected CT scans, namely the orientation and size of the voxels, are available. The orientation is defined for each view: axial (X-Y plane), coronal (X-Z plane) and sagittal (Y-X plane), of the CT scans. Instead, the size of the voxel is defined by two parameters: • The pixel spacing , which indicates the dimension in millimeters of a single pixel in each slice; • The slice thickness, which corresponds to the distance in millimeters between two adjacent slices.
To train the deep segmentation model, the dataset was split into a training, validation and test set, as described in Table 2.

Number of CT Scans
Training 134 Validation 10 Test 10 All the scans were pre-processed as described in Section 3.1, resulting in a set of slices for each view of the scans. Table 3 displays the number of slices belonging to the training, validation and test sets, along with their sizes, for each view. Each slice of the dataset is grayscale, and each pixel has a binary label associated with it, indicating whether the pixel belongs to the aorta or not. Supervision is generated semi-automatically using 3DSlicer. The dataset shows high positional, labeling and contrast variability, mainly caused by the following reasons: • Position of the aorta -not all the scans are centered in the same way; • Presence of errors in the labels-the semi-automatic procedure used to create the labels is sometimes not accurate due to the presence of false positives (voxels which are labeled as aorta but that actually do not belong to it); • Different acquisition systems-the CT scans were collected in different periods and using different acquisition systems.

Results and Discussion
As described before, different segmentation network architectures were tested. In particular, the following four architectures were used in our experiments: For each of the above architectures, following the training procedure described in Section 3.2, three models were trained, one for each CT scan view (axial, coronal and sagittal). The results on the validation set of each model for each view are reported in Table 4 for U-Net and in Table 5 for LinkNet, respectively.  As can be easily observed, LinkNet obtains a greater MIoU and, in particular, when the ResNet34 is used as the encoder, the difference between LinkNet and U-Net is greater. If, instead, Inception ResNet V2 is used as the encoder, the difference between U-Net and LinkNet is less significant. Furthermore, it can be noted that, even with a small dataset, Inception ResNet V2 outperforms ResNet34 in all the experiments. Only when the model is trained on the coronal view do the two architectures behave quite similarly. Based on these results, the LinkNet architecture with the Inception ResNet V2 encoder was chosen and evaluated on the test set. The results are reported in Table 6. The results obtained are promising but limited by the quality of the annotations. Indeed, the ground truth of the dataset was obtained with a semi-automatic procedure and, in some cases, was not completely accurate. However, the dataset was provided with a reasonable supervision that somehow compensates for the absence of datasets with aortic images labeled at the pixel level. In Figure 5, some images are provided, together with their labels and the segmentation generated by the network, for a qualitative evaluation. Figure 6 shows some slices not correctly segmented by the network. As we can see, in this case, the images are actually difficult to interpret; in the second and third row, in fact, the slices are really dark and, in the first row, the network probably wrongly classified the aorta due to its size.

Conclusions
In this paper, some deep convolutional neural networks for aorta segmentation were trained, using a dataset of 154 CT scans collected at the Department of Medicine, Surgery and Neuroscience of the University of Siena. Two types of architectures, U-Net and LinkNet, with different types of encoders, ResNet34 and Inception ResNet V2, were tested as segmentation networks. Despite the fact that network training was based on a small set of training images with a low-quality supervision, obtained with a semi-automatic labeling approach, and there was variability in the acquisition conditions, we demonstrated that it was possible to successfully train three 2D segmentation networks, one for each view (axial, coronal and sagittal). Obtaining a set of high-quality supervised 3D images is costly and time consuming; however, if a larger set of semi-automatically supervised scans become available, it would be possible in principle to further improve the results. Therefore, as a matter of future research, we leave the possibility of employing a semi-supervised approach to enrich the current dataset, based on a set of unlabeled scans, that will hopefully increase the network performance. Another future development currently under investigation entails post-processing on the network output to clean the predictions. In particular, consistency between predictions from adjacent sections and between predictions from different views could be used to improve the segmentation quality. Institutional Review Board Statement: Ethical review and approval were waived for this study because the used data have been anonymized immediately after their collection.

Conflicts of Interest:
The authors declare no conflict of interest.