Semantic Cardiac Segmentation in Chest CT Images Using K-Means Clustering and the Mathematical Morphology Method

Whole cardiac segmentation in chest CT images is important to identify functional abnormalities that occur in cardiovascular diseases, such as coronary artery disease (CAD) detection. However, manual efforts are time-consuming and labor intensive. Additionally, labeling the ground truth for cardiac segmentation requires the extensive manual annotation of images by the radiologist. Due to the difficulty in obtaining the annotated data and the required expertise as an annotator, an unsupervised approach is proposed. In this paper, we introduce a semantic whole-heart segmentation combining K-Means clustering as a threshold criterion of the mean-thresholding method and mathematical morphology method as a threshold shifting enhancer. The experiment was conducted on 500 subjects in two cases: (1) 56 slices per volume containing full heart scans, and (2) 30 slices per volume containing about half of the top of heart scans before the liver appears. In both cases, the results showed an average silhouette score of the K-Means method of 0.4130. Additionally, the experiment on 56 slices per volume achieved an overall accuracy (OA) and mean intersection over union (mIoU) of 34.90% and 41.26%, respectively, while the performance for the first 30 slices per volume achieved an OA and mIoU of 55.10% and 71.46%, respectively.


Introduction
Cardiovascular disease (CVD) has been reported as one of the leading causes of death globally and occurs due to functional abnormalities in the heart and blood vessels [1]. In 2016, according to the World Health Organization (WHO), about 17.9 million people died from CVDs, which is equivalent to 31% of all global deaths (mainly from stroke and heart attack) [1]. For instance, one of the CVDs, coronary artery disease (CAD) is a group of abnormalities in blood vessels supplying the heart muscle [1]. CAD is caused by a surplus of calcium in the coronary artery trees. Excessive calcium can narrow the arteries and increase the risk of heart attack [2,3]. Therefore, an early assessment and diagnosis can significantly reduce the life-threatening nature of this CVD and improve quality of life for the afflicted patients [2,3]. In modern medical imaging modalities, computed tomography (CT), magnetic resonance imaging (MRI), and ultrasound are used to assist in identifying abnormal findings in the human body for early assessment and diagnosis [4][5][6][7]. Recently, the non-gated and non-invasive chest CT has been used to provide potential support for investigative imaging tests to interpret cardiac function states [8][9][10]. More detailed characteristics of chest CT images are described in Appendix A.
Labeling the ground truth for cardiac segmentation requires the extensive manual annotation of images by the radiologist, which can be time-consuming and labor intensive [7][8][9][10][22][23][24][25][26]. Due to the difficulty in obtaining annotated data and the required expertise as an annotator, an unsupervised approach has been considered in this study. The goal of cardiac segmentation is to partition the whole chest CT image into cardiac anatomical ROIs, with respect to not only the dissimilarity of each pixel's value, but also the meaningful structure (i.e., geometrical position) of cardiac anatomical ROIs. For example, the Hounsfield units (the quantitative scale of chest CT images) of heart muscles and other muscles in the body are in the same range (almost identical). Detailed information of the Hounsfield unit (HU) of chest CT images is described in Appendix B. Additionally, the cardiac anatomical substructures (i.e., the four chambers, DA and coronary artery) are usually formed in one shape that is mostly arranged in the center of a chest CT image [3,30].
Inspired by the performance of the mean-thresholding method [13][14][15][16][17], we adapted it for this current study by utilizing the K-Means clustering method as a threshold criterion. The K-Means clustering method [31,32] is a simple unsupervised method, which exploits Euclidean distances to compute the mean of all given pixels and assigns pixels into k different clusters based on the nearest mean. However, the anatomical structure of cardiac tissues and the quantitative scale (i.e., HU) of chest CT images are very complicated for cardiac segmentation when using the K-Means clustering method directly. Therefore, we exploited a mathematical morphology method [33,34] to enhance the shifting of the mean threshold. In this paper, semantic whole-heart segmentation combining K-Means clustering as a threshold criterion of mean-thresholding and the mathematical morphology method as a threshold shifting enhancer is proposed. In the proposed approach, the K-Means method is utilized to automatically cluster pixels, while the mathematical morphology method is used to remove noise, fill holes and convex hull; also, prior knowledge of chest anatomical structures [3,30] is required to assist in the awareness of geometrical positions. The silhouette scoring method [32,35] is applied to evaluate the performance of K-Means clustering, while overall accuracy (OA) and mean intersection over union (mIoU) are calculated to evaluate the overall performance of cardiac segmentation.
The remaining parts of the paper are organized as follows. Section 2 describes related works of earlier approaches, such as graph-based, mean-threshold and fuzzy clustering, in addition to deep learning approaches such as supervised and unsupervised learning methods. Section 3 describes a step-by-step hierarchical flow of the proposed methodology. Experimental results and comparison discussions are analyzed in Sections 4 and 5, respectively. Section 6 concludes our study. Finally, supplemental literature reviews of CT, HU and algorithm tables are presented in Appendices A-C, respectively.

Related Works
Shi et al. [11] conducted natural image segmentation by proposing a normalized cut criterion. The method addressed images as a graph partition problem. The normalized cut was used as a measure of dissimilarity between subgroups of a graph. The eigenvalue was utilized to minimize the criteria of the normalized cut. Pedro et al. [12] conducted natural image segmentation through a pairwise region comparison method. The method also addressed images as a graph partition problem. The greedy decision was used as a measurement criterion.
Dorin et al. [13] conducted natural image segmentation by proposing a recursive mean shift procedure to generate M-estimators. The M-estimators were utilized in detecting the modes of the density. The method produced significant results with a low dimension of the space. Larrey-Ruiz et al. [14] conducted cardiac segmentation in chest CT images using multiple threshold values. The thresholding was calculated by the mean value of statistical local parameters (i.e., pixels are in one slice) and global parameters (i.e., pixels are across all slices of a full volume). Huo et al. [15] conducted weakly unsupervised cardiac segmentation as an initial step for coronary calcium detection. The method contained many unrelated anatomical ROIs, such as the spine, ribs, and muscles. The method was adapted from Liao et al. [16], utilizing a convex hull of lungs as the base parameters. Rim et al. [17] conducted whole-heart segmentation by adapting the cardiac segmentation method of Huo et al. [15] and Liao et al. [16]. The method used an alternative threshold of a convex hull of lungs and a convex hull of ribs as a base parameter. As the alternative threshold was manually defined by the mathematical geometry of the lungs, the method is an ad hoc set of each image.
Radha et al. [18] conducted brain image segmentation using an intelligent fuzzy level set. Quantum particle swarm optimization was proposed to improve the steadiness and meticulousness in order to reduce opening sensitivity. The results for this method showed that it could optimize up to 15% more than the original fuzzy level set method. However, the method experienced challenges with neoplastic or degenerative ailment images. Versaci et al. [19] conducted image edge detection by proposing a new approach of fuzzy divergence and fuzzy entropy. The proposed fuzzy entropy used fuzzy divergence as the distances between fuzzified images, which were computed by means of fuzzy divergence. Chanda et al. [20] conducted cardiac MRI image segmentation using a fuzzy-based approach. The method was based on morphological, threshold-based segmentation and fuzzy-based edge detection. The method achieved more than 90% accuracy. Kong et al. [21] conducted an image segmentation method using an intuitionistic fuzzy C-Means clustering algorithm. The method defined a modified non-membership function, an initial clustering center based on grayscale features, an improved nonlinear kernel function and a new measurement of fuzzy entropy. The method outperformed existing algorithms, but it took a lot of computational time.
Ronneberger et al. [22] conducted an electron microscopy (EM) image segmentation using supervised deep learning by defining U-net architecture. The network consisted of 23 convolutional layers. The experiment was trained with 30 images of size 512 × 512 and received an average IOU of 92%. The U-net model has become the gold standard for biomedical image segmentation. Payer et al. [23] conducted whole-heart segmentation using supervised deep learning by defining two-step CNN architecture. The experiment was trained on CT images with 20 volumes of size 300 × 300 × 188 and achieved an average dice similarity coefficient of 90.8%. Ahmed et al. [24] conducted whole-heart segmentation using supervised deep learning by defining a CNN network and using stacked denoising auto-encoders. The experiment was trained on CT images of eight subjects and achieved an average accuracy of 93.77%. Liao et al. [25] conducted 3D whole heart segmentation using supervised deep learning by defining a multi-modality (i.e., CT and MRI) transfer learning network with adversarial training. The network introduced the attention mechanism into the U-net network. The experiment was trained on 60 CT volumes with a dice value of 0.914. Max et al. [26] conducted multi-modality (i.e., CT and MRI) 3D whole-heart segmentation using supervised deep learning by defining a shape encoder-decoder network. The experiment was trained on 15 CT volumes with a dice value of 0.653.
Xia et al. [27] conducted natural image segmentation using unsupervised deep learning by defining a W-net architecture. The network was a concatenation of two U-net networks. The experiment was trained on 11,530 images and achieved a probabilistic rand index (PRI) of 0.86. Joyce et al. [28] conducted multi-class medical image segmentation on Myocardium (MYO), left ventricle (LV) and right ventricle (RV) regions using unsupervised deep learning by defining a generative adversarial network (GAN) model. The experiment was trained on 20 CT volumes with a dice value of 0.51. Perone et al. [29] conducted medical image segmentation using unsupervised deep learning by defining a self-ensembling architecture. The experiment was trained on MRI images from 100 subjects and achieved the best dice value of 0.847.

Overall Workflow of the Proposed Method
The segmentation process is intended to generate a binary mask of whole-heart anatomical ROIs, including the four chambers, coronary arteries, and DA, as shown in Figures 1-3 and listed in Appendix C. As the cardiac anatomical ROIs are formed in one shape, arranged mostly in the center of a chest CT image [3,30], a step-by-step hierarchical process to enhance the mean-threshold method from the corner towards the center region of the chest CT image is proposed. Firstly, the air substance at the corner of a chest CT image is filtered by a convex hull of foreground mask. Then, other human body substances such as fat, muscles and lungs are filtered by a convex hull of lung mask. Finally, the spine substance is filtered by a convex hull of spine mask. More details of each step are explained in the following sub-sections.

Grayscale Conversion
Firstly, the raw CT image is read in HU pixels, which is the standard radio-density scale. One CT image has a resolution of 512 × 512 × 1 (i.e., width × height × channels), with the range around [−1000, +1000]. In the current study, 56 images per volume are trimmed. Thus, one CT volume is 512 × 512 × 56 in size. The HU pixels are converted into grayscale pixels using the standard range normalization, as shown in Equation (1).
where Y is a set of HU pixels, Y ⊂ [−1000, +1000] 3D , and X is a set of grayscale pixels, X ⊂ [0, 255] 3D and 3D = 512 × 512 × 56. Y min and Y max are the minimum and maximum value of HU pixels, respectively. An example of a chest CT image in grayscale pixels is shown in Figure 2a.

Convex Hull of Foreground Mask
The K-Means method (known as the unsupervised technique in clustering literature [31,32]) is used to cluster the whole chest CT image into foreground (i.e., anatomical ROIs such as fat, muscle, lungs, heart and other) and background (i.e., air) clusters. Given k = 2 clusters, and the set of X grayscale pixels X ⊂ [0, 255] 2D , the k centroid clusters C are calculated by minimizing the function ∅, as shown in Equation (2).
where the foreground threshold τ is calculated by the mean of k centroid clusters C. Lines randomly situated at the bottom of the chest in the CT images were observed, as shown in Figure 2a, which are considered as noise. To remove this noise, the morphological binary opening operation [33,34] ϕ is conducted with the default structure element from [34].
where F is the foreground mask and B i is the default structure elements with i = 1, 2, 3, 4.
is iteratively applied by a hit-or-miss transform ∝ with B i ; and when it converges (i.e., A i r is equal to A i r−1 ), it is united with F, which is referred to as D i . Then, the convex hull ω is a union of D i . The examples and generation processes of the foreground mask and the convex hull of foreground mask are shown in Figure 2b,c and Table A2, respectively.

Convex Hull of Lung Mask
The process of generating a convex hull of lung mask is intended to remove fat, muscle and rib substances. Firstly, the lung mask is computed. The foreground threshold τ is used to compute the lung mask. The enhanced grayscale pixels E, E ⊂ [0, 255] 2D are computed to assist the thresholding. Then, the lung mask L, L ⊂ [0, 1] 2D is computed as shown in Equation (5).
There are blood vessels within the lungs, which result in many small holes in lung mask L. To fill those small holes, the morphological binary closing operation [33,34] θ is conducted with the default structure element from [34]. The operation is iterated twice: L = θ(θ(L)). Then, the convex hull of lung mask U, U ⊂ [0, 1] 2D is computed by the morphological convex hull operation [33,34] ω adapted from Equation (4), as shown in Equation (6).
The intermediate heart mask I, I ⊂ [0, 1] 2D , is a bitwise AND operation between the convex hull of lung mask U and an inversion of lung mask L, which is computed as shown in Equation (7).

Convex Hull of Spine Mask
The process of generating a convex hull of spine mask is intended to remove spine pixels and other substances under the spine and DA. Firstly, the spine mask is computed. The enhanced intermediate heart grayscale pixels E, E ⊂ [0, 255] 2D are computed to assist the thresholding, as shown in Figure 3a. Then, the k centroid clusters C of the K-Means method adapted from Equation (2) are computed to divide the enhanced grayscale pixels E into background, heart and spine clusters with a value of k = 3. The spine pixels are brighter compared to the background and heart pixels, as shown in Figure 3a. We can assume that spine pixels are in the last right cluster in Figure 3b. Therefore, the spine threshold τ is the maximum centroid of clusters C. Then, the spine binary mask S, S ⊂ [0, 1] 2D , is a result of a thresholding condition, which is defined as shown in Equation (8).
The morphological binary closing θ and opening operation ϕ are applied to fill holes and remove islands, respectively. Each operation is iterated twice: S = ϕ(ϕ(S)) and then, S = θ(θ(S)).
The spine mask S does not cover pixels of substances around the spine and under the DA, as shown in Figure 2g. To remove those pixels, the convex hull of spine mask is computed. (x0, y0) and (xc, yc), which are denoted as the top and center coordinates of the white convex polygon in the spine mask S, are computed by the region properties function ρ and ρ from [34]: (x0, y0) = ρ(S) and (xc, yc) = ρ (S), respectively. Then, the convex hull of spine mask P, P ⊂ [0, 1] 2D , is defined, as shown in Equation (9).
where x and y are a set of coordinates of spine mask S and y is a set of row axes of convex hull spine mask P. Then, the heart mask H, H ⊂ [0, 1] 2D is a bitwise AND operation between the intermediate heart mask I and an inversion of spine mask convex hull P, which is computed as shown in Equation (10).
where σ(P) is a bitwise NOT operator: σ(P) = NOT(P). The examples and generation processes of the spine mask, convex hull of spine mask and heart mask are shown in Figure 2g-i and Table A4, respectively.
The examples and generation processes of segmented heart images are shown in Figure 2j and Table A5, respectively.

Convex Hull of Lung Mask Refinement
The proposed segmentation method depends on the convex hull of lung mask, similar to Huo et al. [14] and Rim et al. [15]. Empirically, when the lungs are not well surrounding the heart region, the convex hull of lung mask fails to compute intermediate heart mask I using Equation (7). In this case, the convex hull of lung mask is refined, as shown in Figure 4. The lungs were observed to vanish little by little from the CT image when the liver appears. Thus, within 56 slices per CT volume, the last lungs-well-surrounded slice is recorded and used as the global parameters for the next slice. Then, the local and global parameters are combined to compute the convex hull of lung mask. Given N is the number of slices in one volume (i.e., N = 56) and y0 is the top row axis of the white convex polygon in the lung mask L n−i and L n , L ⊂ [0, 1] 3D , the convex hull of lung mask U, U ⊂ [0, 1] 3D is computed by a bitwise OR operation, as shown in Equation (12).
where n and (n − i) are the current and the last lungs-well-surrounded slice in N, respectively. The examples and generation processes of the refined convex hull of lung mask are shown in Figure 4 and Table A6, respectively.

Experimental Setup
The current experiment was conducted on a Windows 10 computer with an Intel Core™ i7-9700 CPU @ 3.00 GHZ, 32.0 GB RAM, and an NVIDIA GeForce RTX 2070 graphics card. The code was written in Python language with the Scikit image library [34] and the K-Mean method of the Scikit learn library [32]. The method was applied on a dataset from Soonchunhyang University Cheonan Hospital [36]. The dataset was acquired randomly from 500 subjects who were scanned using a CT scanner (Phillips iCT 256) during 2019. The chest CT slices were captured in diverse ranges containing between 56 to 84 slices. The first 56 slices were selected for the current study. The resolution of each slice was the same at 512 × 512 pixels. The FOV was 250 × 250 mm and the slice thickness was 2.5 mm. All data were stored in dicom 3.0 format [37].

Silhouette Score
The silhouette method [32,35] was applied to score how well the K-Mean method [31,32] separated the clusters. The formula of the silhouette method is shown in Equation (13).
where a is the average distance from the ith pixel to all pixels in the same cluster and b is the average distance from ith pixel to all pixels in the closest cluster. The score has a range of [−1, +1]. If score = −1, it means that the clusters are not well separated. If score = +1, it means that the clusters are well separated. If score = 0, it means that the clusters are overlapping.
The evaluation was analyzed on three cases where M denotes the number of images: (1) the first 30 slices from the 1st to 30th slice: M = 500 × 30; (2) the last 36 slices from the 31st to 56th slice: M = 500 × 36; and (3) a full volume per subject from the 1st to 56th slice: M = 500 × 56, as shown in Table 1. The K-Means method was applied twice for foreground filtering and spine filtering in Equation (2) of Section 3.3 and Equation (8) of Section 3.5, respectively. For foreground filtering, the K-Means method of the three cases achieved mean scores of 0.2891, 0.2904 and 0.2903, respectively. For spine filtering, the K-Means method of the three cases achieved mean scores of 0.5367, 0.5356 and 0.5356, respectively. For overall foreground and spine filtering, the K-Means method for the three cases achieved mean scores of 0.4129, 0.4130 and 0.4130, respectively. We noticed that the K-Means method performed the spine clustering with higher scores than the foreground clustering. Additionally, the K-Means method achieved almost the same mean scores for all three cases in overall foreground and spine filtering.

Human Visual Inspection Evaluation
An intersection over union IoU of each subject, an overall accuracy OA and a mean intersection over union mIoU were calculated for overall segmentation, as shown in Equation (14). IoU = TP n TP n +FP n +FN n mIoU = 1 N ∑ IoU n OA = ∑ TP n M (14) where N is the number of subjects (i.e., 500 subjects) and n is the nth subject in N. M is the number of images. TP, FP and FN represent number of images of true positives, false positives and false negatives of the segmentation result, respectively. OA and mIoU evaluate the overall quality of the segmentation, and the IoU of each subject evaluates the quality of all images per subject.
Since our segmentation method does not have a ground truth for validating, we conducted a visual inspection manually in which the human error rate was assumed to be around 5%. If the segmented image consists of four chambers, the coronary arteries and the DA, it is considered as a whole-heart segmentation, as shown in Figure 5a,b,e,f,i,j. If the segmented image consists of four chambers and the coronary arteries but misses DA, it is considered as a four-chamber segmentation, as shown in Figure 5c,d,g,h,k,l. For the first 30 slices, the segmentation achieved high performance on both wholeheart and four-chamber segmentations with OA and mIoU values of 55.10%, 71.46%, 82.62% and 82.62% respectively, as shown in Table 2. Among the 500 subjects, there were 316 and 341 subjects whose IoU was higher than the mIoU for whole-heart and four-chamber segmentations, respectively, as shown in Figure 6a,b. Additionally, the minimum and maximum IoU were 0% and 100%, respectively. Among the 500 subjects, there were 58 and 14 subjects whose IoU was 0%, while there were 234 and 247 subjects whose IoU was 100% for whole-heart and four-chamber segmentations, respectively, as shown in Figure 6c,d.
For the last 36 slices, the segmentation achieved low performance on both whole-heart and four-chamber segmentations with OA and mIoU values of 8.37%, 12.60%, 10.35% and 15.42%, respectively. Among the 500 subjects, there were 198 and 215 subjects whose IoU was higher than the mIoU for whole-heart and four-chamber segmentation, respectively. Additionally, the minimum and maximum IoU are 0% and 69.23%, respectively. Among the 500 subjects, there were 222 and 180 subjects whose IoU was 0%, while there were 1 and 1 subjects whose IoU was 69.23% for whole-heart and four-chamber segmentation, respectively. Table 2. Evaluation results of 500 subjects (%). The OA, mIoU, IoU (min) and IoU (max) of three cases are described in both segmentations (whole-heart and four-chamber segmentation). Abbreviations: OA, overall accuracy; mIoU, mean intersection over union; IoU (min), minimum IoU; IoU (max), maximum IoU.  6. Number of subjects: (a) whose IoU was lower than the mIoU (blue) and higher than the mIoU (orange) for whole-heart segmentation; (b) whose IoU was lower than the mIoU (blue) and higher than the mIoU (orange) of four-chamber segmentation; (c) whose IoU was minimum (blue) and maximum (orange) for whole-heart segmentation; and (d) whose IoU was minimum (blue) and maximum (orange) for four-chamber segmentation.

OA
For the full volume, the segmentation achieved good performance on both wholeheart and four-chamber segmentations with OA and mIoU values of 34.90%, 41.26%, 50.91% and 54.10%, respectively. Among the 500 subjects, there were 283 and 315 subjects whose IoU was higher than the mIoU for whole-heart and four-chamber segmentation, respectively. Additionally, the minimum and maximum IoU were 0%, 0%, 85.45% and 89.13%, respectively. Among the 500 subjects, there were 51 and 11 subjects whose IoU was 0%, while there were 1 and 1 subjects whose IoU was 85.45% and 89.13% for whole-heart and four-chamber segmentation, respectively.

Discussion
Among the mean-thresholding methods proposed by [13][14][15][16][17], there is one paper by Larrey-Ruiz et al. [14] in which cardiac segmentation on 32 chest CT images was conducted by defining multiple threshold values. The thresholding was calculated by the mean value of statistical local parameters (i.e., pixels were in one slice) and global parameters (i.e., pixels were across all slices of a whole volume). Table 3 shows comparison results for the top 32 subjects of our proposed method with Larrey-Ruiz et al. [14] for whole-heart segmentation. For the first 30 slices, our results outperform Larrey-Ruiz et al. [14] in both OA and A (max) (maximum accuracy) with values of 100% and 100%, and 94.42% and 99.81%, respectively. For the full volume, Larrey-Ruiz et al. [14] outperforms our method in OA with values of 87.64% and 73.66%, respectively. However, our method outperforms Larrey-Ruiz et al. [14] in A (min) with values of 67.85% and 46.95%, respectively. Table 3. Comparison results of 32 subjects (%).

Whole-Heart
Our Method Larrey-Ruiz et al. [14] 1st to 30th slice-OA Among the unsupervised deep learning approaches proposed by [27][28][29], there is one paper by Joyce et al. [28] in which multi-class medical image segmentation was conducted on MYO, LV and RV regions on 20 CT volumes with a dice value (mDice) of 0.51. For the top 20 subjects, our result of a full volume for whole-heart segmentation outperforms Joyce et al. [28] with an mIoU of 0.7833, as shown in Table 4. Table 4. Comparison results of 20 subjects.

Conclusions
This paper presented semantic whole-heart segmentation combining K-Means clustering as a threshold criterion of the mean-thresholding method and the mathematical morphology method as a threshold shifting enhancer. The experiment was conducted on 500 subjects in two cases: (1) 56 slices per volume containing full heart scans, and (2) 30 slices per volume containing about half of the top of heart scans before the liver appears. In both cases, the results showed an average silhouette score of the K-Means method, with a value of 0.4130. Additionally, the experiment on 56 slices per volume achieved an OA and mIoU of 34.90% and 41.26%, respectively; while the performance result on the first 30 slices per volume achieved an OA and mIoU of 55.10% and 71.46%, respectively.
High performance was achieved when the heart was well surrounded by lungs. Otherwise, low performance was achieved. The low performance was likely caused by the lack of filtering of the liver, as both the HU pixels and geometrics of the organ could not be used as a criterion for thresholding. Additionally, the goal of the proposed research was to segment the whole heart. However, the results showed that the four-chamber segmentation outperformed the whole-heart segmentation. The outperformance was due to a failure in generating the convex hull of spine mask.
There are limitations in this study, such as the failure of DA segmentation and liver removal, but the main contributions of our proposed method can be summed up as the following: (1) we proposed fully unsupervised semantic whole-heart segmentation from chest CT images; (2) we proposed the K-Means method as a thresholding criterion and the mathematical morphology method as a threshold-shifting enhancer; and (3) we demonstrated good performance for the first 30 slices, which will be able to be used as an initial step for other cardiac applications in other CADs. Finally, the future direction of our research is to conduct an unsupervised deep learning approach to overcome the abovementioned limitations. Informed Consent Statement: Patient consent was waived due to the retrospective design of this study.

Data Availability Statement:
No new data were created in this study. Data sharing is not applicable to this article.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A. Chest CT Image
A computed tomography (CT) image [3,30] is a medical image that is the result of using X-ray CT scan measurements taken from different angles of a body. CT images allow the radiologist to identify disease or injury within various regions of the body without cutting. CT scans of the chest are mainly performed to gain knowledge about lung and heart anatomy, as shown in Figure A1. Figure A1a illustrates a contrast CT image in which the right lung is at the left position of the image, while the left lung is at the right position of the image. For the current study, we address a contrast CT image as a CT image. For example, the coronary CT calcium scan is used to assess CADs. A large amount of calcium pixels appearing in the coronary arteries can narrow the arteries and increase the risk of heart attack, as illustrated by the red arrow in Figure A1b. Typically, the scanning starts from the thorax and travels to the base of lungs, known as an axial examination [38]. The number of slices per person depends on CT scanner specifications. In the current study, the CT images were scanned using a Phillips iCT 256 (described in detail in Section 4.1). Since the number of slices in which the heart appeared in the chest CT images varies from person to person, 50 subjects were randomly selected and investigated, as illustrated in Figure A2. The investigation was tracked from the first slice until the slice that contains the base of heart. Among the 50 subjects, there were 12 subjects whose heart images ended at the 45th slice. Figure A2. Number of subjects whose heart image ended from the 40th to 56th slice.

Appendix B. Hounsfield Unit (HU)
The Hounsfield unit (HU) [39] is a quantitative scale obtained by the linear transformation of the measurement of attenuation coefficients. Its value describes the radiodensity of a CT image and highlights different substances in the human body. The value is scaled from -1000 HU for air to +2000 HU for very dense bone and over +3000 HU for metals. The HU of substances remain the same from person to person. The HU-presenting substances of the human body in chest CT images are listed in Table A1.  Figure A3 shows a frequency of HU for one CT image. The frequency histogram describes the following information: there are more than 20,000 pixels with HU < −1000, known as air; there are more than 15,000 pixels with −1000 < HU < −500, known as lungs; there are more than 20,000 pixels with −500 < HU < +500, known as muscles, fat, heart and more; there are a few pixels with HU > +500, known as bone.

Appendix C. Algorithms
The following algorithms were used in this manuscript. Table A2. Generation process of a convex hull of foreground mask.