1. Introduction
Epicardial adipose tissue (EAT) is located between the pericardium and the myocardium [
1]. It has various distributions and is commonly found on the heart surface, in the atrioventricular and interventricular grooves, in the right ventricle lateral wall, and near the coronary arteries [
2]. In many studies, it has been considered a source of inflammatory mediators and cytokines, and EAT volume has been associated with coronary artery disease [
3]. EAT volume has been evaluated as an imaging biomarker for the diagnosis of pathological states such as metabolic syndrome and visceral obesity [
4,
5,
6].
Manual or semi-automatic measurement of EAT volume by radiologists, while feasible, proves to be time-consuming and impractical for extensive clinical practice. To address this limitation, automated EAT volume quantification methods have been proposed, aiming to provide efficient solutions. Most EAT volume quantification is typically carried out using cardiac computed tomography (CT) scans. However, in this study, our focus lies specifically on EAT segmentation and quantification in non-contrast low-dose CT (LDCT) scans. In particular, EAT exhibits an irregular and noncontinuous shape, accompanied by a lack of uniformity in spatial distribution. Additionally, the presence of other fat tissues, such as mediastinal fat, positioned outside the pericardium presents a challenge. In CT images, these structures can be visually similar to EAT and are located nearby. The thin layer of the pericardium serves as a crucial reference for distinguishing EAT from these similar structures. One of the primary motivations for focusing on LDCT is its reduced contrast on coronary structures and the different thickness compared to standard-dose CT scans. Consequently, the segmentation and quantification of EAT in LDCT pose greater challenges. By addressing these complexities and providing a robust and accurate EAT segmentation methodology for LDCT, we aim to enhance the clinical utility of automated EAT volume quantification in various medical settings. Such advancements hold the potential to streamline cardiac assessments and improve patient care, given the widespread usage of LDCT in clinical practice.
As for automatic EAT segmentation methods, some works [
7,
8,
9,
10,
11] utilized non-contrast standard-dose CT or LDCT, while some works [
12,
13,
14,
15] utilized coronary CT angiography (CCTA), which requires a higher radiation dose with contrast to ensure adequate image quality while capturing fine details on the coronary arteries. Recently, there has also been work utilizing magnetic resonance images (MRI) [
16]. However, many works provided insufficient information on the setting of the protocol for EAT label acquisition. Furthermore, the variation in automated quantification of EAT volume across different CT acquisition protocols remains an unanswered question. The influence of protocol settings on EAT segmentation and the resulting volume of EAT remains unclear. Access to implementation codes or datasets is crucial for the reproducibility and comparison of work with the state-of-the-art. Due to variations in data set-up among different studies, a strict comparison of reported accuracy values can be misleading. Therefore, the main objective of our research is to thoroughly evaluate and compare state-of-the-art segmentation methods using our extensive dataset comprising 154 low-dose CT scans. To assess the segmentation performance, we employ well-established metrics such as the Dice similarity coefficient (DSC), mean Intersection of Union (mIoU), sensitivity, and specificity. Additionally, considering the clinical relevance of EAT volume, we utilize Pearson correlation and Bland–Altman analysis to measure the interscan reproducibility of EAT measurement in low-dose CT scans [
17]. Through this rigorous evaluation, we aim to contribute valuable insights into the efficacy of existing segmentation techniques and to improve the understanding of the assessment of EAT in the context of low-dose CT imaging.
1.1. Related Work
The early automatic EAT segmentation methods were based on classic mathematical models such as intensity-based approaches [
18], region growing approaches, active contour approaches [
19], and atlas-based approaches [
20,
21]. Later, some machine learning-based methods with handcrafted features [
7,
8,
22,
23,
24] or with clustering algorithms [
13,
25] were proposed for EAT segmentation. However, some are rarely fully automatic. Most recently, deep learning (DL)-based approaches have shown success in many medical image segmentation tasks [
26].
The rise of deep learning brings to EAT segmentation methods more automation, as it can learn intricate features directly from images in an end-to-end manner rather than designing and selecting features manually. With the available advanced computer hardware such as graphical processing units (GPUs), the state-of-the-art DL-based segmentation approaches have outperformed many previous traditional methods. The basis of DL-based segmentation is the convolutional neural network (CNN). A standard CNN consists of an input layer, functional hidden layers, and an output layer. Commonly used functional layers include the 2D or 3D convolutional operations, pooling layers (e.g., max-pooling), normalization layers (e.g., batch normalization), activation functions (e.g., rectified linear unit (ReLU)), transposed convolutional operations, and upsampling layers. The input of the network could be CTs in 2D or 3D format and the corresponding labels. The output of the network is a matrix where each element could be a probabilistic score or a set of category indexes. Generally, for medical image segmentation, the U-Net structure [
27] and more advanced 3D U-Net [
28], the V-Net [
29], the attention U-Net [
30] with attention gates, and U-Net++ [
31] with dense connections and deep supervision [
32] have together formed a U-Net family that has gained popularity and success. Many recent works on EAT segmentation were based on or inspired by the U-Net family.
A fully automated EAT segmentation model proposed by Commandeur et al. [
33,
34] utilized a multi-task fully convolutional neural network with a statistical shape model. Santini et al. [
35] applied standard U-Net on EAT segmentation with a dataset of 119 CTs, and Zhang et al. [
9] utilized two U-Nets and morphological layers to achieve better EAT segmentation results. One of the most recent works by He et al. [
12] proposed a 3D U-Net architecture with attention gates (AG) and deep supervision for EAT segmentation. Attention gates play a crucial role in prioritizing essential regions within an image during the segmentation process, enhancing the model’s accuracy. The attention mechanism is a widely utilized approach, especially in tasks such as cardiac image segmentation [
36,
37]. Many works showed quantitatively promising performance, but some were trained and evaluated with a very small dataset [
7,
9] or with private datasets [
12,
33,
34]. The labeling protocol varies in various works. For labeling, commonly used labels include the region inside the pericardium, the contour/surface of the pericardium, pixel-wise EAT labels, and the contour/surface of EAT regions, etc.
1.2. Contributions
In this study, we comprehensively evaluated state-of-the-art segmentation methods for EAT using 154 non-contrast LDCT scans. The LDCT scans were part of the coronary artery calcium CT scans from the Risk Or Benefit IN Screening for CArdiovascular diseases (ROBINSCA) trial [
38]. We provided two types of labels for the LDCT scans: (a) region inside the pericardium and (b) pixel-wise EAT labels. The LDCT scans were ECG-triggered and generated with low radiation doses, ensuring patient safety. We provide clear and sufficient information about the data and label acquisition protocol in
Section 2.1. To achieve representativeness and diversity, we selected four methods from the U-Net family as the EAT segmentation models: 3D U-Net [
28], 3D attention U-Net [
30], U-Net++ [
31], and the recent deep attention U-Net (DAU-Net) [
12]. The model structures, along with the attention gates and deep supervision modules in these models, are meticulously explained in
Section 2.2. For rigorous evaluation, we utilized four-fold cross-validation and hold-out tests in
Section 3. The segmentation results were analyzed using the Dice similarity coefficient, the mean intersection of union, sensitivity, specificity, and visualization. Furthermore, we performed a quantitative analysis of EAT volume using the Pearson correlation and Bland–Altman analysis in
Section 3.3. In
Section 4, we delve into crucial aspects, including label types, domain knowledge, patch size, training time, deep supervision, and evaluation, to gain deeper insights into EAT segmentation using deep learning. Additionally, we propose potential future directions in the aspect of domain knowledge, data unification, benchmarking, and deep learning techniques, aiming to enrich the contributions of this research. Our comprehensive evaluation and in-depth discussions contribute valuable insights to the field of EAT segmentation, and we believe this work will significantly advance the application of deep learning techniques in cardiac image analysis and clinical practice.
4. Discussion
The performance comparison of the state-of-the-art methods for EAT segmentation showed some interesting information: (1) Generally, the neural networks trained with the pericardium masks showed better segmentation and quantification results; and (2) U-Net++ trained with the pericardium masks outperformed the other models on segmentation and quantification. Furthermore, the process, encompassing data collection to evaluation results, involves numerous intricate details. Here, we discuss some points that we believe are crucial to EAT segmentation and quantification.
Label types: To the best of our knowledge, there are two types of labels in the works for EAT segmentation: (1) the pixel-wise or voxel-wise label maps (e.g., in [
7,
33]), and (2) the contours or outlines of the EAT (e.g., in [
12,
20]). There are various ways to obtain these labels, but, as most works are based on unpublished data, we cannot know the labeling protocols in detail. Based on our observations of example figures with labels and our extensive experience in labeling, it is evident that there is no ideal label type for EAT segmentation. Regarding pixel-wise or voxel-wise label maps, it is important to consider the physical characteristics of CT scans. Specific noise points can be erroneously classified as EAT within the pericardium. Though we could remove most of the noise with some denoising techniques, it is hard to remove all of them. The contours and outlines may better focus on the region with fat tissue. Because, geometrically, the contours occupy pixels or voxels, some small regions with very thin layers of fat may be neglected during labeling. In addition, the delineation of complex structures could be extremely time-consuming in 3D data such as CTs.
From the evaluation perspective, different label types could lead to different errors. For pixel-wise and voxel-wise labels, as there is noise, these may lead to more false negatives in the prediction. For contours and outlines labels, as some fat tissue may be missing, these may lead to more false positives in the prediction.
Attention mechanism: Compared to the results obtained from other methods, the performance of 3D attention U-Net demonstrates relatively lower accuracy and exhibits larger variance. While attention mechanisms can be advantageous in focusing on informative regions and enhancing performance in certain scenarios, they may also introduce additional complexity and elevate the risk of overfitting. Moreover, the effectiveness of attention mechanisms heavily relies on the dataset characteristics and the specific target structures being segmented. In the case of EAT segmentation in LDCT, where the target object has low visibility, the attention mechanism may lead to more errors. In
Figure 6, it is evident that most mistakes made by the attention-based model are false positives below the pericardium, which aligns with the results obtained from the DAU-Net. Hence, the attention mechanisms might pose challenges in excluding structures below the pericardium accurately.
Domain knowledge: As the goal of EAT segmentation network is to find the exact locations of EAT voxel-wisely, ideally, it should mimic the manual segmentation progress as performed by radiologists. Thus, anatomical knowledge about EAT is a crucial reference for segmentation. By using the labels of pericardium masks, the network is forced to learn knowledge about pericardium, which is the foundation to distinguish EAT and adjacent similar structures such as mediastinal fat. Apart from our dataset, some works on EAT segmentation used labels that included pericardium knowledge too [
7,
20]. However, most works did not incorporate this knowledge deeply in deep neural networks. From a recent review of anatomy-aided deep learning for medical image segmentation [
43], we noticed that there are many possible ways to incorporate anatomy information into deep learning. A well-integrated network could take advantage of both anatomy knowledge and data-driven deep learning methods.
Patch size: For models using 3D convolution operations, the patch size is a key hyperparameter for training. Due to the large image size of medical images and the memory limitation of GPUs, usually, it is not feasible to process a 3D image as one input for the input layer. Thus, the common way to solve this is to set a patch size for the input layer. By setting the patch size, the large 3D images are divided into multiple patches and processed separately for training. As the patch size influences the size of feature maps and the contextual information in the training process directly, it influences the segmentation results significantly. From our experiments and related papers, we noticed that, generally, a larger patch size could lead to better results. However, a larger patch size may lead to a much higher computational cost. Thus, there is a trade-off to choose the proper patch size for training performance and efficiency. In this paper, we selected , which is a relatively large patch size.
Training time and inference time: For all experiments, we set the maximum training epoch number as 500, with early stopping. With the exception of U-Net++, which stopped earlier, the other models were trained until the 500th epoch. The training time for U-Net++ models (usually stopped at around the 80th epoch) was between 12 and 17 h, while the training time for 3D U-Net, attention U-Net, and DAU-Net was between 5 and 8 days, due to the different amounts of training data and different label types. The inference times for all models ranged from 4.33 s to 6 s per sample and were relatively similar. Thus, to verify whether training for a longer time would increase the performance, we trained models of 3D U-Net in the hold-out test set-up for 1000 epochs (in
Table 4) with both label types. All models showed improving performance. Particularly, the model trained with EAT showed obvious improvement. However, the computation cost was very high, as the training time doubled.
Deep supervision: During experiments, we noticed that deep supervision influenced the decrease in training loss and models’ performance sharply. To verify the effects of deep supervision, we trained 3D attention U-Net models with and without deep supervision (in
Table 5).
To visualize the prediction of 3D attention U-Net, we show the region inside pericardium segmentation results of 3D attention U-Net with and without deep supervision in
Figure 9. Without deep supervision, the segmentation focused too much on the inside region and misses some pixels around the pericardium. This reduced the performance of EAT segmentation sharply, as most EAT locates near the pericardium.
Evaluation: As some models are trained with 3D inputs and some are trained with 2D inputs, the evaluation could sometimes be tricky. From the previous works on EAT segmentation, we noticed that most of the early works were based on 2D CT slices [
7,
13,
19,
20,
21], while some recent works were based on 3D CT images [
9,
12]. At the evaluation stage, when computing the mean values for evaluation metrics, early works treated one 2D slice as a sample [
7], while some recent works treated a 3D image as a sample [
12]. Therefore, there is a numerical difference due to the different computation settings. We tested both computation settings of mean values for our trained U-Net++ models. The mean values computed based on 2D samples were slightly higher than those computed based on 3D samples. However, considering that one 3D sample is from one patient, all the evaluation metrics in our paper were calculated based on 3D samples.
Future Work
From the related literature and our experiments, we believe that there is potential for automatic EAT segmentation. Here, we list some future work directions that we hope could be helpful for further research.
Domain knowledge: From the previous works, we noticed the lack of domain knowledge in the research into EAT segmentation and quantification. Many models for EAT segmentation only applied an existing segmentation network to the EAT data directly. Thus, the domain knowledge from the radiology aspect and the anatomical uniqueness of EAT is ignored. We believe that there are more possibilities if we could incorporate deep learning techniques and domain knowledge deeply.
Data unification: Due to the large variety of data acquisition protocols, the comparison between research papers on this topic is hard. Thus, a unified labeling and data acquisition protocol could reduce this barrier and increase the reproducibility of future works.
Benchmark: Compared to some popular medical image segmentation tasks such as brain tumors or lung nodules, there are very limited publicly available data or benchmarks with EAT labels. Thus, the comparison between methods is hard and the reproducibility is very low. With a benchmark, this problem could be solved, and the visibility and popularity of EAT segmentation could be improved.
Deep learning techniques: Recently, the deep learning techniques for segmentation have improved rapidly. Many techniques such as generative adversarial networks, physics-informed deep learning, and graph neural networks have shown success in many other segmentation tasks [
44], but have not yet been applied to EAT segmentation.