Artificial Intelligence in Coronary Artery Calcium Scoring

Cardiovascular disease (CVD), particularly coronary heart disease (CHD), is the leading cause of death in the US, with a high economic impact. Coronary artery calcium (CAC) is a known marker for CHD and a useful tool for estimating the risk of atherosclerotic cardiovascular disease (ASCVD). Although CACS is recommended for informing the decision to initiate statin therapy, the current standard requires a dedicated CT protocol, which is time-intensive and contributes to radiation exposure. Non-dedicated CT protocols can be taken advantage of to visualize calcium and reduce overall cost and radiation exposure; however, they mainly provide visual estimates of coronary calcium and have disadvantages such as motion artifacts. Artificial intelligence is a growing field involving software that independently performs human-level tasks, and is well suited for improving CACS efficiency and repurposing non-dedicated CT for calcium scoring. We present a review of the current studies on automated CACS across various CT protocols and discuss consideration points in clinical application and some barriers to implementation.


Introduction
Coronary heart disease (CHD) is the most common type of cardiovascular disease (CVD) in the United States, and it accounted for 696,937 deaths in 2020 [1].In addition, approximately USD 407.3 billion in CVD-associated costs were incurred in the US between 2018 and 2019.Considering the fact that around 50% of CVD-associated deaths occur in asymptomatic individuals or patients without prior CVD diagnoses, atherosclerotic CVD (ASCVD) risk stratification and modification are essential aspects for ensuring CVD prevention and subsequent disease cost reduction, early diagnosis, and the reduction of mortality and major adverse cardiovascular events (MACE) [2].Coronary artery calcium (CAC) is an established indicator of coronary artery disease (CAD) that is associated with disease burden and ASCVD risk and is quantified by ECG-gated non-contrast computed tomography (NCCT) [3,4].CAC scoring (CACS) by ECG-gated NCCT is a total measure of calcium density and plaque attenuation, which may be used to inform the decision for therapy initiation [4].The Agatston method for CACS estimates coronary artery calcium lesion scores, defined as the product of the total area of calcified plaque (in squared millimeters) and the peak calcium density weighing score [5].Absolute CAC scores are then assigned to each coronary vessel based on the maximal Hounsfield units (HU) and may be associated with a visual score such that 0 = none, 1-10 = minimal, 11-100 = mild, 101-400 = moderate, and >400 = severe.Visual scores are subjective estimations of CAC based on visual analysis of non-ECG-gated scans [6].
Several society guidelines including the American College of Cardiology/American Heart Association (ACC/AHA), European Society of Cardiology, and National Lipid Association have recommended the use of CACS to guide preventive therapy initiation in specific populations [7].The ACC/AHA recommends CACS in patients with intermediate risk and some with borderline risk and uncertainty regarding risk-based preventive therapy; statin initiation is recommended when CACS ≥ 100 [8].Other risk stratification tools, such as the Multi-Ethnic Study of Atherosclerosis (MESA) risk score, a race-, age-, and sexadjusted score that incorporates CACS with other known CAD risk factors, have been shown to improve the estimation of 10-year CVD risk.However, the extensive use of CACS is still limited by several factors including its lack of insurance coverage, the need for a dedicated ECG-gated NCCT protocol, and the need for specialized imaging lab personnel, particularly in high-volume imaging labs and institutions.Although the application of nondedicated CTs (CTs acquired for other purposes besides coronary artery calcium scoring) to manual CACS and calcium visual estimation has been described by several authors to have good correlation to Agatston scoring, it is not the current standard of practice [9][10][11].
As artificial intelligence (AI) continues to play an increasing role in medicine and radiology, it offers unique advantages to CACS and CVD risk estimation [12][13][14].Beyond improving the time-consuming, labor-intensive, and repetitive tasks of calcium segmentation and quantification, AI offers the possibility of acquiring quantitative CACS from non-dedicated routine chest CT protocols acquired for other purposes, including low-dose CT (LDCT), cardiac CT angiography (CCTA), and positron emission tomography (PET)/CT attenuation scans.Non-dedicated CT use also has the advantage of decreased summative radiation exposure.There are a few existing commercially available semiautomatic and automatic CACS software products, some of which were utilized in the studies discussed below (Table 1).Although semi-automatic tools are useful, they still require manual allocation of automatically segmented and detected calcium; therefore, they are user-dependent and time-consuming.

Deep Learning and Artificial Neural Networks
The multi-layered neural networks of deep learning (DL) algorithms learn from inputs of complex datasets and automatically produce outputs by recognizing intricate data relationships that might otherwise be missed at a human level.DL techniques are widespread and commonly applied in speech recognition, computer vision, and language processing.In recent years, DL has been increasingly applied across various cardiovascular imaging modalities for various tasks including image acquisition, reconstruction, segmentation, analysis, and prognosis [16].In particular, convolutional neural networks (CNNs) are the most commonly used in cardiovascular imaging and automated CACS.CNNs comprise several specialized network layers that identify image-based features and can create feature maps that aid the prediction of the final model output [17].

Studies of CACS Automation
A fully automated CACS software product should be able to perform coronary artery segmentation, coronary calcium analysis, and coronary artery labeling (Figure 1).Given the role of CACS as a risk stratification tool, the process of automatic CACS model development and testing should demonstrate excellent generalizability and ability to assess the risk of ASCVD across a diverse population.Models that are limited in dataset size and diversity tend to have overfitting and poor generalizability, a situation where they perform well with training data, but are unable to make accurate predictions with test data; these models underperform in the real world [17,18].The training data typically form the majority of the entire dataset and are utilized for model development, while the test data are used to assess model performance after model training [17].Some of the consequences of poor generalizability are reductions in accuracy and greater variability in model sensitivity and specificity when tested against external datasets.For instance, one CNN algorithm was developed for hepatic fibrosis categorization on ultrasound images and demonstrated an accuracy of 83.5% when tested against the internal dataset; however, the model accuracy reduced to 76.4% with external testing, reflecting the model's inability to generate a reliable output when small changes were made to its input data such as a new or external testing dataset [19].Therefore, the assessment of model generalizability through external validation is essential.The development of diverse multi-detector and multi-institutional data banks will be necessary for building clinically relevant AI risk assessment models.

Studies of CACS Automation
A fully automated CACS software product should be able to perform coronary artery segmentation, coronary calcium analysis, and coronary artery labeling (Figure 1).Given the role of CACS as a risk stratification tool, the process of automatic CACS model development and testing should demonstrate excellent generalizability and ability to assess the risk of ASCVD across a diverse population.Models that are limited in dataset size and diversity tend to have overfitting and poor generalizability, a situation where they perform well with training data, but are unable to make accurate predictions with test data; these models underperform in the real world [17,18].The training data typically form the majority of the entire dataset and are utilized for model development, while the test data are used to assess model performance after model training [17].Some of the consequences of poor generalizability are reductions in accuracy and greater variability in model sensitivity and specificity when tested against external datasets.For instance, one CNN algorithm was developed for hepatic fibrosis categorization on ultrasound images and demonstrated an accuracy of 83.5% when tested against the internal dataset; however, the model accuracy reduced to 76.4% with external testing, reflecting the model's inability to generate a reliable output when small changes were made to its input data such as a new or external testing dataset [19].Therefore, the assessment of model generalizability through external validation is essential.The development of diverse multi-detector and multi-institutional data banks will be necessary for building clinically relevant AI risk assessment models.Beyond the major task performances, the reduction of false positives and appropriate reference standards are necessary for reliable model development.Proposed models should have the ability to focus on target coronary anatomy to avoid misclassification of aortic and valvular calcifications as coronary calcifications [20][21][22].Various approaches have been described for region-of-interest localization and segmentation including conventional manual segmentation methods, bounding boxes with manual lesion localization via coordinates, and automatic coronary extraction [20,23,24].Additionally, though gated and non-gated NCCT CACS have been shown to have good agreement, they may differ in mean absolute scores, so manually derived gated CACS are preferred as a reference standard for comparing automated scores.However, in models trained on nondedicated CT, reference manual scores from non-dedicated CT also allow the independent evaluation of model accuracy.Beyond the major task performances, the reduction of false positives and appropriate reference standards are necessary for reliable model development.Proposed models should have the ability to focus on target coronary anatomy to avoid misclassification of aortic and valvular calcifications as coronary calcifications [20][21][22].Various approaches have been described for region-of-interest localization and segmentation including conventional manual segmentation methods, bounding boxes with manual lesion localization via coordinates, and automatic coronary extraction [20,23,24].Additionally, though gated and non-gated NCCT CACS have been shown to have good agreement, they may differ in mean absolute scores, so manually derived gated CACS are preferred as a reference standard for comparing automated scores.However, in models trained on non-dedicated CT, reference manual scores from non-dedicated CT also allow the independent evaluation of model accuracy.

ECG-Gated and Non-Gated NCCT
Over time, various approaches, model types, and datasets have been proposed for CACS automation with varying results (Table 2).One study, by Eng et al., proposed two CNN-based models for automatic scoring using gated coronary CT and non-gated chest CT, respectively (Figure 2) [25].Automatic scores from both models demonstrated an almost-perfect correlation with manual scoring (Cohen's kappa 0.89, p < 0.0001).The nongated model also had very good sensitivity (71-94%) and positive predictive value (PPV) (88-100%) and the gated model had a faster average scoring time (3.5 ± 2.1 s vs. 261 s).
Other researchers have proposed models trained on gated CT images and compared them to gated CT-derived reference manual scores.For instance, Ihdayhid et al. used a CNN-based model developed by Artrya Ltd. for automatic CACS in gated NCCT scans [22].The model included separate CNNs for aortic and cardiac segmentation to distinguish aortic from coronary calcifications and decrease the false positive rate of the proposed model.Overall, the proposed model efficiently produced automatic scores that had excellent agreement with manual scoring (κ = 0.90 [95 CI, 0.88-0.9],p < 0.001).While most studies used one or two CNN-based models, a multicenter, multiscanner, and multivendor study trained and tested an ensemble of five 3D U-Net models [37].The ensemble model trained on all data subsets achieved the best Cohen's kappa (0.894), accuracy (85.7%), and intraclass correlation coefficient (ICC) (0.970) relative to single models and the ensemble model trained on partial data subsets.This study further highlighted the benefit of a large training dataset size on model performance.Two While most studies used one or two CNN-based models, a multicenter, multi-scanner, and multivendor study trained and tested an ensemble of five 3D U-Net models [37].The ensemble model trained on all data subsets achieved the best Cohen's kappa (0.894), accuracy (85.7%), and intraclass correlation coefficient (ICC) (0.970) relative to single models and the ensemble model trained on partial data subsets.This study further highlighted the benefit of a large training dataset size on model performance.Two separate studies compared their suggested automatic tools with established semi-automatic software and had good outcomes.The first report compared a proposed automatic software product with a reference semi-automatic software product, syngo.via,and showed excellent correlation by Pearson's coefficient (0.935) and ICC (0.996); furthermore, good risk category assignment was also achieved (κ = 0.919) [38].Similarly, another study proposed a model based on an improved U-Net structure, known as U-Net++, and compared it to syngo.CT CaScoring, a semiautomatic software product [36].The model was tested on two datasets, with and without clinically detected CAC.The model achieved excellent ICC (1.0) for Agatston, volume, and mass scores relative to semi-automated CACS and kappa for risk categorization.
A 2021 study by Xu et al. was unique in its comparison of automatic scoring from nongated chest CTs of varying slice thicknesses, 1 mm and 3 mm [23].Manual gated NCCT was used as a reference, and though both slice thicknesses had good ICC (1 mm = 0.9 [95% CI, 0.85-0.93]and 3 mm = 0.94 [95% CI, 0.92-0.96]),PPV (1 mm = 90% vs. 3 mm = 93%), accuracy (1 mm = 88% vs. 3 mm = 94%), and κ (1 mm = 0.72 and 3 mm = 0.82), automatic scores from the 3 mm scans demonstrated slightly better values overall.Although this study showed improved accuracy with thicker slice scans, further analyses are required to determine the optimal thickness for model accuracy and noise reduction considering that noise increases with thickness and contributes to calcium score misclassification.
A quality improvement project by Sandhu et al. derived automated CACS from previously acquired non-gated NCCT and showed 51% statin prescription for primary prevention at 6 months when primary care physicians were notified of CAC presence, versus 7% without notification [32].The study also demonstrated a significant decrease in low-density lipoprotein and an increase in further non-invasive CAD testing in the notification arm.The study highlights the potential for improved preventative care with AI, non-dedicated CT, and results notification; further studies on its impact on MACE would be enlightening.A cardiovascular outcomes study utilized automated CAC derived from non-gated NCCT and determined that CAC ≥ 100 was associated with a higher risk of death, as well as a composite of death/MI/stroke and death/MI/stroke/revascularization, relative to CAC = 0 [33].Furthermore, patients were being undertreated with statins.

PET/CT Attenuation Correction
Non-gated CT attenuation correction scans in PET imaging have also been used for automatic CACS.Pieszko et al. proposed a DL model for automated CACS from CT attenuation correction (CTAC) scans acquired with cardiac PET imaging [26].There were no significant differences in net reclassification and the model showed a stepwise increase in MACE in the CACS groups.Similarly, standard CACS and DL-based CTAC scores had similar negative predictive values (NPV) for MACE at 85% and 83%, respectively.
One study used a U-Net-based model, AVIEW CAC by Coreline Soft, for automatic CACS in ungated CT scans from 100 patients who underwent 18-FDG PET/CT [27].Patients in this study also underwent ECG-gated CT scans within 6 months of PET/CT that were manually scored and compared to the AI-based CACS from ungated CT.Though the model had almost perfect risk categorization κ, it underestimated the CAC category in 42% of cases and 44% had to be recategorized.
Another group presented a model for automatic CAC quantification in CCTA scans [24].The volume of interest was first automatically determined by placing a "bounding box" around the heart for extracardiac calcification exclusion.Four ConvPairs were generated based on dimensionality (2.5D and 3D) and input patch size (15 or 25 voxels) and used for voxel classification; overall, the best-performing ensemble had a 71% sensitivity and 0.48 FP errors per scan.The model also had excellent ICC and an almost perfect κ relative to the reference, gated NCCT mass score.

Low-Dose Chest CT and Transthoracic Echocardiogram
LDCT has also been applied towards automatic CAC scoring with good outcomes.One study analyzed automatic CACS using AVIEW CAC by Coreline Soft in LDCT and gated CT across three institutions [28].Compared to manual scores from both scan types, the LDCT scores were substantial to almost-perfect κ, while gated CT scores were almost perfect κ across all institutions.LDCT also had a higher false positive rate which may have been contributed by more noise and artifacts or poorer model performance due to lymph node calcifications.Another study assessed automated CACS, using AVIEW by Coreline, as a predictor of 12-yr all-cause mortality in LDCT and showed an incremental association between CACS and all-cause mortality [29].
Yu et al. used a commercially available automatic CACS software product, CAC-Score Doc by ShuKun technology, that was trained and tested on non-gated chest CT for CAC quantification [35].Automatic software was compared to semiautomatic software, CaScoring, Syngo by Siemens, and manual scoring.
Interestingly, a study by Yuan et al. employed a video-based CNN model for binary CACS prediction from transthoracic echocardiograms (TTE) [30].The model was able to discriminate zero CAC from high CAC (≥400 Agatston units) with very good AUC and F1 scores for zero and high CAC.Though binary CACS is less informative in guiding therapy initiation due to the loss of information on patients with CACS between >0 to <400, the study shows promise for a potential role for TTE in automated CACS.

Multiple CT Protocols
Using data from four large cohort studies, Zeleznik et al. trained a model for cardiac segmentation and CAC quantification using manually segmented ECG-gated CTs and tested the model with both gated and low-dose chest CTs [31].There was excellent correlation between the automatic and manual calcium score groups; however, the AUC for event prediction between the two was not statistically significant.A CNN-based algorithm suggested by Van Velzen et al. was tested on a diverse set of CT protocols including standard ECG-gated CT, LDCT, CTAC, diagnostic chest CT, and radiation therapy planning CTs with excellent ICC for CAC volume and overall κ [34].

Workflow Optimization
Presently, several institutions are employing CACS automation algorithms in clinical settings for use in routine non-dedicated CTs and standard, gated NCCTs.AI is able to improve the workflow in 3D imaging post-processing labs, particularly at high-volume centers.Automatic CACS models can improve efficiency and resource utilization by decreasing time spent on segmentation and coronary annotation.Though manual doublechecks will be required to prevent oversight, the combined time for automatic scoring and double-checking has been shown to be shorter than manual scoring.Consequently, radiologists can use less time overall by mainly verifying the automated results, and radiology technicians can be repurposed for other post-processing tasks.This will also aid in cost reduction in imaging labs.In addition, high inter-observer variability in lowerquality non-dedicated scans can also be reduced by AI.

Image Considerations
Overall, the problem of miscategorization and false positive calcifications persists in CACS automation and has been attributed to artifacts, noise, and non-coronary calcifications (aortic, valvular, and lymph node calcifications) [20,22].Denoising, region-of-interest localization, and segmentation are potential areas of intervention to improve image quality and reduce false positive calcifications.Multiple studies have demonstrated the superiority of iterative reconstruction (IR) over filtered back projection (FBP) for noise reduction during image reconstruction [39,40].One study showed decreased radiation dose, higher signal-to-noise ratio (SNR), and fewer false positive calcifications with low-dose iterative reconstruction by establishing noise thresholds; low-dose IR exhibited better performance than filtered back projection [41].Furthermore, DL-based image reconstruction (DLIR) may be an even better alternative to IR and FBP.However, a study showed that while incremental DLIR was associated with a stepwise decrease in noise and an increase in SNR, there was a tendency for significant incremental decreases in CACS and CAC volume [42].As such, further research should be conducted in applying noise reduction to DL reconstruction to identify the optimal noise reduction level through DLIR without compromising quantitative CACS.

External Validation and Data Diversity
Training dataset diversity and external validation are essential for AI algorithm generalizability.Studies have shown the importance of large databases in model training and demonstrated inferior ICC and Cohen's kappa when a model is trained with a smaller dataset [36].Poorly generalizable models perform well with training datasets and poorly with test sets that differ from their training sets, a phenomenon known as overfitting.As such, AI models need diverse training datasets to excel in realistic clinical settings and prevent the propagation of real-world biases [43].This goal can be attained through the creation of large multicenter, multivendor, multi-detector, and multi-phenotype databases, achievable through cross-institutional collaboration.For instance, BunkerHill Health has an existing research consortium including several academic institutions to facilitate faster multicenter cooperation by aiding AI algorithm training and validation, clinical deployment, and addressing the legal implications of multicenter partnerships [44].Large dataset repositories exist for other imaging modalities including CMR and SPECT.Current research endeavors in model development can contribute to data archives by publicizing their annotated datasets and permitting their collation by research consortiums.

Metrics Standardization
During study evaluation, metric selection is essential for appropriate criteria evaluation and optimization during model training.Cohen's kappa, concordance index, and ICC scores are ideal for assessing the risk category allocation, accuracy, and reliability; as such, they are commonly used metrics in automated CACS [45,46].The Proposed Requirements for Cardiovascular Imaging-Related Machine Learning Evaluation (PRIME) checklist provides a guideline for reporting model evaluation measures including Bland-Altmann plots, misclassification risk, accuracies, and inter-/intra-observer variability; however, specific and standard acceptable parameter thresholds are undefined [47].For example, Cohen's kappa ranges from 0-1.0 with six different subgroups, the best of which is between 0.81-1.00,and is indicative of almost-perfect risk categorization [48].However, there is no standard minimum acceptable value for automated models.A minimum requirement of almost-perfect categorization would be acceptable for proposed models; nevertheless, the requirement needs to be defined for Cohen's kappa and other metrics to give more meaning to each model's clinical adequacy.

Conclusions
AI application in CACS has the potential to produce precise, efficient, and accurate scores from dedicated and non-dedicated CT.While most studies presented so far have produced fast and accurate automatic scores, the challenges of data homogeneity, inadequate dataset size, and calcium mis-categorization due to noise artifacts and aortic/valvular calcifications remain persistent.Although a certain degree of false positive rates is expected with imaging tests, it is essential to minimize this in the higher CAC risk groups (>100 AU) where results significantly impact medical decision-making, to avoid over-utilization of resources for primary prevention and further diagnostic testing.Furthermore, creating data consortiums will support large-scale studies, facilitate accurate model development, and aid clinical implementation; however, an equal representation of samples from various

Figure 1 .
Figure 1.Overview of Artificial Intelligence in Coronary Artery Calcium Scoring.Automatic cardiac segmentation and coronary labeling are hallmarks of proposed models.Improvement of data diversity, data repositories, external validation, image artifacts, and metric standardization will enhance the proposed models.CACS automation optimizes workflow, reduces cost, and allows redistribution of resources in image post-processing labs.

Figure 1 .
Figure 1.Overview of Artificial Intelligence in Coronary Artery Calcium Scoring.Automatic cardiac segmentation and coronary labeling are hallmarks of proposed models.Improvement of data diversity, data repositories, external validation, image artifacts, and metric standardization will enhance the proposed models.CACS automation optimizes workflow, reduces cost, and allows redistribution of resources in image post-processing labs.

Figure 2 .
Figure 2. Comparison of Manual and Automatic CAC Identification.Image adapted from Eng et al. [25] Manual and automatic CACS comparison in two patients.The automatic model excludes calcifications in the descending aorta and some in the aortic root.(a) False negative calcium analysis by the automatic model in the right coronary artery (red arrow).False positive aortic root calcification is present (blue circle).(b) Comparison of automatic and manual coronary calcifications in another patient.

Figure 2 .
Figure 2. Comparison of Manual and Automatic CAC Identification.Image adapted from Eng et al. [25] Manual and automatic CACS comparison in two patients.The automatic model excludes calcifications in the descending aorta and some in the aortic root.(a) False negative calcium analysis by the automatic model in the right coronary artery (red arrow).False positive aortic root calcification is present (blue circle).(b) Comparison of automatic and manual coronary calcifications in another patient.

Table 1 .
Commercially available systems for automatic CACS.