1. Introduction
Cardiovascular disease is one of the leading causes of death worldwide, claiming approximately 17.9 million lives each year [
1]. Coronary artery disease (CAD) is one of the most common and fatal cardiovascular diseases worldwide. The main cause of CAD is the accumulation of atherosclerotic plaques in the epicardial arteries [
2], which leads to angina pectoris or heart attack. Accumulation of atherosclerotic plaque can lead to stenosis of the aortic lumen. Therefore, the detection of coronary artery stenosis is very important, and early detection of stenotic arteries allows early intervention and reduces mortality.
In recent years, deep-learning technology has made significant advancements in the medical field, especially in medical image analysis, which has pushed forward the application of computers in vascular stenosis detection. In coronary artery stenosis detection, physicians mainly rely on X-ray images to make their diagnosis. The deep-learning technology applied in this field initially segmented blood vessels from X-ray images, and physicians could then diagnose whether patients have stenosis based on the contour of blood vessels. For example, Yang et al. achieved a 91.7% F1-score for coronary vessel segmentation using a U-Net network [
3]. Zhu et al. accomplished coronary vessel segmentation using a PSPNet network to segment the vessel contour to assist physician diagnosis [
4]. Such algorithms are still semi-automated because they rely on physicians’ diagnoses. In fact, it makes sense for physicians to have a system that can automatically detect stenosis and support fully automated testing. One application of fully automated stenosis detection is the use of classification algorithms, where a computer model simply determines the presence of stenosis. For example, Jungiewicz et al. created a dataset using X-ray contrast images of 16 patients, including small fragment images of positive stenosis and small fragment images of non-stenosis negative for binary classification, with the innovative use of the vision-transformer classification network [
5]. Ovalle-Magallanes et al. achieved an F1-score of 91.8% for classification accuracy on a smaller dataset by incorporating quantum computing into the neural network computation, using just 250 contrast images [
6]. Another class of fully automated detection algorithms is object detection, which not only detects stenosis but also identifies the stenotic region. The latest research in object detection for coronary artery stenosis includes categorizing the diseased coronary artery into four types based on different patient perspectives, such as cranial (CRA) and left anterior oblique (LAO) views. These types are local stenosis (LS), diffuse stenosis (DS), bifurcation stenosis (BS) and chronic total occlusion (CTO). And the YOLOv5 detection model was utilized for diagnosis [
7]. Additionally, the latest YOLOv8 detection model based on CNN networks was employed, combined with transfer learning techniques, to accomplish binary classification detection of coronary artery stenosis—whether there is stenosis on the vessel or not [
8]. Of course, some of the latest technologies have been added to the object detection models, such as diffusion models, which, as a type of deep generative model, are widely used in computer vision. For instance, Li et al. proposed a quantum diffusion model with spatiotemporal feature sharing for real-time detection of stenosis, achieving an F1-score of 92.39%, which demonstrates significant performance [
9]. Several other studies have also proposed relevant models for the detection of stenosis [
10,
11,
12,
13,
14]. For example, Danilov et al. collected data from 100 patients, with a total of 8835 contrast images containing a clear annotation of the stenosis location inside each frame, and achieved a 96% F1-score and 94% mAP with the RFCN ResNet-101 V2 detection network [
10]. Freitas et al. collected 132 frame images with significantly visible arterial stenosis from 50 patients and used a DeepCADD detection model to achieve a detection accuracy of 83% precision and 89.13% recall [
11].
More meaningful to the physicians’ diagnosis is the object detection technology, which can determine whether stenosis is present on coronary angiography images and accurately determine the lesion area. However, when the object detection networks are performed on X-ray coronary angiography images, two problems exist: on the one hand, X-ray images contain noise and artifacts, particularly breathing artifacts that can blur blood vessel boundaries and mask blood vessel details, thereby affecting diagnostic results; on the other hand, stenotic regions of blood vessels often appear as small lesions, making the detection of these small targets more challenging for deep-learning algorithms. Therefore, more effective stenosis detection requires the network to have the following characteristics. First, in the contrast image, the structure of blood vessels is spread throughout the image, and the small branches of blood vessels have blurred contours, resulting in the complexity of the vessel structure in the contrast image. Therefore, a network with a strong ability to extract features is required to extract rich vessel structures. Second, X-ray contrast images contain more noise, with breathing artifacts and interference from other organs and tissues; therefore, the network must have anti-interference ability. Third, stenosis is formed because of the accumulation of some atherosclerotic plaques, which often manifest in a small area, and our network also needs to excel at detecting small targets.
With the progress of deep-learning techniques, an increasing number of efficient object detection networks have been proposed. Early object detection algorithms were mainly two-stage detection algorithms [
15,
16,
17], such as the spatial pyramid pooling network (SPP-Net) [
9], the region-based convolutional neural network (R-CNN) [
16] and the Faster R-CNN [
17]. Two-stage algorithms are performed by two different neural network branches: one generates the candidate box region, and the other classifies and detects the candidate box, which makes the overall algorithm more complex. The network of the single-stage algorithm [
18,
19,
20] provides the prediction results along with the candidate frames, which significantly improves the inference speed and detection efficiency of the network. For example, SDD [
18], RetinaNet [
19] and YOLO [
20] provide a qualitative leap in the detection speed. The early single-stage algorithms did not achieve the desired results in terms of detection accuracy because of their simple model structure. With the development of deep learning, single-stage algorithms have been proposed to solve this problem. For example, YOLOv5 [
21] used an anchor-based and C3 module as the main extraction framework of the network during feature extraction and achieved 64.1% mAP50 on the COCO dataset. YOLOX [
22] used the SimOTA label assignment strategy and the decoupled head prediction head, which achieved higher accuracy in a shorter training time. In addition, because of the excellent performance of the transformer [
23] model in the field of natural language processing, Carion [
24] and others introduced the model into the field of computer vision and proposed the DETR detection model. However, it has drawbacks, such as a large number of model parameters, a large number of training samples and difficulty in the convergence of training. Lv et al. optimized the training strategy and designed the RT-DETRv2 detection model by setting different numbers of sampling points for features at different scales in the deformable attention mechanism, based on the DETR detection model [
25]. Duan et al. designed a bottom-up object detection method and developed the CenterNet++ detection model, which detects each object as a triplet of keypoints [
26].
The YOLO family of network models uses diverse convolutional structures to enhance the feature extraction. These designs enable the models to capture and learn complex features in images more efficiently. YOLOv8 [
27] has a diverse convolutional structure containing gradient streams enriched by the C2f module to enrich the model by connecting more branches across the layers, and the spatial pyramid pooling fast (SPPF) module captures information from different feature layers. Even so, for our coronary angiography images, the YOLOv8 network is still deficient because of the complexity of the vessel structure, interference of noise and artifacts. Some preprocessing methods for medical images can effectively improve network accuracy. Zhu et al. used the 3D Canny edge detection algorithm in MRI images of brain tumors to enhance the edge information of the lesion tissue [
28]. Saifullah et al. achieved better segmentation results using particle swarm optimization with histogram equalization preprocessing [
29]. In medical images, owing to the presence of background noise interference and breathing artifacts, attention mechanisms are often added to the network to enhance its immunity to interference. Ovalle-Magallanes et al. demonstrated that the model focuses more on the lesion area after adding the CBAM attention mechanism using visualization techniques [
30]. In stenosis detection, the stenotic lesion area often appears in a small segment of a vessel, as shown in
Figure 1. Coronary artery stenosis is difficult to detect accurately owing to the detection of small targets. For the detection of small targets, Wang et al. added two additional layers of feature maps to the detection head of the model to enhance the accuracy of small target detection in road vehicle detection [
31].
In the field of coronary artery detection, the key issues that need to be addressed for more effective detection include the following: first, the framework must be able to effectively extract features from complex coronary angiography images, requiring the framework to have a strong feature extraction capability, which is essential for accurate medical image analysis. Second, the framework must possess anti-interference capabilities to effectively mitigate the effects of noise and artifacts in coronary angiography images, thereby enhancing the reliability and accuracy of the diagnostic process. Finally, the framework must address the challenge of detecting small targets, enabling more precise localization of the lesion area and achieving superior diagnostic outcomes. The framework effectively extracts key blood vessel contour information from the image, combats noise interference and concentrates on features within the stenosis region. It also effectively addresses the challenge of detecting stenosis in small target regions, thereby significantly improving the detection accuracy. The main contributions of this study are as follows:
- (1)
The DCA-YOLOv8 object detection framework was designed according to the characteristics of coronary angiography images. The framework can maximize the extraction and focus on the stenosis information for fast and accurate stenosis detection.
- (2)
The HEC preprocessing module uses a histogram equalized image with contours extracted using the Canny edge detection algorithm, enhancing the vascular region’s features.
- (3)
We employed a detection head incorporating the AICI loss function to detect stenosis and small targets, utilizing an auxiliary detection frame. This method accelerates the convergence of the framework, improves the accuracy and achieves optimal detection results.
The remainder of this paper is organized as follows.
Section 2 explains the proposed method in detail. The experimental setup and findings are presented in
Section 3. Experimental results and comparative evaluations are presented in
Section 4. Finally,
Section 5 discusses and
Section 6 summarizes the tasks, respectively.
2. Methods
The proposed basic framework for coronary artery stenosis detection consists of three parts. The first part is fused into the network using the DCA attention mechanism to extract rich vessel features. The second part is a HEC preprocessing enhancement module that uses a combination of histogram equalization and Canny edge detection. Finally, we used the output module that combines the AICI loss function detection header to converge faster and more accurately to complete the final detection. The flowchart is shown in
Figure 2.
2.1. DCA Feature Extraction Module
In our study, we use the YOLOv8 main frame network to feature extract the input image, as shown in
Figure 3.
YOLOv8 has a Conv + Batch Normalization + SiLU (CBS) module, C2f module, SPPF feature extraction module and neck feature fusion module, which is capable of extracting rich features. However, when applying YOLOv8 to stenosis detection using coronary angiography images, its capabilities are still insufficient. Therefore, we added a newly designed DCA module to the YOLOv8 network. According to our experiments, placing the DCA module before the C2f module yields better results.
The addition of attention mechanisms to networks has proven to be effective, and common attention mechanisms [
32,
33,
34,
35] are used in the context of computer vision, such as squeeze and excitation (SE) [
32], CBAM [
33], efficient channel attention (ECA) [
34] and coordinate attention (CA) [
35]. In coronary angiography, the vessel contour is the most important information in the image, and information regarding the vessels must be extracted and redundant information filtered out. Traditional feature extraction attention mechanisms, such as CBAM and SE, utilize global average pooling, which overlooks the positional information of the stenosis in blood vessels in the image. Thus, we propose adding the DCA attention mechanism to the framework, which directs the framework to focus more on the positional information of the stenosis in blood vessels in the image. The DCA structure is shown in
Figure 4. Our proposed attention mechanism does not change the input channel. The DCA module consists of two serial sub-attention modules and processes feature inputs
, which are computed by the following formula:
In Equation (1), are all pixel points of the input features, where and denote the height and width of the image, respectively. After Equation (1), the features are pooled in the wide direction dimension, preserving the position information in the h-direction: . After processing the important information in the h-direction using Equation (2), the vascular information in the h-direction is activated, which is represented as . In Equation (2), represents the dimensionality reduction process using a 1 × 1 convolution kernel, with a scaling factor of r, and represents the activation function, which is expressed in Equation (4). It can exhibit a smoother gradient. The attention vector in the h-direction is calculated by Equation (3), where represents the upsampling process using a 1 × 1 convolution kernel to expand to the original channel dimension, and represents the sigmoid activation function. Finally, in Equation (5), the attention vector is multiplied by the input to obtain the initial result. The second sub-module has the same process as the first one, which consists of pooling in the wide direction to maximize the retention of information in the w-direction position to obtain the final result output.
2.2. HEC Preprocessing Enhancement Module
In coronary angiography images, X-ray contrast imaging can reveal artifacts, blurred vessel contours and noise interference from other tissues, which makes it difficult to observe the vessels and results in poor contrast with the image background. Inspired by these two articles [
28,
29], we preprocess and enhance the image before inputting it into the framework to make blood vessel tissues more prominent, allowing stenotic regions to be more easily distinguished before being input into the DCA-YOLOv8 framework architecture.
The HEC module consists of adding pixel values from the histogram equalized image and the Canny edge extracted image. It can enhance the vascular feature area in the contrast image, thereby making it easier to distinguish stenosis in blood vessels. Histogram equalization is a method of transforming an original image to obtain a new image with a uniformly distributed grayscale levels. It widens the gray levels where there are many pixels and reduces them where there are few, balancing the original distribution of pixel values across a range of values. The image becomes brighter, and the contrast is enhanced. Histogram equalization is used to stretch the image non-linearly and redistribute the pixel values of the image to achieve a clear image. The per-pixel point mapping is as follows:
In Equation (6), L denotes the total gray level, which represents the converted gray level, and N denotes the number of pixels in the image.
represents the number of pixels included in the j-th gray level.
represents the converted pixel value. We used histogram equalization to enhance the vascular structure of our image, which made it easier to distinguish the vascular structure from the background. The transformed image is shown in
Figure 5i.
The goal of the Canny edge detection operator is to find an optimal edge detection algorithm that applies Gaussian filtering to smooth the image with the aim of removing noise. It employs a dual-threshold approach to identify potential edges. The coronary angiography image is processed by the Canny operator to obtain a contour map of the vascular structure. Consequently, the stenotic region is included in the extracted contour, and this processing effectively enhances the features of the vessel contour and significantly boosts the features of the stenotic lesion. The transformed image is shown in
Figure 5ii.
Finally, we combined the histogram equalized image with the Canny edge extraction results to obtain the final enhanced image, as shown in
Figure 5iii. The HEC-processed image, which is more representative of our blood vessel features, was input into the framework for training.
2.3. Output
After extraction by the DCA-YOLOv8 framework backbone, three feature layers were obtained, which were 80 × 80, 40 × 40 and 20 × 20 in size. As depicted in
Figure 2, the three output feature layers are concatenated together and simultaneously fed into the box subnet and the class subnet to perform box location and classifier. The final prediction result was obtained through the output module. In YOLOv8, the CIoU loss functions are employed for box location loss, and binary cross-entropy (BCE) is used for classification loss. In the framework we proposed, DCA-YOLOv8, we use AICI as the box location loss and BCE as the classification loss. The traditional loss function for detecting small targets suffers from a lack of accuracy in the fitted localization frame and slow convergence. We designed the AICI loss function according to the characteristics of coronary artery stenosis, which is more suitable for the detection and bounding box regression of small stenosis targets. The use of the AICI loss function can more accurately locate small areas of the stenosis.
In the YOLOv8 network, localization loss uses the CIoU loss function, which is composed of an IoU loss function and a penalty term. The calculation of the CIoU loss function is illustrated in
Figure 6. The specific calculations were as follows:
In Equation (7), denotes the square of the distance between the target box and the center of the predicted box; denotes the diagonal length of the smallest enclosing box that covers both boxes; and d represents the distance between the centers of two rectangular boxes. In addition, α is the weight function, which is computed based on , and is used to measure the similarity of the aspect ratios. In the DCA-YOLOv8 framework, the CIoU loss function was improved by us.
For smaller target samples, the loss function can converge by using a larger auxiliary enclosing box. Setting a reasonable ratio value using the inner-IoU can accelerate convergence and improve the accuracy [
36]. The inner-IoU is defined as follows: the ground truth (GT) box and anchor are denoted as
and
b, respectively, as shown in
Figure 7. In
Figure 7, the target box is the ground truth box, and the anchor box is the predicted box. The centers of the GT box and the inner-GT box are denoted by (
), while (
,
) denotes the centers of the anchor and the inner anchor. The width and height of the GT box are denoted as
and
, respectively, while the width and height of the anchor are represented by
and
. The inner-IoU is calculated as follows:
In inner-IoU, the corresponding scale factor ratio is set to control the scale size of the auxiliary bounding box. However, a change in the value of the ratio results in a change in the corresponding IoU value. The
and
additional term penalties in CIoU also change. Therefore, the penalty term factor in CIoU becomes inaccurate owing to the transformation of the ratio value. When inner-IoU is used, the area of the inner-GT anchor changes accordingly. Although these two loss terms were designed for the original IoU, now that we use the inner-IoU, the original loss terms have become inaccurate. Thus, we regulate the inaccuracy owing to the transformed IoU by adding two parameters
and
, which can be learned adaptively. Hence, our AICI loss function is defined as:
The AICI loss function is the box loss function used in the DCA-YOLOv8 framework proposed by us. In Equation (17), denotes the square of the distance between the target box and the center of the prediction box, and denotes the diagonal length of the smallest enclosing box covering the two boxes. α is the weight parameter, and is used to measure the similarity of the aspect ratios. and are the two parameters we added for adaptive learning. By setting the AICI loss function to a ratio value greater than one, the auxiliary frame is made larger, increasing the number of matches between the prediction frame and the gold-standard frame, thus reducing the problem of the vanishing gradient when the IoU is zero. This also increases the gradient value when the IoU is small, which makes the predicted box fit more closely to the target box. The AICI loss function enables the predicted frames to be more accurate, and the use of an auxiliary ratio value accelerates the convergence of the training, which leads to a higher level of accuracy.
6. Conclusions
Here, we proposed a new framework for the detection of coronary stenosis. This includes preprocessing, a feature extraction network and a detection head. First, in the preprocessing stage, we designed a HEC enhancement module tailored to the characteristics of coronary angiography images, which increases the contrast between vessels and the background, enabling the framework to more accurately identify stenosis lesions. Second, in the feature extraction module, we incorporated the DCA attention mechanism, which directs the framework to focus more on the vessel region information and the characteristics of stenosis lesions, thereby improving the framework’s accuracy. Finally, for the detection of small targets in stenosis, we designed the AICI loss function, which accelerates the convergence during framework training and enhances the framework’s accuracy. The experimental results show that our proposed framework achieved precision, recall, F1-score and mAP values of 96.62%, 95.06%, 95.83% and 97.6%, respectively. The average precision score reached a peak of 97.6%, and all other metrics were also highly competitive.
The advantages of this study are as follows. (1) The framework proposed in this paper achieves more accurate detection of coronary artery stenosis, thereby effectively assisting physicians in diagnosing the condition. (2) The novel DCA-YOLOv8 framework incorporates innovative improvements for coronary artery stenosis detection, demonstrating higher accuracy than other networks. The proposed stenosis detection framework could better assist physicians in rapidly and accurately identifying the narrowed coronary artery stenosis areas.
Our method has certain limitations. Due to the dataset, the framework we propose can only detect the presence of stenosis on coronary angiography images and cannot further classify the types of stenosis or identify arteries with severe blockage. The detection results from our framework can only assist physicians in determining whether stenosis is present. Further treatment decisions, such as stent placement and other interventions, must be made by physicians based on the degree of stenosis and the patient’s health condition. Future work will focus on utilizing more annotated coronary angiography images to enable the model to assess the type of stenotic lesions and determine the degree of stenosis, providing physicians with more valuable supplementary reference. This will further enhance the performance of the framework and broaden its applicability.