LRSE-Net: Lightweight Residual Squeeze-and-Excitation Network for Stenosis Detection in X-ray Coronary Angiography

: Coronary heart disease is the primary cause of death worldwide. Among these, ischemic heart disease and stroke are the most common diseases induced by coronary stenosis. This study presents a Lightweight Residual Squeeze-and-Excitation Network (LRSE-Net) for stenosis classiﬁcation in X-ray Coronary Angiography images. The proposed model employs redundant kernel deletion and tensor decomposition by Depthwise Separable Convolutions to reduce the model parameters up to 48.6 x concerning a Vanilla Residual Squeeze-and-Excitation Network. Furthermore, the reduction ratios of each Squeeze-and-Excitation module are optimized individually to improve the feature recalibration. Experimental results for Stenosis Detection on the publicly available Deep Stenosis Detection Dataset and Angiographic Dataset demonstrate that the proposed LRSE-Net achieves the best Accuracy—0.9549/0.9543, Sensitivity—0.6320/0.8792, Precision—0.5991/0.8944, and F 1 -score—0.6103/0.8944, as well as competitive Speciﬁcity of 0.9620/0.9733.


Introduction
Coronary Heart Disease (CHD) is the most common cause of death worldwide [1], mainly characterized by a partial narrowing of the coronary artery due to an adipose plaque formation [2]. This condition, also called coronary stenosis, reduces the oxygen blood supply reaching the heart muscle, ultimately leading to a heart attack [3]. Generally, manual stenosis detection requires exhaustive visual inspection of coronary images, whose efficacy could be deteriorated by the clinical standards and differences of expertise among physicians. For this reason, Computer-Aided Diagnosis (CAD) supports and tends to reduce the workload of the medical expert diagnosis for stenosis detection.
Although various coronary imaging techniques exist, such as ultrasound, magnetic resonance, and computed tomography, X-ray coronary angiography (XCA) remains the gold standard for CHD diagnosis [4]. Furthermore, physicians prefer the XCA screening test as a simultaneous coronary artery bypass surgery renders a reliable solution [5].
Moreover, the XCA screening test obtains high-resolution images of the main coronary arteries and their branches [6]. However, automatic stenosis detection is not easy due to the specific characteristics of XCA images, mainly background noise, the presence of a coronary stent, non-coronary vascular structures (i.e., ribs), and multiple superposed branching points [7][8][9], as shown in Figure 1. In the last decade, CNNs have achieved outstanding performance gains in classification and segmentation tasks in the medical image domain compared with the traditional machine learning (ML)-based methods [10,11]. The core of CNN is its capability to extract, select, and classify features during the optimization step, while in ML methods, each of these steps is conducted independently. Different methods have been introduced to improve CNNs capabilities, such as attention mechanisms that adaptively recalibrate the intermediate feature maps by weighting their inter-channel and inter-spatial relationships; however, this increases the number of parameters of the network. This paper proposes a Lightweight Residual Squeeze-and-Excitation Network (LRSE-Net) for stenosis detection. The proposed LRSE-Net model relies on Depthwise Separable Convolutions (DSC) [12] that have been shown to learn rich features with a reduced parameter set efficiently. Moreover, individuals improve the baseline architecture further.

Related Work
Machine learning techniques have been proposed to detect automatic stenosis in XCA images [13][14][15]. These studies first extract discriminative features based on texture and shape information. Then, a feature selection process is performed to choose the most suitable features to feed a classifier. Finally, different classifiers, such as Naive Bayes and Support Vector Machine, accomplish stenosis detection. However, features extracted in a hand-crafted manner limit the effectiveness of feature selection, and consequently, the classification performance.
Recently, deep learning methods have been able to tackle feature extraction, selection, and classification within the optimization procedure in an end-to-end manner, showing outstanding performance compared to the hand-extracted feature-based methods. Wu et al. [16] proposed a deep learning framework consisting of two stages. First, from the full raw XCA, candidate frames are selected based on the segmentation results that produce a UNet [17]. Subsequently, an object-based detection network employing a VGG (Visual Geometry Group) [18] as a backbone network provides the classification of stenosis regions. Following the same idea, Pang et al. [19] detected stenotic regions, including prior coronary artery displacement information. They used a Residual Network (ResNet) [20] that acts as a backbone model of the object detector network. Later, Danilov et al. [21] evaluated different object detection network configurations, including a Single Shot multi-box Detector (SSD) [22], Faster Region-Based Convolutional Neural Networks (Faster-RCNN) [23], and Region-based Fully Convolutional Networks (R-FCN) [24]. In their networks, distinct backbones networks have been employed, such as MobileNet-v2 [25], ResNet (50, 101) [20], and Inception-v4 [26].
However, the previous methods require the whole angiographic test and assume that a single stenosis region is present in the image. Another approach to solving this task is using a patch-based classification network. In this way, the full-size XCA image generates n-patches to be classified as positive or negative stenosis cases. In this context, Antczak and Liberadzki [27] employed a VGG-based model of only five convolutional layers to classify XCA image patches into the stenosis and no stenosis categories. A pre-training strategy was performed by synthetic data, consisting of a Bezier-based generative model to improve the results. Further, Ovalle-Magallanes et al. [28] proposed a novel hierarchical Bezier-based generative model to generate more realistic synthetic XCA patches. The dataset was evaluated on different ResNet configurations (18,34,50), including the Convolutional Block Attention Module (CBAM) [29]. Later, Ovalle-Magallanes et al. [30] performed an exhaustive evaluation of the impact of three attention mechanisms (Squeeze-and-Excitation [31], Convolutional Block Attention Module [29], and Efficient Channel Attention [32]). They demonstrated that a Trimmed ResNet18 with a Squeeze-and-Excitation attention module achieved the best trade-off between classification performance and computational cost. The methods mentioned above only employed a subset of the negative samples of the dataset released by Antczak and Liberadzki [33] to create a balanced training and test dataset; thus, only 125 negative and 125 positive cases were selected. This can lead to a biased classification when a large dataset is tested.
As discussed in previous paragraphs, different deep learning approaches have been used to develop strategies to detect stenosis in XCA images, through either object-based or patch-based models. These methods have shown notable performance; nevertheless, object-based approaches are limited to detecting a single stenosis case in the whole image. Meanwhile, patch-based methodologies are restricted to detecting small stenotic regions (i.e., based on the size of the patch). Moreover, both approaches take as their backbone network architectures designed for the ImageNet dataset, changing only the top of the model. Hence, redundant kernels may exist, limiting the classification performance.
This study presents a Lightweight Residual Squeeze-and-Excitation Network (LRSE-Net) for a patch-based stenosis classification based on two compression methods to reduce the model size: (1) redundant kernels deletion and (2) tensor decomposition by Depthwise Separable Convolutions. Additionally, they include independent ratios for each attention module to improve the feature extraction and generalization. The proposed LRSE-Net is up to 48× smaller (in number of parameters) than previous models employed for this task. The network's performance is evaluated employing two public datasets: (1) The full dataset from Antczak and Liberadzki [33] consisting of 1519 images with 125 positive cases of stenosis and the remainder as negative. (2) A patch-based version of the dataset released by Danilov et al. [34], which includes 6769 positive patches and 26,699 negative patches. The main contributions of this research are as follows: • An LRSE-Net model is proposed by replacing vanilla convolutions with Depthwise Separable Convolutions, drastically reducing the number of parameters; • Independent dilation ratios for each attention module are selected to enhance the network performance; • Redundant kernels in the convolutional layers are removed to obtain a smaller model; • A data augmentation policy is introduced to mitigate the imbalance of the dataset; • A new patch-based dataset is released to validate the model performance.

Materials and Methods
The proposed LRSE-Net model consists of two main elements: a Squeeze-and-Excitation Attention Mechanism [31] and Depthwise Separable Convolution [12]. Altogether, these two modules produce robust stenosis detection by employing fewer parameters. In this section, a full description of these fundamental components is given.

Squeeze-and-Excitation Attention Mechanism
A Squeeze-and-Excitation (SE) block is a gating mechanism that models channel-wise feature relationships by integrating two operations: a squeeze operation and an excitation operation. In this manner, the network can enhance hierarchical features in a channel-wise manner. The structure of an SE block is illustrated in Figure 2. . Squeeze-and-Excitation block. The input features are recalibrated (F scale (·, ·)) by learnable weights (F ex (·, W)) that capture the channel dependencies (F sq (·)).

Squeeze Operation
In order to capture channel dependencies between the input feature maps X ∈ R h×w×c , where h × w is the spatial size of the features and c is the number of channels, a Global Average Pooling (GAP) [35] calculates the global spatial information (squeeze) into a statistic z ∈ R c . Each m-element of the statistic is given by: Notice that this operation is parameter-free and applies a dimensionality reduction; thus, it reduces each feature map x m ∈ R h×w to a single scalar value z m .

Excitation Operation
The excitation operation aims to reduce the channel-wise feature complexity and boost generalization. A simple gating mechanism g(·, W) is applied to accomplish this task, such that: where σ and δ refer to the sigmoid and Rectified Linear Unit (ReLU) activation function, respectively, and noticing that c ∑ m=1 s m = 1. The gating mechanism acts as a bottleneck with two fully connected layers W 1 ∈ R c× c r and W 2 ∈ R c r ×c . Here, the parameter r is a reduction ratio controlling the number of parameters of the SE block. In such a way, a Squeeze-Excitation operation SE(·, W) : R h×w×c → R 1×1×c can be defined as: Finally, the input feature maps X are weighted by the obtained values s to obtain a learnable recalibration that emphasizes or ignores specific channels. The rescaling procedure is performed by: where F scale (x m , s m ) is a channel-wise multiplication between the feature map x m ∈ R h×w and the scalar s m .

Depthwise Separable Convolution
Let f conv (·, W) : R h 1 ×w 1 ×c 1 → R h 2 ×w 2 ×c 2 be a standard convolution operation that takes as input X in and produces X out parameterized by the kernel W ∈ R k×k×c 1 ×c 2 computed as: where * represents the convolution operation and k-the filter size, Depthwise Separable Convolutions (DSC) factorize a standard convolution by two independent convolutions: (1) depthwise convolution and (2) point-by-point convolution (1 × 1 convolution), as shown in Figure 3. The depthwise convolution f dw−conv (·, W) : R h 1 ×w 1 ×c 1 → R h 1 ×w 1 ×c 1 decoupled the input feature map from its channels, applying a single filter to each input channel, as follows: Then, the pointwise f pw−conv (·, W) : R h 1 ×w 1 ×c 1 → R h 2 ×w 2 ×c 2 convolution combines the features of each channel through a 1 × 1 standard convolution, such as: This factorization reduces the number of parameters and computation operations.

Lightweight Residual Squeeze-and-Excitation Network
The proposed Lightweight Residual Squeeze-and-Excitation Network (LRSE-Net) consists of SE attention layers and DSC with residual connections layers. The network follows the structure of ResNet, where residual connections accelerate the training efficiency and resolve the gradient degradation problem. Formally, a residual block is defined as: where X in and X out stand for the input and output feature maps, respectively, F res (·, W i ) represents the residual mapping to be learned parameterized by the kernels W i i.e., multiple convolutional layers, F down (·, W s ) performs a linear projection with a learnable kernel W s to match the dimensions (e.g., when the input/output channels changed), and δ is the ReLU function. The residual mapping follows the order of execution as Convolution → Batch Normalization → ReLU → Convolution → Batch Normalization. Note that the standard convolution is replaced with DSC. After the residual block, a SE attention module is placed to highlight key channel-wise information. Thus, the Residual Squeeze-and-Excitation RSE : R h 1 ×w 1 ×c 1 → R h 2 ×w 2 ×c 2 block is defined as: where X res = F res (X in W i ) is the output of the residual mapping and δ-the ReLU activation function. Figure 4 depicts an illustration of the Residual Squeeze-and-Excitation block. The proposed network took as a backbone network the ResNet18, which is mainly characterized by consisting of one 7 × 7 convolutional layer, with a stride of two pixels, followed by a max-pooling of size two; four residual blocks within 64, 128, 256, and 512 kernels, respectively, come after. Then, redundant kernels were removed in the convolutional layers (half of them) to obtain a smaller model. Similarly, the top residual block and the first max-pooling are removed. A pipeline illustrating these model compression steps is shown in Figure 5.  Hence, the LRSE-Net structure contains 14 convolutional layers organized as one 3 × 3 convolution with 32 kernels and stride of two pixels, three residual SE blocks, each with two residual mappings followed by a SE module with dilation ratios r = 16, 13, 9, respectively, forming 12 convolutions with 32, 64, 128 kernels of size 3 × 3, and one dense layer for final classification. Notice that a GAP layer reduces the feature maps' dimensionality to a 1D vector that feeds the dense layer. Table 1 summarizes the LRSE-Net architecture. The optimal selection of the hyperparameters of the SE blocks and the number of kernels per residual block were obtained using the Tree-structured Parzen Estimator (TPE) algorithm [36,37], minimizing the validation Cross-Entropy Loss.

Datasets
Two public datasets were used to evaluate the proposed model: the Deep Stenosis Detection Dataset (DSDD) [33] and the Angiographic Dataset for Stenosis Detection (ADSD) [34].
DSSS [33] consists of small XCA image patches of size 32 × 32 taken from different image positions and sources. It contains a total of 1519 images, where only 125 are positive cases of stenosis and 1394 negative cases, which generate an unbalanced ratio of 1:11, i.e., one positive case for eleven negative ones. This database does not specify a partition for training and testing sets.
ADSD [34] presented a set of XCA images with a total of 8325 grayscale images (100 patients) of 512 × 512 to 1000 × 1000 pixels. XCA images were taken using Coroscop (Siemens) and Innova (GE Healthcare) image-guided surgery systems at the Research Institute for Complex Problems of Cardiovascular Diseases (Kemerovo, Russia). A bounding box around stenotic segments was set with different areas: small (<322 pixels), medium (322 ≤ area ≤ 962 pixels), and large (>962 pixels). The training and test subsets are specified with 7493 and 832 images, respectively.
A patch-based dataset was generated to evaluate the proposed patch-based approach from ADSD [34], taking square patches centered on the stenosis bounding box for the positive cases and the 4-connected neighbors around the bounding box as negative cases. During the patch selection, patches smaller than 32 × 32 pixels were omitted. In this way, the new dataset (P-ADSD) consisted of 6769 positive patches, and 26,699 negative patches were obtained (1:4 unbalanced ratio). Thus, the training subset contained 6080 positive and 23,986 negative cases, while the test subset had 689 positive and 2713 negative cases. Patches were re-sized to 64 × 64 to homogenize the image dimensions.
On the other hand, to deal with the small size of data with the unbalanced ratio of the DSSS [33], a data augmentation policy was applied, generating four additional images by input image. The policy includes random rotation around 90, 180, and 270 degrees, random horizontal flip, random horizontal and vertical shift of −10% to 10%, random zoom-in of 0% to 10%, and random brightness change. Additionally, a partition of 80:20 was set to split the dataset into training and testing. The data augmentation policy was applicable only in the training and positive subsets. In this manner, the augmented dataset (A-DSSS), including 430 positive and 1394 negative stenosis cases, was obtained, reducing the unbalanced ratio to 1:3.

Results
The proposed LRSE-Net model was evaluated through multiple comparisons with different architectures employed for stenosis detection. The performance analysis was conducted using the datasets P-ADSD and A-ASSS described above. First, the evaluation metrics are defined. Secondly, the implementation details for training the model are explained. Finally, numerical results are shown.

Evaluation Metrics
For the evaluation of the proposed approach, five metrics are considered: Accuracy, Sensitivity, Specificity, Precision, and F 1 -score, which are defined as follows: where TP refers to the number of true positives, TN is the number of true negatives, FP denotes the false positives cases, and FN represents the number of false positives.

Implementation Details
The training process employs the Stochastic Gradient Descent with Momentum (SGDM) optimizer [38] with a learning rate of 1 × 10 −3 and a momentum of 0.9. The model was trained with a batch size of 32 for 100 epochs minimizing the Cross-Entropy Loss. The model was implemented using the Pytorch framework, and the experiments ran on Google's cloud servers, including a Tesla P4 GPU with 2560 CUDA cores and 8 GB of RAM.
To fairly compare the proposed method with other models, all the experiments followed the same hyperparameters and were initialized using the same seed. Moreover, a k-fold cross-validation (5-fold) was set following an 80:20 ratio from the validation subset. The validation step allows for saving the best weight during the training process. Table 2 summarizes the dataset partition distribution. Both dataset and their train-validation-test partition are freely available at: https://github.com/eovallemagallanes/LRSE-Net (accessed: 30 October 2022).

Ablation Study
An ablation study over the A-DSSS dataset is presented to demonstrate the impact of the DSC, and the SE module is reported in Table 3. All configurations were trained from scratch employing the hyperparameters presented in the previous subsection. The comparative analysis evaluates four main groups of configurations: (1) without DSC and SE, (2) without DSC but with SE, (3) with DSC but without SE, and (4) with DSC and SE. For configurations using the SE module, two variants were tested: (1) with default reduction ratios (r = 16) and (2) with independent ratios r = 16, 13, 9. As mentioned before, the TPE algorithm was employed to find the model configuration minimizing the validation loss of the first fold.
Numerical results indicate that incorporating SE attention modules with individual reduction ratios increased Specificity and Precision compared with no attention model and default SE ratios and with a lower parameter addition. The exclusive use of DSC showed very competitive results in Accuracy, Sensitivity, and Specificity concerning the baseline model (with vanilla convolution operations). Still, it drastically reduced the number of parameters by around 3.6×. The DSC with SE, including default dilation ratios, achieved the best Specificity and Precision. In particular, including DSC and SE with individual reduction ratios presented the highest Accuracy, Sensitivity, and F 1 -score and the second-best required parameters, reducing the number of parameters by around 3.5× compared to the baseline model. Therefore, this last model configuration was selected as the default model for subsequent comparison.

Stenosis Classification Performance Comparison
The performance of the LRSE-Net was evaluated on two public datasets (see Table 2). The methods trained all models from scratch and employed the same hyperparameters to ensure a fair comparison.
For the A-DSSS dataset, the results are shown in Table 4. It can be seen that the proposed LRSE-Net achieved the best mean Accuracy (0.9349), Sensitivity (0.6320), Precision (0.5991), and F 1 -score (0.6103). On the other hand, Vanilla ResNet18 achieved the best Specificity (0.9850). Even though LRSE-Net achieved 2.3% less in Specificity concerning Vanilla ResNet18, it attained a gain of 2%, 50%, 13% and 41% in Accuracy, Sensitivity, Precision and F 1 -score. Compared with other attention models, Vanilla SE-ResNet18 obtained higher Specificity than the LRSE-Net, around 2%; however, Sensitivity, Precision, and F 1 -score were widely overcome by LRSE-Net. The training and validation curves are shown in Figures 6 and 7, where it can be seen that the proposed model got the highest accuracy curves and the lowest loss. The second-best accuracy and validation curves are the ones of the CBAM-ResNet34. After 50 epochs, all validation losses started overfitting, showing up and down values due to the fold class imbalance. Notice that the validation subset is not augmented. The Trim ResNet18 achieved the most stable validation accuracy curve over the epochs.   The performance employing the P-ADSD dataset is shown in Table 5. In this case, the proposed model achieved the best mean Accuracy, Sensitivity, Precision, and F 1 -score with 0.9543, 0.8792, 0.8944, and 0.8863, respectively; and the second-best Specificity with 0.9620 (only 0.05% below). Comparing the models within an attention mechanism, the proposed model had a gain in four evaluation metrics; CBAM-ResNet34 obtained the best Specificity, while Trim SE-ResNet performed poorly in Sensitivity (0.7931) and F 1 -score (0.8134). Their corresponding training and validation curves are shown in Figures 8 and 9, confirming that the proposed model attained the lowest validation loss and higher validation accuracy than Trim-ResNet18 and Vanilla SE-ResNet18. The training curves exhibited a smoother behavior than the validation curves, where the LRSE-Net displayed lower accuracy and greater loss. Nevertheless, this leads to a better generalization performance.
Numerical results in both datasets demonstrate the efficacy of the proposed approach and indicate that SE modules with independent dilation ratios can enhance the feature representation, thus learning more discriminative features. Further, LRSE-Net accomplished better than the CBAM mechanism, which uses channel and spatial attention.

Class Activation Maps Compassion
The Gradient-weighted Class Activation Map (GradCAM) [39] retrieves a visual explanation of the most important regions in the image for the model's decision. Figure 10 illustrates the Grad-CAM for the test set of the A-DSSS dataset. High discriminative regions for stenosis detection are colored in hot tones (red colors) and cold tones (purple colors) for less informative regions (i.e., the gradient contributes in a minor way). In the model without attention (a) and including CBAM module (d), the GadCAM focused on corner regions more than blood vessel zones. For instance, the Vanilla ResNet18 showed two false negative cases in the last two test images; the CBAM-ResNet34 has one false positive (third row) and four false negative cases. In the case when the model includes the SE block (b), (c), and (e), the GradCAM started to set greater attention to blood vessel regions. The Vanilla SE-ResNet18 (b) arose a false positive case (first test image), the Trim SE-ResNet18 (c) an extra false negative (sixth column). In particular, the LRSE-Net presented greater attention over the blood vessel with non-false positive or negative cases. As can be seen in Figure 11 for the P-ADSD dataset, the GradCAM featured more isolated high-attention regions in all the cases. These regions are located over blood vessel pixels for the Vanilla ResNet18 and the ResNet's including SE block. In addition, the CBAM-ResNet34 (d) showed high attention to the positive stenosis cases in the background zones of the image. The test images can include different blood vessel widths, background artifacts, and blood vessel bifurcations that affect the gradient activation regions. However, the GradCAM produced proper attention over the blood vessel for test cases with visible major blood vessels.

Discussion
The performance results validate the capability of the proposed method to classify stenosis cases in XCA image patches in different size datasets with major negative stenosis cases. Moreover, it was demonstrated that individual selection of dilation ratios for SE modules boosts the network performance. As the model goes deeper, the dilation ratios are smaller; this suggests that deeper features require an SE module with additional parameters to recalibrate the features. Similarly, the inclusion of DSC and the redundant kernel removal drastically reduced the network's complexity (in terms of the number of parameters) up to 48.6× compared with a vanilla ResNet18, 48.9× concerning a vanilla SE-ResNet18, and 35.7× smaller than the CBAM-ResNet34.
By visualizing training and validation curves, it can be seen that the network performance is directly affected by the quality and quantity of the training data. For example, the first dataset (A-DSSS) showed poor performance and rapid overfitting, even when data augmentation was performed. This scenario is not depicted employing the P-ADSD dataset, where around 33K images are available.
The GradCAM recovered a reasonable visual explanation over blood vessel regions, highlighting discriminative regions in hot tones and those with lower contributions in cold tones. Moreover, it supported the importance of incorporating an attention mechanism to improve the model numerical and explainable capabilities.

Conclusions
This paper proposed an LRSE-Net to classify stenosis cases from XCA images. The model consists of two main elements, a DSC and an SE module, which reflect high classification rates with lower computational requirements in terms of the required parameters. The proposed model is 48.9× smaller than Vanilla SE-ResNet18 and 35× smaller than CBAM-ResNet34. The experimental results demonstrate that LSRE-Net consistently outperformed Residual models with or without attention mechanisms. Additionally, the individual selection of dilation ratios for the SE blocks improved the classification performance, including a smaller dilation ratio as the network goes deeper. In particular, greater boosts were achieved when the dataset was small, with a gain of 2%, 50%, 13%, and 41% in Accuracy, Sensitivity, Precision, and F 1 -score, respectively. Moreover, the LRSE-Net GradCAM maps retrieved a refined region proposal of the stenosis location, which could support the physician's decision-making process.
Although the recognition rates are high, there is still a need for further improvements, such as evaluating the proposed model as the backbone for an object-based recognition system and detecting stenosis cases from the full XCA test. A future direction of this work concerning model compression may be to analyze other approaches, such as quantization, different low-rank-tensor decomposition, and knowledge distillation. Another research direction to address the limited training data could be generating artificial data by deep generative models.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: