1. Introduction
Coronary stenosis is a common disease which leads to heart attacks and eventually the death of patients, but it is not always accurately diagnosed by cardiology specialists [
1]. X-ray tomography still remains the most widely adopted method to acquire coronary angiographies from patients with heart diseases [
2]. However, the acquired X-ray coronary digital images are commonly low resolution with narrow regions or a low contrast of vessels and background, which makes it difficult for cardiologists to find affected arteries by an exhaustive visual image exploration.
Figure 1 presents a coronary angiography involving a region with a stenosis case labeled by a cardiology specialist and a corresponding schematic representation.
Since the identification of coronary stenosis cases relies on the expertise and visual accuracy of cardiology specialists, to identify toxin or adipose tissue areas inside the arteries, the use of techniques and digital image processing methods is important to improve the detection of artery diseases. The automatic detection and classification of coronary stenosis cases from digital images of coronary angiographies have been reported in the literature using distinct approaches and methods. The use of feature extraction and machine learning classification techniques are some of the pioneer approaches [
3]. For instance, the use of image processing techniques to enhance or segment arteries, which allows working with morphological features, has been used to differentiate healthy arteries from those indicating disease [
4,
5,
6,
7]. The combination of feature extraction and automatic feature selection techniques is also used to extract optimal feature subsets able to train specific classification machine learning models [
8].
The problem has been also addressed by the use of deep learning techniques, which are a completely different approach when compared with manual feature extraction and machine learning techniques. The use of convolutional neural networks (CNN) for coronary stenosis detection and classification is the most common approach reported in deep learning-related literature [
9,
10]. One of main drawbacks of using traditional CNN architectures is the need for large image databases. This commonly leads to the use of highly complex CNN architectures, which could decrease its performance in terms of accuracy when small image databases are used for object detection or classification tasks [
11]. To overcome the disadvantages of work with small image databases, the use of data augmentation and transfer learning techniques has been applied successfully. Despite the gains in classification accuracy provided by data augmentation, the absence of novel, empirical datasets poses challenges for clinical adoption [
12,
13]. In addition, as CNN architectural complexity increases, the difficulty of identifying the specific components most critical to the classification task scales accordingly [
14].
The search for optimal CNN architectures is described in the literature as the Neural Architecture Search (NAS) problem. It is not a trivial task due to the high number of combinations that are possible to form with the distinct layer types. Furthermore, structural constraints must be addressed to prevent inconsistent layer connectivity or execution errors during the training phase. Beyond these constraints, representing CNN architectures across diverse structural formats—conforming to various computational techniques and algorithms—remains a significant challenge [
15,
16]. Consequently, numerous methodologies for the automated design of CNN models have emerged in the literature. For instance, Jiang et al. [
17], utilize the Multi-Objective Particle Swarm Optimization along with the problem decomposition technique to generate CNN architectures. Similarly, Elsken et al. [
18] propose a hill-climbing approach, which is validated using the CIFAR-10 image database [
19]. More recently, Franco-Gaona et al. [
20] employed an Estimation of Distribution Algorithm (EDA) to evolve optimal architectures for diverse image-processing tasks, including table detection and coronary stenosis classification. Parallel to these developments, substantial research has focused on the automated design of lightweight CNNs. Pham et al. [
21] introduced a methodology for the creation of lightweight CNN architectures by optimizing inter-layer connections. Alternatively, proposals for the design of CNN architectures that are able to run on mobile and embedded devices with important hardware limitations have been reported [
22,
23]. The use of model compression techniques have been also applied to reduce the resource consumption cost of CNN models [
24,
25].
While CNNs have revolutionized medical image recognition, their application is significantly hindered by the scarcity of large-scale, annotated datasets. Unlike general-purpose image databases such as ImageNet, which served as the foundation for modern CNN architectures, medical databases often lack the volume and diversity required for robust training. In this context, high-complexity architectures including ResNet50 [
26], InceptionV3 [
27] and VGG-16 [
28] yield suboptimal classification results when trained on limited image databases. Similarly, lightweight models such as MobileNetV2 [
29] and EfficientNetB0 [
30], despite their efficiency on large-scale image databases, exhibit a considerable decline in performance when applied to constrained medical image databases. Even when the use of transfer learning increases classification rates, there is a need to augment the original images because of large differences in the number of instances in which transfer learning is supported [
31]. However, while data augmentation can be beneficial, some strategies could incorporate unrealistic variances or narrow relevant features that are critical to the analysis [
32]. In addition, since data augmentation is a standard technique to improve model generalization, its application in medical imaging presents unique challenges that can compromise diagnostic integrity or violate critical data relationships [
33,
34].
This paper proposes a novel Hybrid Multi-Objective Evolutionary method to automatically design lightweight CNNs for classifying coronary stenosis cases. The use of multi-objective strategies such as the Multi-Objective Evolutionary Algorithm based on Decomposition (MOEA-D) [
35] is relevant because it avoids the problem of assigning weights to the distinct goals involved. In addition, since the MOEA-D produces a set of trade-off solutions among the objectives involved, an additional search-refined stage driven by the Simulated Annealing (SA) method is performed [
36]. In this stage, a subset containing the 3 best solutions achieved by the MOEA-D is extracted in order to perform a second search stage driven by the SA algorithm, which decreases the architecture complexity and, at the same time, tries to maintain or increase the classification accuracy.
The main contribution of this paper is a methodology that is able to automatically design lightweight specialized CNN architectures focused on the classification of positive and negative coronary stenosis cases using small image databases. It involves important design constraints to assure the creation of consistent CNN models. The proposed method conceptualizes the NAS problem as a multi-objective optimization problem with two goals: minimizing the classification error rate and the architecture complexity. A high classification accuracy is essential to guarantee the quality of the model architecture to be used in clinical practice. On the other hand, the use of a lightweight architecture is adequate to achieve optimal classification rates using small image datasets. In order to measure the classification rate, the Accuracy error metric was used during the training stage. Furthermore, to assure the obtained results, additional metrics such as Precision, Recall, F1-Score and Jaccard Similarity Coefficient were also used on posterior independent tests. In addition, the number of learning parameters was used as a complexity measure. For the experiments, two distinct image databases of coronary stenosis cases were used to confirm the obtained results. The first image database is composed of negative and positive coronary stenosis patches. On consecutive experiments, a second independent database, involving natural and synthetic images of coronary stenosis cases, was used. In the conducted tests, the proposed method performs better than the other compared techniques from the literature in almost all metrics. Finally, based on the presented results, the proposed methodology was adequate to produce CNN models that can be used in clinical practice as part of information systems for computer-aided diagnosis.
3. Proposed Method
The proposed method involves two main stages related to the NAS problem. In the first stage, a set of solutions is produced by the MOEA-D strategy. Each solution is a trade-off between the error classification accuracy during the training and validation stages and the architecture complexity, which is measured using the number of learnable parameters of the CNN. In the second stage, a subset of optimal solutions, found by the MOEA-D, are selected and improved independently using the SA algorithm to finally select the optimal solution. The difficulty of designing an optimal CNN architecture is due to needing the correct number and combination of distinct layer types involving their own setup parameters. Since the multi-objective evolutionary algorithms such as the MOEA-D work with the concepts of individual and population, each solution (or individual) is represented as a vector of values considering the distinct variables involved in the problem. Therefore, a vector of 7 elements was used for the representation of a basic CNN architecture by considering the most common CNN layer types.
Figure 6 illustrates the design approach used for the representation of distinct CNN layer types.
According to
Figure 6, distinct CNN layers are represented as a vector of values requiring important considerations, which are described as follows. The first cell stores a discrete value to decide if a convolution layer will be present (1) or not (0). The second cell stores the convolution kernel size, which is constrained to sizes of
if the
value is 0 or
otherwise [
28]. Cell
governs the number of convolution filters. The cell stores a discrete value in the range [1, 5] to generate the power of 2 values, which can be 2, 4, 8, 16 and 32 convolution filters [
43]. Cell
indicates if a normalization layer will be added (1) or not (0). Cell
controls the activation and type of the rectified linear unit layer, where 0 means that a ReLu layer will not be added, 1 adds a ReLu layer, and 2 adds a LeakyReLU layer. Cell
activates the addition of a dropout layer if the value is 1. If the value is 0, the dropout layer will not be added. Cell
is a discrete value indicating if a maxpooling layer will be added (1) or not (0). The size of a pooling layer can be established arbitrarily. However, CNN architectures such as VGG and ResNet have demonstrated that applying a half uniform downsampling strategy produces optimal classification results, decreasing the computational load, memory and CPU/GPU resources [
44].
By considering distinct pooling sizes (
,
,
,
,
and
), the previous structure can be repeated 6 times to form the full representation of the CNN architecture as a vector of 42 elements, as illustrated in
Figure 7.
In addition, several constraints were added to the architecture representation as follows. It was strategic to set the parameter stride of convolutional layers with a value of 2 in order to add stabilization and performance improvement [
26,
45]. Even when the CNN architecture representation is intended to decrease on each block of seven cells, the insertion of a pooling layer that downscales the inputs to subsequent layers depends on the activation of cells
,
,
,
and
. If some of those cells are not active (has a value of 1), the inputs remain in the last downscale size from previous layers. In addition, cell
does not have an effect on the process, but it is present to keep the CNN representation structure. This design strategy allows adding a diversity of solutions because the CNN architectures are not constrained to a single form. After a solution is produced, it is important to finish the CNN architecture by adding the corresponding flatten and dense layers, which are responsible for the classification task. Since the input of these layer types depends of the output of the previous layer, they were not considered as part of the solution representation.
Based on the representation described before, the proposed method for the NAS stage can be implemented accordingly.
Figure 8 illustrates the hybrid multi-objective evolutionary method schema, which is focused on the automatic design of a lightweight CNN for the classification of positive and negative coronary stenosis cases.
According to
Figure 8, the hybrid metaheuristic starts with an initial CNN architecture search driven by the MOEA-D method, which returns a set of trade-off solutions known as the Pareto front. In step 2, a subset of
N solutions is selected from the Pareto front to be improved. For this paper, we selected 3 solutions closest to the origin in terms of the Euclidean distance. In step 3, the SA algorithm is used to improve the previously selected solutions.
It is important to mention that since the SA method is single-objective, a solution is accepted as the best only if the two involved objectives (Accuracy classification error and number of learning parameters) are minimized simultaneously, which is equivalent to assigning a balanced weight to each of them. Finally, in step 4, the best solution (closest to the origin of the Euclidean space) is taken in accordance with the results obtained by the SA algorithm.
4. Results and Discussion
For the experiments, two distinct publicly available image databases of coronary stenosis cases were used. The first image database contains 608 images corresponding to regions of coronary angiograms, which were properly validated by cardiology experts [
46]. Each image size is
pixels. The number of positive and negative cases is 304 for each.
Figure 9 illustrates a sample of 14 images containing positive and negative coronary stenosis cases.
The second image database involves real and synthetic images corresponding to coronary stenosis regions [
47]. The size of each image is
pixels.
Figure 10 illustrates a sample of 21 images containing natural and synthetic coronary regions of positive and negative stenosis cases.
All the experiments were carried out on a computer with an Intel Core i7-9700K CPU with 3.60 GHz and 16 GB of RAM. It also contains an NVidia Titan RTX GPU with 24 GB of DDR6 VRAM, 4608 CUDA Cores and 576 Tensor Cores. The algorithms and search strategies were written in Matlab R2024a software.
For the NAS task, driven by the MOEA-D and the SA strategies, the Stenosis608 database was split using the holdout strategy with a proportion of 80–20% for training and testing, respectively. In addition, a cross-validation strategy with
was used on the training stage. It is important to mention that only the training partition was used for the NAS process. In addition, the MOEA-D implementation is derived from the Kalami [
48] code base. The corresponding setup parameters required for the MOEA-D and the CNN training are described in
Table 1.
Consequently, for the second step, the value of
N, which corresponds to the number of taken solutions from the Pareto front produced by the MOEA-D, was established as 3. The corresponding parameter values for the SA and the CNN training options are described in detail in
Table 2.
As described in
Table 2, the number of epochs for the CNN models training was increased considerably due to the single-solution nature of the SA method. Consequently, it is important to mention that all of the previous parameter values were established as a trade-off between the required time to obtain a response and an adequate fitness value.
After the NAS stage was concluded, the results show that the hybrid approach improved almost all of the solutions for all of the metaheuristics, as described in
Table 3.
Based on data described in
Table 3, the hybrid approach improves the CNN architecture in almost all cases. By comparing columns related with the initial learning parameters (ILPs) and final learning parameters (FLPs), there is evidence related to decreasing the number of learning parameters in almost all of the CNN models. However, due to the nature of the SA algorithm, in which the worst solutions are accepted in the first iterations and many goals are involved, the final result could be affected when an accepted solution improves only one of the goals. For instance, the solutions produced by the Hybrid MOPSO slightly improve the CNN complexity and the validation accuracy. However, its performance is below those of the other methods. Consequently, the first solution of the NSGA-II starts with a CNN architecture consisting of 1026 learnable parameters and a validation accuracy of
. When the SA method ends, the number of parameters was improved because the value decreases to 646. In addition, the validation accuracy also decreases to
instead of increasing or keeping the initial value. In contrast, the second solution found by the MOEA-D algorithm was improved considerably, since it started with 11,910 learnable parameters and a validation accuracy of
. Once the SA algorithm concluded, the number of learnable decreased to 3438 and the validation accuracy increased to
, indicating that the final solution improved the two goals simultaneously. To complement the previous data,
Figure 11 illustrates the Pareto front found by the MOEA-D.
Figure 12 illustrates the CNN architecture for the best solution found by the proposed method.
The CNN architecture found by the proposed hybrid evolutionary multi-objective method is formed by 15 processing layers plus the classification one. As illustrated in
Figure 12, the resulting architecture makes use of all common layer types, starting with a convolution in the second layer, whose corresponding kernel size is
. Because of the stride value, which was established in 2 as a constant, an initial input downscale is performed, transforming the input size of
to
pixels, increasing computational efficiency and showing that the selected strategy was adequate. Consequently, a normalization layer and leaky ReLU layers were added, respectively. The fifth layer type is a convolution layer with a kernel size of
and 16 filters, which is followed by a ReLU layer. The seventh layer is of convolution type involving a kernel size of
and 16 filters. At this point, and because of the fifth layer, the image data size was decreased to
pixels. Layers 8, 9 and 10 correspond to normalization, leaky ReLU and dropout types, respectively. The eleventh layer is a maxpool with no padding to assure that all input pixels are covered (including edges) in the input feature map. The cost of this consideration is that the image size is not reduced. However, this behavior is compensated by the stride value in the convolution layers. The twelfth layer corresponds to a convolution using 16 filters. Because of the previous maxpool layer and the fact that in the last part, the maxpool size is
, a kernel size of
is used, which decreases the input size from
to
. The thirteenth layer transforms the input from the last layer, whose size is
, to a data vector of 256 values. Consequently, a fully connected layer consisting of
neurons (the number of classes) is added at the fourteenth position. Finally, an activation softmax, plus the classification layer, are added.
To assure model stability, the CNN architecture formed by each strategy was trained using 488 random images (≈80%) from the Stenosis608 image database. The remaining 120 instances (≈20%) were used for testing on a later step. In this test, only training images were involved, using a cross-validation with
on 30 independent trials. The number of epochs was established in 1000 to perform an exhaustive training of the models to increase its reliability and efficiency in terms of the Accuracy metric. The corresponding results are described in
Table 4.
Based on data described in
Table 4, the proposed method achieved the highest classification rates in terms of the Accuracy metric. However, there is evidence of the impact of multi-objective search strategies, since the number of learning parameters in the models decreases significantly to the order of thousands. Consequently, all models were stable, since the standard deviation of the classification accuracy was low for all cases. In order to measure reliability, the previous models were tested against additional classification metrics such as Precision, Recall, F1-Score and Jaccard Similarity Coefficient, as described in
Section 2.4, using the trained model with the highest classification rate from the previous test. The corresponding results are presented in
Table 5.
According to the results presented in
Table 5, there is evidence related to the effect of the solution improvement stage performed by the SA algorithm, in which some of the best solutions produced by the multi-objective techniques are taken and improved individually. However, even when some of the compared methods achieved high rates in metrics such as Accuracy, Recall and F1-Score, in clinical practice, they fail to accurately classify real positive disease cases, which is critical. As a consequence, the use of the Jaccard Similarity Coefficient metric was important, because it inflicts a high penalty to positive cases classified as negative (False Negatives). By considering this, the proposed method achieved the best classification results compared to the other methods.
To assure the reliability of the designed CNN architecture, a second test was performed using the Antczak image database, which contains natural and synthetic artery coronary stenosis images. For the first test, only natural images were considered, forming a dataset of 244 instances in a balanced way between positive and negative coronary stenosis cases. The dataset was split into 194 and 50 instances of training and testing, respectively. In
Table 6, a statistical analysis of the training accuracy, based on the classification performance, is presented. Each model was trained using the cross-validation strategy with
on 30 independent trials.
Consequently, each architecture was tested by taking its corresponding model, which achieved the highest validation classification Accuracy in the previous test. The pretrained models were tested using the image testing partition consisting of 50 images. The obtained results are presented in
Table 7.
Based on the results presented in
Table 7, the proposed method achieved the highest rates in almost all metrics. It is difficult for complex CNN architectures to achieve correct training and classification rates with image databases having a low number of instances. Consequently, the design of lightweight CNN architectures was adequate because they achieve superior classification results with datasets containing a low number of instances. Finally, a dataset containing 2788 images was formed. Since the Antczak coronary stenosis database contains 1394 and 122 natural images corresponding to negative and positive cases, respectively, the database was equilibrated by adding 1272 synthetic images of positive cases using the author’s method to generate them. Similar to the previous test, the image dataset was split in a proportion of ≈80–20% for training and testing, which corresponds to 2240 and 548 images, respectively. Using the images from training partition, each model was trained using the cross-validation strategy with
on 30 independent trials. Based on the obtained results in each trial, a corresponding statistical analysis is presented in
Table 8.
Based on the data presented in
Table 8, the increase in the number of instances improved the training validation accuracy considerably for all CNN architectures. However, the CNN model found by the proposed method overpassed or was equal to the rest of them in all measurements.
According to the data described in
Table 9, the CNN architectures produced by all the compared strategies increased their classification rate under all of the measured metrics. This is an important fact in traditional CNN architectures such as ResNet50 and VGG-16 because they tend to improve classification rates by increasing the size of the training image dataset. Consequently, lightweight models such as MobileNetV2 [
29] and EfficientNetB0 [
30] also increased on all classification metrics considerably, using the testing set. However, the proposed methodology, based on the MOEA-D and the SA methods, performs considerably better than the other compared techniques in all metrics, which provides evidence about the CNN model’s reliability when analyzing distinct coronary stenosis image databases.
In order to explore details about the image features that allow the proposed CNN model to classify positive and negative coronary stenosis cases, the Grad-CAM technique was used [
50].
Figure 13 presents distinct images from the Stenosis608 database, consisting of 21 positive coronary stenosis cases with relevant features marked in a heatmap by the Grad-CAM method.
According to
Figure 13, the image features leading the CNN model to produce a positive classification result are almost related to the pixels corresponding to arteries. Consequently, for negative stenosis classification results, the model involves features from arteries and non-artery pixels. In
Figure 14, the corresponding Grad-CAM results from the Antczak image database are also presented.
Based on the images presented in
Figure 14, the Grad-CAM shows evidence that artery pixels were also relevant to the identification of features related to positive coronary stenosis cases. Consequently, non-artery information form the image was relevant to the identification of negative coronary stenosis cases.
Finally, the required time for each stage of the experiments was interesting. For instance, in the initial search step, training a single solution consumed an average of ≈85 s: when multiplied by the number of individuals (50) and by the number of generations (100), the total required time was ≈5 days. For the second step, after the MOEA-D method produced the solution set, the SA takes ≈2.30 h to improve the
best solutions found in the previous step. The required time in this step was considerably low when compared with the initial search step. Moreover, once the model is trained, the classification of a single image takes an average time of
s, which is a considerably low value. In
Appendix A, the corresponding Matlab code to build, train and perform a single test of the CNN architecture of the proposed method is presented.
5. Conclusions
In this paper, a novel method based on a Hybrid-Evolutionary Multi-Objective strategy for the automatic design of lightweight CNN architectures was presented. The main contribution is a methodology that can be applied for the automatic design of CNN models that are able to learn from small datasets without the need for data augmentation or transfer learning mechanisms. This aspect is relevant in distinct contexts such as medicine, where the image databases are scarce due to ethical and legal restrictions. The proposed method starts with an exhaustive search stage driven by the MOEA-D strategy over distinct combinations of layer types in order to produce an optimal CNN architecture. In the process, all produced models are validated to be coherent between the distinct layers that are added and connected. For instance, two consecutive ReLU layers did not improve the model, so the second was removed from the solution. Another type of inconsistency can occur if all of the convolution activation values of an encoded solution have a value of 0, which indicates the absence of convolution operations in the CNN model. In this case, the solution is corrected by adding a convolution layer. After the first stage of the search process concluded, the best solutions are selected from the Pareto front and improved independently using the SA algorithm. The hybrid approach produced a CNN architecture consisting of 15 layers and 3498 learning parameters, which is considered lightweight when compared with other such as ResNet50 or VGG-16 in which the number of learners is 25 and 138 million, respectively. As for the model architecture’s effectiveness, a public database consisting of 608 images corresponding to regions of coronary angiographies was used. In the test stage, we achieved classification rates of , , , and in terms of the Accuracy, Precision, Recall, F1-Score and Jaccard Similarity Coefficient, which achieved superior results to those of the other techniques from the literature. In a consecutive step, an independent image database consisting of natural and synthetic images was used to test the achieved model architecture. The test started using an image dataset of 244 instances, containing only natural images. The proposed method achieved classification rates of , , , and for Accuracy, Precision, Recall, F1-Score and Jaccard Similarity Coefficient metrics, respectively. In the last test, natural and synthetic images were used to form an image bank of 2788 instances. Using this image database, the CNN model achieved classification rates of , , , and under the same metrics. Moreover, once the model is trained, the required average time to classify a single image was seconds. The main disadvantage of the proposed method is the required time to achieve a solution. The NAS stage is very time consuming, considering that in the population-based search methods, each individual (solution) corresponds to a CNN architecture that must be trained independently. Regarding this point, it is important to mention that the proposed method is intended to work with small image databases, which is a common fact in medical context. Consequently, its use on large image databases could considerably increase the required time to produce an optimal result. Another disadvantage of the representation of CNN architectures in the proposed method is the design of only feed-forward architectures. As a consequence, future work could be focused on performing non-adjacent connections between the distinct layers of the model and implementing a strategy to decrease the required time during the NAS stage. However, according to the presented results, the proposed method could be applied in the automatic design of lightweight CNN architectures for other medical databases involving retina, chest, lung or other human organ images. Finally, based on the obtained results, it can be concluded that the proposed method can be applied to design lightweight CNN models to assist in clinical practice as a part of information systems for computer-aided diagnosis and decision-making support.