Breast Cancer Detection in Thermography Using Convolutional Neural Networks (CNNs) with Deep Attention Mechanisms

: Breast cancer is one of the most common types of cancer among women. Accurate diagnosis at an early stage can reduce the mortality associated with this disease. Governments and health organizations stress the importance of early detection of breast cancer as it is related to an increase in the number of available treatment options and increased survival. Early detection gives patients the best chance of receiving effective treatment. Different types of images and imaging modalities are used in the detection and diagnosis of breast cancer. One of the imaging types is “infrared thermal” breast imaging, where a screening instrument is used to measure the temperature distribution of breast tissue. Although it has not been used often, compared to mammograms, it showed promising results when used for early detection. It also has many advantages as it is non-invasive, safe, painless, and inexpensive. The literature has indicated that the use of thermal images with deep neural networks improves the accuracy of early diagnosis of breast malformation. Therefore, in this paper, we aim to investigate to what extent convolutional neural networks (CNNs) with attention mechanisms (AMs) can provide satisfactory detection results in thermal breast cancer images. We present a model for breast cancer detection based on deep neural networks with AMs using thermal images from the Database for Research Mastology with Infrared Image (DMR-IR). The model will be evaluated in terms of accuracy, sensitivity and speciﬁcity, and will be compared against state-of-the-art breast cancer detection methods. The AMs with the CNN model achieved encouraging test accuracy rates of 99.46%, 99.37%, and 99.30% on the breast thermal dataset. The test accuracy of CNNs without AMs was 92.32%, whereas CNNs with AMs achieved an improvement in accuracy of 7%. Moreover, the proposed models outperformed previous models that were reviewed in the literature.


Introduction
Despite diverse detection and prevention methods, human diseases have gradually increased for personal or environmental reasons.Cancer, the most common worldwide disease, is a condition in which cells grow irregularly and affect other body parts.Some of the cancers found are breast, prostate, lung, skin, and pancreas.Cancer causes many deaths globally [1].
Breast cancer is the leading cause of death in women if not diagnosed early.The development of early detection and treatment methods for breast cancer has enhanced the possibility of survival.One of the procedures widely used in medicine to detect breast cancer is screening the patient using various imaging modalities, such as mammography, ultrasound, computed tomography imaging (CT), magnetic resonance imaging (MRI), and thermal imaging screening.These modalities have distinct approaches and tools; results depend on different factors, and using more than one method is recommended to verify the results.Mammography is the most common procedure for breast cancer screening; although it is considered the gold standard, it carries many risks for patients, so in recent years thermal imaging has gained attention because of its low risks, unlike mammography [2].The use of digital infrared thermal imaging has not yet become the main tool for detecting early breast cancer since it relies on skin lesions and temperature [3].However, promising results indicate that radiologists can use it with mammography to evaluate the state of the breast for accurate assessment and diagnosis of breast cancer.
Recently, with the advances in machine learning (ML) and deep learning (DL) techniques, the detection and diagnosis of breast cancer using ML and DL are becoming more effective.Moreover, deep attention techniques have attracted more interest in this field [4] because of the promising results achieved when DL algorithms for breast cancer detection were integrated with deep AMs.This directs special attention to prominent areas of the image instead of treating all batches as equally important, which in turn increases the likelihood of improving the performance of the applied techniques.
Although breast cancer detection is advanced, more research is needed to explore different modalities and enhance detection accuracy when using new ML algorithms with deep AMs.
Therefore, this research aims to contribute to the field of breast cancer detection.The main objective of this study is to provide a robust and high-performance model based on DL and AMs to detect breast cancer using thermal imaging, helping physicians in making a more accurate and validated breast cancer tumor diagnosis.
The remaining sections of this paper are organized as follows.Section 2 presents a background overview of the DL methods.Section 3 gives a brief overview of deep AMs.The following section reviews the related literature in breast cancer detection.The proposed method is discussed in Section 5, which includes the experiment design and evaluation methods.Section 6 discusses the experimental results.Finally, Section 7 concludes this paper with the main findings and future directions.

Deep Learning and Convolutional Neural Network (CNN)
Machine learning (ML) is a subset of AI that enables systems to learn and improve automatically from the experiment without programming them manually.It is based on a dataset of examples, experiences, or instructions used for training to help analyze and extract meaning from data [5].The result helps in classification, prediction, and detection, especially in the medical field.However, we can categorize AI innovations into two categories: conventional ML and DL.Conventional ML algorithms are based on system learning through training data in order to develop a trained model.These conventional ML algorithms have enhanced the development of many applications to run automatically and provided many advantages.Yet, the performance of conventional ML algorithms was outperformed by an advanced type of ML algorithms known as DL [6].
DL trains artificial neural networks through learning multiple levels of representation and abstraction that include data such as images, sound, and text [7].DL algorithms effectively handle large rather than limited data.Different categories of DL techniques have been applied in the literature.One of the main DL algorithms is convolutional neural networks (CNNs).
A CNN is an effective and popular DL approach to image-based classification and recognition work for supervised learning.A CNN consists of several layers stacked on top of each other, including convolution, pooling, and fully connected layers, and each layer has a different task.The CNN is trained by providing the appropriate input, and the parameters are calculated layer by layer to produce a final output.The convolution layer determines the presence of certain features or patterns in the input data.It is possible to use several filters to extract different features.At the same time, the pooling layer reduces the dimensions of the feature maps and network parameters, and in the last stage, the fully connected layer connects neurons with all neurons in the preceding layer, which provides essential results such as the category scores in the case of classification [8].
In general, DL has revolutionized computer vision using superhuman accuracy algorithms in various tasks and areas.Several prominent approaches to image processing and classification have been presented, which have improved performance, especially in the medical field.

Deep Attention Mechanisms
Usually, when applied on images, ML algorithms will give equal importance-"attention"-to all patches of an image without giving special attention to prominent areas.By incorporating AMs with ML, the most prominent areas of the image are emphasized, increasing the possibility of improving the performance of the applied techniques.The AMs can highlight the most critical parts of the input information and thus improve its ability to extract the most relevant information for each part of the output and suppress or even ignore completely irrelevant information, resulting in improvements in the quality of the dynamically generated output.AMs were first used for machine translation and then utilized in neural networks.Their use rapidly developed in image processing, answering questions, and machine translation [9].There are three main types of AMs, which we will briefly discuss.

•
Soft Attention: A series of elements are used to calculate categorical distribution.The resulting possibilities reflect the importance of each element and are used as weights to generate context-aware encoding that represents the weighted total of all elements [10].It identifies how much attention should be spent on each element, considering the interdependency between the deep neural network's mechanism and the target, by assigning a weight of 0 to 1 to each input element.The attention layers calculate weights using softmax functions, making the overall attentional model deterministic and differentiable.Soft attention has the ability to act both spatially and temporally.The primary function of the spatial context is to extract the features or weight of the most essential features.It adjusts the weights of all samples in sliding time windows for the temporal context, as samples at various periods contribute differently.Soft mechanisms have a high processing cost while being deterministic and differentiable [11].

•
Hard Attention: From the input sequence, a subset of elements is chosen.The hard AMs force the model to focus solely on the important elements, ignoring all others, where the weight assigned to an input part is either 0 or 1.As a result, the objective is non-differentiable since the input elements are either observed or not.The procedure entails making a series of decisions on which parts to highlight.For example, in the temporal context, the model attends to a portion of the input to acquire information, deciding where to focus in the following step based on the known information.Based on this information, a neural network may decide.Hard-attention mechanisms are represented by stochastic processes since there is no ground truth to suggest the optimal selection policy.Reinforcement learning approaches are required to train models with hard attention since the model is not differentiable [11].In fact, various natural language processing tasks rely solely on very sparse tokens from long text input.Hard attention is well-suited for these tasks because it overcomes the weakness of soft attention in long sequences [10].When compared to soft mechanisms, inference time and computational costs are reduced whenever the complete input is not stored or processed [11].

•
Self-Attention: The interdependence between the input elements of the mechanism is measured because it allows the input to interact with the other "self" and decide what it should pay more attention to.The parallel computation capacity for a long input is one of the self-attention layer's primary benefits over soft and hard mechanisms.This mechanism layer uses simple and readily parallelizable matrix calculations to check the attention of all the identical input elements [11].

Related Work
This section presents a review of the current literature on breast cancer diagnosis and classification using thermal imaging.
In the past few years, a significant number of studies have been conducted using ML and DL to diagnose breast cancer in mammography and histopathology imaging.
A hybrid method using a CNN and LSTM was proposed in [12] to classify 7909 histological images extracted from the BreakHis dataset, after features are extracted using a proposed model fed into softmax and SVM for the decision stage.The best precision of 96% was achieved on a 40× dataset, and the best accuracy of 91% on 200×, which was comparable with the most recent findings.
In [13], the authors proposed a new CNN architecture inspired by the U-net structure of the two CBIS-DDSM mass and macrocalcification mammography datasets, in which whole data from both datasets of 692 mass images and 603 macrocalcification images were trained, while 202 and 152 were tested, respectively.The model achieved an accuracy of 94.31%.
A new model called IRRCNN was proposed in [14] based on a CNN to classify breast cancer (benign/malignant) based on 7909 histopathological images extracted from the BreakHis dataset and to verify the performance of the model.For the test, the model achieved 91.4%, compared to the methods applied to the open-source dataset.The BC-DROID model enabled one-step automatic detection and classification using a CNN; in [15] the model was trained on 10,480 complete mammograms extracted from DDSM data, achieving a detection accuracy of 90%, a classification accuracy of 93.5%, and an AUC of 92.315%.
In [16], a gray-level co-occurrence matrix and a gray-level run-length matrix extracted and fed two inputs into a hybrid deep learning model of a CNN and RNN called a CRNN for the detection of mammographic breast cancer; the result and AND operation outputs of the classifier achieved a diagnostic accuracy of 90.59%, which outperformed the traditional models.
The rapid growth in the development of infrared cameras has recently led to an expansion of research to diagnose and classify breast cancer using thermal images.Many studies have been conducted using ML and DL algorithms on thermal images.
The authors of [17] classified 1052 thermogram images from the University Hospital at The Federal University of Pernambuco.They used several models such as the Bayes network, naïve Bayes, SVM, J48 decision tree, multi-layer perceptron (MLP), random forest (RF), random tree (RT), and ELM.The results showed that MLP gave promising results compared to the other classifiers, with an accuracy of 73.38%, a kappa index of 0.6007, a sensitivity of 78%, and a specificity of 88%.The test was also performed using a 10-fold cross-validation method with a kappa index of 0.6402, and the accuracy was increased to 76.01%.The overall efficiency of the system was 83%.In another study [18], the features SSigFS, FStat, and STex were used to improve the performance of ANN, SVM, KNN, and DT classifiers in diagnosing thermal images from the DBT-TU-JU and DMR-IR datasets.The classifiers' performances in both databases were compared; in DBT-TU-JU, SVM-RBF and ANN achieved the highest accuracy of 84.29%, while in the DMR-IR dataset, ANN and SVM-linear achieved the highest accuracy of 87.50%.
On the other hand, DL algorithms, especially a CNN, have been applied on thermal images and have shown a competitive performance in diagnosing breasts.In one of the studies applying DL on thermal images [19], a CCN was applied for the first time to classify thermal images and was optimized by the Bayes optimization algorithm.It was trained on 1116 images extracted from the DMI dataset and achieved an accuracy of 98.05%.Compared with previous studies using the same dataset, with different features and classifiers, the model achieved a competitive performance on more images.
In another study [20], the authors diagnosed breast cancer from thermal images using a DCNN model.They converted 680 thermal images to grayscale extracted from a Visual lab-IR dataset and then pre-processed, segmented, and classified them.They achieved predictive accuracy of 95.8% with 680 thermograms.This model outperformed the study in [21], which achieved an accuracy of 93.30% with 50 thermograms.In addition, in comparison to a study in 2012 [22] that used the DT and fuzzy classifiers, it achieved the highest average accuracy of 93.30% and a significant improvement was achieved in this study.In [23] the authors investigated the combination of GVF breast segmentation and CNN classification to detect breast cancer.They used 63 breast images obtained from the DMR-IR dataset as normal and abnormal.The model was evaluated using a 2-fold crossvalidation approach; it obtained 100% accuracy, sensitivity, and specificity, and achieved a better performance than tree random forest (TRF), MLP, and Bayes Network.
The authors of [24] applied a multi-input classifier model based on CNNs for early detection of breast cancer that combines thermal images of different opinions with the personal and clinical data of 287 patients.They compared the performance of seven models; the results demonstrated that the M.4ncd model had the best accuracy of 97%, AUC ROC of 0.99, a specificity of 100%, and a sensitivity of 83%.Furthermore, the M.4ncd model results were compared with the results of other studies in the literature, and the model outperformed them in AUC ROC and specificity measures.
In [25], CNNs were trained to classify thermal images through static and dynamic protocols.The DMR-IR dataset that was classified through the static protocol contained 300 thermal images.Using the dynamic protocol, 2740 images were classified.The proposed method obtained competitive results in both protocols.The static protocol achieved 98% accuracy for color images and 95% for grayscale, while the dynamic protocol achieved 95% for color images and 92% for grayscale.It outperformed other methods applied to the same dataset.
The reviewed studies using thermal breast cancer images demonstrated that, compared to other image modalities, thermal images achieved good results when used for the detection of breast cancer.Additionally, they were used with both ML and DL methods.
As we have seen, the previous reviewed studies gave equal importance or "attention" to all patches of an image and did not give special attention to prominent areas.These studies were conducted using AMs integrated with DL algorithms.
In [26], the authors introduced a new model of DL for the diagnosis and classification of histopathology images.This consisted of CNN and RNN models and used a parallel structure of two different types of neural networks-a CNN (DenseNet) and an RNN (LSTM)-with AMs conducted on three datasets.They applied the models on three datasets, namely, BACH2018, Bioimaging 2015, and Extended Bioimaging 2015.The accuracy rates achieved on those three datasets were 92%, 96%, and 97.5%, respectively.Similarly, in [27], the BreastNet model was built based on CNNs and AMs to classify 7909 tissue images extracted from the BreakHis dataset.The best results were achieved using 200X images with an accuracy of 98.51%, sensitivity of 98.70%, and F1-score of 98.28%.
In another study [4], the authors introduced a new CNN-based approach that includes two separate attention units to learn a more effective feature for classifying 7909 histopathology images from the BreakHis dataset.The model achieved 98% accuracy when amplifying 100×, which outperformed the latest methods performed on the same dataset by a large margin.
The authors of [28] suggested a CNN-based, fine-grained image model to extract features and a novel AM for localizing objects performed on histology images from Peking Union Hospital and the BACH benchmark dataset from the 2018 Grand Challenge in Breast Cancer Histology images.The performance of the proposed method was compared with five recently reported methods on the histology dataset; results showed that the proposed model in this study achieved an accuracy of 96%, with an improvement of 2% compared to the others.
To identify potential lesion areas from the images and to avoid processing the original image, the authors of [29] proposed a model based on the deep attention network to classify histological images extracted from the BreakHis dataset.The model showed a 98% accuracy rate in four different amplification types, as well as a 50% reduction in training time compared to the POMDP-based approach.
In [30], a novel AM was developed to classify high-resolution histological images of breast tissue using a CNN and BLSTM.The model was trained on the ICIAR 2018 database.The experimental results revealed that the model achieved 85.50% accuracy for classification in four classes and 96.25% in two classes.The model was also compared with and without the AM.Moreover, it was compared with state-of-art methods, and it outperformed other methods when applied with and without the AM.
The authors of [31] improved CNNs by integrating an innovative SE-Attention mechanism to classify 18,157 collected mammograms and built a new benchmarking dataset.The model outperformed other studies with an accuracy of 92.17%.
To improve the performance on the BreakHis dataset, the authors of [32] suggested a tissue image classification framework for breast cancer based on RESNET with a convolutional block attention module (CBAM).The proposed model was used to classify 7909 images; it showed a significant improvement by achieving an accuracy of 92.6%, sensitivity of 94.7%, specificity of 88.9%, F1-score of 94.1%, and AUC of 91.8% for 200× magnification factor.In addition, results of ResNet50 and ResNet-50 with a CBAM at patient-level and image-level were compared to the proposed method and others, showing a significant improvement in accuracy.
In conclusion, Table 1 includes a brief review of related works, and we can infer that DL with deep AMs has not been applied on thermal images for breast cancer diagnosis.On the other hand, based on studies reviewed in the previous subsection, breast cancer detection in thermal images has shown promising accuracy results compared to other modalities.This encourages investigation of the effect of integrating DL methods with AMs in detecting breast cancer in thermal images.Thus, our contribution in this work is exploring and comparing the use of different types of deep attention mechanisms with CNNs in the detection of breast cancer in thermal images.Based on our review of the related works this approach has not been previously used.

Materials and Methods
This section describes how we achieved our goals and objectives practically.An experimental research approach was used to answer the following research question: "To what extent can deep learning techniques with deep AMs improve performance for the task of breast cancer detection in thermography images?" Our proposed method introduces the AMs and the approach of CNNs to improve breast cancer detection and classification using thermal images through five components: preprocessing, feature extraction, BLSTM, AMs, and image classification.Figure 1 illustrates the steps in the proposed methodology.Next, we will provide details on these stages.

Materials and Methods
This section describes how we achieved our goals and objectives practically.An experimental research approach was used to answer the following research question: "To what extent can deep learning techniques with deep AMs improve performance for the task of breast cancer detection in thermography images?" Our proposed method introduces the AMs and the approach of CNNs to improve breast cancer detection and classification using thermal images through five components: preprocessing, feature extraction, BLSTM, AMs, and image classification.Figure 1 illustrates the steps in the proposed methodology.Next, we will provide details on these stages.

Preprocessing
In this stage, we used segmented grayscale images from DMR-IR datasets that were preprocessed by employing cropping, normalization, and resizing.In post-preprocessing these thermal images were transformed from the original 640 × 480 dimensions into 56 × 56 dimensions, which is the resolution often utilized in CNN applications [25].In order to minimize overfitting, all images were enhanced in the following stage to provide a more significant size dataset for training After the dataset was pre-processed, we divided it into training and testing sets using a stratified 10-fold cross-validation technique.The data were divided into ten roughly equal segments, with one subset used as a roughly equivalent test set.

Preprocessing
In this stage, we used segmented grayscale images from DMR-IR datasets that were preprocessed by employing cropping, normalization, and resizing.In post-preprocessing these thermal images were transformed from the original 640 × 480 dimensions into 56 × 56 dimensions, which is the resolution often utilized in CNN applications [25].In order to minimize overfitting, all images were enhanced in the following stage to provide a more significant size dataset for training After the dataset was pre-processed, we divided it into training and testing sets using a stratified 10-fold cross-validation technique.The data were divided into ten roughly equal segments, with one subset used as a roughly equivalent test set.

Feature Extraction
Feature extraction is desirable in pattern recognition that focuses on distinguishing between the normal and the cancerous breast.We used a CNN as a powerful feature extraction ability to extract an efficient feature set of segmented grayscale images.We constructed a stacking of a series layer as follows: • Image input layer: The image input layer is the connection layer to the network, as it contains the channel size and the input image size.Based on the database containing the grayscale images, the channel size was 1, and the size of the post-processing images was 56 × 56, so the input layer was 56 × 56 × 1.

•
Convolution layer: A set of filters in the CNN processed the input image to generate activation maps.By stacking the activation maps along the depth dimension, the layer output was produced.Three parameters determined the convolution layer structure; the first parameter determined the filter size, and the second parameter determined the feature mappings.Then, via a Stride, we slid the filter over the following receptive area of the same input image and repeated the procedure.The same procedure was repeated until we completed the entire image.The input for the following layer will be the output.

•
Max pooling layer: After convolution, the spatial volume of the input image was reduced using a pooling layer.It was employed between two convolution layers.Max pooling is a reduced feature map that retains the majority of the dominant features from the previous feature map after redundant data have been eliminated.

•
Dropout layer: Overfitting may be avoided by using the dropout technique, which also offers a way of roughly combining many distinct neural network architectures efficiently and exponentially.Dropout refers to units in a neural network that are dropping, both hidden and visible.Dropping a unit out entails removing it temporarily from the network.

BLSTM Layer
It is commonly acknowledged that an image has spatial characteristics; we used a BLSTM network to learn long-term information from spatial features.To make full use of the past and future context information of a sequence in classification, a bidirectional LSTM was employed to extract temporal features from forward and backward order.

AMs Layer
Based on the BLSTM output, we produced various attention values that represented the significance of the feature vector.In this layer, we applied three AMs separately, using the same procedures in the previous and subsequent layers, which were as follows: self-attention (SL), soft attention (SF), and hard attention (HD).

•
SL layer: Processing sequential data with an attention technique that takes each timestamp's context into account.In this paper, it is implemented by package kerasself attention with multiplicative type, by using the following equation [33]: • SF layer: Discredits irrelevant areas by multiplying the corresponding features map with a low weight.Accordingly, a high attention area keeps the original value while low-attention areas get closer to 0 (become dark in the visualization).We use the hidden state C = h t −1 from the previous time step to compute a weight α i for covering each sub-section of an image.We compute a score s i to measure how much attention as the following equation [34]: We pass s i to a softmax for normalization to compute the weight α i .
α i =softmax(s 1 ,s 2 ,. . .s i ) With softmax, α i adds up to 1, and we use it to compute a weighted average for x i Z= ∑ i α i x i • HD layer: This forces the model to focus solely on the important elements, ignoring all others, where the weight assigned to an input part is either 0 or 1.As a result, the objective is non-differentiable since the input elements are either observed or not.To compute HD, using α i as a sample rate to pick one x i as the input to the next layer, instead of a weighted average as in SF [34].
Z ∼ x i , α i

Fully Connected Layer
We combined all the information acquired from the preceding layers across the network to classify the given image.The output was flattened and fed to the fully connected layer from the preceding layers.

Sigmoid Layer
The output of the fully connected layer was converted into binary (0 or 1) using a sigmoid function, which may then be used to interpret as classification probabilities.

Classification Layer
This was the last layer in the network.The layer assigned each input to one of the two mutually exclusive classes (normal or cancer) by using the classification probabilities generated by the sigmoid for each input.After training, the trained network performed the test set classification.
Three AM models, SL, HD, and SF with CNN, were developed, which we called CNN-SL, CNN-HD, and CNN-SF.We used a CNN without AMs for comparison and to define the impact of the AMs on the CNN performance.In the CNN model, all layers were applied except the BLSTM and AMs, while in the CNN models with AMs, all layers were applied, as shown in the Figure 1 dataset.
From our review of the research work on breast cancer detection using thermal images, the most-used dataset of thermal images was DMR-IR, consisting of static and dynamic protocols, which is freely available online for researchers in the field [35].The DMR static protocol recorded only 1 image per patient, consisting of 177 images of healthy patients and 42 images of cancer patients.The dynamic protocol recorded a set of 20 images per patient for 95 healthy patients, for a total of 1900 images, and for the non-healthy cases, there were 42 patients, totaling 840 images.Each patient image had a spatial resolution of 640 × 480 pixels that could be downloaded in JPG or BMP format.The dataset was imbalanced, and we addressed this problem by using popular techniques such as data augmentation and resampling.We augmented the dataset, which is a significant factor of most DL classification models since it improves classification accuracy by decreasing bias because the models have enough data to train on.The augmented techniques used: Consequently, we had two augmented versions of each original image and the dataset size increased to 4146 images.

Evaluation Metrics
To evaluate and compare the performance of the different developed models of breast cancer detection using AMs and deep neural networks in thermal breast images, these evaluation metrics were used: accuracy, sensitivity, specificity, precision, and F1-score.The equations for these metrics are given as follows [36]: Accuracy = (TP + TN)/(TP + TN + FP + FN) (1) Sensitivity (also known as Recall) = TP/(TP + FN) Precision = TP/(TP + FP) F1-score = 2 (Precision × Recall)/(Precision + Recall) (5) where TP, TN, FP, and FN indicate true positive, true negative, false positive, and false negative, respectively.The accuracy is the proportion of the total number of correctly classified cases.It is easy to understand but can be misleading, especially if you have an uneven class distribution.Specificity is the fraction of negative occasions.Precision is the ratio of correctly positively labeled (abnormal) cases to all positively identified cases.Sensitivity (recall) is the proportion of positive (abnormal) cases correctly identified as positive.On the other hand, the F1-score is considered a better indicator of the classifier's performance than accuracy.It sums up the predictive performance of a model by combining the two competing metrics, precision and recall [36].
The model's ability to distinguish between classes is measured by the AUC ROC, which typically has a value range between 0.5 and 1.In the medical context, it assesses the model's diagnostic performance; values closer to 1 indicate good test results, while values closer to 0.5 indicate poor test results [24].Moreover, the developed model was compared with the state-of-art studies discussed in Section 4, Related Work.

Results
We evaluated the developed models and compared them to the methods in the literature.We compared the performances of the developed models using SL, HD, and SF AMs with a CNN on the DMR-IR dataset of thermal images, which resulted in four different models (CNN, CNN-SL, CNN-HD, CNN-SF).To evaluate our proposed approach against other approaches, we compared the achieved results of the four models with the state-of-the-art methods reviewed in the literature review section.
The remaining parts of this section are organized as follows: Section 6.1 will briefly describe the experimental settings.Section 6.2 describes the classification results, and Section 6.3 discusses the comparisons of CNNs with different AMs.Finally, the fourth part will compare the developed models to other DL-based models in the literature.

Experimental Settings
The proposed models were trained for breast cancer diagnosis on thermal images based on the dynamic protocol from the DMR-IR dataset.The dataset contains 1542 thermal images divided into carcinogenic (762) and non-carcinogenic (780) breasts for 56 patients.The dataset was split into training and testing sets of 1302 and 240 thermal images, respectively.To overcome the problem of the small dataset size, we expanded the training dataset size using image augmentation techniques, namely, rotation and brightness change.We rotated the original images 90 degrees.We adjusted the brightness, a color augmentation technique, of the images so that the resulting image was lighter than the original images.Thus, for each original image we had two versions of augmented images.Table 2 details information about the dataset.The training and test datasets were united as inputs (images) and targets (labels), and then we used a 10-fold cross-validation technique for training and testing the models.Following stratified sampling, the source data were split into 10 roughly equal disjointed segments.One subset of the test sets equivalent to 415 thermal images and labels was used to verify the performance of our modes, and the others were training sets equivalent to 3731 thermal images and labels.
Pretraining of these networks used a binary class of the DMR-IR dataset to obtain the initialization parameters of the models.All AMs were trained using the Adam optimizer, and we used the ReLU and sigmoid activation functions.To further improve the accuracy of the image classification results, periodic learning was used.The learning rate of the four models was 0.001.The batch size was set to 64 for the fitting models, and the number of epochs was set to 50.

Classification Results
Experiments were conducted to classify the thermal images of breast cancer through the proposed models using the CNN model with/without the self, hard, and soft AMs (CNN, CNN-SL, CNN-HD, CNN-SF).We resized all images to 56 × 56 pixels, which is the resolution often utilized in CNN applications [25].Furthermore, we used stratified 10-fold cross-validation to ensure that each image was evaluated in both the test and training sets.Additionally, it minimized generalization issues introduced by overfitting.The performance of the proposed models was evaluated on the DMR-IR dataset, and the results were compared to determine the most accurate approach.Table 3 shows the accuracy, specificity, sensitivity, AUC ROC, and Cohen's kappa scores of each model.Table 3 shows the average results of the CNN with/without AMs to observe the effect on the performance of the CNN model.The CNN model without the use of AMs achieved the lowest performance among the models; nevertheless, it showed good results.We found that using AMs with the CNN model had a high impact on the improved performance of the CNN in all metrics.
The results also showed a convergence between three AM models with the CNN.The CNN-HD indicated that the use of hard attention with the CNN had a slightly higher effect on metrics such as accuracy, specificity, sensitivity, and F1-score.The CNN-SF performed higher than other models in the precision metric.As we know, recall is a critical measure in medical applications; accordingly, the proposed model results achieved encouraging results.Furthermore, we offered the results of the average AUC ROC of 10-fold cross-validation, where the AUC score represents the ability of the model to distinguish between classes, in our case, normal and cancer.The CNN models with the three AMs achieved convergence in results, which were about 0.999, while the CNN was 0.851.Figure 2 below plots the ROC curves of four models for each fold and shows the AUC results and average.Cohen's kappa is a robust statistic that can be utilized for intra-and inter-rater reliability evaluation.Similar to correlation coefficients, it can have values between -1 and +1, where 0 denotes the degree of agreement predicted by chance and 1 denotes perfect agreement between the raters.The kappa is a standardized value interpreted in the same way across various studies, such as in all correlation statistics.Cohen recommended using the following scale to interpret the kappa result: values ≤ 0 indicating no agreement, and 0.01-0.20 as none to slight, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1.00as almost perfect agreement [37].The CNN models with the three AMs presented a great performance in this metric, achieving 0.98 for all mechanisms, which rep- Cohen's kappa is a robust statistic that can be utilized for intra-and inter-rater reliability evaluation.Similar to correlation coefficients, it can have values between −1 and +1, where 0 denotes the degree of agreement predicted by chance and 1 denotes perfect agreement between the raters.The kappa is a standardized value interpreted in the same way across various studies, such as in all correlation statistics.Cohen recommended using the following scale to interpret the kappa result: values ≤ 0 indicating no agreement, and 0.01-0.20 as none to slight, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1.00as almost perfect agreement [37].The CNN models with the three AMs presented a great performance in this metric, achieving 0.98 for all mechanisms, which represented almost perfect agreement.The result of the CNN alone was 0.69, which represented substantial agreement.We provided aggregate Cohen's kappa scores of 10-fold cross-validation.
Figure 3 shows the aggregate confusion matrix of 10-fold cross-validation for the four models, CNN, CNN-SL, CNN-HD, and CNN-SF, on the DMR-IR dataset.In the aggregation confusion matrix, through 10-fold cross-validation of a true label and a predicted label, each fold had 204 normal and 210 cancer images.For the CNN, the 1626 normal breast class images were correctly classified, with 420 misclassifications, and in the cancer class, the 1895 images were correctly classified, with 205 images misclassified.In contrast, for the CNN-SL, 2036 normal breast class images were correctly classified, and 10 were misclassified, while in the cancer class, 2082 images were classified correctly, with 18 misclassified.For the CNN-HD, 2040 normal breast images were correctly classified with 6 misclassified, and 2085 cancer images were correctly classified, and 15 were misclassified.In the CNN-SF confusion matrix, 2029 images were correctly classified in the normal breast class, whereas 17 were misclassified, and of the cancer images, 2090 were correctly classified, while 10 were mistakenly classified.
The four tables of the classification report are shown in Figure 4, which also shows the classification performances of the CNN, CNN-SL, CNN-HD, and CNN-SF in terms of In the aggregation confusion matrix, through 10-fold cross-validation of a true label and a predicted label, each fold had 204 normal and 210 cancer images.For the CNN, the 1626 normal breast class images were correctly classified, with 420 misclassifications, and in the cancer class, the 1895 images were correctly classified, with 205 images misclassified.In contrast, for the CNN-SL, 2036 normal breast class images were correctly classified, and 10 were misclassified, while in the cancer class, 2082 images were classified correctly, with 18 misclassified.For the CNN-HD, 2040 normal breast images were correctly classified with 6 misclassified, and 2085 cancer images were correctly classified, and 15 were misclassified.
In the CNN-SF confusion matrix, 2029 images were correctly classified in the normal breast class, whereas 17 were misclassified, and of the cancer images, 2090 were correctly classified, while 10 were mistakenly classified.
The four tables of the classification report are shown in Figure 4, which also shows the classification performances of the CNN, CNN-SL, CNN-HD, and CNN-SF in terms of accuracy, precision, recall, F1-score, and support, where support represents the number of samples.To conclude and summarize the results, when looking at Table 3 and Figures 2 and  3, we observed that early diagnosis of breast cancer through thermal images using CNNs with the proposed AMs was shown to be effective.The classification rate of 84.92%, with a low false alarm rate of 15.07% for the CNN represented the highest false alarm among all models owing to the absence of AMs.At the same time, the CNN-SL achieved a classification rate of 99.32% with a low false alarm rate of 0.67%, while the 99.49% classification rate of the CNN-HD with a low false alarm rate of 0.50% represented the lowest false alarm rate among all models.Finally, the CNN-SF had a classification rate of 99.34% with a low false alarm rate of 0.65%.It was evident from the accuracy obtained utilizing the CNN models with/without the three AMs that high results were achieved in all the proposed models.According to the findings, hard attention was the most accurate, followed by self-attention, and soft attention was third, the last CNN model.

Comparison with Other Models
To evaluate the effectiveness of the proposed approach and the model developed with the latest approaches in the literature, we compared the achieved results with those of other models.Table 4 shows the performance results of the proposed method compared with similar approaches in the literature.Most studies reported the accuracy, and some added additional metrics such as specificity and sensitivity (recall).Few of them reported precision, F1-score, and AUC score.The comparison of our proposed approach with these methods was considered regarding three different themes: (a) comparison with methods using DL on different types of images, (b) comparison with methods using DL on thermal images, and finally (c) comparison with methods using DL and AM on different types of images.To conclude and summarize the results, when looking at Table 3 and Figures 2 and 3, we observed that early diagnosis of breast cancer through thermal images using CNNs with the proposed AMs was shown to be effective.The classification rate of 84.92%, with a low false alarm rate of 15.07% for the CNN represented the highest false alarm among all models owing to the absence of AMs.At the same time, the CNN-SL achieved a classification rate of 99.32% with a low false alarm rate of 0.67%, while the 99.49% classification rate of the CNN-HD with a low false alarm rate of 0.50% represented the lowest false alarm rate among all models.Finally, the CNN-SF had a classification rate of 99.34% with a low false alarm rate of 0.65%.It was evident from the accuracy obtained utilizing the CNN models with/without the three AMs that high results were achieved in all the proposed models.According to the findings, hard attention was the most accurate, followed by self-attention, and soft attention was third, the last CNN model.

Comparison with Other Models
To evaluate the effectiveness of the proposed approach and the model developed with the latest approaches in the literature, we compared the achieved results with those of other models.Table 4 shows the performance results of the proposed method compared with similar approaches in the literature.Most studies reported the accuracy, and some added additional metrics such as specificity and sensitivity (recall).Few of them reported precision, F1-score, and AUC score.The comparison of our proposed approach with these methods was considered regarding three different themes: (a) comparison with methods using DL on different types of images, (b) comparison with methods using DL on thermal images, and finally (c) comparison with methods using DL and AM on different types of images.Several methods were reviewed in the related work section and different DL methods were applied on different types of breast cancer images.Section (A) of Table 4 shows their performance results.As shown in the table, the achieved accuracy of the different methods was above 90%.They used different types of images: histopathology, mammography, and MRI.The best achieved accuracy of 94.31% in [13] was after using a U-Net CNN on mammography breast images.On the other hand, our CNN approaches to thermal images achieved accuracy rates of 99.34%, 99.49%, and 99.32% for SF, HD, and SL AMs, respectively (see Section (D) of Table 4).In addition, the AMs with the CNN achieved higher scores than [16] in sensitivity and specificity metrics, with 9% and 6%, respectively.In the measure of precision, all proposed models, including the CNN without AMs, achieved results higher than [12].The CNNs with the three AMs achieved a competitive result with an AUC ROC of 0.999, while the study of mammograms achieved 0.9931 [15], where our result was higher.This proved that the use of a CNN with an AM on thermal images greatly enhanced the detection accuracy.

Comparison with Methods Using DL on Thermal Images
The literature review section discussed a variety of approaches that used different ML and DL techniques for thermal imaging of breast cancer.Section (B) of Table 4 shows the performance results of the used approaches.
Using a CNN, the achieved accuracy in the reviewed studies ranged from 92% to 98.95%.In [19], the highest accuracy of 98.95% was achieved using a CNN improved by the Bayes optimization algorithm.In contrast, our proposed method using SF, HD, and SL AMs with a CNN achieved accuracy rates of 99.34%, 99.49%, and 99.32%, respectively, as shown in Section (D) of Table 4.
Moreover, a study [20] was conducted on the same DMR-IR dataset and achieved a sensitivity of 99.40%, while the CNN-HD and CNN-SL models achieved 99.71% and 99.52%, respectively, with an increase of about 0.31%.In the AUC ROC and specificity metrics, our proposed model showed results that were competitive with [24].
The comparison of the results of our proposed approach with those of CNNs from similar studies using DL showed promising enhancements in the classification and detection of breast cancer in the thermal images.This confirmed the effectiveness of incorporating attention techniques with DL for enhancing the accuracy of breast cancer detection via thermal image.

Comparison with Methods Using DL and AMs on Different Types of Images
Previously, we compared our method with other methods using DL or using different types of images.Since our aim was to investigate the use of AMs with DL on thermal images, we considered and compared the results achieved in other methods that used AMs with DL but on different types of images.We reviewed several methods using an AM with DL on the two types of image modalities, mammography and histopathology.
Section (C) of Table 4 illustrates that those studies achieved an accuracy above 90%.In [4], the CNN model with AMs achieved the highest accuracy of 99.3% based on histopathology images.As we discussed earlier, our proposed models based on CNNs with three AMs (SF, HD, and SL) achieved competitive performance results on thermal images.Our CNN-HD model outperformed the results achieved in [4] with an accuracy of 99.49%.
The AMs used with the CNNs outperformed in outcomes of sensitivity and F1-score metrics for SL (99.52%, 99.32), CNN-HD (99.71%, 99.49), and CNN-SF (99.21%, 99.36).On the other hand, in the study conducted on histopathology images in [27], sensitivity was 98.70% and the F1-score was 98.28%.In addition, the outcomes of the AMs used with the CNNs showed higher results than [32] for the AUC ROC score, with an increase above 0.08.
Finally, we can conclude that, based on the results discussed in this section, our proposed methods achieved a higher performance than previous DL-based studies, whether using different types of images or using AMs.This comparison also proved that incorporating AMs with DL, whether or not CNNs, on thermal images for breast cancer diagnosis, a methodology that had not been explored, offered promising results and is worth further investigation for additional enhancements.

Conclusions and Future Work
We proposed three AM models with a CNN for use on thermographic images for breast cancer diagnosis.Our proposal proved that AMs improved the performance of the CNN model.Nevertheless, the CNN provided satisfactory results.Moreover, AMs with a CNN outperformed the models in related studies.The proposed breast cancer diagnostic models achieved the following test accuracy rates: CNN 84.92%, self-attention 99.32%, hard attention 99.49%, and soft attention 99.34%.
A limitation of our study is the limited availability of datasets of thermal breast images.In addition, the dataset size is small, and these issues were addressed in our study using augmentation techniques.Therefore, a future extension of this work would be to experiment with datasets of large sizes by merging multiple datasets and exploring multiple augmentation techniques.
In future work, the proposed models will be tested on different datasets of biomedical images.Moreover, AMs will be applied with pre-trained deep transfer learning models, and their results will be compared to the performance of the CNNs with AMs.

20 Figure 3 Figure 3 .
Figure 3 shows the aggregate confusion matrix of 10-fold cross-validation for the four models, CNN, CNN-SL, CNN-HD, and CNN-SF, on the DMR-IR dataset.

Figure 3 .
Figure 3. Aggregate confusion matrix of all folds for each model as: (a) CNN model; (b) CNN-SL model; (c) CNN-HD model; (d) CNN-SF model.

Table 1 .
Breast cancer detection and classification studies.

Table 3 .
Performance of CNN with AMs.

Table 4 .
Comparison of the performance results of all models.