A Deep Learning-Based Approach for the Diagnosis of Acute Lymphoblastic Leukemia

: Leukemia is a deadly disease caused by the overproduction of immature white blood cells (WBS) in the bone marrow. If leukemia is detected at the initial stages, the chances of recovery are better. Typically, morphological analysis for the identiﬁcation of acute lymphoblastic leukemia (ALL) is performed manually on blood cells by skilled medical personnel, which has several disadvantages, including a lack of medical personnel, sluggish analysis, and prediction that is dependent on the medical personnel’s expertise. Therefore, we proposed the Multi-Attention EfﬁcientNetV2S and EfﬁcientNetB3 state-of-the-art deep learning architectures using transfer learning-based ﬁne-tuning approach to distinguish the normal and blast cells from microscopic blood smear images that both are pretrained on large-scale ImageNet database. We simply modiﬁed the last block of both models and added additional layers to both models. After including this Multi-Attention Mechanism, it not only reduces the model’s complexities but also generalizes its network quite well. By using the proposed technique, the accuracy has improved and the overall loss is also minimized. Our Multi-Attention EfﬁcientNetV2S and EfﬁcientNetB3 models achieved 99.73% and 99.25% accuracy, respectively. We have further compared the proposed model’s performance to other individual and ensemble models. Upon comparison, the proposed model outclassed the existing literature and other benchmark models, thus proving its efﬁciency.


Introduction
Both children and adolescents are affected by leukemia, which is a malignancy of blood cells in the bone marrow [1]. Bone marrow is a soft fatty tissue found inside bone cavities. Hematopoietic cells, fat cells, blood vessels, fibrous tissue, and fluid are all found in the bone marrow. Blood cells were created by stem cells. The growth of blood stem cells leads to the formation of myeloid or lymphoid stem cells. Lymphocytes are a type of WBC produced by lymphoid stem cells. Myeloid stem cells, on the other hand, create platelets, granulocytes, red blood cells, and monocytes. WBC also includes monocytes and granulocytes. Leukemia is caused by the immature production of WBC by stem cells. A single immature or blast cell can generate billions of other blast cells [2].
Leukemia is classified as acute or chronic depending on how quickly it progresses. Based on the kind of blood cell involved, acute leukemia is split into acute lymphoblastic

•
We proposed the Multi-Attention EfficientNetB3 and EfficientNetV2S models to distinguish the ALL (unhealthy cells) and hem (healthy cells) in this article; • We simply modified the last block of both models and added the Multi-Attention Layers in both models. After including this Multi-Attention mechanism not only reduces the model's complexities but also generalizes its network quite well; • We added a crop function to reduce the unwanted part of the image; • To address the issue of unbalanced data, we also applied the augmentation technique to expand the dataset; • Our Multi-Attention EfficientNetV2S and EfficientNetB3 models achieved the 99.73% and 99.25% accuracy, respectively, on the test dataset for ALL and hem cells; • We also compared our model to other CNN models that were previously used for the detection of normal cells and cancerous cells from blood smear images but our Multi-Attention EfficientNetV2S and EfficientNetB3 models provided a higher classification accuracy. Our Section 2 includes the literature review of previously used techniques to classify the ALL from microscopic blood smear images. The Section 3 includes the data set description, data preprocessing, augmentation techniques and also a brief discussion of EfficientNetB3 and EfficientNetV2S pre-trained CNN models. The Section 4 of our research paper includes the experiment results and discussion. Finally, in Section 5, we concluded this this research article along with future work.

Related Work
Deep learning has gained the attention of the world through its application in different sectors: braintumor detection [14], intrusion detection [15][16][17][18], and multi-object fuse detection problems [19]. Kasani et al. [9] presented the ensemble approach to classifying cancerous cells and normal cells based on transfer learning. They also applied the normalization technique to change the pixel value between 0 and 1 to overcome the error during training. They used different data augmented techniques to solve the problem of imbalanced data. The ensemble model that consisted of NASNetLarge and VGG16 achieved 96.58% overall accuracy. Zakir Ullah et al. [20] proposed the state-of-the-art VGG16 architecture to detection of healthy and blast cells from blood smear images. They used the Efficient Channel Attention (ECA) module with VGG16 to learn the semantic features that concentrates on the image's instructive region. They used different image preprocessing steps like data augmentation, image resizing and data normalization. VGG16 + ECA model obtained an overall accuracy 91%.
Computers can directly interpret FFR values from coronary pictures obtained from CT angiography thanks to a revolutionary deep neural network approach suggested by the researchers [21], known as TreeVes-Net. Their proposed system recorded coronary geometric information regarding blood fluid-related data with the help of a tree-structured recurrent neural network (RNN). With tests on 13,000 artificial coronary trees and 180 actual coronary trees from clinical patient data, they obtained 0.92 and 0.93 in the area under the ROC curve AUC.
To create an LGE-equivalent image segment for diagnosis-related tissues, the author [22] presented Progressive Sequential Causal GANs. For this, PSGAN presented three matchless characteristics, i.e., a progressive framework, a sequential casual learning network and two specifically self-learning loss terms (synthesis and segmentation). The researchers obtained an overall 97.17% segmented accuracy with a 0.96 correlation coefficient for scar ratio.
A new method that named PMD, was suggested as a medicalmodality in a research study [23] to initially permit the VBDI in each of the several intracoronary imaging modalities. The PMD allows the use of a MIMT to solve a typical SIST learning issue, a plan for enhancing vessel environment adaptation heterogeneity. The PMD is introduced by the use of a specifically created structure-deformable neural network that broadens the information base for system learning due to the lack of clinical data and the perception that areas of vessels with varying sizes using a new bidirectional pyramidal network. Results of the wide experiments can exemplify the efficacy of the PMD approach in intracoronary photographs.
Jing et al. [24] introduced the VIT-CNN ensemble approach that combines the Effi-cientNetB0 and vision transfer model to deal with b-lymphoblast detection. They convert the size of the image and also normalize the image to avoid overfitting. They used a different enhancement data sampling (DEDS) technique to increase the images in the dataset. VIT-CNN ensemble network attained 99.03% accuracy on the test set. Sahlol et al. [2] presented a hybrid technique that combines a CNN-based VGGNet model with the Salp Swarm algorithm (SASSA). In this hybrid approach, a pre-trained VGG-Net model was used to extract features, while the SASSA was used to not only pick significant features but also to eliminate noisy features and to improve the model's accuracy. For classification of normal and abnormal cells, SVM was used. SVM classifier achieved 96.11% accuracy on the ALL-IDB2 dataset and achieved 87.9% accuracy on the ISBI-2019 dataset. For the segmentation purpose of the WBC nucleus, UNET and UNET++-based techniques was introduced in recent years to get a better classification of normal and blast cells [25,26]. Using microscopic pictures obtained from the ALL-IDB dataset, Genovese et al. [27] introduced a traditional machine learning strategy on the CNN VGG-16 model for ALL detection. The authors of [28,29] proposed the AlexNet model for the detection of ALL from microscopic blood smear images based on transfer learning. Techniques for enhancing data are also presented to address the issue of insufficient data.
Mustafa et al. [30] proposed a majority voting ensemble technique that combines the four models (InceptionV3, ResNet-V2, Xception, DesNet121) to classify the normal and blast cells from the ISBI-2019 dataset. After preprocessing and augmentation, the ensemble model achieved 98.5% accuracy. Genovese et al. [31] introduced two HistoCNN and HistoNet models that is based on CNN (VGG- 16, architectures. The HistoNet model adopted the features of the HistoCNN model based on transfer learning and applied it to the ALL-IDB dataset to classify the normal and blast cells. K-mean clustering, C-mean, Marker Controlled Watershed and histogram-based thresholding techniques were used for segmentation of the nucleus from WBC [32,33]. Authors proposed both individual and ensemble models for detection of ALL cancer from microscopic blood smear images but ensemble models attained a higher accuracy than individual models. ResNet101-9 ensemble model [34] achieved 85.11% accuracy and the weighted ensemble of network model [35] achieved 88.3% accuracy.

A. DATASET PREPROCESSING AND AUGMENTATION
The size of the images in the dataset is 450 × 450 pixels. We used the crop function to minimize the unwanted part of the image. After cropping and resizing, the size of the image was reduced to 300 × 300 resolution. We have not applied the normalized technique to the image database because the EfficientNet model expects a pixels range of 0 to 255 so no scaling is required. The dataset is imbalanced because cancer images have doubled to healthy images that can cause problems during training. The class of small quantity of images learn fewer features than the class of large quantity of images that not a good choice to create the best model. Data augmentation is a very popular technique that is not only used to increase the data images but also to produce variations in the dataset, such as rotation, contrast enhancement, the mirror of the image using horizontal and vertical flips, zooming the image and much more. We used different augmentation techniques to solve the problem of imbalanced data. We rotated the image counter-clockwise by 30 and 20 degrees and adjusted brightness randomly [0.2 to 1.2]. We also applied horizontal and vertical flips to increase our dataset. Figure 1 shows the augmented techniques that apply on the dataset. After augmentation, our dataset was balanced and each class contained 20,000 images.

B. EFFICIENTNET CNN MODEL
Mingxing Tan et al. [36] introduced EfficientNet in their paper "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks" in 2019. The purpose of this paper was to look at how to scale neural network architectures to improve accuracy. The depth, width, and resolution of convolutional neural networks can all be adjusted to increase or decrease their size. The number of concealed layers is referred to as depth, and it can be changed to meet the problem.
The EfficientNet is based on a revolutionary CNN model scaling method. It makes use of an easy compound coefficient that works well. EfficientNet equally scales each dimension with a set of scaling factors, unlike traditional techniques that scale network characteristics, such as width, depth, and resolution. In practice, scaling individual dimensions improves model performance, but balancing all network dimensions concerning available resources improves overall model performance greatly. The given equations were devised by the authors to evenly scale up the depth, width, and resolution of the coefficient.
ics 2022, 11, 3168 The EfficientNet is based on a revolutionary CNN mod use of an easy compound coefficient that works well. Effic dimension with a set of scaling factors, unlike traditional te characteristics, such as width, depth, and resolution. In pra mensions improves model performance, but balancing all ne ing available resources improves overall model performan tions were devised by the authors to evenly scale up the dep the coefficient.  The coefficient ϕ is a user-specified coefficient that regulates how various novel resources are accessible for model scaling, and α, β, and γ are constants that can be discovered by a brief grid search and define how these additional resources should be assigned to the network depth, width, and resolution, accordingly.
The EfficientNet model family's basic building piece is mobile inverted bottleneck convolution (MBConv). MB Conv is built on MobileNet [37] model concepts. EfficientNet provides same accuracy on ImageNet database with small in size than other models. In this research, we used the EfficientNet-B3 CNN model. This EfficientNet variant was chosen because it offers a good combination of processing resources and precision. Furthermore, instead of employing the ReLU activation function, this network employs the Swish activation function, as shown in Figure 2, which has a shape that is similar to the ReLU and Leaky ReLU functions, and so shares some of their performance benefits. Its activation function is smoother than that of the other two.
The equation of the Swish function is shown in Equation (6): where β ≥ 0 is a parameter that can be learned when the CNN model is being trained. As can be seen in Figure 2, f swish becomes the linear activation function if β is equal to 0, and as, β goes to ∞, f swish resembles the ReLU function but is smoother. Figure 3 shows the complete procedure of our Multi-Attention Mechanism. However, Figure 4 depicts a complete structure of the EfiicientNetB3 Model. The equation of the Swish function is shown in Equation (6): where β ≥ 0 is a parameter that can be learned when the CNN model is b can be seen in Figure 2, becomes the linear activation function if β i as, β goes to ∞, resembles the ReLU function but is smoother.  The equation of the Swish function is shown in Equation (6): where β ≥ 0 is a parameter that can be learned when the CNN model is being trained. As can be seen in Figure 2, becomes the linear activation function if β is equal to 0, and as, β goes to ∞, resembles the ReLU function but is smoother. Figure 3 shows the complete procedure of our Multi-Attention Mechanism. However, Figure 4 depicts a complete structure of the EfiicientNetB3 Model.

C. EFFICIENTNET V2S
Mingxin Tan and Quoc V. Le [38] introduced EfficientNetV2, a high-class model that is a significant increase over EfficientNet in terms of training speed and a modest improvement in terms of accuracy over EfficientNet.
EfficientNetV2 employs progressive learning, which implies that although the image sizes are initially tiny when the training begins, they gradually rise in size. This approach is based on the observationthatas mage size is increases, EfficientNets' training rates slow down. Progressive learning, on the other hand, is not a novel notion; it has been utilized before. The issue is that, in its prior usage, the same regularization technique was applied to images of various sizes. According to the authors of EfficientNetV2, this reduces network capacity and performance. To address this issue, they dynamically increase the regularization along with the image sizes.
EfficientNets use a convolution layer known as the "depth-wise convolution layer," which has fewer parameters and FLOPS but cannot fully exploit modern accelerators (GPU/CPU). To address this issue, recent research titled "MobileDets: Searching for Object Detection Architectures for Mobile Accelerators" proposes a new layer called"Fused-MB Conv layer" to overcome the problem. In this case, EfficientNetV2 employs this new layer. However, because the fused layers have a larger number of parameters, they cannot simply replace all of the old MB Conv layers with the fused. To dynamically determine the best mix of fused and conventional MB Conv layers, they deploy training-aware NAS. The results of NAS reveal that early on, replacing portions of the MB Conv layers with fused layers improves performance with smaller models. It also shows that it is advantageous to have a lower expansion ratio for the MB Conv layers (across the network). Finally, smaller kernel sizes with more layers are preferable.

C. EFFICIENTNET V2S
Mingxin Tan and Quoc V. Le [38] introduced EfficientNetV2, a high-class model that is a significant increase over EfficientNet in terms of training speed and a modest improvement in terms of accuracy over EfficientNet.
EfficientNetV2 employs progressive learning, which implies that although the image sizes are initially tiny when the training begins, they gradually rise in size. This approach is based on the observationthatas mage size is increases, EfficientNets' training rates slow down. Progressive learning, on the other hand, is not a novel notion; it has been utilized before. The issue is that, in its prior usage, the same regularization technique was applied to images of various sizes. According to the authors of EfficientNetV2, this reduces network capacity and performance. To address this issue, they dynamically increase the regularization along with the image sizes.
EfficientNets use a convolution layer known as the "depth-wise convolution layer," which has fewer parameters and FLOPS but cannot fully exploit modern accelerators (GPU/CPU). To address this issue, recent research titled "MobileDets: Searching for Object Detection Architectures for Mobile Accelerators" proposes a new layer called"Fused-MB Conv layer" to overcome the problem. In this case, EfficientNetV2 employs this new layer. However, because the fused layers have a larger number of parameters, they cannot simply replace all of the old MB Conv layers with the fused. To dynamically determine the best mix of fused and conventional MB Conv layers, they deploy training-aware NAS. The results of NAS reveal that early on, replacing portions of the MB Conv layers with fused layers improves performance with smaller models. It also shows that it is advantageous to have a lower expansion ratio for the MB Conv layers (across the network). Finally, smaller kernel sizes with more layers are preferable. A complete structure of MBConv and Fused-MBConv is given in Figure 5. Efficient-Net [36] grows up all phases uniformly by employing a straightforward compound scaling approach. According to the authors of EfficientNetV2, this is unnecessary because not all of these stages require scaling to increase performance. That's why, in subsequent phases, they accept non-uniform scaling method to gradually add more layers. Since EfficientNets have a propensity to aggressively scale up image sizes, they also incorporate a scaling rule to set a maximum image size limit. Despite being 6.8 times smaller, EfficientNetV2 trains up to 11 times more quickly than EfficientNetV1.
ctronics 2022, 11,3168 A complete structure of MBConv and Fused-MBConv is give cientNet [36] grows up all phases uniformly by employing a straight scaling approach. According to the authors of EfficientNetV2, this is u not all of these stages require scaling to increase performance. That's phases, they accept non-uniform scaling method to gradually add m ficientNets have a propensity to aggressively scale up image sizes, the scaling rule to set a maximum image size limit. Despite being 6.8 cientNetV2 trains up to 11 times more quickly than EfficientNetV1.

Multi-Attention Mechanism
The machine learning field uses the attention technique to pay components of an input vector to identify long-term dependencie Multi-Attention Module that is inspired by Convolutional Block (CBAM) and another weighted Attention Average Module. Both atten parallel and merges at the end. We simply modified the last block added the attention layers in both models. Including this Multi-Atten only reduces the model's complexities and also generalizes its netwo merging the Multi-Attention Layers, we passed the output into anot fully-connected layer (256 elements)withelu as the activation functio our model has included 2 outputs with softmax as an activation func In 2018, the authors [39] introduced a CBAM that is based o mechanism. By combining channel-wise attention with spatial attent formative features. The modules are arranged sequentially, start

Multi-Attention Mechanism
The machine learning field uses the attention technique to pay attention to various components of an input vector to identify long-term dependencies. We introduced a Multi-Attention Module that is inspired by Convolutional Block Attention Module (CBAM) and another weighted Attention Average Module. Both attention module works parallel and merges at the end. We simply modified the last block of both models and added the attention layers in both models. Including this Multi-Attention Mechanism not only reduces the model's complexities and also generalizes its network quite well. After merging the Multi-Attention Layers, we passed the output into another layer that called fully-connected layer (256 elements)withelu as the activation function. The final layer of our model has included 2 outputs with softmax as an activation function.
In 2018, the authors [39] introduced a CBAM that is based on a dual attention mechanism. By combining channel-wise attention with spatial attention, it learns the informative features. The modules are arranged sequentially, starting with the channel-wise module and moving on to the spatial module. Channel attention works on the image to produce meaningful information that utilizes the inter-channel relationship of features. Channel attention is computed after a little modification as: M c F is the final output of our channel attention module. δ is denoted as sigmoid function, where W 0 is the weight of the multi-layer perception (MLP) with one hidden layer. Both F c avg and F c max are denoted as average pooling and max pooling features. The spatial attention module works differently from the channel attention and concentrates on the image's instructive region. The spatial attention is computed after a little modification as: f 3×3 is the convolutional operation with 3 × 3 filter size, δ is denoted as sigmoid function and both F c avg and F c max are denoted as average pooling and max pooling features. We also modified the end part of its spatial module by integrating the Global Weighted Average Pooling (GWAP) method, which is computed as: where (x, y) is denoted as weights at the spatial location in the spatial attention and d represented the height, width, and number of channels. For feature aggregating, the average score of weights (x, y) is calculated. Additionally, the second attention layer named weighted Attention Average was presented by Felbo et al. [40] in their paper. The Weighted Attention Average module is computed as: e t = h t w a (10) where h t denoted as the image at timestamp t and w a denoted as a weight matrix for the attention layer, The representations are multiplied by the weight matrix to create the attention important scores for each time step, a t , and then the results are normalized to create a probability distribution over the images. Last but not least, using the attention importance scores as weights, a weighted summation of all the time steps yields the representation vector for the image.

D. DATASET DESCRIPTION
In this research paper, we used the C-NMC_2019 dataset prepared by ISBI and presented in the health imaging challenge [9,20,24,41]. This dataset consists of 10,661 cell images with which 7272 cancer images obtained from 47 acute lymphoblastic leukemia patients and 3389 normal images obtained from 26 healthy persons. ALL and healthy cells had nucleus-to-cytoplasm ratios of approximately 1/5 and 2/5, respectively. As shown in the bottom row of Figure 6, healthy cells on a blood smear seem homogenous and uniform, round-to-ovoid-shaped, tiny in size, and with a normal nuclear shape. The form and size of all cells are different. ALL cells are elongated and unusual in shape, with a considerable quantity of chromatin (a mass of genetic material). The size of ALL lymphoblasts varies, and the nuclei's form is quite uneven, as seen in the top row of Figure 4, these cells were segmented from microscopic photos, and each cell image was collected as an actual image. To a considerable extent, some staining noise and lighting faults that occurred during the collection procedure have been rectified.

A. PERFORMANCE EVALUATION METRICS
We evaluated our model's performance with different parameters, which include accuracy, precision, F1-Score, Sensitivity, and Specificity. The formulas of these parameters are [42]:

B. EXPERIMENTAL SETUP AND HYPERPARAMETERS
The machine learning engineer can modify many parameters that govern how the network will train or even its design while aiming to attain optimal accuracy and performance of a neural network model. These characteristics are known as hyperparameters, and they play a critical role in the overall performance of any Convolutional Neural Network. Although there are some guidelines for determining the ideal value for various hyperparameters, hyperparameter tuning is largely an exploratory procedure. Figure 7 depicts a complete structure of the EfficientNetV2S model.

•
The learning rate hyperparameter determines how much change will be made to the network's weights after each backpropagation pass. We set a learning rate of 0.001 for both models. The learning rate is reduced to a 0.5 factor if the monitor value does not improve; • Epochs are set to 20 for both efficietNetB3 and efficientNetV2S; • The batch size is set to 16 for both models; • The patience parameter is set to 1 and the stop patience parameter is set to 3; • Both models are saved with the highest accuracy in the validation set; • Adamax optimizer is used for training purposes with extension of Adam that try to combine the best part of the RMSProp and momentum optimizer. In some scenari-

A. PERFORMANCE EVALUATION METRICS
We evaluated our model's performance with different parameters, which include accuracy, precision, F1-Score, Sensitivity, and Specificity. The formulas of these parameters are [42]:

B. EXPERIMENTAL SETUP AND HYPERPARAMETERS
The machine learning engineer can modify many parameters that govern how the network will train or even its design while aiming to attain optimal accuracy and performance of a neural network model. These characteristics are known as hyperparameters, and they play a critical role in the overall performance of any Convolutional Neural Network. Although there are some guidelines for determining the ideal value for various hyperparameters, hyperparameter tuning is largely an exploratory procedure. Figure 7 depicts a complete structure of the EfficientNetV2S model.

•
The learning rate hyperparameter determines how much change will be made to the network's weights after each backpropagation pass. We set a learning rate of 0.001 for both models. The learning rate is reduced to a 0.5 factor if the monitor value does not improve; • Epochs are set to 20 for both efficietNetB3 and efficientNetV2S; • The batch size is set to 16 for both models; • The patience parameter is set to 1 and the stop patience parameter is set to 3; • Both models are saved with the highest accuracy in the validation set; • Adamax optimizer is used for training purposes with extension of Adam that try to combine the best part of the RMSProp and momentum optimizer. In some scenarios, the Adamax optimizer provides the better optimization than the Adam optimizer; • Categorical cross-entropy is used to calculate the loss during training that is well-suited for the categorical problem; • We added an additional batch norm [43] layer before fully connected layers; • The TensorFlow [44] framework and Python 3.7 were used to implement the experiments; well-suited for the categorical problem; • We added an additional batch norm [43] layer before fully connected layers; • The TensorFlow [44] framework and Python 3.7 were used to implement the experiments;

C. DISCUSSION
The ISBI-2019 data set is divided by an 8:1:1 ratio in the train, valid, and test datasets respectively. The basic purpose of using a validation dataset is to estimate the performance of training data and tune the hyperparameters to optimize the model. For results, the test data set has used for overall accuracy, which was not a part of our training procedure. The Multi-Attention EfficientNetB3 model attained 99.25% accuracy on the test set and the Multi-Attention EfficientNetV2S model achieved 99.73% accuracy. EfficientNetV2S model achieved 0.70% more accuracy than EfficientNetB3, which can also be seen in Table 1.The EfficientNetV2S model training had been terminated at epoch 16 after 3 adjustments of learning rate with no improvement that can also be seen in Figure 8. The EfficientNetB3 model had been training terminated after completing the 15 epochs, which can also be seen in Figure 9. According to Figures 8 and 9, training and validation loss curves gradually decrease and try to combine with an optimal point. In Figures 8 and 9 training and validation curves show clearly no overfitting in our models.

C. DISCUSSION
The ISBI-2019 data set is divided by an 8:1:1 ratio in the train, valid, and test datasets respectively. The basic purpose of using a validation dataset is to estimate the performance of training data and tune the hyperparameters to optimize the model. For results, the test data set has used for overall accuracy, which was not a part of our training procedure. The Multi-Attention EfficientNetB3 model attained 99.25% accuracy on the test set and the Multi-Attention EfficientNetV2S model achieved 99.73% accuracy.
EfficientNetV2S model achieved 0.70% more accuracy than EfficientNetB3, which can also be seen in Table 1.The EfficientNetV2S model training had been terminated at epoch 16 after 3 adjustments of learning rate with no improvement that can also be seen in Figure 8. The EfficientNetB3 model had been training terminated after completing the 15 epochs, which can also be seen in Figure 9. According to Figures 8 and 9, training and validation loss curves gradually decrease and try to combine with an optimal point. In Figures 8 and 9 training and validation curves show clearly no overfitting in our models.    A confusion matrix can also be the best choice to measure the performance of the model. All of the diagonal elements indicate outcomes that have been accurately categorized. On the off diagonals of the confusion matrix, misclassified outcomes are depicted. Therefore, the confusion matrix of the best classifier will only contain diagonal elements and have zero values for all other elements.. Following the categorization procedure, a confusion matrix produces actual and expected values. According to Figure 10, our Multi-Attention EfficientNetV2S model misclassified only 11 images from 4000 images, which include 8 ALL images and 3 normal images. Multi-Attention EfficientNetB3 model misclassified 30 images from 4000 images, which include 10 ALL and 20 normal images. Our Multi-Attention EfficientNetV2S model has 19 images less misclassified than Effi-cientNetB3 shows a better ability to classify the correct predictions.   A confusion matrix can also be the best choice to measure the performance of the model. All of the diagonal elements indicate outcomes that have been accurately categorized. On the off diagonals of the confusion matrix, misclassified outcomes are depicted. Therefore, the confusion matrix of the best classifier will only contain diagonal elements and have zero values for all other elements.. Following the categorization procedure, a confusion matrix produces actual and expected values. According to Figure 10, our Multi-Attention EfficientNetV2S model misclassified only 11 images from 4000 images, which include 8 ALL images and 3 normal images. Multi-Attention EfficientNetB3 model misclassified 30 images from 4000 images, which include 10 ALL and 20 normal images. Our Multi-Attention EfficientNetV2S model has 19 images less misclassified than Effi-cientNetB3 shows a better ability to classify the correct predictions. A confusion matrix can also be the best choice to measure the performance of the model. All of the diagonal elements indicate outcomes that have been accurately categorized. On the off diagonals of the confusion matrix, misclassified outcomes are depicted. Therefore, the confusion matrix of the best classifier will only contain diagonal elements and have zero values for all other elements. Following the categorization procedure, a confusion matrix produces actual and expected values. According to Figure 10, our Multi-Attention EfficientNetV2S model misclassified only 11 images from 4000 images, which include 8 ALL images and 3 normal images. Multi-Attention EfficientNetB3 model misclassified 30 images from 4000 images, which include 10 ALL and 20 normal images. Our Multi-Attention EfficientNetV2S model has 19 images less misclassified than EfficientNetB3 shows a better ability to classify the correct predictions. Figure 11 shows the comparison graph of both models with different parameters. Our Multi-Attention EfficientNetV2S model achieved a 99.85% precision score that is 0.85% more than Multi-Attention EfficientNetB3 model. Similarly, F1-Score, Sensitivity and Specificity of the Multi-Attention EfficientNetV2S and Multi-Attention EfficientNetB3 are 99.72%, 99.60%, 99.85%, and 99.25%, 99.50%,99.00%, respectively, in the comparison graph.   We also compared our models result to other previously individual and ensemble model results, that were used for detection of acute lymphoblastic leukemia from microscopic blood smear images. If we compare our Multi-Attention EfficientNetB3 of with its family member EffcientNetB0, our model has almost a 4% better accuracy with the    We also compared our models result to other previously individual model results, that were used for detection of acute lymphoblastic leuke croscopic blood smear images. If we compare our Multi-Attention Efficien its family member EffcientNetB0, our model has almost a 4% better accu We also compared our models result to other previously individual and ensemble model results, that were used for detection of acute lymphoblastic leukemia from micro-scopic blood smear images. If we compare our Multi-Attention EfficientNetB3 of with its family member EffcientNetB0, our model has almost a 4% better accuracy with the same dataset. Similarly, compared to individual models our Multi-Attention EfficientNetB3 achieved a 0.35% higher accuracy than the vision transfer model. Compared to ensemble models, our multi-Attention efficientNetB3 model achieved 2.67%, 0.75%, 0.22% higher accuracy, which can also be seen in Table 2. EfficientNetV2S also belong to the EfficientNetB3 family but it is an upgraded version. If we compare the Multi-Attention EfficientV2S model with its family members then EfficientNetV2S achieved 0.48%, and 4.55% higher accuracy than Multi-Attention Efficient-NetB3 and EfficientNetB0, respectively, which proves the model's ability for detection of leukemia. Similarly, compared to individual models Multi-Attention EfficientNetV2S achieved 0.83% higher accuracy than vision transfer models, which can also be seen in Table 2. Similarly, compared to ensemble models Multi-Attention EfficientNetV2S achieved a 3.15%, 1.23%, and 0.70% higher accuracy, which can also be seen in Table 2.
EfficientNetV2S and EfficientNetB3 with Multi-Attention module compare to another model [20] VGG16 + Efficient Channel Attention (ECA), even then our models have performed better and achieved almost a 9 to 10% higher accuracy with the same dataset, which also can be seen in Table 2.

Grad-Cam Analysis
We used images from the testing set in the Grad-CAM analysis for the qualitative analysis. Grad-CAM is a well-known proposed visualization technique that makes use of gradients to determine the significance of specific spatial positions within convolutional layers. Gradients are calculated as they are Grad-CAM results for Healthy and Blast classes clearly display attendance regions. We attempt to examine how well this network utilizes features by looking at the locations that both networks have deemed crucial for class prediction. In this study, we compare the visualization outcomes of the multi-Attention networks (EfficientNetV2S + multi-Attention) and (EfficientNetB3 + multi-Attention) with their respective baselines (EfficientNetB3) and (EfficientNetV2S). Figure 12 illustrates the visualization result.
In Figure 12, we can clearly see the multi-Attention network gave batter result to identify the target object than baseline networks. This proves that our multi-Attentionintegrated network learned well to identify the target object in the image dataset. If we compare C and F images in Figure 12, then EfficientNetV2S with Multi-Attention Layers focus more precisely on the target than EfficientNetB3 with Multi-Attention Layers and shows a better ability to target the image.

Conclusions and Future Work
In our research paper, we presented a study in which we use pre-trained models and a transfer learning-based fine-tuning strategy to forecast acute lymphoblastic Leukemia to overcome the death rate at an early stage in the medical field. For this, we used the ISBI-2019 dataset, which included both healthy and unhealthy cells. We have also included augmentation techniques to overcome the problem of imbalanced data that deals with the minimization of the error rate during training procedures and is necessary for the improvement of the model accuracy. Both Multi-Attention EfficientNetV2S and Effi-cientNetB3 achieved the 99.73% and 99.25% classification accuracies, respectively. We compared our model's accuracy to other deep learning and ensemble models to prove its efficiency. Upon comparison, it has been concluded that our proposed two models provide better outcomes than existing literature, thus proving their efficiency.

Conclusions and Future Work
In our research paper, we presented a study in which we use pre-trained models and a transfer learning-based fine-tuning strategy to forecast acute lymphoblastic Leukemia to overcome the death rate at an early stage in the medical field. For this, we used the ISBI-2019 dataset, which included both healthy and unhealthy cells. We have also included augmentation techniques to overcome the problem of imbalanced data that deals with the minimization of the error rate during training procedures and is necessary for the improvement of the model accuracy. Both Multi-Attention EfficientNetV2S and EfficientNetB3 achieved the 99.73% and 99.25% classification accuracies, respectively. We compared our model's accuracy to other deep learning and ensemble models to prove its efficiency. Upon comparison, it has been concluded that our proposed two models provide better outcomes than existing literature, thus proving their efficiency.