Customized Deep Learning Classifier for Detection of Acute Lymphoblastic Leukemia Using Blood Smear Images

Acute lymphoblastic leukemia (ALL) is a rare type of blood cancer caused due to the overproduction of lymphocytes by the bone marrow in the human body. It is one of the common types of cancer in children, which has a fair chance of being cured. However, this may even occur in adults, and the chances of a cure are slim if diagnosed at a later stage. To aid in the early detection of this deadly disease, an intelligent method to screen the white blood cells is proposed in this study. The proposed intelligent deep learning algorithm uses the microscopic images of blood smears as the input data. This algorithm is implemented with a convolutional neural network (CNN) to predict the leukemic cells from the healthy blood cells. The custom ALLNET model was trained and tested using the microscopic images available as open-source data. The model training was carried out on Google Collaboratory using the Nvidia Tesla P-100 GPU method. Maximum accuracy of 95.54%, specificity of 95.81%, sensitivity of 95.91%, F1-score of 95.43%, and precision of 96% were obtained by this accurate classifier. The proposed technique may be used during the pre-screening to detect the leukemia cells during complete blood count (CBC) and peripheral blood tests.


Introduction
ALL (acute lymphoblastic leukemia) is a lymphoid blood cell malignancy characterized by the development of immature lymphocytes [1]. These impaired white blood cells harm the entire body and bone marrow, putting the immune system as a whole at risk. It also inhibits the bone marrow's capacity to generate red blood cells and platelets. Moreover, these cancerous cells can enter the bloodstream and cause serious harm to other regions of the human body, including the kidney, liver, brain, heart, and other organs, leading to the development of other deadly cancers. According to worldwide statistics by the World Health Organization (WHO)'s International Agency for Research on Cancer, they reported 437,033 cases of leukemia and 303,006 deaths as of 2022 [2]. The blood, bone marrow, and extramedullary sites all show signs of ALL. This deadly disease is categorized into T-lymphoblastic leukemia (Pre-T), B-lymphoblastic leukemia (Pre-B), and B lymphoblastic leukemia according to the WHO [3]. Mature-B lymphoblastic leukemia starts in the bone marrow and releases an abnormal quantity of white blood cells in the body. These dangerously formed cells are known as "leukemia cells" or "blasts" because they are severely To further support our claim, the remainder of the article has been described in the following manner. The following section describes the workflow and image processing techniques used, and the network is explained. Section 3 presents the results produced by our classifier along with a comparison of the outcomes from other established studies. The conclusion of the article is described in Section 4.

Materials and Methods
This section explains the workflow, image processing techniques, and network architecture developed for leukemia classification. Figure 1 describes the workflow followed for this work. After acquiring the dataset, augmentation was carried out to increase the size as well as the robustness. Afterward, the image data was fed to the CNN for automated feature extraction. The network then managed to classify the images as either ALL (blast cell) or HEM (healthy cell). [25] DNA Sequence images other ML models 75% -- [26] GRTD dataset VCGNet 96% 93% 93% [27] BCCD ALL-IDB2 JTSC CellaVision databases CNN 97% 80% 94% [28] LISC and Dhruv dataset CNN 97% 80% 94% [29] Amreek Clinical Laboratory CNN 97.75% -- To further support our claim, the remainder of the article has been described in the following manner. The following section describes the workflow and image processing techniques used, and the network is explained. Section 3 presents the results produced by our classifier along with a comparison of the outcomes from other established studies. The conclusion of the article is described in Section 4.

Materials and Methods
This section explains the workflow, image processing techniques, and network architecture developed for leukemia classification. Figure 1 describes the workflow followed for this work. After acquiring the dataset, augmentation was carried out to increase the size as well as the robustness. Afterward, the image data was fed to the CNN for automated feature extraction. The network then managed to classify the images as either ALL (blast cell) or HEM (healthy cell).

Input Data
The dataset used in this research belongs to the "ALL Challenge dataset of ISBI, 2019" [30][31][32][33]. The dataset contained cell images of both normal individuals and patients diagnosed with ALL. As shown in Figure 2, the original images acquired from a digital microscope having various components of the blood smear are pictorially depicted. The CNN classification for leukemia is done using segmented white blood cell (WBC) regions. This requires preprocessing and color segmentation. The subsequent process involves finding better segmentation of the region of interest. The HSI color space-based images of blood smears are shown in Figure 3. The white blood cells are seen to have better contrast than the other components of the image. To further localize the white blood cells, we selected the saturated component as it describes the intensity of the color, which is shown in Figure 4. This was then converted to a binary image by performing thresholding, as shown in Figure 5. The thresholding was performed on the gray scale image in such a way that all pixels in the range of (180-255) were converted to white, while all pixels belonging to values below this threshold were converted to black. The segmented image was obtained by finding the product of the original image and the segmented image, which was then used for further processing, as shown in Figure 6.

Input Data
The dataset used in this research belongs to the "ALL Challenge dataset of ISBI, 2019" [30][31][32][33]. The dataset contained cell images of both normal individuals and patients diagnosed with ALL. As shown in Figure 2, the original images acquired from a digital microscope having various components of the blood smear are pictorially depicted. The CNN classification for leukemia is done using segmented white blood cell (WBC) regions. This requires preprocessing and color segmentation. The subsequent process involves finding better segmentation of the region of interest. The HSI color space-based images of blood smears are shown in Figure 3. The white blood cells are seen to have better contrast than the other components of the image. To further localize the white blood cells, we selected the saturated component as it describes the intensity of the color, which is shown in Figure  4. This was then converted to a binary image by performing thresholding, as shown in Figure 5. The thresholding was performed on the gray scale image in such a way that all pixels in the range of (180-255) were converted to white, while all pixels belonging to values below this threshold were converted to black. The segmented image was obtained by finding the product of the original image and the segmented image, which was then used for further processing, as shown in Figure 6.

Input Data
The dataset used in this research belongs to the "ALL Challenge dataset of ISBI, 2019" [30][31][32][33]. The dataset contained cell images of both normal individuals and patients diagnosed with ALL. As shown in Figure 2, the original images acquired from a digital microscope having various components of the blood smear are pictorially depicted. The CNN classification for leukemia is done using segmented white blood cell (WBC) regions. This requires preprocessing and color segmentation. The subsequent process involves finding better segmentation of the region of interest. The HSI color space-based images of blood smears are shown in Figure 3. The white blood cells are seen to have better contrast than the other components of the image. To further localize the white blood cells, we selected the saturated component as it describes the intensity of the color, which is shown in Figure  4. This was then converted to a binary image by performing thresholding, as shown in Figure 5. The thresholding was performed on the gray scale image in such a way that all pixels in the range of (180-255) were converted to white, while all pixels belonging to values below this threshold were converted to black. The segmented image was obtained by finding the product of the original image and the segmented image, which was then used for further processing, as shown in Figure 6.       A total of 10,661 images were collected from 73 participants from the C NMC 2019 dataset. There were 7272 images of blast cells and 3389 images of healthy cells in total. The images in this dataset were uniform with a size of 450 × 450 × 3 and had been pre-processed such that only the object of interest (WBC) was included, and everything else was    A total of 10,661 images were collected from 73 participants from the C NMC 2019 dataset. There were 7272 images of blast cells and 3389 images of healthy cells in total. The images in this dataset were uniform with a size of 450 × 450 × 3 and had been pre-processed such that only the object of interest (WBC) was included, and everything else was    A total of 10,661 images were collected from 73 participants from the C NMC 2019 dataset. There were 7272 images of blast cells and 3389 images of healthy cells in total. The images in this dataset were uniform with a size of 450 × 450 × 3 and had been pre-processed such that only the object of interest (WBC) was included, and everything else was A total of 10,661 images were collected from 73 participants from the C NMC 2019 dataset. There were 7272 images of blast cells and 3389 images of healthy cells in total. The images in this dataset were uniform with a size of 450 × 450 × 3 and had been pre-processed such that only the object of interest (WBC) was included, and everything else was padded with black. Figure 2 gives a glimpse of the kind of image available in the dataset. Figure 7a represents the deadly blast cell, and Figure 7b represents a normal white blood cell. This dataset is reliable since expert oncologists have done the blast/healthy cell classification.
Healthcare 2022, 10, x FOR PEER REVIEW 6 of 16 padded with black. Figure 2 gives a glimpse of the kind of image available in the dataset. Figure 7a represents the deadly blast cell, and Figure 7b represents a normal white blood cell. This dataset is reliable since expert oncologists have done the blast/healthy cell classification.

Data Augmentation
The number of images provided to the neural network plays a pivotal role in the feature extraction procedure. The dataset used had an imbalance of images of the two classes. This would make the classification process biased. So, to remove the bias, the images were subjected to auto orientation and resizing. Augmentation steps involved were: (1) vertical horizontal flipping, (2) clockwise and anti-clockwise rotation, (3) random brightness adjustments (4) random Gaussian blur with the addition of pepper and salt noise to the pixels. The final dataset consisted of 12,000 images, with 6000 images in each class. Figure 8 depicts the images obtained after augmentation.

Data Augmentation
The number of images provided to the neural network plays a pivotal role in the feature extraction procedure. The dataset used had an imbalance of images of the two classes. This would make the classification process biased. So, to remove the bias, the images were subjected to auto orientation and resizing. Augmentation steps involved were: (1) vertical horizontal flipping, (2) clockwise and anti-clockwise rotation, (3) random brightness adjustments (4) random Gaussian blur with the addition of pepper and salt noise to the pixels. The final dataset consisted of 12,000 images, with 6000 images in each class. Figure 8 depicts the images obtained after augmentation.
Healthcare 2022, 10, x FOR PEER REVIEW 6 of 16 padded with black. Figure 2 gives a glimpse of the kind of image available in the dataset. Figure 7a represents the deadly blast cell, and Figure 7b represents a normal white blood cell. This dataset is reliable since expert oncologists have done the blast/healthy cell classification.

Data Augmentation
The number of images provided to the neural network plays a pivotal role in the feature extraction procedure. The dataset used had an imbalance of images of the two classes. This would make the classification process biased. So, to remove the bias, the images were subjected to auto orientation and resizing. Augmentation steps involved were: (1) vertical horizontal flipping, (2) clockwise and anti-clockwise rotation, (3) random brightness adjustments (4) random Gaussian blur with the addition of pepper and salt noise to the pixels. The final dataset consisted of 12,000 images, with 6000 images in each class. Figure 8 depicts the images obtained after augmentation.

CNN
A CNN has a sequence of layers that transforms an image volume into an output volume through a differential function. The architecture of CNN was inspired by the visual cortex of the brain. The architecture fits the data better because of the reduction in the number of parameters involved and the reusability of weights. There are different types of layers in a convolutional network which includes: convolution (CONV), pooling (Pool), and fully connected (FC). •

Convolution (CONV) Layers
The convolution layers are the main building blocks of CNN. They comprise a set of independent filters, which are convolved with the input volume to compute an activation map made of neurons. The useful features from the input images are extracted by having multilayered architecture. Each of the filters can be of a different type, and they extract different features, such as vertical lines, horizontal lines, and edges. The CNN layers help in extracting features through convolution. The extracted deep features play a major role in the decision support system.

y[m, n] = x[m, n] × h[m, n]
(1) where, x[m, n] = Input m, n = no. of rows, no. of columns, respectively i, j = row index and column index Similarly, the size of the image after convolution is given by Equation (3): where m = number of input features n = convolution kernel size p = padding s = stride • Pooling (POOL) Layer Convolutional neural networks often use pooling layers to decrease the representation size and increase the speed of computation. A pooling layer summarizes the activities of local patches of nodes in the convolutional layers. Pooling can be done in two ways: max pooling and average pooling. In max pooling, the maximum value for each patch of the feature map is stored, while others are discarded. The intuition behind using max pooling is that the maximum value indicates that it has the most impact on that patch of the image. Hence other patches can be discarded. Average pooling follows a similar procedure, except in the place of the maximum value, the average value of the patch is taken, and all other values are discarded.

• Fully Connected (FC) Layer
The FC input layer takes the output of the previous layers and turns them into a single vector that can be connected to the input layer of the next stage. This layer contains a softmax layer at the end, which predicts the correct label (0, 1). The output layer gives the final probability for each layer. The fully connected part of the CNN determines the most accurate weights by going through its backpropagation. The weights that each node receives are used to determine their respective labels. Since this project is of binary classification, the nodes will be prioritized to either 1 or 0. • Batch Normalization Batch normalization decreases the covariance shift, i.e., the amount by which the hidden unit values shift. If the algorithm is trained to map some input x to some output y, and if the distribution of x changes, the prediction will not work as well, and retraining might be required. Batch normalization allows the learning of each layer in an independent manner. An advantage of using batch normalization is that learning rates can be set higher as it makes sure that no activation goes high or low. Batch normalization also reduces overfitting as it has regularization effects. To improve the stability of the neural network, batch normalization normalizes the previous activation layers' outputs. This process adds two parameters to each layer, so the normalized output gets multiplied by gamma (standard deviation) and beta (mean). Mathematically, the mini-batch mean is given by Equation (4): Mini-batch variance is shown in Equation (5): Normalization is given by Equation (6): • Dropout Deep neural networks are likely to overfit early on any given dataset. Dropout is a method of regularization that approximates training a large number of neural networks with different architectures in parallel. While training, some dropout layers are randomly ignored. This simulates the effect of a new layer, making the neural network treat it in such a way. In effect, each update is performed with a new outlook on the layer. This method makes the network more robust as additional noise is introduced.

•
Loss function In the model, the categorical cross-entropy loss function has been implemented. The performance of this binary classification model, i.e., whose output lies between 0 and 1, is measured by the mentioned loss function. Categorical crossentropy compares the distribution of the predictions with the true distribution, where the probability of the class in consideration is set to 1 and the probability of the other classes is set to 0. Categorical cross-entropy is shown in Equation (7): where y-hat is the predicted expected value and y is the observed value. • Optimizer Optimizers are algorithms used to change the attributes of the neural network such as learning rates and weights to reduce the loss. In the model, adaptive movement estimation (Adam) has been incorporated. Adam is a combination of the root mean square propagation (RMSProp) and adaptive gradient algorithm (AdaGrad). The proposed convolution neural network architecture is shown in Figure 2, which makes use of pooling layers, fully connected layers, convolutional layers, dropout and batch normalization. Features were automatically extracted from the input images by the CNN. Feature extraction is then performed by the convolutional layers and the pooling layers. Four convolutional layers, four max-pooling layers, and 3 fully connected layers were utilized. Batch normalization and Dropout were applied to account for overfitting, vanishing, and exploding of gradients. This model consisted of 95,099,266 parameters in total. The architecture for the designed model can be seen in Figure 9. A more detailed description of the model is described in Table 2.

Performance Metrics
The performance metrics estimated include accuracy, precision, recall, and F1 score: Accuracy: It is the ratio of true positive predictions to the total number of predictions and is given in Equation (8).

Accuracy =
True positive + True Negative Total samples × 100 (8) Precision: It is the ability of the model to return only relevant instances, given by Equation (9).
Recall: It is the ability of the model to identify all relevant instances, as shown in Equation (10). It emphasizes false negative results. This is also called the true positive rate or sensitivity.

Recall =
True positive (TP) True positive (TP) + FalseNegative (FN) × 100 (10) Specificity: It is an important metric to identify false-positive results. It is described in Equation (11) Speci f icity = True negative (TN) True negative (TN) + FalsePositive (FP) × 100 (11) F1 Score: This is the harmonic mean of precision and recall and is used to indicate a balance between Precision and Recall given in Equation (12).

Model Evaluation
The ALL-NET architecture consists of a total of 4 convolution layers alternated with max pooling layers. This setting is followed by a total of 3 fully connected layers. The reason for keeping a max pooling layer after every convolution is to maintain the size of the processed image instance to a minimum. This can create a potential problem; if excessive max pooling is done, then it can lead to potential loss of information or patterns which are not spanning wide or large enough. Further, as observed previously, the data augmentation step has already added noise to the image; such noise can also affect the operation of max pooling. Batch normalization is carried out during every alternate max pooling to make sure that the flowing data is normalized, and every neuron has some input to give. Finally, to avoid any overfitting, we use dropout. These layers will ensure that multiple neurons having similar weight vectors are not unintentionally learning the same pattern in the given image instance. The learning rate was initially set at 1 × 10 −3 initially, but it was observed that 1 × 10 −5 gave marginally better improvement during the learning phase. The batch size for the training images was set at 16. While the epochs were initially selected to be in the range of 50 to 100. As pointed out in the later sections, it was found that the model was prone to overfitting if the epochs exceeded 70. Further fine tuning was done, and 65 epochs were finally selected as the final parameter value.
After image augmentation, the number of test instances available as blast cell images and healthy cell images was nearly the same as shown in Figure 10, so there was no imbalance present. The dataset was initially given an 80-20 train test split. The training split was further divided into 5-fold cross-validation. The best performing fold was then utilized for testing on the holdout validation 20% test split. The cross-validation was carried out to overcome any potential overfitting which might occur due to model exposure to one class of images. The proposed model was trained on 12,000 images and tested on 2132 images. Each fold of 5-fold cv was learned by the model for 65 epochs. We performed validation of this model by training and conducting simultaneous test evaluations for five runs. The augmented data was mixed along with the original data to provide a wide range of training samples, this in turn helped the model to generalize better. The accuracy and loss curves The dataset was initially given an 80-20 train test split. The training split was further divided into 5-fold cross-validation. The best performing fold was then utilized for testing on the holdout validation 20% test split. The cross-validation was carried out to overcome any potential overfitting which might occur due to model exposure to one class of images. The proposed model was trained on 12,000 images and tested on 2132 images. Each fold of 5-fold cv was learned by the model for 65 epochs. We performed validation of this model by training and conducting simultaneous test evaluations for five runs. The augmented data was mixed along with the original data to provide a wide range of training samples, this in turn helped the model to generalize better. The accuracy and loss curves are described in Figures 11 and 12, respectively. From Figure 12, it can be noticed how the categorical cross-entropy loss function starts to reduce heavily in the span of 10-30 epochs, referring to Figure 11. The model is performing adequately but stopping the model training at this point would have led to potential underfitting. Epochs after 35 do not provide many variations in the loss, and a steady decrease in the loss function can be observed. This behavior of the ALL-NET was observed in all 5 separate training processes. While increasing the epoch number from 65 would have led to a decrease in the loss even further as well as simultaneous increase in the accuracy, this would have led to overfitting as we will see when the model is evaluated against the holdout test set.  A confusion matrix was used to evaluate the number of true predictions. The prediction was done on two possible classes, "ALL", i.e., blast cells and "HEM" i.e., healthy cells. It was found that from a total of 2132 images, 1454 images were classified as blast cells, and 667 images were classified as healthy cells. During prediction, the model predicted 1996 images, i.e., 94.2% images correctly, and 5.8% images incorrectly. Predictions made per class are seen more clearly in Figure 13, depicting the confusion matrix. These results can be seen in Table 3.  A confusion matrix was used to evaluate the number of true predictions. The prediction was done on two possible classes, "ALL", i.e., blast cells and "HEM" i.e., healthy cells. It was found that from a total of 2132 images, 1454 images were classified as blast cells, and 667 images were classified as healthy cells. During prediction, the model predicted 1996 images, i.e., 94.2% images correctly, and 5.8% images incorrectly. Predictions made per class are seen more clearly in Figure 13, depicting the confusion matrix. These results can be seen in Table 3. During the first instance, the accuracy, specificity, recall, F1-score, and specificity were 94.94%, 94.87%, 94%, 94.96%, and 95.95%, respectively. When the model was run again, the accuracy, specificity, recall, F1-score, and specificity obtained were 94.5%, 95.8%, 93.2%, 93.2% and 96%, respectively. When the model was run the third time, the above metrics obtained were 94.72%, 95.8%, 93.2%, 93.2%, and 96%, respectively. During the fourth iteration, the accuracy, specificity, recall, F1-score, and specificity obtained were 95%, 95.91%, 94%, 94.96 and 95.95%, respectively. In the last instance, the above metrics obtained were 95.45%, 95%, 95.91%, 95.43% and 94.94%, respectively. In Figure 12, the accuracies are plotted against epochs.
A confusion matrix was used to evaluate the number of true predictions. The prediction was done on two possible classes, "ALL", i.e., blast cells and "HEM" i.e., healthy cells.
It was found that from a total of 2132 images, 1454 images were classified as blast cells, and 667 images were classified as healthy cells. During prediction, the model predicted 1996 images, i.e., 94.2% images correctly, and 5.8% images incorrectly. Predictions made per class are seen more clearly in Figure 13, depicting the confusion matrix. These results can be seen in Table 3. A confusion matrix was used to evaluate the number of true predictions. The prediction was done on two possible classes, "ALL", i.e., blast cells and "HEM" i.e., healthy cells. It was found that from a total of 2132 images, 1454 images were classified as blast cells, and 667 images were classified as healthy cells. During prediction, the model predicted 1996 images, i.e., 94.2% images correctly, and 5.8% images incorrectly. Predictions made per class are seen more clearly in Figure 13, depicting the confusion matrix. These results can be seen in Table 3.   Figure 13. Confusion matrix of 5 separate training instances. Referring to Table 3, it can be observed that the F1 score obtained by the model in each of the 5 running instances is up to par if not better than the current existing model performances which we shall discuss later. A consistent F1 score of 90 plus in all 5 runs indicates that the balance between the precision and recall is maintained. The balance between precision and recall is quite pertinent for a model such as ALL-NET which can potentially serve as a preliminary screening tool. Our approach of using a simple CNN network for this dataset has given us good results across multiple performance metrics.

Matthews
Many approaches have been used for classifying leukemia. Of the many, a few approaches with good performance are discussed in Table 4. ALL was diagnosed using Abunadi et al. [34] using an ensemble deep learning approach. The combined model achieved an accuracy of 100%. For the C_NMC_2019 dataset, Yongsheng Pan et al. [35] used a neighborhood correction technique to diagnose this fatal condition at an early stage. An accuracy of 92% was obtained by this algorithm. Khandekar et al. [36] used the YOLOe4 algorithm to diagnose this fatal disease. A maximum recall of 96% was obtained by the deep learning model. Christian et al. [37] have utilized an attention-based neural network to detect ALL. A maximum F1-score of 82% was obtained by this efficient algorithm. Table 4 demonstrates the comparison of prior research with the suggested technique for a similar dataset. The need for preprocessing to tackle this problem may be a slight drawback in this research. The model could be made more robust if it were trained with more data. The intelligent algorithms are promising with the optimized features for screening various problems in digital pathology [38][39][40]. The improved telehealth framework with the intelligent algorithm will enable the remote diagnosis facility [41][42][43].

Conclusions
A method for early diagnosis of cancer from infinitesimal images of white blood cells using CNN has been proposed in this research. Since the deep learning approaches do not require manual feature engineering, the model performs exceptionally well when compared to traditional image processing techniques. The good performance of blast cell detection is supported by the accurate classification results. Maximum accuracy of 95% was obtained by the custom deep learning ALL-NET classifier. It operates on all data available rather than a portion specified by a feature vector, which is also a benefit. This work can help during the screening, reducing the rate of error, as well as decreasing the computational time. As a result, this research can be used to provide a theoretical framework for a diagnosis support tool for the detection of ALL. The future study includes expanding the dataset with noisy images with very little pre-processing, to address the problem of using actual medical images for prediction. Combining these models with explainabality models provides useful inferences to practitioners. Algorithms such as Yolov4, Resnet, and AlexNet can also be explored since they can perform better on these tasks.