Classification of White Blood Cells: A Comprehensive Study Using Transfer Learning Based on Convolutional Neural Networks

White blood cells (WBCs) in the human immune system defend against infection and protect the body from external hazardous objects. They are comprised of neutrophils, eosinophils, basophils, monocytes, and lymphocytes, whereby each accounts for a distinct percentage and performs specific functions. Traditionally, the clinical laboratory procedure for quantifying the specific types of white blood cells is an integral part of a complete blood count (CBC) test, which aids in monitoring the health of people. With the advancements in deep learning, blood film images can be classified in less time and with high accuracy using various algorithms. This paper exploits a number of state-of-the-art deep learning models and their variations based on CNN architecture. A comparative study on model performance based on accuracy, F1-score, recall, precision, number of parameters, and time was conducted, and DenseNet161 was found to demonstrate a superior performance among its counterparts. In addition, advanced optimization techniques such as normalization, mixed-up augmentation, and label smoothing were also employed on DenseNet to further refine its performance.


Introduction
Blood is a highly specialized tissue that is comprised of several cell types and plasma components. This essential fluid transports and supplies oxygen and nutrition to tissues and organs of the body. Blood functions to remove carbon dioxide, ammonia and other waste products. It also aids in various other biological functions, including oxygen transport, cell regeneration, clotting, body temperature regulations and immunity. There are four essential cellular components in blood: red blood cells (RBCs), white blood cells (WBCs), platelets and plasma. Among them, RBCs typically account for 40-50% of total blood volume and convey oxygen from the lungs to all other vital tissues [1]. On the other hand, WBCs can be found not only in blood but also in lymphatic tissues. Even though they make up a small percentage of blood volume, typically in the range of 1% in a healthy person, they constitute the immune system's first call against foreign invaders [1]. WBCs look for, detect, and bind to foreign proteins in bacteria, viruses, and fungi so as to eliminate them. There are several types of WBCs, each of which plays a different role in immunity responses [2].
CBC (complete blood count) is a widely used blood test to measure a population's health status. The test entails the determination of RBC, WBC and platelet parameters. Most parameters can be investigated by automatic blood analyzers provided by several manufacturers. The WBC count represents essential parameters on the total WBC, and the differential WBC count displays the absolute number or percentage of neutrophils, eosinophils, basophils, lymphocytes and monocytes. These five categories of WBCs can be categorized as granulocytes and agranulocytes, as depicted in Figure 1. Each WBC type has its own specific function; the alteration of each cell type number reflects the pathological condition of the patient. The results of the WBC differential count are provided by blood analyzers with differential principles based on granularity, size, biochemistry properties, etc. However, blood film examination by skilled and experienced medical technologists is highly needed [3,4]. Traditional methods for blood film examination for counting white blood cells can be imprecise and time-consuming, and the technical skill of laboratory technicians has a substantial impact on the test's reliability. This paper presents a deep learning approach to perform image categorization to achieve more robust and accurate results. Deep learning approaches similar to this paper are being used in various sectors of medical applications [5][6][7][8].

Lymphocytes
(1 -4%) Although deep learning has become popular in image classification in various medical domains, WBC classification is also a significant task where many variations of different architectures are employed. Even after solving this problem using many algorithms, a comparative analysis of a different model for evaluation and deployment is still lacking. This paper is focused on providing a comparative analysis of various models of CNNs used for the classification of blood smear images. Thus, it is worthwhile to investigate and implement such models for identifying and counting WBCs and providing results based on different validation criteria, nflops, and time complexity, and the best model is suggested. This study is based on the BCCD dataset, and the models used in this study utilize transfer learning techniques and are fine-tuned using the latest advancements that can be specifically customized for image classification tasks. Thus, transfer learning is an important method that facilitates the training of accurate models quickly with fewer data points at a low cost. The methodology applied in this paper uses various steps, i.e., data acquisition, data cleaning, and image processing, followed by the implementation of the models.
Several methods have been developed for categorizing the types of WBCs in blood smear pictures. Most of those methods were based on fuzzy logic, machine learning, deep learning, or a hybrid of these three methods [9,10]. Saraswat et al. [9], in 2014, and Kumar et al. [10], in 2020, surveyed the employment of locally accessible datasets in their majority of research works, building a tailored model for the classification of the data. The distributions of classes in the datasets used in the literature are shown in Figure 1. Most of the studies used an open dataset, whereas other study papers did not divulge their dataset. However, open datasets are preferred for comparison with past publications. Support-vector machines (SVM) [11] and Bayesian classifiers [12] can be used to classify data using machine learning. Compared to other models, these models can perform well in spite of the large amounts of data with pre-processing in advance. Hegde et al. [13] compared a conventional image classification method and a convolutional neural network (CNN). Although the results are similar, primitive methods rely on image segmentation and feature extraction, but they are easy to implement. In comparison, the CNN is independent of these parameters but requires a large amount of labelled datasets. Singh et al. [14] trained a CNN for 200 epochs and hence proposed a classification model for WBCs. Combining ResNet and DCGAN, Ma et al. [15] came up with a classification model performing better than before. An approach for extracting the region of interest from the smear images for the classification of WBCs was conducted by Sengür et al. [16]. Further developing the concept, Sengür et al. [16] used ResNet50 for the feature extraction and principal component analysis for feature selection, and finally, the classification was performed using long short-term memory. The model performed at 85.7% accuracy. Patil et al. [17] employed canonical correlation analysis [18], enlisting both convolutional and recurrent neural network architectures. The aforementioned researchers [16,17] performed their experiments on the same dataset. Wijesinghe et al. [19] used the K-means clustering method for WBC nuclei segmentation, following which a VGG-16 architecture [20] was employed to classify the designated classes for the nuclei images. A model based on a CNN trained using the local image data proposed by Jung et al. [21] was used for classifying the data from another reference presented in [22].
Ucar et al. [23] suggested a ShuffleNet [24]-based deep learning model, producing an overall accuracy of 97 percent. Using Euler's Jenks optimisation, Karthikeyan et al. [25] devised a CNN for detecting and classifying WBC. Using Jenks optimized pooling [22] in the blood samples, peripheral RBCs were removed. Moreover, for learning-based approaches, research based on fuzzy logic can also be found in the literature. Ghosh et al. [26] presented a fuzzy technique for counting WBCs in smear images. Similarly, a technique used by Chan-Vese [27] was utilized by Rawat et al. [28] to detach WBC nuclei from sample images; they also proposed an adaptive neuro-fuzzy classifier. Recently, Ashish et al. proposed the SOTA model based on CNN architecture for the classification of WBCs in fewer epochs, i.e., in a more time-efficient m anner, and achieved an accuracy of 98.55%. This paper utilizes a convolutional neural network (CNN) [29] and its variants, such as AlexNet [30], DenseNet-(121, 161) [31], ResNet-(18, 34, 50) [32], SqueezeNet-(10, 11) [33], and VGGNet- (11,13) [20], for the classification of white blood cells from smear images of blood gathered on the Blood Cell Count and Detection dataset. Different architectures of CNNs have been experimented with, and the results are portrayed and validated using the validation criteria provided in the following Section 2.6.
The core concepts and contributions of this paper are as follows:

1.
We have applied advanced image processing and data augmentation techniques, i.e., random resizing and cropping, which randomly select different parts of an image, enabling the model to focus on and perceive various features. This improves the model's generalization capabilities and prevents overfitting.

2.
We applied advanced fine-tuning techniques such as normalization, mixup augmentation, and label smoothing to train the CNN model and obtain preferable results in comparison with other similar research.

3.
We investigated and compared the efficiency and complexity of multiple deep neural network (DNN) architectures initialized with pre-trained weights for WBC classification.

Dataset
The BCCD dataset [34] is an open dataset containing five classes of white blood cell images. The dataset is comprised of 12,436 images of blood cells managed into five different categories, i.e., basophils, eosinophils, lymphocytes, monocytes, and neutrophils. Out of the five categories, basophils are removed during data clean-up due to a lack of image availability. The instance distribution on the remaining four classes has approximately 300 images per class.

Data Pre-Processing
After data acquisition, the data were thoroughly scrutinized, where it was discovered that the class basophil has a meagre image count; hence, the field was removed entirely before feeding the data to the pre-trained CNN models, as listed in Figure 2. Certain pre-processing of the data was carried out. We performed three operations within the datablock, as mentioned below: •

Convolutional Neural Network
In one of the multidisciplinary fields of AI, deep learning, convolutional neural networks are considered one of the advanced architectures for various computer vision tasks. In comparison to other networks, CNNs have demonstrated higher achievements in computer vision [35].
CNNs have a specific trait called invariance that allows them to see images in a very broad fashion [36] so that even an image with scattered face attributes is treated as a person by CNNs. Convolution is a feature extraction procedure in a CNN that employs a kernel of a specific size. The kernel is traversed throughout the network with specific steps, i.e., stride, which is specified during the implementation of the architecture. The outcome of this technique is known as a feature map. Following the extraction of the feature map, the pooling procedure is used to reduce the size of the feature map even more [29]. A layer in the network is made up of the processes outlined above. The network is made up of numerous layers. The image is finally flattened, and a fully or partially connected layer is formed [29]. The image is then classified using a classification layer, which determines the likelihood of the image falling into one of several categories.
In this paper, the different architectures of CNN are used as listed in Figure 2, which has been further fine-tuned to create models that are specifically tailored for WBC image classification.

Transfer Learning
Transfer learning is a technique that enables researchers and practitioners to employ a previously learned model for a completely new task. In computer vision, transfer learning can be beneficial in leveraging knowledge from a previous assignment to improve the prediction of a new task. Due to these capabilities and the ability to train a deep network with few inputs, this technique is gaining more attention in the field.
Transfer learning only functions if the skills acquired in the first task are generic. In addition, the input to the model must be of the same size as when it was first trained. In this case, we must perform a resizing operation before feeding it to the network.

•
DenseNet: Introduced by Huang et al. [31], this network contains direct connections between any two layers having the same size as the feature maps. DenseNet, as shown in Figure 3, reduces the vanishing gradient problem, reinforces the feature propagation, vitalizes the reuse of features, and significantly decreases the number of parameters.   [20]. Only 3 × 3 convolutional layers are used in the VGG network, and they are stacked on top of one another in increasing depths. In addition, max pooling handles reduce volume size. Next comes a softmax classifier, which is composed of two completely connected layers.

Model Building
Models were run for a total of 9 epochs, where the first 3 epochs did not perform any training of the network (i.e., freeze). In the next 6 epochs, we trained the network and observed the performance (i.e., unfreeze). After training, we used the validation set to test the model's performance. Therefore, the methodology is carried out for all the pre-trained models mentioned earlier, and the most efficient model out of them, i.e., DenseNet-161, is selected as our main model. Since DenseNet161 outperforms other models, the paper shows the most experimentation results obtained from DenseNet-161.

Performance Evaluation Metrics
The entire performance of a classifier is not shown by accuracy alone. As a result, several performance indicators such as precision, recall, F1-score, and ROCAUC were computed [38]. The ratio of true predictions among all the predictions generated by any classifier for a given class is known as precision. Recall or sensitivity, on the other hand, refers to the proportion of accurate predictions for a class out of all the samples in that class. A statistic known as the F1-score combines precision and recall into one metric for assessing the performance of a classifier. There are mathematical definitions for each of these performance indicators, which are presented below. The mathematical illustrations have the following acronyms for true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).
The area under the curve (AUC) of an ROC curve is a performance statistic for classification issues at multiple threshold levels. An ROC is a probability curve, whereas the AUC is the level or measure of separability. This can be considered as an indicator to depict how well any model is able to distinguish between classes.
The AUC shows how accurately the model predicts that the 0 class will be 0 and that the 1 class will be 1. If the model predicts a 0 class as 0 and a 1 class as 1 better, the value for the AUC will be higher, indicating the model is better at classification.

Hardware
Experiments were performed in a Google Colaboratory environment using the provided GPU Tesla T4, 12 GB RAM and Intel Xenon CPU 2.20 GHz. Python libraries used for this experimentation included FastAI, PyTorch and NumPy libraries.

Results and Discussion
This research was carried out using the methodology shown in Figure 4. The proposed classification approach's effectiveness was validated on the publicly accessible (BCCD) dataset.

Experimental Details
The raw image dataset from BCCD underwent several pre-processing steps, as listed below. Data augmentation: The resized images underwent further data augmentation operations, i.e., rotation, zoom, perspective warping, and lighting (change in brightness and contrast). This process of generating random variations of the input data so that they appear unique but do not alter the data's underlying significance is referred to as data augmentation.
Data augmentation is the process of generating random variations of the input data so that they appear unique but do not alter the data's underlying significance. (d) Model hyperparameters: The settings used for model building were as follows: •

Model Benchmarking
In this paper, for the classification of the WBC images, different variants of the CNN were implemented. Their performance was tested using the BCCD dataset; hence, the results obtained are compared in this section. It is observed from the experiments that the results improve as the network gets deeper. Here, in this case, the deepest network of all has the best value. To be precise, the accuracy obtained during both training and testing is 100%. Table 1 depicts the parameters necessary to compare the models: average time, trainable parameters and accuracy. Every model with its average time taken while processing the trainable parameters with some accuracy is well illustrated. Apart from the table, this paper also shows the graph of the training and testing process of DenseNet-161. DenseNet-161 is the best-performing network among the 10 networks used in the comparison of networks. The validation report of DenseNet-161 is also provided.

Normalization
Normalization, which was first introduced by Ioffe and Szgedy et al. [40] in 2015, was used for normalizing the output from the activation in each layer before the signal traversed to the next layer.
Let us consider x as a mini-batch of activation; then, the normalizedx can be obtained using the equations below: We calculate the µ β and σ β over each mini-batch β during training. Applying this equation suggests that the activations will be zero-centered, with a mean and variance that are both close to zero. We substitute the mini-batch µ β and σ β with running averages of µ β and σ β , which are calculated during the training phase at testing time. This guarantees that we can process images through our network and still obtain correct predictions free from bias caused by the µ β and σ β from the final mini-batch processed through the network during training.
The comparison of different performance metrics in DenseNet-161 after normalization is given in Table 2.

Mixup Augmentation
Mixup, which was introduced by Hongyi Zhang et al. [41], is a potent data augmentation strategy that can offer significantly improved accuracy, especially when one doesn't have much data or a pre-trained model that was trained on data that are similar to one's dataset. For each image, the following are the steps performed by mixup:

•
Pick another data at random from your dataset; • Randomly choose a weight; • Take a weighted average of your image and the image you choose in step 2; this will serve as your independent variable; • Take a weighted average of the labels on this image and the labels on your image; the result will be your dependent variable.
x i and x j are raw input vectors, whereas y i and y j are one-hot label encodings, and λ is the weight for our weighted average. The idea of mixup augmentation is described in the equations below:x The comparison of different performance metrics in DenseNet-161 after mixup augmentation is given in Table 2.

Label Smoothing
Label smoothing was introduced by Christian Szegedy et al. [42]. In order to save memory, generally in classification problems, we avoid one-hot encoding in practice, but the loss we compute is the same as if we had. This indicates that the model has been trained to return 0 for all categories other than the one that belongs to the target class, which returns 1. The model will acquire gradients and develop the ability to forecast activations with even greater confidence if 0.999 is deemed to be "good enough." This promotes overfitting and results in a model that, at the moment of inference, will not provide helpful probabilities: it will always report 1 for the predicted category, even if it is not entirely sure, simply because this is how it was trained.
Instead, we could train by replacing all 1s with numbers just below 1 and all 0s with numbers a little above 0, respectively. Label smoothing is the term for this. Label smoothing will help make the training more robust even if there is mislabeled data by encouraging models to be less confident. A model that generalizes more effectively will be the end outcome.
In practice, we start with one-hot-encoded labels, then replace all 0s with ( N ), where N is the number of classes and is a parameter. Similarly, we replace the 1s with (1 − + N ). In this way, we prevent the model from making overly confident predictions.
The comparison of different performance metrics in DenseNet-161 after label smoothing is given in Table 2.

Conclusions
Manual classification of blood cells visually by experts is a time-consuming and tiresome endeavour. The presented method described herein has successfully been demonstrated to be capable of classifying the types of white blood cells by combining smear images with appropriate deep learning approaches. The white blood cell categorization task for leukocytes can be completed automatically based on the results of the proposed approach.
In this study, we looked at how to classify smear images via the use of several CNN architectures. On the basis of trainable parameters, the average time taken and accuracy, the best results obtained after implementing the discussed models were compared to each other. In comparison to other architectures, DenseNet-161 performed substantially better in the leukocyte recognition challenge, with significantly higher accuracy. However, the superiority of the accuracy of DenseNet-161 is challenged by the other implemented models when additional variables such as average time taken and the number of trainable parameters are taken into account. From this paper, we gained insights into how multiple parameters need to be taken into account for selecting a suitable deep learning architecture. In addition, various architectures and networks can be utilized for benchmarking models, while larger datasets may give rise to different results.

Data Availability Statement:
The data presented in this study are available at the following link: https://www.kaggle.com/datasets/paultimothymooney/blood-cells (accessed on 5 August 2022).