1. Introduction
Malaria is a globally widespread disease caused by parasitic protozoa transmitted to humans by infected female mosquitoes of Anopheles. In 2019, there were an estimated 229 million malaria cases worldwide, with an estimated 409,000 deaths due to malaria. Of them, 94% of malaria cases and deaths occurred in Africa [
1]. In this context, children under five years of age are the most vulnerable group accounting for 67% (274,000) of all malaria deaths worldwide. Parasites of the genus Plasmodium (P.) cause malaria in humans by attacking red blood cells (RBCs). They spread to people through the bites of infected female Anopheles mosquitoes, called “malaria vectors”. Five species of parasites cause malaria in humans:
P. falciparum,
P. vivax,
P. ovale,
P. malariae and
P. knowlesi.
P. falciparum and
P. vivax are the two posing the most significant threat [
1,
2]. The former is most prevalent in Africa, while
P. vivax is predominant in the Americas. Malaria plasmids within the human host have the following life stages: ring, trophozoite, schizont and gametocyte. The World Health Organization (WHO) defines human malaria as a preventable and treatable disease if diagnosed promptly. Still, the diagnosis must be made promptly, as the worsening illness can lead to disseminated intravascular thrombosis, tissue necrosis and spleen hypertrophy [
1,
3,
4,
5].
Blood cell analysis using peripheral blood slides under a light microscope is considered the gold standard for the detection of leukaemia [
6,
7,
8,
9], blood cell counting [
10,
11,
12,
13,
14] or the diagnosis of malaria [
15,
16,
17]. Manual microscopic examination of peripheral blood smears (PBS) for malaria diagnosis has advantages such as high sensitivity and specificity compared to other methods. However, it requires about 15 minutes for microscopic examination of a single blood sample [
18], and the quality of the diagnosis depends solely on the experience and knowledge of the microscopist. It is common for the microscopist to work in isolation without a rigorous system to ensure the quality of the diagnosis. In addition, the images analysed may be subject to variations in illumination and staining that can affect the results. In general, the manual process is tedious and time-consuming, and decisions dictated by misdiagnosis lead to unnecessary use of drugs and exposure to their side-effects or severe disease progression [
19,
20].
This work investigates the classification of malaria parasites using transfer learning (TL) to distinguish healthy and parasite-affected cells and classify the four P. falciparum stages of life. Moreover, the robustness of the models has been evaluated with cross-dataset experiments on two very different public datasets.
In this paper, transfer learning will be introduced by explaining how it works and discussing the pretrained networks selected to perform the comparative tests. The experiments are divided into (i) binary, (ii) multiclass and (iii) cross-domain classification. In the latter, networks trained on datasets from different domains were used to see if this improves accuracy over results obtained in a single domain.
The rest of the manuscript is organised as follows.
Section 2 presents the literature on computer-aided diagnostic (CAD) systems for malaria analysis.
Section 3 illustrates the datasets, methods and experimental setup. The results are presented and discussed in
Section 4 and, finally, in
Section 5, the findings and directions for future works are drawn.
2. Related Work
Several solutions for the automatic detection of malaria parasites have been developed in recent years. They aim to reduce the problems of manual analysis depicted in
Section 1 and provide a more robust and standardised interpretation of blood samples while reducing the costs of diagnosis [
15,
21,
22], mainly represented by CAD systems. They can be based on the combination of image processing and traditional machine learning techniques [
23,
24,
25], and also deep learning approaches [
16,
26,
27,
28], especially after the proposal of AlexNet’s convolutional neural network (CNN) [
29].
Since malaria parasites always affect the RBCs, any automatic malaria detection needs to analyse the erythrocytes to discover if they are infected or not by the parasite and, further, to find the stage of life or the type.
Among the more recent and classical solutions not employing CNNs, Somasekar et al. [
23] and Rode et al. [
25] proposed two malaria parasite segmentation methods. The first one used fuzzy clustering and connected component labelling followed by minimum perimeter polygon to segment parasite-infected erythrocytes and detect malaria, while the second one is based on image filtering and saturation separation followed by triangles thresholding.
Regarding the CNN-based approaches, Liang et al. [
26] proposed a novel model for the classification of single cells as infected or uninfected, while Rajaraman et al. [
27] studied the accuracy of CNN models, starting from pretrained networks, and proposed a novel architecture trained on a dataset available from the National Institutes of Health (NIH). They found that some pre-existing networks, by means of TL, can be more efficient than networks designed ad hoc. In particular, ResNet-50 obtained the best performance. Subsequently, they further improved through an ensemble of CNNs [
28]. Rahman et al. [
30] also exploited TL strategies using both natural and medical images and performed an extensive test of some off-the-shelf CNNs to realise a binary classification.
Some other techniques not explored in this work are based on the combination of CNN-extracted features and handcrafted ones [
31,
32,
33] or the direct use of object detectors [
34]. For example, Kudisthalert et al. [
33] proposed a malaria parasite detection system, based on the combination of handcrafted and deep features, extracted from pretrained AlexNet. Abdurahman et al. [
34] realised a modified version of the YOLOV4 detector. Moreover, they generated new anchor box sizes with a K-means clustering algorithm to exploit the model on small objects.
Finally, a recent focus has been posed on mobile devices, which enable a cheaper and quicker diagnosis in the underdeveloped areas of the world, where more expensive laboratories do not exist. As an example, Bias et al. [
24] realised an edge detection technique based on a novel histogram-based analysis, coupled with easily accessible hardware, focused on malaria-infected thin smear images.
The work in [
30] is the most similar to the approach here proposed. In particular, they compared different off-the-shelf networks for a binary classification using two datasets, one is the Malaria Parasite Image Database for Image Processing and Analysis (MP-IDB) [
35] and another is composed of synthetic and medical images. The task faced, however, is a binary classification. On the entire MP-IDB, they reported 85.18% accuracy with a fine-tuned version of VGG-19.
In summary, the main difference between our work and the state-of-art is that here an extended set of off-the-shelf CNNs on two very different public datasets have been exploited with a dual purpose: detect healthy and unhealthy RBCs and distinguish the various stages of life. Finally, it is the first baseline provided for the stages of life classification on the MP-IDB.
4. Experimental Results
Three different experiments were conducted, according to the classification purpose:
Binary classification on the NIH dataset (healthy vs. sick);
Multiclass classification on the MP-IDB-FC dataset (four stages of life);
Multiclass cross-dataset classification on both datasets.
The results obtained in the analysis of each experiment were performed using the confusion matrix. The confusion matrix metric used in this study is
Accuracy. The formula of this metric is given in Equation (
2). The variables used in the equation are True Positive (
TP), False Positive (
FP), True Negative (
TN), and False Negative (
FN), parameters of the confusion matrix used to calculate the metrics [
49,
50].
4.1. Binary Classification Performance on NIH
To determine the training options to use, test trials were carried out. From them, it was mainly verified that:
Extending the training phase beyond ten epochs did not improve accuracy, as the network stored individual image features rather than class features, and overfitting compromised the results;
The ideal learning rate was 1 × 10−4. The accuracy increased too slowly for smaller values, and for larger ones, it did not converge to a specific value;
Empirically, Adam was found as the best solver.
Table 2 shows that almost all the networks have an accuracy value close to the average. The standard deviation of the collected data is solely 0.16%. This aspect could be because the dataset used has many valuable images for training the network. In particular, ResNet-18 recorded the highest accuracy value, confirming the high performance expressed in [
27]. MobileNetV2, SqueezeNet and ShuffleNet recorded average values, which is an important result since they are networks designed for mobile use.
4.2. Multiclass Classification Performance on MP-IDB-FC
The multiclass classification on MP-IDB-FC was designed to determine the life stage of the parasite: ring phase, adult trophozoite, schizont and gametocyte.
Like the binary classification, comparative tests with the same training set, validation set and test set were carried out to determine which networks perform best and allow comparison between them. We created three datasets with 100, 200 or 300 images per class in the training set. We refer to these sets as D1, D2 and D3, respectively. They were constructed by oversampling with augmentation of the remaining classes.
Table 3 shows the performance in this experiment. Each test was cross-validated five times; then, we reported the mean accuracy and standard deviation considering each of the five folds. The most notable result is that the average performance from D2 to D3 worsens. However, DenseNet-201 and GoogLeNet are the only networks to benefit from increasing the dimensionality of the training set.
4.3. Cross-Dataset Classification Evaluation
The cross-domain classification was carried out to evaluate the CNNs robustness.
4.3.1. MP-IDB-FC Classification with NIH Models
Firstly, two different multiclass classifications were realised on MP-IDB-FC by:
When used for training or fine-tuning, the split on MP-IDB-FC was 50% for training (with 10% for validation) and 50% for testing. Every test was evaluated with five-fold cross-validation. Therefore, we report the average accuracy of the five folds and the standard deviation. This cross-domain experiment tested whether the networks trained on the NIH dataset could be employed on the MP-IDB-FC dataset. It is helpful to point out that the significant difference is that, on the one hand, MP-IDB-FC has parasite crops and not healthy RBCs and, on the other hand, NIH contains only ring-stage parasites. For this reason, the objective of Exp1 was to discriminate between rings and the remaining stages of life, while Exp2 aimed to expand the knowledge of the NIH pretrained models with new information on the stages of life.
The results depicted in
Table 4 show that when the target domain differs excessively from the source domain, it is hard to directly apply the models trained with NIH to MP-IDB-FC (Exp1), even if the task seemed feasible, as ring-stage parasites were contained in both datasets. Conversely, Exp2 shows that using the CNNs first trained on NIH and then fine-tuned on MP-IDB-FC led to an improvement in average accuracy. The information about healthy RBCs provided with NIH training does not affect the overall result. In addition, the standard deviation is under 4% for all the networks, leading to satisfactory performance stability.
4.3.2. P. vivax Classification Using P. falciparum Data
The last experiment aimed to investigate the possibility of classifying the stages of life of P. vivax using the information on P. falciparum. So, a different dataset was created and composed of the crops of P. vivax parasites, referred to as MP-IDB-VC. Even in this case, three different evaluations were conducted by:
Training on MP-IDB-VC and testing on MP-IDB-FC (Exp3);
Training on MP-IDB-VC and testing on MP-IDB-VC (Exp4);
Training on MP-IDB-FC, fine-tuning and testing on MP-IDB-VC (Exp5).
Trophozoites, schizonts and gametocytes greatly vary between the two types, while the ring stages are pretty similar.
As it can be seen from
Table 5, the classification of
P. falciparum stages of life employing models trained
P. vivax produced dreadfully low results due to the differences between all the stages except rings. On the other hand, using same-domain models (Exp4) had satisfactory results. Exp5 demonstrates that the fine-tuning strategy on the models pretrained on
P. falciparum improved the accuracy, as already happened in
Section 4.3.1 Exp2. In this task, DenseNet-201 provided the best performance, being the only CNN to overcome 85% and outperforming the average of 10% in both cases.
5. Conclusions
The results obtained in this work support the importance of deep learning in haematology. This work aimed to demonstrate that pretrained off-the-shelf networks can offer high accuracy for diagnosing malaria utilising transfer learning however showing several limitations of this approach. Several comparative tests were developed using a selection of pretrained networks differentiated by size, depth and the number of parameters. In particular, using the NIH dataset, it is possible to distinguish a healthy from an infected erythrocyte with an accuracy of over 97%. Small networks such as SqueezeNet and ShuffleNet performed well, consolidating a possible development of software for malaria diagnosis in small devices such as smartphones. On the other hand, MP-IDB has highlighted some critical issues: deep learning is not very effective when the dataset used for training is unbalanced. Some classes of parasites in the dataset have a small number of images. Nevertheless, the oversampling, augmentation and preprocessing methods still allowed us to exceed 90% accuracy on the test set for distinguishing the four life stages of the P. falciparum parasite. Finally, the cross-domain experiments have highlighted some critical points in classifying data from heterogeneous domains. It was counterproductive to apply the models trained with NIH to MP-IDB-FC, but the use of the CNNs firstly trained on NIH and fine-tuned on MP-IDB-FC led to an improvement in average accuracy. This aspect also applies to using the P. vivax dataset as the target domain, as most of the classes deviate too much from the corresponding P. falciparum classes. However, using both types of parasites as source domains produced better results than training on P. vivax only. In general, the extensive experimentation has highlighted how DenseNet-201 offers the most stable and robust performance, offering itself as a crucial candidate for further developments and modifications.
Among the possible developments of this work, we aim to propose a framework able to detect malaria parasites from blood smear images and classify different species of parasites and different stages of life, mainly focusing on high variation data. We also plan to use domain adaptation algorithms to improve cross-domain performance.