NOVA: A Novel Multi-Scale Adaptive Vision Architecture for Accurate and Efficient Automated Diagnosis of Malaria Using Microscopic Blood Smear Images

Hosen, Md Nayeem; Islam Mozumder, Md Ariful; Mondal, Proloy Kumar; Cheol Kim, Hee

doi:10.3390/electronics14244861

Open AccessArticle

NOVA: A Novel Multi-Scale Adaptive Vision Architecture for Accurate and Efficient Automated Diagnosis of Malaria Using Microscopic Blood Smear Images

¹

Computer Engineering, Inje University, Gimhae-si 50834, Republic of Korea

²

Institute of Digital Anti-Aging Healthcare, Inje University, Gimhae-si 50834, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(24), 4861; https://doi.org/10.3390/electronics14244861

Submission received: 17 October 2025 / Revised: 4 December 2025 / Accepted: 5 December 2025 / Published: 10 December 2025

(This article belongs to the Special Issue Artificial Intelligence and Pattern Recognition for Intelligent Systems)

Download

Browse Figures

Versions Notes

Abstract

Background: Malaria continues to be a significant global health concern, particularly in tropical and subtropical areas. Timely and accurate diagnosis is crucial in minimizing the disease’s mortality. The standard method, microscopic diagnosis, which represents the gold standard, is heavily reliant on skilled interpretation, labor-intensive, and prone to human error. Methods: To address these challenges, we propose the NOVA (Novel Multi-Scale Adaptive Vision Architecture) for the diagnosis of malaria. NOVA is based on an innovative dynamic channel attention and Learnable Temperature Spatial Pyramid Attention to achieve more powerful feature representation and better classification performance. In addition, adaptive feature refinement and enhanced transformer blocks are used to obtain multi-scale feature extraction and contextual reasoning. Furthermore, a multi-strategy pooling mechanism that fuses average, max, and attention-based aggregation is developed to enhance the model’s discriminative capability. Results: We conduct experiments on a publicly accessible dataset of 15,031 microscopic thin blood smear images to validate the effectiveness of the proposed approach. The model is assessed and compared on a benchmark malaria microscopy dataset, achieving an accuracy of 97.00%, a precision of 96.00%, and an F1-score of 97.00%, outperforming other existing models. Conclusions: The experimental results demonstrate the feasibility of the proposed approach as a potential research prototype for the automated diagnosis of malaria. Before clinical deployment, further multi-site clinical evaluation on a large patient cohort is required for validation.

Keywords:

malaria classification; NOVA; CNN; transfer learning; spatial pyramid attention; transformer

1. Introduction

Red blood cell (RBC) analysis can be used as a diagnostic measure to diagnose the disease. RBCs are used in the diagnosis of anemia, thalassemia, and malaria [1]. Variations in size and morphological features of RBCs can indicate the presence of parasites like malaria. The diagnosis of blood diseases is usually performed through CBC testing. But microscopic images are extremely important for the diagnosis of such diseases. The process of RBC counting and studying morphological changes in cells increases diagnostic potential even more. The primary cause of the disease is the Plasmodium parasite that enters the human body through female Anopheles mosquitoes [2]. In 2025, about 30 countries had a high number of malaria cases, as reported by the World Health Organization (WHO). Malaria infects an estimated 240 million people, with 58 out of 1000 people at risk of being infected. Africa accounts for 90% of global malaria cases and 92% of deaths. Malaria symptoms include fever, headache, and weakness. Severe cases lead to RBC destruction and life-threatening complications, making early diagnosis critical [3,4].

Microscopic medical image analysis has been applied to the diagnosis and classification of various blood-borne diseases. The microscopic diagnosis of blood images is error-prone and manually performed. Thin smear slides are stained, and digital thick film slides are prepared for microscopic diagnosis all over the world. Malaria is a blood-borne disease which is generally diagnosed through microscopic images [5]. Giemsa-stained thin and thick blood smear images are considered the gold standard in malaria detection. Stained, thick smear slides are used in microscopic observation of parasites to identify and count them and thin smear slides are generally used to identify the species. The degree of morphological change in RBCs shows the severity of malaria. Morphological changes in RBCs are size variation, distribution variation, and morphological variation [6]. Infected RBCs are light red; parasites are blue and dark red. Manual methods have several major problems that need automated AI-based analysis. First, a high skill level and a long-term training period are needed to obtain a correct diagnosis. For this reason, morphological changes in RBCs are to be reviewed carefully to determine the severity of the disease. Morphological structures are to be analyzed to figure out if there is an infection and if other infections are included. At this point, some vague visual clues make malaria parasites different from other cellular materials, but sometimes they cause misdiagnosis. In addition, another significant challenge is that manual slide observation is a tiring process. Malaria diagnosis is performed with Giemsa-stained blood smears. Manual microscopy is a laborious task requiring trained technicians. It is highly subjective and has inter-observer variability. Negative diagnosis confirmation involves examination of 200 high-powered fields or more. This makes the entire diagnostic process both time- and labor-consuming with the scope for human error. These factors present a great unmet need for automation of diagnosis [7].

This process is both time-consuming and likely to experience human error. Another disadvantage is that the reproducibility and accuracy of manual diagnoses are likely to vary greatly between operators and laboratories. In the past few years, AI-based techniques, particularly machine learning (ML) and deep learning (DL), have made great contributions to malaria diagnosis and analysis [8]. Automated AI techniques generally overcome the problems of inaccuracy, time consumption, and manual labor. Another remarkable feature of AI is that it can analyze large numbers of images within a short time. Automated malaria detection by AI also generally includes the presence, type, and severity of the parasite. AI-based malaria detection and classification methods are usually faster and more consistent than manual methods. As a result, these are more reliable and produce standardized diagnosis and treatment protocols [9]. Because of all these advantages, various AI-based methods are being applied to the malaria diagnosis problem. In this field, ML and DL algorithms are used in various tasks of malaria detection and analysis.

The recent advances in artificial intelligence (AI), machine learning (ML), and deep learning (DL) have also gained significant attention for automated malaria diagnosis. Convolutional Neural Networks (CNNs) and hybrid CNN–transformer approaches have demonstrated promising results in classifying parasitized and uninfected red blood cells. Multiple CNN-based classification, feature extraction, and object detection techniques have been investigated, showing improved performance compared to traditional methods. However, several existing models have certain limitations, such as the lack of multi-scale feature representation, fine-grained spatial dependency capture, poor generalization to different imaging conditions, and reliance on handcrafted or shallow attention mechanisms. On the other hand, transformer-based architecture typically faces convergence instability when training on relatively small medical image datasets. To address these limitations, we introduce NOVA (Novel Multi-Scale Adaptive Vision Architecture), a unified deep learning framework, designed specifically for malaria diagnosis using microscopic thin smear images. The main contributions of this research are summarized below:

Novel Architecture (NOVA): We proposed the NOVA (Novel Multi-Scale Adaptive Vision Architecture), a specialized deep learning architecture for malaria diagnosis that prioritizes multi-scale adaptability and contextual reasoning.
Attention Mechanisms: The architecture incorporates Dynamic Channel Attention and a novel Learnable Temperature Spatial Pyramid Attention (LT-SPA) module to achieve a more powerful and discriminative representation than standard attention mechanisms.
Feature Refinement: The NOVA utilizes a multi-scale adaptive feature refinement approach, coupled with augmented transformer blocks, to enable efficient global–local contextual feature extraction for robust classification.
Pooling Fusion: A novel pooling mechanism that synergistically combines average, max, and attention-based pooling is designed, improving discriminative capability and reducing feature redundancy.
Performance on Benchmark Dataset: Extensive experiments on a benchmark dataset of 15,031 microscopic thin blood smear images confirm the superior performance of the NOVA, achieving 97.00% accuracy, 96.00% precision, and 97.00% F1-score, outperforming state-of-the-art malaria diagnostic models.
Clinical Applicability: The study demonstrates the potential of the NOVA as a foundation for a scalable automated diagnostic tool, though multi-center clinical validation is required to establish real-world applicability.

The rest of this paper is structured as follows: Related studies are reviewed in Section 2. The dataset, preprocessing, and architecture are detailed in Section 3. Experimental results and comparisons are shown in Section 4. The model performance is discussed in Section 5. Finally, we conclude the study with limitations and future directions in Section 6 and Section 7.

2. Related Work

Shambhu et al. (2023) utilized deep learning to improve image-based malaria parasite detection [10]. Bilal (2023) employed identity- and category-guided strategies in tandem with machine learning algorithms to detect malaria from images of blood cells [11]. Gois et al. (2022) developed an object detection framework, which is based on DCGAN and CNN models, to identify malaria parasites in blood smear imagery [12]. Qadri et al. (2023) proposed a transfer learning system that uses pre-trained neural networks to distinguish between parasitized and healthy red blood cells [13]. Mayrose et al. (2023) extracted clinically relevant dengue patterns in blood smear imagery using machine learning, driven by platelet and lymphocyte characteristics [14].

Ahamed et al. (2023) proposed a hybrid lightweight parallel depthwise separable CNN and ridge regression–based ELM with SHAP explainability to classify malaria parasites from RBC smear images [15]. Kumar et al. (2024) introduced a model based on Inception–ResNet v2 to recognize malaria infection from infected cell images of microscopic slides [16]. Hemalatha et al. (2022) introduced an improved CNN-based deep learning technique to segment and classify blood cells [17]. Khan et al. (2024) proposed an effective leukocyte detection and classification system that is based on a CNN and a dual-attention mechanism [18]. Zedda et al. (2024) presented an attention-based deep learning system to perform robust end-to-end detection of early- and late-stage malaria parasites [19].

Hevia-Montiel et al. (2023) leveraged deep learning to focus on the segmentation of Trypanosoma cruzi nests in histopathological imagery, offering precise localization and segmentation of parasite clusters [20]. Sharma et al. (2022) introduced an automated approach to detect malaria-infected erythrocytes, which is based on concavity factor analysis and pseudo-valley-based thresholding to boost the efficacy of geometric feature-driven methodologies [21]. Nayak et al. (2022) put forth an ensemble AI-augmented Medical Internet of Things (MIoT) system for automated malaria detection, which fuses multiple model outputs to improve reliability and confidence in diagnostic outcomes [22]. Paul et al. (2022) introduced an ensemble weight-assisted YOLOv5 system for the more accurate localization and detection of malaria parasites in blood smears to improve sensitivity and robustness [23]. Zedda et al. (2023) implemented YOLO-PAM, which introduces a parasite-attention mechanism into the mainstream pipeline to identify and accentuate relevant regions while de-emphasizing the parasites’ surroundings and thus improving detection accuracy [24].

Tong et al. (2023) introduced PolarMask, a method for instance segmentation on cellular images, that leverages weak labels to improve segmentation [25]. Saleem et al. (2022) proposed a deep learning network for the segmentation and classification of leukemia using a fusion of transfer learning models [26]. Tarimo et al. (2024) proposed WBC YOLO-ViT, a hybrid two-stage system for white blood cell detection and classification, merging the strengths of YOLOv5 and Vision Transformers for improved detection and classification [27]. Khan et al. (2024) introduced GAN-based virtual color staining for white blood cell images to automate the detection of abnormal blood cells and improve staining consistency [28]. Sheikh and Chachoo (2023) introduced a novel multilevel latent low-rank representation and CNN-based fusion method to improve cellular image fusion [29]. Vemuri et al. (2022) proposed a Bayesian sampling framework to learn asymmetric generalized Gaussian mixture models, which proved to be more robust and comprehensive for statistical modeling in medical image analysis tasks [30].

Recent work by Kumar and Babulal furthers the automated detection of parasites and characterization of hematological imagery. In 2024, they introduced an OCR–NURBS-based segmentation method to detect microfilariae in blood smear images, which integrated Optical Character Recognition with Non-Uniform Rational B-Splines to learn and model the intricate morphology of the parasite to enhance segmentation accuracy [31]. They also presented FC-TriSDR, a model that combined feature-centric clustering with a Tri-stage Spatial-Density-Region (TriSDR) framework to effectively segment and characterize erythrocytes while addressing issues such as dense cell populations and non-uniform shapes [32]. In the same work, they also introduced an optimized preprocessing pipeline, featuring refined filtering, illumination normalization, and contrast enhancement to improve segmentation robustness for lymphatic filariasis images across diverse datasets [33]. In addition, they proposed a reliable image processing architecture and stacking classifier ensemble to diagnose lymphatic filariasis, which outperforms other models in terms of classification accuracy as well as robustness to noise and background artifacts [34].

3. Materials and Methods

This section shows the steps of the proposed methodology and the description of Novel NOVA models used in this work. Figure 1 presents the workflow of the proposed methodology. The description of each step is presented in the following sections.

3.1. Dataset Description

The study leveraged the Malaria Detection Dataset [35] to train and assess deep learning models aiming to automatically detect and classify malaria from microscopic images of blood smears. The dataset consists of colored microscopic blood smear images in JPG format with varying original resolutions. To ensure uniformity across samples and compatibility with standard CNN architectures, all images were resized to 224 × 224 pixels as part of our preprocessing pipeline. It is divided into three subsets, training, validation, and testing, each encompassing two categories: parasitized (infected red blood cells with Plasmodium parasites) and uninfected (healthy red blood cells without infection). The training set, consisting of 13,152 images, was employed to learn the visual patterns and features that differentiate infected and non-infected cells. The validation set, comprising 626 images, was utilized for model parameter tuning and regularization during the training process to prevent overfitting. The testing set, with 1253 images, was reserved for the final model evaluation, ensuring unbiased assessment of its performance and generalization capabilities. The dataset maintained a balanced class distribution and ensured visual diversity, including variations in staining, illumination, and cell morphology, to enhance the robustness of the developed models. This dataset has been commonly used as a benchmark for malaria classification tasks and is publicly available on Kaggle and the NIH Malaria Cell Image Database. Figure 2 shows samples of parasitized cell images and uninfected cell images.

3.2. Preprocessing and Data Splitting

Preprocessing is an extremely important and initial step for deep learning-based image classification tasks as it directly affects model convergence, stability, and generalization. The Malaria Detection Dataset used for the study contained a total of 15,031 microscopic blood smear images, of which 7508 were parasitized cell images and 7523 were uninfected cell images, with an extremely well-balanced dataset. Since the original dataset contains images with varying resolutions and aspect ratios, and deep learning architectures require uniform input size, all images were standardized before being input into the network for model training. To meet this requirement, all images from the original dataset were resized to 224 × 224 × 3 pixels using bilinear interpolation. After resizing the images, we used the train–validation–test split that was provided with the Kaggle Malaria Detection Dataset to facilitate reproducibility and enable fair comparison with existing benchmarks on this dataset. The dataset was pre-partitioned into three splits: 87.5% (13,152 images) for training, 4.2% (626 images) for validation during hyperparameter tuning, and 8.3% (1253 images) for final testing to report generalization ability and prediction accuracy on unseen data. This predefined train–validation–test split, which maintains class balance between the split subset, is standard practice for this benchmark dataset. It is important to note that data integrity and class balance were considered at all stages. Table 1 shows the distribution of parasitized and uninfected images in the data-splitting process.

Subsequently, all images underwent a series of preprocessing and augmentation operations. Initially, the image pixels were cast to a floating point data type. For numerical stability in gradient-based optimization, images were then normalized to the range of [0, 1] using a rescaling operation (Rescaling (1/255)). To help the model generalize various imaging conditions, such as changes in cell orientation, illumination, and magnification, a series of data augmentation operations were applied randomly to each image in the training set. These transformations included random horizontal flipping, rotation, zooming, translation, and brightness/contrast adjustments. On the other hand, only normalization was performed on the validation and test sets to maintain consistency. Table 2 represents a summary of preprocessing and augmentation operations.

3.3. Proposed Novel NOVA

In this paper, a unified deep learning architecture termed the NOVA (Novel Multi-Scale Adaptive Vision Architecture) is proposed to address the proposed robust malaria image classification problem with a special emphasis on medical image analysis. In Table 3, we briefly compare the NOVA with popular attention and hybrid CNN–transformer modules. The existing CBAM and SE modules as well as Swin Transformer and hybrid CNN–transformer models do not possess multi-scale adaptability, temperature-controlled attention, or pooling fusion. In contrast, the NOVA incorporates all these desirable properties. The overall design philosophy of the NOVA is based on multi-scale feature extraction, adaptive attention, and transformer-based sequence modeling to further enhance discriminative capabilities. In contrast to traditional convolutional neural networks that typically focus on unidirectional feature propagation, the NOVA utilizes a hierarchical multi-branch structure to capture features at multiple levels of semantic abstraction and spatial resolution in parallel. Figure 3 depicts the proposed NOVA.

3.3.1. Backbone Network

The selected architecture uses EfficientNetB3 as a backbone for feature extraction. EfficientNetB3 is a state-of-the-art convolutional neural network that uses a compound scaling technique for adjusting the network depth, width, and resolution. It is loaded with the pre-trained ImageNet weights, so the network can be used as a backbone for transfer learning. However, instead of using only the final layer’s output, the NOVA also extracts features at different intermediate steps of the backbone network. This way, both low-level and high-level semantic features are preserved together with spatial information. Three feature extraction levels are defined in this way:

Early-scale features: These features are extracted at block2 level. It has the highest spatial resolution of 56 × 56.
Mid-scale features: These features are extracted at block4 level. It has an intermediate resolution of 28 × 28 and also some high-level semantic information.
Late-scale features: This is the final backbone output with a 7 × 7 resolution and the highest amount of semantic information.

The use of these three multi-scale features allows the network to access and process the information at different levels of abstraction, which helps overcome the single-scale feature representation bottleneck.

3.3.2. Dynamic Channel Attention Module

In Figure 4, we propose a new Dynamic Channel Attention (DCA) mechanism that builds on top of the vanilla channel attention mechanism with the addition of a learnable temperature scaling module. The whole channel attention process can be written as follows:

Let

X \in R^(B \times H \times W \times C)

represent the input feature map, where B is the batch size, H and W are spatial dimensions, and C is the number of channels.

Step 1: Global Pooling

f_{a v g} = G A P (X) \in R^(B \times C)

f_{m a x} = G M P (X) \in R^(B \times C)

where GAP and GMP denote global average pooling and global max pooling operations, respectively.

Step 2: Attention Network

a_{a v g} = σ (W_{o u t} . R e L U (W_{f_{c}} . f_{a v g}))

a_{m a x} = σ (W_{o u t} . R e L U (W_{f_{c}} . f_{m a x}))

where

W_{f_{c}}

projects the feature to a reduced dimension (reduction ratio r = 16), Wout projects back to the original channel dimension, σ denotes the sigmoid activation, and ReLU is the rectified linear activation function.

Step 3: Temperature Scaling

a_{c o m b i n e d} = σ (\frac{a_{a v g} + a_{m a x}}{T + \in})

where T is a learnable temperature parameter initialized to 1 and trained via backpropagation, and ϵ = 10^ {−8} prevents division by zero. The temperature parameter acts as a learnable scaling factor that adapts the attention magnitude during training.

Step 4: Feature Reweighting

Y = X . e x p a n d (a_{c o m b i n e d,} s h a p e (X)

The output channel attention maps are broadcast along the spatial dimensions and performed elementwise on the input feature to allow the network to explicitly focus on informative channels or inhibit less useful ones.

3.3.3. Spatial Pyramid Attention Module

The Spatial Pyramid Attention (SPA) (Figure 5) module mines the multi-scale spatial dependency via several parallel convolutions with different kernel sizes, similar to the spirit of spatial pyramid pooling but in an attention manner. The module is defined as follows:

Step 1: Spatial Feature Statistics

S_{a v g} = M e a n P o o l (X, a x i s = C) \in R^(B \times H \times W \times 1)

S_{m a x} = M a x P o o l (X, a x i s = C) \in R^(B \times H \times W \times 1)

Step 2: Multi-Scale Convolution

A_{1 \times 1} = σ (C o n v_{1 \times 1} (S_{a v g}))

A_{3 \times 3} = σ (C o n v_{3 \times 3} (S_{m a x}))

A_{5 \times 5} = σ (C o n v_{5 \times 5} (S_{a v g}))

where σ is the sigmoid activation, and all convolutions employ symmetric padding to maintain spatial dimensions.

Step 3: Attention Fusion

A_{c o m b i n e d} = \frac{A_{1 \times 1} + A_{3 \times 3}}{3}

Step 4: Spatial Feature Reweighting

Y = X . A_{c o m b i n e d}

This multi-scale approach enables the network to capture both fine-grained local patterns and broader contextual information, enhancing spatial discriminability.

3.3.4. Adaptive Feature Refinement Module

Figure 6 represents the adaptive feature refinement (AFR) module, which is composed of a multi-branch network with a gating mechanism to iteratively refine the features, so as to fuse information from mutually supplementary feature extraction branches.

Step 1: Multi-Branch Processing

b_{1} = R e L U (C o n v_{1 \times 1} (X))

b_{3} = R e L U (C o n v_{3 \times 3} (X))

where each branch operates on the input features with different receptive field sizes.

b_{c o n c a t} = C o n c a t (b_{1}, b_{3}, a x i s = C)

b_{f u s e d} = R e L U (C o n v_{1 \times 1} (b_{c o n c a t}))

Step 2: Gated Residual Connection

g = σ (C o n v_{1 \times 1} (X))

Y = X + b_{f u s e d} . g

The gating mechanism, g, is a learned mask that regulates the contribution of the refined features to the residual connection. This prevents gradient flow problems and allows the network to learn when to apply feature refinement.

3.3.5. Feature Enhancement Module

The Feature Enhancement Module (FEM) performs progressive convolutional refinement with batch normalization to improve feature discriminability:

h_{1} = B a t c h N o r m (R e L U (C o n v_{3 \times 3} (X)))

h_{2} = c o n v_{1 \times 1} (h_{1})

Y = X + α . h_{2}

where α = 0.1 is a residual weighting factor. The small weighting factor ensures that enhancement contributes gradually without destabilizing training dynamics.

3.3.6. Enhanced Transformer Block

In Figure 7, extending the refined features from the convolutional layers, the NOVA further leverages transformer-based sequence modeling for learning long-range dependencies and global context. The enhanced transformer block consists of multi-head self-attention followed by a feed-forward network.

Step 1: Input Preparation

Features from the previous layers are reshaped into a sequence of spatial patches:

S = R e s h a p e (X) \in R^(B \times (H \times W) \times C

Positional embeddings are added to retain spatial location information:

S^{'} = S + {E m b e d d i n g}_{p o s}

Step 2: Multi-Head Self-Attention

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

M H S A (Q, K, V) = C o n c a t ({h e a d}_{1}, {h e a d}_{2}, \dots \dots \dots {h e a d}_{h}) W_{o}

where h = 4 attention heads, and each head operates on dk = C/h dimensional subspaces. This parallel attention allows the model to capture diverse interaction patterns.

Step 3: Feed-Forward Network

F F N (x) = R e L U (x W_{1} + b_{1}) W_{2} + b_{2}

With a hidden dimension equal to C (reduced from the typical 4C for memory efficiency).

Step 4: Residual Connections and Layer Normalization

z_{1} = L a y e r N o r m (S^{'} + D r o p o u t (M u l t i H e a d A t t e n t i o n (S^{'})))

z_{2} = L a y e r N o r m (z_{1} + D r o p o u t (F F N (z_{1})))

Multiple transformer blocks (in the optimized version: 1 block) are stacked to progressively refine the sequence representation.

3.3.7. Multi-Strategy Pooling

Rather than employing a single pooling strategy, the NOVA combines multiple complementary pooling operations to capture different statistical aspects of the feature representation in Figure 8.

Global Average Pooling:

P_{a v g} = \frac{1}{T} \sum_{t = 1}^{T} s_{t}

where T = H × W is the sequence length and st is the t-th token.

Global Maximum Pooling:

w = s o f t m a x (D e n s e (S))

P_{a t t n} = \sum_{t = 1}^{T} w_{t} . s_{t}

These three pooling strategies are concatenated to form a comprehensive feature vector:

P_{c o m b i n e d} = C o n c a t (P_{a v g}, P_{m a x}, P_{a t t n})

3.3.8. Classification Head

The pooled features are passed through a progressive dense network with regularization:

h_{1} = D r o p o u t (B a t c h N o r m (R e L U ({D e n s e}_{512} (P_{c o m b i n e d}, λ = 10^{- 4}))))

h_{2} = D r o p o u t (B a t c h N o r m (R e L U ({D e n s e}_{256} (h_{1}, λ = 10^{- 4}))))

y = s o f t m a x ({D e n s e}_{N} (h_{2}))

where N is the number of classes, L2 regularization with λ = 10⁻⁴ is applied to dense layers, and dropout rates of 0.4 and 0.3 are used, respectively. SoftMax activation produces probability distribution over class labels.

3.3.9. Integration Strategy and Design Rationale

In contrast to standard architectures, which are merely sequential stacks of layers, we hierarchically integrate each module of NOVA. Each is deliberately positioned to play to its strengths and cover the weaknesses of the others. Attention modules precede refinement blocks to encourage desired features, and features extracted from CNNs are inputs to transformers to support stable learning on small datasets. Residual connections and normalization layers are used between stacked modules to maintain gradient flow, and multi-strategy pooling is used to fuse complementary feature statistics without overfitting to any representation. The value of our design choices is confirmed by the ablation study in Section 4.2, in which we demonstrate that while each component individually lags behind state-of-the-art performance, the fully integrated system can reach state-of-the-art results.

3.4. Performance Measures

Before we start assessing our model’s performance, we need to set the ground for the models to operate in. As the task is phrased currently, “detecting malaria in blood smear images,” it is a binary classification problem. To train the models, we use the binary cross-entropy loss function. This loss function is used for binary problems as it measures the discrepancies between the predicted probability distributions and the true binary label—it is 0 if the cell is uninfected and 1 if it is infected. The optimization of this function is carried out to force the models to make as few mistakes as possible compared to true labels. As the two classes are “parasitized” and “uninfected”, the loss function directly correlates with the model’s ability to successfully separate the two types of samples. For a comprehensive assessment of the performance and reliability of our proposed architecture, as applied to the detection of malaria from microscopic images of blood smears, we have a set of metrics to turn to. Metrics for the model’s performance are an essential part of the binary classification task, as well as a general machine learning (ML) one. The set of metrics we have chosen to observe provides a complete view of the diagnosis quality of our models [36,37]:

True Positives (TP)/True Negatives (TN): The higher the value for these, the more reliable the model is—it will be able to give the right amount of the right medicine to the right people and not misdiagnose a healthy patient.
False Positives (FP): In this case, as the disease we are trying to diagnose is severe, false positives may be considered an acceptable loss as they may be interpreted as the model “playing it safe.” This would, of course, lead to an increased amount of medication being provided, as well as inflated expenses and panic among the people who have been misdiagnosed.
False Negatives (FN): This kind of inaccuracy is the most important one to pay attention to when diagnosing malaria. After all, a single FN could be a missed opportunity to save a patient’s life and let them slip into a severely worse condition, if not end it. Therefore, an effort should be made to bring FN down to a zero or near-zero level.

(1) Accuracy

Accuracy tells us the percentage of predictions (TP and TN) made over all the test data. Accuracy is quite a straightforward metric that is used to understand model performance and may not be a good indicator of model effectiveness in the case of an imbalanced dataset.

A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N}

Accuracy indicates to us how often the classifier is correct in all cases, and the results reflect how accurately the model can classify cells as parasitized and uninfected.

(2) Precision

Precision looks at the number of correct predictions for a positive label. Precision is an important metric for domains like medical diagnosis, where false positives have a major implication on the end user.

p r e c i s i o n = \frac{T P}{T P + F P}

Precision will be important to look at how accurately the models have predicted the label as parasitized and not make healthy cells fall into the class of parasitized cells.

(3) Recall (Sensitivity)

Recall looks at the percentage of positives that are correctly detected. Sensitivity or recall is another important performance metric, especially in the medical domain, as the cost of false negatives may be high.

r e c a l l = \frac{T P}{T P + F N}

Recall will look at whether the model could detect all the cells that have parasites, and it will also be important as the consequences of missing parasitized cells could have grave implications in the medical domain.

(4) F1-Score

The F1-score (F-measure) is the harmonic means of precision and recall. It gives equal importance to both precision and recall and is useful when both false positives and false negatives are important.

F 1 S c o r e = 2 \times \frac{p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l}

The F1-score is a single metric that combines both accuracy and coverage, and this is important when there is an even importance of predicting correctly and covering as many positives as possible.

(5) Confusion Matrix (CM)

The confusion matrix (Figure 9) of all models provides a summary of the predictions and gives us the numbers of TP, FP, TN, and FN predicted by the models. The confusion matrix in medical domains, such as malaria detection, can help doctors obtain an overview of the types of errors made by the models, for example, whether the model tends to predict the most common class or has a bias towards predicting either of the classes as parasitized. Thus, this helps doctors make appropriate adjustments and perform further validation before actually using these models in the real world.

4. Result Analysis

4.1. Experimental Environment

Hardware and software specifications for the experimental setup are as follows: the system comprises an NVIDIA GPU with 8 GB VRAM, a multi-core CPU, 16 GB of RAM, and an SSD with 50 GB of free space. The software includes Python 3.8+ or higher, TensorFlow 2.10 or higher, CUDA 11.2 with cuDNN 8.1, and Scikit-learn 1.0. We also used mixed precision training (float32) to reduce memory footprint and computational time while preserving model training stability. The model was trained in the input shape of 224 × 224 × 3 with a batch size of 32, determined by the GPU memory. The model was trained for 100 epochs with continuous learning rate monitoring, starting from η₀ = 10⁻⁴.

4.2. Ablation Study

We performed an ablation study to quantitatively identify the gain in performance by each of the proposed components. We constructed five variants of the proposed model by successively adding the individual architectural components (channel attention, spatial attention, feature refinement, and transformer blocks) to the baseline model. The results are shown in Table 4.

As seen from the results, the baseline model with only EfficientNet-B3 as the backbone already delivers a high level of performance, with an accuracy of 96.49% and an AUC of 98.90%. The model with the channel attention module shows a slightly lower accuracy score, which is probably due to the model over-focusing on the dependencies along the channel dimension. In contrast, the model with the spatial attention module shows improved discrimination, achieving an accuracy of 96.81% and an F1-score of 94.57%, outperforming the baseline model. The models with the feature refinement module and transformer blocks show a drop in performance compared to the baseline. This might be due to the repeated amplification of features, as well as the transformer’s unstable convergence on the small malaria dataset. The proposed NOVA with the optimal attention modules and lightweight transformer-based feature fusion achieves the highest scores in all the reported metrics, with a 97.00% accuracy, a 97.00% F1-score, and a 99.14% AUC.

Analysis of Isolated Module Performance

This observation from the ablation study seems to suggest an important design intuition for this work. Namely, each architectural module, while not performing very well by themselves, can be made useful if properly conditioned and integrated. The feature refinement module, for example, achieves an accuracy of 62.89% when inserted independently, while the Transformer module performs at 69.91% in its standalone manner.

Breaking down the performance gap of the feature refinement module, we believe this is largely because of the following factors: (1) Amplification of non-informative noise—when multi-branch features are fused without attentional selection mechanisms, it non-selectively amplifies both discriminative parasitic features and background noises/artifacts/staining variation. (2) Training instability—in the absence of high-level feature initialization, introducing a gated residual block as in the proposed AFR module further complicates optimization and has difficulty in converging with the small medical dataset. (3) Feature redundancy—with multiple parallel feature refinement branches lacking feature selection mechanisms, AFR introduces redundant feature maps, leading to overfitting. An observation from the activation maps of AFR also supports this hypothesis: with naive initialization, the model still learns to activate highly on cell boundary/background regions instead of the desired inner-cell parasitic components. The performance degradation of the transformer module could be further broken into the following:

(1) Loss of local contextual information: When the input image is tokenized into flattened non-overlapping patches as transformer inputs, local morphological information becomes lost, which is critical in learning subtle intracellular parasitic features (chromatin dots, hemozoin pigments, etc.).

(2) Data scale mismatch: Transformer-based self-attentional architectures are typically trained on very large-scale datasets (ImageNet with 1.2 M images) to prevent overfitting, while the medical image dataset used in this work contains only 13,152 training images.

(3) Attention falling on wrong region: In the absence of a hierarchically initialized CNN backbone, the model might attend to noisy background regions instead of the discriminative foreground cells.

The final NOVA model addresses the limitations by means of a synergistic hierarchical combination:

Stage 1 (Foundation): The EfficientNetB3 CNN backbone extracts spatially aware and semantically meaningful features that shield downstream modules from learning spurious patterns.

Stage 2 (Selection): Dynamic Channel Attention and Learnable Temperature Spatial Pyramid Attention filter out noisy channels and localize regions of interest BEFORE refinement, preventing noise amplification.

Stage 3 (Enhancement): Feature refinement now operates on attention-filtered representations, allowing the gating mechanism to selectively enhance discriminative parasitic morphologies without amplifying artifacts.

Stage 4 (Global Reasoning): The transformer now receives well-structured, attention-guided features from the CNNs, allowing stable training and the effective modeling of long-range dependencies (e.g., spatial relationships between multiple infected cells).

The hierarchical nature of this design also means that each module can shield its successor from the limitations. Namely, CNNs prevent transformers from learning random patterns, attention prevents refinement from amplifying noise, and refinement provides well-enhanced features for global reasoning. The final NOVA configuration achieves an accuracy of 97.00% and highlights that architectural synergy is key in medical image analysis, where both subtle features and limited data present unique challenges.

4.3. Statistical Robustness via Repeated Experiments

In order to estimate the statistical significance and replicability of the NOVA model architecture, five independent training sessions were performed, with random seeds initialized to different values (in all other respects, hyperparameters and training conditions were kept equal). The results of these five training sessions are provided in Table 5. We show the values of each performance metric (accuracy, precision, recall, F1-score, and AUC) for each run, as well as the mean ± standard deviation (SD) over all runs. For all performance metrics, the standard deviations over multiple runs are extremely low (±0.21 for accuracy; ±0.05 for AUC). This shows that the performance of the model is extremely stable during training and does not vary much between runs. In other words, the NOVA’s performance does not stem from the effects of a good initialization or advantageous sampling of the training set but is robust to these effects and would achieve consistently high predictive performance over multiple repetitions. The average accuracy of 96.94% and AUC of 99.12% show that the NOVA consistently and reliably has a strong discriminative performance, being able to distinguish parasitized from uninfected cells. The mean F1-score of 96.68% also shows that this strong performance is shared between the metrics of precision and recall, both of which are important for medical diagnosis as both false positives and false negatives can be detrimental.

4.4. Comparison with Other Conventional Techniques

To further validate the efficiency in diagnosis, we also analyzed and compared the computational efficiency of the NOVA and common deep learning architectures. The number of parameters, floating point operations per second (FLOPs), and inference time per image are reported in Table 6. The proposed NOVA achieves optimal accuracy along with moderate model complexity. The NOVA only needs 11.8M parameters, much less than DenseNet201 (20.0 M) and EfficientNet-B7 (37.0 M). Similarly, the NOVA only results in 3.1 GFLOPs, which is lower than ResNet50 (4.1 GFLOPs) and EfficientNet-B7 (7.9 GFLOPs). The NOVA only takes 6.8 (ms) to process a 224 × 224 image, making it feasible to be deployed in real-time screening scenarios. The NOVA shows the model’s potential for deployment on mid-range GPUs and edge devices.

The performance evaluation metrics, including accuracy, precision, recall, F1-score, and AUC, are compared for various deep learning models employed in malaria cell detection. DenseNet121, DenseNet169, DenseNet201, VGG16, ResNet50, and CNN, representing conventional architectures, demonstrate promising performance. The accuracy of these models falls within the range of approximately 89.76% to 94.12%. Furthermore, AUC values range from 95.90% to 97.65% for DenseNet201, DenseNet169, DenseNet121, ResNet50, VGG16, and CNN. On the other hand, lightweight models, including MobileNet, MobileNetV2, and EfficientNet-B1, show relatively lower performance. Notably, EfficientNet-B1 has the lowest accuracy at 60.35%. These results suggest that lightweight models, with their reduced parameter complexity, may face challenges in extracting discriminative features for malaria cell classification. InceptionV3, a widely used deep model, yields moderate results with an accuracy of 85.67% and an AUC of 93.50%. This finding suggests that not all deep architectures are equally effective in generalizing the task of malaria cell detection. On the other hand, our proposed NOVA outperforms the existing deep learning models in terms of all metrics, achieving 97.00% accuracy, 96.00% precision, 97.00% recall, 97.00% F1-score, and 98.00% AUC. This result demonstrates the NOVA’s ability to reliably and effectively detect infected cells. Overall, these results position the NOVA as a viable candidate for malaria diagnosis as it outperforms traditional and lightweight deep learning models in discriminative capability and overall classification performance.

In Figure 10, the graph represents the model’s training history, which indicates good convergence, with both training and validation accuracy approaching 96% by epoch 35. The loss curves show a smooth decrease in training loss from 0.65 to around 0.25, while the validation loss is somewhat more erratic during the first epochs, eventually settling around 0.18 after about epoch 20. Validation accuracy initially fluctuates quite significantly between epochs 5–15, from 0.50 to 0.98, before stabilizing at a high level. This initial instability is likely due to the model being sensitive to the specific batch compositions in the early stages of training; however, the eventual convergence of both loss and accuracy metrics indicates successful learning without overfitting.

In Figure 11, the confusion matrix illustrates the model’s classification performance over the 1253 samples in the test set. The model correctly classified 606 samples as parasitized and 608 as uninfected, with a small number of misclassifications: 23 samples were false negatives (parasitized classified as uninfected) and 16 were false positives (uninfected classified as parasitized). This results in a sensitivity (true positive rate) of 96.3% and a specificity (true negative rate) of 97.4%, reflecting a high level of discriminative performance. In Figure 12, the receiver operating characteristic (ROC) curve is provided, which shows an excellent performance with an area under the curve (AUC) of 0.98. The ROC curve quickly rises towards the upper left corner, indicating that the model can achieve high true positive rates while maintaining low false positive rates across different classification thresholds, making it a reliable candidate for deployment in diagnostic settings.

Figure 13 shows Grad-CAM-based interpretability on a few examples from the malaria cell classification task. Each row of the figure consists of (i) the original microscopy image with its ground truth label, (ii) the corresponding Grad-CAM activation map for the parasitized class, (iii) the activation map for the uninfected class, and (iv) the final prediction with confidence. In the correctly classified parasitized samples, the Grad-CAM maps highlight the activation on the area of the intracellular parasite. The model thus appears to capture morphological features that are discriminatory for the positive class, such as chromatin-rich parasite structures and cytoplasmic anomalies. The uninfected heatmap is mostly inactive in these examples, which shows that the model can correctly downregulate the negative class when strong parasitic features are present. In the uninfected samples, the activation maps are consistently different. The uninfected class consistently highlights the central region of the erythrocyte, while the parasitized heatmap is mostly inactive. The model thus appears to base its decision on features such as homogeneous cell texture and the lack of parasitic inclusions. Prediction overlays on the original image further confirm high confidence in the classification. Intriguingly, for some uninfected examples, weak or diffused activation can be observed on the parasitized heatmap. This can potentially be caused by staining artifacts, mild structural anomalies or noise in the sample, and the model’s sensitivity to these. The final prediction is nonetheless correct, as global activation is still consistent within classes.

Figure 14 shows examples of the predicted results of the proposed model and the corresponding attention maps. The first, second, and third columns represent the true label of parasitized and uninfected cells, the attention map of the model (regions in which the model thinks are important to classify), and the predicted label with the confidence score, respectively. Green represents correctly predicted results and red represents incorrectly predicted results. The model correctly predicted most cells and captured the attention on the important region in most cases. However, for two parasitized cells, the model did not pay enough attention to the parasitized cell, which led to misclassification.

4.5. Comparison with Dataset-2

We assessed the cross-dataset transferability of the NOVA by applying it to the NIH Malaria Cell Images Dataset [37]. The dataset consists of 27,558 images (13,779 parasitized and 13,779 uninfected) acquired from 150 infected patients and 50 healthy donors. The dataset was randomly split with stratification and a ratio of 80:10:10 for training, validation, and testing, respectively, with random seed fixed at 42 for reproducibility and perfect class balance in each split. All images were bilinearly interpolated to 224 × 224 × 3 pixels and normalized to [0, 1]. In addition to the interpolation and normalization, training images were randomly flipped, rotated (±20°), zoomed (±15%), translated (±10%), and shared (brightness/contrast, ±20%), whereas validation and test images only underwent the interpolation and normalization steps. We fine-tuned the pre-trained NOVA from Dataset-1 for 100 epochs with early stopping at epoch 47 on the NIH dataset. The optimization setup used Adam with an initial learning rate of 1 × 10⁻⁴ and the ReduceLROnPlateau learning rate scheduler, a batch size of 32, binary cross-entropy loss, and L2 regularization (λ = 1 × 10⁻⁴). The key limitation of this evaluation is the absence of patient-level metadata in Dataset-2, which precluded patient-stratified splitting and may result in the training and testing sets including cells from the same patients. All reported results represent cell-level classification performance, not patient-level diagnostic generalizability.

Table 7 shows the results of the selected deep learning architectures on the malaria cell dataset. From the conventional deep learning models, DenseNet169 and VGG16 gave good accuracy values of 94.19% and 94.92%, respectively, and the corresponding AUC values are 0.9824 and 0.9713. These models also demonstrate their good discrimination ability in terms of ROC between parasitized and uninfected cells. CNN, MobileNet, and EfficientNet-B7 also had very good accuracy, with a score of more than 95% in performance. However, InceptionV3, MobileNetV2, and EfficientNet-B1 showed poor accuracy and did not do well [38,39].

On the other hand, our proposed NOVA outperforms all the baseline models. It provides very high accuracy of 99.42%, where the precision, recall, and F1-score are above 99% and the area under the curve (AUC) is 99.08%. The proposed NOVA model has strong generalization power and demonstrated its ability to extract both local and global discriminative features in classifying malaria-infected and uninfected cells automatically.

5. Discussion

In this study, we evaluated the performance of various state-of-the-art deep learning architectures in malaria cell classification. This was in addition to our proposed NOVA, which was built upon the innovative NOVA. Upon analysis of the results, there is a noticeable variation in the performance of these models. This underscores the impact of the differences in their feature extraction methods, the depth of the models, and the underlying architectural choices. The proposed NOVA outperforms all the baseline models by a significant margin, achieving an exceptional accuracy with precision, recall, and F1-score surpassing. These results imply that the NOVA has effectively integrated hierarchical feature extraction with attention mechanisms (or any specific mechanism used in the NOVA), allowing the model to learn local cellular details and global structural patterns. The superior generalization capabilities observed in Dataset-2 also reinforces the robustness and reliability of the NOVA for real-world malaria diagnosis, where staining and imaging conditions are varied. In summary, the experimental findings of this research suggest that the NOVA has potential as a highly accurate automated diagnostic system on benchmark datasets, but future clinical validation studies in different healthcare settings are needed before clinical deployment can be recommended.

6. Limitations and Future Work

While the proposed NOVA has demonstrated high performance in the given datasets, it is important to note certain limitations that were addressed in the scope of the current study. First, the work in the present study is limited to the binary classification of infected and uninfected red blood cells. The proposed model’s performance on multi-class parasitic infections or mixed-stage infections remains unexplored. While the dataset does provide additional validation of the network’s generalization capabilities to some extent, it is important to recognize that the datasets employed are derived from controlled laboratory conditions. The model’s robustness and generalizability under diverse clinical settings, which may involve variations in staining protocols, imaging devices, or patient demographics, have not been fully assessed. However, our ablation study demonstrates the effectiveness of the integrated NOVA; it also reveals that individual modules (feature refinement: 62.89%; transformer: 69.91%) show significant performance degradation when applied in isolation. This finding underscores an important consideration for medical image analysis: component-wise evaluation may not reflect real-world performance when modules are designed for synergistic integration. Future work we will explore automated neural architecture search (NAS) techniques to optimize module ordering and integration strategies for specific medical imaging tasks. In addition, while the two datasets that we have access to in this work give us a very high degree of confidence in the correctness of the model’s predictions, both were collected in a laboratory setting with highly regulated stain quality and image acquisition settings. However, in a clinical environment, stain recipe, slide preparation quality, microscope hardware, illumination intensity, and camera resolution can all vary, leading to different appearances for red blood cells. This variation in input data distribution could impact model performance. For future work, we will be validating the generalizability of the NOVA on multi-center clinical datasets collected from a variety of hospitals using different equipment and stain recipes. We are also in the process of organizing prospective clinical validation trials with medical laboratories to test NOVA’s performance on images from real deployments. These trials will include testing the NOVA on low-cost mobile microscopy systems and point-of-care diagnostic hardware.

7. Conclusions

We presented the NOVA, a deep learning-based framework for improving automated malaria diagnosis from microscopic images of blood smears. We comprehensively evaluated the performance of the NOVA on the provided dataset and compared the results with other widely used deep learning models, such as DenseNet, ResNet, and EfficientNet. We found that our model consistently outperformed the baselines in all evaluation metrics. With an accuracy of 97.00% and an AUC of 99.00%, the NOVA has shown great potential in building a more robust, scalable, and reliable malaria detection system. In addition, we tested the model on the second dataset, and the generalization capability of our proposed model was confirmed with an accuracy of 99.42%. We demonstrated that the proposed NOVA model has high performance on benchmark datasets and has the potential to aid clinicians in malaria identification, which needs to be stringently clinically validated under real-world healthcare settings. In the future, we plan to extend the NOVA for multi-class parasitic infection detection, optimize it for real-time deployment, and incorporate explainable AI modules to increase clinical trust and usability.

Author Contributions

Conceptualization, methodology, software, validation, data curation, formal analysis, writing—original draft preparation, and visualization, M.N.H., P.K.M. and M.A.I.M.; investigation, resources, writing—review and editing, supervision, project administration, and funding acquisition, H.C.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the MSIT (Ministry of Science ICT), Korea, under the National Program for Excellence in SW, supervised by the IITP (Institute of Information & Communications Technology Planning & Evaluation) in 2022 (2022-0-01091, 1711175863).

Institutional Review Board Statement

The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board of Inje University (protocol code No. INJE IRB 2024-09-002; date: 2 September 2024.

Data Availability Statement

The datasets used in this study are publicly available and can be accessed from open repositories for reproducibility and further research at https://www.kaggle.com/datasets/shahriar26s/malaria-detection (accessed on 5 October 2024). The secondary dataset link is https://www.kaggle.com/datasets/iarunava/cell-images-for-detecting-malaria (accessed on 5 October 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sunarko, B.; Bottema, M.; Iksan, N.; Hudaya, K.A.; Hanif, M.S. Red blood cell classification on thin blood smear images for malaria diagnosis. J. Phys. Conf. Ser. 2020, 1444, 012036. [Google Scholar] [CrossRef]
World Malaria Day 2025. Available online: https://www.who.int/campaigns/world-malaria-day/2025 (accessed on 16 October 2025).
Marrelli, M.T.; Brotto, M. The effect of malaria and anti-malarial drugs on skeletal and cardiac muscles. Malar. J. 2016, 15, 524. [Google Scholar] [CrossRef]
Brenas, J.H.; Al-Manir, M.S.; Baker, C.J.; Shaban-Nejad, A. A malaria analytics framework to support evolution and interoperability of global health surveillance systems. IEEE Access 2017, 5, 21605–21619. [Google Scholar] [CrossRef]
Muhammad, F.A.; Sudirman, R.; Zakaria, N.A.; Mahmood, N.H. Classification of Red Blood Cell Abnormality in Thin Blood Smear Images using Convolutional Neural Networks. J. Phys. Conf. Ser. 2023, 2622, 012011. [Google Scholar] [CrossRef]
Salaheldin, A.M.; Wahed, M.A.; Talaat, M.; Saleh, N. An evaluation of AI-based methods for papilledema detection in retinal fundus images. Biomed. Signal Process. Control 2024, 92, 106120. [Google Scholar] [CrossRef]
Hoyos, K.; Hoyos, W. Supporting malaria diagnosis using deep learning and data augmentation. Diagnostics 2024, 14, 690. [Google Scholar] [CrossRef]
Khan; Kim, H.-C.; Mozumder, M.A.I.; Sumon, R.I.; Armand, T.P.T.; Omair, M. Skin Cancer Detection using Transfer Learning Models and Ensemble Approach to Enhanced Diagnostic Accuracy. In Proceedings of the 2025 27th International Conference on Advanced Communications Technology (ICACT), Pyeongchang, Republic of Korea, 16–19 February 2025. [Google Scholar]
Khan, R.U.; Almakdi, S.; Alshehri, M.; Haq, A.U.; Ullah, A.; Kumar, R. An intelligent neural network model to detect red blood cells for various blood structure classification in microscopic medical images. Heliyon 2024, 10, e26149. [Google Scholar] [CrossRef]
Shambhu, S.; Koundal, D.; Das, P. Deep learning-based computer assisted detection techniques for malaria parasite using blood smear images. Int. J. Adv. Technol. Eng. Explor. 2023, 10, 990. [Google Scholar] [CrossRef]
Bilal, H.M. Identification and classification for diagnosis of malaria disease using blood cell images. Lahore Garrison Univ. Res. J. Comput. Sci. Inf. Technol. 2023, 7, 14–28. [Google Scholar] [CrossRef]
Gois, F.N.B.; Marques, J.A.L.; de Oliveira Dantas, A.B.; Santos, M.C.; Neto, J.V.S.; de Macêdo, J.A.F.; Du, W.; Li, Y. Malaria Blood Smears Object Detection Based on Convolutional DCGAN and CNN Deep Learning Architectures. In Computer and Information Science; Springer International Publishing: Cham, Switzerland, 2022. [Google Scholar]
Qadri, A.M.; Raza, A.; Eid, F.; Abualigah, L. A novel transfer learning-based model for diagnosing malaria from parasitized and uninfected red blood cell images. Decis. Anal. J. 2023, 9, 100352. [Google Scholar] [CrossRef]
Mayrose, H.; Bairy, G.M.; Sampathila, N.; Belurkar, S.; Saravu, K. Machine learning-based detection of dengue from blood smear images utilizing platelet and lymphocyte characteristics. Diagnostics 2023, 13, 220. [Google Scholar] [CrossRef]
Ahamed, M.F.; Nahiduzzaman, M.; Ayari, M.A.; Khandakar, A.; Islam, S.M.R. Malaria Parasite Classification from RBC Smears Using Lightweight Parallel Depthwise Separable CNN and Ridge Regression ELM by Integrating SHAP Techniques. Res. Sq. 2023. [Google Scholar] [CrossRef]
Kumar, P.M.V.; Venaik, A.; Shanmugaraja, P.; Augustine, P.J.; Madiajagan, M. Design of inception ResNet V2 for detecting malarial infection using the cell image captured from microscopic slide. Imaging Sci. J. 2024, 72, 657–668. [Google Scholar] [CrossRef]
Hemalatha, B.; Karthik, B.; Reddy, C.K.; Latha, A. Deep learning approach for segmentation and classification of blood cells using enhanced CNN. Meas. Sens. 2022, 24, 100582. [Google Scholar] [CrossRef]
Khan, S.; Sajjad, M.; Abbas, N.; Escorcia-Gutierrez, J.; Gamarra, M.; Muhammad, K. Efficient leukocytes detection and classification in microscopic blood images using convolutional neural network coupled with a dual attention network. Comput. Biol. Med. 2024, 174, 108146. [Google Scholar] [CrossRef]
Zedda, L.; Loddo, A.; Di Ruberto, C. A deep architecture based on attention mechanisms for effective end-to-end detection of early and mature malaria parasites. Biomed. Signal Process. Control 2024, 94, 106289. [Google Scholar] [CrossRef]
Hevia-Montiel, N.; Haro, P.; Guillermo-Cordero, L.; Perez-Gonzalez, J. Deep learning–based segmentation of Trypanosoma cruzi nests in histopathological images. Electronics 2023, 12, 4144. [Google Scholar] [CrossRef]
Sharma, M.; Devi, S.S.; Laskar, R.H. Automatic detection of malaria infected erythrocytes based on the concavity point identification and pseudo-valley based thresholding. IETE J. Res. 2022, 68, 4043–4060. [Google Scholar] [CrossRef]
Nayak, S.R.; Nayak, J.; Vimal, S.; Arora, V.; Sinha, U. An ensemble artificial intelligence-enabled MIoT for automated diagnosis of malaria parasite. Expert Syst. 2022, 39, e12906. [Google Scholar] [CrossRef]
Paul, S.; Batra, S.; Mohiuddin, K.; Miladi, M.N.; Anand, D.; Nasr, O.A. A novel ensemble weight-assisted Yolov5-based deep learning technique for the localization and detection of malaria parasites. Electronics 2022, 11, 3999. [Google Scholar] [CrossRef]
Zedda, L.; Loddo, A.; Di Ruberto, C. YOLO-PAM: Parasite-attention-based model for efficient malaria detection. J. Imaging 2023, 9, 266. [Google Scholar] [CrossRef]
Tong, B.; Wen, T.; Du, Y.; Pan, T. Cell image instance segmentation based on PolarMask using weak labels. Comput. Methods Programs Biomed. 2023, 231, 107426. [Google Scholar] [CrossRef]
Saleem, S.; Amin, J.; Sharif, M.; Anjum, M.A.; Iqbal, M.; Wang, S.-H. A deep network designed for segmentation and classification of leukemia using fusion of the transfer learning models. Complex Intell. Syst. 2022, 8, 3105–3120. [Google Scholar] [CrossRef]
Tarimo, S.A.; Jang, M.A.; Ngasa, E.E.; Shin, H.B.; Shin, H.; Woo, J. WBC YOLO-ViT: 2 Way-2 stage white blood cell detection and classification with a combination of YOLOv5 and vision transformer. Comput. Biol. Med. 2024, 169, 107875. [Google Scholar] [CrossRef] [PubMed]
Khan, Z.; Shirazi, S.H.; Shahzad, M.; Munir, A.; Rasheed, A.; Xie, Y.; Gul, S. A framework for segmentation and classification of blood cells using generative adversarial networks. IEEE Access 2024, 12, 51995–52015. [Google Scholar] [CrossRef]
Sheikh, I.M.; Chachoo, M.A. A novel cell image fusion approach based on the collaboration of multilevel latent low-rank representation and the convolutional neural network. Biomed. Signal Process. Control 2023, 83, 104654. [Google Scholar] [CrossRef]
Vemuri, R.T.; Azam, M.; Bouguila, N.; Patterson, Z. A Bayesian sampling framework for asymmetric generalized Gaussian mixture models learning. Neural Comput. Appl. 2022, 34, 14123–14134. [Google Scholar] [CrossRef]
Kumar, P.; Babulal, K.S. Automated detection of microfilariae parasite in blood smear using OCR-NURBS image segmentation. Multimed. Tools Appl. 2024, 83, 63571–63591. [Google Scholar] [CrossRef]
Kumar, P.; Babulal, K.S. Hematological image analysis for segmentation and characterization of erythrocytes using FC-TriSDR. Multimed. Tools Appl. 2023, 82, 7861–7886. [Google Scholar] [CrossRef]
Kumar, P.; Babulal, K.S. Pre-processing pipelines for effective segmentation of lymphatic filariasis parasite images. In Advanced Computational and Communication Paradigms; Springer Nature: Singapore, 2023. [Google Scholar]
Kumar, P.; Babulal, K.S. Automated diagnosis of lymphatic filariasis: A robust approach for microfilariae detection using image processing and stacking classifier. In Proceedings of the 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), Delhi, India, 6–8 July 2023; IEEE: Piscataway, NJ, USA, 2023. [Google Scholar]
Malaria Detection. Available online: https://www.kaggle.com/datasets/shahriar26s/malaria-detection (accessed on 16 October 2025).
Mujahid, M.; Rustam, F.; Shafique, R.; Montero, E.C.; Alvarado, E.S.; Diez, I.d.l.T.; Ashraf, I. Efficient deep learning-based approach for malaria detection using red blood cell smears. Sci. Rep. 2024, 14, 13249. [Google Scholar] [CrossRef]
Malaria Cell Images Dataset. Available online: https://www.kaggle.com/datasets/iarunava/cell-images-for-detecting-malaria (accessed on 16 October 2025).
Raza, M.; Farwa, U.E.; Mozumder, A.I.; Mon-Il, J.; Kim, H.-C. ETRC-net: Efficient transformer for grading renal cell carcinoma in histopathological images. Comput. Electr. Eng. 2025, 128, 110747. [Google Scholar] [CrossRef]
Mondol, P.K.; Mozumder, A.I.; Kim, H.C.; Al-Onaizan, M.H.A.; Hassan, D.S.M.; Al-Bahri, M.; Muthanna, M.S.A. A ResNet-50–UNet Hybrid with Whale Optimization Algorithm for Accurate Liver Tumor Segmentation. Diagnostics 2025, 15, 2975. [Google Scholar] [CrossRef]

Figure 1. The proposed workflow of our study.

Figure 2. Samples taken from dataset contain parasitized cell images and uninfected cell images.

Figure 3. Architecture of the proposed deep learning model for malaria detection.

Figure 4. Dynamic channel attention module.

Figure 5. Overview of Spatial Pyramid Attention Module.

Figure 6. Adaptive feature refinement module.

Figure 7. Enhanced transformer block.

Figure 8. Multi-strategy pooling.

Figure 9. Overview of the confusion matrix.

Figure 10. Training and validation performance curves.

Figure 11. Confusion matrix for malaria detection.

Figure 12. Receiver operating characteristic (ROC) curve.

Figure 13. Grad-CAM-based visual explanation of model decisions.

Figure 14. Attention map-based visual explanation of model predictions for parasitized and uninfected blood cell images.

Table 1. Distribution of parasitized and uninfected cell images.

Subset	Parasitized	Uninfected	Total Images
Training	6570	6582	13,152
Testing	629	624	1253
Validation	309	317	626
Total	7508	7523	15,031

Table 2. Summary of preprocessing and augmentation operations.

Operation	Parameters/Range
Image resizing	224 × 224 × 3 pixels
Type casting and normalization	Float32, scaled to [0, 1]
Random horizontal flip	Probability = 0.5
Random rotation	±20% of 360°
Random zoom	±15%
Random translation	±10% in width and height
Random brightness adjustment	±20% intensity
Random contrast adjustment	±20%
Prefetching and caching	prefetch (AUTOTUNE)
Shuffling (training only)	Buffer size = dataset size, seed = 42

Table 3. Comparison of NOVA components with popular attention/transformer-based architecture.

Architecture/Module	Channel Att.	Spatial Att.	Multi-Scale	Temp. Scaling	Transformer Context	Pooling Fusion
CBAM	✔ GAP + GMP	✔	✖	✖	✖	✖
SE Block	✔	✖	✖	✖	✖	✖
Swin Transformer	✖	✔ Window-based	Partial	✖	✔	✖
ResNet + Transformer Hybrid	✖	✖	✔ Backbone	✖	✔	✖
NOVA (Proposed)	✔ Dynamic CA (T-learnable)	✔ LT-SPA	✔ 3-scale fusion	✔	✔ Lightweight block	✔ Avg + Max + Att

✖ means the techniques did not applied on other studies, ✔ applied.

Table 4. Ablation study of different architectural variants for malaria detection.

Model Variant	Accuracy (Best Epoch)	Precision	Recall	F1-Score	AUC
Baseline	96.49% (Epoch 14)	93.78%	93.77%	93.78%	98.90%
+Channel Attention	91.46% (Epoch 7)	90.93%	90.89%	90.91%	96.80%
+Spatial Attention	96.81% (Epoch 18)	94.58%	94.57%	94.58%	98.81%
+Feature Refinement	62.89% (Epoch 3)	78.16%	64.70%	70.75%	83.72%
+Transformer Module	69.91% (Epoch 2)	72.81%	66.93%	69.73%	86.24%
NOVA	97.00%	96.00%	97.00%	97.00%	99.14%

Table 5. Performance of NOVA over five repeated runs.

Run	Accuracy (%)	Precision (%)	Recall (%)	F1 (%)	AUC (%)
Run 1	96.90	95.80	97.10	96.45	99.10
Run 2	97.10	96.20	97.20	96.90	99.15
Run 3	96.80	96.00	96.90	96.45	99.05
Run 4	97.05	96.30	97.00	96.90	99.18
Run 5	96.85	96.10	97.05	96.70	99.11
Mean ± SD	96.94 ± 0.21	96.12 ± 0.17	97.05 ± 0.10	96.68 ± 0.18	99.12 ± 0.05

Table 6. Comparative performance of different state-of-the-art deep learning models for malaria cell detection and classification.

Model	Parameters (M)	FLOPs (G)	Inference Time (ms/image)	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	AUC (%)
DenseNet121	8.0	2.8	25	91.42	92.15	90.88	91.51	96.85
DenseNet169	14.3	3.4	30	92.38	91.88	92.95	92.41	97.10
DenseNet201	20.0	4.0	35	89.76	90.05	89.22	89.63	95.90
CNN	5.5	1.2	15	94.12	93.50	94.80	94.14	97.65
InceptionV3	23.9	5.7	40	85.67	86.22	85.10	85.65	93.50
VGG16	138	15.5	50	93.55	94.10	93.00	93.55	96.80
ResNet50	25.6	4.1	35	90.88	91.50	90.25	90.87	96.20
EfficientNet-B1	7.8	0.7	12	60.35	61.00	59.50	60.24	62.80
EfficientNet-B7	37.0	7.9	90	94.78	95.10	94.40	94.75	98.50
MobileNet	4.2	0.57	10	92.85	93.00	92.70	92.85	97.00
MobileNetV2	3.4	0.3	8	87.25	87.50	86.90	87.20	94.50
Proposed NOVA	11.8	3.1	6.8	97.00	96.00	97.00	97.00	98.00

Table 7. Comparative performance of different state-of-the-art deep learning models for malaria cell detection, Dataset-2.

Model	Accuracy	Precision	Recall	F1-Score	AUC
DenseNet121	0.9376	0.9319	0.9441	0.9377	0.9810
DenseNet201	0.9238	0.9183	0.9303	0.9236	0.9735
DenseNet169	0.9419	0.9319	0.9535	0.9400	0.9824
CNN	0.9521	0.9469	0.9579	0.9512	0.9775
InceptionV3	0.8443	0.8381	0.8533	0.8430	0.9296
VGG16	0.9492	0.9322	0.9688	0.9500	0.9713
ResNet50	0.9387	0.9501	0.9259	0.9375	0.9771
EfficientNet-B1	0.5975	0.6143	0.5229	0.5591	0.6268
EfficientNet-B7	0.9539	0.9516	0.9564	0.9534	0.9899
MobileNet	0.9521	0.9456	0.9593	0.9517	0.9810
MobileNetV2	0.8679	0.8700	0.8649	0.8647	0.9425
Proposed NOVA	99.42%	99.40%	99.40%	99%	99.08%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hosen, M.N.; Islam Mozumder, M.A.; Mondal, P.K.; Cheol Kim, H. NOVA: A Novel Multi-Scale Adaptive Vision Architecture for Accurate and Efficient Automated Diagnosis of Malaria Using Microscopic Blood Smear Images. Electronics 2025, 14, 4861. https://doi.org/10.3390/electronics14244861

AMA Style

Hosen MN, Islam Mozumder MA, Mondal PK, Cheol Kim H. NOVA: A Novel Multi-Scale Adaptive Vision Architecture for Accurate and Efficient Automated Diagnosis of Malaria Using Microscopic Blood Smear Images. Electronics. 2025; 14(24):4861. https://doi.org/10.3390/electronics14244861

Chicago/Turabian Style

Hosen, Md Nayeem, Md Ariful Islam Mozumder, Proloy Kumar Mondal, and Hee Cheol Kim. 2025. "NOVA: A Novel Multi-Scale Adaptive Vision Architecture for Accurate and Efficient Automated Diagnosis of Malaria Using Microscopic Blood Smear Images" Electronics 14, no. 24: 4861. https://doi.org/10.3390/electronics14244861

APA Style

Hosen, M. N., Islam Mozumder, M. A., Mondal, P. K., & Cheol Kim, H. (2025). NOVA: A Novel Multi-Scale Adaptive Vision Architecture for Accurate and Efficient Automated Diagnosis of Malaria Using Microscopic Blood Smear Images. Electronics, 14(24), 4861. https://doi.org/10.3390/electronics14244861

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

NOVA: A Novel Multi-Scale Adaptive Vision Architecture for Accurate and Efficient Automated Diagnosis of Malaria Using Microscopic Blood Smear Images

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Dataset Description

3.2. Preprocessing and Data Splitting

3.3. Proposed Novel NOVA

3.3.1. Backbone Network

3.3.2. Dynamic Channel Attention Module

3.3.3. Spatial Pyramid Attention Module

3.3.4. Adaptive Feature Refinement Module

3.3.5. Feature Enhancement Module

3.3.6. Enhanced Transformer Block

3.3.7. Multi-Strategy Pooling

3.3.8. Classification Head

3.3.9. Integration Strategy and Design Rationale

3.4. Performance Measures

4. Result Analysis

4.1. Experimental Environment

4.2. Ablation Study

Analysis of Isolated Module Performance

4.3. Statistical Robustness via Repeated Experiments

4.4. Comparison with Other Conventional Techniques

4.5. Comparison with Dataset-2

5. Discussion

6. Limitations and Future Work

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI