1. Introduction
Lung diseases such as pneumonia, COVID-19, pulmonary fibrosis, and interstitial lung disease (ILD) remain among the leading causes of global morbidity and mortality [
1,
2]. These conditions place significant strain on healthcare systems, particularly in low- and middle-income regions where timely diagnosis is often hindered by limited access to trained radiologists and advanced diagnostic infrastructure [
3,
4,
5].
In recent years, artificial intelligence (AI) has emerged as a transformative force in medical imaging, offering powerful tools for automated diagnosis, decision support, and workflow optimization. AI-driven techniques, particularly deep learning, have demonstrated substantial potential in enhancing diagnostic accuracy and speed across a range of modalities, including chest X-rays [
6,
7]. For instance, computer-assisted systems based on morphological analysis have shown promise in detecting asymmetries in mammographic images [
8], illustrating how AI can support clinical decisions even in complex or subtle cases.
Chest X-rays (CXRs) are widely available and cost-effective for lung disease screening, yet their interpretation remains a skill-intensive task, leading to diagnostic bottlenecks in many clinical environments [
9,
10]. Building on this momentum, recent advances in deep learning have shown promise in automating medical image analysis [
11,
12]. Convolutional Neural Networks (CNNs) have been extensively adopted for classification tasks, achieving expert-level accuracy in identifying disease patterns in CXR images [
13,
14]. In parallel, object detection models such as the You Only Look Once (YOLO) series [
15,
16] have enabled the localized identification of multiple abnormalities, offering enhanced interpretability and spatial precision in diagnosis.
In this context, classification refers to assigning an entire chest X-ray image a single label (e.g., COVID-19 or Normal), while object detection goes further by identifying and localizing multiple abnormalities within the image using bounding boxes (e.g., detecting both Pleural Effusion and Cardiomegaly in different regions). These two approaches offer complementary insights: classification supports rapid triage, while detection enhances interpretability through spatial reasoning.
However, the deployment of such models in real-world, resource-constrained environments demands more than just accuracy. Models must also be lightweight, efficient, and capable of running on embedded systems with limited computational capacity—what is broadly referred to as
edge AI [
17,
18]. In this context, we adopt the general term “edge computing” to encompass all forms of near-data inference, ranging from so-called
Extreme Edge (or
End Computing)—which includes wearable and embedded systems like the Raspberry Pi—to more capable edge servers. While distinctions exist between Extreme Edge, Edge, and Cloud layers, this study focuses on the edge spectrum as a whole, without explicitly differentiating among them. These Embedded AI systems must deliver reliable classification and detection performance while adhering to the constraints of portable or point-of-care hardware platforms, such as the Raspberry Pi or wearable devices.
1.1. Motivation and Research Gap
While both classification and detection models have individually demonstrated success in CXR interpretation, few studies have explored their performance under deployment constraints on edge hardware. Most existing research emphasizes classification or detection in isolation, often evaluated using high-end GPU infrastructure without consideration for embedded deployment [
19]. Additionally, few works compare both paradigms—classification and detection—in terms of their performance trade-offs (accuracy vs. efficiency), interpretability, and real-time feasibility on edge devices.
This paper addresses this gap by conducting a comprehensive performance comparison of Embedded AI solutions for lung disease diagnosis, focusing on both classification and detection. We evaluate six widely used CNN models for five-class disease classification and a YOLO-based architecture for multi-label abnormality detection. Beyond accuracy, we benchmark each model’s inference latency on a Raspberry Pi 4 device, including detailed measurements of image preprocessing time, inference time, and overall end-to-end latency using a high-resolution input. Notably, the Raspberry Pi 4 is one of the most affordable and widely available single-board computers (SBCs), making it highly attractive for embedded deployment in clinical settings. Compared to wearable platforms such as the RealWear Navigator 500 and other edge AI-capable wearable devices, which are significantly more expensive, the Raspberry Pi offers a cost-effective solution. This economic advantage, combined with its flexibility and community support, was a key factor in selecting it as the reference platform for benchmarking real-time decision support systems—particularly in surgical environments or resource-constrained healthcare settings. The broader vision is that just as modern wearable devices provide web browsing or messaging capabilities on the wrist, future surgical wearables will embed real-time AI-driven decision support systems, enabling clinicians to receive diagnostic cues directly in the operating room.
Building on the identified research gap, this study explores the feasibility of a modular Embedded AI pipeline for lung disease diagnosis tailored to real-world deployment. The proposed system consists of two complementary components: (i) a multi-class classification model to support rapid triage and disease categorization, and (ii) a multi-label object detection model to provide spatial interpretability of thoracic abnormalities. While both components are designed for embedded environments, our current work focuses on evaluating the classification branch. Specifically, we deploy post-training quantized models on a Raspberry Pi 4 and assess real-time inference performance using high-resolution chest X-ray images. In parallel, we train a lightweight YOLOv8n model optimized for embedded detection, establishing the foundation for future integration. This modular approach offers a practical and scalable pathway toward real-time, interpretable decision support systems for mobile clinics, rural health centers, and surgical environments.
1.2. Objective
The primary objective of this study is to assess the performance and deployability of deep learning-based classification and detection models for lung disease diagnosis in embedded settings. Specifically, we aim to
Evaluate and compare six CNN-based classifiers (ResNet101, DenseNet201, MobileNetV3-Large, InceptionResNetV2, Xception, and EfficientNetV2-B0) on five lung disease categories using base and augmented datasets;
Implement and assess a lightweight object detection model (YOLOv8n) for localizing 14 distinct thoracic abnormalities in chest X-rays;
Analyze model trade-offs in terms of accuracy, interpretability, and edge-device performance, emphasizing their suitability for embedded deployment on devices such as Raspberry Pi.
1.3. Contributions
The primary contributions of this study are as follows:
A modular, edge-oriented AI pipeline is proposed for lung disease diagnosis, comprising classification and detection components. The classification component was experimentally validated on a Raspberry Pi 4 using INT8 quantized models after training, while the detection module (YOLOv8n) was trained and optimized for embedded deployment, but not yet deployed.
A comparative performance evaluation of six CNN-based classifiers (ResNet101, DenseNet201, MobileNetV3-Large, InceptionResNetV2, Xception, and EfficientNetV2-B0) was performed using base and augmented versions of a curated five-class lung disease dataset.
A lightweight YOLOv8n detector was trained to identify 14 thoracic abnormalities in chest X-rays, and its detection performance was analyzed using confidence-based metrics and mean Average Precision (mAP).
A detailed runtime analysis was performed on the Raspberry Pi 4, capturing model loading, image preprocessing, inference time, and end-to-end latency on high-resolution inputs. These metrics were used to evaluate classification performance under real-time embedded constraints.
The trade-offs between classification and detection were discussed from an Embedded AI deployment perspective, with emphasis on diagnostic complementarity, inference efficiency, and interpretability.
The remainder of the paper is structured as follows:
Section 2 reviews the literature on deep learning for lung disease analysis and embedded AI.
Section 3 describes the datasets, model architectures, and experimental setup.
Section 4 presents the classification and detection results along with edge deployment performance.
Section 5 explores diagnostic implications and limitations.
Section 6 concludes the paper with future directions.
2. Related Work
The application of deep learning to medical imaging has driven significant progress in the diagnosis of lung diseases from chest X-ray (CXR) images. While many studies address classification or detection separately, few have explored their integration within a resource-efficient, Embedded AI framework suitable for real-world deployment. This section reviews the related literature across four themes: classification models, detection models, edge deployment strategies, and multi-stage diagnostic pipelines.
2.1. Deep Learning for Lung Disease Classification
CNN-based models have achieved strong results in lung disease classification from CXR images. Architectures such as ResNet [
20], DenseNet [
21], Inception [
22], and EfficientNet [
23] have demonstrated high accuracy across multi-class tasks involving conditions like COVID-19, pneumonia, and lung opacity [
24,
25]. These models learn spatial features hierarchically and often rival radiologists in performance.
However, their real-time deployment is often constrained by computational demands. To address this, lightweight models like MobileNetV2/V3 [
26,
27], SqueezeNet [
28], and ShuffleNet [
29] have been adopted for edge AI scenarios, sacrificing some predictive power for faster inference. Despite their efficiency, classification models typically lack interpretability and cannot localize abnormalities—limiting their clinical transparency.
2.2. Object Detection for Thoracic Abnormalities
To provide localization and visual justification, object detection models such as Faster R-CNN [
30], RetinaNet [
31], and YOLO [
32] have been used to identify disease-specific regions within CXRs. The YOLO family, particularly YOLOv5 and YOLOv8, offers an effective balance of speed and accuracy for real-time medical applications [
33]. Studies show YOLO-based models can detect thoracic conditions such as pleural effusion, lung nodules, and COVID-19-associated opacities [
34,
35]. Lightweight variants (YOLOv5n, YOLOv8n) have been developed to facilitate edge deployment.
Despite their value, detection models require detailed bounding box annotations, which are costly to obtain. Moreover, they are seldom evaluated on low-power devices, making their real-world feasibility uncertain—especially in embedded systems such as mobile or wearable diagnostic platforms.
2.3. Edge-AI Deployment for Medical Diagnosis
As Embedded AI systems gain traction in healthcare, the focus has shifted to making AI deployable on low-resource devices. Solutions such as Raspberry Pi, Jetson Nano, and RealWear headsets enable diagnostic inference in rural and mobile clinics [
36,
37,
38,
39,
40]. Techniques like model quantization, pruning, and conversion to TFLite [
41] or ONNX are commonly used to reduce model size and inference latency.
Several works have reported successful deployment of convolutional neural networks on Raspberry Pi for COVID-19 and pneumonia screening [
42]. For object detection, models such as YOLOv5n have been demonstrated on Jetson devices with reasonable accuracy and speed [
43,
44]. However, side-by-side evaluations of classification and detection in the same embedded system remain rare, leaving a gap in comparative performance understanding.
2.4. Toward Unified and Interpretable Pipelines
Integrating classification and detection tasks offers the potential to build hybrid diagnostic pipelines, where a classification model rapidly screens an image for abnormality and a detection model provides spatial localization for detailed interpretation. This dual-stage approach mirrors real-world clinical workflows, where an initial diagnosis is often followed by targeted examination of suspected regions.
Some prior studies have explored combining classification with interpretability tools, such as class activation maps (CAMs) or attention-based mechanisms, to highlight regions associated with disease patterns. These methods provide visual cues alongside classification predictions, enhancing clinical transparency. One recent approach demonstrated the integration of Grad-CAM (Gradient-weighted Class Activation Mapping) with lightweight convolutional networks for multi-class pulmonary disease classification, effectively balancing model efficiency with interpretability [
45]. However, most existing approaches stop short of full object detection and instead rely on weak localization without bounding box precision.
Moreover, the full integration of classification and object detection models into a single, resource-efficient, edge-deployable system remains largely unexplored. While cloud-based or high-end deployments exist, systematic evaluations of such hybrid pipelines under embedded computing constraints are rare. This gap motivates the present work, which benchmarks classification and detection models independently and analyzes their synergistic potential for real-time, interpretable, and lightweight lung disease diagnosis in Embedded AI environments.
3. Methodology
This study adopts a dual deep learning methodology to assess the performance and deployment feasibility of AI-based lung disease diagnosis using chest X-ray (CXR) images, particularly in resource-constrained environments. The workflow comprises two complementary tasks: image-level classification and object-level detection.
For classification, six state-of-the-art convolutional neural network (CNN) models—ResNet101, DenseNet201, MobileNetV3-Large, InceptionResNetV2, Xception, and EfficientNetV2-B0—are trained to categorize CXR images into five diagnostic classes: COVID-19, Bacterial Pneumonia, Viral Pneumonia, Lung Opacity, and Normal. Training is performed on a GPU workstation, and post-training inference is evaluated on a Raspberry Pi 4 Model B to analyze suitability for edge deployment.
In parallel, a YOLOv8n object detection model is employed to localize 14 thoracic abnormalities, including Consolidation, Pleural Effusion, and Pulmonary Fibrosis, using a separate, bounding box-annotated dataset. The model’s performance is evaluated using standard detection metrics such as mean Average Precision (mAP) and Intersection over Union (IoU).
Both tasks are independently executed with dedicated datasets and training pipelines. A correlation analysis is later performed to examine alignment between classification outcomes and localized findings. This dual-track approach facilitates the design of a hybrid diagnostic framework that combines speed, interpretability, and deployment feasibility for real-world edge-AI clinical applications.
3.1. Datasets and Models
3.1.1. Classification Dataset
The classification dataset used in this study follows the five-class configuration proposed by Vantaggiato et al. [
46], covering
COVID-19,
Normal,
Bacterial Pneumonia,
Viral Pneumonia, and
Lung Opacity (non-pneumonia diseases). This composite dataset was built from multiple public medical imaging repositories, including the IEEE8023 COVID-19 Chest X-ray dataset, the RSNA Pneumonia Detection Challenge, CheXpert, and the Kaggle Pneumonia dataset. The labels in these sources were derived from structured hospital records, institutional diagnoses, or radiologist-generated reports, ensuring clinical credibility. For example, CheXpert includes labels extracted from radiology reports using a rule-based NLP system, while the Kaggle datasets were annotated based on physician input or radiology summaries. In addition, 207 COVID-19 chest X-rays and an equal number of test samples from other classes were obtained directly from the Hospital of Tolga, Algeria, with diagnoses verified by clinical experts. These clinically vetted sources help ensure that the classification labels reflect real-world diagnostic decisions, although inter-observer variability remains a known limitation in medical imaging.
The curated classification dataset contains 2020 chest X-ray images, evenly distributed across the five diagnostic categories. For training and evaluation, the dataset was split using a 70%/15%/15% ratio into training, validation, and test subsets, respectively. To improve model generalization, an augmented version of the dataset was created, expanding the total to 9875 images. This corresponds to approximately four augmentation variants per original image. Augmentation operations included random horizontal flipping, rotation, brightness adjustments, and zoom transformations. A small portion of the dataset was retained for testing to evaluate model performance on unseen data.
Each image was labeled with a single class, without bounding boxes, making the dataset strictly classification-specific. Preprocessing involved resizing the images to architecture-dependent input dimensions—either or —and normalizing pixel values to the range. The dataset was used to evaluate six CNN architectures in terms of diagnostic accuracy, robustness to augmentation, and suitability for embedded deployment.
3.1.2. Detection Dataset
For the object detection task, we used a curated subset of the VinBigData Chest X-ray Abnormalities Detection dataset [
47], publicly available in COCO format (vinbigdata_coco_chest_xray__wbf_yolo_his) via Kaggle (
https://www.kaggle.com/datasets/mmmmmmmmmma/vinbigdata-coco-chest-xray-wbf-yolo-his) (accessed on 13 August 2025). This dataset originates from a large-scale initiative involving over 100,000 chest X-ray scans retrospectively collected from two major hospitals in Vietnam. From this raw collection, a subset of 18,000 images was manually annotated by 17 experienced radiologists with 22 local (bounding box) and 6 global (image-level) abnormality labels. The training set of 15,000 scans was triple-labeled by independent radiologists, while the 3000-scan test set was labeled by consensus from five experts, ensuring high-quality clinical annotation.
In our study, we used a refined version of this dataset comprising 3296 images in YOLO-ready format, where bounding boxes for 14 thoracic abnormalities were consolidated using Weighted Box Fusion (WBF). This format ensures greater consistency in overlapping predictions and minimizes annotation noise. Such clinically vetted and spatially localized labels are critical for training reliable object detection models like YOLOv8n.
3.1.3. Classification Models
We evaluated six convolutional neural networks (CNNs), selected for their architectural diversity and suitability for both high-accuracy and edge-friendly deployment: ResNet101 (deep residual connections), DenseNet201 (dense layer connectivity), MobileNetV3-Large (optimized for low-power devices), InceptionResNetV2 (inception modules with residual links), Xception (depthwise separable convolutions), and EfficientNetV2-B0 (compound model scaling). This selection balances classical and state-of-the-art CNN architectures to enable a comprehensive assessment across different model families. In particular, MobileNetV3-Large and EfficientNetV2-B0 are chosen for their proven efficiency on embedded platforms, while deeper models like DenseNet201 and ResNet101 serve as performance references. Although Xception is moderately heavy compared to MobileNet or EfficientNet, it was selected for its efficient use of depthwise separable convolutions and its ability to balance accuracy with computational cost. InceptionResNetV2, while more complex, was included for completeness due to its exceptional feature extraction capabilities and to provide insights into the performance limits of high-capacity models on constrained devices.
All models were trained on the five-class classification dataset using categorical cross-entropy loss and evaluated on both base and augmented datasets. Early stopping was used to prevent overfitting. Performance was assessed using accuracy, precision, recall, F1-score, and confusion matrices across the COVID-19, Bacterial Pneumonia, Viral Pneumonia, Lung Opacity, and Normal categories.
3.1.4. Detection Model
We employed
YOLOv8n, a lightweight, anchor-free variant of the YOLOv8 family, optimized for efficient inference in edge environments. It features a CSPDarknet-inspired backbone and decoupled head for classification and regression, with support for ONNX and TensorRT exports. YOLOv8n was trained for 50 epochs using stochastic gradient descent (SGD) with a cosine learning rate schedule on a multi-label chest X-ray dataset comprising 14 thoracic abnormalities, including
Consolidation,
Pulmonary Fibrosis, and
Pleural Effusion. The Ultralytics PyTorch [
48,
49] pipeline (Ultralytics v8.3.30, PyTorch v2.5.1) was used with real-time augmentations. Performance was evaluated using mAP@0.5, mAP@0.5:0.95, and class-wise precision, recall, and IoU, highlighting its potential for spatially aware, edge-deployable diagnosis.
3.2. Training Setup
All experiments were conducted on a high-performance setup with an NVIDIA Tesla GPU, 13th Gen Intel Core i7-1365U CPU (12 cores), 32 GB RAM, and Ubuntu 22.04. Classification models were implemented using TensorFlow [
50] and Keras [
51], while YOLOv8n detection employed the Ultralytics PyTorch pipeline.
For classification, all CNNs were initialized with ImageNet pretrained weights and trained with the Adam optimizer (learning rate: 0.0001) using categorical cross-entropy loss. Models were trained for up to 100 epochs (batch size of 32 and an input size of either
or
, depending on the model architecture) with early stopping and learning rate reduction on plateau. Early stopping was applied consistently across all CNN models using a patience value of 10 (i.e., training stopped if validation loss did not improve for 10 consecutive epochs), with a minimum delta threshold of 1 × 10
−4. This configuration was selected after empirical testing showed improved convergence and final accuracy compared to a lower patience value. Data augmentation (flipping, rotation, zooming, brightness adjustment) was introduced in a secondary phase to improve generalization. Model checkpoints were based on best validation accuracy. For transparency, convergence plots (training and validation loss/accuracy) for all models trained on the augmented dataset are included in
Appendix A. These reflect the final training phase used for model evaluation. Plots from base training are omitted to avoid redundancy, as augmentation was applied consistently across all models to improve generalization.
YOLOv8n was fine-tuned for 50 epochs with COCO-pretrained weights, using SGD (initial LR: 0.01 with cosine decay, momentum: 0.937, weight decay: 0.0005). Input size was pixels with batch size 16. Augmentations included mosaic, mixup, random scaling, and flipping. Model selection was based on peak validation mAP at IoU thresholds 0.5 and 0.5:0.95.
3.3. Edge Deployment
To assess feasibility in resource-constrained environments, all classification models were deployed on a Raspberry Pi 4 Model B (4 GB RAM, ARM Cortex-A72, Ubuntu 22.04, Python 3.9.2). Each model was converted from standard TensorFlow to TensorFlow Lite (TFLite), a lightweight framework optimized for mobile and embedded devices. TFLite models use a reduced runtime and smaller binaries, enabling fast and efficient inference on ARM-based systems. We applied post-training quantization, typically reducing weights and activations to 8-bit integer (int8) or 16-bit floating point (float16) precision, depending on model compatibility and target hardware. Full integer quantization was preferred for maximum speed and compression, while float16 was used where preserving numerical fidelity was important. This process significantly reduced memory footprint and improved inference speed while preserving classification performance and easing portability across devices. Inference was evaluated using 10 high-resolution test images per model, recording top-1 classification accuracy, model load time, image preprocessing time, inference time, and overall end-to-end latency. This setup was designed to closely simulate real-world deployment conditions on the Raspberry Pi 4 Model B using PNG inputs.
YOLOv8n was not yet deployed in this study due to its computational demands and the ongoing nature of our benchmarking on embedded accelerators. However, the model was exported to ONNX format to ensure compatibility with a wide range of hardware platforms, such as the NVIDIA Jetson Orin Nano or Raspberry Pi with TPU accelerator. This exportability supports a modular hybrid deployment strategy, wherein lightweight classification is performed locally on-device for rapid triage, while detection can be offloaded to nearby accelerators via Multi-access Edge Computing (MEC) for spatial interpretation. This forward-looking design aims to balance responsiveness, interpretability, and computational efficiency, paving the way for scalable deployment in real-world healthcare settings.
3.4. Evaluation Metrics
To evaluate model performance and deployment feasibility, we employed metrics tailored to classification, detection, and edge inference tasks.
3.4.1. Classification Metrics
For classification, we computed accuracy, precision, recall, and F1-score, along with confusion matrices across the five categories: COVID-19, Bacterial Pneumonia, Viral Pneumonia, Lung Opacity, and Normal. These metrics were calculated on both the base and augmented datasets to evaluate generalization capability. To assess edge deployment feasibility, we also recorded the top-1 accuracy on Raspberry Pi using high-resolution test images under quantized inference settings.
3.4.2. Detection Metrics
Detection performance was assessed using mean Average Precision (mAP) at IoU thresholds of 0.5 (mAP@0.5) and 0.5:0.95 (mAP@0.5:0.95). Additional metrics included class-wise precision, recall, and IoU to evaluate the spatial accuracy of localized predictions across 14 thoracic abnormalities.
3.4.3. Edge Inference Metrics
To assess deployment feasibility on Raspberry Pi, we evaluated the end-to-end latency of each quantized model using a high-resolution chest X-ray image (2566 × 2566, PNG format). The reported latency includes sample preprocessing and inference time, excluding model loading as it occurs only once during deployment. In addition, we recorded the TensorFlow Lite model size to quantify the trade-off between resource efficiency and diagnostic performance in embedded healthcare environments.
3.5. Correlation Analysis
To assess the diagnostic complementarity between classification and detection models, we conducted a correlation analysis linking the five classification categories—COVID-19, Bacterial Pneumonia, Viral Pneumonia, Lung Opacity, and Normal—with the 14 thoracic abnormalities identified by the YOLOv8n detection model. Although trained on separate datasets, both models reflect overlapping radiographic features observed in clinical practice.
Notable associations include Consolidation and Pleural Effusion with bacterial pneumonia; Infiltration, Pulmonary Fibrosis, and ILD with COVID-19 and lung opacity; and Atelectasis or Pleural Thickening with viral or post-infectious presentations. The Normal class generally aligned with the absence of detection outputs, reinforcing the consistency between tasks.
Table 1 summarizes these relationships, highlighting how spatially localized detections can contextualize and validate image-level classifications. This diagnostic synergy supports a hybrid framework wherein classification enables rapid triage and detection provides visual interpretability—especially valuable for real-time deployment on edge-AI systems in low-resource settings.
4. Results
This section presents the outcomes of our dual deep learning framework for lung disease diagnosis using chest X-ray images. We first evaluate the performance of six CNN-based classification models trained on both base and augmented datasets, followed by the detection performance of a YOLOv8n model on a separate multi-label dataset. We then assess edge deployment feasibility by benchmarking inference on a Raspberry Pi, and conclude with a combined analysis that explores the diagnostic complementarity of classification and detection.
4.1. Classification Results
We evaluated six CNN architectures—ResNet101, DenseNet201, MobileNetV3-Large, InceptionResNetV2, Xception, and EfficientNetV2-B0—for multi-class classification of chest X-rays across five disease categories.
Table 2 and
Table 3 summarize their performance on the base and augmented datasets, respectively.
On the base dataset, MobileNetV3-Large achieved the highest validation accuracy (62.0%), followed closely by ResNet101 (61.4%) and EfficientNetV2-B0 (60.4%). DenseNet201 delivered moderate performance (57.8%), while Xception (51.6%) and InceptionResNetV2 (20.0%) underperformed significantly. Notably, InceptionResNetV2 failed to learn effectively, with its validation accuracy plateauing at random guessing levels.
Data augmentation led to slight but consistent improvements across most models. MobileNetV3-Large improved to 63.4%, and ResNet101 to 62.8%, while DenseNet201 also saw a minor gain to 58.4%. These improvements suggest better generalization when exposed to increased data variability. However, some models—such as EfficientNetV2-B0—exhibited signs of overfitting, achieving very high training accuracy (96.05%) without corresponding improvements in validation performance.
To further assess model behavior under real-world conditions, we performed inference on a representative test image labeled as Bacterial Pneumonia. Among the six quantized models deployed on the Raspberry Pi 4, only the Xception model correctly identified the class, albeit with moderate confidence (41.4%). All other models misclassified the sample, with DenseNet201, ResNet101, and InceptionResNetV2 predicting COVID-19, EfficientNetV2-B0 predicting Viral Pneumonia, and MobileNetV3-Large labeling it as Normal. Most incorrect predictions were made with high confidence, indicating overconfidence in misclassification and revealing overlapping feature representations among pneumonia-related classes. This real-world inference underscores the challenge of distinguishing bacterial and viral pathologies, highlighting the importance of improved model calibration and interpretability in clinical AI deployments.
Overall, these results emphasize the trade-offs between model capacity and generalization. Lightweight models such as MobileNetV3-Large achieved strong validation performance with relatively low overfitting, making them suitable candidates for deployment in resource-constrained environments, as discussed in
Section 4.3.
4.2. Detection Results
We evaluated YOLOv8n for multi-label detection of 14 thoracic conditions from chest X-rays. The model achieved a mAP@0.5 of 27.6% and mAP@0.5:0.95 of 14.7%, reflecting moderate performance given the class imbalance, small lesion sizes, and visual overlap between conditions.
Figure 1 shows the normalized confusion matrix, highlighting high true positive rates for well-defined classes such as
Pulmonary Fibrosis,
Pleural Effusion, and
Aortic Enlargement. Lower recall was observed for rare and visually ambiguous findings like
Calcification and
Pneumothorax.
Loss convergence (
Figure 2) indicates stable training across all components. Qualitative examples in
Figure 3 confirm YOLOv8n’s ability to accurately localize multiple conditions, including co-occurrences like
Cardiomegaly and
Pleural Effusion. However, smaller pathologies (e.g.,
Nodule/Mass) were frequently missed, revealing limitations in spatial sensitivity.
Detection behavior is summarized in
Figure 4, which includes precision–recall, confidence-based, and F1 performance curves. Precision reached
1.00 at high confidence (0.872), while F1-score peaked at
0.31 (threshold = 0.091), suggesting the model favors high-precision predictions at the expense of recall. We selected the nano version of YOLOv8 due to its compact architecture and significantly reduced computational footprint, making it more suitable for resource-constrained environments and compatible with potential deployment on edge accelerators like Jetson Nano or Coral TPU.
Figure 4e,f highlight the dataset’s skewed label distribution and central clustering of bounding boxes, which may bias detection outcomes. Most annotations overlap in the thoracic region, complicating fine-grained localization in multi-disease contexts.
In summary, YOLOv8n demonstrates reliable detection for dominant classes but struggles with rare or subtle features. Future improvements may include class re-weighting, advanced augmentation, and multi-scale feature enhancement. Nonetheless, YOLOv8n’s efficiency, ONNX support, and edge compatibility make it a strong candidate for hybrid diagnostic pipelines alongside classification models.
4.3. Edge Inference Results
To evaluate real-world feasibility, we measured the end-to-end latency of each quantized classification model on a
Raspberry Pi 4, using a high-resolution chest X-ray image (
, PNG format) to better reflect deployment scenarios in clinical settings. The reported latency includes image preprocessing and inference time, while
model loading is excluded, as it occurs only once during deployment and is not part of repeated inference cycles. Among the tested models,
MobileNetV3-Large achieved the fastest performance with an end-to-end latency of
429.6 ms, followed by
EfficientNetV2-B0 at
623.0 ms, confirming their suitability for real-time embedded deployment. In contrast,
InceptionResNetV2 and
ResNet101 exhibited significantly higher latencies of
2300.7 ms and
1908.2 ms, respectively, due to their deeper architectures and larger parameter counts. Notably, even with a full-resolution image, all models completed end-to-end inference in under
2.4 s, demonstrating the feasibility of deploying AI-based decision support systems on embedded hardware, provided that quantization and architectural choices are carefully optimized for edge constraints. A detailed breakdown of model load time, preprocessing time, inference time, and total latency is presented in
Table 4.
It is important to clarify that model loading time was excluded from our reported total latency, as it is a one-time operation performed during system initialization. In realistic deployment scenarios, the model remains loaded in memory for repeated use, and therefore does not contribute to the per-image inference delay. The latency values presented reflect only the actual runtime operations preprocessing and inference, that occur for each input image.The YOLOv8n detection model was not deployed on the Raspberry Pi due to its computational demands. Instead, it was exported to ONNX format for execution on edge accelerators such as NVIDIA Jetson Nano or Coral TPU. In realistic setups, a hybrid diagnostic pipeline is envisioned: lightweight CNNs perform local classification on-device, while detection is offloaded to a Multi-access Edge Computing (MEC) server over 5G for spatial interpretation. This design optimizes diagnostic responsiveness and interpretability in mobile and underserved clinical environments.
4.4. Impact of Quantization on Prediction Confidence
To evaluate the effect of post-training quantization on model prediction behavior, we compared the classification outcomes and softmax confidence values of all six models before and after 8-bit quantization, using the same high-resolution chest X-ray image (2566 × 2566 pixels, PNG format). As shown in
Appendix B (
Table A1), quantization preserved the predicted class in five out of six models, with only DenseNet201 exhibiting a shift in predicted label from
Normal to
COVID-19.
Despite minor shifts in probability magnitudes, the overall confidence distributions remained stable across models, with the dominant class maintaining its relative position. Quantized models displayed slightly sharpened or dampened softmax values, a typical artifact of reduced floating-point precision. Crucially, no unsafe class inversions—such as a misclassification between Normal and COVID-19—were observed, preserving the integrity of clinical safety margins.
These results suggest that 8-bit quantization introduces minimal risk to classification reliability when applied to well-optimized models, reinforcing its suitability for resource-constrained, real-time medical inference on embedded edge hardware.
4.5. Combined Analysis and Insights
To assess the diagnostic synergy of classification and detection, we analyzed how their outputs aligned and complemented each other. Classification models, such as DenseNet201 and EfficientNetV2-B0, offered accurate image-level predictions but lacked spatial interpretability. In contrast, the YOLOv8n detection model localized 14 thoracic abnormalities, many of which semantically mapped to the classification categories.
For example, Consolidation and Infiltration often co-occurred in cases predicted as Bacterial or Viral Pneumonia, while findings such as Pulmonary Fibrosis, ILD, and Lung Opacity were common in COVID-19 diagnoses. Pleural Effusion frequently accompanied Bacterial Pneumonia, supporting cross-validation of model outputs.
In uncertain classification cases, YOLOv8n’s localized outputs (e.g., bounding boxes for Effusion) enhanced interpretability and offered decision support. While visual overlays were not included in this study, the observed semantic consistency between detection and classification underscores the diagnostic value of combining both approaches.
This supports a hybrid Embedded AI pipeline: classification provides rapid triage on edge devices like Raspberry Pi, and detection—offloaded to accelerators or MEC infrastructure—adds spatial context. Such an architecture enhances transparency, reliability, and responsiveness in low-resource deployments.
4.6. Summary of Key Findings
Classification Performance: DenseNet201 and EfficientNetV2-B0 achieved top accuracy; MobileNetV3-Large offered the best speed–accuracy trade-off for edge deployment.
Data Augmentation Benefits: Augmented datasets improved generalization, especially for compact models, and reduced class confusion.
Detection Effectiveness: YOLOv8n achieved mAP@0.5 of 27.6% and mAP@0.5:0.95 of 14.7%. High precision was observed for visually distinct classes, with lower recall for rare or subtle findings.
Edge Feasibility: MobileNetV3-Large and EfficientNetV2-B0 demonstrated efficient inference on Raspberry Pi. YOLOv8n is better suited for hardware accelerators or MEC offloading.
Hybrid Potential: Correlation between classification and detection outputs validates a combined framework, enhancing diagnostic confidence, transparency, and edge deployability.
These results underscore the value of combining classification and detection in Embedded AI solutions for lung disease diagnosis and support their integration into real-world clinical workflows.
5. Discussion
This study explored a dual deep learning strategy—image-level classification and object-level detection—for AI-assisted lung disease diagnosis from chest X-rays. The integration of both approaches offers a more comprehensive and interpretable diagnostic workflow, particularly valuable in low-resource settings.
5.1. Interpretability and Practical Deployment
One of the primary challenges in deploying classification models in clinical practice is their limited transparency. The integration of object detection via YOLOv8n improves interpretability by providing spatial localization of abnormalities (e.g., Consolidation, Pleural Effusion, Pulmonary Fibrosis). These localized outputs align with classification labels, enhancing clinical trust and supporting more informed decision-making.
From a deployment standpoint, our experiments on Raspberry Pi 4 confirmed that lightweight classification models such as MobileNetV3-Large and EfficientNetV2-B0 achieve a favorable trade-off between accuracy and latency, enabling real-time inference in embedded contexts. While the YOLOv8n detection model was not deployed as part of this work, it is technically feasible to run it on Raspberry Pi 4 when optimized (e.g., via ONNX or TFLite) and supported by lightweight runtime environments. For enhanced performance or throughput, integration with hardware accelerators like the Coral TPU or NVIDIA Jetson Orin Nano is also possible. In a practical deployment scenario, classification could be used for on-device triage, while detection would provide spatial explainability, either on the same device or via a lightweight co-processor, enabling a flexible and interpretable edge-AI diagnostic workflow.
5.2. Complementary Value of Classification and Detection
Although this study evaluates classification and detection models independently, their integration presents practical diagnostic advantages aligned with real-world clinical workflows. The classification component serves as a rapid, low-complexity screening tool suitable for embedded platforms like Raspberry Pi, while the detection module contributes interpretability through spatial localization of abnormalities.
While a quantitative fusion of outputs was not conducted, qualitative observations indicate that the detected regions of interest often align well with the classification categories, particularly for conditions such as bacterial pneumonia, COVID-19, and pleural effusion. This complementary behavior supports the envisioned modular pipeline, where classification facilitates triage and detection augments interpretability.
Notably, the full integration of both modules into a unified edge-deployable pipeline remains a future goal. However, the modular architecture and deployment-aware design already offer insights into how such hybrid AI systems can be structured. Future work will investigate adaptive strategies, where detection is selectively triggered based on classification uncertainty, thereby enhancing diagnostic precision while managing computational overhead in resource-constrained environments.
5.3. Impact of Data Augmentation and Future Dataset Expansion
The application of standard data augmentation techniques yielded only marginal improvements for certain lightweight models and did not substantially enhance overall classifier performance. This outcome suggests that, within our current experimental framework, augmentation alone is insufficient to overcome the limitations imposed by the relatively small dataset size. Moreover, conventional augmentation techniques may not always introduce meaningful diversity to chest X-ray images, as common transformations can result in redundant or noninformative alterations to lung-specific features. This highlights the need for more context-aware or advanced augmentation strategies—tailored specifically for medical imaging—that preserve anatomical realism while enriching training variability.
Future research should focus on expanding the dataset in a more systematic and scalable manner, ideally incorporating real-world clinical diversity. Federated learning frameworks can offer a viable path by enabling collaborative model training across distributed institutions while preserving patient privacy and adhering to regulatory constraints.
In parallel, emphasis should remain on training low-complexity, quantization-friendly models that retain high diagnostic value while being optimized for real-time inference on resource-constrained edge devices.
5.4. Limitations and Future Work
A key limitation of this study lies in the use of separate datasets for classification and detection tasks, preventing joint optimization or end-to-end learning of a unified model. Future research could explore semi-supervised learning or multi-task learning approaches to bridge this gap and exploit partially labeled data.
Additionally, although the feasibility of classification model inference was thoroughly validated on Raspberry Pi 4 using quantized models, the hybrid pipeline integrating both classification and detection was not experimentally implemented or profiled end-to-end. Nevertheless, the proposed modular architecture reflects a novel and practical strategy, where lightweight classification is performed on-device for rapid triage, and detection is selectively activated via remote accelerators such as Jetson Orin Nano or edge servers. This design anticipates real-time decision support systems deployable in surgery rooms or mobile clinical setups. Future work will focus on fully integrating this pipeline and evaluating its end-to-end latency, diagnostic synergy, and clinical viability.
Another important limitation concerns the relatively small dataset and class imbalance, which particularly affects rare abnormalities and their detection accuracy. Enlarging the dataset with clinically validated annotations would improve the model’s generalization capacity. More rigorous cross-dataset validation and clinical trials involving real-world deployment are necessary to ensure clinical robustness.
Furthermore, some models required different input resolutions (e.g., vs. ), which might have affected performance consistency. Standardizing preprocessing pipelines or using resolution-invariant architectures should be explored in future work.
Moreover, while this study focused on six well-established CNN architectures for their diversity and relevance to embedded deployment, future work will expand the model pool to include additional lightweight families such as ShuffleNet, SqueezeNet. This will enable a more comprehensive exploration of the trade-offs between latency, accuracy, and deployability across a broader spectrum of edge-oriented models.
Finally, while the datasets used were labeled by radiologists or derived from structured hospital sources, no direct collaboration with clinical experts was involved in evaluating the model’s interpretability or usability. Future efforts should prioritize expert-in-the-loop validation to align AI outputs with clinical workflows.
Additionally, the datasets used in this study lack complete demographic annotations such as age, sex, or ethnicity, limiting our ability to assess or control for demographic bias. This represents an important constraint on the generalizability of the results, particularly for real-world deployment across diverse populations. Future work should consider demographic balancing and fairness-aware modeling to ensure equitable diagnostic performance across groups.
5.5. Novelty and Positioning
This work introduces a modular, edge-oriented AI pipeline for lung disease diagnosis, integrating both multi-class classification and multi-label object detection to reflect real-world diagnostic workflows. While most prior studies emphasize accuracy on high-end computing platforms, few consider the deployability of AI models on resource-constrained edge hardware. To the best of our knowledge, no existing work systematically evaluates the feasibility and complementary value of combining classification and detection models in embedded settings.
The novelty of this study lies in (i) benchmarking six quantized classification models on Raspberry Pi 4 for real-time inference, using high-resolution chest X-ray inputs to simulate deployment conditions, and (ii) training a YOLOv8n detection model that is compatible with embedded accelerators, though not deployed in this version. Together, these components demonstrate a theoretically viable hybrid diagnostic pipeline, with classification enabling rapid triage and detection offering spatial interpretability.
While a full end-to-end prototype was not implemented, this modular approach forms the foundation for future hybrid pipelines deployable in mobile clinics, wearable medical devices, or surgical assistance systems. The study prioritizes a balance between speed, interpretability, and resource efficiency—three essential pillars for embedded AI solutions in real-time, distributed healthcare environments. This forward-looking perspective distinguishes the work from the existing literature focused solely on algorithmic performance or isolated evaluation.
6. Conclusions
This study conducted a comprehensive performance comparison of Embedded AI solutions for lung disease diagnosis from chest X-ray images, evaluating both classification and detection models within the context of edge-deployable healthcare. By integrating image-level classification with object-level detection, we addressed key challenges related to diagnostic accuracy, interpretability, and real-world feasibility.
Six CNN-based classifiers were benchmarked across base and augmented datasets for five-class classification. MobileNetV3-Large and EfficientNetV2-B0 demonstrated the most favorable trade-off between classification accuracy and latency, achieving real-time performance on Raspberry Pi using quantized models. For detection, a lightweight YOLOv8n model was trained to localize 14 thoracic abnormalities with competitive mAP scores, particularly excelling in the detection of high-frequency and spatially distinct conditions such as Cardiomegaly, Pleural Effusion, and Pulmonary Fibrosis. Although YOLOv8n was not deployed on embedded hardware in this study, its compact architecture makes it a promising candidate for future edge deployment.
The observed alignment between classification predictions and detected abnormalities underscores the diagnostic synergy of combining both approaches. This supports the feasibility of a hybrid diagnostic pipeline, wherein classification provides efficient triage and detection enhances clinical interpretability through spatial verification—especially valuable in low-resource or mobile healthcare settings.
In summary, this work contributes a modular, deployable AI framework that leverages the complementary strengths of classification and detection. Through experimental validation of classification inference on Raspberry Pi and the training of an edge-compatible detection model, it addresses real-world constraints often overlooked in the existing literature. Future work will focus on fully integrating and evaluating this hybrid pipeline under real-time and system-level conditions, along with clinical validation and fairness-aware modeling to ensure equitable deployment.