Digitization of Medical Device Displays Using Deep Learning Models: A Comparative Study

Ferreira, Pedro; Lobo, Pedro; Reis, Filipa; Vilaça, João L.; Morais, Pedro

doi:10.3390/app15105436

Open AccessArticle

Digitization of Medical Device Displays Using Deep Learning Models: A Comparative Study

by

Pedro Ferreira

^1,2,*

,

Pedro Lobo

^1,2

,

Filipa Reis

³,

João L. Vilaça

^1,2

and

Pedro Morais

^1,2

¹

2Ai, School of Technology, IPCA, 4750-810 Barcelos, Portugal

²

LASI—Associate Laboratory of Intelligent Systems, 4800-058 Guimarães, Portugal

³

GLINTT Life, 2710-693 Sintra, Portugal

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(10), 5436; https://doi.org/10.3390/app15105436

Submission received: 15 April 2025 / Revised: 7 May 2025 / Accepted: 9 May 2025 / Published: 13 May 2025

(This article belongs to the Special Issue Innovations in Artificial Neural Network Applications)

Download

Browse Figures

Versions Notes

Abstract

With the growing number of patients living with chronic conditions, there is an increasing need for efficient systems that can automatically capture and convert medical device readings into digital data, particularly in home-based care settings. However, most home-based medical devices are closed systems that do not support straightforward automatic data export and often require complex connections to access or transmit patient information. Since most of these devices display clinical information on a screen, this research explores how a standard smartphone camera, combined with artificial intelligence, can be used to automatically extract the displayed data in a simple and non-intrusive way. In particular, this study provides a comparative analysis of several You Only Look Once (YOLO) and Single Shot MultiBox Detector (SSD) models to evaluate their effectiveness in detecting and recognizing the readings on medical device displays. In addition to these comparisons, we also explore a hybrid approach that combines the YOLOv8l model for object detection with a Convolutional Neural Network (CNN) for classification. Several iterations of the aforementioned models were tested, using image resolutions of 320 × 320 and 640 × 640. The performance was assessed using metrics such as precision, recall, mean average precision at 0.5 Intersection over Union (mAP@50), and frames per second (FPS). The results show that YOLOv8l (640) achieved the highest mAP@50 of 0.979, but at a lower inference speed (13.20 FPS), while YOLOv8n (320) offered the fastest inference (129.79 FPS) with a reduction in mean average precision (0.786). Combining YOLOv8l with a CNN classifier resulted in a slight reduction in overall accuracy (0.96) when compared to the standalone model (0.98). While the results are promising, the study acknowledges certain limitations, including dataset-specific biases, controlled acquisition settings, and challenges in adapting to real-world scenarios. Nevertheless, the comparative analysis offers valuable insights into the trade-off between inference time and accuracy, helping guide the selection of the most suitable model based on the specific demands of the intended scanning application.

Keywords:

artificial intelligence; display digitization; object detection; single shot multibox detector (SSD); you only look once (YOLO)

1. Introduction

In recent years, the importance of daily health monitoring has grown significantly, driven by the increase in cases of patients with chronic conditions such as diabetes, hypertension, and cardiovascular diseases. Patients with this type of condition are often required to track physiological data, such as blood pressure, glucose levels, or oxygen saturation, using medical devices at home [1]. However, despite the widespread use of such devices, a challenge remains in efficiently collecting and sharing the recorded data with healthcare providers. Currently, many medical devices require manual data entry into personal health records or mobile applications, which can be considered a tedious process and prone to many errors [2,3]. This monitoring process can be optimized using intelligent systems that digitize and centralize the data present in these devices. Additionally, many medical measurement devices do not have built-in communication interfaces, such as Bluetooth or Wi-Fi, which prevents them from automatically sending data to external systems or applications. Even in cases where connectivity is available, data transmission is often restricted to proprietary applications, limiting interoperability and integration with broader healthcare platforms.

Almost all recommended medical devices use a screen to show measurement values. Within those devices that use a screen, the majority use a seven-segment display [4]. Despite their simple structure, seven-segment digits can be difficult to detect and recognize. Additionally, acronyms (which can be units of measurement or the category the values represent) typically are not in seven-segment format and can have multiple fonts, adding another layer of complexity to the recognition process.

The development of systems for automating the recognition of medical or industrial device displays has already been explored in previous works [1,3,4,5,6,7,8,9,10,11,12,13,14,15]. A diverse range of methodologies were employed in these studies, from traditional image-processing techniques to deep learning models. Regarding studies that focused on traditional processing methods, various approaches have been employed to identify seven-segment digits. Adaptive thresholding and binarization techniques have been applied to enhance contrast and isolate digit segments from the background, as demonstrated by the works [3,5]. Segmentation techniques have been widely used to extract individual digits from display screens. For instance, Tsiktsiris et al. [3] proposed an algorithm that combines adaptive thresholding, segmentation, and a predefined lookup table for digit classification. Feature extraction methods were also used to refine digit recognition by analyzing segment shapes and spatial positioning. Popayorm et al. [6] explored this approach using Hue, Saturation, Value (HSV)-based color slicing to improve segmentation accuracy. Finnegan et al. [4] proposed a hybrid approach that combined binarization with Maximally Stable Extremal Regions (MSER) to enhance digit segmentation in seven-segment displays. Additionally, their method incorporated machine learning, using a Multi-Layer Perceptron (MLP) classifier to improve recognition accuracy beyond that of traditional image-processing techniques.

As for studies that follow a deep learning approach [1,7,8,9,10,11,12,13,14,15], we can observe a prevalence in the use of models such as Convolutional Neural Networks and object detection frameworks. Lobo et al. [1] used the You Only Look Once (YOLOv3) and Single Shot MultiBox Detector (SSD) models for the detection and classification of digits and acronyms in medical devices. The study focused on three types of devices: glucometers, blood pressure monitors, and oximeters. Among the models tested, YOLOv3 demonstrated the most consistent performance. In the context of industrial devices, Suttapakti et al. [9] evaluated several object detection models for seven-segment display recognition and found that the Cascade Region-Based Convolutional Neural Network (Cascade R-CNN) achieved the best overall performance. Although its classification accuracy was slightly lower than YOLOv3 and CornerNet, it outperformed all other models in precision, recall, and F1-score, demonstrating potential for reliable digit detection.

From the analysis of related work, we can observe that traditional processing methods, although effective for specific and controlled environments, may have limitations in generalizing to different display conditions, such as conditions with noise and lighting variations. However, these methods remain efficient and suitable for constrained devices or controlled settings. Suttapakti et al. [9] conducted a direct comparison between deep learning models and the handcrafted approaches analyzed by Popayorm et al. [6], showing that deep learning achieved superior performance in terms of detection accuracy, although the study did not focus on computational cost or real-time applicability. In scenarios involving small datasets or highly controlled acquisition conditions, simpler algorithms can still perform adequately and often offer advantages in terms of computational efficiency and ease of deployment. Additionally, the assumption that artificial intelligence (AI) systems will perform reliably under diverse real-world conditions may overlook practical issues such as device variability, user operation errors, and inconsistent lighting environments. These challenges must be considered when evaluating the generalizability of AI-driven digitization approaches in home care settings.

This work is part of an initial analysis to support the development of a practical application for automatically digitizing measurements from medical device displays intended for use in home care and clinical environments. To support this goal, the present study focuses on identifying the most suitable detection models based on their accuracy and efficiency under controlled conditions, establishing a technical foundation for the forthcoming implementation phase.

The current research provides a comparative analysis of several YOLO [16] and SSD [17] models for detecting and classifying digits and acronyms on medical device displays. This study aims to provide insights into the advantages and limitations of each model, offering guidance for selecting the most suitable option based on the specific requirements of the intended application. The analysis also evaluates the computational efficiency of each model to determine their suitability for real-time deployment in resource-constrained environments.

The remainder of this document is organized as follows: Section 2 outlines the methodology, including the object detection models, datasets, and implementation details. Section 3 describes the experimental setup, covering the training process and evaluation metrics used to assess model performance. Section 4 presents the experimental results, followed by Section 5, which discusses the findings, key insights, limitations, and potential improvements. Finally, Section 6 concludes the study, summarizing the main contributions.

2. Methodology

2.1. Models Description

Object detection architectures can be broadly classified into single-stage and two-stage detectors, each presenting specific trade-offs regarding accuracy, speed, and computational efficiency. The selection of the appropriate architecture is highly dependent on the application requirements. Given the potential future deployment of this system on mobile devices or devices with limited resources, the comparative analysis focused on single-stage detectors, due to their lower computational requirements and ability to achieve real-time performance without significantly compromising detection accuracy.

Single-stage detectors such as YOLO [16] and the SSD [17] employ an end-to-end approach that directly predicts bounding boxes and class probabilities in a single network pass. These models are optimized for speed and are particularly suitable for real-time applications. However, due to the lack of a separate region refinement stage, they may exhibit a slight decrease in accuracy when detecting small or overlapping objects.

For this study, the comparative analysis included the following models: YOLOv3u, YOLOv5mu, YOLOv8n, YOLOv8m, and YOLOv8l from the YOLO family (Ultralytics repository) and the SSD MobileNet V2 Feature Pyramid Network Lite (FPNLite) from the TensorFlow Model Garden API. The models were evaluated at two distinct input image resolutions: 320 × 320 and 640 × 640.

In addition to the aforementioned models, a Convolutional Neural Network (CNN) was developed for the classification component. The architecture of this network was based on the work carried out by Lobo et al. [1], and was designed to classify the objects extracted from the bounding boxes predicted by the detection models.

2.2. YOLO (You Only Look Once)

YOLO approaches object detection as a single regression task, directly predicting bounding boxes and class probabilities from the input image in one forward pass. A convolutional backbone, typically CSPDarknet-inspired modules, is used to extract rich spatial features. These features are then processed by a feature aggregation neck, such as a Feature Pyramid Network (FPN) or Path Aggregation Network (PAN), which enhances multiscale detection by combining spatial and semantic information across different resolutions. The resulting feature maps are fed into multiple detection heads that simultaneously predict object classes, bounding box coordinates, and objectness scores. Recent iterations, such as YOLOv8, introduce anchor-free detection and partially decoupled heads for classification and regression, improving performance on small or overlapping objects while maintaining real-time inference capabilities.

Figure 1 illustrates the general architecture of the YOLOv8 model, highlighting a convolutional backbone based on Concatenate-to-Fuse (C2F) modules for feature extraction, a PAN-style neck for multiscale feature fusion, and detection heads operating at three different resolutions. In the following subsections, we provide detailed descriptions of the YOLO variants used in this comparative study. The information presented is based on the official documentation from the Ultralytics repository [18].

2.2.1. YOLOv3u

YOLOv3u is an advanced variant of YOLOv3, optimized by Ultralytics for improved accuracy and inference speed. It incorporates the anchor-free, objectness-free split head used in YOLOv8 models. While maintaining the original Darknet-53 backbone and neck architecture from YOLOv3, the aforementioned modification eliminates the need for predefined anchor boxes and objectness scores. The updated detection head improves the model’s ability to detect objects of varying sizes and shapes.

2.2.2. YOLOv5mu

YOLOv5mu is an optimized variant of the YOLOv5 series that combines the model’s signature modular design with advanced detection capabilities. Building on YOLOv5’s Cross Stage Partial (CSP) backbone architecture, which reduces computational load while maintaining accuracy, the medium (mu) variant introduces the anchor-free, objectness-free split head from YOLOv8 models. The medium variant specifically targets the balance between computational efficiency and detection performance, making it particularly suitable for resource-constrained environments.

2.2.3. YOLOv8 (n, m, l)

YOLOv8 represents one of the latest versions of the YOLO family, introducing state-of-the-art improvements in architecture and detection capabilities. Built on an optimized CSPDarknet backbone with enhanced neck architectures, the model incorporates several key innovations including anchor-free detection, deeper networks, and advanced training techniques. The anchor-free split head design eliminates the need for predefined bounding box anchors, enhancing adaptability to detect objects of varying sizes while maintaining computational efficiency. In this study, the variants YOLOv8n (nano), YOLOv8m (medium), and YOLOv8l (large) were used to evaluate the performance in recognizing digits and acronyms on medical device displays.

2.3. SSD MobileNet V2 FPNLite

The SSD framework [17], when combined with the MobileNetV2 backbone and the FPNLite extension, forms a computationally efficient object detection system optimized for resource-constrained environments.

MobileNet V2 acts as the backbone and employs depthwise separable convolutions along with inverted residual blocks to enable lightweight and effective feature extraction with minimal computational overhead. To enhance detection performance across distinct object scales, particularly for small objects, the SSD integrates FPNLite, a streamlined variant of the traditional Feature Pyramid Network (FPN). FPNLite enables hierarchical feature fusion, allowing the model to leverage multiscale features more effectively without a significant increase in complexity. The SSD detection head complements this architecture by performing multiscale predictions across feature maps at different resolutions. Each grid cell within these feature maps is responsible for detecting objects in specific receptive fields, using predefined default anchor boxes with varying aspect ratios and scales.

The general architecture of the SSD with MobileNetV2 and FPNLite is shown in Figure 2.

2.4. CNN Architecture

A Convolutional Neural Network (CNN) was developed for multiclass classification. It was designed for grayscale images of size 32 × 32 pixels with a single channel (32, 32, 1). The architecture was optimized for computational efficiency while incorporating regularization techniques to mitigate overfitting and improve generalization on limited grayscale image data.

The model architecture begins with an input layer configured to process grayscale images of dimensions 32 × 32 pixels followed by batch normalization to stabilize and accelerate training. It includes three convolutional blocks, each comprising two convolutional layers with ReLU activations and the same padding, followed by max-pooling and dropout layers to reduce spatial dimensions and mitigate overfitting. The number of filters increases progressively across the blocks (32, 64, 128) to capture increasingly complex features.

After the convolutional layers, the output is flattened and passed through a fully connected dense layer with 128 units, followed by an additional dropout layer. The final layer is a dense layer with 15 units, activated by a softmax function to produce class probabilities corresponding to the 15 output categories.

The general CNN architecture is illustrated in Figure 3.

2.5. Datasets

The dataset was obtained from [1]. The data include photographs of local medical devices, images compiled from the internet, particularly from manufacturers catalogs, and additional data sourced from a public dataset [4]. The medical devices represented in the dataset include oximeters, glucometers, and blood pressure monitors. A detailed breakdown of the number of images per device type and source is presented in Table 1, illustrating the composition and diversity of the dataset. From the full set of 1614 annotated images, 75% were used for training, 12.5% for validation, and 12.5% for testing. This split was applied to ensure balanced representation across device types and label classes.

For each image, manual labeling was performed using 15 classes, comprising the ten digits (0 to 9) and five acronyms representing medical terms: “sys” for systolic blood pressure, “dia” for diastolic blood pressure, “m” for glycemic level, “p” for pulse, and “spO2” for oxygen saturation.

A second dataset was created from the original. This dataset consisted of cropped images of the bounding boxes annotated in the original images, along with their corresponding labels. The purpose of this dataset was to train the CNN to perform the classification task specifically, to compare the performance of standalone models with the combined approach of bounding box regression and CNN classification.

To illustrate some of the variability present in the dataset, Figure 4 presents a selection of sample images. These samples reflect a range of characteristics, including inconsistent lighting, glare, reflections, and varying fonts.

2.6. Implementation Details

Before our analysis, some preprocessing steps were carried out, including the resizing and cropping of all images to a dimension of 832 × 832 pixels and converting them to a common file format.

In our implementation, considering the applied methodology, a second resizing was applied to convert the image size to the defined input sizes, namely, 320 × 320 and 640 × 640 pixels. Additionally, data augmentation techniques, such as horizontal flips and random crops, were applied to increase the diversity of the training data.

Image normalization was handled internally within the model architectures during training, both in YOLO (Ultralytics) and the SSD (via the TensorFlow Model Garden configuration). This process scaled the pixel values from their original range of [0, 255] to a range of [0, 1] for YOLO and [−1, 1] for the SSD.

Regarding the CNN, additional preprocessing steps were applied to the cropped images. These steps included resizing the images to a dimension of 32 × 32 pixels, converting them to grayscale, and normalizing the pixel values to a range of [0, 1]. Data augmentation techniques were also applied to improve the generalization ability of the model.

3. Experiments

3.1. Training Setup

All experiments were conducted on a computational platform equipped with an NVIDIA GeForce GTX 1650 graphics card (4 GB) (Santa Clara, CA, USA), an Intel Core i5-9300HF (Santa Clara, CA, USA), and 8 GB of RAM, operating with the NVIDIA CUDA Toolkit version 12.1.

The implementation utilized Ultralytics YOLO, TensorFlow Model Garden, and PyTorch for model development and execution. Supporting tools included OpenCV and Pillow for image processing, NumPy and Pandas for data handling, and Matplotlib for data visualization.

The YOLO models were trained for approximately 30,000 steps with the AdamW optimizer using the Complete Intersection over Union (CIoU) as the loss function for bounding box regression and Binary Cross-Entropy (BCE) for both objectness (confidence) and classification. The SSD models were trained for 56,000 steps with the Momentum Stochastic Gradient Descent (SGD) optimizer using Smooth L1 Loss for bounding box regression and Sigmoid Focal Loss for classification.

Regarding the Convolutional Neural Network (CNN), the model was trained over 150 epochs using the Adam optimizer and categorical cross-entropy.

3.2. Evaluation Metrics

The object detection models were evaluated using metrics such as precision, recall, and mean average precision (mAP) at 0.5 Intersection over Union (IoU). For overall performance, we also measured frames per second (FPS) to assess inference speed.

In the class-wise comparison, precision, recall, and average precision (AP) at 0.5 IoU were reported for each class to highlight the model’s ability to detect specific categories.

Regarding the evaluation of the classification component, comparing the standalone models with the combined approach of bounding box regression and CNN classification, the metrics used were the standard classification report metrics.

4. Results

To assess how the model’s performance varies with resolution, two distinct input dimensions were evaluated, 320x320 and 640x640. As shown in Table 2, models using a higher resolution achieved better detection accuracy. The YOLOv8l (640) model achieved the highest mAP@50, reaching 0.979. Regarding inference speed, YOLOv8n (320) achieved the fastest performance, with a frame rate of 129.79 FPS.

In the class-wise performance analysis, the YOLOv8l (640) and SSD MobileNet V2 FPNLite (640) models, which achieved the highest mAP@50 scores within their respective architectures, were compared. It should be noted that this comparison selection was based solely on mAP@50 and does not imply these are the best models overall, as the optimal choice may vary depending on specific use-case requirements.

The comparison, shown in Table 3, reports precision, recall, and AP@50 metrics for each class. The classes include the digits 0-9, as well as medical acronyms such as “sys”, “dia”, “m”, “p”, and “spO2”. The YOLOv8l (640) demonstrated superior performance across all classes, with particularly high metrics for class 4, achieving a precision of 0.995 and a recall of 0.975. While the SSD MobileNet V2 FPNLite (640) model exhibited lower precision and recall values, it performed well in class 4 with 0.902 precision and 0.933 recall. Overall, the YOLOv8l (640) model demonstrated higher class-wise performance when compared to the SSD MobileNet V2 FPNLite (640).

Table 4 presents a comparative analysis of classification metrics between the YOLOv8l (640) standalone model and the YOLOv8l (640) combined with a CNN classifier. It is essential to emphasize that the classification metrics were derived exclusively from matched bounding boxes, where correspondence between predicted and ground truth boxes was established using an Intersection over Union threshold of 0.5. Only these matched boxes were included in the metrics calculation, while unmatched predictions and ground truths were excluded from the evaluation. This approach focused on the classification performance of the boxes that were successfully matched rather than the overall object detection performance. For the standalone YOLOv8l (640) model, an accuracy of 0.98 was achieved, with high precision, recall, and F1-scores across most classes.

In comparison, the YOLOv8l model combined with the CNN classifier achieved a slightly lower overall accuracy of 0.96. While most classes maintained high performance metrics, some classes, such as class 5 and class 2, showed lower values of precision, recall, and F1-scores.

Regarding the confusion matrices between the YOLOv8l standalone model and the YOLOv8l with the CNN, shown in Figure 5, we can once again observe that the performance of the standalone model was slightly superior to that of the model with the CNN incorporated.

Figure 6 presents the precision–recall (PR) curve for the YOLOv8l standalone model. The PR curve illustrates the relationship between precision and recall across different classification thresholds. We can observe that all classes have good performance across various threshold levels. The curves for each class are positioned in the upper-right corner, indicating that the model achieves a good balance of precision and recall for all the analyzed classes.

Finally, Figure 7 provides a qualitative analysis showcasing some of the results obtained with the YOLOv8l (640) standalone model.

5. Discussion

The evaluation of YOLO and SSD models in this study provided key insights into their strengths and weaknesses in detecting and recognizing digits and acronyms on medical device displays.

Overall, the results demonstrate that the YOLO models, particularly YOLOv8l (640), exhibited superior performance in terms of mAP@50, precision, and recall when compared to the SSD MobileNet V2 FPNLite models. YOLOv8l (640) achieved the highest mAP@50 of 0.979, and strong class-wise performance for both digits and medical acronyms. The SSD (640) model performed relatively well for certain classes (e.g., class 4 with 0.902 precision and 0.933 recall), but it showed poorer performance compared to the YOLO models for other classes.

Analysis of the experimental results demonstrated a fundamental trade-off between detection accuracy and computational efficiency. While higher resolution (640 × 640) achieved the highest detection accuracy, it also came at the expense of higher inference time. For instance, YOLOv8n (320) offered the fastest performance at 129.79 FPS, making it a suitable option for real-time applications, but this was achieved at the cost of a lower mAP (0.786). Conversely, YOLOv8l (640), which achieved the highest mAP (0.979), had a much lower inference speed of 13.20 FPS, making it less suitable for real-time use cases where inference time is critical. This highlights the need to balance inference speed and detection performance depending on the application requirements. For real-time patient-monitoring systems where inference speed is crucial, a lighter model such as YOLOv8n may be more appropriate. However, in situations where detection accuracy is the priority, YOLOv8l (640) might be more suitable. To the extent of our knowledge, similar works did not report inference speed, computational efficiency, or frames per second (FPS), focusing solely on standard evaluation metrics for object detection models.

The comparison between the standalone YOLOv8l (640) model and the YOLOv8l with the CNN classifier provided additional insight into the classification component performance. The standalone model achieved a higher performance with an accuracy of 0.98. However, upon integrating the CNN classifier, a slight reduction in accuracy to 0.96 was observed, with some classes, such as class 5 and class 2, showing a noticeable drop in all metrics evaluated. For the other classes, the performance remained high or even improved, suggesting that integrating the CNN for classification may be beneficial in certain cases. Nevertheless, further optimization is required to avoid performance loss in specific classes. The analysis of the confusion matrices for both configurations (Figure 5) reinforces these findings. The YOLOv8l standalone model demonstrated fewer misclassifications across almost all classes. In contrast, the combined YOLOv8l-CNN approach demonstrated increased confusion between classes, particularly for the visually similar digits ‘2’ and ‘5’. This may be due to the classifier’s sensitivity to subtle variations in seven-segment digit representations, as these digits can appear visually similar depending on display orientation.

The precision–recall (PR) curve for the YOLOv8l (640) model, shown in Figure 6, demonstrates a consistently strong performance across all classes. The curves cluster tightly in the upper-right quadrant, reflecting that the model maintains high precision and recall under different threshold levels. The global mAP@50 across all classes reaches 0.981, confirming the model’s performance consistency and generalization capacity. In addition to quantitative metrics, a qualitative assessment was conducted to visually inspect some of the predictions made by the YOLOv8l (640) model. As shown in Figure 7, the model accurately detects digits and acronyms across various medical devices and conditions. It handles differences in lighting, reflections, fonts, and orientations, demonstrating strong generalization.

Nevertheless, some performance bottlenecks were found. The study identified that visually similar digits, such as ’2’ and ’5’, present recognition challenges, particularly in the CNN-based classification, where minor variations in seven-segment structures led to increased misclassification. This suggests a limitation in the model’s ability to differentiate fine-grained features in small objects. Future improvements should focus on handling ambiguous characters more effectively and improving model robustness through a more realistic data augmentation process.

Overall, and looking at all results, this study suggests that YOLOv8l (640) achieved the highest detection accuracy under the given experimental conditions. However, it is important to note that these results were obtained using the specific dataset analyzed in this study, which reflects some controlled conditions and may not fully represent the variability encountered in real-world environments. Additionally, while YOLOv8l performs well in terms of precision and mAP@50, its inference speed is considerably slower than that of lighter models such as YOLOv8n. Therefore, its suitability may be limited in applications where real-time processing is a critical requirement. The comparative analysis highlights the trade-offs between accuracy and computational efficiency, indicating that model selection should be scenario dependent rather than based solely on peak detection performance. This study provides a foundation for the final smartphone application being researched in this project. Based on the findings, the most suitable model will be used in the next phase, which will include deployment and evaluation in real-world scenarios.

Building on these findings, future research will focus on developing models capable of capturing contextual relationships between the detected elements. Future models will not only detect and classify individual digits and acronyms but also integrate contextual awareness to directly associate these values with their corresponding medical parameters (e.g., systolic or diastolic blood pressure), reducing the need for extensive post-processing.

An additional consideration for future deployment is the integration of ethical safeguards, which was beyond the scope of this technical evaluation. As highlighted by Thurzo [19], the increasing autonomy of AI systems in healthcare, especially when it operates with partial autonomy in diagnostic or monitoring tasks, introduces governance challenges that require transparency, accountability, and human oversight, which are not addressed by performance metrics alone. In line with these concerns, the concept of a “Trustworthy Ethical Firewall”, introduced by Thurzo [20], proposes a layered architecture embedding mathematically provable ethical constraints into the AI’s decision-making core. This architecture ensures transparent, auditable decision-making, and incorporates escalation protocols to trigger human oversight in ambiguous or high-risk scenarios. Integrating such frameworks into medical AI systems could mitigate risks and promote trustworthiness, especially when AI agents are used for patient monitoring or diagnosis. Addressing these ethical aspects will ensure the safe and reliable deployment of the proposed future application.

6. Conclusions

This study conducted a comparative evaluation of several YOLO and SSD-based object detection models for the task of digit and acronym recognition on medical device displays. The results demonstrated that YOLOv8l (640) achieved the highest accuracy with an mAP@50 of 0.979, while lighter models such as YOLOv8n (320) obtained faster inference times (129.79 FPS) at the expense of accuracy (0.78). These findings underscore the importance of balancing performance metrics such as accuracy, speed, and model complexity. Rather than declaring a single optimal model, the results support that the model selection should be based on the operational constraints and priorities of specific deployment scenarios.

Furthermore, the comparison carried out in the classification component between the standalone YOLOv8l (640) and the hybrid approach that combines YOLOv8l (640) with a CNN-based classifier demonstrated that, despite a slight reduction in overall accuracy, certain classes benefited from the addition of the CNN.

Future work should extend this evaluation to include a more diverse and noisy dataset, incorporating real-world variability such as inconsistent lighting, device display differences, and user-related factors (e.g., angle, blur, or obstruction), and should also include a more comprehensive analysis of the trade-offs between multiple performance indicators. Additionally, research should explore lightweight and efficient models suitable for deployment on mobile or embedded hardware, where inference speed, memory usage, and power consumption are critical constraints.

Author Contributions

Conceptualization, P.F. and P.M.; methodology, P.F. and P.M.; software, P.F.; investigation, P.F. and P.M.; resources, J.L.V.; writing—original draft preparation, P.F.; writing—review and editing, J.L.V., P.L., P.M. and F.R.; visualization, J.L.V., P.L., P.M. and F.R.; supervision, J.L.V. and P.M.; project administration, J.L.V. and P.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Innovation Pact HfPT–Health from Portugal, co-funded by the “Mobilizing Agendas for Business Innovation” of the “Next Generation EU” program of Component 5 of the Recovery and Resilience Plan (RRP), concerning “Capitalization and Business Innovation”, under the Regulation of the Incentive System “Agendas for Business Innovation”. This work was also financed by national funds, through the FCT—Foundation for Science and Technology and FCT/MCTES under the projects UIDB/05549:2Ai, UIDP/05549:2Ai, LASI-LA/P/0104/2020, and CEECINST/00039/2021.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Lobo, P.; Vilaça, J.L.; Torres, H.; Oliveira, B.; Simões, A. Smart scan of medical device displays to integrate with a mHealth application. Heliyon 2023, 9, e16297. [Google Scholar] [CrossRef] [PubMed]
Chen, K.Y.; Chen, F.G.; Hou, T.W. A low-cost reader for automatically collecting vital signs in hospitals. J. Med. Syst. 2012, 36, 2599–2607. [Google Scholar] [CrossRef] [PubMed]
Tsiktsiris, D.; Kechagias, K.; Dasygenis, M.; Angelidis, P. Accelerated seven segment optical character recognition algorithm. In Proceedings of the 2019 Panhellenic Conference on Electronics & Telecommunications (PACET), Volos, Greece, 8–9 November 2019; pp. 1–5. [Google Scholar]
Finnegan, E.; Villarroel, M.; Velardo, C.; Tarassenko, L. Automated method for detecting and reading seven-segment digits from images of blood glucose metres and blood pressure monitors. J. Med. Eng. Technol. 2019, 43, 341–355. [Google Scholar] [CrossRef] [PubMed]
Shenoy, V.N.; Aalami, O.O. Utilizing smartphone-based machine learning in medical monitor data collection: Seven segment digit recognition. arXiv 2018, arXiv:1807.04888. [Google Scholar]
Popayorm, S.; Titijaroonroj, T.; Phoka, T.; Massagram, W. Seven segment display detection and recognition using predefined HSV color slicing technique. In Proceedings of the 2019 16th International Joint Conference on Computer Science and Software Engineering (JCSSE), Chonburi, Thailand, 10–12 July 2019; pp. 224–229. [Google Scholar]
Moreira, L.P. Automated Medical Device Display Reading Using Deep Learning Object Detection. arXiv 2022, arXiv:2210.01325. [Google Scholar]
Rusyn, B.; Lutsyk, O.; Kosarevych, R.; Varetsky, Y. Automated Recognition of Numeric Display Based on Deep Learning. In Proceedings of the 2019 3rd International Conference on Advanced Information and Communications Technologies (AICT), Lviv, Ukraine, 2–6 July 2019; pp. 244–247. [Google Scholar]
Suttapakti, U.; Titijaroonroj, T.; Nunsong, W.; Kakanopas, D. Seven Segment Display Detection and Recognition via Deep Learning Technique. In Proceedings of the 2022 19th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), Prachuap Khiri Khan, Thailand, 24–27 May 2022; pp. 1–4. [Google Scholar]
Prakruthi, M.; Vismaya, K.; Suhas, S.; Manoj, K.; Pai, N.; Satwik, P.N. Application of convolutional neural networks in mobile devices for inferring readings from medical apparatus. Int. J. Sci. Res. 2017, 4, 2321–2705. [Google Scholar]
Low, L.M.; Salleh, F.H.M.; Law, Y.F.; Zakaria, N.Z. Detecting and recognizing seven segment digits using a deep learning approach. In Proceedings of the 1st International Conference on Advances in Machine Intelligence, and Cybersecurity Technologies (AMICT2023), Sabah, Malaysia, 11–12 December 2023; ITM Web of Conferences. EDP Sciences: Paris, France, 2024; Volume 63, p. 01007. [Google Scholar]
Bonilla-Rivas, F.; Cárdenas-Angelat, C.; Garrido-Díaz, A.; Aguayo-Torres, M.C. Adapting YOLO to recognition of real numbers in 7-segment digits. In Proceedings of the 2023 International Conference on Intelligent Computing, Communication, Networking and Services (ICCNS), Valencia, Spain, 19–22 June 2023; pp. 19–25. [Google Scholar]
Imran, M.; Anwar, H.; Tufail, M.; Khan, A.; Khan, M.; Ramli, D.A. Image-based automatic energy meter reading using deep learning. Comput. Mater. Contin. 2023, 74, 203–216. [Google Scholar] [CrossRef]
Haseeb, H.; Hassan, M.T.; Iftikhar, A.; Asmat, A. Automated Detection and Recognition of Seven-Segment Digits from Electric Meters Utilizing Digital Image Processing and Machine Learning. VFAST Trans. Softw. Eng. 2024, 12, 87–98. [Google Scholar] [CrossRef]
Gushima, K.; Kashima, T. Automatic generation of seven-segment display image for machine-learning-based digital meter reading. In Proceedings of the PHM Society Asia-Pacific Conference, Tokyo, Japan, 11–14 September 2023; Volume 4. [Google Scholar]
Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Ultralytics. YOLOv8 Models—Ultralytics Documentation. 2025. Available online: https://docs.ultralytics.com/models/ (accessed on 27 March 2025).
Thurzo, A. How is AI Transforming Medical Research, Education and Practice? Bratisl. Med. J. 2025, 126, 243–248. [Google Scholar] [CrossRef]
Thurzo, A. Provable AI Ethics and Explainability in Medical and Educational AI Agents: Trustworthy Ethical Firewall. Electronics 2025, 14, 1294. [Google Scholar] [CrossRef]

Figure 1. General high-level architecture of YOLOv8.

Figure 2. High-level SSD MobileNet V2 FPNLite architecture.

Figure 3. CNN architecture for 15-class classification.

Figure 4. Samples from the dataset demonstrating variability in lighting, device type, font styles, and display conditions.

Figure 5. Confusion matrices for YOLOv8l (640): (a) standalone; (b) with CNN incorporated.

Figure 6. Precision–recall (PR) curve for YOLOv8l (640) standalone model.

Figure 7. Sample results from the YOLOv8l (640) standalone model.

Table 1. Dataset composition by device type, showing number of devices and images per source.

Device Type	Catalogs		Local Photos		Public Dataset		Total Images
Device Type	Devices	Images	Devices	Images	Devices	Images	Total Images
Oximeter	∼23	253	1	68	0	0	321
Glucometer	∼39	233	0	0	3	254	487
Blood Pressure Monitor	∼57	307	1	111	3	388	806
Total	169	793	2	179	6	642	1614

Table 2. Performance comparison of YOLO models and SSD MobileNet V2 FPNLite (320 and 640 input resolutions).

Model	Resolution	Precision	Recall	mAP@50	FPS
YOLOv3u	320	0.946	0.820	0.910	30.05
	640	0.971	0.915	0.971	9.00
YOLOv5mu	320	0.922	0.822	0.902	83.68
	640	0.961	0.926	0.971	27.38
YOLOv8n	320	0.879	0.675	0.786	129.79
	640	0.939	0.859	0.935	54.83
YOLOv8m	320	0.925	0.833	0.908	77.28
	640	0.957	0.931	0.972	24.34
YOLOv8l	320	0.950	0.842	0.923	43.25
	640	0.971	0.928	0.979	13.20
SSD	320	0.669	0.631	0.638	16.53
	640	0.812	0.835	0.838	7.84

Table 3. Class-wise performance comparison of YOLOv8l (640) and SSD MobileNet V2 FPNLite (640).

Class	YOLOv8l (640)			SSD (640)
Class	Precision	Recall	AP@50	Precision	Recall	AP@50
0	0.986	0.908	0.985	0.829	0.797	0.767
1	0.992	0.915	0.971	0.843	0.852	0.825
2	0.952	0.903	0.973	0.779	0.740	0.777
3	0.976	0.916	0.965	0.754	0.781	0.815
4	0.995	0.975	0.992	0.902	0.933	0.946
5	0.978	0.914	0.978	0.685	0.831	0.858
6	0.979	0.932	0.976	0.865	0.782	0.830
7	0.987	0.937	0.990	0.842	0.869	0.874
8	0.942	0.911	0.970	0.757	0.765	0.778
9	0.975	0.935	0.992	0.812	0.847	0.886
sys	0.991	0.950	0.992	0.893	0.877	0.858
dia	0.991	0.930	0.993	0.792	0.884	0.872
m	0.975	0.886	0.968	0.818	0.878	0.867
p	0.968	0.952	0.976	0.884	0.820	0.803
spO2	0.886	0.951	0.961	0.733	0.868	0.811

Table 4. Comparison of classification metrics: YOLOv8l (640) standalone vs. YOLOv8l + CNN.

Class	Standalone			With CNN
Class	Precision	Recall	F1	Precision	Recall	F1
0	0.98	0.96	0.97	0.97	0.99	0.98
1	0.99	1.00	1.00	0.99	0.98	0.99
2	0.95	0.96	0.96	0.89	0.81	0.85
3	0.96	0.96	0.96	0.99	0.97	0.98
4	1.00	1.00	1.00	0.98	1.00	0.99
5	0.98	0.94	0.96	0.76	0.81	0.78
6	1.00	0.98	0.99	0.93	0.99	0.96
7	1.00	0.98	0.99	0.96	0.97	0.97
8	0.93	0.99	0.96	0.99	0.98	0.98
9	0.98	0.97	0.97	0.99	0.98	0.98
sys	1.00	0.99	1.00	1.00	1.00	1.00
dia	1.00	1.00	1.00	1.00	1.00	1.00
m	0.99	1.00	1.00	0.98	0.97	0.97
p	1.00	1.00	1.00	0.95	0.92	0.94
spO2	1.00	1.00	1.00	0.99	0.99	0.99
Accuracy		0.98			0.96
Macro avg	0.98	0.98	0.98	0.96	0.96	0.96
Weighted avg	0.98	0.98	0.98	0.96	0.96	0.96

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ferreira, P.; Lobo, P.; Reis, F.; Vilaça, J.L.; Morais, P. Digitization of Medical Device Displays Using Deep Learning Models: A Comparative Study. Appl. Sci. 2025, 15, 5436. https://doi.org/10.3390/app15105436

AMA Style

Ferreira P, Lobo P, Reis F, Vilaça JL, Morais P. Digitization of Medical Device Displays Using Deep Learning Models: A Comparative Study. Applied Sciences. 2025; 15(10):5436. https://doi.org/10.3390/app15105436

Chicago/Turabian Style

Ferreira, Pedro, Pedro Lobo, Filipa Reis, João L. Vilaça, and Pedro Morais. 2025. "Digitization of Medical Device Displays Using Deep Learning Models: A Comparative Study" Applied Sciences 15, no. 10: 5436. https://doi.org/10.3390/app15105436

APA Style

Ferreira, P., Lobo, P., Reis, F., Vilaça, J. L., & Morais, P. (2025). Digitization of Medical Device Displays Using Deep Learning Models: A Comparative Study. Applied Sciences, 15(10), 5436. https://doi.org/10.3390/app15105436

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Digitization of Medical Device Displays Using Deep Learning Models: A Comparative Study

Abstract

1. Introduction

2. Methodology

2.1. Models Description

2.2. YOLO (You Only Look Once)

2.2.1. YOLOv3u

2.2.2. YOLOv5mu

2.2.3. YOLOv8 (n, m, l)

2.3. SSD MobileNet V2 FPNLite

2.4. CNN Architecture

2.5. Datasets

2.6. Implementation Details

3. Experiments

3.1. Training Setup

3.2. Evaluation Metrics

4. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI