3.1. Dataset Preparation
The data preparation process consisted of data collection, data labeling, dataset splitting, and data augmentation stages. The data used in this study are images of blood pressure monitor measurement results that were collected through a combination of public datasets and image-scraping algorithms. Details about the sources of these datasets can be found in
Table 1. Data sorting was performed to eliminate irrelevant or poor-quality data. Once the data was filtered, the labeling process was conducted using Roboflow [
11] by applying bounding boxes to the input data (images), covering a total of 11 classes that included 10 classes for digits (0 to 9) and 1 class for indicating systolic pressure, diastolic pressure, or pulse rate.
Figure 1 displays the class distribution of the whole dataset. In the dataset, Class 10 has the highest occurrence, as each image typically contains three objects from Class 10, representing the three blood pressure metrics: systolic, diastolic, and pulse rate. Following Class 10, Class 1 has the second-highest occurrence because systolic pressure values generally fall in the range of hundreds. The relatively high frequencies of Classes 7 and 8 can be attributed to diastolic pressure and pulse-rate values, which commonly fall within the 70 s to 80 s range.
The labeled data was then divided into three sets: training, validation, and testing, with percentages of 70%, 20%, and 10%, respectively. To ensure the robustness of the trained model, the validation and test sets were carefully curated to be representative of the overall dataset in terms of image quality and real-world variability. Images exhibiting variations in blur levels, lighting conditions, and other artifacts commonly found in real-world settings were deliberately included in all sets. The final stage was data augmentation, which aimed to increase the volume and variety of training data twofold. This was intended to improve the model’s accuracy and flexibility in real-world conditions. The data augmentation process included rotations of
, shearing of
, brightness adjustments of
, blurring of up to 3.5 pixels, and adding noise up to 5% of the total pixels. The dataset splits before and after the augmentation process can be seen in
Table 2.
To further assess the model’s robustness in real-world scenarios, we also performed cross-validation using 10 additional images captured directly from a physical digital blood pressure monitor. These images were collected independently from the original dataset and were not part of the training, validation, or test sets. They reflect authentic variations in image quality, including lighting conditions, viewing angles, and device display differences that may occur during practical usage.
3.2. Model Development
We employed a deep learning model with the YOLOv8 architecture [
14]. The YOLO (You Only Look Once) architecture was selected due to its typically superior performance in AP (average precision) and inference time compared to other object detection models [
15]. YOLO has been in development since 2015 as an open-source algorithm, thereby providing accessibility to researchers and developers globally. Its popularity has led to a large community, resulting in thorough documentation, easier troubleshooting, and knowledge sharing among users [
16]. YOLOv8, the eighth version of the YOLO architecture, exhibits enhancements in mAP over its predecessor [
17].
YOLOv8 has five variants: nano, small, medium, large, and extra-large. A comparison of these five variants can be found in
Table 3. As the model size increases, its network complexity and accuracy also improve, but its inference time is longer. In this study, the training process was conducted only for the small, medium, and large variants. The nano variant was not selected due to its relatively lower accuracy than the other variants. Meanwhile, the extra-large variant was not chosen because of its complexity and large size, making it unsuitable for use on smartphones with limited computational capabilities.
The three variants were trained using the following parameters: epochs = 500, patience = 50, batch size = {8, 16, 32}, image size = 640, IoU = 0.5, optimizer = auto. The automatic option for optimizer parameters resulted in the use of the best optimizer, SGD, with a learning rate of 0.01 and a momentum of 0.9. Meanwhile, other parameters that are not mentioned used the default configuration.
The model was developed using several Python (v3.10.12) libraries, such as OpenCV (v4.8.0.76), PyTorch (v2.3.0) [
19], TensorFlow (v2.13.0) [
20], TensorFlow Lite (v2.13.0) [
21], etc. The Google Colab [
22] environment was used to develop the model, while a laptop was used to develop the mHealth application. Smartphones, on the other hand, were used as the platform to deploy and run the mHealth application. In this study, 2 smartphones were utilized to evaluate performance across different hardware capabilities: a Xiaomi Redmi 9A, representing a low-end device, and a Samsung Galaxy A30S, representing a mid-range device. For clarity in the subsequent sections, these are referred to as Smartphone 1 (Xiaomi Redmi 9A) and Smartphone 2 (Samsung Galaxy A30S). The specifications of the working environments used in this work are shown in
Table 4.
In addition to the YOLOv8 model, which serves as our focus in this study, two other object detection models were selected for comparative analysis: YOLOv11 and RT-DETR. YOLOv11 represents the latest advancement in the YOLO (You Only Look Once) family, introducing architectural improvements that focus on enhancing both accuracy and inference speed through adaptive re-parameterization and improved feature aggregation [
23]. Although our study primarily focuses on YOLOv8, YOLOv11 was included to provide insight into the ongoing evolution and potential performance ceiling of the YOLO framework. Meanwhile, RT-DETR (Real-Time DEtection TRansformer) is a recent transformer-based detector designed for real-time performance, incorporating a one-stage pipeline and query-based object decoding that eliminates the need for non-maximum suppression [
24]. Its inclusion offers a perspective on how convolutional models like YOLO compare with transformer-based architectures in the specific task of digit recognition from medical device displays, especially those that use seven-segment digit representation.
3.3. Model Compression
Model compression employed a quantization technique aimed at enhancing detection speed and reducing the computational load of the model, given its deployment on smartphones, which typically have lower computational capabilities than computers. Quantization techniques were applied to convert model weights into 16-bit floating-point (FP16) and 8-bit-integer (INT8) precision. The FP16 quantization process was conducted using the export feature of the Ultralytics YOLOv8 library by specifying the format parameter to ‘tflite’ and enabling the ‘optimize’ and ‘half’ options. The ‘optimize’ parameter optimizes the model for mobile devices, while the ‘half’ parameter implements FP16 half-precision quantization. In addition to producing models in the FP16 format, the export results also yielded models in the 32-bit floating-point (FP32) TensorFlow format, which was used for quantization to INT8 precision using the TensorFlow library.
During INT8 quantization, a calibration process is necessary to estimate the range of all floating-point values in the model. Constant values such as weights and biases are easily calibrated. However, variable values like inputs, activations, and outputs require cycles of the inference process to be calibrated. Consequently, a dataset of approximately 100 to 500 samples is needed for this calibration process. In this study, a validation dataset consisting of 430 images was used as the calibration dataset. These validation samples were carefully selected to ensure representativeness in terms of class diversity, image quality, and real-world variability. This subset covered a broad range of image conditions, including different lighting levels, degrees of focus, and background complexities, typically encountered in practical mHealth environments. The representativeness of these calibration samples contributes to maintaining the robustness and accuracy of the quantized YOLOv8-small INT8 model despite the precision reduction inherent to quantization.
3.5. Development of mHealth Application
The mHealth application features three main functionalities: capturing blood pressure readings using the camera, recording historical blood pressure data, and visualizing this data in graphical form. Additionally, there is an authentication feature that binds these data to a user account. The deep learning model developed in the previous stage is integrated into the mHealth application for the blood pressure reading feature.
This mHealth application encompasses a total of 9 use cases, as depicted in
Figure 2. The first three use cases—login, register, and logout—are part of the authentication feature. Subsequently, the blood pressure data visualization feature is represented as the display measurement history use case. With this use case, users can see the blood pressure measurement history in the form of a line chart and a list of measurement logs. Furthermore, users can record their blood pressure measurements through the use case to add new blood pressure measurement results. Recording blood pressure data can be performed in two ways: by directly capturing measurement results or by uploading photos of the results from the smartphone’s image gallery. Meanwhile, user management features can be accessed through the last 4 use cases: view profile, add new user, delete user, and edit user information.
The workflow of the blood pressure reading feature is illustrated in
Figure 3. Generally, this process is divided into two main parts, digit detection and digit grouping, to derive blood pressure values. In the digit detection process, a deep learning model is utilized to detect and classify the digits present in the photo. Subsequently, in the grouping process, these digits are grouped and categorized into systolic pressure, diastolic pressure, and pulse-rate values. Then, the resulting data is stored in a database for use by the visualization feature. Data storage is implemented using a non-relational database, Cloud Firestore. This choice ensures that the data is not just stored locally, allowing users to access it even if their local data is deleted. Additionally, Cloud Firestore supports caching, eliminating the need for constant internet access to perform data reading and writing operations. The data on the user’s phone is automatically synchronized with the data stored in the cloud once the phone is connected to the internet. This design allows it to operate both online and offline, making it suitable for clinical settings where network connectivity may be limited.
The grouping process of the detected digits into meaningful values is performed by considering the intersection percentage of a digit’s bounding box (represented by Classes ‘0’ to ‘9’) with a value’s bounding box (represented by Class ‘10’). An illustration of these classes and their corresponding bounding boxes is depicted in
Figure 4. Value classes are displayed as purple boxes, while digit classes are displayed as boxes of other colors. A digit class is grouped into a value class (i.e., systolic, diastolic, or pulse) if its bounding box intersects with at least 75% of the corresponding value’s bounding box area along the vertical axis. This 75% threshold was determined empirically through controlled experiments under commonly observed real-world image conditions, including slight skew and display angle variations. The objective was to select a threshold that achieves reliable grouping accuracy while maintaining computational efficiency. This heuristic also accommodates occasional overextensions of digit bounding boxes beyond the expected value region.
Once grouped, the digits within each group are sorted based on their positions from left to right. Then, concatenating these digits results in a meaningful value. As a result, three values representing the three blood pressure metrics are obtained. Subsequently, categorization into systolic pressure, diastolic pressure, or pulse-rate values is determined by sorting the values’ positions in the image from top to bottom.
Overall, this mHealth application consists of 5 pages: login page, register page, home page, camera page, and profile page. The user interface of the home, camera, and profile pages is displayed in
Figure 5. In the implementation, these pages are divided into 3 activities: main activity, login activity, and register activity, with the main activity handling 3 fragments: home, camera, and profile.
On the login page, users are required to enter their registered email and password, while on the registration page, users are asked to provide personal information, including their name, email, password, weight, height, and date of birth. The name, email, and password are used for authentication purposes, while the weight, height, and date of birth are used to determine the categorization of blood pressure values.
The homepage serves the primary function of visualizing recorded blood pressure data within the application. On the homepage, there is a dropdown menu for selecting the user whose blood pressure data will be displayed. Meanwhile, on the profile page, users can view and modify their account details. Additionally, the profile page displays a list of other user profiles registered under the same account, allowing for user management operations such as adding, modifying, and deleting user data.
On the camera page, there is a camera view displaying live frames captured by the camera. This section also includes an overlay to draw bounding boxes surrounding the detected seven-segment digits. The detection results, such as inference time, systolic pressure, diastolic pressure, pulse rate, and blood pressure category, are displayed below the camera view. After capturing an image, users can confirm the detection results or retake the picture. In addition to using the camera, users have the option to use images from the phone’s image gallery by pressing the gallery button.