1. Introduction
The most prevalent sense in our body, vision, is essential for many facets and phases of life. While it is sometimes taken for granted, its absence seriously hinders our capacity to learn, navigate, read, carry out daily duties, and work. About 40 million people worldwide are blind, and another 250 million have some kind of vision impairment, according to the World Health Organization (WHO). Due to the aging population and the increased incidence of diseases like diabetic retinopathy and glaucoma, which are becoming more and more common causes of visual impairment, the prevalence of low vision has increased.
Living with restrictions, as with any disability, presents constant challenges. Specifically, the lack of visual accuracy makes everyday situations significantly more difficult to manage. Although alternative approaches to handling these routines can be developed, one immediate consequence of this impairment is the insecurity felt when moving or traveling independently, especially in outdoor and unfamiliar environments [
1].
In both indoor and outdoor settings, computer vision is an essential and helpful tool for those with visual impairments, such as those who are blind or have limited vision. The basic idea is that visually challenged people can be helped in their daily tasks by deploying cameras as extra eyes, whose images are automatically evaluated by software. The main goals of current research are to create specialized devices that visually impaired persons can carry with them and to develop unique methods for scene analysis and augmentation.
The field of computer vision has made notable progress in recent times, resulting in the creation of systems intended to tackle the difficulties encountered by people with visual impairments. One such tool, introduced by Gagnon [
2], helps blind viewers follow their favorite TV shows better, especially during dialogue-light portions. It is a computer vision application that creates automatic descriptions of video information. A head-mounted device that is specifically designed to scan important areas of the scene and obtain information that is helpful to blind people is detailed in the study of Pradeep [
3]. Furthermore, Choudhury [
4] provides specifics on an image contrast enhancement technique that greatly enhances the use of images, text, and other visual elements for those with limited vision. A portable computer carried in a backpack and a camera fixed on the user’s shoulder are used by the system described in Chen’s [
5] study to localize and recognize text in urban surroundings.
In addition to these earlier systems, More works have applied deep learning–based computer vision methods to assistive technologies, such as wearable scene description devices [
6,
7], navigation aids integrating object detection and auditory feedback [
8], and mobile applications for reading textual information in complex environments [
6]. In parallel, significant progress has been made in text spotting and OCR, with end-to-end deep learning pipelines such as EAST, CRAFT, and Transformer-based OCR models demonstrating improved robustness in unconstrained environments [
9,
10,
11]. Lightweight implementations suitable for mobile devices, like PaddleOCR [
12], further support the feasibility of deploying recognition systems for real-world assistive use cases.
Numerous automatic bus number identification systems, many of which make use of active sensors like GPS and Radio Frequency Identification (RFID), have been described in the literature. Among the vision-based methods is the one by Guida [
13], which uses an Adaboost-based cascade of classifiers to identify bus line numbers and then translates them into audio announcements via OCR. Pan [
14] created a system that uses cameras at bus stops to identify the bus route numbers. HOG and SVM are used for bus recognition, while OCR, in conjunction with text-to-speech, is used for audio announcements. A comparable system with three subsystems, bus motion detection using Modified Adaptive Frame Differencing (MAFD), bus panel detection, and text detection, was developed by Tsai [
15]. All three subsystems result in speech notifications.
This study proposes an algorithm that aims to assist visually impaired individuals in understanding the local bus routes of Genova, Italy. The algorithm not only detects the bus route number but also determines the direction in which the bus is heading. By providing both pieces of information, the system aims to enhance the mobility and independence of blind individuals by improving the accessibility and usability of public transit. Furthermore, we aim to integrate this algorithm into our navigational aid that can function offline, enabling visually impaired users to access bus route information without requiring an internet connection. This offline capability would further enhance their navigation and mobility in urban environments. The main contributions of this work are as follows: (i) we design a simple yet efficient algorithm that is lightweight enough to run in offline mode on portable devices, (ii) we integrate YOLOv8, ESRGAN, OCR, and lexicon-based correction into a unified pipeline optimized for practical use, and (iii) we demonstrate that despite its simplicity, the system achieves higher accuracy compared with existing methods.
2. System Overview
Public buses in Genova run in two directions, which makes it difficult for visually impaired people (VIP) to determine which bus to board. To address this challenge, our algorithm allows VIPs to capture an image of the approaching bus using a mobile device. Importantly, the picture does not need to be perfectly centered on the bus front, since the custom-trained YOLOv8 model automatically detects and isolates the relevant front panel region even when the photo contains background objects, partial bus views, or is slightly misaligned. In cases where the bus panel is not detected or the image is too blurry to process reliably, the system does not return misleading results but instead prompts the user to retake the photo. While in Genova the front panels are LED-based, the algorithm is adaptable to other bus display formats (e.g., printed signs or painted boards), provided that a representative training dataset is available.
Figure 1 shows that the suggested approach is divided into multiple crucial parts. The bus is first identified in the picture. To save computational resources, only the front portion of the detected bus, or the area of interest (ROI), is taken into consideration for additional processing. All other information is discarded. After the ROI has been identified, the picture is cropped so only the LED panel, which shows the bus route number and destination, is extracted.
After the LED panel is extracted, image-enhancing methods are used to improve the text’s image quality. The bus route number and destination are then read from the improved image using text recognition techniques. The likelihood of errors resulting from misspellings or confusing text is decreased by comparing the recognized text to a database of Genova bus routes and selecting the closest match. This algorithm provides a reliable and efficient way to determine the exact bus route and direction, hence increasing the independence and mobility of VIPs and significantly improving their accessibility to transportation in Genova.
2.1. Dataset
To train and evaluate the proposed system, we collected a dataset of bus images representing the 147 official bus routes operating in Genova, Italy. The images were obtained from two sources: (i) photographs captured with mobile devices at bus stops, and (ii) publicly available repositories. This ensured variability in both acquisition conditions and bus appearances.
For YOLOv8 training, we used 400 annotated bus images. These were manually labeled to mark only the front-facing region of each bus, ensuring that the model learned to isolate the LED panel area. Model hyperparameters were optimized using a train/validation split of 80/20 within this dataset.
For evaluation, an independent test set of 120 images was reserved, which was not seen during training or validation. This dataset was intentionally curated to capture a range of challenging real-world conditions. A summary of dataset diversity across lighting, occlusion, and motion blur is provided in
Table 1. In addition, the dataset included variation in the following:
Viewing angles: ranging from frontal captures to oblique angles of up to ∼45°.
Distances: close-range captures (3–5 m) up to ∼15 m away from the bus.
Display characteristics: while all buses used LED panels, differences were present in brightness and size. Most panels employed standard fonts, but a subset featured non-standard spacing or scrolling layouts, included to test OCR robustness.
This dataset ensured that the algorithm was evaluated under diverse and challenging conditions rather than only on ideal inputs. Approximately one-fifth of the images contained partial occlusion or strong reflections, while others exhibited blur or low contrast. Although limited in size and restricted to a single city, the dataset provides an important first step in benchmarking the system under real-world conditions. Future work will expand the dataset across multiple cities and display types (e.g., printed signs, painted boards) to improve generalization. The dataset will be made available upon request for research purposes.
2.2. Detection of Region of Interest
The first stage of the proposed algorithm involves detecting the presence of a bus within the input image. For this purpose, the YOLOv8 object detection model [
16] was employed. YOLOv8 is a state-of-the-art real-time object detection framework that enables efficient localization of objects within an image. To tailor the model for our application, we retrained it specifically to detect only the front portion of the bus, which typically displays the route number and destination information.
We selected YOLOv8 as the detection backbone because it provides an optimal balance between accuracy and computational efficiency, which is crucial for mobile-based assistive technologies. At the time of system development, newer releases such as YOLOv9, YOLOv10, and YOLOv11 were available; however, these versions required substantially higher computational resources and lacked the stable deployment pipelines and pretrained weights necessary for rapid adaptation to our dataset. YOLOv8, in contrast, offered robust accuracy with real-time performance on mobile devices, making it a practical and reliable choice for our application.
For model training, we employed the YOLOv8-s (small) variant, which provides a favorable trade-off between speed and detection accuracy. A total of 400 annotated images of buses were used for training, curated from publicly available sources including the Roboflow platform [
17]. Images were manually annotated to include only the front-facing segments of buses as bounding box labels, thereby defining the region of interest (ROI) and eliminating irrelevant elements such as advertisements, side panels, or unrelated text. The model was trained for 100 epochs with a batch size of 16, using the AdamW optimizer and an initial learning rate of 0.001 scheduled via cosine decay. To improve generalization, standard data augmentation strategies were applied, including horizontal flipping, random scaling (±15%), brightness/contrast adjustment (±20%), and simulated motion blur. A 20% validation split from the training set was used for monitoring performance during training.
Once trained, the YOLOv8 model was integrated into the system pipeline to automatically detect the front panel of the bus from a captured image. Following detection, the corresponding ROI was cropped from the image and passed on to subsequent modules for further processing, including image enhancement and text recognition. By isolating only the relevant portion of the image, this step reduces noise and computational overhead while improving the overall effectiveness of the OCR system used downstream.
2.3. Image Enhancement and Text Reading
In the second stage, we applied the Enhanced Super-Resolution Generative Adversarial Network (ESRGAN) [
18] to improve the resolution of the cropped ROI before text recognition. The ESRGAN architecture consists of a generator network with residual-in-residual dense blocks (RRDBs) that reconstruct high-resolution images from low-resolution inputs, and a discriminator trained with perceptual loss to enhance photo-realism. The motivation for this step was to address common real-world image degradations, such as low light, motion blur, or glare on LED panels, that reduce OCR accuracy. We used the official pretrained ESRGAN model without additional fine-tuning, as it provided sufficiently clear enhancements of bus LED displays while remaining computationally lightweight.
In terms of performance, ESRGAN processed each ROI in approximately 0.4 s on an NVIDIA RTX 3060 GPU and 1.0 s on a standard Intel i7-11800H CPU, demonstrating feasibility for near real-time applications. Beyond qualitative improvement, we observed a consistent gain in recognition confidence: across the test set, mean OCR confidence increased from 0.46 (without ESRGAN) to 0.71 (with ESRGAN), and this difference was statistically significant (p < 0.01, paired t-test). While ESRGAN occasionally introduced minor stroke artifacts (“hallucinations”), these did not result in misclassifications in our dataset, since the subsequent lexicon correction step filtered out spurious outputs.
Following enhancement, text recognition was carried out using the EasyOCR framework [
19], a PyTorch-based OCR engine that supports 58 languages. For our application, EasyOCR was configured with both Italian and English language models to reflect the languages used on Genova bus displays. To improve robustness, a minimum confidence threshold of 0.5 was applied to filter out low-confidence predictions, and the recognized outputs were validated against a custom lexicon containing all 147 bus routes and destinations.
Further implementation details are summarized below:
Backbone: EasyOCR employs a Convolutional Recurrent Neural Network (CRNN) with a ResNet feature extractor.
Decoding method: Connectionist Temporal Classification (CTC) beam search decoding was used.
Non-Maximum Suppression (NMS): A default threshold of 0.2 was applied to merge overlapping text boxes.
Preprocessing: Input images were resized to a height of 64 pixels while preserving aspect ratio, and normalized before recognition.
Tokenization: Default EasyOCR tokenization rules (alphanumeric + diacritics) were applied.
Language handling: Italian was set as the primary recognition language, with automatic fallback to English when the Italian model returned low confidence scores.
This configuration provided a structured and reliable OCR pipeline, ensuring accurate recognition of bus numbers and destinations even under challenging conditions.
2.4. Comparison of Detected Text with Standard
The final step entails giving the algorithm a list of all Genova bus routes and their associated destinations. The algorithm processes this list, storing bus route numbers as dictionary keys and their two possible destinations as dictionary values.
The algorithm identifies the text, which then matches it to the list of Genova buses. The dictionary key and value pair with the highest matching ratio with the detected text are output. The algorithm confirms the bus number’s existence in the dictionary if it detects both the bus route number and the destination. The destination with the highest match ratio is chosen as the output once the algorithm compares the detected destination with the corresponding dictionary data, if the bus route number is present in the dictionary. The match ratio between the detected text and the dictionary keys and values is calculated using the Levenshtein method [
20].
When a bus number is incorrectly recognized, the algorithm verifies that the number is present in the dictionary keys. If not, it looks through the dictionary values to see if the detected destination exists. In cases where neither the detected bus number nor the destination can be matched with the dictionary of valid routes, the system does not provide the raw OCR output as a final result. Instead, it flags the case as “unrecognized bus route” and prompts the user to capture another image. This safeguard prevents the risk of announcing a route that does not exist, which could lead to confusion or undesired actions.
3. Result and Discussion
The initial step of our algorithm is to find the bus in the input image. We trained the YOLOv8 algorithm to detect only the front portion of the bus. The performance metrics of YOLOv8 showed encouraging results.
Figure 2 presents the Precision-Recall curve, which shows a mean Average Precision (mAP) of 0.943 at an IoU threshold of 0.5. The curve demonstrates that the model retains accuracy while predicting true positives with good precision at reduced recall values. Precision falls as recall rises, indicating the trade-off between adding false positives and capturing more positive cases. The high mAP score confirms that the model achieves a good balance between precision and recall across different thresholds.
The relationship between the model’s F1 score and the confidence threshold for the “bus front” class and all classes is depicted by the F1-Confidence curve in
Figure 3. Precision and recall are balanced by the F1 score, which peaks at 0.93 at a confidence level of 0.736. The curve indicates that as the model begins to detect more true positives at lower confidence levels, the F1 score improves quickly. However, after the ideal cutoff, the F1 score starts to decline, suggesting that higher thresholds reduce recall while preserving precision, thereby negatively impacting the F1 score. This analysis helps identify the optimal confidence threshold for maximizing balanced predictions.
To rigorously evaluate the performance of the proposed bus route recognition system, we employed an independent test set of 120 images of Genova buses that was entirely excluded from YOLOv8 training and validation. This set included a diverse range of route numbers and destinations, with images captured both via mobile devices at bus stops and sourced from online repositories. The dataset intentionally encompassed challenging real-world conditions, such as variations in lighting (daytime and evening), low-light environments, glare from reflective surfaces, motion blur, and oblique viewing angles, thereby providing a robust basis for assessing the reliability and generalizability of the system.
Advertisements and other unnecessary messages or numbers might cause errors when using the entire bus as the ROI. We efficiently isolated the crucial ROI and removed irrelevant text by custom-training the YOLOv8 algorithm to recognize only the front portion of the bus. From
Figure 4A, it can be clearly seen that considering the whole bus for text detection can cause errors.
Figure 4B depicts the results of the custom-trained YOLOv8, which correctly focused only on the bus front.
To address the limitations of low-quality inputs such as motion blur, glare, or poor lighting, we applied the ESRGAN super-resolution model to the ROI before OCR. This enhancement step consistently increased recognition reliability across the dataset. On average, the mean OCR confidence improved from 0.46 (without ESRGAN) to 0.71 (with ESRGAN), and this gain was statistically significant (
p < 0.01, paired t-test). While ESRGAN may occasionally introduce artificial stroke artifacts, these did not lead to misclassifications in our experiments, since the subsequent lexicon-based validation filtered out such spurious outputs.
Figure 5 illustrates a representative example, where the OCR result improved from “45 BRI NOLE” (confidence 0.13) to “45 BRIGNOLE” (confidence 0.73).
Following super-resolution and OCR, the recognized text was validated against a pre-defined list of Genova bus routes and destinations. This step was critical for correcting incomplete or erroneous OCR outputs. For example, the enhanced result “45 BRIGNOLE” was matched to the lexicon entry “45 STAZIONE BRIGNOLE,” which represents the correct bus route and destination. As shown in
Figure 6, the lexicon-based correction minimized errors, reduced the impact of OCR uncertainties, and ensured robust identification of both bus numbers and destinations under challenging imaging conditions.
To provide a more detailed evaluation, we also report per-task performance metrics (
Table 2). Bus front detection achieved a precision of 96.2%, recall of 94.5%, and F1 score of 95.3%. Route number recognition achieved an accuracy of 93.8% ± 2.1, while destination recognition achieved 92.5% ± 2.8, with confidence intervals estimated via bootstrapping (1000 resamples).
To further assess the contribution of each component, we performed an ablation study (
Table 3). Using the full bus image as ROI reduced recognition accuracy to 78.4%, due to the inclusion of irrelevant text such as advertisements. Excluding ESRGAN resulted in lower OCR confidence and an accuracy of 85.7%. Replacing EasyOCR with Tesseract reduced recognition accuracy to 81.2%, confirming the advantage of deep learning–based OCR engines. Finally, removing the route lexicon–based correction step decreased accuracy from 95% to 88.9%, highlighting the importance of structured validation. This also illustrates that while the lexicon improves recognition reliability, it may reduce flexibility under open-set conditions, since routes or destinations not present in the predefined list could be incorrectly matched to the closest available entry, resulting in over-correction or misclassification. Future work will, therefore, extend the comparison to stronger OCR frameworks such as PP-OCRv3 and transformer-based recognizers, and will include systematic evaluation under open-set conditions to ensure robustness and avoid over-correction.
Our system yielded an overall accuracy of 95% with a 5% error rate. This performance outperforms the results of earlier research. For instance, Wongta [
21] reported 73.47% accuracy in bus number detection. Guida [
13], who tested their algorithm on a small dataset of only five bus routes, achieved nearly 100% accuracy but lacked scalability. Maina [
22] reported 72% total accuracy. A key advantage of our approach is its ability to detect both the bus route number and its destination, making it more versatile. By simply updating the lexicon with city-specific bus routes, our method can be adapted to different urban contexts, ensuring applicability for visually impaired individuals across diverse public transport systems. For cities with frequently changing or highly variable route information, the lexicon can be updated manually or linked to open transit APIs for automatic synchronization. Since the lexicon operates independently of detection and OCR, this scalability ensures that our approach remains adaptable and future-proof.
From a privacy perspective, the proposed system was designed to operate entirely on-device, with all image processing performed locally. Only the bus front panel is analyzed, while surrounding regions are discarded, and no images are stored or transmitted, thereby reducing the risk of exposing unintended personal information in public spaces. In terms of efficiency, the algorithm follows an event-based design: it is triggered only when the user actively captures an image, rather than running continuously. This approach minimizes both computational load and energy consumption, making integration into mobile or wearable devices more feasible. On standard hardware, average processing time remains close to one second, and further optimization through lightweight model variants or quantization could reduce the battery impact even further.
To clarify the practical benefit for visually impaired people, the proposed system is designed to be integrated into portable navigation aids (e.g., smartphones or wearable devices) rather than deployed at bus stops. Users can point their device toward an approaching bus, and the algorithm processes the bus front locally to extract route information. The recognized route and destination are then conveyed to the user through audio feedback.
The analysis of failure cases highlights several limitations of the proposed algorithm. First, the dataset size and scope remain restricted, with 400 training and 120 test images collected from a single city (Genova). Although the dataset was curated to include variation in lighting, angles, distances, occlusion, and panel layouts, its limited scale raises concerns about generalization to other cities, bus designs, and display types. Second, detection errors occurred in ∼4% of cases, where YOLOv8 failed to correctly isolate the bus front due to extreme occlusion, glare, or unusual viewing angles. Third, OCR misreads accounted for ∼6% of errors, typically under conditions of motion blur or low illumination. Fourth, lexicon mismatches contributed ∼3% of errors, where the correction step over-adjusted the OCR output when multiple near matches existed in the dictionary. Non-standard fonts and panel layouts also occasionally confused the OCR system.
In addition to these issues, the dataset did not explicitly include real-time environmental conditions such as rain, fog, or heavy traffic, which may further reduce visibility or cause occlusion of bus panels. Another limitation is that our OCR comparison was restricted to EasyOCR, Tesseract, and PaddleOCR; while this provides a useful baseline, stronger frameworks such as PP-OCRv3 and transformer-based recognizers were not evaluated. Finally, the system has not yet been systematically tested under open-set conditions, where unseen routes or destinations may appear and challenge the robustness of lexicon-based correction.
Taken together, these limitations emphasize the need for future improvements, including expanding the dataset across multiple cities and display types, incorporating diverse weather and traffic scenarios, integrating more advanced OCR backbones, refining lexicon correction strategies, and conducting systematic evaluations under open-set conditions to ensure robustness in real-world deployments.
Building on the identified limitations, future work will focus on several directions. First, dataset expansion is essential. We plan to collect a larger and more diverse dataset across multiple cities, incorporating different display types (e.g., printed signs, painted boards), varied fonts, and challenging environmental conditions such as rain, fog, and heavy traffic. This will help ensure stronger generalization beyond the current Genova-based dataset. Second, user-centered evaluation will be carried out through trials with visually impaired participants to assess the system’s effectiveness in real-world navigation. These studies will measure usability, error rates, and user satisfaction. Specifically, we plan to integrate the bus route recognition algorithm into our navigation aid device for VIPs, which is currently under development, enabling real-world trials that combine route recognition with multisensory guidance. Third, OCR enhancement will be explored by extending the comparison to more advanced frameworks such as PP-OCRv3 and transformer-based recognizers, alongside lightweight variants optimized for mobile deployment. Finally, robustness testing will include systematic evaluation under open-set conditions, where unseen routes or destinations may appear, to better understand and mitigate brittleness introduced by lexicon-based correction.
By addressing these directions, we aim to bridge technical improvements with lived user experience, moving toward a practical assistive tool that empowers blind and visually impaired individuals to navigate public transit with greater confidence and independence.