Next Article in Journal
Optimization of Reverse Logistics Networks for Hazardous Waste Incorporating Health, Safety, and Environmental Management: Insights from Large Cruise Ship Construction
Previous Article in Journal
D4Care: A Deep Dynamic Memory-Driven Cross-Modal Feature Representation Network for Clinical Outcome Prediction
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Personalized Contextual Information Delivery Using Road Sign Recognition

1
Korea Artificial Intelligence Certification Institute (KAIC), Seoul 04790, Republic of Korea
2
School of Media, Arts, and Science, Sogang University, Seoul 04107, Republic of Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(11), 6051; https://doi.org/10.3390/app15116051
Submission received: 30 December 2024 / Revised: 5 April 2025 / Accepted: 10 April 2025 / Published: 28 May 2025

Abstract

:
Road sign recognition is essential for navigation and autonomous driving applications. While existing models focus primarily on text detection and extraction, they fail to incorporate user-specific contextual information, limiting their effectiveness in real-world scenarios. This study proposes a modular system that enhances road sign recognition by integrating user-adapted contextual reasoning. The system applies a step-by-step Chain of Thought (CoT) approach to link detected road signs with relevant contextual data, such as location, speed, and destination. Compared to traditional image captioning models, our approach significantly improves information relevance and usability. Experimental results show that the proposed system achieves a 23.4% increase in user-adapted information accuracy and reduces interpretation errors by 17.8% in real-world navigation scenarios. These findings demonstrate that semantic inference-based reasoning improves decision-making efficiency, making road sign recognition systems more practical for real-world applications. The study also discusses challenges such as real-time processing limitations and potential future improvements for broader infrastructure recognition.

1. Introduction

The rapid advancements in artificial intelligence (AI) and computer vision have profoundly impacted various industries, with road sign recognition emerging as a critical application in domains such as autonomous driving, navigation systems, and traffic management. Road signs serve as essential tools for conveying directions, distances, and regulatory information to drivers. However, traditional recognition systems often fall short of delivering intuitive and user-centric insights. These systems primarily focus on object detection or basic text extraction, which limits their ability to address the contextual needs of users. For example, road signs often contain multiple objects and text components (Figure 1), yet traditional models process these elements in isolation without integrating them into meaningful, user-adapted outputs.
Conventional object detection models, such as Region-Based Convolutional Neural Network (Faster-RCNN) and You Only Look Once (YOLO), are adept at identifying objects and their locations but are inherently limited in providing actionable, real-time information tailored to the user’s specific context [1,2,3]. Advances such as YOLOv4 have enhanced speed and accuracy in real-time applications [4]. Similarly, image captioning models are designed to generate textual descriptions of images but fail to incorporate critical user-related factors such as location, destination, or driving conditions [5,6]. These limitations hinder their applicability in scenarios requiring personalized and context-aware decision making, particularly in dynamic environments like road navigation.
To address these shortcomings, this study proposes a novel approach that integrates road sign recognition with contextual information delivery. The system combines modular object detection techniques with user-specific contextual data to generate intuitive outputs that enhance usability. Leveraging advanced models such as Faster-RCNN [1], DeepLabV3 [7], and Character Region Awareness for Text Detection (CRAFT) [8], the proposed system processes road sign images to extract meaningful textual and symbolic information while incorporating user-specific details like location, speed, and destination. Furthermore, the system utilizes step-by-step reasoning to link these data points, providing actionable insights tailored to the user’s immediate needs. This research contributes to the field by demonstrating a robust system capable of overcoming the limitations of traditional recognition models, paving the way for broader applications in navigation, logistics, and beyond.
Recent advancements in traffic sign recognition have explored various deep learning-based approaches to improve detection and interpretation accuracy. YOLO-based models, such as Sign-YOLO [9], have integrated attention mechanisms to enhance traffic sign detection, particularly for real-time applications in autonomous driving and intelligent driver assistance systems. Similarly, Large Language Model (LLM)-driven contextual reasoning models [10] have demonstrated promising results in improving cross-domain adaptability for traffic sign recognition through few-shot in-context learning. These approaches highlight the importance of integrating context-aware reasoning into traffic sign recognition, aligning with our study’s objective of enhancing usability and decision-making efficiency.
This study proposes a modular system that integrates object detection, text recognition, and user-adapted contextual reasoning to improve road sign recognition. Unlike conventional approaches that only extract textual information, the proposed system transforms raw detections into meaningful insights tailored for real-world navigation. By leveraging a CoT-based reasoning approach, extracted text and symbolic information are linked with user-specific data such as location, speed, and destination. This enables the system to generate context-aware and user-adapted outputs rather than static text extractions.
Additionally, this study demonstrates the effectiveness of semantic inference in road sign recognition by comparing the proposed approach with traditional image captioning models. Through this comparative analysis, the system is shown to significantly enhance real-time decision making and usability in navigation applications.
The remainder of this paper is structured as follows. Section 2 reviews existing research on object detection, text recognition, and context-based reasoning in road sign recognition. Section 3 details the proposed modular system, including the integration of user context and semantic inference. Section 4 presents experimental results, including comparisons with traditional image captioning models and performance analysis. Section 5 discusses limitations and future research directions, particularly in optimizing real-time processing and expanding semantic inference applications. Finally, Section 5 provides the conclusion and final remarks.

2. Related Work

This section reviews previous research relevant to the proposed system. We first discuss key object detection methodologies commonly applied in road sign recognition tasks. Next, we explore the recent CoT-based reasoning approaches, emphasizing their effectiveness in enhancing interpretability and decision making. Finally, we briefly introduce conventional image captioning models, highlighting their limitations in delivering context-aware insights, thus setting the stage for our proposed approach.

2.1. Object Detection

Object detection is a core task in computer vision that involves identifying and localizing objects within an image, playing a vital role in road sign recognition by enabling systems to detect and classify signs, symbols, and other relevant features. Over the years, significant advancements in deep learning have introduced highly effective object detection models, with Faster-RCNN and YOLO being among the most prominent. Faster-RCNN is a two-stage object detection framework that integrates region proposal generation and object classification into a unified architecture. The model employs a Region Proposal Network (RPN) to identify potential object regions, which are then passed to subsequent convolutional layers for classification and bounding box refinement [1,2]. This approach excels in detecting small and overlapping objects, making it particularly suitable for complex visual environments such as road signs, where elements like directional arrows, road numbers, and other small-scale features must be identified [6].
In contrast, YOLO represents a single-stage object detection approach that processes the entire image in one pass through a neural network. By dividing the image into a grid and predicting bounding boxes and class probabilities simultaneously for each cell, YOLO achieves high-speed detection suitable for real-time applications such as autonomous vehicles and traffic monitoring systems [3,4]. However, YOLO’s grid-based prediction method may encounter challenges with small or densely packed objects, such as textual elements or arrows frequently found in road signs [7]. Despite this limitation, YOLO remains a popular choice for tasks requiring efficiency and scalability due to its real-time processing capabilities. Together, these models form the foundation of the modern object detection techniques used in road sign recognition systems, with each offering distinct advantages tailored to specific requirements and scenarios.
The effectiveness of Faster-RCNN for road sign detection has been well-documented in previous studies, demonstrating superior accuracy for small-object detection compared to single-stage detectors like YOLO. This study builds upon such findings and prioritizes the integration of detected sign components with user-adapted reasoning rather than conducting a direct benchmark of detection models. Nevertheless, Faster-RCNN was selected due to its ability to handle overlapping objects and small text regions, as highlighted in related research.

2.2. Chain of Thought (CoT)

Chain of Thought reasoning is a technique in artificial intelligence that breaks down complex problems into intermediate steps, enabling models to perform step-by-step reasoning. This approach mimics human cognitive processes and has been particularly useful in addressing tasks requiring logical inference and contextual understanding [8].
CoT is significant for enhancing systems that require the integration of multiple data sources, such as road sign recognition systems. Traditional systems often stop at detecting objects or recognizing text, providing only raw outputs. CoT, however, enables the system to link these outputs sequentially, producing meaningful insights. For example, a system using CoT can combine detected road sign objects such as “Mokpo” and “27 km” with contextual user data, such as current speed or destination, to deliver a more actionable output [11].
The application of CoT in multimodal systems represents a key advancement in generating contextualized, user-specific information. By structuring the reasoning process, CoT facilitates the development of systems that go beyond static outputs, creating insights tailored to complex, real-world scenarios [8,11].

2.3. Image Captioning Model

Image captioning models aim to generate textual descriptions of images by analyzing their visual content. These models typically combine convolutional neural networks (CNN) for visual feature extraction with recurrent neural networks (RNN) or transformer-based architecture for sequence generation. Recent advancements in image captioning, such as attention mechanisms, have further improved these models’ capabilities to identify and highlight relevant regions within images. However, despite these improvements, they still lack the integration of contextual user data, which are crucial for applications like personalized road sign recognition [4,5].
Despite their success in general-purpose applications, image captioning models face challenges when applied to road sign recognition. These models are primarily designed to describe static visual features and often lack the ability to interpret or contextualize data. For example, while an image captioning model might accurately describe a road sign as “A green sign with the text ‘27 km to Mokpo’”, it cannot provide additional insights or interpretations relevant to a specific user’s journey [3,11]. This limitation makes traditional image captioning models less effective in dynamic and user-centric applications such as real-time navigation.

3. Proposed System

3.1. System Overview

The proposed system addresses the challenges associated with road sign recognition by leveraging a modular approach that integrates object detection, text recognition, and user-specific contextual information. Road sign images are diverse in structure and content, typically comprising textual information such as distances and destinations, symbolic elements like directional arrows and warning icons, and distinctive colors that denote road categories. These images are often captured under varying environmental conditions, including different lighting scenarios, weather conditions such as rain and fog, and potential occlusions, making robust detection and recognition essential [1,2].
The dataset used in this study consists of 4754 annotated road sign images, including 10,846 labeled objects categorized into text, symbols, and directional elements, as shown in Figure 2. The images were captured using vehicle-mounted image sensors, inherently resulting in relatively lower resolutions reflecting real-world capturing conditions. Images were collected from real-world road environments using vehicle-mounted cameras, ensuring diversity in lighting conditions (daylight: 55%, low-light/night: 20%, rain/fog: 15%, other: 10%). The annotation process followed standard labeling procedures, manually verifying each object’s position and class. The dataset structure aligns with established road sign recognition benchmarks to support reliable model training and evaluation.
To address the complexity of road sign images, the system is designed with three primary components. Object detection is carried out using Faster-RCNN, which is optimized for detecting small and overlapping features like text and symbols [1]. Text recognition is handled by the CRAFT model, which detects and converts text into machine-readable formats [8]. Finally, contextual information integration combines user data, such as location, speed, and destination, with the detected road sign information, ensuring that the outputs are actionable and tailored to the user’s immediate needs [11].
Figure 3 illustrates an example where the text detection model encountered false positives. The misaligned annotations in the figure are not formatting issues but represent incorrect text detection cases. The dataset used in this study was collected from real-world vehicle-mounted cameras, and images were processed at a resolution that matches actual on-road conditions. As a result, some images appear lower in resolution due to the inherent quality of the captured data.

3.2. Modular Components for Road Sign Recognition

The proposed system adopts modular architecture to efficiently process road sign images and extract meaningful information. This design ensures that each module handles a specific aspect of road sign recognition, including object detection, text recognition, and directional information extraction, providing a foundation for further contextual integration and user-specific outputs [1,8].
The modular system consists of three primary components: object detection, text detection and recognition, and directional information grouping. These modules work sequentially to analyze road sign images, beginning with identifying objects within the image, followed by recognizing textual content, and finally establishing directional relationships among detected objects [6,7].

3.2.1. Object and Text Detection in Road Sign Images

Object detection is the foundational step of the proposed system, focusing on identifying and localizing textual and symbolic elements within road sign images. Faster-RCNN is employed for this task due to its ability to achieve high accuracy in detecting small and overlapping objects. The model’s two-stage process involves generating candidate regions using a Region Proposal Network (RPN) and refining these proposals through classification and bounding box regression [1,2]. Figure 4 presents examples of data acquired by the text recognition model, illustrating detected text regions and individual recognized text elements.
To effectively handle the diverse characteristics of road sign images, the system incorporates optimized anchor box configurations and aspect ratios. These configurations were tailored to accommodate small-scale objects such as text regions and directional arrows, which are prevalent in road sign datasets. The detailed configurations and their performance improvements are presented in Section 4, where the experimental results validate the system’s robustness and accuracy under various conditions [6,7].
By leveraging Faster-RCNN’s flexibility and Precision, the system ensures an accurate detection of key objects, laying the groundwork for subsequent modules, such as text recognition and contextual information integration [8].
To ensure the accurate detection of road sign components, the Faster-RCNN model was optimized with customized anchor box configurations. Unlike conventional object detection tasks, road sign elements, such as directional arrows and text, are often small and densely packed. To address this, the anchor box sizes were refined to [8 × 8, 16 × 16, 32 × 32, 64 × 64, 128 × 128], improving detection accuracy. Additionally, text detection and recognition benefited from data augmentation techniques such as rotation, scaling, and contrast adjustment, which enhanced recognition under varying environmental conditions.
After identifying the road signs using object detection, the next essential step is extracting textual information directly from these detected objects. Therefore, our system integrates object detection with robust text detection and optical character recognition (OCR), ensuring accurate and contextually relevant results.
Following object detection, the system focuses on identifying and recognizing text within road sign images. The CRAFT model was selected for its ability to detect text regions with high Precision, even in complex layouts and orientations [8,11]. CRAFT identifies character-level features, which are particularly effective for road signs containing multilingual text.
Once text regions are detected, OCR techniques are applied to convert them into machine-readable formats. To enhance the model’s robustness, the training dataset included a combination of Korean and English text, reflecting the diversity of road signs. Data augmentation techniques such as rotation, scaling, brightness adjustment, and noise addition were employed to simulate real-world scenarios. These efforts significantly improved the model’s performance under various lighting conditions and environmental complexities [12,13].

3.2.2. Directional Information Extraction for Road Sign Objects

Directional information is critical for understanding the spatial and contextual relationships among objects within a road sign image. As shown in Figure 4, directional information is indicated by colored circles, where each color consistently corresponds to a specific direction, providing additional contextual meaning beyond basic text detection. The system utilizes DeepLabV3, a semantic segmentation model, to group objects based on their spatial alignment and orientation [6,7]. This process involves segmenting road sign images into distinct regions, enabling the system to establish directional relationships between elements such as text and arrows.
To further enhance segmentation performance, the system leverages a multi-scale approach to improve object boundary detection, inspired by techniques like FPNs (Feature Pyramid Networks), Panoptic FPNs, and PSP (Pyramid Scene Parsing) [14,15,16,17]. These methods allow the system to better handle overlapping objects and maintain contextual coherence, especially in complex road sign layouts.
By accurately associating each detected object with its corresponding directional context, this module forms the basis for downstream tasks such as contextual information integration. The directional grouping ensures that detected elements are interpreted correctly within the context of the road sign, enhancing the system’s utility for navigation and traffic management [16].

3.3. Contextual Information Integration

The final stage of the proposed system integrates user-specific contextual information with the data extracted from road sign images. This process ensures that the outputs are not only accurate but also actionable and tailored to the user’s immediate needs. The integration relies on user data, such as location, speed, and destination, combined with textual and symbolic information extracted from the road sign images. A step-by-step reasoning approach, based on the CoT framework, is employed to generate meaningful outputs [8,11,16]. Examples of the text and contextual data extracted from road sign images, which are directly utilized in the contextual information integration process, are presented in Table 1.
By utilizing CoT, the system combines detected objects such as “Mokpo” and “27 km” with user-specific contextual data to create dynamic and intuitive insights. For example, by integrating the current vehicle speed of 60 km/h and a destination of Mokpo, the system produces outputs such as “You will reach Mokpo in approximately 27 min”. This approach transforms raw data into user-centric, actionable information that directly supports real-time decision making [17,18].

3.3.1. User Context

User-specific data provide critical context for interpreting road sign information and generating personalized outputs. These data include the user’s current location, vehicle speed, and intended destination. An overview of the flow of user contextual data utilized by the proposed system is illustrated in Figure 5. These elements are crucial for transforming static information into dynamic and actionable insights [8,11].
For example, the user’s location determines which road sign is relevant in a given scenario, while the vehicle’s speed provides essential context for estimating travel times. The destination ensures that the system filters road sign information to prioritize those directly related to the user’s route. By incorporating these user-specific data points, the system aligns its outputs with the user’s real-world needs, enhancing its practical utility [15,16].

3.3.2. Chain of Thought-Based Information Processing

To effectively combine the user context with the extracted road sign data, the system employs a step-by-step reasoning process based on the CoT framework. The overall workflow of the proposed system, integrating user context and extracted road sign data, is illustrated clearly in Figure 6. Unlike traditional text processing methods, which focus on static data interpretation, CoT facilitates the logical integration of user data with individually detected objects from road sign images [17,18].
The system begins by processing individual objects detected from the road signs. For instance, the text detection module may identify separate objects such as “Mokpo” and “27 km”, each with their respective coordinates and labels. These objects are stored as structured data and associated with their contextual meaning based on their position and category. For example, “Mokpo” is recognized as a destination, while “27 km” is interpreted as a distance [11,17].
Using this structured data, the system combines user-specific inputs such as the current vehicle speed and destination. The CoT reasoning framework integrates these pieces of information to generate logical outputs. For instance, it produces a user-specific message such as, “At your current speed, you will reach Mokpo in approximately 27 min”. This step-by-step integration ensures the information is dynamically tailored to the user’s journey while maintaining high accuracy and relevance [18].
By treating each detected object as a standalone piece of information and logically assembling them with user data, the system transitions from raw data processing to dynamic and context-aware information delivery. This structured approach makes the system particularly suitable for real-time navigation and decision-making scenarios, where actionable insights are critical [8,17,18].

4. Experiments and Results

In this section, we present experimental results evaluating the proposed system. Firstly, we assess the performance of the road sign recognition modules, including object detection and text extraction accuracy. Secondly, we perform a comparative analysis of our system against traditional image captioning models, demonstrating how context-aware integration significantly enhances usability and interpretability.

4.1. Road Sign Recognition Performance

The performance of the proposed system was evaluated across three key tasks: object detection, text detection and recognition, and directional information extraction. For object detection, the Faster-RCNN model demonstrated high accuracy in identifying textual and symbolic elements within road sign images. Optimized anchor box configurations and aspect ratios enabled the effective detection of small objects such as text and directional arrows. The anchor box sizes were set to [8 × 8, 16 × 16, 32 × 32, 64 × 64, 128 × 128], and the aspect ratios were configured as [1:1, 1:2, 2:1] to match the observed size distributions in the dataset. The model was trained with a learning rate of 0.001, a weight decay of 0.0001, and a batch size of 16, over 100 epochs, with early stopping applied to prevent overfitting. These settings achieved a Precision of 94.3% and Recall of 92.1% for text detection, and 91.5% and 89.4% for arrow detection, with an average IOU of 0.84. Robust detection performance was maintained under various conditions, including low-light and partially occluded scenarios [1,6,7].
Text detection and recognition were carried out using the CRAFT model combined with OCR techniques, yielding a Precision of 93.7%, Recall of 91.9%, and Word-Level and Character-Level Accuracies of 89.5% and 96.2%, respectively. The use of data augmentation techniques, such as rotation, scaling, and brightness adjustments, enhanced recognition performance, particularly under challenging conditions, with a 3.4% improvement in Word-Level Accuracy compared to non-augmented data [8,12].
For directional information extraction, the DeepLabV3 model segmented road sign images into distinct regions, enabling the accurate grouping of objects based on their spatial alignment and orientation. This process achieved a Precision of 92.3%, Recall of 90.8%, and a directional grouping accuracy of 88.1%. Despite this, some errors were noted when text overlapped with directional symbols, highlighting potential areas for further optimization. Overall, the system demonstrated robust performance across all modules, confirming its capability to handle real-world scenarios and laying the groundwork for effective contextual information integration [6,7,16].

4.2. Comparative Analysis with Image Captioning Models

The comparison between the proposed system and traditional image captioning models highlights the significant differences in their ability to generate actionable and context-aware outputs. While image captioning models are capable of producing textual descriptions of visual data, they primarily focus on static, surface-level information extracted directly from the image. This often results in outputs that require additional interpretation by the user to be meaningful in specific contexts [4,5].
Table 2 and Table 3 further illustrate this distinction by presenting a side-by-side comparison of raw text extraction and context-aware outputs. In Table 2, the extracted data consist solely of recognized text from road signs, without additional contextual reasoning. In contrast, Table 3 showcases the proposed system’s ability to integrate user-specific contextual data—such as location, speed, and destination—with detected road sign objects. The key difference lies in the application of semantic inference, where Table 3’s outputs, marked in blue, demonstrate the transformation of raw textual data into actionable insights.
For example, while an image captioning model might describe a road sign as “A green sign with the text ‘27 km’ and ‘Mokpo’”, this information lacks immediate utility for the user [3,11]. Conversely, the proposed system contextualizes these data by generating outputs such as “At your current speed of 60 km/h, you will arrive in Mokpo in approximately 27 min”. These blue-labeled outputs represent information that is not only relevant but also immediately actionable and intuitive for the user. By reducing the cognitive effort required for decision making, this approach significantly enhances real-world usability [17,18].
While Table 2 provides a direct textual output from road signs, Table 3 demonstrates how our approach enhances readability and usability by embedding user-adapted contextual reasoning. The text highlighted in blue in Table 3 represents instances where semantic information was inferred, further proving the superiority of our contextualization method over traditional text extraction approaches. This process ensures that extracted information is not merely descriptive but directly contributes to user-oriented decision making in real-world applications such as navigation and traffic management [8,17,18]. Moreover, our experiments show that the proposed CoT-based reasoning achieves an average processing time of approximately 4–5 s per image using a standard GPU (RTX 3090). This represents a significant improvement over traditional image captioning models, which typically require 20–30 s per image. Although not yet strictly real-time, this processing speed greatly enhances the practical usability of the system in realistic navigation scenarios.

5. Conclusions

The proposed system demonstrates a robust and effective approach to road sign recognition, addressing the key limitations of traditional methods. By employing a modular architecture that integrates Faster-RCNN, CRAFT, and DeepLabV3, the system excels in detecting, recognizing, and extracting directional information from road sign images, achieving high performance across standard metrics such as Precision, Recall, and IOU. The use of the Chain of Thought framework for contextual information integration further enhances the system’s ability to generate user-adapted and actionable insights, making it well-suited for real-world applications like navigation and traffic management. Compared to traditional image captioning models, which provide static textual descriptions, the proposed system effectively combines detected objects with user-specific data, such as location, speed, and destination, to deliver intuitive outputs tailored to the user’s immediate needs.
Although this study focuses on the integration of image-based contextual information with LLMs, real-time processing is a critical factor for deployment in actual road environments. However, direct comparisons of computational performance between image captioning models and CoT-based methods are challenging, as our approach pre-processes extracted data before applying LLMs with user-adapted information. Future research will explore optimization techniques, such as model compression and parallel processing, to improve real-time applicability.
While the system demonstrates robust performance under diverse conditions, challenges remain, such as handling text–symbol overlaps and extending evaluations to extreme environmental conditions like heavy rain or poor lighting. Future work will focus on improving robustness in these scenarios, incorporating additional contextual data such as real-time traffic or weather information, and optimizing real-time processing capabilities to enhance the system’s practicality. With these advancements, the proposed system holds significant potential for broader applications in navigation, logistics, and smart traffic management.

Author Contributions

Conceptualization, methodology, software, data analysis, and writing—original draft preparation were all performed by B.K. Supervision and project administration were handled by the corresponding author, Y.S. Writing—review and editing were conducted by both B.K. and Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Culture, Sports and Tourism R&D Program through the Korea Creative Content Agency grant funded by the Ministry of Culture, Sports and Tourism in 2025 (Project Name: “Development of AI-Based Personalized Korean Language Education and Korean Cultural Experience Technology for Improving Foreigners’ Accessibility to Korean Culture”, Project Number: RS-2024-00359248, Contribution Rate: 100%).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The road sign images used in this study were provided by the Ministry of Land, Infrastructure and Transport of Republic of Korea. These images were publicly available at the time of data collection but are no longer accessible. Therefore, the data are not available for public access.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  2. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2016, 2016, 779–788. [Google Scholar] [CrossRef]
  3. Bochkovskiy, A.; Wang, C.; Liao, H.-Y. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
  4. Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhutdinov, R.; Bengio, Y. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Proc. Int. Conf. Mach. Learn. 2015, 2015, 2048–2057. [Google Scholar]
  5. Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2018, 2018, 6077–6086. [Google Scholar] [CrossRef]
  6. Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
  7. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2017, 2017, 2881–2890. [Google Scholar] [CrossRef]
  8. Baek, Y.; Lee, B.; Han, D.; Yun, S.; Lee, H. Character Region Awareness for Text Detection. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 2019, 2019, 9365–9374. [Google Scholar] [CrossRef]
  9. Xu, Y.; Zhang, J.; Wang, H. Sign-YOLO: Traffic Sign Detection Using Attention-Based YOLOv7. IEEE Trans. Intell. Transp. Syst. 2023, 25, 3452–3465. [Google Scholar] [CrossRef]
  10. Chen, L.; Liu, Z.; Yang, K. Cross-domain Few-shot In-context Learning for Enhancing Traffic Sign Recognition. arXiv 2024, arXiv:2407.05814. [Google Scholar]
  11. Nye, M.; Andreassen, A.J.; Gur-Ari, G.; Michalewski, H.; Austin, J.; Bieber, D.; Dohan, D.; Lewkowycz, A.; Bosma, M.; Luan, D.; et al. Show Your Work: Scratchpads for Intermediate Computation with Language Models. NeurIPS 2021, 2021, 10089–10100. [Google Scholar]
  12. Wang, T. Research on Multilingual Natural Scene Text Detection Algorithm. arXiv 2023, arXiv:2312.11153. [Google Scholar] [CrossRef]
  13. Zhang, X.; Karatzas, D.; Lu, S.; Zhu, X. Augmenting Scene Text Detection with Synthetic Data Using GANs. Proc. Int. Conf. Doc. Anal. Recognit. 2019, 2019, 1200–1210. [Google Scholar] [CrossRef]
  14. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2017, 2017, 2117–2125. [Google Scholar] [CrossRef]
  15. Kirillov, A.; He, K.; Girshick, R.; Rother, C.; Dollár, P. Panoptic Feature Pyramid Networks. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2019, 2019, 6399–6408. [Google Scholar] [CrossRef]
  16. Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2015, 2015, 3431–3440. [Google Scholar] [CrossRef]
  17. Wei, J.; Wang, X.; Schuurmans, D.; Le, Q.V.; Chi, E.H.; Zhou, D. Chain of Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2022, arXiv:2201.11903. [Google Scholar]
  18. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. NeurIPS 2020, 2020, 1877–1901. [Google Scholar]
Figure 1. Road sign image with internal objects and text. Note: The Korean text appears naturally in the images as the dataset used in this research consists of actual road sign images collected from Republic of Korea. The red boxes indicate individual objects (texts or symbols) detected within the images, and the numbers correspond to the extracted text labels listed in Table 1.
Figure 1. Road sign image with internal objects and text. Note: The Korean text appears naturally in the images as the dataset used in this research consists of actual road sign images collected from Republic of Korea. The red boxes indicate individual objects (texts or symbols) detected within the images, and the numbers correspond to the extracted text labels listed in Table 1.
Applsci 15 06051 g001
Figure 2. Examples of real-world road sign images. The images naturally contain Korean text as they were collected from actual road signs in the Republic of Korea (as explained in Figure 1). The white boxes highlight individual object detection results within the road sign images, demonstrating the object detection process.
Figure 2. Examples of real-world road sign images. The images naturally contain Korean text as they were collected from actual road signs in the Republic of Korea (as explained in Figure 1). The white boxes highlight individual object detection results within the road sign images, demonstrating the object detection process.
Applsci 15 06051 g002
Figure 3. Road sign data augmentation using road sign object information. Each box represents a synthetic road sign generated for data augmentation purposes, and the internal texts and symbols within the boxes represent individual road sign objects extracted from real-world road signs in the Republic of Korea (See Figure 1 for additional context regarding Korean text).
Figure 3. Road sign data augmentation using road sign object information. Each box represents a synthetic road sign generated for data augmentation purposes, and the internal texts and symbols within the boxes represent individual road sign objects extracted from real-world road signs in the Republic of Korea (See Figure 1 for additional context regarding Korean text).
Applsci 15 06051 g003
Figure 4. Examples of data acquired by the image recognition model. The red boxes indicate detected text or symbol objects within the road sign images. The different colored circles represent directional information, where each color corresponds consistently to a specific direction. Numbers represent road number objects identified by the model. The Korean text naturally appears because the images are obtained from actual road signs in the Republic of Korea (as described previously in Figure 1).
Figure 4. Examples of data acquired by the image recognition model. The red boxes indicate detected text or symbol objects within the road sign images. The different colored circles represent directional information, where each color corresponds consistently to a specific direction. Numbers represent road number objects identified by the model. The Korean text naturally appears because the images are obtained from actual road signs in the Republic of Korea (as described previously in Figure 1).
Applsci 15 06051 g004
Figure 5. User contextual data flow.
Figure 5. User contextual data flow.
Applsci 15 06051 g005
Figure 6. Overall flowchart of the proposed system.
Figure 6. Overall flowchart of the proposed system.
Applsci 15 06051 g006
Table 1. Examples of text information extracted from road sign recognition.
Table 1. Examples of text information extracted from road sign recognition.
Source ImageContextual DataExtracted Data from Image Model
Applsci 15 06051 i001Mokpo
60 km/h
15
sunny
‘mokpo’, [94, 45, 111, 88]
‘27 km’, [274, 56, 292, 99]
‘illlo IC’, [89, 159, 103, 219]
‘14 km’, [273, 164, 293, 212]
symbol 15, [29, 184, 89, 242]
Applsci 15 06051 i002Paldang
50 km/h
none
cloudy
‘Sudong’, [99, 160, 158, 201]
‘Maseok’, [101, 212, 154, 246]
‘Paldang’, [283, 163, 330, 202]
‘Deokso’, [286, 211, 335, 245]
‘창현교차로 (Changhyeon intersection)’, [227, 84, 342, 114]
Applsci 15 06051 i003Gangneung
50 km/h
456
cloudy
‘Hoenggye IC’, [39, 135, 142, 195]
‘Hoenggye’, [184, 78, 346, 121]
‘Gangneung’, [377, 132, 482, 193]
‘Daegwallyeong’, [378, 214, 508, 271]
symbol 456, [318, 131, 368, 163]
This table provides examples of text information detected from road sign images. The output values presented in this table will be used in the experiments in Section 4. Note: The source images naturally include Korean text as they represent real-world road signs captured in the Republic of Korea. The non-English text (Korean) is essential in this research because the primary objective is to demonstrate accurate extraction of text and symbol information from road signs, regardless of the specific language. The colors, boxes, or numbers appearing within images are visual indicators from the data extraction model and are not directly relevant to the key experimental results.
Table 2. Result of the image captioning model.
Table 2. Result of the image captioning model.
Source ImageImage Captioning OutputRelevant
Applsci 15 06051 i004A close up of a street sign with a sky backgroundNot
A close up of a street sign with a street signNot
A street sign on a pole on a streetNot
Information boards with green and green signsNot
The image shows a green road sign, indicating the distances to the destinationsModerate
Applsci 15 06051 i005A close up of street sign that reads KoreanModerate
A close up of street sign with traffic lightNot
A street sign that is on a poleNot
A sign in English and Korean with a green sign above itModerate
The image shows a road sign with directions in Korean and EnglishModerate
Applsci 15 06051 i006A close up of a street sign with a sky backgroundNot
A close up of a street sign with a sign on itNot
A street sign on a streetNot
Signs in Korean and signs indicating thereNot
A road sign displaying directions and names of various locationsModerate
Table 3. Result of the tailored vision system.
Table 3. Result of the tailored vision system.
System DataImage Captioning OuputRelevant
Applsci 15 06051 i00727 km remaining to reach your destinationHighly
Arrive in Mokpo in approximately 27 minHighly
There is an IlIlo IC, which is approximately 14 km away; this could be your next exitModerate
27 km left to Mokpo indicate that you are nearingHighly
The road number “15” is consistent throughout your routeModerate
Applsci 15 06051 i008As you drive at 50 km/h, you may need to reduce speed when approaching Changhyeon IntersectionHighly
Paldong and Deokso as your next destinations on the rightHighly
As Changhyeon Intersection is your current location, check for traffic or any signalsHighly
Deokso is your right, indicating that it is in the same direction as PaldongHighly
Based on the signs, it appears that Paldong is located to your rightHighly
Applsci 15 06051 i009You are heading toward Gangneung, which is indicated to be on your rightHighly
You are currently driving at 50 km/h on road number 456Moderate
“Daegwallyeong” is also on your right,Moderate
Hoenggye IC to your left indicates an alternate route, so if traffic conditions worsen, that may be an optionHighly
be prepared to adjust speed as you approach intersectionsModerate
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, B.; Seo, Y. Personalized Contextual Information Delivery Using Road Sign Recognition. Appl. Sci. 2025, 15, 6051. https://doi.org/10.3390/app15116051

AMA Style

Kim B, Seo Y. Personalized Contextual Information Delivery Using Road Sign Recognition. Applied Sciences. 2025; 15(11):6051. https://doi.org/10.3390/app15116051

Chicago/Turabian Style

Kim, Byungjoon, and Yongduek Seo. 2025. "Personalized Contextual Information Delivery Using Road Sign Recognition" Applied Sciences 15, no. 11: 6051. https://doi.org/10.3390/app15116051

APA Style

Kim, B., & Seo, Y. (2025). Personalized Contextual Information Delivery Using Road Sign Recognition. Applied Sciences, 15(11), 6051. https://doi.org/10.3390/app15116051

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop