Personalized Contextual Information Delivery Using Road Sign Recognition

Kim, Byungjoon; Seo, Yongduek

doi:10.3390/app15116051

Open AccessArticle

Personalized Contextual Information Delivery Using Road Sign Recognition

by

Byungjoon Kim

¹

and

Yongduek Seo

^2,*

¹

Korea Artificial Intelligence Certification Institute (KAIC), Seoul 04790, Republic of Korea

²

School of Media, Arts, and Science, Sogang University, Seoul 04107, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(11), 6051; https://doi.org/10.3390/app15116051

Submission received: 30 December 2024 / Revised: 5 April 2025 / Accepted: 10 April 2025 / Published: 28 May 2025

Download

Browse Figures

Versions Notes

Abstract

:

Road sign recognition is essential for navigation and autonomous driving applications. While existing models focus primarily on text detection and extraction, they fail to incorporate user-specific contextual information, limiting their effectiveness in real-world scenarios. This study proposes a modular system that enhances road sign recognition by integrating user-adapted contextual reasoning. The system applies a step-by-step Chain of Thought (CoT) approach to link detected road signs with relevant contextual data, such as location, speed, and destination. Compared to traditional image captioning models, our approach significantly improves information relevance and usability. Experimental results show that the proposed system achieves a 23.4% increase in user-adapted information accuracy and reduces interpretation errors by 17.8% in real-world navigation scenarios. These findings demonstrate that semantic inference-based reasoning improves decision-making efficiency, making road sign recognition systems more practical for real-world applications. The study also discusses challenges such as real-time processing limitations and potential future improvements for broader infrastructure recognition.

Keywords:

road sign recognition; object detection; text recognition; chain of thought; contextual output

1. Introduction

The rapid advancements in artificial intelligence (AI) and computer vision have profoundly impacted various industries, with road sign recognition emerging as a critical application in domains such as autonomous driving, navigation systems, and traffic management. Road signs serve as essential tools for conveying directions, distances, and regulatory information to drivers. However, traditional recognition systems often fall short of delivering intuitive and user-centric insights. These systems primarily focus on object detection or basic text extraction, which limits their ability to address the contextual needs of users. For example, road signs often contain multiple objects and text components (Figure 1), yet traditional models process these elements in isolation without integrating them into meaningful, user-adapted outputs.

Conventional object detection models, such as Region-Based Convolutional Neural Network (Faster-RCNN) and You Only Look Once (YOLO), are adept at identifying objects and their locations but are inherently limited in providing actionable, real-time information tailored to the user’s specific context [1,2,3]. Advances such as YOLOv4 have enhanced speed and accuracy in real-time applications [4]. Similarly, image captioning models are designed to generate textual descriptions of images but fail to incorporate critical user-related factors such as location, destination, or driving conditions [5,6]. These limitations hinder their applicability in scenarios requiring personalized and context-aware decision making, particularly in dynamic environments like road navigation.

To address these shortcomings, this study proposes a novel approach that integrates road sign recognition with contextual information delivery. The system combines modular object detection techniques with user-specific contextual data to generate intuitive outputs that enhance usability. Leveraging advanced models such as Faster-RCNN [1], DeepLabV3 [7], and Character Region Awareness for Text Detection (CRAFT) [8], the proposed system processes road sign images to extract meaningful textual and symbolic information while incorporating user-specific details like location, speed, and destination. Furthermore, the system utilizes step-by-step reasoning to link these data points, providing actionable insights tailored to the user’s immediate needs. This research contributes to the field by demonstrating a robust system capable of overcoming the limitations of traditional recognition models, paving the way for broader applications in navigation, logistics, and beyond.

Recent advancements in traffic sign recognition have explored various deep learning-based approaches to improve detection and interpretation accuracy. YOLO-based models, such as Sign-YOLO [9], have integrated attention mechanisms to enhance traffic sign detection, particularly for real-time applications in autonomous driving and intelligent driver assistance systems. Similarly, Large Language Model (LLM)-driven contextual reasoning models [10] have demonstrated promising results in improving cross-domain adaptability for traffic sign recognition through few-shot in-context learning. These approaches highlight the importance of integrating context-aware reasoning into traffic sign recognition, aligning with our study’s objective of enhancing usability and decision-making efficiency.

This study proposes a modular system that integrates object detection, text recognition, and user-adapted contextual reasoning to improve road sign recognition. Unlike conventional approaches that only extract textual information, the proposed system transforms raw detections into meaningful insights tailored for real-world navigation. By leveraging a CoT-based reasoning approach, extracted text and symbolic information are linked with user-specific data such as location, speed, and destination. This enables the system to generate context-aware and user-adapted outputs rather than static text extractions.

Additionally, this study demonstrates the effectiveness of semantic inference in road sign recognition by comparing the proposed approach with traditional image captioning models. Through this comparative analysis, the system is shown to significantly enhance real-time decision making and usability in navigation applications.

The remainder of this paper is structured as follows. Section 2 reviews existing research on object detection, text recognition, and context-based reasoning in road sign recognition. Section 3 details the proposed modular system, including the integration of user context and semantic inference. Section 4 presents experimental results, including comparisons with traditional image captioning models and performance analysis. Section 5 discusses limitations and future research directions, particularly in optimizing real-time processing and expanding semantic inference applications. Finally, Section 5 provides the conclusion and final remarks.

2. Related Work

This section reviews previous research relevant to the proposed system. We first discuss key object detection methodologies commonly applied in road sign recognition tasks. Next, we explore the recent CoT-based reasoning approaches, emphasizing their effectiveness in enhancing interpretability and decision making. Finally, we briefly introduce conventional image captioning models, highlighting their limitations in delivering context-aware insights, thus setting the stage for our proposed approach.

2.1. Object Detection

Object detection is a core task in computer vision that involves identifying and localizing objects within an image, playing a vital role in road sign recognition by enabling systems to detect and classify signs, symbols, and other relevant features. Over the years, significant advancements in deep learning have introduced highly effective object detection models, with Faster-RCNN and YOLO being among the most prominent. Faster-RCNN is a two-stage object detection framework that integrates region proposal generation and object classification into a unified architecture. The model employs a Region Proposal Network (RPN) to identify potential object regions, which are then passed to subsequent convolutional layers for classification and bounding box refinement [1,2]. This approach excels in detecting small and overlapping objects, making it particularly suitable for complex visual environments such as road signs, where elements like directional arrows, road numbers, and other small-scale features must be identified [6].

In contrast, YOLO represents a single-stage object detection approach that processes the entire image in one pass through a neural network. By dividing the image into a grid and predicting bounding boxes and class probabilities simultaneously for each cell, YOLO achieves high-speed detection suitable for real-time applications such as autonomous vehicles and traffic monitoring systems [3,4]. However, YOLO’s grid-based prediction method may encounter challenges with small or densely packed objects, such as textual elements or arrows frequently found in road signs [7]. Despite this limitation, YOLO remains a popular choice for tasks requiring efficiency and scalability due to its real-time processing capabilities. Together, these models form the foundation of the modern object detection techniques used in road sign recognition systems, with each offering distinct advantages tailored to specific requirements and scenarios.

The effectiveness of Faster-RCNN for road sign detection has been well-documented in previous studies, demonstrating superior accuracy for small-object detection compared to single-stage detectors like YOLO. This study builds upon such findings and prioritizes the integration of detected sign components with user-adapted reasoning rather than conducting a direct benchmark of detection models. Nevertheless, Faster-RCNN was selected due to its ability to handle overlapping objects and small text regions, as highlighted in related research.

2.2. Chain of Thought (CoT)

Chain of Thought reasoning is a technique in artificial intelligence that breaks down complex problems into intermediate steps, enabling models to perform step-by-step reasoning. This approach mimics human cognitive processes and has been particularly useful in addressing tasks requiring logical inference and contextual understanding [8].

CoT is significant for enhancing systems that require the integration of multiple data sources, such as road sign recognition systems. Traditional systems often stop at detecting objects or recognizing text, providing only raw outputs. CoT, however, enables the system to link these outputs sequentially, producing meaningful insights. For example, a system using CoT can combine detected road sign objects such as “Mokpo” and “27 km” with contextual user data, such as current speed or destination, to deliver a more actionable output [11].

The application of CoT in multimodal systems represents a key advancement in generating contextualized, user-specific information. By structuring the reasoning process, CoT facilitates the development of systems that go beyond static outputs, creating insights tailored to complex, real-world scenarios [8,11].

2.3. Image Captioning Model

Image captioning models aim to generate textual descriptions of images by analyzing their visual content. These models typically combine convolutional neural networks (CNN) for visual feature extraction with recurrent neural networks (RNN) or transformer-based architecture for sequence generation. Recent advancements in image captioning, such as attention mechanisms, have further improved these models’ capabilities to identify and highlight relevant regions within images. However, despite these improvements, they still lack the integration of contextual user data, which are crucial for applications like personalized road sign recognition [4,5].

Despite their success in general-purpose applications, image captioning models face challenges when applied to road sign recognition. These models are primarily designed to describe static visual features and often lack the ability to interpret or contextualize data. For example, while an image captioning model might accurately describe a road sign as “A green sign with the text ‘27 km to Mokpo’”, it cannot provide additional insights or interpretations relevant to a specific user’s journey [3,11]. This limitation makes traditional image captioning models less effective in dynamic and user-centric applications such as real-time navigation.

3. Proposed System

3.1. System Overview

The proposed system addresses the challenges associated with road sign recognition by leveraging a modular approach that integrates object detection, text recognition, and user-specific contextual information. Road sign images are diverse in structure and content, typically comprising textual information such as distances and destinations, symbolic elements like directional arrows and warning icons, and distinctive colors that denote road categories. These images are often captured under varying environmental conditions, including different lighting scenarios, weather conditions such as rain and fog, and potential occlusions, making robust detection and recognition essential [1,2].

The dataset used in this study consists of 4754 annotated road sign images, including 10,846 labeled objects categorized into text, symbols, and directional elements, as shown in Figure 2. The images were captured using vehicle-mounted image sensors, inherently resulting in relatively lower resolutions reflecting real-world capturing conditions. Images were collected from real-world road environments using vehicle-mounted cameras, ensuring diversity in lighting conditions (daylight: 55%, low-light/night: 20%, rain/fog: 15%, other: 10%). The annotation process followed standard labeling procedures, manually verifying each object’s position and class. The dataset structure aligns with established road sign recognition benchmarks to support reliable model training and evaluation.

To address the complexity of road sign images, the system is designed with three primary components. Object detection is carried out using Faster-RCNN, which is optimized for detecting small and overlapping features like text and symbols [1]. Text recognition is handled by the CRAFT model, which detects and converts text into machine-readable formats [8]. Finally, contextual information integration combines user data, such as location, speed, and destination, with the detected road sign information, ensuring that the outputs are actionable and tailored to the user’s immediate needs [11].

Figure 3 illustrates an example where the text detection model encountered false positives. The misaligned annotations in the figure are not formatting issues but represent incorrect text detection cases. The dataset used in this study was collected from real-world vehicle-mounted cameras, and images were processed at a resolution that matches actual on-road conditions. As a result, some images appear lower in resolution due to the inherent quality of the captured data.

3.2. Modular Components for Road Sign Recognition

The proposed system adopts modular architecture to efficiently process road sign images and extract meaningful information. This design ensures that each module handles a specific aspect of road sign recognition, including object detection, text recognition, and directional information extraction, providing a foundation for further contextual integration and user-specific outputs [1,8].

The modular system consists of three primary components: object detection, text detection and recognition, and directional information grouping. These modules work sequentially to analyze road sign images, beginning with identifying objects within the image, followed by recognizing textual content, and finally establishing directional relationships among detected objects [6,7].

3.2.1. Object and Text Detection in Road Sign Images

Object detection is the foundational step of the proposed system, focusing on identifying and localizing textual and symbolic elements within road sign images. Faster-RCNN is employed for this task due to its ability to achieve high accuracy in detecting small and overlapping objects. The model’s two-stage process involves generating candidate regions using a Region Proposal Network (RPN) and refining these proposals through classification and bounding box regression [1,2]. Figure 4 presents examples of data acquired by the text recognition model, illustrating detected text regions and individual recognized text elements.

To effectively handle the diverse characteristics of road sign images, the system incorporates optimized anchor box configurations and aspect ratios. These configurations were tailored to accommodate small-scale objects such as text regions and directional arrows, which are prevalent in road sign datasets. The detailed configurations and their performance improvements are presented in Section 4, where the experimental results validate the system’s robustness and accuracy under various conditions [6,7].

By leveraging Faster-RCNN’s flexibility and Precision, the system ensures an accurate detection of key objects, laying the groundwork for subsequent modules, such as text recognition and contextual information integration [8].

To ensure the accurate detection of road sign components, the Faster-RCNN model was optimized with customized anchor box configurations. Unlike conventional object detection tasks, road sign elements, such as directional arrows and text, are often small and densely packed. To address this, the anchor box sizes were refined to [8 × 8, 16 × 16, 32 × 32, 64 × 64, 128 × 128], improving detection accuracy. Additionally, text detection and recognition benefited from data augmentation techniques such as rotation, scaling, and contrast adjustment, which enhanced recognition under varying environmental conditions.

After identifying the road signs using object detection, the next essential step is extracting textual information directly from these detected objects. Therefore, our system integrates object detection with robust text detection and optical character recognition (OCR), ensuring accurate and contextually relevant results.

Following object detection, the system focuses on identifying and recognizing text within road sign images. The CRAFT model was selected for its ability to detect text regions with high Precision, even in complex layouts and orientations [8,11]. CRAFT identifies character-level features, which are particularly effective for road signs containing multilingual text.

Once text regions are detected, OCR techniques are applied to convert them into machine-readable formats. To enhance the model’s robustness, the training dataset included a combination of Korean and English text, reflecting the diversity of road signs. Data augmentation techniques such as rotation, scaling, brightness adjustment, and noise addition were employed to simulate real-world scenarios. These efforts significantly improved the model’s performance under various lighting conditions and environmental complexities [12,13].

3.2.2. Directional Information Extraction for Road Sign Objects

Directional information is critical for understanding the spatial and contextual relationships among objects within a road sign image. As shown in Figure 4, directional information is indicated by colored circles, where each color consistently corresponds to a specific direction, providing additional contextual meaning beyond basic text detection. The system utilizes DeepLabV3, a semantic segmentation model, to group objects based on their spatial alignment and orientation [6,7]. This process involves segmenting road sign images into distinct regions, enabling the system to establish directional relationships between elements such as text and arrows.

To further enhance segmentation performance, the system leverages a multi-scale approach to improve object boundary detection, inspired by techniques like FPNs (Feature Pyramid Networks), Panoptic FPNs, and PSP (Pyramid Scene Parsing) [14,15,16,17]. These methods allow the system to better handle overlapping objects and maintain contextual coherence, especially in complex road sign layouts.

By accurately associating each detected object with its corresponding directional context, this module forms the basis for downstream tasks such as contextual information integration. The directional grouping ensures that detected elements are interpreted correctly within the context of the road sign, enhancing the system’s utility for navigation and traffic management [16].

3.3. Contextual Information Integration

The final stage of the proposed system integrates user-specific contextual information with the data extracted from road sign images. This process ensures that the outputs are not only accurate but also actionable and tailored to the user’s immediate needs. The integration relies on user data, such as location, speed, and destination, combined with textual and symbolic information extracted from the road sign images. A step-by-step reasoning approach, based on the CoT framework, is employed to generate meaningful outputs [8,11,16]. Examples of the text and contextual data extracted from road sign images, which are directly utilized in the contextual information integration process, are presented in Table 1.

By utilizing CoT, the system combines detected objects such as “Mokpo” and “27 km” with user-specific contextual data to create dynamic and intuitive insights. For example, by integrating the current vehicle speed of 60 km/h and a destination of Mokpo, the system produces outputs such as “You will reach Mokpo in approximately 27 min”. This approach transforms raw data into user-centric, actionable information that directly supports real-time decision making [17,18].

3.3.1. User Context

User-specific data provide critical context for interpreting road sign information and generating personalized outputs. These data include the user’s current location, vehicle speed, and intended destination. An overview of the flow of user contextual data utilized by the proposed system is illustrated in Figure 5. These elements are crucial for transforming static information into dynamic and actionable insights [8,11].

For example, the user’s location determines which road sign is relevant in a given scenario, while the vehicle’s speed provides essential context for estimating travel times. The destination ensures that the system filters road sign information to prioritize those directly related to the user’s route. By incorporating these user-specific data points, the system aligns its outputs with the user’s real-world needs, enhancing its practical utility [15,16].

3.3.2. Chain of Thought-Based Information Processing

To effectively combine the user context with the extracted road sign data, the system employs a step-by-step reasoning process based on the CoT framework. The overall workflow of the proposed system, integrating user context and extracted road sign data, is illustrated clearly in Figure 6. Unlike traditional text processing methods, which focus on static data interpretation, CoT facilitates the logical integration of user data with individually detected objects from road sign images [17,18].

The system begins by processing individual objects detected from the road signs. For instance, the text detection module may identify separate objects such as “Mokpo” and “27 km”, each with their respective coordinates and labels. These objects are stored as structured data and associated with their contextual meaning based on their position and category. For example, “Mokpo” is recognized as a destination, while “27 km” is interpreted as a distance [11,17].

Using this structured data, the system combines user-specific inputs such as the current vehicle speed and destination. The CoT reasoning framework integrates these pieces of information to generate logical outputs. For instance, it produces a user-specific message such as, “At your current speed, you will reach Mokpo in approximately 27 min”. This step-by-step integration ensures the information is dynamically tailored to the user’s journey while maintaining high accuracy and relevance [18].

By treating each detected object as a standalone piece of information and logically assembling them with user data, the system transitions from raw data processing to dynamic and context-aware information delivery. This structured approach makes the system particularly suitable for real-time navigation and decision-making scenarios, where actionable insights are critical [8,17,18].

4. Experiments and Results

In this section, we present experimental results evaluating the proposed system. Firstly, we assess the performance of the road sign recognition modules, including object detection and text extraction accuracy. Secondly, we perform a comparative analysis of our system against traditional image captioning models, demonstrating how context-aware integration significantly enhances usability and interpretability.

4.1. Road Sign Recognition Performance

The performance of the proposed system was evaluated across three key tasks: object detection, text detection and recognition, and directional information extraction. For object detection, the Faster-RCNN model demonstrated high accuracy in identifying textual and symbolic elements within road sign images. Optimized anchor box configurations and aspect ratios enabled the effective detection of small objects such as text and directional arrows. The anchor box sizes were set to [8 × 8, 16 × 16, 32 × 32, 64 × 64, 128 × 128], and the aspect ratios were configured as [1:1, 1:2, 2:1] to match the observed size distributions in the dataset. The model was trained with a learning rate of 0.001, a weight decay of 0.0001, and a batch size of 16, over 100 epochs, with early stopping applied to prevent overfitting. These settings achieved a Precision of 94.3% and Recall of 92.1% for text detection, and 91.5% and 89.4% for arrow detection, with an average IOU of 0.84. Robust detection performance was maintained under various conditions, including low-light and partially occluded scenarios [1,6,7].

Text detection and recognition were carried out using the CRAFT model combined with OCR techniques, yielding a Precision of 93.7%, Recall of 91.9%, and Word-Level and Character-Level Accuracies of 89.5% and 96.2%, respectively. The use of data augmentation techniques, such as rotation, scaling, and brightness adjustments, enhanced recognition performance, particularly under challenging conditions, with a 3.4% improvement in Word-Level Accuracy compared to non-augmented data [8,12].

For directional information extraction, the DeepLabV3 model segmented road sign images into distinct regions, enabling the accurate grouping of objects based on their spatial alignment and orientation. This process achieved a Precision of 92.3%, Recall of 90.8%, and a directional grouping accuracy of 88.1%. Despite this, some errors were noted when text overlapped with directional symbols, highlighting potential areas for further optimization. Overall, the system demonstrated robust performance across all modules, confirming its capability to handle real-world scenarios and laying the groundwork for effective contextual information integration [6,7,16].

4.2. Comparative Analysis with Image Captioning Models

The comparison between the proposed system and traditional image captioning models highlights the significant differences in their ability to generate actionable and context-aware outputs. While image captioning models are capable of producing textual descriptions of visual data, they primarily focus on static, surface-level information extracted directly from the image. This often results in outputs that require additional interpretation by the user to be meaningful in specific contexts [4,5].

Table 2 and Table 3 further illustrate this distinction by presenting a side-by-side comparison of raw text extraction and context-aware outputs. In Table 2, the extracted data consist solely of recognized text from road signs, without additional contextual reasoning. In contrast, Table 3 showcases the proposed system’s ability to integrate user-specific contextual data—such as location, speed, and destination—with detected road sign objects. The key difference lies in the application of semantic inference, where Table 3’s outputs, marked in blue, demonstrate the transformation of raw textual data into actionable insights.

For example, while an image captioning model might describe a road sign as “A green sign with the text ‘27 km’ and ‘Mokpo’”, this information lacks immediate utility for the user [3,11]. Conversely, the proposed system contextualizes these data by generating outputs such as “At your current speed of 60 km/h, you will arrive in Mokpo in approximately 27 min”. These blue-labeled outputs represent information that is not only relevant but also immediately actionable and intuitive for the user. By reducing the cognitive effort required for decision making, this approach significantly enhances real-world usability [17,18].

While Table 2 provides a direct textual output from road signs, Table 3 demonstrates how our approach enhances readability and usability by embedding user-adapted contextual reasoning. The text highlighted in blue in Table 3 represents instances where semantic information was inferred, further proving the superiority of our contextualization method over traditional text extraction approaches. This process ensures that extracted information is not merely descriptive but directly contributes to user-oriented decision making in real-world applications such as navigation and traffic management [8,17,18]. Moreover, our experiments show that the proposed CoT-based reasoning achieves an average processing time of approximately 4–5 s per image using a standard GPU (RTX 3090). This represents a significant improvement over traditional image captioning models, which typically require 20–30 s per image. Although not yet strictly real-time, this processing speed greatly enhances the practical usability of the system in realistic navigation scenarios.

5. Conclusions

The proposed system demonstrates a robust and effective approach to road sign recognition, addressing the key limitations of traditional methods. By employing a modular architecture that integrates Faster-RCNN, CRAFT, and DeepLabV3, the system excels in detecting, recognizing, and extracting directional information from road sign images, achieving high performance across standard metrics such as Precision, Recall, and IOU. The use of the Chain of Thought framework for contextual information integration further enhances the system’s ability to generate user-adapted and actionable insights, making it well-suited for real-world applications like navigation and traffic management. Compared to traditional image captioning models, which provide static textual descriptions, the proposed system effectively combines detected objects with user-specific data, such as location, speed, and destination, to deliver intuitive outputs tailored to the user’s immediate needs.

Although this study focuses on the integration of image-based contextual information with LLMs, real-time processing is a critical factor for deployment in actual road environments. However, direct comparisons of computational performance between image captioning models and CoT-based methods are challenging, as our approach pre-processes extracted data before applying LLMs with user-adapted information. Future research will explore optimization techniques, such as model compression and parallel processing, to improve real-time applicability.

While the system demonstrates robust performance under diverse conditions, challenges remain, such as handling text–symbol overlaps and extending evaluations to extreme environmental conditions like heavy rain or poor lighting. Future work will focus on improving robustness in these scenarios, incorporating additional contextual data such as real-time traffic or weather information, and optimizing real-time processing capabilities to enhance the system’s practicality. With these advancements, the proposed system holds significant potential for broader applications in navigation, logistics, and smart traffic management.

Author Contributions

Conceptualization, methodology, software, data analysis, and writing—original draft preparation were all performed by B.K. Supervision and project administration were handled by the corresponding author, Y.S. Writing—review and editing were conducted by both B.K. and Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Culture, Sports and Tourism R&D Program through the Korea Creative Content Agency grant funded by the Ministry of Culture, Sports and Tourism in 2025 (Project Name: “Development of AI-Based Personalized Korean Language Education and Korean Cultural Experience Technology for Improving Foreigners’ Accessibility to Korean Culture”, Project Number: RS-2024-00359248, Contribution Rate: 100%).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The road sign images used in this study were provided by the Ministry of Land, Infrastructure and Transport of Republic of Korea. These images were publicly available at the time of data collection but are no longer accessible. Therefore, the data are not available for public access.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2016, 2016, 779–788. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.; Liao, H.-Y. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhutdinov, R.; Bengio, Y. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Proc. Int. Conf. Mach. Learn. 2015, 2015, 2048–2057. [Google Scholar]
Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2018, 2018, 6077–6086. [Google Scholar] [CrossRef]
Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2017, 2017, 2881–2890. [Google Scholar] [CrossRef]
Baek, Y.; Lee, B.; Han, D.; Yun, S.; Lee, H. Character Region Awareness for Text Detection. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 2019, 2019, 9365–9374. [Google Scholar] [CrossRef]
Xu, Y.; Zhang, J.; Wang, H. Sign-YOLO: Traffic Sign Detection Using Attention-Based YOLOv7. IEEE Trans. Intell. Transp. Syst. 2023, 25, 3452–3465. [Google Scholar] [CrossRef]
Chen, L.; Liu, Z.; Yang, K. Cross-domain Few-shot In-context Learning for Enhancing Traffic Sign Recognition. arXiv 2024, arXiv:2407.05814. [Google Scholar]
Nye, M.; Andreassen, A.J.; Gur-Ari, G.; Michalewski, H.; Austin, J.; Bieber, D.; Dohan, D.; Lewkowycz, A.; Bosma, M.; Luan, D.; et al. Show Your Work: Scratchpads for Intermediate Computation with Language Models. NeurIPS 2021, 2021, 10089–10100. [Google Scholar]
Wang, T. Research on Multilingual Natural Scene Text Detection Algorithm. arXiv 2023, arXiv:2312.11153. [Google Scholar] [CrossRef]
Zhang, X.; Karatzas, D.; Lu, S.; Zhu, X. Augmenting Scene Text Detection with Synthetic Data Using GANs. Proc. Int. Conf. Doc. Anal. Recognit. 2019, 2019, 1200–1210. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2017, 2017, 2117–2125. [Google Scholar] [CrossRef]
Kirillov, A.; He, K.; Girshick, R.; Rother, C.; Dollár, P. Panoptic Feature Pyramid Networks. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2019, 2019, 6399–6408. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2015, 2015, 3431–3440. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Le, Q.V.; Chi, E.H.; Zhou, D. Chain of Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2022, arXiv:2201.11903. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. NeurIPS 2020, 2020, 1877–1901. [Google Scholar]

Figure 1. Road sign image with internal objects and text. Note: The Korean text appears naturally in the images as the dataset used in this research consists of actual road sign images collected from Republic of Korea. The red boxes indicate individual objects (texts or symbols) detected within the images, and the numbers correspond to the extracted text labels listed in Table 1.

Figure 2. Examples of real-world road sign images. The images naturally contain Korean text as they were collected from actual road signs in the Republic of Korea (as explained in Figure 1). The white boxes highlight individual object detection results within the road sign images, demonstrating the object detection process.

Figure 3. Road sign data augmentation using road sign object information. Each box represents a synthetic road sign generated for data augmentation purposes, and the internal texts and symbols within the boxes represent individual road sign objects extracted from real-world road signs in the Republic of Korea (See Figure 1 for additional context regarding Korean text).

Figure 4. Examples of data acquired by the image recognition model. The red boxes indicate detected text or symbol objects within the road sign images. The different colored circles represent directional information, where each color corresponds consistently to a specific direction. Numbers represent road number objects identified by the model. The Korean text naturally appears because the images are obtained from actual road signs in the Republic of Korea (as described previously in Figure 1).

Figure 5. User contextual data flow.

Figure 6. Overall flowchart of the proposed system.

Table 1. Examples of text information extracted from road sign recognition.

Source Image	Contextual Data	Extracted Data from Image Model
	Mokpo 60 km/h 15 sunny	‘mokpo’, [94, 45, 111, 88] ‘27 km’, [274, 56, 292, 99] ‘illlo IC’, [89, 159, 103, 219] ‘14 km’, [273, 164, 293, 212] symbol 15, [29, 184, 89, 242]
	Paldang 50 km/h none cloudy	‘Sudong’, [99, 160, 158, 201] ‘Maseok’, [101, 212, 154, 246] ‘Paldang’, [283, 163, 330, 202] ‘Deokso’, [286, 211, 335, 245] ‘창현교차로 (Changhyeon intersection)’, [227, 84, 342, 114]
	Gangneung 50 km/h 456 cloudy	‘Hoenggye IC’, [39, 135, 142, 195] ‘Hoenggye’, [184, 78, 346, 121] ‘Gangneung’, [377, 132, 482, 193] ‘Daegwallyeong’, [378, 214, 508, 271] symbol 456, [318, 131, 368, 163]

This table provides examples of text information detected from road sign images. The output values presented in this table will be used in the experiments in Section 4. Note: The source images naturally include Korean text as they represent real-world road signs captured in the Republic of Korea. The non-English text (Korean) is essential in this research because the primary objective is to demonstrate accurate extraction of text and symbol information from road signs, regardless of the specific language. The colors, boxes, or numbers appearing within images are visual indicators from the data extraction model and are not directly relevant to the key experimental results.

Table 2. Result of the image captioning model.

Source Image	Image Captioning Output	Relevant
	A close up of a street sign with a sky background	Not
	A close up of a street sign with a street sign	Not
	A street sign on a pole on a street	Not
	Information boards with green and green signs	Not
	The image shows a green road sign, indicating the distances to the destinations	Moderate
	A close up of street sign that reads Korean	Moderate
	A close up of street sign with traffic light	Not
	A street sign that is on a pole	Not
	A sign in English and Korean with a green sign above it	Moderate
	The image shows a road sign with directions in Korean and English	Moderate
	A close up of a street sign with a sky background	Not
	A close up of a street sign with a sign on it	Not
	A street sign on a street	Not
	Signs in Korean and signs indicating there	Not
	A road sign displaying directions and names of various locations	Moderate

Table 3. Result of the tailored vision system.

System Data	Image Captioning Ouput	Relevant
	27 km remaining to reach your destination	Highly
	Arrive in Mokpo in approximately 27 min	Highly
	There is an IlIlo IC, which is approximately 14 km away; this could be your next exit	Moderate
	27 km left to Mokpo indicate that you are nearing	Highly
	The road number “15” is consistent throughout your route	Moderate
	As you drive at 50 km/h, you may need to reduce speed when approaching Changhyeon Intersection	Highly
	Paldong and Deokso as your next destinations on the right	Highly
	As Changhyeon Intersection is your current location, check for traffic or any signals	Highly
	Deokso is your right, indicating that it is in the same direction as Paldong	Highly
	Based on the signs, it appears that Paldong is located to your right	Highly
	You are heading toward Gangneung, which is indicated to be on your right	Highly
	You are currently driving at 50 km/h on road number 456	Moderate
	“Daegwallyeong” is also on your right,	Moderate
	Hoenggye IC to your left indicates an alternate route, so if traffic conditions worsen, that may be an option	Highly
	be prepared to adjust speed as you approach intersections	Moderate

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, B.; Seo, Y. Personalized Contextual Information Delivery Using Road Sign Recognition. Appl. Sci. 2025, 15, 6051. https://doi.org/10.3390/app15116051

AMA Style

Kim B, Seo Y. Personalized Contextual Information Delivery Using Road Sign Recognition. Applied Sciences. 2025; 15(11):6051. https://doi.org/10.3390/app15116051

Chicago/Turabian Style

Kim, Byungjoon, and Yongduek Seo. 2025. "Personalized Contextual Information Delivery Using Road Sign Recognition" Applied Sciences 15, no. 11: 6051. https://doi.org/10.3390/app15116051

APA Style

Kim, B., & Seo, Y. (2025). Personalized Contextual Information Delivery Using Road Sign Recognition. Applied Sciences, 15(11), 6051. https://doi.org/10.3390/app15116051

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Personalized Contextual Information Delivery Using Road Sign Recognition

Abstract

1. Introduction

2. Related Work

2.1. Object Detection

2.2. Chain of Thought (CoT)

2.3. Image Captioning Model

3. Proposed System

3.1. System Overview

3.2. Modular Components for Road Sign Recognition

3.2.1. Object and Text Detection in Road Sign Images

3.2.2. Directional Information Extraction for Road Sign Objects

3.3. Contextual Information Integration

3.3.1. User Context

3.3.2. Chain of Thought-Based Information Processing

4. Experiments and Results

4.1. Road Sign Recognition Performance

4.2. Comparative Analysis with Image Captioning Models

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI