Multimodal Intelligent Perception at an Intersection: Pedestrian and Vehicle Flow Dynamics Using a Pipeline-Based Traffic Analysis System

Chang, Bao Rong; Tsai, Hsiu-Fen; Chen, Chen-Chia

doi:10.3390/electronics15020353

Open AccessEditor’s ChoiceArticle

Multimodal Intelligent Perception at an Intersection: Pedestrian and Vehicle Flow Dynamics Using a Pipeline-Based Traffic Analysis System

by

Bao Rong Chang

¹

,

Hsiu-Fen Tsai

^2,*

and

Chen-Chia Chen

¹

Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung 81148, Taiwan

²

Department of Fragrance and Cosmetic Science, Kaohsiung Medical University, Kaohsiung 80708, Taiwan

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(2), 353; https://doi.org/10.3390/electronics15020353

Submission received: 29 December 2025 / Revised: 7 January 2026 / Accepted: 8 January 2026 / Published: 13 January 2026

(This article belongs to the Special Issue Interactive Design for Autonomous Driving Vehicles)

Download

Browse Figures

Versions Notes

Abstract

Traditional automated monitoring systems adopted for Intersection Traffic Control still face challenges, including high costs, maintenance difficulties, insufficient coverage, poor multimodal data integration, and limited traffic information analysis. To address these issues, the study proposes a sovereign AI-driven Smart Transportation governance approach, developing a mobile AI solution equipped with multimodal perception, task decomposition, memory, reasoning, and multi-agent collaboration capabilities. The proposed system integrates computer vision, multi-object tracking, natural language processing, Retrieval-Augmented Generation (RAG), and Large Language Models (LLMs) to construct a Pipeline-based Traffic Analysis System (PTAS). The PTAS can produce real-time statistics on pedestrian and vehicle flows at intersections, incorporating potential risk factors such as traffic accidents, construction activities, and weather conditions for multimodal data fusion analysis, thereby providing forward-looking traffic insights. Experimental results demonstrate that the enhanced DuCRG-YOLOv11n pre-trained model, equipped with our proposed new activation function

β s i l u

, can accurately identify various vehicle types in object detection, achieving a frame rate of 68.25 FPS and a precision of 91.4%. Combined with ByteTrack, it can track over 90% of vehicles in medium- to low-density traffic scenarios, obtaining a 0.719 in MOTA and a 0.08735 in MOTP. In traffic flow analysis, the RAG of Vertex AI, combined with Claude Sonnet 4 LLMs, provides a more comprehensive view, precisely interpreting the causes of peak-hour congestion and effectively compensating for missing data through contextual explanations. The proposed method can enhance the efficiency of urban traffic regulation and optimizes decision support in intelligent transportation systems.

Keywords:

Intersection Traffic Control; DuCRG-YOLOv11n; ByteTrack; Pipeline-based Traffic Analysis System; Vertex AI; Claude Sonnet 4

1. Introduction

With the rapid development of urbanization and motorization, traffic regulation has become a critical challenge in urban governance. However, traditional approaches often suffer from inefficiency, poor real-time responsiveness, limited accuracy, improper resource utilization, and various safety and management issues. Conventional automated monitoring systems remain constrained by high costs, maintenance difficulties, insufficient coverage, weak multimodal data integration, and inadequate traffic information analysis. To overcome these limitations, Intelligent Transportation Systems (ITSs) [1] have gradually emerged, leveraging sensors, real-time communication, artificial intelligence (AI) analytics, and cloud platforms to shift traffic regulation from passive dependency to proactive decision-making, thereby realizing a new paradigm of data-driven, real-time control and sustainable development. The transformation not only improves the flexibility of traffic management but also enhances decision support and governance capacity. In recent years, with the continuous advancement of AI computing power, Sovereign AI, which emphasizes the autonomous control of AI technologies and data, has garnered increasing attention, driving the development of localized intelligent transportation governance. Notably, Agentic AI—featuring multimodal perception, task decomposition, memory, reasoning, and multi-agent collaboration—can effectively support recognition, understanding, and natural human–AI interaction in traffic scenarios, thereby improving the precision and efficiency of traffic flow and roadway environment analysis.

Therefore, the study proposes a Pipeline-based Traffic Analysis System (PTAS) [2] that integrates Computer Vision (CV) [3], Multi-Object Tracking (MOT) [4], Natural Language Processing (NLP) [5], Retrieval-Augmented Generation (RAG) [6,7], and Large Language Models (LLMs) [8]. The proposed model not only performs real-time statistical analysis of pedestrian and vehicle flows at intersections, but also integrates textual data such as news reports and social media feedback to identify further potential risk factors, including traffic accidents, construction activities, and weather conditions, and extends its analysis to environmental information such as noise and particulate matter. Through multimodal data fusion analysis, the research aims to develop an intelligent traffic regulation system characterized by Sovereign AI and Agentic AI, providing more accurate and forward-looking traffic insights to support urban traffic optimization and decision-making.

In terms of methodological design, multiple LLMs were proposed and evaluated based on various performance factors to validate our approach, including Data Description, Peak Hour Analysis, Congestion Cause Analysis, Traffic Accident Analysis & Suggestions, Construction Impact & Suggestions, Air Quality Analysis, Noise Analysis, Data Gaps Noted, Structure Logic & Readability, and Improvement Suggestions. By carefully comparing these performance dimensions, the study selects the most effective LLM for analyzing intersection pedestrian and vehicle flows within the proposed PTAS framework. The main contribution of the research lies in presenting a PTAS capable of real-time intersection flow statistics and precise analysis of traffic peak hours and congestion causes, while effectively describing data content and identifying missing information. The analytical results provide forward-looking traffic insights that can be immediately delivered to authorities and the public, enabling timely actions to address ongoing traffic challenges.

We arranged the following paragraphs of the paper as follows. In Section 2, the related work will be described, including various pre-trained YOLO-related models, a few large language models, the retrieval-augmented generation process, and multi-object tracking algorithms. We present the method for implementing the proposed system in Section 3. The study found the experimental results and discussion in Section 4. Finally, we drew a brief conclusion in Section 5.

2. Related Work

2.1. Literature Review

In traditional traffic monitoring systems, manual observation or sensor-based detection (such as loop and infrared sensors) is often relied upon. However, these methods have inherent limitations in terms of accuracy and scalability, rendering them inadequate to meet the growing complexity of modern traffic environments. With technological advancements, the Internet of Things (IoT), machine learning, and deep learning have gradually been introduced into intelligent transportation, demonstrating remarkable potential. Oladimeji et al. [9] conducted a comprehensive review on the topic, describing the applications and development trends of machine learning and IoT in intelligent transportation, such as using machine learning for route planning to alleviate congestion or combining smart cameras with Convolutional Neural Networks (CNNs) for parking space detection and tracking, highlighting the feasibility of these emerging technologies in enhancing traffic management efficiency and real-time scene perception.

Object detection is an indispensable technology in traffic monitoring. Tao et al. [10] proposed a YOLO-based object detection system for traffic scenes to address the limitations of traditional methods in terms of speed and accuracy. Hu et al. [11] developed a learning-based detection framework focusing on three key object categories: traffic signs, vehicles, and riders. The method extracts and shares dense features, enabling the detection process to perform only a single feature evaluation during testing, thereby significantly improving computational efficiency. Experimental results demonstrate that the framework achieves performance comparable to that of state-of-the-art methods across multiple benchmark datasets, thereby proving its effectiveness and efficiency in traffic scene perception.

However, relying solely on object detection is insufficient for comprehensive traffic monitoring. Multiple Object Tracking (MOT) techniques have been widely applied in intelligent transportation to enhance tracking and analytical capabilities. The core of MOT algorithms lies in establishing associations of the same object across consecutive frames in a video sequence, producing stable and continuous trajectories. Jiménez-Bravo et al. [12] provided a literature review outlining standard MOT techniques, including Intersection over Union (IoU) trackers, Kalman filters, and the SORT and DeepSORT algorithms. Among these, SORT combines the Kalman filter with the Hungarian algorithm, while DeepSORT further incorporates a deep neural network to extract appearance features, improving the accuracy of object identity consistency. Notably, MOT performance heavily depends on the quality of object detection and recognition algorithms, making the integration of MOT and detection models a significant challenge in traffic monitoring. Moreover, traffic management requires real-time flow statistics, event prediction, and semantic interpretation capabilities. Traditional rule-based or statistical models often fail to make precise, real-time decisions under dynamic and complex traffic conditions. With the advancement of Natural Language Processing (NLP) technologies, computer science has gradually introduced large language models (LLMs) and retrieval-augmented generation (RAG) into the field of intelligent transportation research. By semantically integrating heterogeneous data, they provide knowledge-driven decision support at a higher level of abstraction. Movahedi et al. [13] applied LLMs to adaptive traffic signal control, leveraging their vast knowledge base and reasoning ability to design dynamically adjustable traffic controllers. Experimental results indicate that the LLM-based approach significantly improves traffic control efficiency and flexibility, demonstrating its transformative potential in intelligent transportation management.

2.2. Object Detection Models

YOLOv8 [14], released by Ultralytics on 10 January 2023, delivers cutting-edge performance in both accuracy and speed, as shown in Figure 1. The version offers multiple pre-trained model variants, including nano, small, medium, and large, to accommodate various hardware configurations and application scenarios. Its architecture adopts an anchor-free detection head, enhanced with the C2f module and FPN + PANet structures to improve multi-scale feature fusion. Regarding the loss function, it integrates CIoU and Distribution Focal Loss along with optimized positive and negative sample selection, thereby improving classification and localization precision. YOLOv8 is available through command-line and Python package interfaces, supporting multiple export formats and lightweight deployment options suitable for real-time practical applications.

Figure 2 illustrates the latest model, YOLOv11 [15], which builds upon and enhances the advantages of YOLOv8 by refining the backbone and neck design. It introduces C3K2 blocks, C2PSA modules, and SPFF multi-scale and attention mechanisms to improve detection accuracy for small objects and complex scenes. The model offers variants ranging from nano to extra-large, adapting to resource-constrained devices and high-performance deployment needs. It also supports diverse tasks, including object detection, instance segmentation, image classification, pose estimation, and oriented bounding box detection. Experimental results on the COCO dataset demonstrate that YOLOv11 achieves a higher or comparable mean Average Precision (mAP) than YOLOv8m, while using fewer parameters, thereby showcasing an excellent balance between accuracy and computational efficiency.

In a vehicle detection study [16], researchers compared YOLOv11 with YOLOv8 and YOLOv10 [17]. The results indicated that YOLOv11 outperforms previous versions, particularly in detecting small or occluded vehicles, while maintaining competitive inference speed, making it especially suitable for real-time detection scenarios.

2.3. Reparameterization Ghost Bottleneck, and Dual Convolution

In our previous work [18], as shown in Figure 3, the Reparameterization Ghost Bottleneck (RepGhost Bneck) is a module design within lightweight convolutional neural networks. Its core idea is to generate more feature maps with minimal computational cost, thereby enhancing network capacity while reducing computational complexity. Compared with the traditional GhostNet, RepGhost adopts structural reparameterization, transforming the multi-branch structure into a single pathway after training, and replacing feature concatenation with element-wise addition, effectively reducing system overhead. With similar or lower FLOPs, RepGhost achieves higher accuracy than GhostNet and MobileNet series models. According to the experiments reported in [18], RepGhostNet demonstrates higher Top-1 accuracy on ARM mobile devices, confirming that it maintains efficiency and precision under lightweight design constraints.

In our previous study [18], as illustrated in Figure 4, Dual Convolution (Dual Conv) [19] was motivated by the observation that conventional convolutions often rely on a 3 × 3 kernel to capture local spatial features, but at a relatively high computational cost, which is unsuitable for edge-device applications. To balance computational efficiency and representational capability, Dual Conv combines a 3 × 3 Depthwise Convolution (DW Conv) with a 1 × 1 Pointwise Convolution (PW Conv). The 3 × 3 DW Conv performs operations independently within each input channel, effectively reducing computation and improving inference speed. At the same time, the 1 × 1 PW Conv integrates cross-channel information and performs numerical scaling to preserve the model’s representational power. Dual Conv can simultaneously replace traditional convolutions by capturing features at multiple scales, while significantly reducing computational demand and accelerating computation, making it particularly suitable for lightweight network designs.

2.4. Large Language Models

Based on deep learning, large Language Models (LLMs) learn linguistic structures and semantics through extensive training on text, enabling tasks such as text generation, translation, summarization, and question answering. They have made remarkably significant progress in Natural Language Processing (NLP) [20]. Their massive scale and complexity provide superior performance and entail high computational costs and potential bias issues. In applications, de Zarzà et al. [21] combined LLMs with deep probabilistic reasoning to enhance the responsiveness of autonomous driving. Liu et al. [22] leveraged large-scale pretraining and parameter scaling for traffic forecasting, noting that although traditional models often rely on complex architectures, their accuracy improvements are limited. In contrast, LLMs demonstrate more efficient predictive capability in time series analysis.

A well-known example is Llama [23], introduced by Meta, which belongs to a series of LLMs and multimodal models characterized by being open-source and commercially usable, thereby lowering the entry barrier for AI technology adoption. Llama is based on the Transformer architecture but includes multiple design improvements to enhance efficiency, stability, and long-sequence processing capability. For instance, Llama replaces the Transformer’s LayerNorm with RMSNorm, stabilizing the normalization process and accelerating computation. It employs the SwiGLU activation function, which combines gating and smooth computation to improve model performance. Additionally, it substitutes the original positional encoding with Rotary Embeddings (RoPEs), modeling relative positional relationships via rotation transformations. Overall, Llama retains the core structure of the Transformer while optimizing computational efficiency and memory usage. The Llama series has released Llama 1, Llama 2, and Llama 3, continuously improving model scale, performance, and multimodal capabilities. Its open-source nature has also facilitated extensive academic research and industrial applications, as shown in Figure 5.

With the development of LLMs, related platforms have also emerged. For instance, Google Cloud’s Vertex AI integrates pretrained models, custom models, and AutoML, enabling researchers to complete the entire workflow from data processing to model deployment on a single platform. ChatGPT, developed by OpenAI [24] and released in 2022, has been widely applied in natural language dialogue, text generation, content summarization, and code writing. With iterative improvements through GPT-3.5, GPT-4, GPT-4o, and GPT-4.5, its performance and reasoning speed have continuously increased, with GPT-4o achieving faster computation and lower cost, making it currently the most advanced version. In the study by Villarreal et al. [25], ChatGPT was applied to complex mixed-traffic control problems and tested across three environments: loops, bottlenecks, and intersections. The results showed that ChatGPT did not consistently improve performance across all scenarios; however, at intersections and bottlenecks, the number of successful strategies increased by 150% and 136%, respectively, compared with baseline strategies relying solely on beginner capabilities, with some strategies even surpassing expert-level performance.

The Claude series, developed by Anthropic, is built on the “Constitutional AI” framework, emphasizing safety and controllability to reduce the risk of inappropriate outputs effectively. Its latest version, Claude Sonnet 4 [26], demonstrates enhanced language understanding, logical reasoning, and capabilities for handling complex tasks. Gemini, developed by Google DeepMind, is characterized by its multimodal processing capabilities, which enable it to simultaneously understand text, code, images, audio, and video. Gemini 2.5 Flash [27] achieves high performance and low latency, supporting long-context reasoning and real-time applications, making it especially suitable for chatbots, data analysis, and multimodal scenarios. In the study by Moraga et al. [28], UAVs, IoT sensors, and LLMs (Gemini) were integrated and applied within the SUMO simulator to dynamically adjust vehicle speeds, reducing urban traffic congestion and CO₂ emissions. Experiments across three urban scenarios demonstrated that the Gemini-based solution significantly outperformed the baseline, highlighting its potential in intelligent traffic management.

LLMs demonstrate strong capabilities in language and time-series analysis, playing a critical role across research, industry, and real-time applications through these diverse platforms. The study leverages these tools for data processing, analysis, and intelligent applications, further demonstrating the research potential and practical value of LLMs.

2.5. Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with generative models, aiming to enhance the accuracy and reliability of Large Language Models (LLMs), as shown in Figure 6. The process involves using vector search [29] to retrieve relevant information from large databases and then providing it as auxiliary input to the generative model, making the generated outputs more accurate and timelier. Traditional LLMs rely solely on pretrained knowledge, which often proves insufficient when handling new information or details not covered in the training data. RAG addresses the limitation by incorporating external data sources, allowing the model to improve response quality based on real-time retrieved information. Its advantages include referencing up-to-date and authoritative sources, increasing generation accuracy, and reducing dependency on large pretrained datasets, enhancing efficiency and lowering computational resource requirements. By combining retrieval and generation, RAG enhances text quality and facilitates the handling of more complex language tasks.

Xu et al. [30] explored an interpretable method for predicting traffic user behavior in autonomous driving based on RAG. The system integrates the reasoning capabilities of a Knowledge Graph (KG) with the language abilities of LLMs. Its core design first establishes an entirely inductive reasoning framework through Knowledge Graph Embedding (KGE) and Bayesian reasoning, then leverages RAG to combine real-time vehicle sensor data with KG information to generate explainable predictions of road user behavior. The approach strikes a balance between rigorous reasoning and linguistic interpretability, thereby enhancing system transparency and trustworthiness, and effectively predicting pedestrian crossing behavior and vehicle lane-changing actions.

2.6. Multiple Object Tracking (MOT)

Deep SORT [31], built upon the SORT framework, introduces Convolutional Neural Network (CNN)-based appearance features, reducing the reliance on motion information alone and effectively mitigating the identity switch (ID switches) issues common in traditional SORT, as illustrated in Figure 7. By combining positional and appearance information, Deep SORT significantly enhances stability and accuracy in long-term tracking, particularly in scenarios with occlusions or high densities. However, incorporating appearance features increases computational load, requiring more efficient feature extraction models for real-time applications. During tracking, Deep SORT utilizes a Kalman Filter to predict and update the state of each target, integrating observations with predictions to achieve smoother tracking. Equation (1) calculates the Kalman Filter state transition, where k represents the discrete time interval (e.g.,

k = 1

denotes 1 s,

k = 2

denotes 2 s),

{\hat{X}}_{k}

is the estimated state at time

k

,

Z_{k}

is the current measurement (possibly noisy), and

K_{k}

is the Kalman gain, which determines the relative weight of the measurement and the predicted state. A higher

K_{k}

places more emphasis on new measurements for rapid adaptation, while a lower

K_{k}

favors maintaining past states for stability.

{\hat{X}}_{k} = K_{k} \cdot Z_{k} + (1 - K_{k}) \cdot {\hat{X}}_{k - 1}

(1)

Additionally, the Hungarian Algorithm in Deep SORT optimally assigns multiple predicted trackers to new detection boxes. It first constructs a cost matrix based on positional differences between predicted and detected boxes, then applies row and column reductions to achieve minimal cost matching, ensuring each row and column retains only one minimum value (corresponding to 0). The system then assigns each new detection box to the most suitable tracker, achieving high-accuracy association and minimizing mismatches in multi-object tracking.

ByteTrack introduces a two-stage matching strategy: the first stage prioritizes high-confidence detection boxes, and the second stage attempts to match remaining unmatched tracklets with low-confidence boxes [32], as shown in Figure 8. The core design manages temporarily lost or occluded targets through a buffer mechanism. Equation (2) represents the detector output set D, where

τ

is the confidence threshold, dividing detections into high- and low-confidence sets. Equation (3) calculates the matching cost between detection

d_{i}

and tracklet

T_{j}

, with tracklets predicted by the Kalman Filter. The Hungarian algorithm is then applied to the cost matrix for minimal cost matching, using

D_{h i g h}

in the first stage and

D_{l o w}

for unmatched tracklets in the second stage. Unmatched tracklets are placed in a lost buffer, resuming if matched within a specified number of frames; otherwise, the system discards them. The method achieves state-of-the-art performance on the MOT17 and MOT20 datasets (e.g., 80.3 MOTA on MOT17) [32], striking a balance between tracking completeness and computational efficiency, making it suitable for real-time applications such as autonomous driving and intelligent surveillance.

D_{h i g h} = {d ∣ s c o r e (d) \geq τ}, D_{l o w} = {d ∣ s c o r e (d) < τ}

(2)

c_{i j} = 1 - I o U (T_{j}, d_{i})

(3)

In summary, Deep SORT emphasizes the importance of appearance features in reducing mismatches, whereas ByteTrack enhances tracking completeness and robustness via flexible low-confidence box management. Together, they represent the evolution of multi-object tracking from appearance-based enhancement to low-confidence compensation.

3. Method

The chapter describes the study’s core concepts and implementation workflow, detailing how the object detection algorithm YOLOv11 is integrated with the Vertex AI platform to construct a vision-based vehicle and pedestrian flow analysis system at intersections, known as the Pipeline-based Traffic Analysis System (PTAS).

In Figure 9, the PTASs’ workflow consists of several key steps. First, real-time intersection images are captured from surveillance devices and fed into the system for subsequent processing. Then, the YOLOv11 model performs object detection and vehicle classification, identifying cars, motorcycles, and other relevant targets within the frame. Subsequently, the system employs ByteTrack for vehicle tracking and counts traffic volume to generate structured traffic data.

The resulting flow statistics are automatically uploaded to Google Cloud Storage and integrated with other heterogeneous data sources. The PTAS uses Vertex AI’s Retrieval-Augmented Generation (RAG) technology and LLMs, such as GPT, Claude, or Gemini, to perform traffic semantic analysis, extracting potential risks and anomalies from multi-source textual information. Finally, the analysis results are visualized on a web platform, allowing decision-makers to monitor intersection traffic conditions and potential hazards in real-time.

3.1. Data Collection, Processing, and Model Training & Testing

In the pre-training stage, the study initially utilized the Traffic Detection Project dataset from Kaggle (https://www.kaggle.com/datasets/yusufberksardoan/traffic-detection-project?select=valid (accessed on 2 December 2025)) as the experimental foundation for various pre-training models in object detection and tracking tasks. The dataset, primarily composed of traffic surveillance images collected from different countries, offers a global perspective on traffic monitoring and management. It contains 6633 high-quality annotated images, each carefully labeled with bounding boxes to identify various objects such as vehicles, pedestrians, and traffic signs, making it highly suitable for object detection research. The YOLO-related models were pre-trained using the dataset, with data divided into training, validation, and testing splits. The tracking algorithm, in contrast, employed default parameters and did not require a pre-training or validation phase, being used solely for tracking execution during testing.

DarkLabel performs image annotation, automatically labeling object classes, including vehicles and pedestrians, and correcting erroneous labels to produce accurate annotation datasets required for training and testing during the object tracking phase. The dataset’s diversity and high-quality annotations make it suitable for other object detection tasks. To enhance model performance in real-world scenarios, an additional 1000 traffic images from actual cases were collected and used for transfer learning, performing fine-tuning training and testing to improve model adaptability to specific environments.

In the transfer learning stage, the study obtained intersection data from the Kaohsiung City Real-Time Traffic Information Service (https://www.kaggle.com/datasets/yusufberksardoan/traffic-detection-project?select=valid (accessed on 2 December 2025)), capturing live intersection images to fine-tune and retrain the previously pre-trained YOLO-related models derived from the Traffic Detection Project dataset. The real-time images of the peak period were collected from June 2025 to July 2025, spanning two months. From the dataset, we randomly selected suitable images and manually labeled them to ensure annotation accuracy and data reliability. The experiment focused on intersections with higher traffic volumes and more complex activity patterns to enhance representativeness and diversity. The first observation site was the intersection of Wufu 1st Rd. and Guanghua 1st Rd., as shown in Figure 10. In contrast, the study identified the second site at the southwest corner of Jiuru 2nd Road and Ziyou 1st Road, as shown in Figure 11.

3.2. Improved Object Detection Models

The study adopts the Reparameterization Ghost (RG) module [18] to reduce redundant feature computations through Ghost Convolution, lowering model parameters and FLOPs. Combined with Reparameterization techniques, the module enables the network structure to transform during training and inference, maintaining high efficiency and low latency during inference. The manner is particularly suitable for real-time traffic monitoring on edge devices. For object detection, in addition to using the baseline YOLOv8 and YOLOv11 models, we also introduced improved variants for boosting their performance. The study implemented two variant models with RG for the baseline YOLOv8 and YOLOv11: RG-YOLOv8n and RG-YOLOv11n, as shown in Figure 12 and Figure 13, respectively.

To speed up convolution operations and enhance detection precision, the Dual Convolution (DuC) [18] introduces a dual convolution structure in the feature extraction stage, strengthening the model’s ability to represent multi-scale features. The design effectively captures fine-grained features, improving detection accuracy for small objects and dense, overlapping targets in traffic scenes.

To enhance detection precision while using DuC, we replaced the activation function of the sigmoid linear unit (

S i l u

) with a sigmoid linear unit with parameter

β

(called

β s i l u

) in DuC, capturing more sensitive deviations of the activation function in its output and allowing for an appropriate response to suitable nonzero outputs when x is less than zero.

β_{r} (x)

is a two-layer linear regression, and Equation (4) computes

β_{r} (x)

, where

x

represents the input and

r e l u

stands for rectified linear unit. The initial value of the first regression’s intercept

b_{1}

is set to 0, and the initial value of the output regression’s intercept

b_{2}

is also set to 0. The initial values of the coefficients of the first regression

ω_{1}

and the second regression

ω_{2}

are randomly generated.

β_{r} (x)

behaves similarly to a fixed

β

learning mechanism, providing greater flexibility and adaptability, enabling it to handle input-related scaling. Equation (5) calculates

β s i l u

, where

x

represents the input, and the initial value of the bias b is −1. When b = −1, the function behaves more like

r e l u

near

x

= 0, causing

σ (β_{r} (x) \cdot x + b) \approx 0

.

β_{r} (x) = ω_{2} \cdot r e l u (ω_{1} x + b_{1}) + b_{2}

(4)

β s i l u (x) = x \cdot σ (β_{r} (x) \cdot x + b)

(5)

By combining RG and DuC in the DuCRG architecture, the model aims to maintain high computational efficiency while enhancing recognition capability for small objects and complex scenarios. Likewise, in addition to the improved RG-YOLOv8n and RG-YOLOv11n models, we have further upgraded variants of these models to enhance their speed performance. The study reconstructed two variant models with DuC for the RG-YOLOv8n and RG-YOLOv11n: DuCRG-YOLOv8n and DuCRG-YOLOv11n, as shown in Figure 14 and Figure 15, respectively.

Comparing the baseline and improved models enables a comprehensive evaluation of how different architectural designs affect performance metrics, including FPS, Precision, Recall, and F1-score, thereby further elucidating their advantages and limitations in intelligent traffic monitoring applications.

3.3. Traffic Flow Calculation and Edge-Based Real-Time Image Acquisition

The system utilizes YOLOv11n object detection combined with tracking algorithms to perform vehicle traffic counting, automatically recording the entry and exit of different vehicle types. The YOLO model identifies vehicle categories in the video frames, while ByteTrack continuously monitors the movement trajectories of individual vehicles. Virtual lines (Virtual Line) are set within the image frame, as shown in Figure 16. When a car crosses the line, the counting mechanism is triggered, recording the vehicle’s entry (Entry) and exit (Exit) information at the intersection.

The system also integrates real-time intersection images from the Kaohsiung City Traffic Information Service as the primary data source for traffic flow analysis. Images are captured and analyzed every ten minutes, automatically identifying vehicle movements at different intersections. The experiment stored the analysis results in CSV format for subsequent statistical analysis and visualization.

Each record includes the following fields:

(1): Vehicle Type: e.g., car, truck, bus;
(2): Entry: the direction from which the vehicle enters the intersection;
(3): Exit: the direction through which the vehicle leaves the intersection;
(4): Count: the number of cars passing through the specified entry and exit directions during the time interval.

3.4. Acquisition of Urban Environment and Traffic-Related Monitoring Data

The system incorporates traffic accident data, road construction information, air quality data, and noise monitoring analysis for integrated evaluation, enhancing the precision of identifying traffic congestion causes during peak hours.

Traffic accident data are automatically collected via web scraping from the Kaohsiung City Real-Time Traffic website, as shown in Figure 17. The website updates in real time and provides information on the occurrence time, location, and description of traffic events (e.g., accidents, road closures due to construction, signal failures), serving as an essential data source. By acquiring and analyzing these data, the system can respond immediately when it detects abnormal events on specific road segments and provide alerts or recommendations. For example, the information can notify drivers to take alternative routes in case of a traffic accident.

The study obtained road construction information through the public API provided by the Kaohsiung City Road Excavation Management Center, including fields such as construction start time, construction area, affected road segments, and project name. Python’s requests library via HTTP GET requests in JSON format retrieved the data and then structured it for subsequent filtering and analysis. Figure 18 illustrates the construction data provided by the center, serving as essential background information when analyzing traffic conditions at intersections.

Air quality monitoring data are sourced from the Government Open Data Platform API and are updated hourly, covering air quality indices (AQI) and related monitoring measurements at various stations. The main fields include:

(1): Site Name (sitename): Name of the monitoring station.
(2): AQI (aqi): Air Quality Index.
(3): Primary Pollutant (pollutant): The pollutant contributes most to the AQI.
(4): PM2.5/PM10: Concentrations of delicate particulate matter and suspended particles.
(5): Status (status): Air quality status (e.g., Good, Moderate, Unhealthy for Sensitive Groups).
(6): Publish Time (publishtime): Data update time.

By filtering for county = “Kaohsiung City”, real-time monitoring data for stations in Kaohsiung can be obtained. The system can use these data to monitor air quality changes and assess health risks, and they serve as essential references for policy making and research analysis.

Noise monitoring information is provided by the Kaohsiung City Environmental Protection Bureau, including monitoring locations, control zones, monitoring IDs, and records. The data also contain measurement dates, noise levels for daytime, evening, and nighttime periods (unit: dB), and road width information. The system utilizes the dataset to analyze traffic and noise characteristics under various time periods and road conditions, providing valuable references for traffic planning, environmental monitoring, and improvement measures. Figure 19 illustrates the relevant data available on the website.

3.5. Data Management and Retrieval Integration

The system uses Cloud Storage as the core data storage and management platform. Structured data generated during the object detection and tracking stages—such as vehicle entry/exit records, traffic accident logs, construction records, noise measurements, and particulate matter (PM) data—are automatically saved in CSV format and uploaded to the cloud.

Additionally, the system integrates diverse historical data on the intersection, including real-time traffic flow, historical traffic statistics, traffic incident reports, construction and control records, as well as data on noise and particulate matter. These datasets are stored in various formats, including CSV, TXT, and PDF, to support subsequent analysis and decision-making applications. Figure 20 illustrates the overall architecture.

The system utilizes Vertex AI’s RAG (Retrieval-Augmented Generation) technology in the data analysis workflow. First, it retrieves historical traffic data (including traffic accident records) to ensure data completeness and usability, as shown in Figure 21. Then, it integrates multiple sources of information, including: (1) historical traffic records (including traffic accidents) to verify whether complete and usable data is available; (2) real-time traffic conditions; (3) historical traffic flow data; (4) weather conditions; (5) event notifications and construction announcements; (6) noise and particulate matter (PM) monitoring data. Through integrating and retrieving these multi-source datasets, the system can enhance the accuracy and comprehensiveness of traffic analysis, providing precise traffic insights and decision-support capabilities.

3.6. AI Analysis and Decision Support

The system stores vehicle entry/exit and historical traffic data in Google Cloud Storage to enhance data management and analytical efficiency, while integrating real-time pedestrian and vehicle flow data, incident reports, construction data, noise data, and particulate matter data. Figure 22 shows the overall architecture.

Figure 23 illustrates the system’s data processing and intelligent analysis workflow. The system utilizes Vertex AI’s RAG technology for data retrieval, combined with the top three most widely used LLMs, including GPT, Claude, and Gemini, to perform semantic analysis and intelligent Q&A on traffic and environmental data. The workflow can handle multimodal data simultaneously, encompassing structured traffic records, image analysis results, noise, and construction information, to provide precise traffic insights and decision support.

Figure 24 shows the system’s analysis output workflow. After the user inputs an analysis prompt, the system retrieves relevant data via RAG, GPT-4, Claude Sonnet 4, and Gemini-2.5-flash LLMs to generate a comprehensive analysis. If data is insufficient, the PTAS reports the missing items and suggests the types of data to be supplemented, such as information for specific time periods or vehicle types. Simultaneously, the PTAS can provide dynamic traffic management recommendations, such as signal timing adjustments or alternative route guidance, to improve overall traffic efficiency and precision in traffic management.

3.7. User Interface Webpage in the PTAS

The Pipeline-based Traffic Analysis System (PTAS) utilizes Flask to build a user interface webpage; Figure 25 shows an example of such a webpage. It provides real-time image monitoring and analysis of traffic conditions. It divided the interface into real-time image display and traffic flow analysis. In the real-time image display, the system utilizes YOLO object detection technology to annotate vehicle types and movement directions in real-time, while also employing tracking techniques to monitor traffic dynamics. The way enables users to observe traffic hotspots and vehicle flow trends intuitively.

According to the collected historical data, we can implement traffic flow analysis, presenting vehicle counts across different time periods, peak-hour predictions, and potential congestion alerts. It also displays relevant information such as weather conditions, incident reports, construction announcements, and noise and particulate matter monitoring data, providing comprehensive multi-source traffic analysis. The system also supports automated data storage and retrieval, enabling users to query traffic information for specific time periods and road segments through the backend API, thereby enhancing the timeliness and accuracy of traffic management and decision-making.

4. Experimental Results and Discussion

4.1. Experiment Setting

Table 1 shows the hardware configuration used in the study, while Table 2 lists the software packages employed. The system environment is Windows 11, with Python as the primary programming language, and development conducted using the Anaconda platform. Since Python serves as the core development language for the entire system, the Flask framework is adopted to build a Python-based web application, providing a user-friendly interface and interactive functionality. The study executes Python programs within the Anaconda environment, primarily using Spyder and Jupyter Notebook as development and testing platforms to facilitate program development, debugging, and experiment recording. For object detection, the OpenCV library is employed for image processing and target recognition, enabling automated detection and analysis of objects in both video and image data. Additionally, the DarkLabel tool is used for annotating videos and images, creating labeled datasets required for training and testing during the object tracking phase. Through Flask, the system integrates the front-end interface with back-end model computation, achieving real-time data input, model inference, and visualized result presentation, thereby enhancing the system’s usability and practicality.

4.2. Object Detection and Tracking Data Collection and Model Training

The study divides the data collection into two levels, corresponding to the pre-training and transfer learning requirements. In the pre-training stage, the study utilizes the dataset provided by the Traffic Detection Project from Kaggle (https://www.kaggle.com/datasets/yusufberksardoan/traffic-detection-project?select=valid (accessed on 2 December 2025)). The dataset contains 6633 traffic surveillance images collected from various countries, providing a global perspective on traffic monitoring and management. Each image is meticulously annotated with bounding boxes to identify objects such as vehicles, pedestrians, and traffic signs, ensuring high-quality labeling suitable for object detection and tracking research. The dataset encompasses diverse environmental conditions, including daytime, nighttime, and rainy scenarios, effectively simulating dynamic real-world traffic environments.

Table 3 summarizes the dataset composition, including image counts, class distribution, and data proportions, where the Training set accounts for 87.5% (used for detection only), the Validation set accounts for 8.3% (used for detection only), and the Test set accounts for 4.2% (used for both detection and tracking).

Table 4 shows the data collection for the transfer learning stage. In the stage, Table 4 presents the data collection for the transfer learning stage. In this phase, the study also collected 1000 traffic surveillance images from two major intersections in Kaohsiung City (https://cctv2.kctmc.nat.gov.tw/88b6eddf/&t=1744354849787 (accessed on 2 December 2025), https://cctv1.kctmc.nat.gov.tw/b045c1d4/&t=1749010635776 (accessed on 2 December 2025)): the intersection of Wufu 1st Road and Guanghua 1st Road, and the intersection of Jiuru 2nd Road and Ziyou 1st Road. (southwest corner). The real-time images of the peak period were captured between June and July 2025, covering a two-month period of continuous traffic observation. To ensure representativeness and enhance the model’s adaptability and generalization, the collected images encompass diverse conditions, including daytime, nighttime, and rainy weather, reflecting realistic urban traffic dynamics. From these recordings, we randomly selected suitable frames and manually annotated them with bounding boxes to identify vehicles, pedestrians, and relevant traffic elements. These annotated datasets were subsequently used for YOLO-based object detection and ByteTrack-based vehicle tracking experiments, forming the core dataset for model fine-tuning and evaluation in the transfer learning phase.

4.3. Object Detection and Tracking Model Parameter Settings

In the object detection stage, the study employed pre-trained YOLOv8 and YOLOv11, as well as their lightweight versions, for real-time detection of pedestrians and vehicles. The models were fine-tuned through transfer learning on real-world traffic images, enabling effective recognition of pedestrians and different vehicle types, thereby enhancing adaptability and detection accuracy in practical traffic scenarios.

The system integrated ByteTrack and DeepSORT algorithms for multi-object tracking in the tracking stage. The experiment set ByteTrack parameters as follows: high-confidence association threshold track_high_thresh = 0.6, low-confidence association threshold track_low_thresh = 0.2, new object initialization threshold new_track_thresh = 0.25, matching threshold match_thresh = 0.8, and tracking buffer track_buffer = 90. These settings ensure high accuracy while improving tracking stability under temporary occlusions. The experiment configured DeepSORT parameters as follows: n_init = 3 to confirm new objects only after three consecutive detections, thereby reducing false positives; max_iou_distance = 0.7 and max_cosine_distance = 0.2 for matching based on location and appearance features; and nn_budget = 100 to control the number of stored feature vectors, thereby balancing tracking stability and computational efficiency.

By combining YOLO for efficient object detection with ByteTrack and DeepSORT for multi-object tracking, the study achieves a balance of detection accuracy, tracking continuity, and system real-time performance, thereby enhancing the reliability of pedestrian and vehicle dynamics analysis in traffic scenarios.

Figure 26 shows the object detection results at the first location (intersection of Wufu 1st Rd. and Guanghua 1st Rd.) for (a) daytime and (b) nighttime images, as well as (c) daytime and (d) nighttime images including pedestrian flow. At the intersection, the system effectively detects vehicles during both day and night, with stable overall performance. However, occasional detection errors or misses occur in low-light nighttime conditions, particularly in areas with vehicle light reflections or insufficient illumination, affecting model accuracy.

Figure 27 shows the object detection results at the second location (intersection of Jiuru 2nd Rd. and Ziyou 1st Rd., southwest corner) for (a) daytime and (b) nighttime images, as well as (c) daytime and (d) nighttime images including pedestrian flow. At the location, the system also demonstrates all-weather object detection capability. Daytime detection performs well, accurately recognizing pedestrians and various types of vehicles. However, under nighttime conditions, interference from vehicle lights and occlusion due to high traffic density can still lead to false or missed detections, slightly affecting the overall model accuracy.

4.4. Knowledge Base Construction, RAG Retrieval, and LLMs Query Process

Figure 28 illustrates how the study utilizes Python code to organize and structure raw documents into a knowledge base (Corpus) and then converts them into high-dimensional semantic representations through text vectorization. These vectorized results are stored in a retrieval database, serving as the foundation for subsequent Retrieval-Augmented Generation (RAG) processes and LLMs’ query responses. The approach can ensure that models integrate external knowledge sources when generating answers, thereby improving the accuracy and reliability of responses. The study applies the same RAG and LLMs analysis workflow to traffic intersections. First, the data is uploaded to Google Cloud Storage (GCS) and vectorized using Google’s text embedding model text-embedding-004 to support semantic retrieval. Subsequently, the experiment used import_files_to_corpus() to import files and perform segmented embedding, establishing a corpus for later analysis.

Figure 29 and Figure 30 display the RAG semantic retrieval results for the intersection of Wufu 1st Road and Guanghua 1st Road, as well as the intersection of Jiuru 2nd Road and Ziyou 1st Road, specifically at the southwest corner. After constructing the corpus, the system executes semantic retrieval via the Vertex AI API, comparing user queries with the corpus content to fetch the most relevant passages. LLMs such as GPT-4o, Claude Sonnet 4, and Gemini-2.5-flash then analyze and process these passages to provide precise traffic insights and intelligent Q&A results.

The Gemini model can directly integrate retrieved content during generation using create_rag_tool() and generate_with_rag_tool(). For GPT and Claude, which do not natively support Vertex AI semantic retrieval, retrieval must first be performed manually via Vertex AI RAG then incorporated the retrieved results into the prompt. They can generate responses using OpenAI ChatCompletion.create() and client.messages.create() for retrieving the information from Vertex AI, respec-tively, ensuring correctness and completeness of the answers.

4.5. PTAS User Interface Webpage and Analysis Reports

Figure 31 and Figure 32 illustrate the PTAS user interface webpages used at Wufu 1st Road and Guanghua 1st Rd. intersection and Jiuru 2nd Rd. and Ziyou 1st Rd., southwest corner intersection. The system is built with a Flask web interface, providing real-time image monitoring and analysis of traffic conditions. The interface mainly consists of real-time image display and traffic flow analysis. The traffic flow analysis section includes pedestrian and vehicle flow, weather conditions, event notifications, construction announcements, and data on noise and particulate matter. These messages enable users to access multidimensional traffic information simultaneously, thereby enhancing the timeliness and accuracy of traffic management and decision-making.

4.6. PTAS Traffic Analysis Results Assessment

Table 5 assesses different large language models (LLMs) used in the PTAS for traffic flow analysis, including GPT-4o, Claude Sonnet 4, and Gemini-2.5-flash. By examining each model’s analysis factors and information completeness on the same traffic dataset, PTAS can assess the applicability and perspective differences of these models in intelligent traffic analysis scenarios.

For the traffic conditions on 20 August 2025, from the afternoon to evening (15:10–20:00), the study integrates the analysis results from GPT-4o, Claude Sonnet 4, and Gemini-2.5-flash. The analysis covers six types of traffic participants—cars, motorcycles, buses, trucks, lorries, bicycles, and pedestrians—and records their movement directions and passage counts using 10 min intervals. The evening peak occurred mainly between 18:00 and 19:10, with motorcycle flows exceeding 100 multiple times, and car flows approaching 120, resulting in high intersecting traffic and congestion along major roads. Some models also indicated localized congestion during the morning peak (07:00–09:00), primarily due to narrow lanes and dense intersections.

All three models reflect data limitations, such as the absence of complete full-day datasets, signal cycle information, vehicle speed and travel delay indicators, and background information like construction or illegal parking, which restrict precise analysis. Accident data indicate that recent incidents have occurred in the Qianzhen and Sanmin districts, but not at the studied intersections, thus limiting the local impact.

From an environmental and health perspective, the AQI on that day ranged from good to moderate (25–48), suitable for outdoor activities for the general population. At the same time, the system advises sensitive groups to minimize prolonged stays at intersections. Noise monitoring at Lingya station recorded levels of 55–57 dB during the daytime, 54–56 dB in the evening, and 51–54 dB at night, which is normal for urban roads but slightly higher than average. Fude station recorded approximately 65 dB during daytime/evening, potentially affecting nearby residents. Recommended measures include fine-tuning traffic signals, optimizing motorcycle waiting areas, channelization management, controlling illegal parking, reducing horn usage and modified exhausts, and improving pavement quality and sound insulation.

The PTAS traffic analysis results for the second location, the southwest corner of Jiuru 2nd Rd. and Ziyou 1st Rd. intersection, including traffic data for cars, motorcycles, trucks, and pedestrians, recording directional flow and 10 min interval counts. Pedestrian data are relatively sparse, and PTAS does not differentiate between buses and bicycles. Peak traffic periods occurred from 7:00 to 9:00 a.m. and from 5:00 to 7:00 p.m., with high traffic volumes prone to congestion. In contrast, pedestrian peaks occurred between 12:00 and 14:00, indicating moderate midday activity that may have been influenced by nearby commercial areas and lunch-hour crossings, albeit with a limited impact on vehicle congestion. The data only covers specific intervals, lacking full-day continuous monitoring. Noise monitoring data from 2024 may not accurately reflect conditions on the analysis day. An accident occurred at Jiuru Overpass 1 at 23:24, which may have caused localized nighttime traffic disruption. Construction data are unavailable, preventing assessment of their impact. The air quality was good, with the Fuxing station AQI at 43, suitable for outdoor activities. Notably, the system cannot give noise protection recommendations due to the absence of real-time noise monitoring.

4.7. Performance Evaluation

The study conducted a performance evaluation of the object detection models within the PTAS, presenting results as averages from the intersection of Wufu 1st Road and Guanghua 1st Road, as well as the southwest corner of the intersection of Jiuru 2nd Road and Ziyou 1st Road. Test metrics include FPS, Precision, Recall, and F1-score. The models evaluated comprise six versions: Yolov8n, RG-Yolov8n, DuCRG-Yolov8n, Yolov11n, RG-Yolov11n, and DuCRG-Yolov11n. By comparing the detection efficiency and accuracy of each model across the same test dataset, the study analyzed the suitability and performance differences of these models in real-time traffic monitoring scenarios, providing a basis for subsequent model selection and optimization, as shown in Table 6. The results show that DuCRG-Yolov11n achieved the best overall performance, with detailed indicators including the intersection of Wufu 1st Rd. and Guanghua 1st Rd., as well as the southwest corner of the intersection of Jiuru 2nd Rd. and Ziyou 1st Rd. in Kaohsiung City.

Speed (Frames Per Second, FPS) represents the number of frames processed per second and is used to evaluate the model’s processing efficiency and real-time performance. Equation (6) shows the

F P S

calculation is in, where

f

denotes the total number of processed frames, and

t

represents the total processing time in seconds.

F P S = \frac{f}{t}

(6)

Precision (%) indicates the proportion of correctly predicted positive samples among all samples predicted as positive, measuring the accuracy of the model’s positive predictions. Equation (7) calculates the

P r e c i s i o n

, where

T P

(True Positives) represents the number of samples correctly identified as positive, and

F P

(False Positives) refers to the number of samples incorrectly identified as positive.

P r e c i s i o n = \frac{T P}{(T P + F P)}

(7)

Recall (%) measures the proportion of actual positive samples that are correctly predicted as positive, reflecting the model’s ability to identify all relevant instances. Equation (8) computes the

R e c a l l

, where

T P

(True Positives) are positive samples that the model can correctly predict and

F N

(False Negatives) are positive samples that the model failed to detect.

R e c a l l = \frac{T P}{(T P + F N)}

(8)

The F1-Score represents the harmonic mean of Precision and Recall, offering a balanced evaluation of a model’s accuracy and completeness. Because both Precision and Recall share the same numerator (

T P

) but differ in their denominators, employing the harmonic mean is mathematically appropriate. Equation (9) defines the

F 1_S c o r e

. A higher

F 1_S c o r e

reflects better overall model performance, particularly when dealing with imbalanced datasets. Its value ranges from 0 to 1, where values closer to 1 indicate superior predictive capability.

F 1_S c o r e = 2 \times \frac{(P r e c i s i o n \times R e c a l l)}{(P r e c i s i o n + R e c a l l)}

(9)

Next, Table 7 presents the tracking performance metrics of the ByteTrack and Deep SORT algorithms combined with each object detection model for the intersection of Wufu 1st Rd. and Guanghua 1st Rd., as well as the southwest corner of the intersection of Jiuru 2nd Rd. and Ziyou 1st Rd., and the average across both locations, respectively. Evaluation metrics include Multiple Object Tracking Accuracy (MOTA), Identification F1 Score (IDF1), and Multiple Object Tracking Precision (MOTP). The results indicate that ByteTrack paired with DuCRG-Yolov11n achieved the best tracking performance at both locations, outperforming other model-algorithm combinations.

MOTP measures the precision of the tracker’s output bounding boxes relative to the corresponding ground-truth boxes. Its range is 0 to 1, with values closer to 0 indicating minor localization errors and a more accurate overlap between predicted and actual objects. In the py-motmetrics package, MOTP is calculated as the average of

1 - I o U

, representing the mean localization error across all matched object pairs.

The MOTP calculation, as shown in Equation (10), is based on the average localization error between correctly matched detection–ground-truth pairs over all frames. Specifically,

d_{t, i}

denotes the distance error (

1 - I o U

) between the

i^{t h}

correctly matched pair in frame

t

;

C_{t}

represents the number of correct matches in frame

t

;

Σ_{t, i^{d_{t, i}}}

refers to the cumulative localization error across all successfully matched pairs in all frames; and

Σ_{t} C_{t}

indicates the total number of correct matches across all frames, which serves as the denominator in the computation.

MOTP = \frac{Σ_{t, i^{d_{t, i}}}}{Σ_{t} C_{t}}

(10)

I D F 1

evaluates the accuracy of target identity (ID) assignment, reflecting the tracker’s ability to maintain consistent identity recognition throughout the tracking process. The IDF1 value ranges from 0 to 1, with higher values indicating more precise and stable identity tracking; a value of 1 represents perfect identity maintenance. It serves as a key comprehensive metric for assessing the overall accuracy and reliability of the tracker.

The

I D F 1

calculation, as shown in Equation (11), is defined by the harmonic mean of precision and recall based on identity matching. In the formulation,

I D T P

represents the number of correctly identified targets (true positives),

I D F P

denotes the number of incorrect identity assignments (false positives), and

I D F N

refers to the number of targets that the system should have identified but missed (false negatives). The metric thus provides a balanced evaluation of the tracker’s identity preservation and detection completeness.

I D F 1 = \frac{2 \times I D T P}{2 \times I D T P + I D F P + I D F N}

(11)

M O T A

provides a quantitative assessment of overall tracking performance by simultaneously considering missed targets, false detections, and identity switches. It serves as a comprehensive indicator for evaluating tracker accuracy and reliability, enabling a systematic comparison of different tracker–detector model combinations and offering a solid basis for subsequent system optimization and model selection.

The calculation of

M O T A

, as expressed in Equation (12), incorporates three primary error components accumulated across all frames: false negatives (missed detections), false positives (incorrect detections), and ID switches (identity mismatches). For each frame

t

,

F N_{t}

represents the number of missed targets,

F P_{t}

denotes the number of false detections,

I D S_{t}

indicates the number of identity switches, and

G T_{t}

refers to the total number of ground-truth objects. The

M O T A

score is then computed as the normalized complement of the total error ratio, providing a direct measure of tracking consistency and precision over time.

M O T A = 1 - \frac{\sum_{t} (F N_{t} + F P_{t} + I D S_{t})}{\sum_{t} G T_{t}}

(12)

The Kaohsiung City Government, in collaboration with Taiwan Shih-Hsi Engineering Consulting Co., Ltd., a local company based in Taipei, Taiwan, has successfully implemented an intelligent traffic signal control system for three major traffic congestion hotspots in the city (https://www.mirrormedia.mg/story/20250625cnt007 (accessed on 2 December 2025)) in 2025. The existing system has been reduced by 11% to 25%, significantly improving traffic efficiency and bringing a smoother commuting experience to Kaohsiung citizens. Based on the observational performance of the legacy system, an estimated improvement in the enhanced DuCRG-YOLOv11n with our proposed new activation function

β s i l u

can achieve a 48.11% increase in FPS and a 3.04% increase in object detection precision. Notably, the existing system lacks object tracking models. Besides, comparing our proposed approach with the existing system, the assessment of LLMs cannot be directly compared with the existing models because there are no LLMs in the existing system.

4.8. Discussion

The study utilized the intelligent PTAS to analyze multi-source data, including vehicle and pedestrian flow, weather conditions, event notifications, construction announcements, noise, and particulate matter, demonstrating its practical application value in smart city traffic management.

Regarding object detection, DuCRG-YOLOv11n outperformed other models, achieving a speed of 68.25 FPS. Additionally, it maintains the precision to identify multiple vehicle types effectively and exhibits stable and reliable performance in real-time video monitoring.

The study experimented with various YOLO models combined with tracking algorithms for object tracking. The results indicated that DuCRG-YOLOv11n, paired with ByteTrack, performed the best, achieving an MOTA of 0.7195, an IDF1 of 0.851, and a MOTP of 0.08735. Further experiments demonstrated that the approach maintained over 90% tracking accuracy in medium- to low-density traffic scenarios while effectively reducing ID switches. However, brief ID changes could still occur in high-density or heavily occluded environments, indicating room for further optimization of tracking performance.

Additionally, the study compared the analysis results of three LLMs: GPT-4o, Claude Sonnet 4, and Gemini-2.5-Flash. All three models provided sufficient data interpretation capabilities, with Claude Sonnet 4 providing a more comprehensive view in detailed analysis and actionable recommendations. According to Claude Sonnet 4, the intersection of Wufu 1st Rd. and Guanghua 1st Rd. has experienced significant evening peak traffic flow from 17:00 to 18:30, with concentrated flows of motorcycles and cars consistent with typical commuting patterns. Pedestrian flow was relatively sparse (approximately 4–9 people per 10 min), not forming significant walking hotspots. No accidents were reported during the study period, indicating a limited impact on traffic. However, the data only covered a short time window and lacked information on intersection geometry, signal cycles, and lane configurations, which limited the precision of the analysis. Image recognition results may also have errors due to changes in lighting or occlusions.

Related studies support these findings. Tao et al. [10] proposed a YOLO-based traffic scene object detection system to overcome the speed and accuracy limitations of traditional object detection systems. Hu et al. [11] designed a learning-based detection framework focusing on traffic signs, vehicles, and riders. Similarly, the study used YOLO to detect vehicles and pedestrians at intersections, confirming the method’s feasibility in various traffic applications and aligning with previous studies in method selection and detection performance.

Regarding LLMs, Movahedi et al. [13] applied LLMs to adaptive traffic signal control, significantly improving traffic control efficiency and flexibility. Villarreal et al. [25] used ChatGPT for complex mixed-traffic control, demonstrating the potential of LLMs in intelligent traffic management. Moraga et al. [28] integrated UAVs, IoT sensors, and LLMs (Gemini) to dynamically adjust vehicle speeds, reducing urban congestion and CO₂ emissions, with notable improvements. The study also incorporated LLMs into traffic systems, demonstrating that they can provide actionable recommendations and aid in decision-making under complex scenarios, consistent with previous research, which further validates the reliability and practical value of LLMs in intelligent traffic applications.

Although the citation [11] and many prior studies have proposed integrated frameworks for traffic object detection, their contributions mainly focus on dense feature sharing, subcategory partitioning, and improving detector generalization. These works do not address intersection-level traffic flow statistics, peak-hour analysis, or congestion cause identification, all of which require temporal modeling and understanding of intersection dynamics. The study integrates real-time intersection image analysis with temporal traffic flow inference into a unified system, marking the first in this field to combine Retrieval-Augmented Generation (RAG) with real-time information indexing. Unlike traditional approaches [11,12,13,14] limited to static object detection or tracking, the proposed method incorporates external traffic knowledge and real-time incident reports during inference, enabling the identification of abnormal temporal patterns, tracing of congestion causes (e.g., accidents, signal failures, special events), and improved robustness in dense traffic scenarios. Thus, the contribution of this work extends beyond object detection by advancing traffic understanding and temporal reasoning.

Compared with the feature-level improvements in the citation [11], our system, which includes a 10-factor analysis described in Table 5, integrates flow information, external knowledge, and temporal characteristics, offering greater practical value for peak-hour traffic analysis and congestion cause inference. Furthermore, in terms of speed and precision of object detection and tracking, our proposed approach outperforms all benchmarks listed in Table 7. Compared with the detection speed of 0.65 s for a single video frame in the citation [11], our system can increase the frame rate to 0.0146 s per frame. It can boost the speed by 45 times compared to the citation [11].

The system demonstrates potential for intelligent traffic monitoring and analysis, but has limitations. Insufficient data volume and missing key information can compromise the credibility of analysis results, potentially leading to misjudgments in low-sample scenarios and impacting the reliability of traffic decision-making. Differences in camera image quality significantly affect the stability of object detection; low resolution or noise interference may lead to incomplete recognition of vehicles or pedestrians. Common challenges such as glare and occlusions in traffic environments remain difficult to overcome. High-density traffic or extreme lighting conditions can still lead to recognition errors, even with optimized models.

Future improvements could include expanding data collection to cover all-day and multi-day periods, integrating intersection geometry and surrounding facility information, and employing multi-angle cameras with low-light enhancement techniques to enhance system accuracy and robustness.

5. Conclusions

The study proposes a sovereign AI-driven intelligent transportation governance approach, developing a mobile AI solution equipped with multimodal perception, task decomposition, memory, reasoning, and multi-agent collaboration capabilities. It integrates computer vision, multi-object tracking, natural language processing, Retrieval-Augmented Generation (RAG), and Large Language Models (LLMs) to construct a Pipeline-based Traffic Analysis System (PTAS). The PTAS can instantly quantify intersectional traffic flow and incorporate potential risk factors such as traffic accidents, construction, and weather conditions for multimodal data fusion analysis, providing forward-looking traffic insights. The PTAS developed in the study can accurately analyze peak traffic periods and congestion causes, effectively describe data content, and highlight missing information. The precise analysis results can be immediately delivered to authorities and the public, enabling timely measures to alleviate current traffic issues. Therefore, the study validates the practical applicability of the approach in smart city traffic management.

In terms of data, coverage can be further expanded across different time periods and road sections, while improving real-time image recognition and multi-object tracking performance under low-light and occluded conditions. Regarding methodological improvements, integrating cloud computing with edge devices could enable deeper multimodal analysis and rapid response after information fusion, enhancing the system’s accuracy and stability in real-time traffic perception and anomaly detection. Higher-level models, such as YOLOv13, could further strengthen model performance through lightweight structural modifications and parameter optimization training, thereby achieving high speed and accuracy. Finally, the research can be applied to smart city traffic decision support, autonomous vehicle environment perception, and dynamic traffic scheduling during disasters, leveraging the advantages of immediacy and intelligence.

Author Contributions

B.R.C. and C.-C.C. conceived and designed the experiments, H.-F.T. collected the dataset and proofread the manuscript, and B.R.C. wrote the paper. All authors have read and agreed to the published version of the manuscript.

Funding

The Ministry of Science and Technology fully supports the work in Taiwan, the Republic of China, under grant numbers NSTC 114-2221-E-390-003 and NSTC 114-2622-E-390-002.

Data Availability Statement

The sample codes in Sample Code for VLM.zip at the following link were used to support the findings of the paper: https://drive.google.com/open?id=1QKMk7QXxHUG0M0Og5z4PwWRI0JVVq3kM&usp=drive_fs (accessed on 2 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Putri, T.D. Intelligent Transportation Systems (ITS): A Systematic Review Using a Natural Language Processing (NLP) Approach. Heliyon 2021, 7, e08615. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Huang, J.; Jin, S.; Lu, S. Vision-Language Models for Vision Tasks: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5625–5644. [Google Scholar] [CrossRef] [PubMed]
Voulodimos, A.; Doulamis, N.; Doulamis, A.; Protopapadakis, E. Deep Learning for Computer Vision: A Brief Review. Comput. Intell. Neurosci. 2018, 2018, 7068349. [Google Scholar] [CrossRef] [PubMed]
Luo, W.; Xing, J.; Milan, A.; Zhang, X.; Liu, W.; Kim, T.-K. Multiple Object Tracking: A Literature Review. Artif. Intell. 2021, 293, 103448. [Google Scholar] [CrossRef]
Chowdhary, K.R. (Ed.) Natural Language Processing. In Fundamentals of Artificial Intelligence; Springer: Berlin/Heidelberg, Germany, 2020; pp. 603–649. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-T.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar] [CrossRef]
Wang, S.; Yang, H.; Liu, W. Research on the construction and application of retrieval enhanced generation (RAG) model based on knowledge graph. Sci. Rep. 2025, 15, 40425. [Google Scholar] [CrossRef]
Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Mian, A. A Comprehensive Overview of Large Language Models. ACM Trans. Intell. Syst. Technol. 2025, 16, 106. [Google Scholar] [CrossRef]
Oladimeji, D.; Gupta, K.; Kose, N.A.; Gundogan, K.; Ge, L.; Liang, F. Smart Transportation: An Overview of Technologies and Applications. Sensors 2023, 23, 3880. [Google Scholar] [CrossRef]
Tao, J.; Wang, H.; Zhang, X.; Li, X.; Yang, H. An Object Detection System Based on YOLO in Traffic Scene. In Proceedings of the 2017 6th International Conference on Computer Science and Network Technology (ICCSNT), Dalian, China, 21–22 October 2017; pp. 315–319. [Google Scholar] [CrossRef]
Hu, Q.; Paisitkriangkrai, S.; Shen, C.; van den Hengel, A.; Porikli, F. Fast Detection of Multiple Objects in Traffic Scenes with a Common Detection Framework. IEEE Trans. Intell. Transp. Syst. 2016, 17, 1002–1014. [Google Scholar] [CrossRef]
Jiménez-Bravo, D.M.; Murciego, Á.L.; Mendes, A.S.; Sánchez San Blás, H.; Bajo, J. Multi-Object Tracking in Traffic Environments: A Systematic Literature Review. Neurocomputing 2022, 494, 43–55. [Google Scholar] [CrossRef]
Movahedi, M.; Choi, J. The Crossroads of LLM and Traffic Control: A Study on Large Language Models in Adaptive Traffic Signal Control. IEEE Trans. Intell. Transp. Syst. 2024, in press. [Google Scholar] [CrossRef]
Sohan, M.; Sai Ram, T.; Rami Reddy, C.V. A Review on YOLOv8 and Its Advancements. In Algorithms for Intelligent Systems: Data Intelligence and Cognitive Informatics; Springer Nature: Singapore, 2024; pp. 529–545. [Google Scholar] [CrossRef]
He, L.; Zhou, Y.; Liu, L.; Cao, W.; Ma, J. Research on object detection and recognition in remote sensing images based on YOLOv11. Sci. Rep. 2025, 15, 14032. [Google Scholar] [CrossRef] [PubMed]
Alif, M.A.R. YOLOv11 for Vehicle Detection: Advancements, Performance, and Applications in Intelligent Transportation Systems. arXiv 2024, arXiv:2410.22898. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. YOLOv10: Real-Time End-to-End Object Detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar] [CrossRef]
Chang, B.R.; Tsai, H.-F.; Syu, J.-S. Implementing High-Speed Object Detection and Steering Angle Prediction for Self-Driving Control. Electronics 2025, 14, 1874. [Google Scholar] [CrossRef]
Zhong, J.; Chen, J.; Mian, A. DualConv: Dual Convolutional Kernels for Lightweight Deep Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 9528–9535. [Google Scholar] [CrossRef]
Kumar, P. Large Language Models (LLMs): Survey, Technical Frameworks, and Future Challenges. Artif. Intell. Rev. 2024, 57, 260. [Google Scholar] [CrossRef]
De Zarzà, I.; de Curtò, J.; Roig, G.; Calafate, C.T. LLM Multimodal Traffic Accident Forecasting. Sensors 2023, 23, 9225. [Google Scholar] [CrossRef]
Liu, C.; Yang, S.; Xu, Q.; Li, Z.; Long, C.; Li, Z.; Zhao, R. Spatial-Temporal Large Language Model for Traffic Prediction. In Proceedings of the 2024 25th IEEE International Conference on Mobile Data Management (MDM), Singapore, 24–27 June 2024; pp. 31–40. [Google Scholar] [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Nazir, A.; Wang, Z. A Comprehensive Survey of ChatGPT: Advancements, Applications, Prospects, and Challenges. Meta-Radiology 2023, 1, 100022. [Google Scholar] [CrossRef]
Villarreal, M.; Poudel, B.; Li, W. Can ChatGPT Enable ITS? The Case of Mixed Traffic Control via Reinforcement Learning. In Proceedings of the 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC), Rhodes, Greece, 17–20 September 2023; pp. 3749–3755. [Google Scholar] [CrossRef]
Anthropic. System Card: Claude Opus 4 & Claude Sonnet 4; Anthropic: San Francisco, CA, USA, 2025; Available online: https://www.anthropic.com/claude-4-system-card (accessed on 10 December 2025).
Google DeepMind. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities; Technical Report; Google DeepMind: Mountain View, CA, USA, 16 June 2025; Available online: https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf (accessed on 15 December 2025).
Moraga, Á.; de Curtò, J.; de Zarzà, I.; Calafate, C.T. AI-Driven UAV and IoT Traffic Optimization: Large Language Models for Congestion and Emission Reduction in Smart Cities. Drones 2025, 9, 248. [Google Scholar] [CrossRef]
Ma, L.; Zhang, R.; Han, Y.; Yu, S.; Wang, Z.; Ning, Z.; Zhang, J.; Xu, P.; Li, P.; Ju, W.; et al. A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge. arXiv 2023, arXiv:2310.11703. [Google Scholar] [CrossRef]
Xu, H.; Yuan, J.; Zhou, A.; Xu, G.; Li, W.; Ban, X.; Ye, X. GenAI-Powered Multi-Agent Paradigm for Smart Urban Mobility: Opportunities and Challenges for Integrating Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) with Intelligent Transportation Systems. arXiv 2024, arXiv:2409.00494. [Google Scholar]
Hou, X.; Wang, Y.; Chau, L.P. Vehicle Tracking Using Deep SORT with Low Confidence Track Filtering. In Proceedings of the 16th IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS), Auckland, New Zealand, 4–7 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar] [CrossRef]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. ByteTrack: Multi-Object Tracking by Associating Every Detection Box. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 1–21. [Google Scholar] [CrossRef]

Figure 1. Object detection and image recognition model YOLOv8.

Figure 2. Object detection and image recognition model YOLOv11.

Figure 3. Workflow of RepGhost Bottleneck.

Figure 4. Workflow of Dual Convolution.

Figure 5. Transformer block of the LLaMA model. Source: https://medium.com/@vi.ai_/exploring-and-building-the-llama-3-architecture-a-deep-dive-into-components-coding-and-43d4097cfbbb (accessed on 2 December 2025).

Figure 6. Operation Process of Retrieval-Augmented Generation (RAG).

Figure 7. Workflow of Deep SORT.

Figure 8. Operation Process of ByteTrack. Source: https://datature.io/blog/introduction-to-bytetrack-multi-object-tracking-by-associating-every-detection-box (accessed on 2 December 2025).

Figure 9. PTAS Workflow.

Figure 10. A screenshot from a live video feed from the intersection of Wufu 1st Rd and Guanghua 1st Rd in Kaohsiung City. Raw image source: https://cctv2.kctmc.nat.gov.tw/88b6eddf/&t=1744354849787 (accessed on 2 December 2025).

Figure 11. A screenshot of the real-time video feed from the southwest corner of the intersection of Jiuru 2nd Rd. and Ziyou 1st Rd. in Kaohsiung City. Raw image source: https://cctv1.kctmc.nat.gov.tw/b045c1d4/&t=1749010635776 (accessed on 2 December 2025).

Figure 12. Object detection and image recognition model RG-Yolov8n.

Figure 13. Object detection and image recognition model RG-Yolov11n.

Figure 14. Object detection and image recognition model DuCRG-Yolov8n.

Figure 15. Object detection and image recognition model DuCRG-Yolov11n.

Figure 16. Virtual road intersection surrounded by four red lines. Raw image source: https://cctv2.kctmc.nat.gov.tw/88b6eddf/&t=1744354849787 (accessed on 2 December 2025).

Figure 17. Collected Kaohsiung City Short Real-Time Traffic Information. Source: https://demos.kcg.gov.tw/TmcQuery_Web.aspx (accessed on 2 December 2025).

Figure 18. Collected Road Construction Information. Source: https://pipegis.kcg.gov.tw/Homepage/index.aspx (accessed on 2 December 2025).

Figure 19. Collected Noise Monitoring Information. Source: https://lab.ksepb.kcg.gov.tw/Noise/NoiseDataDisplay.asp (accessed on 2 December 2025).

Figure 20. Cloud Data Storage and Management Solution.

Figure 21. RAG Data Retrieval and Integration.

Figure 22. Vertex AI RAG and LLMs Analysis Application.

Figure 23. LLMs Deep Semantic Analysis and Reasoning.

Figure 24. LLMs Analysis Output and Decision Support.

Figure 25. An Example of a user interface webpage in PTAS. Raw image source: https://cctv2.kctmc.nat.gov.tw/88b6eddf/&t=1744354849787 (accessed on 2 December 2025).

Figure 26. PTAS Object detection at the Wufu 1st Rd. and Guanghua 1st Rd. in Kaohsiung City. Raw image source: https://cctv2.kctmc.nat.gov.tw/88b6eddf/&t=1744354849787 (accessed on 2 December 2025). (a) Live footage of a crossroads captured on 20 August at 11:41; (b) Live footage of the crossroads captured on 20 August at 20:08; (c) Live footage of a crossroads captured on 16 June at 16:19, including pedestrian flow; (d) Live footage of a crossroads captured on 5 June at 19:22, including pedestrian flow. A green box indicates that target detection has been completed.

Figure 27. PTAS Object detection at the southwest corner of Jiuru 2nd Rd. and Ziyou 1st Rd. in Kaohsiung City. Raw image source: https://cctv1.kctmc.nat.gov.tw/b045c1d4/&t=1749010635776 (accessed on 2 December 2025). (a) Live footage of the crossroads captured on 20 August at 13:52; (b) Live footage of the crossroads captured on August 20 at 19:51; (c) Live footage of a crossroads captured on 19 November at 16:42, including pedestrian flow; (d) Live footage of a crossroads captured on 18 November at 17:27, including pedestrian flow. A green box indicates that target detection has been completed.

Figure 28. Corpus Construction and Text Embedding.

Figure 29. Information was retrieved using RAG at the intersection of Wufu 1st Rd. and Guanghua 1st Rd.

Figure 30. Information was retrieved using RAG at the intersection of Jiuru 2nd Rd. and Ziyou 1st Rd., southwest corner.

Figure 31. User Interface webpage of the PTAS at Wufu 1st Rd. & Guanghua 1st Rd. Intersection. Raw image source: https://cctv1.kctmc.nat.gov.tw/b045c1d4/&t=1749010635776 (accessed on 2 December 2025).

Figure 32. User Interface webpage of the PTAS at the southwest corner intersection of Jiuru 2nd Rd. and Ziyou 1st Rd. Raw image source: https://cctv2.kctmc.nat.gov.tw/88b6eddf/&t=1744354849787 (accessed on 2 December 2025).

Table 1. Hardware specifications.

Resource	Workstation
GPU	nVDIA GeForce RTX 4070 Ti
CPU	Intel(R) Xeon(R) W-2223 CPU @ 3.60 GHz
Memory	DDR5-5600 RAM 16 GB × 2
Storage	KXG60ZNV512G KIOXIA 1 TB

Table 2. Recipe of packages.

Software	Version
Python	3.9.18
Anaconda	23.7.4
Flask	3.1.0
Spyder	5.5.1
Jupyter Notebook	1.0.0
OpenCV	4.11.0.86
DarkLabel	2.4

Table 3. Data Collection.

Number of Images	Object Classification	Data Division Ratio
6633 sheets	5 types (pedestrians, vehicles, trucks, buses, motorcycles)	Training set 87.5% (for detection only) Validation set 8.3% (for detection only) Test set 4.2% (for both detection and tracking)

Table 4. Data Collection for Transfer Learning Stage.

Number of Images	Object Classification	Data Division Ratio
1000 sheets	5 types (pedestrians, vehicles, trucks, buses, motorcycles, etc.)	Training set 80% (for detection only) Test set 20% (for both detection and tracking)

Table 5. Assessment of Various LLMs for Traffic Flow Analysis.

Model	GPT-4o	Claude Sonnet 4	Gemini-2.5-Flash
Factor	GPT-4o	Claude Sonnet 4	Gemini-2.5-Flash
Data Description	Simplified summary focusing on covered time, tool types, and directions	Very clear, listing data sources and analysis details	Provides actual numerical examples (e.g., 17:00–17:10, vehicle and pedestrian flow data)
Peak Hour Analysis	Simplified as mainly 18:10–19:00	Pinpoints 18:10–19:00, highlighting peak for vehicles and motorcycles	Identifies dual peaks at 07:00–09:00 and 17:00–19:00
Congestion Cause Analysis	Attributed to commuting concentration and arterial road pressure	Due to concentrated evening demand and arterial road intersection pressure	High traffic volume, narrow road width, and multiple intersections
Traffic Accident Analysis & Suggestions	Mentions accident in Qianzhen District (not at the intersection), but notes diversion effect	Mentions accident in Qianzhen District (not at the intersection), but notes diversion effect	Mentions accidents in Qianzhen and Sanmin, emphasizing “no accidents at the intersection.”
Construction Impact & Suggestions	Briefly states “no construction.”	Clearly states “no construction.”	States “no construction.”
Air Quality Analysis	Summarized as “Good, suitable for outdoor activities.”	Provides data from Qianjin and Fuxing stations (AQI 25–43), concluding “suitable for outdoor activities.”	Mentions the same stations, with a consistent conclusion
Noise Analysis	Summarized as “Lingya station meets standards, minor impact.”	Provides Lingya station data (day/evening/night), confirms standards met, minor impact.	Provides Lingya + Fude station data, notes higher values at Fude, suggests soundproofing
Data Gaps Noted	Mentions some limitations (e.g., lack of location details)	Lists comprehensively (time, speed, signals, capacity, weather, etc.)	Points out missing data such as “vehicle speed, flow rate.”
Structure, Logic & Readability	Highly condensed, concise, and easy to read	Clear and well-structured, complete point by point	Bullet-point style with many examples, slightly lengthy and cluttered
Improvement Suggestions	Divert peak-hour traffic, continue monitoring	Provides specific suggestions to supplement data	Suggests soundproofing measures
Mode of Expression	Provides a concise view	Provides a comprehensive view	Provides a comprehensive view

Table 6. Performance Evaluation of Object Detection for Various Models.

Metrics	Yolov8n	RG-Yolov8n	DuCRG-Yolov8n	Yolov11n	RG-Yolov11n	DuCRG-Yolov11n
Speed (fps)	46.08	46.72	47.25	63.52	65.18	68.25
Precision (%)	88.7	88.8	89.0	89.4	89.6	91.4
Recall (%)	86.8	86.2	82.7	86.1	86.3	87.1
F1_score	0.862	0.875	0.857	0.877	0.878	0.4435

Table 7. Performance Evaluation of Various Tracking in PTAS for Two Instances.

Metrics	ByteTrack						Deep SORT
Metrics	Yolov8n	RG-Yolov8n	DuCRG-Yolov8n	Yolov11n	RG-Yolov11n	DuCRG-Yolov11n	Yolov8n	RG-Yolov8n	DuCRG-Yolov8n	Yolov11n	RG-Yolov11n	DuCRG-Yolov11n
MOTA	0.5225	0.406	0.5445	0.5155	0.6205	0.7195	0.203	0.291	0.326	0.291	0.345	0.4165
IDF1	0.6765	0.701	0.7325	0.6775	0.741	0.851	0.623	0.6145	0.6345	0.599	0.656	0.679
MOTP	0.1153	0.1075	0.1012	0.1022	0.1076	0.08735	0.1158	0.1269	0.11005	0.10965	0.12815	0.1012

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chang, B.R.; Tsai, H.-F.; Chen, C.-C. Multimodal Intelligent Perception at an Intersection: Pedestrian and Vehicle Flow Dynamics Using a Pipeline-Based Traffic Analysis System. Electronics 2026, 15, 353. https://doi.org/10.3390/electronics15020353

AMA Style

Chang BR, Tsai H-F, Chen C-C. Multimodal Intelligent Perception at an Intersection: Pedestrian and Vehicle Flow Dynamics Using a Pipeline-Based Traffic Analysis System. Electronics. 2026; 15(2):353. https://doi.org/10.3390/electronics15020353

Chicago/Turabian Style

Chang, Bao Rong, Hsiu-Fen Tsai, and Chen-Chia Chen. 2026. "Multimodal Intelligent Perception at an Intersection: Pedestrian and Vehicle Flow Dynamics Using a Pipeline-Based Traffic Analysis System" Electronics 15, no. 2: 353. https://doi.org/10.3390/electronics15020353

APA Style

Chang, B. R., Tsai, H.-F., & Chen, C.-C. (2026). Multimodal Intelligent Perception at an Intersection: Pedestrian and Vehicle Flow Dynamics Using a Pipeline-Based Traffic Analysis System. Electronics, 15(2), 353. https://doi.org/10.3390/electronics15020353

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Intelligent Perception at an Intersection: Pedestrian and Vehicle Flow Dynamics Using a Pipeline-Based Traffic Analysis System

Abstract

1. Introduction

2. Related Work

2.1. Literature Review

2.2. Object Detection Models

2.3. Reparameterization Ghost Bottleneck, and Dual Convolution

2.4. Large Language Models

2.5. Retrieval-Augmented Generation

2.6. Multiple Object Tracking (MOT)

3. Method

3.1. Data Collection, Processing, and Model Training & Testing

3.2. Improved Object Detection Models

3.3. Traffic Flow Calculation and Edge-Based Real-Time Image Acquisition

3.4. Acquisition of Urban Environment and Traffic-Related Monitoring Data

3.5. Data Management and Retrieval Integration

3.6. AI Analysis and Decision Support

3.7. User Interface Webpage in the PTAS

4. Experimental Results and Discussion

4.1. Experiment Setting

4.2. Object Detection and Tracking Data Collection and Model Training

4.3. Object Detection and Tracking Model Parameter Settings

4.4. Knowledge Base Construction, RAG Retrieval, and LLMs Query Process

4.5. PTAS User Interface Webpage and Analysis Reports

4.6. PTAS Traffic Analysis Results Assessment

4.7. Performance Evaluation

4.8. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI