1. Introduction
Unlike other intersections, urban ones are the most complicated due to the lack of signals, making them significantly dangerous. The presence of vehicles and pedestrians makes these intersections even more dangerous. Motorcycles and bicycles often maneuver in blind spots, significantly increasing the likelihood of fatal collisions, especially given the lack of effective traffic control mechanisms [
1,
2,
3]. Traditional methods of controlling traffic only fix issues after they occur, but technology today is advancing. Therefore, systems are strongly needed to detect and eliminate possible problems in real-time [
4,
5]. To address this problem, researchers have attempted to congested urban settings in order to examine the effectiveness of innovative solutions for conflict detection and immediate resolution.
Simultaneously, as the MLLM has appeared, considerable advancement has been recorded within AI, such as MLLM in GPT-4o. These models perform exceptionally well in logical reasoning, contextual comprehension, and decision-making [
6,
7]. MLLMs are capable of analyzing image and video data. Thus, coupled with context-aware and predictive traffic management, MLLMs can revolutionize the entire traffic industry [
8,
9]. One effective technique is the examination of drone-captured videos in which sequences of three overlapping frames are analyzed to identify conflicts. This method of tracking interactions over time helps understand the transitions between frames and helps classify traffic interactions as conflict or non-conflict conditions.
In this study, GPT-4o was applied to conduct MLLM-based unsignalized intersection traffic control. The system architecture utilizes videos captured by drones, which help detect and classify traffic conflicts, provide descriptive details, and even suggest actions to the drivers at risk. In addition, conflict prompt optimization was used, as conflict fine-tuning, and conflict time series analysis to improve accuracy in forecasting and detecting conflicts. Based on the advanced visual reasoning of language models, this technique provides elbow solutions for the complex problem of intersection management by creating practical, realistic, and flexible results that can be implemented directly [
10,
11].
For researchers, this work contributes to the growing field of AI-based traffic management by providing a data-driven, explainable method using cutting-edge multimodal LLMs. For practitioners, especially traffic engineers and urban planners, the system offers a scalable and cost-effective solution for enhancing safety at unsignalized intersections without requiring new infrastructure.
2. Literature Review
Traffic management and autonomous driving applications greatly benefit from MLLMs due to their flexible, responsive, and interpretable nature [
12,
13,
14]. One of the core strengths of these models is the ability to formulate specific and tractable recommendations to various stakeholders, including decision-makers, drivers, and engineers. In practice, MLLMs have led to the development of several innovations, such as internet-connected traffic lights and bright transportation corridors that improve the real-time management of traffic [
15,
16,
17]. At the same time, the incorporation of machine learning techniques in transportation has been systematically studied for its benefits and challenges [
18,
19].
Traffic conflict techniques (TCTs) have traditionally been employed to evaluate near-miss events as indicators of accident risks, relying on observed traffic interactions rather than actual crash data [
1]. These techniques define conflicts as situations where road users approach closely, necessitating evasive actions to avoid collisions [
2]. Common measures include time-to-collision (TTC), post-encroachment time (PET), and speed differential [
20]. Despite their effectiveness, traditional TCTs face limitations such as manual observations, observer bias, and challenges capturing comprehensive spatial-temporal data in complex intersections. Recently, drone-based video analysis has been adopted to overcome these limitations by offering an objective aerial perspective and continuous data recording capability [
3]. However, drone technologies introduce constraints such as altitude restrictions, limited fields-of-view, sensitivity to weather conditions, and data resolution challenges [
4].
More recent work features the use of Large Language Models (LLMs) in autonomous driving, which consists of four core capabilities: planning, perception, question-answering, and generation [
21,
22,
23]. The obstacles of clarity, scalability, and practicality are also noted by the LLM4DRIVE project [
19], alongside the need for sufficient datasets and explanations. A more advanced shift from centric sensor-based approaches toward deeper self-driving system AIs is reported by [
24], who consider Vision Foundation Models (VFMs) an essential step toward better perception and decision-making. Also, [
25] presented a method for more efficiently answering, identifying scenarios, and comprehensively understanding the situations presented to them by employing LLMs such as GPT-3.5.
LLM-based frameworks have also progressed well in forecasting traffic patterns and managing vehicles. For example, [
26] describes systems containing a sequence and a graph embedding layer that perform well on a few shots learning historical datasets. DriveMLM [
17], an advancement that synchronizes multimodal LLMs with behavioral planning states, allows for combining language intentions and vehicle control gestures during the simulation. Additional attempts in [
24] investigate a more natural form of human-vehicle interaction using LLMs to process voice commands. At the same time, [
17] introduces AccidentGPT, which was developed to understand and reconstruct road traffic accidents and offer solutions for improving safety measures on the road.
A different strand of research concerns sensory data fusion with LLMs to provide better situational awareness of the system. In [
27], the LiDAR and Radar data are combined with LLM output to improve object detection and tracking. At the same time, in [
28], the prediction of human movements is based on contextual and visual information. In the same manner, [
29] researched driver-vehicle interaction in various physical activity and voice command combinations, and [
30] developed a method for monitoring real-time dashboard video, identifying dangerous driving behavior such as sudden driving maneuvers and other risks to safety in changing road situations.
Explainability has become an increasingly important factor in deploying LLM-based technologies in sensitive areas such as autonomous driving. Methods such as retrieval-augmented generation (RAG) and knowledge graphs increase users’ trust in system predictions with clear, justifiable outputs [
31]. At the same time, multimodal LLMs with comprehensive traffic foundation models are increasingly used in deeper transportation analytics [
32], and there remains a significant ongoing effort to use reinforcement learning to solve complex problems such as controlling unsignalized intersections [
33].
Despite these advancements, there remains a clear research gap in applying AI-driven methods to address real-time conflict detection specifically at unsignalized intersections. Existing techniques often fail to fully exploit real-time multimodal data for dynamic traffic scenarios, emphasizing the necessity for advanced, robust AI solutions capable of interpreting complex interactions and providing actionable insights instantaneously. This study addresses this gap by employing a fine-tuned MLLM, GPT-4o, specifically designed for analyzing bird’s eye view drone footage to dynamically detect and manage traffic conflicts at unsignalized intersections.
This research is unique because it uses a fine-tuned MLLM, GPT-4o, to apply to traffic management at unsignalized intersections. The method works by analyzing overlapping three-frame observations consisting of bird’s eye view images of 4-way intersections. The proposed technique identifies and classifies conflicts, provides descriptors explaining them, and implements heuristic strategies for the guidance of drivers. To improve intersection safety and efficiency, this method seeks to dynamically adapt to changing traffic conditions using a combination of MLLMs for both visual and temporal inference.
3. Materials and Methods
3.1. Overview of Framework
Our approach is novel in applying a fine-tuned multimodal LLM with structured temporal prompts and overlapping frame sequences to directly interpret real-time drone footage for traffic conflict resolution, which is a unique contribution compared to previous sensor- or rule-based systems. The multi-phase pipeline utilized in this study is shown in
Figure 1. In the Data Collection & Labeling stage, drone footage of crosswalks is obtained and split into triads of frames separated by 0.5 s. Each triplet is marked as either a conflict or no conflict. The GPTs—Zero-Shot Conflict Detections phase comes next. This is where GPT-4o and GPT-4o-mini are assessed in an environment devoid of training. This is done through an iterative prompt design process intended to boost early classification accuracy. The next step is the Fine Tuning GPT-4o, where extensive training and validation sets are created for the purpose of model optimization towards conflict detection. In the Explanation & Evaluation step, the fine-tuned model produces conflict alerts with recommended actions that are provided along with the alerts, all of which are evaluated in the context of their ease of understanding and usability by the traffic specialists. In the last step, the Deployment one, the system is evaluated on a time series dataset containing the real-world scenario of continuous traffic to determine the ability of the GPT-4o to keep track of the changing state of conflicts over time. This diagram includes the main stages of processing, including the design and flow of prompt inputs for GPT-4o models.
3.2. Data Collection and Labeling
The intersections in this study are unsignalized and operate under give-way rules, with the main road given priority. There were no traffic lights or stop signs involved, allowing the model to learn behavior based on natural vehicle interactions. The unsignalized four-legged intersections footage was sourced from the open-access ListDB RepTwo dataset [
34]. All videos were recorded with a DJI Phantom 4 drone flying at 50 m altitude, 45° camera tilt, 30 fps, during four fixed weekday time slots (07:30–09:00, 10:00–11:30, 13:00–14:30, 15:30–17:00). These videos are categorized into sets of triplets for every half a second of video, capturing specific regions of interest. Initially, seven hundred labeled observations consisting of 350 conflicts and 350 no-conflict scenarios were created by analyzing vehicles’ movement, compliance with the priority rule, and near-miss activities. For eight, five, and fourteen times of the remaining 196 observations, 56 and 140 were deployed for validation and testing. A conflict label was assigned whenever time-to-collision (TTC) < 2 s or post-encroachment time (PET) < 1.5 s. All scenarios had a balanced number of conflicts and no conflict situations.
A set approach was used to produce more lively and realistic intersection interaction. This method produced 1534 observations of four continuous videos, 291 of which had conflict and 1242 without. The first observation contained frames 1, 2, and 3, while the next had frames 2, 3, and 4. This more extensive test set evaluated GPT 4o’s ability to handle dynamic traffic conditions with multiple previous conflict states.
3.3. Prompt Creation
Two prompts were created for GPT-4o, and they are aimed at identifying conflicts. Prompt 1 (P1) is contextual and analyzes a four-legged intersection traffic situation and asks the classifier whether a conflict is possible (yes/no answer). Prompt 2 (P2) contained other elements, such as lane configurations and turning movements, which served as better contextual clues. In both cases, the answer was forced to be “yes” or “no.” Within a time-series setting, P2 was further modified to include previous states to sustain the detection of both temporally persistent and new conflicts. As shown in
Table 1.
The structure of Prompt 2 directly incorporates intersection geometry, such as the number of lanes, priority direction, and turning options. These geometric configurations are essential in helping the model assess vehicle interactions. The model uses this spatial context to detect conflicts and generate explanations and recommendations.
3.4. Zero-Shot Evaluation
GPT-4o and GPT-4o-mini were evaluated with no previous training on the particular domain using 140 observations (70 conflicts and 70 no-conflict) for both P1 and P2 without any prompting and this served as an initial evaluation. This function served as a reference point to measure the baseline of the models at hand with no training or fine-tuning within the domain. Precision, recall, accuracy, and F1-score were calculated to assess the models’ performance, showing how well they dealt with previously unseen traffic data. These metrics set a base to track the progression made from further developed training and prompt adjustments.
3.5. Fine-Tuning MLLM
To enhance the identification of incidents, it was necessary to fine-tune GPT-4o with additional data points. A total of 504 training observations were collected (split evenly between cases of conflict and those with no conflict). Additionally, a set of 56 observations was created to validate the model’s performance while adjusting hyperparameters. The model was evaluated on a test set comprising the last 140 observations. The accuracy, precision, recall, F1-score, and other relevant metrics showcased the improvements obtained from targeted training and specially formulated prompt inputs.
3.6. Model Evaluation Metrics
All metrics of interest—accuracy, precision, recall, and F1 score—were used to measure model performance. These metrics are based on true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN). The detailed formulas for these metrics are presented in
Appendix A for reference.
3.7. Manual Evaluation of Explanations and Recommendations
Whenever a conflict was detected, the fine-tuned GPT-4o would also describe the scenario and recommend resolvable measures (like shifting signal phases or changing the direction of traffic flow). After making these claims, the traffic management professionals evaluated them on a 0–10 scale during a roundtable discussion, with 10 as the highest for clarity, accuracy, and usability. Combining the feedback of these experts helped mitigate discrepancies between model predictions and real-world safety outcomes by enabling them to uncover gaps or oversights in the model. At the same time, tracking metrics (e.g., accuracy, precision, recall, and F1-score) still collected information regarding classification performance.
3.8. Deployment Testing
In addition, a model that performed the best in tests was selected for real-world implementation on a time-series dataset. Each subsequent observation was built off the actual conflict label instead of the model’s predicted label. This design enabled the preservation of temporal and logical order of events, which made it possible for the system to monitor ongoing conflicts. The comparison was made between the model’s output and the actual unmistakable labels to determine the extent to which GPT-4o could transition between the presence and absence of a conflict state, especially in the context of sudden shifts or multi-faceted alterations in vehicular movements. This provided the last examination step on the extent to which the model could process traffic data streams under uncontrolled intersections with high traffic volumes.
4. Results
4.1. Model Performance Metrics
Model F1 score, recall, precision, and the challenging test gave the summary report in
Table 2 and
Figure 2. The performance of the adjusted model GPT-4o is observed to be best in P2 Prompting (P2) with an accuracy of 77.14%. This value dropped to 67.14% while utilizing Prompt 1 (P1), where the model performed relatively poorly. This report also improves when P2 is issued because of the additional detail the prompt can provide, which helps the model assess traffic movements and possible predictions with conflict and traffic impacts. All models were tested on a 140-sample dataset used for model training to ensure adequate comparisons between different models.
In a zero-shot manner, performance on P1 and P2 dropped even more, to 55.43% and 58.43%, respectively. The smaller version of the GPT-4o model, GPT-4o-MINI, achieved even worse results at 53.71% (P1) and 50.29% (P2). This emphasizes the importance of the AI model’s dimension and the need for fine-tuning when tackling complex urban traffic scenarios.
In addition to accuracy, the fine-tuned GPT-4o model with P2 also exhibited strong performance in terms of specificity (78%), sensitivity (77.5%), and the F1-score (77%). In comparison, when P1 was used, accuracy dropped to a precision of 74.5%, recall of 67%, and F1 score of 64.5%. Under zero-shot conditions, GPT-4o had even more difficulties, with underperformance in precision at 61.5%, recall at 58.5%, and an F1-score of 55.5%. The problem was aggravated by the smaller GPT-4o model, which had an F1 score of only 35% while using P2 and could not process complex traffic patterns. This underperformance can be attributed to the reduced parameter count of GPT-4o-mini, which limits its ability to capture intricate spatial-temporal patterns. Additionally, without fine-tuning, the smaller model struggles with reasoning over overlapping vehicle trajectories and priority rules. This emphasizes the importance of both model size and domain adaptation.
These data substantiate the consequential step of fine-tuning, prompt engineering, and model capacity. In all data sets, Prompt two did better than Prompt one and was better than the zero-shot attempts on all metrics. This result emphasizes the significance of prompt engineering and model tuning for these nontrivial traffic management scenarios.
4.2. Confusion Matrices of Fine-Tuned Models
Figure 3 shows the confusion matrices obtained from the fine-tuned GPT-4o model prompted differently. When Prompt One is used, the model recorded 66 true negatives (TN) and 28 true positives (TP) while producing 42 false negatives (FN) and 4 false positives (FP). While the model managed to identify a good number of instances of ‘no conflict’ its performance seems to have suffered from the considerably large quantity of false negatives, which implies that there was a challenge in identifying some cases of conflict. This is likely due to the lack of detail in Prompt 1.
However, using Prompt two markedly improved conflict detection. The model produced 60 TN 48 TP and decreased false negatives to 22, while false positives rose slightly to 10. This increase in performance reiterates the impact of Prompt 2 in helping the model understand the details concerning vehicle interactions and traffic importance, which increases its ability to differentiate conflict from non-conflict situations.
These results highlight the power of well-worded prompts. By including more context, such as the lane design, the vehicle’s priority, and the traffic flow, Prompt 2 equips the model with better information and allows it to make better intersection traffic predictions and management.
4.3. Manual Evaluation
A manual evaluation was performed on a model’s output explanatory to assess the clarity and practical usefulness. On average, explanations received 8.99, while recommendations of the model received 9.23, which is significantly higher. These auspicious results indicate that the model’s recommendations are helpful and improve the state of conflict situations, indicator traffic safety, and intersection control.
Figure 4 shows this feedback, while
Figure 5 contains the actual outputs of GPT-4o.
To give readers a more comprehensive view, multiple examples of model explanations and their corresponding recommendations are included in
Figure 5. These illustrate both conflict and no-conflict cases across different scenarios.
The rows in
Figure 5 illustrate the model’s reasoning for conflict detection, as well as the suggestions aimed at the improvement of traffic safety. At a “Conflict detected” mark, the model pointed out vehicle movements like safe turning or blocking the intersection and offered context-sensitive explanations on spatial positions of vehicles and priorities. After that, it provided recommendations, such as changing traffic flows or setting vehicle hierarchies. On the other hand, the “No conflict detected” model reported appropriate traffic movement and no anticipating danger, which helps attempt proper traffic control since a deep understanding of the situation is required.
The manual evaluations have strengthened the premise that an adequately adapted GPT-4o system may identify traffic conflicts, explain them, and reasonably offer helpful advice. Furthermore, this ability to propose means of detection and resolution makes the model particularly useful in managing intersections and making decisions by drivers.
4.4. Deployment Results
In addition to conventional assessment, the model was also evaluated using time-series data. This involved using continuous video footage to assess the model’s continuous conflict detection mechanism. The video sequences were divided into overlapping three-frame segments to ensure a smooth transition and evaluate the model’s performance in different traffic situations.
The model obtained an accuracy of 88.48% when it was tested on the balanced time-series dataset, with an equal distribution of conflict and no-conflict scenarios, as highlighted in
Figure 6 and
Table 3. More impressively, it got measures of 0.88 concerning “no conflict” and 0.89 when it came to “conflict”, while recall scores were 0.89 and 0.88, so the F1-scores for all were 0.89 and 0.88, respectively. According to the confusion matrix in
Figure 6 (left), misclassification rates were low, showing the consistency of the model within structured, balanced environments. The reported precision and recall values were verified to ensure consistency across the time-series datasets. The model achieved high performance with 90.00% overall accuracy, confirming robustness even under natural imbalances in conflict distribution.
Next, the model was tested on the complete time-series dataset constructed from all four videos, which showed a naturally unbalanced distribution of conflict and no conflict situations. Under these conditions, GPT-4o revealed an accuracy score of 90.00%, as shown in
Figure 6 and
Table 3. He achieved a 0.96 precision, 0.92 recall, and 0.94 F1 score for these scenarios, showing strong performance while avoiding false duplicity issues. However, the performance for conflict instances was modestly lower, with a precision metric of 0.71, recall of 0.82, and an F1-score of 0.76. The confusion matrix is represented in
Figure 6 (right), denoting the model performance and the frequency of classification errors that the model makes about these less frequent conflict cases.
Analysis of Temporal Consistency
A temporal consistency analysis was conducted to evaluate the model’s dependency and ability to cope with changing traffic patterns. The blue line in
Figure 7 represents the actual conflict labels, while the orange vertical bars represent the number of mismatches counted across four videos. Each video is separated by red dashed lines that denote its isolation from the next.
As evidenced by the analysis, there were many mismatches for much of the time intervals for video 1 and video 3, which had more complex traffic patterns, for example, when there were overlapping trajectory changes in the shifts of the conflict priority. On the other hand, the model mispredictions for videos 2 and 4 when the scenarios being analyzed were less complex and had simple intersections with the traffic set to be congested to a relatively lower level.
Modeling errors were primarily linked to transitions between the conflict and no conflict states and signaled the model’s susceptibility to abrupt or mild changes in traffic. The model’s performance was affected by other variables, such as the level of traffic and vehicle types in each video. These observations indicate the necessity of a model’s continuous enhancement to achieve temporal consistency, mainly when applied to intricate real-world scenarios.
5. Discussion
This study investigated the capability of a fine-tuned GPT-4o multimodal language model to reason about traffic conflicts at unsignalized urban intersections using visual data from drone videos. The results demonstrated that with structured prompt engineering and temporal reasoning, the model achieved strong predictive performance, particularly in real-world deployment using time-series data. The overall accuracy reached 90.00% in the unbalanced dataset and 88.48% in the balanced set, highlighting the model’s robustness in both synthetic and naturalistic conditions.
The findings reinforce that context-aware prompt design, especially Prompt 2 with geometric and directional cues, significantly enhances the model’s understanding of intersection dynamics. Compared to zero-shot predictions, which showed moderate performance, the fine-tuned model provided more accurate classifications along with clearer reasoning and recommendations that were rated highly by expert evaluators. These interpretability scores are important, as they validate the usefulness of the model’s outputs in real-world traffic management scenarios where actionable insights matter.
However, results also varied across traffic complexity levels. In particular, videos with overlapping vehicle trajectories or denser intersections showed higher mismatch rates in the temporal consistency analysis. This suggests that while the model can track evolving conflicts, it still struggles with chaotic or ambiguous interactions—an area where enhanced training data or multimodal sensor integration (e.g., combining video with GPS or LiDAR) may help. Additionally, the model is currently limited to unsignalized intersections with fixed right-of-way rules, and further adaptation is required to generalize across intersections governed by traffic lights or stop signs.
Overall, this framework presents a scalable prototype for AI-assisted traffic control, requiring minimal infrastructure, leveraging accessible drone imagery, and integrating visual reasoning with natural language understanding. Its societal value lies in improving road safety, especially in developing regions where signalized intersections are scarce and real-time decision support can reduce collisions.
7. Conclusions and Future Work
This study demonstrates the capabilities of newly trained GPT-4o Multimodal Large Language Models, directing traffic at urban intersections without signal lights. With carefully designed prompt engineering and targeted fine-tuning, the model’s zero-shot accuracy reached an impressive 77%. This improvement from his zero-shot challenge is very promising. Even more remarkable is the point value of 8.99 on a 10-point scale for detailed conflict explanations and 9.23 for practical recommendations for improvement. These results illustrate its scope towards solving monthly traffic problems.
To further prove my belief in the model’s efficacy in the field. The fine-tuned 4o performed best in her 88.48% on the dataset evenly distributed along the time series and 90.00% on the entire dataset. This suggests that somewhat understanding the strong ability moderated by the spatial domain to interpret videos broken up into three overlapping frames sequentially. From the in-depth analysis of the mismatched details, one can infer that the model performs much better where the conditions are less complicated but does have some problems with complex, highly congested areas.
The cross-evaluation with the low-performing GPT-4o-mini model together with the previously mentioned 4o without any fine-tuning confirms that medium and low-performing models that were not precisely trained performed poorly. In contrast, prompt two achieved higher than 1 in all the evaluations, which confirms my understanding that comprehensive contextual evidence, such as vehicle movement patterns and rules of cities, is needed to enhance predictive performance.
These findings attest to fine-tuned MLLMs for traffic management and demonstrate that future endeavors can extend these benefits further through continuous expansion and enrichment. Using time series data from diverse traffic environments, enhancing sensor input, and developing new MLLMs such as Gemini or LLaVA can further improve conflict detection and resolution. In addition, future work will evaluate the model on a variety of intersection layouts—T-junctions, signalized crossroads, and roundabouts—to verify transferability across different geometric configurations and control schemes. Using live streams from traffic cameras for practical detection and developing algorithms that address these problems is also a notable future step toward improved urban intersections and a solid base for developing traffic safety and optimization systems. In terms of societal contribution, this work enables scalable traffic monitoring using affordable aerial imagery and AI, which can reduce accidents, improve pedestrian safety, and support smarter city infrastructure without requiring large physical investments.