Leveraging Bird Eye View Video and Multimodal Large Language Models for Real-Time Intersection Control and Reasoning

Masri, Sari; Ashqar, Huthaifa I.; Elhenawy, Mohammed

doi:10.3390/safety11020040

Open AccessArticle

Leveraging Bird Eye View Video and Multimodal Large Language Models for Real-Time Intersection Control and Reasoning

by

Sari Masri

¹

,

Huthaifa I. Ashqar

^2,3,*

and

Mohammed Elhenawy

⁴

¹

Natural, Engineering and Technology Sciences Department, Arab American University, 13 Zababdeh, Jenin P.O. Box 240, Palestine

²

AI and Data Science Department, Arab American University, 13 Zababdeh, Jenin P.O. Box 240, Palestine

³

Artificial Intelligence Program, Fu Foundation School of Engineering and Applied Science, Columbia University, 500 W 120th St., New York, NY 10027, USA

⁴

CARRS-Q and Centre for Data Science, Queensland University of Technology, Brisbane, QLD 4059, Australia

^*

Author to whom correspondence should be addressed.

Safety 2025, 11(2), 40; https://doi.org/10.3390/safety11020040

Submission received: 29 January 2025 / Revised: 30 April 2025 / Accepted: 2 May 2025 / Published: 7 May 2025

Download

Browse Figures

Versions Notes

Abstract

:

Managing traffic flow through urban intersections is challenging. Conflicts involving a mix of different vehicles with blind spots makes it relatively vulnerable for crashes to happen. This paper presents a new framework based on a fine-tuned Multimodal Large Language Model (MLLM), GPT-4o, that can control intersections using bird eye view videos taken by drones in real-time. This fine-tuned GPT-4o model is used to logically and visually reason traffic conflicts and provide instructions to the drivers, which aids in creating a safer and more efficient traffic flow. To fine-tune and evaluate the model, we labeled a dataset that includes three-month drone videos, and their corresponding trajectories recorded in Dresden, Germany, at a 4-way intersection. Preliminary results showed that the fine-tuned GPT-4o achieved an accuracy of about 77%, outperforming zero-shot baselines. However, using continuous video-frame sequences, the model performance increased to about 89% on a time serialized dataset and about 90% on an unbalanced real-world dataset, respectively. This proves the model’s robustness in different conditions. Furthermore, manual evaluation by experts includes scoring the usefulness of the predicted explanations and recommendations by the model. The model surpassed on average rating of 8.99 out of 10 for explanations, and 9.23 out of 10 for recommendations. The results demonstrate the advantages of combining MLLMs with structured prompts and temporal information for conflict detection. These results offer a flexible and robust prototype framework to improve the safety and effectiveness of uncontrolled intersections. The code and labeled dataset used in this study are publicly available (see Data Availability Statement).

Keywords:

conflict detection; fine-tuning; Multimodal Large Language Models (MLLMs); prompt design; unsignalized intersections; urban traffic management; visual and logical reasoning

1. Introduction

Unlike other intersections, urban ones are the most complicated due to the lack of signals, making them significantly dangerous. The presence of vehicles and pedestrians makes these intersections even more dangerous. Motorcycles and bicycles often maneuver in blind spots, significantly increasing the likelihood of fatal collisions, especially given the lack of effective traffic control mechanisms [1,2,3]. Traditional methods of controlling traffic only fix issues after they occur, but technology today is advancing. Therefore, systems are strongly needed to detect and eliminate possible problems in real-time [4,5]. To address this problem, researchers have attempted to congested urban settings in order to examine the effectiveness of innovative solutions for conflict detection and immediate resolution.

Simultaneously, as the MLLM has appeared, considerable advancement has been recorded within AI, such as MLLM in GPT-4o. These models perform exceptionally well in logical reasoning, contextual comprehension, and decision-making [6,7]. MLLMs are capable of analyzing image and video data. Thus, coupled with context-aware and predictive traffic management, MLLMs can revolutionize the entire traffic industry [8,9]. One effective technique is the examination of drone-captured videos in which sequences of three overlapping frames are analyzed to identify conflicts. This method of tracking interactions over time helps understand the transitions between frames and helps classify traffic interactions as conflict or non-conflict conditions.

In this study, GPT-4o was applied to conduct MLLM-based unsignalized intersection traffic control. The system architecture utilizes videos captured by drones, which help detect and classify traffic conflicts, provide descriptive details, and even suggest actions to the drivers at risk. In addition, conflict prompt optimization was used, as conflict fine-tuning, and conflict time series analysis to improve accuracy in forecasting and detecting conflicts. Based on the advanced visual reasoning of language models, this technique provides elbow solutions for the complex problem of intersection management by creating practical, realistic, and flexible results that can be implemented directly [10,11].

For researchers, this work contributes to the growing field of AI-based traffic management by providing a data-driven, explainable method using cutting-edge multimodal LLMs. For practitioners, especially traffic engineers and urban planners, the system offers a scalable and cost-effective solution for enhancing safety at unsignalized intersections without requiring new infrastructure.

2. Literature Review

Traffic management and autonomous driving applications greatly benefit from MLLMs due to their flexible, responsive, and interpretable nature [12,13,14]. One of the core strengths of these models is the ability to formulate specific and tractable recommendations to various stakeholders, including decision-makers, drivers, and engineers. In practice, MLLMs have led to the development of several innovations, such as internet-connected traffic lights and bright transportation corridors that improve the real-time management of traffic [15,16,17]. At the same time, the incorporation of machine learning techniques in transportation has been systematically studied for its benefits and challenges [18,19].

Traffic conflict techniques (TCTs) have traditionally been employed to evaluate near-miss events as indicators of accident risks, relying on observed traffic interactions rather than actual crash data [1]. These techniques define conflicts as situations where road users approach closely, necessitating evasive actions to avoid collisions [2]. Common measures include time-to-collision (TTC), post-encroachment time (PET), and speed differential [20]. Despite their effectiveness, traditional TCTs face limitations such as manual observations, observer bias, and challenges capturing comprehensive spatial-temporal data in complex intersections. Recently, drone-based video analysis has been adopted to overcome these limitations by offering an objective aerial perspective and continuous data recording capability [3]. However, drone technologies introduce constraints such as altitude restrictions, limited fields-of-view, sensitivity to weather conditions, and data resolution challenges [4].

More recent work features the use of Large Language Models (LLMs) in autonomous driving, which consists of four core capabilities: planning, perception, question-answering, and generation [21,22,23]. The obstacles of clarity, scalability, and practicality are also noted by the LLM4DRIVE project [19], alongside the need for sufficient datasets and explanations. A more advanced shift from centric sensor-based approaches toward deeper self-driving system AIs is reported by [24], who consider Vision Foundation Models (VFMs) an essential step toward better perception and decision-making. Also, [25] presented a method for more efficiently answering, identifying scenarios, and comprehensively understanding the situations presented to them by employing LLMs such as GPT-3.5.

LLM-based frameworks have also progressed well in forecasting traffic patterns and managing vehicles. For example, [26] describes systems containing a sequence and a graph embedding layer that perform well on a few shots learning historical datasets. DriveMLM [17], an advancement that synchronizes multimodal LLMs with behavioral planning states, allows for combining language intentions and vehicle control gestures during the simulation. Additional attempts in [24] investigate a more natural form of human-vehicle interaction using LLMs to process voice commands. At the same time, [17] introduces AccidentGPT, which was developed to understand and reconstruct road traffic accidents and offer solutions for improving safety measures on the road.

A different strand of research concerns sensory data fusion with LLMs to provide better situational awareness of the system. In [27], the LiDAR and Radar data are combined with LLM output to improve object detection and tracking. At the same time, in [28], the prediction of human movements is based on contextual and visual information. In the same manner, [29] researched driver-vehicle interaction in various physical activity and voice command combinations, and [30] developed a method for monitoring real-time dashboard video, identifying dangerous driving behavior such as sudden driving maneuvers and other risks to safety in changing road situations.

Explainability has become an increasingly important factor in deploying LLM-based technologies in sensitive areas such as autonomous driving. Methods such as retrieval-augmented generation (RAG) and knowledge graphs increase users’ trust in system predictions with clear, justifiable outputs [31]. At the same time, multimodal LLMs with comprehensive traffic foundation models are increasingly used in deeper transportation analytics [32], and there remains a significant ongoing effort to use reinforcement learning to solve complex problems such as controlling unsignalized intersections [33].

Despite these advancements, there remains a clear research gap in applying AI-driven methods to address real-time conflict detection specifically at unsignalized intersections. Existing techniques often fail to fully exploit real-time multimodal data for dynamic traffic scenarios, emphasizing the necessity for advanced, robust AI solutions capable of interpreting complex interactions and providing actionable insights instantaneously. This study addresses this gap by employing a fine-tuned MLLM, GPT-4o, specifically designed for analyzing bird’s eye view drone footage to dynamically detect and manage traffic conflicts at unsignalized intersections.

This research is unique because it uses a fine-tuned MLLM, GPT-4o, to apply to traffic management at unsignalized intersections. The method works by analyzing overlapping three-frame observations consisting of bird’s eye view images of 4-way intersections. The proposed technique identifies and classifies conflicts, provides descriptors explaining them, and implements heuristic strategies for the guidance of drivers. To improve intersection safety and efficiency, this method seeks to dynamically adapt to changing traffic conditions using a combination of MLLMs for both visual and temporal inference.

3. Materials and Methods

3.1. Overview of Framework

Our approach is novel in applying a fine-tuned multimodal LLM with structured temporal prompts and overlapping frame sequences to directly interpret real-time drone footage for traffic conflict resolution, which is a unique contribution compared to previous sensor- or rule-based systems. The multi-phase pipeline utilized in this study is shown in Figure 1. In the Data Collection & Labeling stage, drone footage of crosswalks is obtained and split into triads of frames separated by 0.5 s. Each triplet is marked as either a conflict or no conflict. The GPTs—Zero-Shot Conflict Detections phase comes next. This is where GPT-4o and GPT-4o-mini are assessed in an environment devoid of training. This is done through an iterative prompt design process intended to boost early classification accuracy. The next step is the Fine Tuning GPT-4o, where extensive training and validation sets are created for the purpose of model optimization towards conflict detection. In the Explanation & Evaluation step, the fine-tuned model produces conflict alerts with recommended actions that are provided along with the alerts, all of which are evaluated in the context of their ease of understanding and usability by the traffic specialists. In the last step, the Deployment one, the system is evaluated on a time series dataset containing the real-world scenario of continuous traffic to determine the ability of the GPT-4o to keep track of the changing state of conflicts over time. This diagram includes the main stages of processing, including the design and flow of prompt inputs for GPT-4o models.

3.2. Data Collection and Labeling

The intersections in this study are unsignalized and operate under give-way rules, with the main road given priority. There were no traffic lights or stop signs involved, allowing the model to learn behavior based on natural vehicle interactions. The unsignalized four-legged intersections footage was sourced from the open-access ListDB RepTwo dataset [34]. All videos were recorded with a DJI Phantom 4 drone flying at 50 m altitude, 45° camera tilt, 30 fps, during four fixed weekday time slots (07:30–09:00, 10:00–11:30, 13:00–14:30, 15:30–17:00). These videos are categorized into sets of triplets for every half a second of video, capturing specific regions of interest. Initially, seven hundred labeled observations consisting of 350 conflicts and 350 no-conflict scenarios were created by analyzing vehicles’ movement, compliance with the priority rule, and near-miss activities. For eight, five, and fourteen times of the remaining 196 observations, 56 and 140 were deployed for validation and testing. A conflict label was assigned whenever time-to-collision (TTC) < 2 s or post-encroachment time (PET) < 1.5 s. All scenarios had a balanced number of conflicts and no conflict situations.

A set approach was used to produce more lively and realistic intersection interaction. This method produced 1534 observations of four continuous videos, 291 of which had conflict and 1242 without. The first observation contained frames 1, 2, and 3, while the next had frames 2, 3, and 4. This more extensive test set evaluated GPT 4o’s ability to handle dynamic traffic conditions with multiple previous conflict states.

3.3. Prompt Creation

Two prompts were created for GPT-4o, and they are aimed at identifying conflicts. Prompt 1 (P1) is contextual and analyzes a four-legged intersection traffic situation and asks the classifier whether a conflict is possible (yes/no answer). Prompt 2 (P2) contained other elements, such as lane configurations and turning movements, which served as better contextual clues. In both cases, the answer was forced to be “yes” or “no.” Within a time-series setting, P2 was further modified to include previous states to sustain the detection of both temporally persistent and new conflicts. As shown in Table 1.

The structure of Prompt 2 directly incorporates intersection geometry, such as the number of lanes, priority direction, and turning options. These geometric configurations are essential in helping the model assess vehicle interactions. The model uses this spatial context to detect conflicts and generate explanations and recommendations.

3.4. Zero-Shot Evaluation

GPT-4o and GPT-4o-mini were evaluated with no previous training on the particular domain using 140 observations (70 conflicts and 70 no-conflict) for both P1 and P2 without any prompting and this served as an initial evaluation. This function served as a reference point to measure the baseline of the models at hand with no training or fine-tuning within the domain. Precision, recall, accuracy, and F1-score were calculated to assess the models’ performance, showing how well they dealt with previously unseen traffic data. These metrics set a base to track the progression made from further developed training and prompt adjustments.

3.5. Fine-Tuning MLLM

To enhance the identification of incidents, it was necessary to fine-tune GPT-4o with additional data points. A total of 504 training observations were collected (split evenly between cases of conflict and those with no conflict). Additionally, a set of 56 observations was created to validate the model’s performance while adjusting hyperparameters. The model was evaluated on a test set comprising the last 140 observations. The accuracy, precision, recall, F1-score, and other relevant metrics showcased the improvements obtained from targeted training and specially formulated prompt inputs.

3.6. Model Evaluation Metrics

All metrics of interest—accuracy, precision, recall, and F1 score—were used to measure model performance. These metrics are based on true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN). The detailed formulas for these metrics are presented in Appendix A for reference.

3.7. Manual Evaluation of Explanations and Recommendations

Whenever a conflict was detected, the fine-tuned GPT-4o would also describe the scenario and recommend resolvable measures (like shifting signal phases or changing the direction of traffic flow). After making these claims, the traffic management professionals evaluated them on a 0–10 scale during a roundtable discussion, with 10 as the highest for clarity, accuracy, and usability. Combining the feedback of these experts helped mitigate discrepancies between model predictions and real-world safety outcomes by enabling them to uncover gaps or oversights in the model. At the same time, tracking metrics (e.g., accuracy, precision, recall, and F1-score) still collected information regarding classification performance.

3.8. Deployment Testing

In addition, a model that performed the best in tests was selected for real-world implementation on a time-series dataset. Each subsequent observation was built off the actual conflict label instead of the model’s predicted label. This design enabled the preservation of temporal and logical order of events, which made it possible for the system to monitor ongoing conflicts. The comparison was made between the model’s output and the actual unmistakable labels to determine the extent to which GPT-4o could transition between the presence and absence of a conflict state, especially in the context of sudden shifts or multi-faceted alterations in vehicular movements. This provided the last examination step on the extent to which the model could process traffic data streams under uncontrolled intersections with high traffic volumes.

4. Results

4.1. Model Performance Metrics

Model F1 score, recall, precision, and the challenging test gave the summary report in Table 2 and Figure 2. The performance of the adjusted model GPT-4o is observed to be best in P2 Prompting (P2) with an accuracy of 77.14%. This value dropped to 67.14% while utilizing Prompt 1 (P1), where the model performed relatively poorly. This report also improves when P2 is issued because of the additional detail the prompt can provide, which helps the model assess traffic movements and possible predictions with conflict and traffic impacts. All models were tested on a 140-sample dataset used for model training to ensure adequate comparisons between different models.

In a zero-shot manner, performance on P1 and P2 dropped even more, to 55.43% and 58.43%, respectively. The smaller version of the GPT-4o model, GPT-4o-MINI, achieved even worse results at 53.71% (P1) and 50.29% (P2). This emphasizes the importance of the AI model’s dimension and the need for fine-tuning when tackling complex urban traffic scenarios.

In addition to accuracy, the fine-tuned GPT-4o model with P2 also exhibited strong performance in terms of specificity (78%), sensitivity (77.5%), and the F1-score (77%). In comparison, when P1 was used, accuracy dropped to a precision of 74.5%, recall of 67%, and F1 score of 64.5%. Under zero-shot conditions, GPT-4o had even more difficulties, with underperformance in precision at 61.5%, recall at 58.5%, and an F1-score of 55.5%. The problem was aggravated by the smaller GPT-4o model, which had an F1 score of only 35% while using P2 and could not process complex traffic patterns. This underperformance can be attributed to the reduced parameter count of GPT-4o-mini, which limits its ability to capture intricate spatial-temporal patterns. Additionally, without fine-tuning, the smaller model struggles with reasoning over overlapping vehicle trajectories and priority rules. This emphasizes the importance of both model size and domain adaptation.

These data substantiate the consequential step of fine-tuning, prompt engineering, and model capacity. In all data sets, Prompt two did better than Prompt one and was better than the zero-shot attempts on all metrics. This result emphasizes the significance of prompt engineering and model tuning for these nontrivial traffic management scenarios.

4.2. Confusion Matrices of Fine-Tuned Models

Figure 3 shows the confusion matrices obtained from the fine-tuned GPT-4o model prompted differently. When Prompt One is used, the model recorded 66 true negatives (TN) and 28 true positives (TP) while producing 42 false negatives (FN) and 4 false positives (FP). While the model managed to identify a good number of instances of ‘no conflict’ its performance seems to have suffered from the considerably large quantity of false negatives, which implies that there was a challenge in identifying some cases of conflict. This is likely due to the lack of detail in Prompt 1.

However, using Prompt two markedly improved conflict detection. The model produced 60 TN 48 TP and decreased false negatives to 22, while false positives rose slightly to 10. This increase in performance reiterates the impact of Prompt 2 in helping the model understand the details concerning vehicle interactions and traffic importance, which increases its ability to differentiate conflict from non-conflict situations.

These results highlight the power of well-worded prompts. By including more context, such as the lane design, the vehicle’s priority, and the traffic flow, Prompt 2 equips the model with better information and allows it to make better intersection traffic predictions and management.

4.3. Manual Evaluation

A manual evaluation was performed on a model’s output explanatory to assess the clarity and practical usefulness. On average, explanations received 8.99, while recommendations of the model received 9.23, which is significantly higher. These auspicious results indicate that the model’s recommendations are helpful and improve the state of conflict situations, indicator traffic safety, and intersection control. Figure 4 shows this feedback, while Figure 5 contains the actual outputs of GPT-4o.

To give readers a more comprehensive view, multiple examples of model explanations and their corresponding recommendations are included in Figure 5. These illustrate both conflict and no-conflict cases across different scenarios.

The rows in Figure 5 illustrate the model’s reasoning for conflict detection, as well as the suggestions aimed at the improvement of traffic safety. At a “Conflict detected” mark, the model pointed out vehicle movements like safe turning or blocking the intersection and offered context-sensitive explanations on spatial positions of vehicles and priorities. After that, it provided recommendations, such as changing traffic flows or setting vehicle hierarchies. On the other hand, the “No conflict detected” model reported appropriate traffic movement and no anticipating danger, which helps attempt proper traffic control since a deep understanding of the situation is required.

The manual evaluations have strengthened the premise that an adequately adapted GPT-4o system may identify traffic conflicts, explain them, and reasonably offer helpful advice. Furthermore, this ability to propose means of detection and resolution makes the model particularly useful in managing intersections and making decisions by drivers.

4.4. Deployment Results

In addition to conventional assessment, the model was also evaluated using time-series data. This involved using continuous video footage to assess the model’s continuous conflict detection mechanism. The video sequences were divided into overlapping three-frame segments to ensure a smooth transition and evaluate the model’s performance in different traffic situations.

The model obtained an accuracy of 88.48% when it was tested on the balanced time-series dataset, with an equal distribution of conflict and no-conflict scenarios, as highlighted in Figure 6 and Table 3. More impressively, it got measures of 0.88 concerning “no conflict” and 0.89 when it came to “conflict”, while recall scores were 0.89 and 0.88, so the F1-scores for all were 0.89 and 0.88, respectively. According to the confusion matrix in Figure 6 (left), misclassification rates were low, showing the consistency of the model within structured, balanced environments. The reported precision and recall values were verified to ensure consistency across the time-series datasets. The model achieved high performance with 90.00% overall accuracy, confirming robustness even under natural imbalances in conflict distribution.

Next, the model was tested on the complete time-series dataset constructed from all four videos, which showed a naturally unbalanced distribution of conflict and no conflict situations. Under these conditions, GPT-4o revealed an accuracy score of 90.00%, as shown in Figure 6 and Table 3. He achieved a 0.96 precision, 0.92 recall, and 0.94 F1 score for these scenarios, showing strong performance while avoiding false duplicity issues. However, the performance for conflict instances was modestly lower, with a precision metric of 0.71, recall of 0.82, and an F1-score of 0.76. The confusion matrix is represented in Figure 6 (right), denoting the model performance and the frequency of classification errors that the model makes about these less frequent conflict cases.

Analysis of Temporal Consistency

A temporal consistency analysis was conducted to evaluate the model’s dependency and ability to cope with changing traffic patterns. The blue line in Figure 7 represents the actual conflict labels, while the orange vertical bars represent the number of mismatches counted across four videos. Each video is separated by red dashed lines that denote its isolation from the next.

As evidenced by the analysis, there were many mismatches for much of the time intervals for video 1 and video 3, which had more complex traffic patterns, for example, when there were overlapping trajectory changes in the shifts of the conflict priority. On the other hand, the model mispredictions for videos 2 and 4 when the scenarios being analyzed were less complex and had simple intersections with the traffic set to be congested to a relatively lower level.

Modeling errors were primarily linked to transitions between the conflict and no conflict states and signaled the model’s susceptibility to abrupt or mild changes in traffic. The model’s performance was affected by other variables, such as the level of traffic and vehicle types in each video. These observations indicate the necessity of a model’s continuous enhancement to achieve temporal consistency, mainly when applied to intricate real-world scenarios.

5. Discussion

This study investigated the capability of a fine-tuned GPT-4o multimodal language model to reason about traffic conflicts at unsignalized urban intersections using visual data from drone videos. The results demonstrated that with structured prompt engineering and temporal reasoning, the model achieved strong predictive performance, particularly in real-world deployment using time-series data. The overall accuracy reached 90.00% in the unbalanced dataset and 88.48% in the balanced set, highlighting the model’s robustness in both synthetic and naturalistic conditions.

The findings reinforce that context-aware prompt design, especially Prompt 2 with geometric and directional cues, significantly enhances the model’s understanding of intersection dynamics. Compared to zero-shot predictions, which showed moderate performance, the fine-tuned model provided more accurate classifications along with clearer reasoning and recommendations that were rated highly by expert evaluators. These interpretability scores are important, as they validate the usefulness of the model’s outputs in real-world traffic management scenarios where actionable insights matter.

However, results also varied across traffic complexity levels. In particular, videos with overlapping vehicle trajectories or denser intersections showed higher mismatch rates in the temporal consistency analysis. This suggests that while the model can track evolving conflicts, it still struggles with chaotic or ambiguous interactions—an area where enhanced training data or multimodal sensor integration (e.g., combining video with GPS or LiDAR) may help. Additionally, the model is currently limited to unsignalized intersections with fixed right-of-way rules, and further adaptation is required to generalize across intersections governed by traffic lights or stop signs.

Overall, this framework presents a scalable prototype for AI-assisted traffic control, requiring minimal infrastructure, leveraging accessible drone imagery, and integrating visual reasoning with natural language understanding. Its societal value lies in improving road safety, especially in developing regions where signalized intersections are scarce and real-time decision support can reduce collisions.

6. Limitations

Although this study demonstrates strong potential for using GPT-4o and drone-based video data in managing unsignalized intersections, several limitations exist. Firstly, the model relies solely on visual data from a fixed drone altitude and angle, which may limit its generalizability to different camera configurations or environmental conditions. Additionally, the performance might be impacted under adverse weather conditions or low-visibility scenarios not represented in our dataset. The dataset’s geographical limitation to Dresden, Germany, may also restrict the transferability of results to regions with different driving behaviors or intersection geometries. Lastly, the current approach focuses exclusively on unsignalized intersections; thus, additional work is needed to adapt and validate this framework for intersections governed by traffic signals or stop signs.

7. Conclusions and Future Work

This study demonstrates the capabilities of newly trained GPT-4o Multimodal Large Language Models, directing traffic at urban intersections without signal lights. With carefully designed prompt engineering and targeted fine-tuning, the model’s zero-shot accuracy reached an impressive 77%. This improvement from his zero-shot challenge is very promising. Even more remarkable is the point value of 8.99 on a 10-point scale for detailed conflict explanations and 9.23 for practical recommendations for improvement. These results illustrate its scope towards solving monthly traffic problems.

To further prove my belief in the model’s efficacy in the field. The fine-tuned 4o performed best in her 88.48% on the dataset evenly distributed along the time series and 90.00% on the entire dataset. This suggests that somewhat understanding the strong ability moderated by the spatial domain to interpret videos broken up into three overlapping frames sequentially. From the in-depth analysis of the mismatched details, one can infer that the model performs much better where the conditions are less complicated but does have some problems with complex, highly congested areas.

The cross-evaluation with the low-performing GPT-4o-mini model together with the previously mentioned 4o without any fine-tuning confirms that medium and low-performing models that were not precisely trained performed poorly. In contrast, prompt two achieved higher than 1 in all the evaluations, which confirms my understanding that comprehensive contextual evidence, such as vehicle movement patterns and rules of cities, is needed to enhance predictive performance.

These findings attest to fine-tuned MLLMs for traffic management and demonstrate that future endeavors can extend these benefits further through continuous expansion and enrichment. Using time series data from diverse traffic environments, enhancing sensor input, and developing new MLLMs such as Gemini or LLaVA can further improve conflict detection and resolution. In addition, future work will evaluate the model on a variety of intersection layouts—T-junctions, signalized crossroads, and roundabouts—to verify transferability across different geometric configurations and control schemes. Using live streams from traffic cameras for practical detection and developing algorithms that address these problems is also a notable future step toward improved urban intersections and a solid base for developing traffic safety and optimization systems. In terms of societal contribution, this work enables scalable traffic monitoring using affordable aerial imagery and AI, which can reduce accidents, improve pedestrian safety, and support smarter city infrastructure without requiring large physical investments.

Author Contributions

Conceptualization, S.M., H.I.A. and M.E.; methodology, S.M., H.I.A. and M.E.; software, S.M.; formal analysis, S.M.; investigation, S.M. and H.I.A.; resources, S.M.; data curation, S.M.; writing—original draft preparation, S.M.; writing—review and editing, H.I.A. and M.E.; visualization, S.M.; supervision, H.I.A. and M.E.; project administration, H.I.A. and M.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study due to the use of publicly available, non-identifiable data.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data and code that support the findings of this study are openly available at https://github.com/sarimasri3/Traffic-Intersection-Conflict-Detection-using-images.git (accessed on 1 March 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

The following standard formulas were used to evaluate the classification performance of the models:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(A1)

R e c a l l = \frac{T P}{T P + F N}

(A2)

P r e c i s i o n = \frac{T P}{T P + F P}

(A3)

F_{1} = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(A4)

where TP is the True Positives, TN is the True Negatives, FP is the False Positives, and FN is the False Negatives.

References

Bucsuházy, K.; Matuchová, E.; Zůvala, R.; Moravcová, P.; Kostíková, M.; Mikulec, R. Human factors contributing to the road traffic accident occurrence. Transp. Res. Procedia 2020, 45, 555–561. [Google Scholar] [CrossRef]
Movahedi, M.; Choi, J. The Crossroads of LLM and Traffic Control: A Study on Large Language Models in Adaptive Traffic Signal Control. IEEE Trans. Intell. Transp. Syst. 2024, 26, 1701–1716. [Google Scholar] [CrossRef]
Mahmud, D.; Hajmohamed, H.; Almentheri, S.; Alqaydi, S.; Aldhaheri, L.; Khalil, R.A.; Saeed, N. Integrating LLMs with ITS: Recent Advances, Potentials, Challenges, and Future Directions. arXiv 2025, arXiv:2501.04437. [Google Scholar] [CrossRef]
Sivakumar, M.; Belle, A.B.; Shan, J.; Shahandashti, K.K. Prompting GPT –4 to support automatic safety case generation. Expert Syst. Appl. 2024, 255, 124653. [Google Scholar] [CrossRef]
Rakha, H.; Amer, A.; El-Shawarby, I. Modeling Driver Behavior within a Signalized Intersection Approach Decision–Dilemma Zone. Transp. Res. Rec. 2008, 2069, 16–25. [Google Scholar] [CrossRef]
Abu Tami, M.; Ashqar, H.I.; Elhenawy, M.; Glaser, S.; Rakotonirainy, A. Using Multimodal Large Language Models (MLLMs) for Automated Detection of Traffic Safety-Critical Events. Vehicles 2024, 6, 1571–1590. [Google Scholar] [CrossRef]
Ashqar, H.I.; Jaber, A.; Alhadidi, T.I.; Elhenawy, M. Advancing Object Detection in Transportation with Multimodal Large Language Models (MLLMs): A Comprehensive Review and Empirical Testing. arXiv 2024, arXiv:2409.18286. [Google Scholar]
Ashqer, M.I.; Ashqar, H.I.; Elhenawy, M.; Almannaa, M.; Aljamal, M.A.; Rakha, H.A.; Bikdash, M. Evaluating a signalized intersection performance using unmanned aerial Data. Transp. Lett. 2024, 16, 452–460. [Google Scholar] [CrossRef]
Lai, S.; Xu, Z.; Zhang, W.; Liu, H.; Xiong, H. Large language models as traffic signal control agents: Capacity and opportunity. arXiv 2023, arXiv:2312.16044. [Google Scholar]
Yarlagadda, J.; Pawar, D.S. Heterogeneity in the Driver Behavior: An Exploratory Study Using Real-Time Driving Data. J. Adv. Transp. 2022, 2022, 4509071. [Google Scholar] [CrossRef]
Bella, F.; Silvestri, M. Interaction driver–bicyclist on rural roads: Effects of cross-sections and road geometric elements. Accid. Anal. Prev. 2017, 102, 191–201. [Google Scholar] [CrossRef] [PubMed]
Ashqar, H.I.; Alhadidi, T.I.; Elhenawy, M.; Khanfar, N.O. Leveraging Multimodal Large Language Models (MLLMs) for Enhanced Object Detection and Scene Understanding in Thermal Images for Autonomous Driving Systems. Automation 2024, 5, 508–526. [Google Scholar] [CrossRef]
Jaradat, S.; Nayak, R.; Paz, A.; Ashqar, H.I.; Elhenawy, M. Multitask Learning for Crash Analysis: A Fine-Tuned LLM Framework Using Twitter Data. Smart Cities 2024, 7, 2422–2465. [Google Scholar] [CrossRef]
Masri, S.; Ashqar, H.I.; Elhenawy, M. Large Language Models (LLMs) as Traffic Control Systems at Urban Intersections: A New Paradigm. Vehicles 2025, 7, 11. [Google Scholar] [CrossRef]
Zhang, Z.; Sun, Y.; Wang, Z.; Nie, Y.; Ma, X.; Sun, P.; Li, R. Large Language Models for Mobility in Transportation Systems: A Survey on Forecasting Tasks. arXiv 2024, arXiv:2405.02357. [Google Scholar]
Pang, A.; Wang, M.; Pun, M.-O.; Chen, C.S.; Xiong, X. iLLM-TSC: Integration reinforcement learning and large language model for traffic signal control policy improvement. arXiv 2024, arXiv:2407.06025. [Google Scholar]
Wang, M.; Pang, A.; Kan, Y.; Pun, M.-O.; Chen, C.S.; Huang, B. LLM-assisted light: Leveraging large language model capabilities for human-mimetic traffic signal control in complex urban environments. arXiv 2024, arXiv:2403.08337. [Google Scholar]
Ouallane, A.A.; Bahnasse, A.; Bakali, A.; Talea, M. Overview of Road Traffic Management Solutions based on IoT and AI. Procedia Comput. Sci. 2022, 198, 518–523. [Google Scholar] [CrossRef]
Almukhalfi, H.; Noor, A.; Noor, T.H. Traffic management approaches using machine learning and deep learning techniques: A survey. Eng. Appl. Artif. Intell. 2024, 133, 108147. [Google Scholar] [CrossRef]
Bella, F.; Gulisano, F. A Hazard-Based Model of the Motorcyclists’ Overtaking Duration. Accid. Anal. Prev. 2020, 141, 105522. [Google Scholar] [CrossRef]
Chen, C.; Huang, Y.P.; Lam, W.H.; Pan, T.L.; Hsu, S.C.; Sumalee, A.; Zhong, R.X. Data efficient reinforcement learning and adaptive optimal perimeter control of network traffic dynamics. Transp. Res. Part C Emerg. Technol. 2022, 142, 103759. [Google Scholar] [CrossRef]
Hashikami, H.; Li, Y.; Kobayashi, R.; Shigeno, M. Challenges of Commuter Carpooling with Adapting to Japanese Customs and Regulations: A Pilot Study. Transp. Res. Interdiscip. Perspect. 2023, 22, 100945. [Google Scholar] [CrossRef]
Bian, Y.; Zhang, X.L.; Wu, Y.P.; Zhao, X.H.; Liu, H.; Su, Y.L. Influence of prompt timing and messages of an audio navigation system on driver behavior on an urban expressway with five exits. Accid. Anal. Prev. 2021, 157, 106155. [Google Scholar] [CrossRef] [PubMed]
Cui, C.; Ma, Y.; Cao, X.; Ye, W.; Wang, Z. Receive, reason, and react: Drive as you say, with large language models in autonomous vehicles. IEEE Intell. Transp. Syst. Mag. 2024, 16, 81–94. [Google Scholar] [CrossRef]
Chen, L.; Sinavski, O.; Hünermann, J.; Karnsund, A.; Willmott, A.J.; Birch, D.; Maund, D.; Shotton, J. Driving with llms: Fusing object- level vector modality for explainable autonomous driving. arXiv 2023, arXiv:2310.01957. [Google Scholar]
Ren, Y.; Chen, Y.; Liu, S.; Wang, B.; Yu, H.; Cui, Z. TPLLM: A Traffic Prediction Framework Based on Pretrained Large Language Models. arXiv 2024, arXiv:2403.02221. [Google Scholar]
Zhang, D.; Zheng, H.; Yue, W.; Wang, X. Advancing ITS Applications with LLMs: A Survey on Traffic Management, Transportation Safety, and Autonomous Driving. In International Joint Conference on Rough Sets; Springer: Cham, Switzerland, 2024; pp. 295–309. [Google Scholar]
Montiel-Marín, S.; Gómez-Huélamo, C.; de la Peña, J.; Antunes, M.; López-Guillén, E.; Bergasa, L.M. Towards LiDAR and RADAR Fusion for Object Detection and Multi-Object Tracking in CARLA Simulator. In Proceedings of the Iberian Robotics Conference, Zaragoza, Spain, 23–25 November 2022; Springer: Cham, Switzerland, 2022; pp. 552–563. [Google Scholar]
Liu, H.B.; Wu, C.; Wang, H.J. Real time object detection using LiDAR and camera fusion for autonomous driving. Sci. Rep. 2023, 13, 8056. [Google Scholar] [CrossRef]
Gómez-Huélamo, C.; Bergasa, L.M.; Gutiérrez, R.; Arango, J.F.; Díaz, A. SmartMOT: Exploiting the Fusion of HDMaps and Multi-Object Tracking for Real-Time Scene Understanding in Intelligent Vehicles Applications. In Proceedings of the 2021 IEEE Intelligent Vehicles Symposium (IV), Nagoya, Japan, 11–17 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 710–715. [Google Scholar]
Hussien, M.M.; Melo, A.N.; Ballardini, A.L.; Maldonado, C.S.; Izquierdo, R.; Sotelo, M.Á. RAG-based Explainable Prediction of Road Users Behaviors for Automated Driving using Knowledge Graphs and Large Language Models. arXiv 2024, arXiv:2405.00449. [Google Scholar] [CrossRef]
Xu, Z.; Zhang, Y.; Xie, E.; Zhao, Z.; Guo, Y.; Wong, K.Y.K.; Li, Z.; Zhao, H. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. arXiv 2023, arXiv:2310.01412. [Google Scholar] [CrossRef]
Liu, J.; Hang, P.; Qi, X.; Wang, J.; Sun, J. MTD-GPT: A Multi-Task Decision-Making GPT Model for Autonomous Driving at Unsignalized Intersections. In Proceedings of the 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC), Bilbao, Spain, 24–28 September 2023; IEEE: New York, NY, USA, 2023; pp. 5154–5161. [Google Scholar]
Baeumler, M.; Lehmann, M. ListDB RepTwo: 3 Months (Jun’22–Aug’22) of Drone Videos and Trajectories at a 4-Way Intersection. OPARA 2023. [Google Scholar] [CrossRef]

Figure 1. Flowchart of the proposed framework, including the prompt structure and how data is processed through the system.

Figure 2. Accuracy Comparison of Zero-Shot and Fine-Tuned Models Using Prompts 1 and 2.

Figure 3. Confusion Matrices for GPT-4o Fine-Tuned Model Using First and Second Prompts for Traffic Conflict Detection.

Figure 4. Manual Evaluation Scores for Model Explanations and Recommended Actions on a Scale of 0 to 10.

Figure 5. Multiple examples of the output of the fine-tuned GPT-4o model, including detected conflicts and recommended actions.

Figure 6. Confusion Matrices for GPT-4o Fine-Tuned Model Using the enhanced Second Prompt on Balanced and Full Time-Series Datasets.

Figure 7. True Labels and Mismatches Over Time (Aligned with Video Durations).

Table 1. Prompt structure.

Prompt 1 (P1)
You are a traffic control AI analyzing drone footage of a four-way intersection with two main roads and two sub-roads. Analyze frames showing moving vehicles before the intersection to detect potential conflicts.
- Answer strictly only with “yes” or “no” in lowercase.

Prompt 2 (P2)
Analyze three sequential overhead images of a four-leg intersection, 0.5 s apart. West-East (main) road has priority. Two lanes each way on main road, with dedicated turn lanes. North-South (sub) road has single lanes with shared turn/crossing. Ignore parked cars. Focus on moving vehicles intending to cross the intersection or turn. If, after all three frames, any unresolved conflict may occur (e.g., priority vehicle and another vehicle potentially entering the same space), answer ‘yes’ (lowercase); otherwise, answer ‘no’ (lowercase).
- Answer strictly only with “yes” or “no” in lowercase to detect conflicts.

Prompt 2 with time-series approach:
Analyze three sequential overhead images of a four-leg intersection, 0.5 s apart. West-East (main) road has priority. Two lanes each way on main road, with dedicated turn lanes. North-South (sub) road has single lanes with shared turn/crossing. Ignore parked cars. Focus on moving vehicles intending to cross the intersection or turn. The previous conflict state is ‘{row[‘previous_conflict’]}’. If, after all three frames, any unresolved conflict may occur, or if the conflict from the previous state persists or escalates, determine if there is a possible unresolved conflict—such as a priority vehicle and another vehicle potentially entering the same space, or any scenario where vehicles could collide given their intended paths.

Table 2. Performance Metrics of Zero-Shot and Fine-Tuned Models for Traffic Conflict Detection Using Different Prompts.

Model	Accuracy	Precision	Recall	F1-Score
GPT-4o Fine-Tuned (Second)	0.771	0.78	0.77	0.77
GPT-4o Fine-Tuned (First)	0.671	0.74	0.67	0.65
GPT-4o (Second)	0.571	0.59	0.57	0.54
GPT-4o (First)	0.529	0.54	0.53	0.51
GPT-4o-mini (First)	0.521	0.61	0.52	0.4
GPT-4o-mini (Second)	0.493	0.48	0.49	0.4

Table 3. Performance Metrics for Time-Series Datasets.

Dataset	Accuracy	Precision (NC)	Precision (C)	Recall (NC)	Recall (C)	F1-Score (NC)	F1-Score (C)
Balanced Time Series	88.48%	88%	89%	89%	88%	89%	88%
Full Time-Series Dataset	90.00%	96%	71%	92%	82%	94%	76%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Masri, S.; Ashqar, H.I.; Elhenawy, M. Leveraging Bird Eye View Video and Multimodal Large Language Models for Real-Time Intersection Control and Reasoning. Safety 2025, 11, 40. https://doi.org/10.3390/safety11020040

AMA Style

Masri S, Ashqar HI, Elhenawy M. Leveraging Bird Eye View Video and Multimodal Large Language Models for Real-Time Intersection Control and Reasoning. Safety. 2025; 11(2):40. https://doi.org/10.3390/safety11020040

Chicago/Turabian Style

Masri, Sari, Huthaifa I. Ashqar, and Mohammed Elhenawy. 2025. "Leveraging Bird Eye View Video and Multimodal Large Language Models for Real-Time Intersection Control and Reasoning" Safety 11, no. 2: 40. https://doi.org/10.3390/safety11020040

APA Style

Masri, S., Ashqar, H. I., & Elhenawy, M. (2025). Leveraging Bird Eye View Video and Multimodal Large Language Models for Real-Time Intersection Control and Reasoning. Safety, 11(2), 40. https://doi.org/10.3390/safety11020040

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Leveraging Bird Eye View Video and Multimodal Large Language Models for Real-Time Intersection Control and Reasoning

Abstract

1. Introduction

2. Literature Review

3. Materials and Methods

3.1. Overview of Framework

3.2. Data Collection and Labeling

3.3. Prompt Creation

3.4. Zero-Shot Evaluation

3.5. Fine-Tuning MLLM

3.6. Model Evaluation Metrics

3.7. Manual Evaluation of Explanations and Recommendations

3.8. Deployment Testing

4. Results

4.1. Model Performance Metrics

4.2. Confusion Matrices of Fine-Tuned Models

4.3. Manual Evaluation

4.4. Deployment Results

Analysis of Temporal Consistency

5. Discussion

6. Limitations

7. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI