From Concept to Representation: Modeling Driving Capability and Task Demand with a Multimodal Large Language Model
Abstract
1. Introduction
- We adapt the architecture of BLIP-2 [10] with a novel custom design aligned with the TCI model, enabling formal modeling of task demand and driving capability. This tailored architecture provides a principled framework for capturing and analyzing alignments in dynamic driving scenarios.
- We propose a new method to explicitly distill low-dimensional semantic vectors for driving capability and task demand. By imposing a dimensionality constraint on the MLLM’s output and integrating a multi-label contrastive learning objective, our approach yields concise yet expressive representations that characterize the driving capability and the task demand within a shared latent space.
- We design a comparator module to measure the distance between capability vectors and between capability and demand vectors. This enables the effective detection of mismatches between driving capability and task demand, thereby facilitating early intervention and risk awareness. Validation and analysis across multiple datasets demonstrate the practical feasibility and interpretability of the proposed method.
2. Related Works
2.1. Task–Capability Interface Model
2.2. Impaired Driving
2.3. Multimodal Large Language Model in Driving
3. Methodology
3.1. Problem Statement
3.2. Objective and Framework Overview
- Learning and modeling the functions , , introduced in Section 3.1.
- Validating the success of the modeling by examining whether the properties described in Section 3.1 are satisfied.
3.3. Task–Capability Large Model
3.4. Embedding Comparator
4. Dataset
4.1. Overview
4.2. Scenario Categorization
4.3. Data Modality
5. Experimental Evaluation
5.1. Experiment Setup
5.2. Overall Evaluation Results
5.2.1. Task Consistency
5.2.2. Intra-Task Capability Ordering
5.2.3. Task–Capability Comparability
5.3. Ablation Study
5.3.1. Embedding Dimension
5.3.2. Task-Difficulty Reference
5.3.3. Task-Difficulty Function
5.4. Generalization to Real-World Data
- Image: Frames are sampled from the in-car camera footage.
- Description: A set of natural language descriptions is constructed to depict the car-following scenario; one sentence is randomly selected for each sample.
- Surroundings: Only the vehicle information directly ahead is used.
- Behavior: Because the angle of the steering wheel and the pedal positions are unavailable, we represent driver behavior using only vehicle speed, yaw rotation speed, and lateral deviation from the center of the lane.
- Prompt: The prompt is kept identical to that in our original dataset.
6. Discussion
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
WHO | World Health Organization |
ABS | Anti-lock Braking Systems |
ESC | Electronic Stability Control |
ADAS | Advanced Driver-Assistance Systems |
ACC | Adaptive Cruise Control |
AEB | Autonomous Emergency Braking |
LKA | Lane Keeping Assist |
TCI | Task–Capability Interface |
MLLM | Multimodal Large Language Model |
SupCon | Supervised Contrastive learning |
MLP | Multi-Layer Perceptron |
FPS | Frames Per Second |
CAN | Controller Area Network |
LLM | Large Language Model |
AnoP | Anomalous Proportion |
MAE | Mean Absolute Error |
Appendix A. Prompt of the Task–Capability Large Model
- Describe the overall task demand presented by the driving environment in this scene.
- Based on the input data, what are the primary environmental and situational challenges the driver faces during this period?
- Summarize the situational demands posed by this scene, including traffic, road structure, and potential hazards.
- Based on the provided information, estimate the level of attentional and control effort required to navigate this scene.
- How do the current traffic and road conditions impact the complexity of the driving task?
- Describe the driving capability demonstrated by the driver in this scene.
- Evaluate the driver’s performance based on the scene and driving behavior provided.
- Based on the input data, what can you infer about the driver’s driving skill and behavior during this period?
- Given the visual and sensor data, extract indicators of driver control quality and risk awareness.
- Summarize the driver-delivered capability using the combined perception from image and sensor data.
References
- World Health Organization. Global Status Report on Road Safety 2023; Technical Report; World Health Organization: Geneva, Switzerland, 2023. [Google Scholar]
- Favarò, F.M.; Nader, N.; Eurich, S.O.; Tripp, M.; Varadaraju, N. Examining accident reports involving autonomous vehicles in California. PLoS ONE 2017, 12, e0184952. [Google Scholar] [CrossRef] [PubMed]
- SAE International. Taxonomy and Definitions for Terms Related to Driving Automation Systems for On-Road Motor Vehicles. SAE Standard J3016_2021. 2021. Available online: https://www.sae.org/standards/content/j3016_202104/ (accessed on 8 June 2025).
- Jatavallabha, A. Tesla’s Autopilot: Ethics and Tragedy. arXiv 2024, arXiv:2409.17380. [Google Scholar]
- Koopman, P. Lessons from the cruise robotaxi pedestrian dragging mishap. IEEE Reliab. Mag. 2024, 1, 54–61. [Google Scholar] [CrossRef]
- Fuller, R. The task-capability interface model of the driving process. Rech. 2000, 66, 47–57. [Google Scholar] [CrossRef]
- Fuller, R. Towards a general theory of driver behaviour. Accid. Anal. Prev. 2005, 37, 461–472. [Google Scholar] [CrossRef]
- Wong, J.T.; Huang, S.H. Modeling Driver Mental Workload for Accident Causation and Prevention. In Proceedings of the Eastern Asia Society for Transportation Studies, Surabaya, Indonesia, 16–19 November 2009; p. 365. [Google Scholar]
- Wu, J.; Gao, B.; Gao, J.; Yu, J.; Chu, H.; Yu, Q.; Gong, X.; Chang, Y.; Tseng, H.E.; Chen, H.; et al. Prospective role of foundation models in advancing autonomous vehicles. Research 2024, 7, 0399. [Google Scholar] [CrossRef]
- Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
- Brookhuis, K.A.; de Waard, D. Assessment of drivers’ workload: Performance and subjective and physiological indexes. In Stress, Workload, and Fatigue; CRC Press: Boca Raton, FL, USA, 2000; pp. 321–333. [Google Scholar]
- Vlakveld, W. Hazard Anticipation of Young Novice Drivers: Assessing and Enhancing the Capabilities of Young Novice Drivers to Anticipate Latent Hazards in Road and Traffic Situations. Ph.D. Thesis, University of Groningen, Groningen, The Netherlands, 2011. [Google Scholar]
- Cestac, J.; Paran, F.; Delhomme, P. Young drivers’ sensation seeking, subjective norms, and perceived behavioral control and their roles in predicting speeding intention: How risk-taking motivations evolve with gender and driving experience. Saf. Sci. 2011, 49, 424–432. [Google Scholar] [CrossRef]
- Körber, M.; Gold, C.; Lechner, D.; Bengler, K. The influence of age on the take-over of vehicle control in highly automated driving. Transp. Res. Part Traffic Psychol. Behav. 2016, 39, 19–32. [Google Scholar] [CrossRef]
- Yan, Y.; Zhong, S.; Tian, J.; Song, L. Driving distraction at night: The impact of cell phone use on driving behaviors among young drivers. Transp. Res. Part Traffic Psychol. Behav. 2022, 91, 401–413. [Google Scholar] [CrossRef]
- Teh, E.; Jamson, S.; Carsten, O.; Jamson, H. Temporal fluctuations in driving demand: The effect of traffic complexity on subjective measures of workload and driving performance. Transp. Res. Part Traffic Psychol. Behav. 2014, 22, 207–217. [Google Scholar] [CrossRef]
- Engström, J.; Markkula, G.; Victor, T.; Merat, N. Effects of cognitive load on driving performance: The cognitive control hypothesis. Hum. Factors 2017, 59, 734–764. [Google Scholar] [CrossRef] [PubMed]
- Li, X.; Oviedo-Trespalacios, O.; Rakotonirainy, A.; Yan, X. Collision risk management of cognitively distracted drivers in a car-following situation. Transp. Res. Part Traffic Psychol. Behav. 2019, 60, 288–298. [Google Scholar] [CrossRef]
- Yang, Y.; Chen, Y.; Wu, C.; Easa, S.M.; Lin, W.; Zheng, X. Effect of highway directional signs on driver mental workload and behavior using eye movement and brain wave. Accid. Anal. Prev. 2020, 146, 105705. [Google Scholar] [CrossRef] [PubMed]
- Summala, H. Towards understanding motivational and emotional factors in driver behaviour: Comfort through satisficing. In Modelling Driver Behaviour in Automotive Environments: Critical Issues in Driver Interactions with Intelligent Transport Systems; Springer: London, UK, 2007; pp. 189–207. [Google Scholar]
- Foy, H.J.; Chapman, P. Mental workload is reflected in driver behaviour, physiology, eye movements and prefrontal cortex activation. Appl. Ergon. 2018, 73, 90–99. [Google Scholar] [CrossRef]
- Delmas, M.; Camps, V.; Lemercier, C. Should my automated car drive as I do? Investigating speed preferences of drivengers in various driving conditions. PLoS ONE 2023, 18, e0281702. [Google Scholar] [CrossRef]
- Sun, Z.; Xu, J.; Gu, C.; Xin, T.; Zhang, W. Investigation of Car following and Lane Changing Behavior in Diverging Areas of Tunnel–Interchange Connecting Sections Based on Driving Simulation. Appl. Sci. 2024, 14, 3768. [Google Scholar] [CrossRef]
- Kolekar, S.; De Winter, J.; Abbink, D. Human-like driving behaviour emerges from a risk-based driver model. Nat. Commun. 2020, 11, 1–13. [Google Scholar] [CrossRef]
- Saifuzzaman, M.; Zheng, Z.; Haque, M.M.; Washington, S. Revisiting the Task–Capability Interface model for incorporating human factors into car-following models. Transp. Res. Part Methodol. 2015, 82, 1–19. [Google Scholar] [CrossRef]
- Delhomme, P.; Meyer, T. Control motivation and young drivers’ decision making. Ergonomics 1998, 41, 373–393. [Google Scholar] [CrossRef]
- Yu, S.Y.; Malawade, A.V.; Muthirayan, D.; Khargonekar, P.P.; Al Faruque, M.A. Scene-graph augmented data-driven risk assessment of autonomous vehicle decisions. IEEE Trans. Intell. Transp. Syst. 2021, 23, 7941–7951. [Google Scholar] [CrossRef]
- de Winkel, K.N.; Christoph, M.; van Nes, N. Towards a framework of driver fitness: Operationalization and comparative risk assessment. Transp. Res. Interdiscip. Perspect. 2024, 23, 101030. [Google Scholar] [CrossRef]
- Rezapour, M.; Ksaibati, K. Identification of factors associated with various types of impaired driving. Humanit. Soc. Sci. Commun. 2022, 9, 1–11. [Google Scholar] [CrossRef]
- Nishitani, Y. Alcohol and traffic accidents in Japan. IATSS Res. 2019, 43, 79–83. [Google Scholar] [CrossRef]
- Shiferaw, B.A.; Crewther, D.P.; Downey, L.A. Gaze entropy measures detect alcohol-induced driver impairment. Drug Alcohol Depend. 2019, 204, 107519. [Google Scholar] [CrossRef] [PubMed]
- Wang, J.; Chai, W.; Venkatachalapathy, A.; Tan, K.L.; Haghighat, A.; Velipasalar, S.; Adu-Gyamfi, Y.; Sharma, A. A survey on driver behavior analysis from in-vehicle cameras. IEEE Trans. Intell. Transp. Syst. 2021, 23, 10186–10209. [Google Scholar] [CrossRef]
- Koch, K.; Maritsch, M.; Van Weenen, E.; Feuerriegel, S.; Pfäffli, M.; Fleisch, E.; Weinmann, W.; Wortmann, F. Leveraging driver vehicle and environment interaction: Machine learning using driver monitoring cameras to detect drunk driving. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, Hamburg, Germany, 23–28 April 2023; pp. 1–32. [Google Scholar]
- Chatterjee, I.; Sharma, A. Driving Fitness Detection: A Holistic Approach for Prevention of Drowsy and Drunk Driving using Computer Vision Techniques. In Proceedings of the 2018 South-Eastern European Design Automation, Computer Engineering, Computer Networks and Society Media Conference (SEEDA_CECNSM), Kastoria, Greece, 22–24 September 2018; pp. 1–6. [Google Scholar] [CrossRef]
- Ki, M.; Cho, B.; Jeon, T.; Choi, Y.; Byun, H. Face identification for an in-vehicle surveillance system using near infrared camera. In Proceedings of the 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand, 27–30 November 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–6. [Google Scholar]
- Varghese, R.R.; Jacob, P.M.; Jacob, J.; Babu, M.N.; Ravikanth, R.; George, S.M. An integrated framework for driver drowsiness detection and alcohol intoxication using machine learning. In Proceedings of the 2021 International Conference on Data Analytics for Business and Industry (ICDABI), Sakheer, Bahrain, 25–26 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 531–536. [Google Scholar]
- Dai, J.; Teng, J.; Bai, X.; Shen, Z.; Xuan, D. Mobile phone based drunk driving detection. In Proceedings of the 2010 4th International Conference on Pervasive Computing Technologies for Healthcare, Munich, Germany, 22–25 March 2010; pp. 1–8. [Google Scholar] [CrossRef]
- Zhou, H.; Carballo, A.; Yamaoka, M.; Yamataka, M.; Fujii, K.; Takeda, K. DUIncoder: Learning to Detect Driving Under the Influence Behaviors from Various Normal Driving Data. Sensors 2025, 25, 1699. [Google Scholar] [CrossRef]
- Zhou, H.; Carballo, A.; Yamaoka, M.; Yamataka, M.; Takeda, K. A Self-Supervised Approach for Detection and Analysis of Driving Under Influence. In Proceedings of the 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC), Edmonton, AB, Canada, 24–27 September 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 4130–4137. [Google Scholar]
- Lowrie, J.; Brownlow, H. The impact of sleep deprivation and alcohol on driving: A comparative study. BMC Public Health 2020, 20, 1–9. [Google Scholar] [CrossRef]
- Saxby, D.J.; Matthews, G.; Warm, J.S.; Hitchcock, E.M.; Neubauer, C. Active and passive fatigue in simulated driving: Discriminating styles of workload regulation and their safety impacts. J. Exp. Psychol. Appl. 2013, 19, 287. [Google Scholar] [CrossRef]
- Jackson, M.L.; Croft, R.J.; Kennedy, G.; Owens, K.; Howard, M.E. Cognitive components of simulated driving performance: Sleep loss effects and predictors. Accid. Anal. Prev. 2013, 50, 438–444. [Google Scholar] [CrossRef]
- Zhang, X.; Zhao, X.; Du, H.; Rong, J. A study on the effects of fatigue driving and drunk driving on drivers’ physical characteristics. Traffic Inj. Prev. 2014, 15, 801–808. [Google Scholar] [CrossRef]
- Oviedo-Trespalacios, O.; Haque, M.M.; King, M.; Washington, S. Self-regulation of driving speed among distracted drivers: An application of driver behavioral adaptation theory. Traffic Inj. Prev. 2017, 18, 599–605. [Google Scholar] [CrossRef]
- Turnbull, P.R.; Khanal, S.; Dakin, S.C. The effect of cellphone position on driving and gaze behaviour. Sci. Rep. 2021, 11, 7692. [Google Scholar] [CrossRef]
- McEvoy, S.P.; Stevenson, M.R.; Woodward, M. The impact of driver distraction on road safety: Results from a representative survey in two Australian states. Inj. Prev. 2006, 12, 242–247. [Google Scholar] [CrossRef]
- Sheykhfard, A.; Haghighi, F. Driver distraction by digital billboards? Structural equation modeling based on naturalistic driving study data: A case study of Iran. J. Saf. Res. 2020, 72, 1–8. [Google Scholar] [CrossRef]
- Hughes, G.M.; Rudin-Brown, C.M.; Young, K.L. A simulator study of the effects of singing on driving performance. Accid. Anal. Prev. 2013, 50, 787–792. [Google Scholar] [CrossRef]
- Deffenbacher, J.L.; Deffenbacher, D.M.; Lynch, R.S.; Richards, T.L. Anger, aggression, and risky behavior: A comparison of high and low anger drivers. Behav. Res. Ther. 2003, 41, 701–718. [Google Scholar] [CrossRef]
- Hu, T.Y.; Xie, X.; Li, J. Negative or positive? The effect of emotion and mood on risky driving. Transp. Res. Part Traffic Psychol. Behav. 2013, 16, 29–40. [Google Scholar] [CrossRef]
- Eboli, L.; Mazzulla, G.; Pungillo, G. The influence of physical and emotional factors on driving style of car drivers: A survey design. Travel Behav. Soc. 2017, 7, 43–51. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 5998–6008. [Google Scholar]
- Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res. 2023, 24, 1–113. [Google Scholar]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
- Ahn, M.; Brohan, A.; Brown, N.; Chebotar, Y.; Cortes, O.; David, B.; Finn, C.; Fu, C.; Gopalakrishnan, K.; Hausman, K.; et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv 2022, arXiv:2204.01691. [Google Scholar] [CrossRef]
- Wang, G.; Xie, Y.; Jiang, Y.; Mandlekar, A.; Xiao, C.; Zhu, Y.; Fan, L.; Anandkumar, A. Voyager: An open-ended embodied agent with large language models. arXiv 2023, arXiv:2305.16291. [Google Scholar] [CrossRef]
- Huang, W.; Wang, C.; Zhang, R.; Li, Y.; Wu, J.; Fei-Fei, L. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv 2023, arXiv:2307.05973. [Google Scholar] [CrossRef]
- Cui, C.; Ma, Y.; Cao, X.; Ye, W.; Zhou, Y.; Liang, K.; Chen, J.; Lu, J.; Yang, Z.; Liao, K.D.; et al. A survey on multimodal large language models for autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 958–979. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv. Neural Inf. Process. Syst. 2019, 13–23. [Google Scholar]
- Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
- Huang, P.Y.; Xu, H.; Li, J.; Baevski, A.; Auli, M.; Galuba, W.; Metze, F.; Feichtenhofer, C. Masked autoencoders that listen. Adv. Neural Inf. Process. Syst. 2022, 35, 28708–28720. [Google Scholar]
- Georgescu, M.I.; Fonseca, E.; Ionescu, R.T.; Lucic, M.; Schmid, C.; Arnab, A. Audiovisual masked autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 16144–16154. [Google Scholar]
- Guo, Z.; Zhang, R.; Zhu, X.; Tang, Y.; Ma, X.; Han, J.; Chen, K.; Gao, P.; Li, X.; Li, H.; et al. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv 2023, arXiv:2309.00615. [Google Scholar]
- Tsimpoukelli, M.; Menick, J.L.; Cabi, S.; Eslami, S.; Vinyals, O.; Hill, F. Multimodal few-shot learning with frozen language models. Adv. Neural Inf. Process. Syst. 2021, 34, 200–212. [Google Scholar]
- Ding, X.; Han, J.; Xu, H.; Zhang, W.; Li, X. Hilm-d: Towards high-resolution understanding in multimodal large language models for autonomous driving. arXiv 2023, arXiv:2309.05186. [Google Scholar]
- Choudhary, T.; Dewangan, V.; Chandhok, S.; Priyadarshan, S.; Jain, A.; Singh, A.K.; Srivastava, S.; Jatavallabhula, K.M.; Krishna, K.M. Talk2bev: Language-enhanced bird’s-eye view maps for autonomous driving. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 16345–16352. [Google Scholar]
- Hu, A.; Russell, L.; Yeo, H.; Murez, Z.; Fedoseev, G.; Kendall, A.; Shotton, J.; Corrado, G. Gaia-1: A generative world model for autonomous driving. arXiv 2023, arXiv:2309.17080. [Google Scholar] [CrossRef]
- Yang, M.; Du, Y.; Ghasemipour, K.; Tompson, J.; Schuurmans, D.; Abbeel, P. Learning interactive real-world simulators. arXiv 2023, 1, 6. [Google Scholar]
- Chen, L.; Sinavski, O.; Hünermann, J.; Karnsund, A.; Willmott, A.J.; Birch, D.; Maund, D.; Shotton, J. Driving with llms: Fusing object-level vector modality for explainable autonomous driving. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 14093–14100. [Google Scholar]
- Fu, D.; Li, X.; Wen, L.; Dou, M.; Cai, P.; Shi, B.; Qiao, Y. Drive like a human: Rethinking autonomous driving with large language models. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, 1–6 January 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 910–919. [Google Scholar]
- Xu, Z.; Zhang, Y.; Xie, E.; Zhao, Z.; Guo, Y.; Wong, K.Y.K.; Li, Z.; Zhao, H. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robot. Autom. Lett. 2024. [Google Scholar] [CrossRef]
- Shao, H.; Hu, Y.; Wang, L.; Song, G.; Waslander, S.L.; Liu, Y.; Li, H. Lmdrive: Closed-loop end-to-end driving with large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 15120–15130. [Google Scholar]
- Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised Contrastive Learning. In Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 18661–18673. [Google Scholar]
- Burges, C.; Shaked, T.; Renshaw, E.; Lazier, A.; Deeds, M.; Hamilton, N.; Hullender, G. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 7–11 August 2005; pp. 89–96. [Google Scholar]
- Sectional Committee of AD Safety Evaluation, Automated Driving Subcommittee; Japan Automobile Manufacturers Association, Inc. Automated Driving Safety Evaluation Framework Ver. 3.0: Guidelines for Safety Evaluation of Automated Driving Technology; Technical Report; Japan Automobile Manufacturers Association: Tokyo, Japan, 2022. [Google Scholar]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 8026–8037. [Google Scholar]
- Li, D.; Li, J.; Le, H.; Wang, G.; Savarese, S.; Hoi, S.C. LAVIS: A One-stop Library for Language-Vision Intelligence. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Toronto, Canada, 9–14 July 2023; pp. 31–41. [Google Scholar]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar]
- Sima, C.; Renz, K.; Chitta, K.; Chen, L.; Zhang, H.; Xie, C.; Luo, P.; Geiger, A.; Li, H. DriveLM: Driving with Graph Visual Question Answering. arXiv 2023, arXiv:2312.14150. [Google Scholar] [CrossRef]
- Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- Romera, E.; Bergasa, L.M.; Arroyo, R. Need data for driver behaviour analysis? Presenting the public UAH-DriveSet. In Proceedings of the 2016 IEEE 19th international conference on intelligent transportation systems (ITSC), Rio de Janeiro, Brazil, 1–4 November 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 387–392. [Google Scholar]
I2I ↓ | N2N ↓ | I2N ↑ | |
---|---|---|---|
Demand | 0.207 | 0.188 | 0.260 |
Capability | 0.209 | 0.190 | 0.263 |
Scenario | Demand | Capability | ||||
---|---|---|---|---|---|---|
Similar ↓ | Different ↑ | Average | Similar ↓ | Different ↑ | Average | |
IA1 | 0.238 | 0.233 | 0.234 | 0.241 | 0.236 | 0.238 |
IA2 | 0.180 | 0.265 | 0.236 | 0.181 | 0.267 | 0.237 |
IA3 | 0.180 | 0.272 | 0.240 | 0.182 | 0.276 | 0.243 |
IA4 | 0.197 | 0.293 | 0.240 | 0.201 | 0.301 | 0.266 |
IB1 | 0.233 | 0.228 | 0.230 | 0.235 | 0.231 | 0.233 |
IB2 | 0.207 | 0.203 | 0.205 | 0.209 | 0.204 | 0.206 |
IB3 | 0.213 | 0.323 | 0.284 | 0.215 | 0.329 | 0.289 |
IB4 | 0.223 | 0.270 | 0.254 | 0.225 | 0.274 | 0.256 |
NA1 | 0.261 | 0.398 | 0.309 | 0.263 | 0.402 | 0.312 |
NA2 | 0.200 | 0.293 | 0.233 | 0.204 | 0.302 | 0.238 |
NA3 | 0.215 | 0.278 | 0.237 | 0.217 | 0.280 | 0.239 |
NA4 | 0.196 | 0.251 | 0.215 | 0.197 | 0.255 | 0.217 |
NA5 | 0.146 | 0.243 | 0.180 | 0.148 | 0.247 | 0.183 |
NA6 | 0.149 | 0.245 | 0.183 | 0.151 | 0.249 | 0.185 |
NA7 | 0.194 | 0.219 | 0.203 | 0.195 | 0.222 | 0.205 |
NB1 | 0.174 | 0.236 | 0.196 | 0.177 | 0.241 | 0.199 |
NB2 | 0.167 | 0.278 | 0.206 | 0.168 | 0.282 | 0.208 |
NB3 | 0.145 | 0.232 | 0.175 | 0.146 | 0.234 | 0.177 |
NB4 | 0.161 | 0.216 | 0.180 | 0.162 | 0.219 | 0.182 |
NB5 | 0.164 | 0.265 | 0.199 | 0.165 | 0.267 | 0.201 |
Scenario | ↓ | ↓ | ↑ | ↑ | |
---|---|---|---|---|---|
IA1 | 16.7 | 0 | 16.7 | 16.7 | 50.0 |
IA2 | 5.6 | 0 | 33.3 | 33.3 | 27.8 |
IA3 | 5.6 | 11.1 | 27.8 | 5.6 | 50.0 |
IA4 | 16.7 | 11.1 | 61.1 | 5.6 | 5.6 |
IB1 | 9.1 | 3.0 | 21.2 | 12.1 | 54.5 |
IB2 | 11.8 | 17.6 | 23.5 | 17.6 | 29.4 |
IB3 | 11.8 | 2.9 | 23.5 | 2.9 | 58.8 |
IB4 | 10.5 | 5.2 | 31.6 | 10.5 | 42.1 |
NA1 | 5.6 | 11.1 | 38.9 | 16.7 | 27.8 |
NA2 | 10.0 | 10.0 | 60.0 | 15.0 | 5.0 |
NA3 | 5.3 | 5.3 | 21.1 | 26.3 | 42.1 |
NA4 | 6.6 | 9.8 | 21.3 | 13.1 | 49.2 |
NA5 | 9.7 | 8.3 | 23.6 | 20.8 | 37.5 |
NA6 | 8.3 | 2.8 | 16.7 | 30.6 | 41.7 |
NA7 | 15.1 | 12.1 | 27.2 | 9.1 | 36.3 |
NB1 | 0 | 5.9 | 47.1 | 5.9 | 41.2 |
NB2 | 8.1 | 12.9 | 37.1 | 12.9 | 29.0 |
NB3 | 15.7 | 2.0 | 13.7 | 23.5 | 45.1 |
NB4 | 5.7 | 14.3 | 31.4 | 17.1 | 31.4 |
NB5 | 5.9 | 5.9 | 47.1 | 11.8 | 29.4 |
Overall | 9.4 | 7.8 | 28.4 | 15.9 | 38.5 |
Scenario | Normal | Drunk | ||||
---|---|---|---|---|---|---|
M.Ref | M.Pred | MAE ↓ | M.Ref | M.Pred | MAE ↓ | |
IA1 | 0.068 | 0.002 | 0.065 | 0.130 | 0.080 | 0.100 |
IA2 | 0.072 | 0.034 | 0.041 | 0.113 | 0.081 | 0.054 |
IA3 | 0.038 | 0.011 | 0.034 | 0.031 | 0.074 | 0.063 |
IA4 | 0.068 | 0.016 | 0.054 | 0.101 | 0.022 | 0.090 |
IB1 | 0.070 | 0.032 | 0.050 | 0.161 | 0.150 | 0.104 |
IB2 | 0.042 | 0.006 | 0.039 | 0.098 | 0.007 | 0.092 |
IB3 | 0.051 | 0.014 | 0.044 | 0.096 | 0.031 | 0.076 |
IB4 | 0.042 | 0.013 | 0.031 | 0.052 | 0.025 | 0.040 |
NA1 | 0 | 0.001 | 0.001 | 0.003 | 0.001 | 0.002 |
NA2 | 0.002 | 0.006 | 0.006 | 0.005 | 0.027 | 0.025 |
NA3 | 0.041 | 0.022 | 0.036 | 0.090 | 0.126 | 0.124 |
NA4 | 0.015 | 0.002 | 0.014 | 0.042 | 0.017 | 0.026 |
NA5 | 0.013 | 0.001 | 0.013 | 0.033 | 0.005 | 0.032 |
NA6 | 0.025 | 0.002 | 0.023 | 0.046 | 0.018 | 0.041 |
NA7 | 0.006 | 0.003 | 0.006 | 0.068 | 0.032 | 0.046 |
NB1 | 0.007 | 0 | 0.007 | 0.006 | 0.002 | 0.006 |
NB2 | 0.008 | 0.001 | 0.008 | 0.023 | 0.007 | 0.018 |
NB3 | 0.008 | 0.002 | 0.009 | 0.019 | 0.005 | 0.017 |
NB4 | 0.021 | 0.006 | 0.017 | 0.042 | 0.032 | 0.045 |
NB5 | 0.043 | 0.039 | 0.052 | 0.054 | 0.059 | 0.061 |
Overall | 0.029 | 0.008 | 0.026 | 0.055 | 0.031 | 0.045 |
Dimension | Demand | Capability | ||
---|---|---|---|---|
Similar ↓ | Different ↑ | Similar ↓ | Different ↑ | |
8 | 0.142 | 0.177 | 0.177 | 0.181 |
16 | 0.209 | 0.260 | 0.194 | 0.264 |
32 | 0.270 | 0.349 | 0.267 | 0.344 |
Dimension | ↓ | ↓ | ↑ | ↑ | |
---|---|---|---|---|---|
8 | 9.7 | 10.2 | 34.9 | 18.3 | 26.8 |
16 | 9.4 | 7.8 | 28.4 | 15.9 | 38.5 |
32 | 11.4 | 8.6 | 24.8 | 15.2 | 39.9 |
Dimension | Normal | Drunk | ||
---|---|---|---|---|
ME | MAE ↓ | ME | MAE ↓ | |
8 | −0.015 | 0.025 | −0.033 | 0.044 |
16 | −0.022 | 0.026 | −0.024 | 0.045 |
32 | −0.025 | 0.027 | −0.044 | 0.047 |
Tolerance | Normal | Drunk | |||
---|---|---|---|---|---|
Relaxed | Stringent | ME | MAE ↓ | ME | MAE ↓ |
▲ | −0.022 | 0.026 | −0.024 | 0.045 | |
▲ | 0.055 | 0.110 | −0.195 | 0.209 | |
▲ | Δ | −0.012 | 0.026 | −0.037 | 0.047 |
Δ | ▲ | 0.063 | 0.114 | −0.178 | 0.196 |
Method | Tolerance | Normal | Drunk | |||
---|---|---|---|---|---|---|
Relaxed | Stringent | ME | MAE ↓ | ME | MAE ↓ | |
baseline | ▲ | −0.022 | 0.026 | −0.024 | 0.045 | |
w/ NN | ▲ | −0.019 | 0.027 | −0.044 | 0.051 | |
w/ NN & EMB | ▲ | −0.014 | 0.021 | −0.016 | 0.037 | |
baseline | ▲ | 0.055 | 0.110 | −0.195 | 0.209 | |
w/ NN | ▲ | −0.011 | 0.083 | −0.276 | 0.283 | |
w/ NN & EMB | ▲ | −0.012 | 0.061 | −0.063 | 0.102 |
Status | ||||||
---|---|---|---|---|---|---|
Normal | 4.8 | 6.2 | 48.6 | 23.1 | 10.6 | 6.7 |
Drowsy | 6.9 | 10.8 | 46.9 | 26.8 | 5.3 | 3.3 |
Aggressive | 5.1 | 6.4 | 37.2 | 34.1 | 9.4 | 7.8 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhou, H.; Carballo, A.; Fujii, K.; Takeda, K. From Concept to Representation: Modeling Driving Capability and Task Demand with a Multimodal Large Language Model. Sensors 2025, 25, 5805. https://doi.org/10.3390/s25185805
Zhou H, Carballo A, Fujii K, Takeda K. From Concept to Representation: Modeling Driving Capability and Task Demand with a Multimodal Large Language Model. Sensors. 2025; 25(18):5805. https://doi.org/10.3390/s25185805
Chicago/Turabian StyleZhou, Haoran, Alexander Carballo, Keisuke Fujii, and Kazuya Takeda. 2025. "From Concept to Representation: Modeling Driving Capability and Task Demand with a Multimodal Large Language Model" Sensors 25, no. 18: 5805. https://doi.org/10.3390/s25185805
APA StyleZhou, H., Carballo, A., Fujii, K., & Takeda, K. (2025). From Concept to Representation: Modeling Driving Capability and Task Demand with a Multimodal Large Language Model. Sensors, 25(18), 5805. https://doi.org/10.3390/s25185805