Next Article in Journal
Radar-Based Gesture Recognition Using Adaptive Top-K Selection and Multi-Stream CNNs
Previous Article in Journal
Augmenting a ResNet + BiLSTM Deep Learning Model with Clinical Mobility Data Helps Outperform a Heuristic Frequency-Based Model for Walking Bout Segmentation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Robust Occupant Behavior Recognition via Multimodal Sequence Modeling: A Comparative Study for In-Vehicle Monitoring Systems

1
College of Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588, USA
2
University of Michigan Transportation Research Institute, Ann Arbor, MI 48109, USA
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(20), 6323; https://doi.org/10.3390/s25206323 (registering DOI)
Submission received: 4 September 2025 / Revised: 4 October 2025 / Accepted: 10 October 2025 / Published: 13 October 2025

Abstract

Understanding occupant behavior is critical for enhancing safety and situational awareness in intelligent transportation systems. This study investigates multimodal occupant behavior recognition using sequential inputs extracted from 2D pose, 2D gaze, and facial movements. We conduct a comprehensive comparative study of three distinct architectural paradigms: a static Multi-Layer Perceptron (MLP), a recurrent Long Short-Term Memory (LSTM) network, and an attention-based Transformer encoder. All experiments are performed on the large-scale Occupant Behavior Classification (OBC) dataset, which contains approximately 2.1 million frames across 79 behavior classes collected in a controlled, simulated environment. Our results demonstrate that temporal models significantly outperform the static baseline. The Transformer model, in particular, emerges as the superior architecture, achieving a state-of-the-art Macro F1 score of 0.9570 with a configuration of a 50-frame span and a step size of 10. Furthermore, our analysis reveals that the Transformer provides an excellent balance between high performance and computational efficiency. These findings demonstrate the superiority of attention-based temporal modeling with multimodal fusion and provide a practical framework for developing robust and efficient in-vehicle occupant monitoring systems. Implementation code and supplementary resources are available (see Data Availability Statement).
Keywords: occupant behavior recognition; multimodal learning; 2D pose; gaze estimation; facial movement; temporal modeling; LSTM; MLP; Transformer; sequence classification; occupant monitoring occupant behavior recognition; multimodal learning; 2D pose; gaze estimation; facial movement; temporal modeling; LSTM; MLP; Transformer; sequence classification; occupant monitoring

Share and Cite

MDPI and ACS Style

Kim, J.; Park, B.-K.D. Robust Occupant Behavior Recognition via Multimodal Sequence Modeling: A Comparative Study for In-Vehicle Monitoring Systems. Sensors 2025, 25, 6323. https://doi.org/10.3390/s25206323

AMA Style

Kim J, Park B-KD. Robust Occupant Behavior Recognition via Multimodal Sequence Modeling: A Comparative Study for In-Vehicle Monitoring Systems. Sensors. 2025; 25(20):6323. https://doi.org/10.3390/s25206323

Chicago/Turabian Style

Kim, Jisu, and Byoung-Keon D. Park. 2025. "Robust Occupant Behavior Recognition via Multimodal Sequence Modeling: A Comparative Study for In-Vehicle Monitoring Systems" Sensors 25, no. 20: 6323. https://doi.org/10.3390/s25206323

APA Style

Kim, J., & Park, B.-K. D. (2025). Robust Occupant Behavior Recognition via Multimodal Sequence Modeling: A Comparative Study for In-Vehicle Monitoring Systems. Sensors, 25(20), 6323. https://doi.org/10.3390/s25206323

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop