Next Article in Journal
Radar-Based Gesture Recognition Using Adaptive Top-K Selection and Multi-Stream CNNs
Previous Article in Journal
Augmenting a ResNet + BiLSTM Deep Learning Model with Clinical Mobility Data Helps Outperform a Heuristic Frequency-Based Model for Walking Bout Segmentation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Robust Occupant Behavior Recognition via Multimodal Sequence Modeling: A Comparative Study for In-Vehicle Monitoring Systems

1
College of Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588, USA
2
University of Michigan Transportation Research Institute, Ann Arbor, MI 48109, USA
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(20), 6323; https://doi.org/10.3390/s25206323
Submission received: 4 September 2025 / Revised: 4 October 2025 / Accepted: 10 October 2025 / Published: 13 October 2025

Abstract

Understanding occupant behavior is critical for enhancing safety and situational awareness in intelligent transportation systems. This study investigates multimodal occupant behavior recognition using sequential inputs extracted from 2D pose, 2D gaze, and facial movements. We conduct a comprehensive comparative study of three distinct architectural paradigms: a static Multi-Layer Perceptron (MLP), a recurrent Long Short-Term Memory (LSTM) network, and an attention-based Transformer encoder. All experiments are performed on the large-scale Occupant Behavior Classification (OBC) dataset, which contains approximately 2.1 million frames across 79 behavior classes collected in a controlled, simulated environment. Our results demonstrate that temporal models significantly outperform the static baseline. The Transformer model, in particular, emerges as the superior architecture, achieving a state-of-the-art Macro F1 score of 0.9570 with a configuration of a 50-frame span and a step size of 10. Furthermore, our analysis reveals that the Transformer provides an excellent balance between high performance and computational efficiency. These findings demonstrate the superiority of attention-based temporal modeling with multimodal fusion and provide a practical framework for developing robust and efficient in-vehicle occupant monitoring systems. Implementation code and supplementary resources are available (see Data Availability Statement).

1. Introduction

Occupant behavior recognition has emerged as a crucial component of intelligent transportation systems, enabling real-time monitoring to enhance road safety and situational awareness. Traditional approaches often rely on single-modality visual cues and static frame-level classifiers, which struggle with the subtle, temporally dependent patterns found in complex, simulated driving environments. Moreover, a single feature type is often insufficient to capture the diverse range of behaviors, from gross body movements to fine-grained facial expressions.
Recent advancements in multimodal learning and temporal modeling have shown promise in addressing these limitations. By combining complementary cues such as body pose, gaze, and facial movements, a more holistic understanding of occupant behavior can be achieved. Temporal models, like LSTMs, can further exploit sequential dependencies to distinguish between visually similar yet temporally distinct actions. More recently, attention-based architectures such as the Transformer have demonstrated state-of-the-art performance in various sequence modeling tasks, offering an alternative approach to capturing long-range dependencies.
To address these challenges, this paper presents a lightweight and modular framework for occupant behavior recognition that leverages temporal modeling of multi-feature inputs. Our approach fuses three complementary modalities—2D pose, 2D gaze, and facial movement (FM)—into fixed-length sequences, which are then classified using three distinct architectures: a static Multi-Layer Perceptron (MLP), a recurrent Long Short-Term Memory (LSTM) network, and an attention-based Transformer encoder. We conduct a comprehensive evaluation on the large-scale Occupant Behavior Classification (OBC) dataset, and our main contributions are as follows:
  • A multimodal occupant behavior recognition pipeline that integrates 2D pose, 2D gaze, and facial movement (FM) features using an early fusion strategy.
  • A comparative analysis of static (MLP), recurrent (LSTM), and attention-based (Transformer) classification models, highlighting the benefits of temporal modeling for complex behavior recognition.
  • An extensive ablation study on the effects of feature combinations, sequence lengths, and frame sampling strategies, providing insights into optimal design choices for in-vehicle monitoring systems.
  • A lightweight and computationally efficient design suitable for practical deployment, supported by performance and inference cost evaluations.
Through these contributions, this work underscores the importance of multimodal fusion and temporal modeling for occupant behavior recognition, offering practical guidelines for the development of robust occupant monitoring systems for in-vehicle applications.

2. Related Work

2.1. Pose Estimation for Occupant Behavior

Accurate pose estimation is essential for capturing body dynamics during driving. Recent YOLO-based frameworks have demonstrated real-time, high-accuracy keypoint detection by integrating object detection and pose estimation into a unified pipeline. YOLO-Pose extends the YOLO architecture for multi-person 2D pose estimation, jointly predicting bounding boxes and keypoints in a single stage, achieving state-of-the-art performance on large-scale benchmarks [1]. Building on this, YOLOv8-PoseBoost incorporates channel attention modules, multi-scale detection heads, and cross-level feature fusion to improve small-target detection in complex environments [2]. These advances provide a robust foundation for extracting spatial cues in occupant monitoring systems.

2.2. Gaze Estimation

Gaze estimation is a key indicator of occupant attention and situational awareness. UniGaze [3] proposes a universal gaze estimation framework trained on large-scale, in-the-wild face datasets using masked autoencoder (MAE) [4] pre-training with a Vision Transformer backbone. This approach improves cross-domain generalization under both leave-one-dataset-out and joint-dataset evaluation protocols, making it suitable for deployment in diverse and unconstrained driving scenarios.

2.3. Facial Movement Modeling

Subtle facial movements can provide critical clues for identifying occupant states, such as Inattention or drowsiness. The FMAE-IAT framework [5] leverages MAE pre-training on the large-scale Face9M dataset, combined with identity adversarial training to avoid identity-dependent biases. It achieves state-of-the-art performance on action unit detection benchmarks such as BP4D [6], BP4D+ [7], and DISFA [8], highlighting its capacity to capture fine-grained facial behavior.

2.4. AI-Based In-Vehicle Occupant Behavior Recognition

AI-based behavior recognition is a cornerstone of modern in-vehicle occupant monitoring systems. A significant body of research has focused on driver-centric applications, leveraging machine learning and deep learning to enhance safety. Convolutional Neural Networks (CNNs), in particular, have been widely adopted for detecting driver distraction. For instance, Xing et al. (2019) [9] utilized models like AlexNet and GoogLeNet to classify seven driver activities, achieving up to 91.4% accuracy in distinguishing distracted from normal driving. Similarly, Valeriano et al. (2018) [10] recognized 10 types of distracted behaviors with 97% accuracy using a ResNet-based model. Beyond deep learning, traditional methods like Support Vector Machines (SVMs) and Decision Trees have also proven effective. Costa et al. (2019) [11] reached 89–93% accuracy in detecting driver fatigue and distraction, while Kumar and Patra (2018) [12] achieved 95.58% sensitivity in drowsiness detection using SVMs with facial features.
More recent works have adopted multimodal approaches, integrating data from RGB, depth, and infrared sensors to capture a richer representation of behavior. Ortega et al. (2020) [13] demonstrated a system that monitors distraction, drowsiness, gaze, and hand–wheel interactions, reporting performance exceeding 90%. While these foundational studies primarily target the driver for safety-critical alerts, their methodologies are broadly applicable to understanding the behaviors of all vehicle occupants, paving the way for more holistic in-cabin monitoring systems. Alongside these, attention-based models like the Transformer [14], originally developed for natural language processing, are increasingly being adapted for time-series and sequence modeling tasks due to their proficiency in capturing long-range dependencies.

2.5. Summary and Positioning

Previous studies have successfully established methodologies for classifying specific, often safety-critical, occupant behaviors within a limited range of 7–10 categories using techniques like CNNs and SVMs [9,10,11]. However, this focus on the driver often overlooks the broader spectrum of general occupant behaviors, and many studies do not systematically compare different feature sets and temporal modeling configurations.
In contrast, our work addresses these gaps by proposing a lightweight pipeline designed for comprehensive occupant behavior recognition. We leverage state-of-the-art pre-trained models—YOLOv8-Pose, UniGaze, and FMAE-IAT—as efficient feature extractors for three complementary cues: 2D pose, 2D gaze, and facial movement. Crucially, our work is distinguished by its validation on the large-scale Occupant Behavior Classification (OBC) dataset, which encompasses 79 diverse occupant behavior classes, moving far beyond driver-specific tasks. We conduct an extensive ablation study to systematically compare three distinct architectural paradigms, a static model (MLP), a recurrent model (LSTM), and an attention-based model (Transformer), and analyze the impact of sequence length and frame sampling strategies. This positions our work at the intersection of multimodal fusion and temporal modeling, providing a robust framework and practical insights for developing next-generation in-vehicle occupant monitoring systems.

3. Methodology

To address the challenges of recognizing complex, temporally dependent occupant behaviors, we designed a lightweight and modular recognition pipeline. Our approach prioritizes both high accuracy through multimodal fusion and computational efficiency by freezing feature extractors. As illustrated in Figure 1, the pipeline is divided into three main stages:
  • Feature Extraction: For each input frame, we extract three types of features—2D pose, 2D gaze, and facial movement (FM). Pre-trained models are used to extract these features, and to improve computational efficiency, the feature extractors are frozen during training.
  • Fusion and Sequence Construction: The extracted features from each modality are concatenated to form a unified feature vector per frame. Then, consecutive frames are grouped into sequences based on a specified number of frames and step size.
  • Temporal Classification: The constructed sequences are fed into a lightweight classifier. We compare three distinct architectures: a static MLP, a recurrent LSTM, and an attention-based Transformer. Only the classifier is trainable, keeping the rest of the pipeline fixed.
This modular design allows easy experimentation with different combinations of input features, sequence lengths, and model architectures, facilitating both ablation and computational cost analysis.

3.1. Multi-Feature Fusion

To construct a comprehensive representation of occupant behavior, we fuse three complementary modalities: 2D pose, 2D gaze, and facial movement (FM). Each feature type captures a different aspect of occupant behavior: pose encodes gross body movement, gaze reflects visual attention, and FM captures subtle expressions related to the occupant’s state (e.g., drowsiness or inattention). Each modality is processed by a specialized, pre-trained feature extractor chosen for its state-of-the-art performance and efficiency, as discussed in Section 2.
  • Two-dimensional Pose: We employ YOLOv8-Pose [15], selected for its high accuracy and real-time keypoint detection capabilities crucial for in-vehicle monitoring.
  • Two-dimensional Gaze: We use UniGaze [3], which offers robust cross-domain generalization, making it suitable for diverse and unconstrained driving scenarios.
  • Facial Movement: We utilize FMAE-IAT [5] to extract a 12-dimensional vector of Facial Action Units (AUs). The process involves detecting and cropping the occupant’s face, resizing it, and feeding it into the frozen FMAE-IAT feature extractor, which directly outputs the 12-dimensional AU intensity vector.
Once extracted, the features from each modality are concatenated along the channel axis for each frame. This early fusion strategy allows the temporal model to learn from a unified representation that incorporates information across all modalities. By design, these feature extraction modules are frozen during training to maintain a lightweight pipeline and ensure computational efficiency.

3.2. Temporal Sequence Modeling

Occupant behaviors are inherently temporal phenomena. To effectively model these dynamics while managing computational load, we transform the continuous video data into discrete sequences using a two-stage sampling process governed by three key hyperparameters, as illustrated in Figure 2.
First, we define a sequence span ( L s p a n ), which is the total duration of the temporal window from the raw video. Second, from within this span, we downsample a fixed number of frame samples ( L s a m p l e s ). These frames are selected at a uniform interval to form the final input sequence.
Finally, the step size (S) determines the offset by which this entire sequence span window is moved to create the next overlapping sequence. A single ground-truth label is assigned to each final sequence by taking the majority vote of the frame-level labels within its span.

3.3. Classifier Architectures

For classifying the fused feature sequences, we implemented and compared three architectures representing different modeling paradigms: a static model (MLP), a recurrent model (LSTM), and an attention-based model (Transformer). Our design focuses on keeping these classifiers lightweight while freezing the upstream feature extractors, which is critical for practical deployment. The detailed architectural parameters for each model are summarized in Table 1.

3.3.1. Multi-Layer Perceptron (MLP)

The MLP serves as our static baseline. It processes a sequence by flattening all temporal features into a single large vector, thus ignoring explicit temporal ordering. Our implementation consists of four fully connected layers with ReLU activations and batch normalization, which progressively reduce the feature dimension before a final classification layer.

3.3.2. Long Short-Term Memory (LSTM)

As a representative recurrent model, the LSTM is chosen for its ability to model temporal dependencies by processing sequences step by step and maintaining an internal memory state. We use a three-layer unidirectional LSTM, where the mean-pooled output of the final time step’s hidden state is passed through a layer normalization step before being used for classification.

3.3.3. Transformer

To represent attention-based models, we use a Transformer encoder architecture. The model first projects the input features into a higher-dimensional space and adds sinusoidal positional encodings to retain sequence order. The data is then processed by a stack of four multi-head self-attention layers, which allows the model to weigh the importance of all frames in the sequence simultaneously. The final classification is made from the mean-pooled and layer-normalized output of the encoder.

4. Experiments

In this section, we describe the dataset used in our study, the evaluation metrics employed, and the implementation and training details and provide a comprehensive analysis of our results, including an ablation study to examine the contribution of each component.

4.1. Dataset

For this study, we utilized the Occupant Behavior Classification (OBC) dataset. This dataset was originally collected at the University of Michigan Transportation Research Institute (UMTRI) to investigate occupant behaviors across different levels of simulated vehicle automation (protocol approved by the UMTRI Institutional Review Board: HUM00162942). The dataset is not publicly available due to privacy protection considerations. The data collection included 42 licensed drivers (21 men and 21 women) with a broad range of anthropometric characteristics and ages from 18 to 59 years. All participants were recorded in a stationary 2018 Hyundai Genesis G90 sedan equipped with two Microsoft Azure Kinect sensors mounted near the A-pillars to capture both front seats.
The dataset contains approximately 2.1 million frames captured at 10 frames per second with a resolution of 1280 × 720 . It covers 79 distinct occupant behavior classes, which were elicited by asking participants to perform a series of scripted tasks. To elicit naturalistic-style behavior, participants were instructed to perform these tasks as they normally would in a real moving vehicle and to find postures they would consider comfortable for a long ride. These tasks were performed under three simulated automation levels: Manual (MN), Fully Automated (FA), and Semi-Automated (SA). For the MN and SA sessions, the participant was seated in the driver’s seat, while for the FA session, they were moved to the passenger’s seat to reflect a non-driving role. The data includes synchronized video from two front-facing camera views, one positioned in front of the driver seat and the other in front of the passenger seat. The OBC dataset captures a variety of controlled driving conditions, including scenarios with a single driver as well as those with passengers seated in the back. Each frame is annotated with a single occupant behavior class.
For our experiments, the dataset was split into training (80%, 1.68 M frames), validation (10%, 210 K frames), and testing (10%, 210 K frames) subsets. The full list of behavior classes is provided in Appendix A. It is important to note the constraints of the data collection environment. The experiments were conducted in a stationary vehicle with a locked steering wheel, and some seat adjustment controls were deactivated to standardize conditions. Behaviors were elicited via scripted prompts from an investigator, which may differ from fully spontaneous actions in an on-road driving context.

4.2. Evaluation Metrics

To evaluate the performance of occupant behavior recognition models, we adopt five widely used metrics for multi-class classification: accuracy, Balanced Accuracy, Macro F1, Weighted F1, and the confusion matrix. Accuracy measures the overall proportion of correctly classified instances:
Accuracy = 1 N i = 1 N 1 ( y ^ i = y i )
where N is the total number of instances, y i is the ground-truth label, y ^ i is the predicted label, and 1 ( · ) is the indicator function. Balanced Accuracy computes the average recall over all C classes, mitigating the impact of class imbalance:
Balanced Accuracy = 1 C c = 1 C T P c T P c + F N c
where T P c and F N c denote the true positives and false negatives for class c. Macro F1 is the unweighted average of per-class F1-scores:
Macro F 1 = 1 C c = 1 C F 1 c with F 1 c = 2 · Precision c · Recall c Precision c + Recall c
Weighted F1 computes the F1-score per class and weights each score by the number of instances in that class:
Weighted F 1 = c = 1 C n c N · F 1 c
where n c is the number of true instances of class c. A confusion matrix is a C × C matrix M, where M i , j denotes the number of instances of class i predicted as class j. It provides a detailed visualization of misclassifications:
M i , j = # { samples where y = i and y ^ = j }
The full confusion matrix for all 79 classes is provided in Appendix B.

4.3. Experimental Setup

We trained and evaluated all three models—MLP, LSTM, and Transformer—under a consistent experimental framework to ensure a fair comparison. The architectural details of each model are described in Section 3.3. For the sequential models (LSTM and Transformer), we conducted an extensive ablation study on temporal configurations by varying the sequence span ( L s p a n ), step size (S), and the number of frame samples ( L s a m p l e s ). Each sequence was assigned a single ground-truth label based on the majority vote of its constituent frames.
All models were trained using the Adam optimizer for up to 200 epochs, employing an early stopping mechanism with a patience of 10 epochs based on the validation loss. The key training hyperparameters, such as learning rate, batch size, and dropout, are summarized for each model in Table 2. All experiments were implemented in PyTorch (version 2.7.1+cu126) and executed on a high-performance computing cluster equipped with an NVIDIA Tesla V100 GPU (Santa Clara, CA, USA).

5. Results

This section presents a comprehensive evaluation of our proposed framework, comparing the performance of the MLP, LSTM, and Transformer models. We analyze the results from four perspectives: the impact of input feature modalities, the effect of temporal configurations on the Transformer model, a direct comparison of model performance versus computational efficiency, and an in-depth analysis of per-class performance.

5.1. Input Modality Ablation Study

To understand the contribution of each visual cue, we first evaluated all three models with various combinations of 2D pose, 2D gaze, and facial movement (FM) features, using a fixed sequence length of 30 frames. As shown in Table 3, several key trends emerge. First, 2D pose is consistently the most dominant modality, providing a strong performance baseline. Second, both LSTM and Transformer significantly outperform the static MLP model across all feature combinations, underscoring the importance of temporal modeling. Third, the Transformer model generally achieves the highest performance, particularly when modalities are fused. The best overall result is achieved when all three modalities (‘Pose + Gaze + FM’) are used with the Transformer, reaching a Macro F1 of 0.8970.

5.2. Temporal Configuration Analysis for the Transformer

Given the strong performance of the Transformer, we conducted an extensive ablation study to analyze its sensitivity to different temporal configurations, with detailed results presented in Table 4. The results indicate that a smaller, denser step size (S) consistently yields better performance. For instance, with a sequence span ( L s p a n ) of 50, a step size of 10 achieves a Macro F1 of 0.9570, whereas a step size of 50 results in a score of only 0.3012. The number of frame samples ( L s a m p l e s ) also plays a crucial role. The highest performance was achieved with a 50-frame span and a step size of 10. Specifically, the configuration with L s a m p l e s = 50 yielded the best Macro F1 score of 0.9570, while the configuration with L s a m p l e s = 25 achieved the highest Balanced Accuracy of 0.9567.

5.3. Performance vs. Efficiency Comparison

A critical aspect for practical deployment is the trade-off between predictive performance and computational cost. We summarize this comparison in Table 5. As expected, the MLP is the most lightweight model but provides the lowest performance. While the LSTM model shows the highest peak performance (Macro F1 of 0.9931), this result stems from our initial experimental design using a frame-level data split. As detailed in our Discussion (Section 6), this approach can lead to performance inflation. In contrast, the Transformer model offers a compelling balance. Its best-performing configuration ( L s p a n = 50 , S = 10 , L s a m p l e s = 50 ) achieves a high and, crucially, more robust Macro F1 score of 0.9570. This positions the Transformer as the superior architecture, providing state-of-the-art performance within our revised framework. Furthermore, its most efficient configuration ( L s p a n = 10 , S = 5 , L s a m p l e s = 5 ) delivers a strong Macro F1 of 0.9395 with only 0.02 GFLOPs, highlighting its suitability for resource-constrained environments.

5.4. Per-Class Performance and Error Analysis

To gain deeper insights into the Transformer model’s behavior, we analyzed its per-class performance using its best-performing configuration, as detailed in Table 6. A notable finding is the model’s exceptionally high performance even on what are predicted to be challenging classes. The Top-5 performing classes are distinct actions like ‘Tilting sun visor’ or ‘Using laptop on armrest’. More impressively, the Bottom-5 classes, which often involve subtle motions or have low sample counts (e.g., ‘Adjusting pelvis in seat’), still achieve F1 scores near or above 0.90. This demonstrates the Transformer’s strong ability to capture discriminative features even from limited data.
This high overall performance is also reflected in the confusion matrices shown in Figure 3. For the 20 most frequent classes, the matrix shows a strong diagonal, indicating few misclassifications. For instance, some notable confusion can be observed between similar fine-grained tasks, such as different types of phone use or subtle posture changes. While the bottom-20 classes show slightly more confusion, the overall performance remains robust, consistent with the findings in our per-class analysis.

6. Discussion

Our experimental results provide several key insights into multimodal temporal modeling for occupant behavior recognition. This section discusses the implications of our findings, focusing on the comparison between static, recurrent, and attention-based models, the role of multimodal fusion, the trade-off between performance and efficiency, and the surprising robustness of our best model.
First, our comparative analysis confirms the critical importance of temporal modeling. As shown in the input modality ablation study (Table 3), both the LSTM and Transformer architectures substantially outperform the static MLP across all feature combinations. This demonstrates that capturing the sequential nature of actions is fundamental to achieving high accuracy. Between the two temporal models, the Transformer consistently shows a competitive edge, especially with fused modalities like ‘Pose + FM’, suggesting that its self-attention mechanism is highly effective for this task.
Second, the analysis of temporal configurations for the Transformer (Table 4) reveals a clear pattern: denser, more overlapping sequences created with smaller step sizes yield superior results. However, this increased performance comes at a higher computational cost. The trade-off between performance and efficiency, summarized in Table 5, is central to our findings. The MLP is the most efficient but least accurate model. In contrast, the Transformer presents a compelling balance; it achieves high performance (Macro F1 of 0.9561) while being significantly more resource-efficient than the LSTM in terms of parameters and FLOPs. This positions the Transformer as a strong candidate for practical, resource-constrained in-vehicle systems.
Third, the per-class performance analysis of our best Transformer model (Table 6) offers further insights into its robustness. A key finding is the model’s exceptionally high F1 scores even for its “Bottom-5” classes, which remain near or above 0.90. These classes, such as ‘Adjusting pelvis in seat’, are characterized by low support counts and subtle motions. This suggests that the Transformer’s self-attention mechanism is highly effective at learning discriminative patterns even from limited examples. This is visually corroborated by the confusion matrices in Figure 3, which display a strong diagonal dominance.
Finally, our study has several key limitations. The frame-level splitting of the dataset introduces potential data leakage from two perspectives. First, it does not guarantee that the training, validation, and test sets are subject-disjoint, which presents a risk of the model learning subject-specific mannerisms (identity leakage). Second, it preserves temporal continuity across the split boundaries, meaning a sequence at the beginning of the validation set can be a direct continuation of a sequence from the training set. Both factors can inflate performance and limit conclusions about generalization. Furthermore, the OBC dataset was gathered in a stationary vehicle with scripted tasks, not in an actual on-road driving context. Generalizing these findings to unconstrained scenarios requires further validation. The model was also not evaluated under challenging conditions common in on-road driving, such as poor illumination, partial occlusions, or unscripted, extreme postures. Additionally, while our analysis provides efficiency metrics on a high-performance GPU (Table 5), we did not benchmark the model on embedded hardware, such as the NVIDIA Jetson series, which is more typical for in-vehicle applications. Although our lightweight design with frozen feature extractors is a strong candidate for such resource-constrained environments, formal validation of its real-time performance on such hardware remains a critical task for future work. Lastly, this study did not include a fairness analysis to assess potential performance biases related to demographic factors such as gender or age. Future work should investigate the model’s performance across these groups to ensure the system is equitable and reliable for all occupants. We contend that these risks are partially mitigated by our feature-based approach. Nonetheless, future research must validate this framework using a strict subject-disjoint split on datasets captured in more naturalistic on-road conditions to confirm its real-world applicability.

7. Conclusions

In this paper, we presented and evaluated a lightweight, modular framework for occupant behavior recognition using multimodal visual features. Our approach effectively fused 2D pose, 2D gaze, and facial movement features and utilized three distinct classifier architectures—a static MLP, a recurrent LSTM, and an attention-based Transformer—to model the temporal dynamics of 79 distinct behaviors from the OBC dataset.
Our comprehensive experiments demonstrated several key findings: (1) temporal models (LSTM and Transformer) significantly outperform static, frame-based MLP classification, confirming the importance of sequential context; (2) fusing all three modalities consistently yields the best performance for the temporal models, validating the benefits of a multimodal approach; and (3) the Transformer model achieved the best overall performance, reaching a Macro F1 score of 0.9570 with a configuration of a 50-frame span, a step size of 10, and 25 sampled frames. Furthermore, our analysis revealed that the Transformer offers a superior balance between high accuracy and computational efficiency, positioning it as a strong candidate for practical, resource-constrained systems.
Overall, this work underscores the critical importance of integrating temporal context and complementary multimodal features for robust occupant behavior recognition. The findings provide a strong foundation and practical guidelines for the development of next-generation, computationally efficient in-vehicle occupant monitoring systems, with the Transformer architecture emerging as a particularly promising solution.

Author Contributions

Conceptualization, J.K. and B.-K.D.P.; methodology, J.K.; software, J.K.; validation, J.K.; formal analysis, J.K.; investigation, J.K.; data curation, J.K.; writing—original draft preparation, J.K.; writing—review and editing, J.K. and B.-K.D.P.; visualization, J.K.; supervision, B.-K.D.P.; project administration, B.-K.D.P.; funding acquisition, B.-K.D.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board of the University of Michigan (protocol code HUM00162942).

Informed Consent Statement

Informed consent was obtained from all subjects involved in this study.

Data Availability Statement

The raw, participant-level data that underpin the results of this study are not publicly available because they contain personally identifiable information (PII) and their release would risk participant privacy. Implementation code, trained model weights (where applicable), and supplementary, non-identifiable materials (example inputs, synthetic samples, and evaluation scripts) are publicly available at the authors’ GitHub repository: https://github.com/wltnkim/UMTRI_Occupant_Behavior, accessed on 9 October 2025. Requests for access to de-identified or restricted data may be considered on a case-by-case basis and will require approval from the Institutional Review Board (IRB) and a signed data use agreement.

Acknowledgments

The authors would like to express their sincere gratitude to the Biosciences Group at the University of Michigan Transportation Research Institute (UMTRI) for their support. Special thanks are extended to Byoung-Keon (Daniel) Park for his invaluable guidance and contributions to this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. List of Behavior Classes

The complete list of the 79 classes from the Occupant Behavior Classification (OBC) dataset is detailed in Table A1 and Table A2. Each class represents a unique occupant behavior performed under a specific, simulated driving condition. A more comprehensive description of the dataset’s composition and data collection protocol is provided in Section 4.1.
Table A1. A detailed list of the 79 behavior classes in the OBC dataset (Part 1 of 2).
Table A1. A detailed list of the 79 behavior classes in the OBC dataset (Part 1 of 2).
IDDriving ModeBehavior CategoryDetailed Description
0Fully AutomatedPosture ChangeChange head positions: Use the headrest or use a hand to support the head.
1ManualHead Range of MotionRotation (Left/Right): Turn head as far as possible to the left/right while still being able to drive.
2Fully AutomatedReachingReaching to floor: Reach to the floor by the right and left foot to pick something up.
3ManualNon-Driving TaskTake a drink from the water bottle and then return it to the cupholder.
4ManualNon-Driving TaskTalk over right shoulder to a passenger in the rear seat.
5ManualReachingReach to the glove box (or as far right as possible on the dash).
6Fully AutomatedPhone UseSelect other body postures for using a phone (seat adjustment allowed).
7ManualHead Range of MotionExtension: Tip head backward, rotating face to the ceiling as far as possible while driving.
8Fully AutomatedLaptop UseReposition legs to hold the laptop differently.
9ManualReachingReach to the center of the passenger seat cushion.
10Fully AutomatedReachingReach to the floor directly in front of the driver seat.
11Semi-AutomatedLaptop UseIn auto-mode, select another laptop location and posture and then type and read for 10 seconds.
12ManualReachingReach to the floor directly in front of the passenger seat.
13Fully AutomatedLaptop UsePlace the laptop on the center armrest to use it.
14Fully AutomatedNon-Driving TaskTalk over left shoulder to a passenger in the rear seat.
15Fully AutomatedVehicle InteractionRemove and don the seat belt.
16ManualVehicle InteractionOpen and then close the sunglasses compartment above the center mirror.
17Semi-AutomatedSleep/RestingIn auto-mode, use armrest, door, or seat contours to find other comfortable resting postures.
18ManualVehicle InteractionChange the vent settings.
19Semi-AutomatedLaptop UseIn auto-mode, use armrest or door to find other postures for using the laptop.
20Fully AutomatedVehicle InteractionChange the vent settings.
21Fully AutomatedReachingReach to the center of the driver seat cushion.
22Fully AutomatedNon-Driving TaskUse phone to make a call (using right hand, left hand, and speaker).
23Fully AutomatedSleep/RestingSelect other body postures for sleeping/resting (seat adjustment allowed).
24ManualNon-Driving TaskUse phone to make a call (using right hand, left hand, and speaker).
25Semi-AutomatedDriving TaskTransition from manual to auto-mode, check mirrors, and then perform a takeover request.
26ManualReachingReach behind the passenger seat (as a specific reaching task).
27ManualPosture ChangeChange head positions: Use the headrest or use a hand to support the head.
28ManualVehicle InteractionPretend to press one of the seat position memory buttons on the door by the left knee.
29ManualDriving TaskCheck right/left blind spot and pretend to change lanes.
30Fully AutomatedPosture ChangeUse armrests (center or door) to adjust position.
31ManualNon-Driving TaskUse the mirror on the back of the visor to look at your face (as a primary task).
32ManualReachingReach behind the passenger seat (during a general posture change).
33ManualHead Range of MotionFlexion: Tip chin to chest as far as possible while driving.
34ManualNon-Driving TaskUse the mirror on the back of the visor to look at your face (during sun visor adjustment).
35Semi-AutomatedRiding TaskIn auto-mode, select another comfortable body posture (seat adjustment allowed).
36ManualTorso Range of MotionSlouch: Push body back and slide hips forward as far as possible while driving.
37Fully AutomatedSleep/RestingRecline seat more, lean on vehicle side, or rest head on hand.
38Fully AutomatedReachingReach to the area of the steering wheel.
39Fully AutomatedVehicle InteractionPretend to press one of the seat position memory buttons on the door near the right knee.
Table A2. A detailed list of the 79 behavior classes in the OBC dataset (Part 2 of 2).
Table A2. A detailed list of the 79 behavior classes in the OBC dataset (Part 2 of 2).
IDDriving ModeBehavior CategoryDetailed Description
40Fully AutomatedPosture ChangeAdjust pelvis in the seat.
41ManualTorso Range of MotionRotation (Left/Right): Twist body as much as possible to the left/right while driving.
42ManualDriving TaskTry out other hand positions on the steering wheel (right hand only, left hand only).
43Fully AutomatedPhone UseMove the phone to other positions (lower, higher, to the side) and continue to use it.
44ManualDriving TaskCheck mirrors (center, left, right), moving head to see more of the field of view.
45Fully AutomatedNon-Driving TaskTake a drink from the water bottle and then return it to the cupholder.
46ManualTorso Range of MotionFlexion: Tilt body forward toward the steering wheel as far as possible.
47Fully AutomatedNon-Driving TaskUse the mirror on the back of the visor to look at your face.
48Semi-AutomatedSleep/RestingIn auto-mode, perform takeover, return to resting, and then select another resting posture.
49Semi-AutomatedSleep/RestingIn auto-mode, receive a “takeover in 2 miles” warning, adjust seat to prepare, and then takeover.
50Semi-AutomatedPhone UseWhile using phone in auto-mode, perform a takeover and then return to preferred position.
51Fully AutomatedNon-Driving TaskUse phone to text or to look at a map.
52ManualPosture ChangeSelect other body postures for driving (seat adjustment allowed).
53Fully AutomatedVehicle InteractionChange the fan speed or temperature using the controls on the dash.
54ManualPosture ChangeUse armrests (center or door) to adjust position.
55Semi-AutomatedPhone UseIn auto-mode, find a new comfortable position while using phone (seat adjustment allowed).
56Semi-AutomatedRiding TaskIn auto-mode, check mirrors and then perform a takeover request.
57ManualHead Range of MotionLateral Bend (Left/Right): Tilt head to the left/right, ear toward shoulder.
58ManualReachingReach to the floor by the right and left foot to pick something up.
59ManualVehicle InteractionChange the fan speed or temperature using the controls on the dash.
60Fully AutomatedLaptop UseMove the laptop to other resting positions and continue typing/reading.
61Fully AutomatedPhone UseTry holding the phone at different locations such as lower, higher, or to the side.
62ManualDriving TaskSimulate stopping the car, shifting to park, and reversing for 3 s.
63Fully AutomatedLaptop UseUse a laptop browser to search a topic, including typing and reading.
64ManualTorso Range of MotionLateral Bend (Left/Right): Tilt body as far as possible to the left/right while driving.
65Fully AutomatedStandard PostureStandard posture: Seated full rear, feet forward on heels, hands on lap, looking forward.
66Fully AutomatedLaptop UseSelect other body postures for using a laptop (seat adjustment allowed).
67Fully AutomatedVehicle InteractionTilt sun visor down, to the side, and back up.
68ManualPosture ChangeAdjust pelvis in the seat to be more relaxed (slouching) or more alert.
69Fully AutomatedReachingReach behind the driver seat.
70ManualVehicle InteractionRemove and don the seat belt.
71Fully AutomatedVehicle InteractionOpen and then close the sunglasses compartment by the center mirror.
72Fully AutomatedPosture ChangeSelect other body postures for riding as a passenger (seat adjustment allowed).
73Fully AutomatedPhone UseUse phone for various tasks: look up number, call, text, use browser/maps, view video.
74Fully AutomatedSleep/RestingChange head positions: Use the headrest or use a hand to hold the head up.
75Fully AutomatedSleep/RestingAdjust pelvis in the seat.
76Fully AutomatedPhone UseRest elbow on the armrest while holding and using the phone.
77ManualNon-Driving TaskUse phone to text or look at a map while considering how to hold it for navigation.
78ManualVehicle InteractionTilt sun visor down, to the side, and back up.

Appendix B. Full Confusion Matrix

Figure A1 presents a heatmap visualization of the 79 × 79 confusion matrix, illustrating the model’s classification performance on the test set. The vertical axis represents the true class labels, while the horizontal axis represents the labels predicted by the model. The color intensity of each cell corresponds to the number of instances, with brighter colors along the main diagonal indicating a high number of correct predictions. Off-diagonal bright spots highlight specific classes that the model tends to confuse. The complete list of class IDs for the axes is detailed in Appendix A.
Figure A1. Heatmap visualization of the 79 × 79 confusion matrix. The color intensity corresponds to the number of instances, showing where the model most often confuses classes.
Figure A1. Heatmap visualization of the 79 × 79 confusion matrix. The color intensity corresponds to the number of instances, showing where the model most often confuses classes.
Sensors 25 06323 g0a1

References

  1. Maji, D.; Nagori, S.; Mathew, M.; Poddar, D. YOLO-Pose: Enhancing YOLO for Multi-Person Pose Estimation. arXiv 2022, arXiv:2204.06806. [Google Scholar]
  2. Wang, F.; Wang, G.; Lu, B. YOLOv8-PoseBoost: Advancements in Multimodal Robot Pose Keypoint Detection. Electronics 2024, 13, 1046. [Google Scholar] [CrossRef]
  3. Qin, J.; Zhang, X.; Sugano, Y. UniGaze: Towards Universal Gaze Estimation via Large-scale Pre-Training. arXiv 2025, arXiv:2502.02307. [Google Scholar]
  4. He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of theIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 16000–16009. [Google Scholar]
  5. Ning, M.; Salah, A.A.; Ertugrul, I.O. Representation Learning and Identity Adversarial Training for Facial Behavior Understanding. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition (FG), Clearwater, FL, USA, 26–30 May 2025. [Google Scholar]
  6. Zhang, X.; Yin, L.; Cohn, J.F.; Canavan, S.; Reale, M.; Horowitz, A.; Liu, P.; Girard, J.M. BP4D-Spontaneous: A High-Resolution Spontaneous 3D Dynamic Facial Expression Database. Image Vis. Comput. 2015, 32, 692–706. [Google Scholar] [CrossRef]
  7. Zhang, X.; Yin, L.; Cohn, J.F.; Canavan, S.; Reale, M.; Horowitz, A.; Liu, P.; Girard, J.M. BP4D+: A Spontaneous 3D Dynamic Facial Expression Database with Depth Data. Image Vis. Comput. 2016, 55, 169–179. [Google Scholar]
  8. Mavadati, S.M.; Mahoor, M.H.; Bartlett, K.; Trinh, P.; Cohn, J.F. DISFA: A Spontaneous Facial Action Intensity Database. IEEE Trans. Affect. Comput. 2013, 4, 151–160. [Google Scholar] [CrossRef]
  9. Xing, Y.; Lv, C.; Wang, H.; Cao, D.; Velenis, E.; Wang, F. Driver Activity Recognition for Intelligent Vehicles: A Deep Learning Approach. IEEE Trans. Veh. Technol. 2019, 68, 5409–5421. [Google Scholar] [CrossRef]
  10. Valeriano, L.C.; Napoletano, P.; Schettini, R. Recognition of Driver Distractions Using Deep Learning. In Proceedings of the 2018 IEEE 8th International Conference on Consumer Electronics—Berlin (ICCE-Berlin), Berlin, Germany, 2–5 September 2018; pp. 1–2. [Google Scholar]
  11. Costa, M.; Oliveira, D.; Pinto, S.; Tavares, A. Detecting Driver’s Fatigue, Distraction and Activity Using a Non-Intrusive Ai-Based Monitoring System. J. Artif. Intell. Soft Comput. Res. 2019, 9, 305–317. [Google Scholar] [CrossRef]
  12. Kumar, A.; Patra, R. Driver Drowsiness Monitoring System Using Visual Behaviour and Machine Learning. In Proceedings of the 2018 IEEE Symposium on Computer Applications & Industrial Electronics (ISCAIE), Penang, Malaysia, 28–29 April 2018; pp. 119–124. [Google Scholar]
  13. Ortega, J.; Kose, N.; Cañas, P.; Chao, M.A.; Unnervik, A.; Nieto, M.; Otaegui, O.; Salgado, L. DMD: A Large-Scale Multi-Modal Driver Monitoring Dataset for Attention and Alertness Analysis. In Proceedings of the ECCV Workshops, Glasgow, UK, 23–28 August 2020; pp. 385–403. [Google Scholar]
  14. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
  15. Jocher, G.; Chaurasia, A.; Qiu, J. YOLOv8 by Ultralytics. GitHub Repository. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 9 October 2025).
Figure 1. Overview of the proposed occupant behavior recognition pipeline, now including the Transformer model as a classifier.
Figure 1. Overview of the proposed occupant behavior recognition pipeline, now including the Transformer model as a classifier.
Sensors 25 06323 g001
Figure 2. Illustration of temporal sequence sampling with overlapping windows.
Figure 2. Illustration of temporal sequence sampling with overlapping windows.
Sensors 25 06323 g002
Figure 3. Confusion matrices for the top-20 and bottom-20 classes, generated from the best-performing Transformer model.
Figure 3. Confusion matrices for the top-20 and bottom-20 classes, generated from the best-performing Transformer model.
Sensors 25 06323 g003
Table 1. Detailed architectures of the implemented classifier models. The input for sequential models is a sequence of fused feature vectors, while the MLP uses a flattened version of this sequence.
Table 1. Detailed architectures of the implemented classifier models. The input for sequential models is a sequence of fused feature vectors, while the MLP uses a flattened version of this sequence.
ParameterMLPLSTMTransformer Encoder
Input Dimension ( L s a m p l e s × 48 ) L s a m p l e s × 48 L s a m p l e s × 48
Layer Configuration I n p u t 256 128 64 79 --
Number of Layers4 Fully-Connected3 Layers4 Encoder Layers
Hidden Dimension-256256
Number of Heads--8
Activation FunctionReLUTanhReLU
NormalizationBatchNorm1dLayerNormLayerNorm
Output Dimension797979
Table 2. Key hyperparameters used for training the MLP, LSTM, and Transformer models.
Table 2. Key hyperparameters used for training the MLP, LSTM, and Transformer models.
HyperparameterMLPLSTMTransformer
Learning Rate 1 × 10 3 1 × 10 3 1 × 10 4
OptimizerAdamAdamAdam
Batch Size256256128
Dropout0.00.00.2
Epochs200200200
Early Stopping Patience101010
Table 3. Test results on the OBC dataset using different feature combinations and models. Input sequences use 30 frames with a step size of 10. The upward arrow (↑) next to each metric indicates that higher values are better. For each metric, the best result among models for a given feature set is highlighted in bold.
Table 3. Test results on the OBC dataset using different feature combinations and models. Input sequences use 30 frames with a step size of 10. The upward arrow (↑) next to each metric indicates that higher values are better. For each metric, the best result among models for a given feature set is highlighted in bold.
FeaturesModelAccuracy ↑Bal. Acc. ↑Weighted F1 ↑Macro F1 ↑
PoseMLP0.68800.62780.68360.6358
LSTM0.90270.87840.90250.8784
Transformer0.90840.87800.90840.8759
GazeMLP0.07010.03200.02730.0150
LSTM0.13870.11060.12750.1145
Transformer0.17010.13310.15760.1338
FMMLP0.32850.24080.29840.2386
LSTM0.59280.53850.58890.5425
Transformer0.76350.71340.76230.7158
Pose + GazeMLP0.68750.63080.68330.6375
LSTM0.90800.88530.90780.8875
Transformer0.90840.87960.90840.8785
Pose + FMMLP0.72020.66090.71670.6668
LSTM0.90720.88120.90690.8838
Transformer0.93490.90690.93480.9081
Gaze + FMMLP0.34420.26530.32050.2663
LSTM0.57080.52770.56780.5306
Transformer0.70940.65840.70830.6611
Pose + Gaze + FMMLP0.72350.66510.72010.6705
LSTM0.91850.89130.91830.8941
Transformer0.92480.89960.92490.8970
Table 4. Comprehensive performance analysis of the Transformer model across varying temporal configurations. The overall best-performing configuration is highlighted in bold.
Table 4. Comprehensive performance analysis of the Transformer model across varying temporal configurations. The overall best-performing configuration is highlighted in bold.
ConfigurationPerformance Metrics
( L span , S , L samples )AccuracyBal. Acc.Weighted F1Macro F1
(10, 5, 5)0.95310.93900.95300.9395
(10, 5, 10)0.94680.93170.94680.9297
(10, 10, 5)0.84720.80610.84660.8090
(10, 10, 10)0.85590.81300.85550.8144
(20, 10, 10)0.87530.84330.87520.8439
(20, 10, 20)0.91480.88630.91470.8854
(20, 20, 10)0.68810.62450.68700.6252
(20, 20, 20)0.64290.57020.64040.5708
(30, 10, 15)0.93060.90900.93070.9080
(30, 10, 30)0.93050.90750.93050.9066
(30, 15, 15)0.84890.80710.84880.8078
(30, 15, 30)0.81090.76260.81040.7653
(30, 30, 15)0.58020.51400.57630.5139
(30, 30, 30)0.52170.45870.51640.4609
(40, 10, 20)0.95230.93600.95240.9360
(40, 10, 40)0.94380.92700.94380.9271
(40, 20, 20)0.74850.69820.74830.6996
(40, 20, 40)0.73400.68420.73360.6858
(40, 40, 20)0.46880.42750.46070.4242
(40, 40, 40)0.41040.35240.39550.3511
(50, 10, 25)0.96760.95670.96760.9561
(50, 10, 50)0.96750.95610.96750.9570
(50, 25, 25)0.64410.60690.64100.6010
(50, 25, 50)0.66710.62040.66450.6227
(50, 50, 25)0.41780.36450.40600.3632
(50, 50, 50)0.35160.30000.33640.3012
Table 5. Computational efficiency and performance comparison of the models. The Transformer is evaluated on its best-performing and most resource-efficient configurations. The best overall configuration balancing performance and efficiency is highlighted in bold.
Table 5. Computational efficiency and performance comparison of the models. The Transformer is evaluated on its best-performing and most resource-efficient configurations. The best overall configuration balancing performance and efficiency is highlighted in bold.
ModelConfigurationMacro F1Params (M)GFLOPsTime (ms)F1/GFLOPs
MLPFrame-level0.67050.06<0.0010.08-
LSTM (Low-Cost)(10, 10, 5)0.66011.390.010.1766.01
LSTM (High-Perf.)(40, 10, 40)0.99311.390.050.4419.86
Transformer (Efficient)(10, 5, 5)0.93954.240.020.3346.97
Transformer (Best-Perf.)(50, 10, 50)0.95704.240.210.344.55
Table 6. In-depth analysis of the Transformer model’s per-class performance, showing the Top-5 and Bottom-5 classes based on their F1 scores. Support indicates the number of test samples for each class. Full behavior descriptions are available in Appendix A.
Table 6. In-depth analysis of the Transformer model’s per-class performance, showing the Top-5 and Bottom-5 classes based on their F1 scores. Support indicates the number of test samples for each class. Full behavior descriptions are available in Appendix A.
GroupClass IDBehavior Description (Summarized)F1 ScoreSupport
Top-578Tilting sun visor1.00009
13Using laptop on armrest0.9916239
11Repositioning with laptop0.9902608
17Finding new resting posture0.9901661
55Repositioning with phone0.99011056
Bottom-570Removing/donning seat belt0.9077137
12Reaching to passenger floor0.9074159
20Adjusting vent settings0.9057103
40Adjusting pelvis in seat0.896690
34Using visor mirror0.8963123
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, J.; Park, B.-K.D. Robust Occupant Behavior Recognition via Multimodal Sequence Modeling: A Comparative Study for In-Vehicle Monitoring Systems. Sensors 2025, 25, 6323. https://doi.org/10.3390/s25206323

AMA Style

Kim J, Park B-KD. Robust Occupant Behavior Recognition via Multimodal Sequence Modeling: A Comparative Study for In-Vehicle Monitoring Systems. Sensors. 2025; 25(20):6323. https://doi.org/10.3390/s25206323

Chicago/Turabian Style

Kim, Jisu, and Byoung-Keon D. Park. 2025. "Robust Occupant Behavior Recognition via Multimodal Sequence Modeling: A Comparative Study for In-Vehicle Monitoring Systems" Sensors 25, no. 20: 6323. https://doi.org/10.3390/s25206323

APA Style

Kim, J., & Park, B.-K. D. (2025). Robust Occupant Behavior Recognition via Multimodal Sequence Modeling: A Comparative Study for In-Vehicle Monitoring Systems. Sensors, 25(20), 6323. https://doi.org/10.3390/s25206323

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop