Driver Intention Recognition for Mine Transport Vehicle Based on Cross-Modal Knowledge Distillation

Zhang, Yizhe; Guo, Yinan; You, Xiusong; Guo, Lunfeng; Miao, Bing; Li, Hao

doi:10.3390/app15126814

Open AccessArticle

Driver Intention Recognition for Mine Transport Vehicle Based on Cross-Modal Knowledge Distillation

by

Yizhe Zhang

,

Yinan Guo

^*

,

Xiusong You

,

Lunfeng Guo

,

Bing Miao

and

Hao Li

School of Mechanical and Electrical Engineering, China University of Mining and Technology (Beijing), Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(12), 6814; https://doi.org/10.3390/app15126814

Submission received: 21 May 2025 / Revised: 12 June 2025 / Accepted: 16 June 2025 / Published: 17 June 2025

Download

Browse Figures

Versions Notes

Abstract

:

Driver intention recognition is essential for optimizing driving decisions by dynamically adjusting speed and trajectory to enhance system performance. However, in the underground coal mine environment, traditional vision-based methods face significant limitations in accuracy and adaptability. To effectively improve the accuracy of vision-based driver intention recognition, this study introduces a novel approach leveraging cross-modal knowledge distillation (CMKD) to integrate electroencephalography (EEG) signals with video data to identify driver intentions in coal mining operations. By combining these modalities, the method capitalizes on their complementary strengths to achieve a more comprehensive understanding of driver intent. Experimental analysis across various models evaluates the performance of the proposed CMKD method, which integrates EEG signals with video data. Results reveal a substantial improvement in recognition accuracy over traditional machine vision-based approaches, with a maximum accuracy of 84.38%. This advancement enhances the reliability of driver intention detection and offers more robust support for decision making in automated mine transport systems.

Keywords:

mine transport vehicle; driver intention; cross-modal knowledge distillation; EEG; machine vision

1. Introduction

Driver intention recognition is essential for safe vehicle operation, especially in complex work environments. Accurately identifying the driver’s intentions helps make timely decisions, ensuring the vehicle operates safely and efficiently. Mine transport vehicles are indispensable for underground auxiliary transportation, playing a critical role in ensuring safety and efficiency [1]. In autonomous mine vehicle operation, accurately interpreting driver intentions is crucial for informed decision making. Drivers form intentions based on their perception of the environment and execute corresponding actions. Vehicle driving videos are key to identifying these intentions, as they capture road features, warning signs, and the behavior of other vehicles [2,3,4,5]. Machine vision-based methods for driving intention recognition are widely employed in autonomous driving systems. These approaches leverage computer vision techniques, particularly convolutional neural networks (CNNs) and other deep learning models, to analyze visual data from drivers and their environments. For instance, Xu et al. [6] proposed an end-to-end deep learning framework that predicted future vehicle actions using monocular camera images, while Ma et al. [7] introduced a weakly supervised adversarial learning approach (cGAN-LSTM) that integrated GPS-based route planners and image data for goal-oriented driving intention recognition without requiring precise localization. Davi et al. [8] utilized forward video sequences to detect turn signals and hazard lights, integrating both spatial and temporal information for action prediction. Mahdi et al. [9] presented DIPNet, a novel neural network that analyzed interior, exterior, and combined video data to forecast driver intentions seconds in advance. Despite these advancements, the unique challenges of underground mining environments hinder the effectiveness of intention recognition systems relying solely on machine vision. Poor lighting, often limited to vehicle-provided illumination [10,11], results in low-quality video images that complicate visual analysis [12,13]. Moreover, machine vision systems often fail to infer intentions such as turning, especially when images lack turning features or dynamic context. This deficiency increases the likelihood of misinterpreting driving intentions.

Enhancing intention recognition accuracy in autonomous mine transport vehicles can be achieved by integrating physiological signals with vision-based methods. Recent advancements in artificial intelligence have significantly improved the use of electroencephalographic (EEG) signals to decode driver intentions [14,15,16]. EEG signals, which reflect brain activity, provide valuable insights into cognitive states and intentions [17,18], enabling enhanced decision-making and control systems for improved safety and accuracy. Compared with electromyogram (EMG) [19] and eye-tracking (ET) [20], EEG [21] has the advantage of directly capturing the driver’s internal mental and cognitive states, such as attention, fatigue, and stress, without being affected by physical factors like muscle movements or eye fatigue. For example, Sun et al. [22] proposed an expanded CNN model with a gating mechanism (GDCNN) to decode four distinct driving intentions from EEG signals. Ju et al. [23] analyzed neural patterns associated with emergency and soft braking, developing an EEG-based method to distinguish these intentions. Liang et al. [24] proposed an EEG-driven approach to differentiate emergency from normal braking. Chang et al. [25] explored EEG signals to decode steering intentions at intersections. This makes EEG particularly suitable for improving intention recognition in complex and dynamic environments such as autonomous driving in mines. When integrated with exterior video data, it can significantly improve the accuracy and depth of driving intention recognition, addressing the limitations of methods that rely solely on video.

Knowledge distillation [26] is a model compression technique that enables a smaller student model to learn from a complex teacher model by transferring soft labels. This method achieves near-teacher performance with reduced computational requirements and has shown success in single-modal applications. With the rise of multimodal learning, cross-modal knowledge distillation (CMKD) has gained prominence. In CMKD, teacher and student models utilize different data modalities, such as text and images, facilitating the transfer and integration of knowledge across modalities to improve performance and generalization. Chen et al. [27] enhanced 3D object detection using 3D point clouds to guide 2D image feature learning. Mohammad et al. [28] transferred knowledge from RGB videos to 3D human pose sequences for action recognition. Li et al. [29] developed a separable multimodal distillation approach for emotional recognition, combining language, sound, and vision. Bano et al. [30] employed physiological signals, including EMG, ECG, and GSR, to guide video data for driver stress recognition. Zhang et al. [31] improved continuous emotional recognition in EEG data by incorporating knowledge from visual modalities. CMKD transfers advanced features and knowledge from the teacher model to the student model, thereby enhancing the student model’s ability to handle complex tasks during training.

This study investigates the combined use of EEG signals and video data to decode driver intentions in mine transport vehicles. Specifically, it applies a CMKD framework, where an EEG-driven intention recognition model serves as the teacher to guide a video-based intention recognition model. This process optimizes the video data processing and improves the accuracy of video-based intention recognition. Compared with existing EEG–video fusion methods, the CMKD approach effectively exploits the complementary nature of EEG and video data, further enhancing the accuracy and robustness of driving intention recognition. At the same time, the CMKD mechanism significantly improves the training efficiency of the video data processing model and strengthens its adaptability in complex dynamic environments, ultimately providing more reliable and accurate support for driving intention recognition.

The key contributions of this paper are as follows.

(1): An experimental framework was designed to collect driving intention signals from mine transport vehicles. Driving videos were recorded from two coal mine transport vehicles, processed into video data for intention analysis, and used as stimuli to gather EEG signals related to driving intentions. Additionally, a simulation environment for mine transport vehicle operation was created, allowing multiple subjects to view the videos and generate corresponding EEG data.
(2): A driving intention recognition model for mine transport vehicles was developed using CMKD. An EEG-based intention recognition model was first created to decode driver intentions. This model served as a teacher, providing intention information to train a video-based student model. Guided by the teacher model, the student model achieved effective intention recognition even in the absence of direct EEG data.

In the context of driving intention recognition, combining EEG and video modalities provides deeper insights into the driver’s cognitive state and the external environment, significantly improving the accuracy and reliability of intention recognition. By using the EEG model as the teacher and the video model as the student in a CMKD framework, this study not only addresses the limitations of traditional single-modal methods but also achieves seamless integration between the two modalities. This approach enhances driving intention recognition accuracy while reducing computational resource consumption, offering significant practical value.

The structure of the paper is as follows. Section 2 describes the experimental setup and methods. Section 3 presents the experimental results and analysis. Section 4 concludes with the findings and discussion of future directions.

2. Materials and Methods

2.1. Experimental Personnel and Stimulus Materials

Thirteen healthy, right-handed male volunteers aged 25–35 participated in the study. Since only male workers are allowed in underground coal mining, all subjects were male. Subjects possessed prior underground mining experience, valid driver’s licenses, basic knowledge of coal mining, and normal or corrected-to-normal vision. All were free of mental illness and ensured adequate rest before the experiment to reduce external influences on neurological responses. Driving videos of mine transport vehicles from two coal mines were recorded as stimulus materials with the recording setup depicted in Figure 1.

The driving videos included various tunnel types, such as auxiliary transport drifts, inclines, panel area auxiliary transport drifts, return airways, transport airways, auxiliary measure drifts, return air measure drifts, and transport accessory drifts. For maneuvers such as left turns, right turns, and avoidance, the start times were identified in the footage, and 4 s video segments were extracted, starting 4 s before each maneuver’s initiation. A marker was set 0.5 s before the video to signal the forthcoming action. Similarly, 4 s segments for normal driving were extracted, also marked with a 0.5 s indicator. The video marking format is illustrated in Figure 2. In total, 156 video segments were extracted: 39 for upcoming left turns, 38 for upcoming right turns, 39 for normal driving, and 40 for upcoming avoidance.

2.2. Experimental Platform and Procedure

The experimental platform included a 65-inch display screen (60 Hz refresh rate, 2560 × 1440 resolution), an EEG cap, and a driving simulator with a steering wheel and pedals. The EEG cap, a BitBrain device integrated with the ErgoLAB EEG data acquisition system (Kingfar International Inc., Beijing, China), adhered to the International 10–20 system for electrode placement. Sixteen channels (O1, O2, P7, P3, Pz, P4, P8, C3, Cz, C4, F7, F3, Fz, F4, F8, and Fpz) were employed, with signal acquisition at 256 Hz. Pure water ensured electrode impedance remained below 5 kΩ, and the reference electrode was positioned on the right earlobe. The driving simulator utilized a Logitech G29 force-feedback steering wheel and pedals set to maximum resistance to prevent unintentional movements. The experimental platform is depicted in Figure 3.

During the experiment, subjects were seated 80 cm from the display screen in a darkened room to eliminate external interference. A 5 min baseline recording was conducted beforehand, with subjects instructed to remain relaxed with their eyes closed. During the driving intention EEG data collection, the screen randomly displayed video stimuli representing different driving scenarios: upcoming left turn, right turn, obstacle avoidance, and normal driving. Subjects were required to focus on the screen while assuming a driving posture, which is gripping the steering wheel with both hands and placing their feet on the pedals. They were instructed to generate specific driving intentions corresponding to the action cues in the videos, such as turning left, turning right, slowing down, or maintaining normal driving. Each trial consisted of a 2 s gray screen followed by a 4 s video stimulus as shown in Figure 4. After every 10 trials, a pause screen was shown, allowing subjects to relax. They could resume the session by pressing a button. Each session comprised 156 trials, and subjects completed 2 sessions, resulting in 312 trials per subject. The entire experiment lasted approximately one hour per subject, yielding 4056 trials across all subjects.

2.3. Data Preprocessing

EEG data preprocessing was conducted using the EEGLAB toolbox in MATLAB 2022a. First, electrode positions were imported from location files. The EEG signals were then notch-filtered at 50 Hz to remove power line noise. A bandpass filter with a 1–49 Hz range was applied to retain relevant signals while removing high-frequency noise. After filtering, the EEG signals were re-referenced using a common average reference to ensure comparability and independence across electrodes. Independent Component Analysis (ICA) was used to separate the EEG signals into neural activity and artifact-related components. Artifacts were excluded, and the remaining components were used to reconstruct the clean EEG signals.

The clean EEG data corresponding to the 4 s intervals for each driving intention label were extracted and baseline corrected to eliminate drift. The first second of each trial was discarded to avoid interference from the gray screen before the video stimulus and potential confounding effects from the labeled video segments. The video data and corresponding EEG data were aligned in terms of time intervals, ensuring that the time ranges of the video data and EEG data were consistent, thereby ensuring overall temporal alignment. This alignment of time intervals ensured that the overall understanding of the video and the EEG data analysis were within the same time frame, providing a guarantee for the validity of subsequent data analysis. Ultimately, 3 s video data and corresponding 3 s EEG data were extracted for further analysis, with the 3 s EEG data containing 768 data points across 16 channels.

2.4. Driving Intention Characteristics Analysis

Based on the experimental platform, EEG data under the same driving intention video stimuli for subjects 1, 2, and 3 were selected for analysis. For each of the four driving intention states, 3 s EEG data in the 4–30 Hz frequency band were extracted for analysis. Changes in these bands are believed to be closely related to the driver’s alertness, attention level, and cognitive decision making while performing tasks. The average value of each channel was calculated to quantify the EEG activity for each channel, and EEG topographic maps were drawn based on these data, as shown in Table 1.

Table 1 reflects the EEG topographic maps showing the average values of the 3 s 4–30 Hz frequency band EEG data for subjects 1, 2, and 3 under the same driving intention video stimuli. By analyzing these topographic maps, significant differences in the spatial distribution patterns of EEG activity can be observed across different driving intention states, reflecting changes in the driver’s cognitive load, alertness, and attention allocation when performing different driving tasks. In various driving intention scenarios, the 4–30 Hz frequency band EEG topographic maps exhibit different characteristics, effectively reflecting the driver’s cognitive and decision-making processes under different driving intentions. Driving intention is essentially a decision-making process over a period of time. When facing different driving scenarios, drivers need to continuously make judgments and adjustments based on road conditions, traffic flow, and their own operational intentions. In relatively simple and stable driving situations such as normal driving, the driver’s attention load is lighter, typically presenting a relaxed yet alert state. At this time, the driver’s cognitive load is lower, primarily focusing on maintaining basic vehicle control operations, and the EEG activity shows a steady and consistent pattern. For driving situations such as obstacle avoidance, the driver needs to make quick responses, and the EEG topographic map typically shows extensive changes, especially in brain areas related to perception and decision making, such as the prefrontal and parietal regions, with significant increases in activity. Therefore, the 4–30 Hz frequency band EEG data can be used to assess the driver’s current driving intention, and the differences in EEG features provide an effective basis for driving intention recognition based on EEG signals.

2.5. EEG Feature Extraction

This section employs filter bank common spatial patterns (FBCSPs) for EEG feature extraction. FBCSP filters the signals across various frequency bands, transforming the high-dimensional time-domain EEG data into a low-dimensional space. This approach maximizes the class covariance, enhancing classification performance. The implementation process of CSP [32] is summarized as follows.

The normalized spatial covariance C of the EEG data (D ∈ RM × L) is defined by Equation (1):

C = \frac{D D^{T}}{t r a c e (D D^{T})},

(1)

where M indicates the number of EEG channels, L represents the total number of samples, and T signifies the matrix transpose. For two types of EEG data, the spatial covariance matrices C₁ and C₂ are computed by averaging the spatial covariance C of each type of EEG data.

The combined covariance matrix C_z is then constructed as shown in Equation (2):

C_{z} = C_{1} + C_{2},

(2)

where the covariance matrix C_z is decomposed to yield the eigenvalues matrix A and the eigenvector matrix F_z, as shown in Equation (3):

C_{z} = F_{z} A F_{z}^{T},

(3)

where the whitening matrix P is derived through the whitening transformation, as shown in Equation (4):

P = \sqrt{A^{- 1}} F_{z}^{T} .

(4)

The matrices S₁ and S₂ are calculated as shown in Equations (5) and (6):

S_{1} = P C_{1} P^{T} = U E_{1} U^{T},

(5)

S_{2} = P C_{2} P^{T} = U E_{2} U^{T},

(6)

where the eigenvectors of matrices S₁ and S₂ are identical, and E₁ and E₂ satisfy E₁ + E₂ = I, which is the identity matrix. U is the matrix of eigenvectors in descending order.

The projection matrix W is then obtained by arranging the eigenvalues in descending order, as shown in Equation (7):

W = U^{T} P \in R^{M \times M} .

(7)

After the EEG data is projected through CSP, the resulting matrix V is given by Equation (8):

V = W \cdot D,

(8)

where D is the preprocessed EEG data. The mean square error of the features is calculated, followed by a logarithmic transformation to obtain the EEG feature vector, as shown in Equation (9):

f_{t} = \log (\frac{var (V_{t})}{\sum_{j = 1}^{2 m} var (V_{j})}), t = 1, 2, \dots, 2 m .

(9)

In a four-class classification problem, feature extraction involves treating one class as a distinct category while combining the remaining three classes into a single group. This approach transforms the original problem into four binary classification tasks, each corresponding to a unique spatial filter. These filters are applied to EEG signals, producing four sets of features, which are then aggregated to form the final extracted features. The feature extraction process of FBCSP is shown in Figure 5.

In FBCSP feature extraction, a zero-phase FIR filter is designed to process EEG signals within the 4–30 Hz frequency range, focusing on the prominent θ, α, and β waves. The 4–30 Hz range is divided into multiple segments: 4–8, 6–10, 8–12, 10–14, 12–16, 14–18, 16–20, 18–22, 20–24, 22–26, 24–28, and 26–30 Hz. CSP feature extraction is then applied to each filtered segment, generating 24 features per segment, resulting in a total of 288 features per sample.

2.6. Construction of a Driving Intention Model Based on Cross-Modal Knowledge Distillation

A driving intent model leveraging machine vision is constructed using the CMKD framework [33] in Figure 6.

In predicting y_i given x_i, x′_i provides supplementary information for the pair (x_i,y_i). Specifically, x_i represents a video segment of driving stimuli, while x′_i corresponds to the EEG signals recorded as subjects viewed the segment. y_i denotes the associated driving intent (e.g., left turn, right turn, avoidance, or normal driving). During training, the teacher model f_t is first trained exclusively on EEG feature data x′_i. The trained f_t then generates predictions using a temperature parameter T to improve the model’s capacity to distinguish commonalities and differences between categories. Equation (10) gives the mathematical representation:

s_{i} = σ (f_{t} ({x^{'}}_{i}) / T),

(10)

where σ represents the softmax function. Subsequently, the student model f_s is trained exclusively on the video data x_i.

The loss function during training is defined through Equation (11):

L o s s = \frac{1}{n} \sum_{i = 1}^{n} [(1 - λ) l_{C E} (y_{i}, σ (f_{s} (x_{i}))) + λ l_{C E} (s_{i}, σ (f_{s} (x_{i})))],

(11)

where the imitation factor λ regulates the balance between the predicted soft labels s_i and the true labels y_i, while l_CE denotes the cross-entropy loss function.

During training, the EEG signal model incorporates the subject consensus processing method from the reference [34]. Specifically, the teacher model predicts soft label values using EEG data from all subjects corresponding to each stimulus video segment. For a given subject, the prediction result obtained by the teacher model is [0.23, 0.66, 0.11, 0.10], and the output results of other subjects follow a similar format. Then, the prediction results of all subjects are averaged category-wise to obtain the mean output of the teacher model for all subjects under the video stimulus. For instance, the mean of all subjects’ outputs for this video segment is [0.33, 0.65, 0.21, 0.36]. Based on this, the mean result is first scaled using a temperature parameter T, which adjusts the smoothness of the soft label distribution. Finally, the temperature-scaled result is normalized through a softmax layer to obtain the final soft labels. These soft labels serve as the target for the student model, guiding it to progressively approximate the teacher model’s prediction distribution during the training process.

3. Result and Discussion

This section presents the experiments and analyses on driving intention recognition methods using video data, EEG features, and the CMKD framework. All models are implemented in PyCharm 2022, using PyTorch 2.4.1 [35] and CUDA 12.1, and trained on an NVIDIA GEFORCE RTX 3080Ti GPU.

3.1. Driving Intention Recognition Analysis Based on Video Data

For video data classification, six methods are evaluated. A stratified sampling approach randomly selects 8 video segments per category, yielding 32 video segments for the test set. The remaining 124 segments are divided into training and validation sets using a 5-fold cross-validation, maintaining a 100:24:32 ratio for training, validation, and testing. Each video segment is processed by extracting 16 frames for both training and testing. Training data is resized to 112 × 112 pixels, followed by image augmentation through affine transformations, which includes random vertical and horizontal scaling translations, with the maximum translation scale being 0.1 times the maximum pixel value, and then standardized. The models are trained for 150 epochs using L2-regularized cross-entropy loss with a regularization parameter of 0.01. Optimization is performed using the AdamW algorithm [36] with a learning rate of 3 × 10⁻⁵, reduced by a factor of 0.1 after 100 epochs. The batch size is set to 8.

The selected methods for driving intention recognition from video data include several advanced models. The R3D Network [37] enhances video sequence analysis by integrating 3D convolutional layers into ResNet, leveraging cross-layer residual connections to effectively capture spatio-temporal features. The R(2+1)D Network [38] improves efficiency by decomposing 3D convolutional networks into separate 2D and 1D temporal convolutions, achieving a balance between computational cost and performance. The LRCN Network [39] combines the image feature extraction capabilities of CNNs with the temporal modeling strengths of recurrent neural networks (RNNs), allowing for dynamic feature extraction and understanding of semantic content in video sequences. In this experiment, two variants of LRCN are utilized: LRCN_18, which employs ResNet18, and LRCN_34, which uses ResNet34 as the feature extraction structure. Then, a BiLSTM network is used to further extract the temporal information from the images at each time step, and the hidden state of the last time step is selected to determine the final state. The SlowFast Network [40] employs two parallel pathways, one for low frame rates to capture slow changes and the other for high frame rates to detect fast actions, enabling detailed analysis of both slow and fast movements in video data. Two variants of SlowFast are used in this experiment: SlowFast_32, which utilizes ResNet32, and SlowFast_50, which employs ResNet50 as the backbone.

To evaluate the performance of different algorithms, the average evaluation metrics of each optimal model were computed on the test set during five-fold cross-validation. These metrics include the average precision (Avg. Precision), average recall (Avg. Recall), average F1 score (Avg. F1) for each category, overall average accuracy (Avg. Accuracy), and the model file size (MB). The results are presented in Table 2.

Table 2 indicates that LRCN_18 achieves an average accuracy of 59.38% with a standard deviation of 3.83%. It excels in the “normal” category with a high average recall and average precision in the “avoidance” category, highlighting its reliability and stability in identifying driving intentions in complex coal mine scenarios. Meanwhile, the LRCN_18 model file size is only 57.097 MB, making it convenient for system deployment. Therefore, LRCN_18 is selected as the student model for training in the subsequent CMKD process. The structure diagram of the LRCN_18 model is shown in Figure 7.

In the LRCN_18 model, the CNN component uses the ResNet18 architecture to extract spatial features from each frame of the image. These spatial features are then input as a temporal sequence into a bidirectional long short-term memory network (BiLSTM) to capture the temporal dependencies between the features. The BiLSTM network learns information from both directions of the sequence, updating the hidden state at each time step by utilizing both preceding and succeeding contextual information. Ultimately, based on the hidden state of the last time step, a fully connected layer is applied for transformation and classification, enabling the recognition of driving intentions. Driving intentions are typically decisions made based on the final objective of the entire sequence, making the hidden state of the last time step more representative. BiLSTM gradually integrates historical temporal information into the final hidden state, thus capturing the complete contextual information, which is particularly crucial in decision-making processes involving dependencies.

3.2. Driving Intention Recognition Analysis Based on EEG Features

Twelve classification methods were evaluated for EEG feature classification, including CNN [41], long short-term memory networks (LSTMs) [42], transformers [43], and temporal convolutional networks (TCNs) [44]. CNNs extract features automatically from input data using stacked layers, while LSTM captures long-term dependencies through memory cells and gating mechanisms. Transformers, leveraging an attention-based architecture, excel in sequence-to-sequence tasks. TCNs address the limitations of traditional recurrent neural networks by utilizing convolutional layers and residual connections to effectively model long-term dependencies and patterns in time-series data.

For EEG feature classification, 20% of the dataset was randomly selected as the test set. From the remaining data, 10% was designated as the validation set, with the remainder used for training, yielding a ratio of 2919:325:812. The data underwent normalization. Using this approach, five data splits were randomly generated for training and testing. The model was trained for 500 epochs using the cross-entropy loss function, and parameters were optimized with the Adam algorithm [45] at a learning rate of 3e–4, which was reduced by a factor of 0.1 after 450 epochs. The batch size was set to 8. Metrics for evaluation included average precision (Avg. Precision), average recall (Avg. Recall), average F1 score (Avg. F1) per category, and average accuracy (Avg. Accuracy). The average evaluation results across five runs on the test set are shown in Table 3.

The TCN_CNN model demonstrates superior performance in classifying driving intention EEG features, achieving an average accuracy of 71.55%. This high performance is attributed to the model’s ability to integrate the TCN and CNN architectures, enabling the detection of subtle feature variations. For further analysis, EEG features from all subjects were partitioned into training, validation, and test sets in a 2600:624:832 ratio, following the data partitioning methodology applied in video-based driving intention recognition. The data was normalized. The average evaluation metrics for each optimal model on the test set after training are detailed in Table 4.

For the four-class classification of driving intentions based on EEG features stimulated by video data, the TCN_CNN model achieves the highest average classification accuracy at 67.31%. Given these findings, the TCN_CNN model, demonstrating the best performance, is selected as the teacher model for training the subsequent CMKD model. The TCN_CNN architecture diagram is shown in Figure 8.

The TCN_CNN adopts a parallel processing approach where EEG features are extracted and processed separately, and then, the results are fused to better identify significant features for the classification of driving intentions. The input EEG feature size is [8, 1, 288], and it is processed through both TCN and CNN for feature extraction. In the TCN module, EEG features are extracted using 1D convolution, and the feature sequence length is adjusted by a cropping module. The features then undergo normalization via the ReLU activation function and Dropout layer, with residual connections enhancing the network’s expressive power. The specific structure is shown in Figure 9a. After three layers of stacked TCN processing, the output EEG feature size becomes [8, 128, 288]. In the CNN module, EEG features are first processed by 1D convolutional layers for local feature extraction followed by ReLU activation for nonlinear transformation. Max pooling operations are then applied for downsampling, further reducing the feature dimensions and extracting more prominent features. The specific structure is shown in Figure 9b. After three layers of stacked CNN processing, the output EEG feature size becomes [8, 128, 36]. Finally, the EEG features extracted by the CNN module are fused with the output from the TCN module to integrate information from different modules. The fused features are flattened through a data flattening layer before being input into the fully connected layer for further processing. In the fully connected layer, features undergo a nonlinear transformation via the ReLU activation function and are subsequently regularized by the Dropout layer. Through this processing pipeline, the model outputs the corresponding driving intention category.

3.3. Driving Intention Recognition Analysis Based on Cross-Modal Knowledge Distillation

Based on previous experiments and analysis, we selected the TCN_CNN model trained on EEG features as the teacher model and the LRCN_18 model, trained on video data, as the student model. The TCN_CNN model combines the strengths of TCN and CNN, effectively extracting multi-level information features from the extracted EEG features and better capturing the complex patterns within the EEG features. Specifically, TCN captures long-range dependencies during feature extraction, while CNN helps to extract local features. The combination of both makes TCN_CNN highly effective in processing EEG features. On the other hand, the LRCN_18 model extracts efficient spatial features from video frames and fully utilizes the temporal information between video frames to capture dynamic changes in the video data. LRCN_18 emphasizes the integration of spatial and temporal information, enabling it to better process video data and play an important role in driving intention recognition tasks. Through the CMKD framework, the knowledge extracted by the teacher model (TCN_CNN) from EEG features can be effectively transferred to the student model (LRCN_18). This allows the student model to leverage the information features embedded in EEG features, improving its recognition accuracy on video data. As both LRCN_18 and TCN_CNN excel in their respective domains, their synergistic effect within the CMKD framework helps the student model more accurately extract and recognize driving intentions from video data.

The model parameters are selected by setting the temperature T within the range [1, 2, 5, 10, 20, 50, 100] and the imitation factor λ within [0.25, 0.5, 0.75, 1]. The student model’s parameters are consistent with those outlined in Section 3.1. After training, the performance of each optimal model from the five-fold cross-validation is evaluated on the test set with metrics including average accuracy (Avg.) and maximum accuracy (Max) shown in Table 5. The sensitivity analysis curves for parameters λ and T are shown in Figure 10 and Figure 11.

From Figure 10, it can be seen that when λ = 0.5 and λ = 0.75, the model performs better when T = 100, achieving the optimal accuracy. However, when λ = 1, the model reaches the optimal accuracy when T = 20. Similarly, from Figure 11, it can be observed that when T = 20 and λ = 1, the model achieves the best performance. Based on the above analysis, it can be concluded that when λ = 1 and T = 20, the algorithm achieves an average accuracy of 75.63% and a maximum accuracy of 84.38%. This significant performance enhancement demonstrates the effectiveness of the proposed method in recognizing driving intentions. To validate the effectiveness of the proposed method more comprehensively, we performed the Wilcoxon signed-rank test [46] on the results from the five-fold cross-validation of both methods to assess the overall significance difference between our method and LRCN_18. The results show that the proposed algorithm significantly outperforms the baseline algorithm (p = 0.0312 < 0.05), further demonstrating that our method performs significantly better than the baseline model.

To further assess its effectiveness, we selected the optimal parameter combination and averaged the evaluation metrics—average precision, recall, F1 score, and accuracy—based on five-fold cross-validation results on the test set in Table 6.

The 1–20 parameter combination delivers the best performance, excelling in both average precision and F1 score, particularly achieving an impressive 80.44% average F1 for the “normal” class. Compared with other parameter combinations and the LRCN_18 model trained solely on video data, the 1–20 configuration reduces false positives while ensuring comprehensive class identification, maintaining both high accuracy and class coverage. This demonstrates the superior balance and stability of the method across multiple metrics, making it the most effective and reliable choice for multi-class classification tasks. The row-normalization confusion matrix with the highest recognition accuracy of LRCN_18 and parameter combinations (1–20) is shown in Figure 12.

As shown in Figure 12, our method outperforms the LRCN_18 in terms of prediction accuracy for the four driving intentions. Among them, the prediction accuracies for right and normal driving intentions are the highest, with no misclassifications. It can be observed that the CMKD method significantly improves the recognition of right and normal driving intentions. However, recognizing left and avoidance driving intentions is more challenging, possibly due to confusion between these two categories, which caused misclassifications. Further optimization for these two intentions will be needed in the future. To visually assess the impact of different parameter configurations, we present bar charts comparing average and maximum accuracy across various settings in Figure 13.

When λ ≥ 0.5 and T > 10, the algorithm consistently surpasses the LRCN_18 method, which relies exclusively on video data for training. This improvement stems from the loss function placing greater emphasis on knowledge transfer from the teacher model when λ ≥ 0.5. As λ increases, the student model more effectively leverages the teacher model’s output, gaining critical insights into driving intentions. This transfer mechanism enables the student model to assimilate richer and more targeted knowledge during training, significantly enhancing its performance, reliability, and accuracy in driving intention recognition tasks.

Accurately interpreting driver intentions in complex coal mine environments is crucial for enhancing transportation efficiency, safety, and reliability while reducing accident risks. The method proposed in this paper implicitly incorporates EEG data features into driving intention recognition based on video data. In practical applications, high-precision intention recognition can be achieved using only video cameras. This method has significant engineering application value in real-time decision making for autonomous driving in underground mines. Additionally, the model has low computational complexity and small latency, effectively addressing the issue of recognizing the driving intention of mining transport vehicles, thus laying a theoretical foundation for breakthroughs in autonomous driving in underground mines.

Regarding the age and number of drivers, the paper selects drivers aged 20–25, with a sample size of 13. For the underground coal mine scenario, since all workers are male and require coal mine driving experience and specific backgrounds, this dataset is somewhat representative and generalizable, but it still has limitations. This study considers drivers of a specific age group only, excluding other age groups, which may not reflect the diverse needs of different drivers. This study currently focuses on intention recognition in simple underground coal mine scenarios and serves as a preliminary exploration of the method. In future research, we will expand the sample size to include more intention recognition scenarios during driving, such as reversing, and conduct studies and analyses on drivers from different age groups, discussing the impact of varying numbers of participants on model accuracy, to further enhance the model’s generalizability and robustness.

4. Conclusions

In this paper, we proposed a driver intention recognition model based on CMKD, with the EEG-based TCN_CNN model as the teacher and the video-based LRCN_18 model as the student. The CMKD framework enables the student model to effectively recognize driver intentions without direct EEG input. Experimental results showed a significant improvement in accuracy over the standalone LRCN_18 model, with a maximum accuracy of 84.38%. These findings demonstrate CMKD’s potential to enhance driving intention recognition and advance multimodal integration research.

In our future work, we plan to introduce a cross-modal attention mechanism to more effectively extract and fuse information between different modalities while combining multimodal transformers to further improve the integration of data from different modalities and consider various modality fusion methods to improve the model’s performance.

Author Contributions

Conceptualization and methodology, Y.Z.; software and validation, Y.Z.; formal analysis and investigation, Y.Z., X.Y. and B.M.; resources and data curation, Y.Z., L.G. and H.L.; writing—original draft preparation and writing—review and editing, Y.Z.; visualization and supervision, Y.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China under Grant 2022YFB4703700 and the National Natural Science Foundation of China under Grant 52121003 and in part of 111 Project under Grant B21014.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board of the School of Mechanical and Electrical Engineering, China University of Mining and Technology (Beijing) (CUMTB20240407).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Guo, Y.; Huang, Y.; Ge, S.; Zhang, Y.; Jiang, E.; Cheng, B.; Yang, S. Low-Carbon Routing Based on Improved Artificial Bee Colony Algorithm for Electric Trackless Rubber-Tyred Vehicles. Complex Syst. Model. Simul. 2023, 3, 169–190. [Google Scholar] [CrossRef]
Trivedi, M.; Gandhi, T.; McCall, J. Looking-in and looking-out of a vehicle: Selected investigations in computer vision based enhanced vehicle safety. In Proceedings of the IEEE International Conference on Vehicular Electronics and Safety, Xi’an, China, 14–16 October 2005; pp. 29–64. [Google Scholar]
Liang, Y.; Zheng, P.; Xia, L. A visual reasoning-based approach for driving experience improvement in the AR-assisted head-up displays. Adv. Eng. Inform. 2023, 55, 101888. [Google Scholar] [CrossRef]
Hanel, A.; Stilla, U. STRUCTURE-FROM-MOTION FOR CALIBRATION OF A VEHICLE CAMERA SYSTEM WITH NON-OVERLAPPING FIELDS-OF-VIEW IN AN URBAN ENVIRONMENT. Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci. 2017, 42, 181–188. [Google Scholar] [CrossRef]
Jamonnak, S.; Zhao, Y.; Huang, X.; Amiruzzaman, M. Geo-Context Aware Study of Vision-Based Autonomous Driving Models and Spatial Video Data. IEEE Trans. Vis. Comput. Graph. 2022, 28, 1019–1029. [Google Scholar] [CrossRef]
Xu, H.; Gao, Y.; Yu, F.; Darrell, T. End-to-End Learning of Driving Models from Large-Scale Video Datasets. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3530–3538. [Google Scholar]
Ma, H.; Wang, Y.; Xiong, R.; Kodagoda, S.; Tang, L. DeepGoal: Learning to Drive with driving intention from Human Control Demonstration. arXiv 2019, arXiv:1911.12610. [Google Scholar] [CrossRef]
Frossard, D.; Kee, E.; Urtasun, R. DeepSignals: Predicting Intent of Drivers Through Visual Signals. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 9697–9703. [Google Scholar]
Bonyani, M.; Rahmanian, M.; Jahangard, S.; Rezaei, M. DIPNet: Driver intention prediction for a safe takeover transition in autonomous vehicles. IET Intell. Transp. Syst. 2023, 17, 1769–1783. [Google Scholar] [CrossRef]
Guo, X.; Zhang, B. The Research of Coal Mine Underground Rubber Tyred Vehicle Wireless Video Aided Scheduling System. In Information Technology and Intelligent Transportation Systems: Volume 1, Proceedings of the 2015 International Conference on Information Technology and Intelligent Transportation Systems ITITS 2015, Xi’an, China, 12–13 December 2015; Springer International Publishing: Cham, Switzerland, 2017; Volume 454, pp. 365–371. [Google Scholar]
Xue, G.; Li, R.; Liu, S.; Wei, J. Research on Underground Coal Mine Map Construction Method Based on LeGO-LOAM Improved Algorithm. Energies 2022, 15, 6256. [Google Scholar] [CrossRef]
Hanif, M.; Yu, Z.; Bashir, R.; Li, Z.; Farooq, S.; Sana, M. A new network model for multiple object detection forautonomous vehicle detection in mining environment. IET Image Process. 2024, 18, 3277–3287. [Google Scholar] [CrossRef]
Yang, W.; Wang, S.; Wu, J.; Chen, W.; Tian, Z. A low-light image enhancement method for personnel safety monitoring in underground coal mines. Complex Intell. Syst. 2024, 10, 4019–4032. [Google Scholar] [CrossRef]
Affanni, A.; Najafi, T. Drivers’ Attention Assessment by Blink Rate Measurement from EEG Signals. In 2022 IEEE International Workshop on Metrology for Automotive (MetroAutomotive); IEEE: Piscataway, NJ, USA, 2022; pp. 128–132. [Google Scholar]
Zhang, X.; Xiao, X.; Yang, Y.; Hao, Z.; Li, J.; Huang, H. EEG Signal Analysis for Early Detection of Critical Road Events and Emergency Response in Autonomous Driving. In Proceedings of the 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC), Bilbao, Spain, 24–28 September 2023; pp. 1706–1712. [Google Scholar]
Jang, M.; Oh, K. Development of an Integrated Longitudinal Control Algorithm for Autonomous Mobility with EEG-Based Driver Status Classification and Safety Index. Electronics 2024, 13, 1374. [Google Scholar] [CrossRef]
Li, M.; Wang, W.; Liu, Z.; Qiu, M.; Qu, D. Driver Behavior and Intention Recognition Based on Wavelet Denoising and Bayesian Theory. Sustainability 2022, 14, 6901. [Google Scholar] [CrossRef]
Martínez, E.; Hernández, L.; Antelis, J. Discrimination Between Normal Driving and Braking Intention from Driver’s Brain Signals. Bioinform. Biomed. Eng. 2018, 10813, 129–138. [Google Scholar]
Kadrolkar, A.; Sup, F.C. Intent recognition of torso motion using wavelet transform feature extraction and linear discriminant analysis ensemble classification. Biomed. Signal Process. Control 2017, 38, 250–264. [Google Scholar] [CrossRef]
Zhao, M.; Gao, H.; Wang, W.; Qu, J. Research on Human-Computer Interaction Intention Recognition Based on EEG and Eye Movement. IEEE Access 2020, 8, 145824–145832. [Google Scholar] [CrossRef]
Fu, R.; Wang, Z.; Wang, S.; Xu, X.; Chen, J.; Wen, G. EEGNet-MSD: A Sparse Convolutional Neural Network for Efficient EEG-Based Intent Decoding. IEEE Sens. J. 2023, 23, 19684–19691. [Google Scholar] [CrossRef]
Sun, J.; Liu, Y.; Ye, Z.; Hu, D. A Novel Multiscale Dilated Convolution Neural Network with Gating Mechanism for Decoding Driving Intentions Based on EEG. IEEE Trans. Cogn. Dev. Syst. 2023, 15, 1712–1721. [Google Scholar] [CrossRef]
Ju, J.; Bi, L.; Feleke, A.G. Noninvasive neural signal-based detection of soft and emergency braking intentions of drivers. Biomed. Signal Process. Control 2022, 72, 103330. [Google Scholar] [CrossRef]
Liang, X.; Yang, Y.; Liu, Y.; Liu, K.; Liu, Y.; Zhou, Z. EEG-based emergency braking intention detection during simulated driving. BioMed. Eng. OnLine 2023, 22, 65. [Google Scholar] [CrossRef]
Chang, W.; Meng, W.; Yan, G.; Zhang, B.; Luo, H.; Gao, R.; Yang, Z. Driving EEG based multilayer dynamic brain network analysis for steering process. Expert Syst. Appl. 2022, 207, 118121. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. Comput. Sci. 2015, 14, 38–39. [Google Scholar]
Chen, Z.; Li, Z.; Zhang, S.; Fang, J.; Jiang, Q.; Zhao, F. BEVDistill: Cross-Modal BEV Distillation for Multi-View 3D Object Detection. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Thoker, F.M.; Gall, J. Cross-Modal Knowledge Distillation for Action Recognition. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 6–10. [Google Scholar]
Li, Y.; Wang, Y.; Cui, Z. Decoupled Multimodal Distilling for Emotion Recognition. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 6631–6640. [Google Scholar]
Bano, S.; Tonellotto, N.; Gotta, A. Drivers Stress Identification in Real-World Driving Tasks. In Proceedings of the 2022 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops), Pisa, Italy, 21–25 March 2022; pp. 140–141. [Google Scholar]
Zhang, S.; Tang, C.; Guan, C. Visual-to-EEG cross-modal knowledge distillation for continuous emotion recognition. Pattern Recognit. 2022, 130, 108833. [Google Scholar] [CrossRef]
Liang, J. EEG Analysis and BCI Research Based on Motor Imagery Under Driving Behavior. Ph.D. Thesis, Hebei University of Technology, Tianjin, China, 2012. [Google Scholar]
Lopez-Paz, D.; Bottou, L.; Bernhard, S.; Vapnik, V. Unifying distillation and privileged information. arXiv 2015, arXiv:1511.03643. [Google Scholar]
Cavazza, J.; Ahmed, W.; Volpi, R.; Morerio, P.; Bossi, F.; Willemse, C.; Wykowska, A.; Murino, V. Understanding action concepts from videos and brain activity through subjects’ consensus. Sci. Rep. 2022, 12, 19073. [Google Scholar] [CrossRef] [PubMed]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Chintala, S. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv 2019, arXiv:1912.01703. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regulation. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Hara, K.; Kataoka, H.; Satoh, Y. Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition. In Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, Italy, 22–29 October 2017; pp. 3154–3160. [Google Scholar]
Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6450–6459. [Google Scholar]
Donahue, J.; Hendricks, L.A.; Guadarrama, S.; Rohrbach, M. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 2625–2634. [Google Scholar]
Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. SlowFast Networks for Video Recognition. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6201–6210. [Google Scholar]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Bkassiny, M. A Deep Learning-based Signal Classification Approach for Spectrum Sensing using Long Short-Term Memory (LSTM) Networks. In Proceedings of the 2022 6th International Conference on Information Technology, Information Systems and Electrical Engineering (ICITISEE), Yogyakarta, Indonesia, 13 December 2022; pp. 667–672. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Bai, S.; Kolter, J.Z.; Koltun, V. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar]
Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Thakkar, B. Continuous variable analyses: Student’s t-test, Mann–Whitney U test, Wilcoxon signed-rank test. In Translational Cardiology; Academic Press: Cambridge, MA, USA, 2025; pp. 165–167. [Google Scholar]

Figure 1. Tunnel scene diagram: (a) mine No. 1; (b) mine No. 2.

Figure 2. Video annotation diagram.

Figure 3. Experiment platform setup: 1—display screen; 2—steering wheel; 3—pedals; 4—button; 5—EEG cap.

Figure 4. Temporal structure of a block.

Figure 5. FBCSP feature extraction flowchart.

Figure 6. CMKD model for driving intention recognition.

Figure 7. LRCN_18 model architecture diagram.

Figure 8. TCN_CNN model architecture diagram.

Figure 9. Temporal Block model and CONV model architecture diagram: (a) Temporal Block model; (b) CONV model.

Figure 10. The sensitivity analysis curve for parameter λ.

Figure 11. The sensitivity analysis curve for parameter T.

Figure 12. The row-normalization confusion matrix of the model with the highest recognition accuracy: (a) LRCN_18; (b) parameter combinations (1–20).

Figure 13. Performance of CMKD for driving intention recognition: (a) average accuracy; (b) maximum accuracy.

Table 1. The average EEG topographies of 4–30 Hz under different driving intention scenarios for some subjects.

Subject/Driver Intention	Left	Right	Normal	Avoidance	Scale (μV)
Subject 1
Subject 2
Subject 3

Table 2. Classification results for video data.

Algorithm	Classes	Avg. Precision (%)	Avg. Recall (%)	Avg. F1 (%)	Avg. Accuracy (%)	Model File Size (MB)
R3D	left	74.29	25.00	33.13	45.63 ± 5.23	129.753
	right	76.91	35.00	40.22
	normal	39.65	95.00	54.40
	avoidance	78.57	27.50	38.67
R(2+1)D	left	79.33	27.50	38.93	53.13 ± 4.94	129.862
	right	63.79	52.50	56.18
	normal	70.70	57.50	57.71
	avoidance	45.92	75.00	51.71
LRCN_18	left	61.24	40.00	45.70	59.38 ± 3.83	57.097
	right	59.09	45.00	47.34
	normal	56.25	95.00	68.26
	avoidance	94.00	57.50	66.99
LRCN_34	left	23.33	20.00	21.43	47.50 ± 5.13	96.646
	right	45.67	47.50	44.17
	normal	55.12	75.00	59.83
	avoidance	74.17	47.50	46.52
SlowFast_32	left	45.33	25.00	31.74	57.50 ± 5.23	84.450
	right	50.16	47.50	47.14
	normal	66.62	77.50	68.97
	avoidance	73.33	80.00	70.95
SlowFast_50	left	13.33	7.50	9.52	46.25 ± 8.67	131.569
	right	43.89	35.00	37.75
	normal	54.71	80.00	61.80
	avoidance	57.06	62.50	54.85

Table 3. Classification results of EEG features for all video stimuli.

Algorithm	Classes	Avg. Precision (%)	Avg. Recall (%)	Avg. F1 (%)	Avg. Accuracy (%)
CNN	left	66.97	69.84	68.34	69.28 ± 0.62
	right	72.22	64.57	68.17
	normal	74.24	69.18	71.60
	avoidance	64.91	73.73	69.02
LSTM	left	63.17	68.80	65.81	67.86 ± 0.53
	right	71.23	64.10	67.42
	normal	70.78	71.79	71.27
	avoidance	66.73	66.87	66.79
CNN_LSTM	left	56.87	60.73	58.73	59.38 ± 1.55
	right	60.12	52.86	56.48
	normal	63.21	61.74	62.43
	avoidance	57.21	62.45	59.75
CNN_BiGRU	left	58.34	58.95	58.62	60.08 ± 1.68
	right	59.65	59.14	59.37
	normal	63.10	60.58	61.81
	avoidance	59.40	61.57	60.45
Transformer	left	63.51	67.44	65.39	67.83 ± 1.01
	right	67.34	66.10	66.66
	normal	72.43	69.56	70.87
	avoidance	68.55	68.24	68.33
CNN_Attention	left	53.01	55.39	54.16	56.75 ± 1.36
	right	58.56	54.86	56.63
	normal	59.62	59.23	59.40
	avoidance	55.83	57.45	56.60
CNN_Transformer	left	56.20	58.32	57.11	59.09 ± 2.28
	right	60.66	55.53	57.95
	normal	63.46	58.36	60.63
	avoidance	57.04	64.22	60.39
LSTM_Attention	left	61.95	66.39	64.09	67.49 ± 0.68
	right	72.46	64.29	68.12
	normal	70.37	70.12	70.54
	avoidance	65.66	68.53	67.06
CNN_TCN	left	63.05	63.35	63.09	65.71 ± 0.67
	right	67.69	65.53	66.42
	normal	67.92	68.79	68.23
	avoidance	64.81	65.00	64.86
TCN_CNN	left	69.86	70.37	70.11	71.55 ± 0.51
	right	72.55	68.95	70.68
	normal	73.18	76.91	74.94
	avoidance	70.75	69.90	70.29
TCN_Attention	left	61.51	63.87	62.66	64.78 ± 1.02
	right	67.03	63.43	65.05
	normal	68.40	68.50	68.44
	avoidance	62.71	63.24	62.81
TCN_LSTM	left	57.38	59.68	58.46	59.61 ± 0.57
	right	61.23	57.52	59.25
	normal	62.03	61.74	61.84
	avoidance	58.05	59.51	58.76

Table 4. Classification results of EEG features based on the video data partitioning method.

Algorithm	Classes	Avg. Precision (%)	Avg. Recall (%)	Avg. F1 (%)	Avg. Accuracy (%)
CNN	left	72.37	64.04	67.84	64.76 ± 0.66
	right	65.01	60.58	62.68
	normal	63.40	71.15	66.97
	avoidance	60.11	63.27	61.60
LSTM	left	69.94	66.73	68.17	66.08 ± 0.60
	right	69.65	63.08	65.66
	normal	67.49	69.04	68.07
	avoidance	59.95	65.48	62.54
CNN_LSTM	left	46.72	49.90	48.25	49.06 ± 2.24
	right	44.59	43.17	43.87
	normal	56.08	52.60	54.24
	avoidance	49.41	50.30	49.95
CNN_BiGRU	left	52.84	54.14	53.39	51.75 ± 1.72
	right	48.14	49.13	48.59
	normal	55.58	55.96	55.73
	avoidance	50.63	47.79	49.01
Transformer	left	68.55	68.46	68.48	66.64 ± 1.12
	right	71.57	59.71	64.79
	normal	66.27	70.96	68.51
	avoidance	62.39	67.40	64.56
CNN_Attention	left	46.76	52.02	49.22	49.11 ± 2.27
	right	44.49	46.06	45.24
	normal	56.33	52.79	54.47
	avoidance	50.27	45.58	47.68
CNN_Transformer	left	50.20	55.48	52.56	51.28 ± 3.06
	right	48.60	47.40	47.92
	normal	54.36	54.23	54.25
	avoidance	52.41	47.98	50.01
LSTM_Attention	left	61.66	63.37	62.47	61.59 ± 0.97
	right	57.78	58.56	58.16
	normal	64.93	67.60	66.22
	avoidance	61.96	56.83	59.28
CNN_TCN	left	59.15	61.44	60.19	58.70 ± 0.72
	right	54.96	54.04	54.47
	normal	63.88	63.17	63.43
	avoidance	56.98	56.16	56.51
TCN_CNN	left	72.62	69.71	71.10	67.31 ± 0.93
	right	73.53	60.00	66.02
	normal	65.03	69.52	67.04
	avoidance	60.97	70.00	65.08
TCN_Attention	left	56.96	59.52	58.12	55.87 ± 1.53
	right	51.59	52.69	52.04
	normal	58.30	59.62	58.94
	avoidance	56.89	51.63	54.08
TCN_LSTM	left	49.52	54.42	51.85	52.86 ± 1.06
	right	48.66	50.67	49.62
	normal	61.91	54.71	58.07
	avoidance	53.05	51.63	52.32

Table 5. Classification results for different parameter combinations.

Parameter	λ = 0.25		λ = 0.5		λ = 0.75		λ = 1
T	Avg. (%)	Max (%)	Avg. (%)	Max (%)	Avg. (%)	Max (%)	Avg. (%)	Max (%)
1	59.38 ± 5.41	68.75	51.25 ± 10.73	68.75	56.87 ± 8.67	71.88	67.50 ± 5.68	75.00
2	58.13 ± 4.19	62.50	64.37 ± 1.71	65.62	58.13 ± 3.57	62.50	49.38 ± 6.78	59.38
5	58.13 ± 4.74	65.62	55.00 ± 10.03	71.88	59.38 ± 6.25	68.75	60.00 ± 8.95	71.88
10	55.62 ± 6.01	65.62	67.50 ± 2.80	71.88	67.50 ± 4.74	71.88	57.50 ± 8.73	71.88
20	53.75 ± 10.46	71.88	65.62 ± 5.41	75.00	60.63 ± 10.74	75.00	75.63 ± 5.59	84.38
50	49.38 ± 11.69	68.75	65.62 ± 5.85	75.00	61.87 ± 9.48	78.12	73.12 ± 1.71	75.00
100	56.25 ± 7.97	68.75	67.50 ± 6.09	78.12	71.25 ± 6.78	81.25	67.50 ± 9.27	78.12

Table 6. Classification results for the optimal parameter combinations.

Parameter Combinations (λ-T)	Classes	Avg. Precision (%)	Avg. Recall (%)	Avg. F1 (%)	Avg. Accuracy (%)
0.75–100	left	76.29	62.50	67.95	71.25 ± 6.78
	right	73.31	65.00	64.57
	normal	67.28	100.00	79.49
	avoidance	82.79	57.50	66.48
1–20	left	91.56	62.50	72.21	75.63 ± 5.59
	right	76.64	77.50	72.14
	normal	68.27	100.00	80.44
	avoidance	95.00	62.50	73.82
1–50	left	87.79	62.50	71.19	73.12 ± 1.71
	right	79.01	62.50	68.07
	normal	61.64	100.00	75.68
	avoidance	88.81	67.50	75.76

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Guo, Y.; You, X.; Guo, L.; Miao, B.; Li, H. Driver Intention Recognition for Mine Transport Vehicle Based on Cross-Modal Knowledge Distillation. Appl. Sci. 2025, 15, 6814. https://doi.org/10.3390/app15126814

AMA Style

Zhang Y, Guo Y, You X, Guo L, Miao B, Li H. Driver Intention Recognition for Mine Transport Vehicle Based on Cross-Modal Knowledge Distillation. Applied Sciences. 2025; 15(12):6814. https://doi.org/10.3390/app15126814

Chicago/Turabian Style

Zhang, Yizhe, Yinan Guo, Xiusong You, Lunfeng Guo, Bing Miao, and Hao Li. 2025. "Driver Intention Recognition for Mine Transport Vehicle Based on Cross-Modal Knowledge Distillation" Applied Sciences 15, no. 12: 6814. https://doi.org/10.3390/app15126814

APA Style

Zhang, Y., Guo, Y., You, X., Guo, L., Miao, B., & Li, H. (2025). Driver Intention Recognition for Mine Transport Vehicle Based on Cross-Modal Knowledge Distillation. Applied Sciences, 15(12), 6814. https://doi.org/10.3390/app15126814

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Driver Intention Recognition for Mine Transport Vehicle Based on Cross-Modal Knowledge Distillation

Abstract

1. Introduction

2. Materials and Methods

2.1. Experimental Personnel and Stimulus Materials

2.2. Experimental Platform and Procedure

2.3. Data Preprocessing

2.4. Driving Intention Characteristics Analysis

2.5. EEG Feature Extraction

2.6. Construction of a Driving Intention Model Based on Cross-Modal Knowledge Distillation

3. Result and Discussion

3.1. Driving Intention Recognition Analysis Based on Video Data

3.2. Driving Intention Recognition Analysis Based on EEG Features

3.3. Driving Intention Recognition Analysis Based on Cross-Modal Knowledge Distillation

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI