Electric Bus Pedal Misapplication Detection Based on Phase Space Reconstruction Method

Due to the environmental protection of electric buses, they are gradually replacing traditional fuel buses. Several previous studies have found that accidents related to electric vehicles are linked to Unintended Acceleration (UA), which is mostly caused by the driver pressing the wrong pedal. Therefore, this study proposed a Model for Detecting Pedal Misapplication in Electric Buses (MDPMEB). In this work, natural driving experiments for urban electric buses and pedal misapplication simulation experiments were carried out in a closed field; furthermore, a phase space reconstruction method was introduced, based on chaos theory, to map sequence data to a high-dimensional space in order to produce normal braking and pedal misapplication image datasets. Based on these findings, a modified Swin Transformer network was built. To prevent the model from overfitting when considering small sample data and to improve the generalization ability of the model, it was pre-trained using a publicly available dataset; moreover, the weights of the prior knowledge model were loaded into the model for training. The proposed model was also compared to machine learning and Convolutional Neural Networks (CNN) algorithms. This study showed that this model was able to detect normal braking and pedal misapplication behavior accurately and quickly, and the accuracy rate on the test dataset is 97.58%, which is 9.17% and 4.5% higher than the machine learning algorithm and CNN algorithm, respectively.


Introduction
Reducing carbon emissions has gradually become a global agenda item in recent decades due to the increasing problems of energy shortages and environmental degradation caused by global climate change and its irreversible effects [1].Moreover, emissions from motor vehicles are one of the major sources of environmental pollution, making electric vehicles even more encouraging as an alternative solution to conventional fuel vehicles [2,3].As 70% of the world's population is expected to live in urban areas by 2050, the development of sustainable urban public transport has become necessary to manage population growth and the significant changes in urbanization [4].Moreover, countries worldwide are providing strong policy support to decarbonize transport and are considering all-electric buses as an opportunity to improve environmental conditions [5].
According to the National Highway Traffic Safety Administration (NHTSA), 16,000 vehicle accidents in the U.S. each year involve pedal errors while driving, and in Japan, that number is about 6000 [6].Major accidents involving rear-end collisions, pedestrian injuries, etc. are commonplace, with approximately 12.6% of accidents caused by driver misapplication of the accelerator [7].Although human hands and feet have similar mechanoreceptor mechanisms, the thresholds of perception for vibrotactile stimuli vary for different body Sensors 2023, 23, 7883 3 of 20 creasingly used in driver behavior recognition tasks.These tasks include braking intention recognition [24], pedal misapplication detection [23,25], driver distraction detection [26][27][28], fatigue detection [29], and so on.However, there is a limited number of studies that have used deep learning to detect pedal misapplication by electric bus drivers, moreover, using the foot camera-based scene data images may encounter problems as they are easily affected by lighting and angle.
Generally, Machine Learning (ML) models are employed to classify problems involving input data presented in the form of time series.However, since entering the 21st century, many more sensors have been deployed in vehicles with the introduction of the information explosion era.Although ML has achieved good results in several engineering domains, the classification performance of traditional ML algorithms decreases with increasing data dimensionality.This behavior results in a "curse of dimensionality" [30].Compared with traditional algorithms such as random forest and Support Vector Machine (SVM), CNNs can extract deeper and more abstract features.CNNs greatly overcome the limitations of traditional machine learning, such as their limited ability to represent complex functions, weak learning ability, difficulty in parallel computation, and insufficient generalization ability in complex classification problems [31].CNNs have become a trend to solve various classification and identification problems.
Therefore, the main objective of this work is to develop a CNN-based pedal misapplication detection model, aiming to detect Pedal Misapplication in Electric Buses (PMEB).A natural driving experiment and a closed-field pedal misapplication experiment for urban electric buses were carried out first, and the collected vehicle driving data were processed to manually select the temporal data of the driver's normal braking and pedal misapplication segments.Based on chaos theory, the phase space reconstruction method was deployed to map the obtained temporal data into high-dimensional color images, producing a driver normal braking dataset and a pedal misapplication dataset.
To sum up, we developed, in this paper, a Model for Detecting Pedal Misapplication in Electric Buses (MDPMEB).Moreover, the main works and contributions of this paper are: (1) A pedal misapplication detection model for electric buses, based on chaos theory and Swin Transformer structure, was developed.Moreover, the phase space reconstruction method principle, based on chaos theory, was introduced.The network structure of the Swin Transformer and the arithmetic process of the Shifted Windows Multi-Head Self-Attention (SW-MSA) method were presented.
(2) The data collection experiment for electric bus driving was carried out using the CANoe software for data parsing.The time-series data of the bus segments entering a station, passing through intersections and pedestrian crossings, and encountering pedestrians were intercepted.
(3) The experiment of the bus driver's pedal misapplication in a closed field was conducted, and the dataset of pedal misapplication was established using the phase space reconstruction method.
(4) Pre-training and formal training of the model proposed in this paper were achieved, and finally, the training results were analyzed and compared to the SVM, iForest, ResNet-50, DenseNet, and EfficientNet V2 models.
Finally, this paper is organized as follows: Section 2 presents previous research related to pedal misapplication, whereas Section 3 gives a method to identify pedal misapplication by electric bus drivers.Section 4 presents the results and discussion of this study.The conclusions and deficiencies of this work are delivered in Section 5.

Related Work
Currently, there are few studies regarding pedal misapplication recognition, which mainly focus on the analysis of pedal misapplication and the design of the system to prevent pedal misapplication.Therefore, in this section, we mainly summarize and sort out the above two aspects.

Pedal Misapplication Analysis and Detection
Kawai and Nakata recruited 44 subjects (both young and old) from Nagoya University to perform a two-handed/bipedal response position selection compatibility task.Participants used their left or right foot (or hand) to press the left or right button according to different computer screen prompts to simulate the process of pressing the accelerator or brake pedal of the vehicle.The experiment found that the driver's error rate was higher in the bipedal task, the reaction time of the elderly driver was significantly longer when the brake pedal was simulated, and the left dorsolateral prefrontal cortex showed a greater degree of activation [32].As for Hasegawa et al., they examined the effect of interruptions on the frequency of pedal misapplication by introducing the concept of the "interruption effect".Moreover, they analyzed the mediating effect of age by recruiting young and older participants to perform a pedal response task accompanied by an interruption task (touching the digits).The findings demonstrated that the longer the interruption time of the pedal task is, the more severe the pedal misapplication will be, particularly among older drivers [16].
Moreover, Baharom et al. studied the driver's response and the pedal's misuse during emergency braking in Malaysia.The researchers prepared a vehicle equipped with a camera, a force pressure sensor, a light cueing device, an audio cueing device, and Arduino hardware instrumentation to explore the emergency braking response characteristics under normal driving and emergency braking conditions.The results show that the farther the driver's foot is from the target pedal, the greater the error occurrence will be [18].Furthermore, based on a video camera, a physiograph, a pedal pressure sensor, and an angular tilt sensor, Wu et al. explored the driver's motion patterns and the electrophysiological characteristics during emergency brake pedal switching [13].In addition, three categories of pedal response types (direct hit, corrected trajectory, and pedal error) were analyzed using Functional Principal Component Analysis (FPCA) using the captured video data from the driving simulation experiments to determine the relationship between the foot trajectory and the pedal misapplication.Added to that, Kajiwara & Murata carried out a road driving simulator experiment where the accuracy of the braking response was evaluated using random forest and plain Bayesian predictors.During this experiment, the left foot stance was one of the input parameters, and the findings showed that the driver's cognitive errors were the main factor contributing to pedal operation errors [20].
Moreover, Bareiss et al. analyzed the crash data from NWVCSS and North Carolina, manually classified the data into normal braking and pedal misuse, and built a classification model using the Bidirectional Encoder Representations from Transformers (BERT) natural language understanding model.Referring to the model of the sequence library, the finetuned model had a classification accuracy of 95.7% [23].
As for Wu et al., they collected data from naturalistic driving experiments while using video, foot placement, driver characteristics, and driver cognitive function levels, and estimated the probability of foot placement through the use of a repeated polynomial logit model.Then, they classified the types of pedal applications using a random forest model and got an accuracy of about 91.4% [19].
Finally, Hernández et al. analyzed the driver's Electroencephalogram (EEG) before activating the pedal brake and classified the EEG signal using the Support Vector Machine (SVM) and CNNs.The findings showed that the recognition rate of emergency braking intention was 71.1% and 71.8% for SVM and CNN, respectively [24].

Anti-Pedal Misapplication System Design
In order to mitigate driver pedal misapplication, Prasanna et al. developed a safety system called the "Throttle Override Safety System (TOSS system)" designed for electric vehicles.This system addresses the issue of 'Design Induced Pilot Error' by separating the drive train from the drive shaft when the driver of an electric vehicle becomes overwhelmed and accelerates the pedal and brakes simultaneously.By interrupting the vehicle's power and allowing it to slide freely, the TOSS system reduces the number of crashes caused by the wrong pedal being pressed in panic [33].Similarly, Zheng et al. concluded a study through investigation and concluded that the driver depresses the accelerator pedal slowly and progressively during vehicle start-up; however, during emergency braking, the force applied by the driver on the brake pedal is sharp and rapid [34].Therefore, Zheng et al. designed a pedal misapplication remediation system for emergency braking that considers the pedal pressure per unit area of the accelerator pedal and the time consumed by depressing the accelerator pedal as the basic points for detection.
Similarly, Le et al. installed a master cylinder behind the accelerator pedal of the vehicle and collected the oil pressure in the cylinder using a pressure sensor.The researchers also concluded that, when the driver misuses the accelerator pedal as a brake pedal, the operation is fast and forceful.Consequently, when the accelerator pedal master cylinder pressure exceeds a certain threshold, the controller closes the valve of the accelerator pedal master, preventing any further movement of the pedal [6].Considered together, pedal misapplication has been relatively poorly studied in driver behavior, and many scholars have conducted driving simulator experiments to analyze the factors that lead to pedal misapplication.Moreover, some scholars have also identified the pedal misapplication using machine learning models or deep learning algorithms to predict the driver's pedal operation patterns as well as the pedal misapplication by capturing the driver's foot camera.Other scholars have designed remediation systems after pedal misapplication in software and hardware system design.
However, there is a lack of studies that focus on detecting pedal misapplication by electric vehicle drivers, particularly electric bus drivers.Furthermore, medical physiologic instruments are usually not equipped in vehicles, and the images obtained by scene cameras can be more sensitive to light intensity and installation location.Therefore, in this study, based on the natural driving data of pure electric buses, the aim was to produce a pedal misapplication image dataset and construct a model to detect pedal misapplication in pure electric bus drivers by relying on the excellent feature representation capability of the CNNs.

Our Proposed MDPMEB Algorithm
This section focuses on a brief introduction to the principles of the overall structure of the MDPMEB algorithm, which includes the theory of phase space reconstruction and the main structure of the Swin Transformer.
As described in Section 1, the architecture of the model in this paper is illustrated in Figure 1. Figure 1 depicts the approach of mapping time series data to a high-dimensional space, the feature extraction and downsampling modules based on the residual structures, the use of the Swin Transformer structure for self-attentive mechanisms, and provides additional insights into the Swin Transformer Block module and the Multi-Layer Perceptron (MLP) structure.Moreover, the sequence data are mapped to a higher dimensional space, and the RGB images are fed into the residual structure to generate a feature matrix of a specified size.The input consists of a deep multi-layer Swin Transformer network structure, which produces a 7 × 7 × 768 feature map.The results of the driver's pedal behavior are predicted using adaptive average pooling and fully connected layers.

Phase Space Reconstruction Method Based on Chaos Theory
Natural driving data collected through the vehicle's OBD and CAN bus has greater convenience and robustness; however, the driving data are recorded in time series format.Although deep learning has excellent feature extraction capability, the tabulated data cannot be directly fed into the CNNs.To weigh the above contradictions, the sequence data were mapped to a higher-dimensional space, and RGB images were generated.Research generally uses the "Takens delay" method for data phase space reconstruction; this method consists of calculating two parameters of optimal embedding dimension and time delay in the time series data to map it to higher dimensional space.After obtaining the above two parameters, the sequence data can be reconstructed using the Recurrence Plot (RP) method, so the problem here is actually how to solve the above two parameters.

Phase Space Reconstruction Method Based on Chaos Theory
Natural driving data collected through the vehicle's OBD and CAN bus has greater convenience and robustness; however, the driving data are recorded in time series format.Although deep learning has excellent feature extraction capability, the tabulated data cannot be directly fed into the CNNs.To weigh the above contradictions, the sequence data were mapped to a higher-dimensional space, and RGB images were generated.Research generally uses the "Takens delay" method for data phase space reconstruction; this method consists of calculating two parameters of optimal embedding dimension and time delay in the time series data to map it to higher dimensional space.After obtaining the above two parameters, the sequence data can be reconstructed using the Recurrence Plot (RP) method, so the problem here is actually how to solve the above two parameters.
In a previous study [35], the authors provided the computational procedure for the "Takens delay" method to obtain the computation.Iterate over all possible values of the embedding dimension and the time delay; first construct the statistic, then solve for the optimal delay time and the optimal embedding dimension.In a previous study [35], the authors provided the computational procedure for the "Takens delay" method to obtain the computation.Iterate over all possible values of the embedding dimension and the time delay; first construct the statistic, then solve for the optimal delay time and the optimal embedding dimension.
Assuming a certain sequence {x i } i = 1, 2, . . . ,N, the reconstructed R m elements can be expressed as follows: where m represents the embedding dimension (usually 2 ≤ m ≤ 5), τ is the time delay, N indicates the length of the sequence data, and p is the number of vectors in the phase space of the data embedding.
Moreover, assuming that y 1), the correlation integral C of the embedded time series is obtained using the below equation.
where r denotes the search radius, it can be taken as any real number in the range of [σ/2, 2σ], σ represents the variance of sequence data.For a certain sequence {x i }i = 1, 2, . . ., N, when splitting {x i } into τ non-overlapping time series, the statistic S of the sequence is determined as follows: Therefore, the extreme deviation ∆S of the statistic S is defined as follows: Finally, for a particular series, the mean test statistic S(τ), and the mean statistical extreme deviation ∆S(τ) are represented in the below equations: In addition, combining Equations ( 7) and ( 8), S index is determined as follows: Referring to the previous studies, the solutions corresponding to the first zero of S(τ) and the point of the minimal value of ∆S(τ) are solved, and the smaller solution obtained from the two equations is considered the optimal delay time τ d .After solving the solution corresponding to the global minimum of S index , the optimal embedding dimension m d is calculated based on the τ d .Referring to the above method, we solve for m d and τ d by writing a piece of Python code.After obtaining the m d and the τ d , the RP method is used to convert the input one-dimensional time series data into a matrix of the corresponding dimension.

Swin Transformer Architecture
The Swin Transformer algorithm was proposed by Microsoft and won the International Conference on Computer Vision (ICCV) 2021 Best Paper [36].In the paper, the concept of SW-MSA was proposed.In the convolution operation, the feature graph was divided into multiple disjoint regions (windows), and the W-MSA operation was carried out to realize information sharing among different windows, which greatly enhanced the expression ability of the network.Moreover, the self-attention mechanism was proposed in 2017, and the information importance of the input matrix was generated by calculating the weights between each element through a formula [37].This mechanism is better at capturing the internal correlation of information, has fewer parameters, and is faster to compute.Thus, the algorithm maps the input matrix to a new matrix through Input Embedding and gets the parameter q, k, v through three transformation matrices.Through mutual matching calculation and Softmax processing, the weight of the corresponding input matrix position is obtained.In more detail, the calculation process is described as follows: where W Q , W K , W V represent the parameters trainable transformation matrices and d indicate the length of the input vector.Moreover, the framework of the Swin Transformer is shown in Figure 1d, which consists of several Swin transformer Blocks with different depths and Patch Merging.This latter is used to downsample the input feature matrix, split the feature matrix, and then stitch it according to the position elements.Next, the matrix passes through a LayerNorm block, and finally, it linearly changes through a fully connected layer (which doubles the depth of the feature matrix while reducing the height and width to half of their original size).The structure of the Swin transformer Block is shown in Figure 1e, which mainly consists of W-MSA and SW-MSA.In the MSA method, the input feature matrix is partitioned into several sub-matrices through a Self-Attention operation window, whereas SW-MSA divides the feature map into several regions of different sizes and disjoints, then the different subwindows are moved and spliced for Self-Attention operation.Finally, the window positions are restored, and the operation process is represented in Figure 2. Note that, in this figure, the input matrix is taken 8 × 8 size as an example after assuming that the window size is 4 × 4; thus, the feature matrix is divided into four windows, and after moving regions A, B, and C, respectively, the masked MSA calculation is performed.Finally, the new 8 × 8 matrix is obtained by performing another split, and the positions of regions A, B, and C are restored.As shown in Figure 1c, the previous Patch Embedding layer structure is replaced, and the downsampling network is built using the ResNet structure.Moreover, Figure 1f shows the MLP module, which consists of a series of LayerNorm, GELU activation functions, and Dropout deactivation methods.

Training Based on a Priori Knowledge
The number of self-made datasets in this study did not reach the level of tens of thousands; moreover, manual production of datasets is extremely time-consuming.
To prevent MDPMEB from overfitting during training, which leads to poor model generalization, inspired by the following two studies, we adopt the idea of migration learning (for example, learning to recognize fruits can help identify animals faster).In a study on detecting corn tassels, Liu et al. used the YOLO V5 to detect RGB images cap- As shown in Figure 1c, the previous Patch Embedding layer structure is replaced, and the downsampling network is built using the ResNet structure.Moreover, Figure 1f shows the MLP module, which consists of a series of LayerNorm, GELU activation functions, and Dropout deactivation methods.

Training Based on a Priori Knowledge
The number of self-made datasets in this study did not reach the level of tens of thousands; moreover, manual production of datasets is extremely time-consuming.
To prevent MDPMEB from overfitting during training, which leads to poor model generalization, inspired by the following two studies, we adopt the idea of migration learning (for example, learning to recognize fruits can help identify animals faster).In a study on detecting corn tassels, Liu et al. used the YOLO V5 to detect RGB images captured from a drone [38].Similarly, in Liu et al.'s study, the number of annotated corn tassel datasets was small (742), and to avoid deep learning limitations, Liu et al. used the VisDrone dataset for pre-training.That is, transfer learning uses the source task as the basis for the target task.In the pre-training process, the target task and the source task are required to be interrelated; that is, the source task belongs to the classification task, and the sample size of the dataset used by the source task must be large enough and the quality good enough [39].
Since the author did not have any self-made dataset that met the conditions, this study adopted the cat and dog dataset from Kaggle as the pre-training dataset, which was agreed to be a good classification dataset.Therefore, the model was pre-trained using a public dataset of dogs and cats, leveraging prior knowledge-based information as the basis for the target task.Alternatively, the pre-training involves using the learned rules as part of the training weights for the target task.

Experimental Results and Discussion
This section describes the process of conducting the experiment, data preprocessing details, dataset production, and training results of the MDPMEB.It also examines the difference in classification accuracy between the MDPMEB and other classification models.
Finally, we conducted the experiments using Pytorch-GPU 1.17.0 and Python 3.7 environment implementations on the Ubuntu 18.04 platform.The experimental setup included an i9-12900K CPU and a NVIDIA RTX3090 GPU.The proposed approach was compared to other classical methods and performed both ablation experiments and validation experiments.

Dataset Sources and Collection
This naturalistic driving data collection experiment lasted for about one week in total and was conducted in Guangzhou and Zhengzhou cities in China.Concerning the bus driving data collection experiment realized in Guangzhou City, the study chose route 107, delimited by stations Zhongshan Eighth Road and Flower City Square.The duration of a single run was approximately 80 min, and the operating hours were between 6 a.m. and 9 p.m. Buses on this route run in two shifts (morning and evening), with a total of two drivers involved.Concerning the bus driving data collection experiment realized in Zhengzhou City, the study chose route B1, delimited by stations Power Plant Road and Agricultural Road.The duration of a single run was approximately 60 min, and the operating hours were between 6 a.m. and 9.30 p.m. Buses on this route run in two shifts (morning and evening), with a total of two drivers involved.
Eventually, in this study, the bus travel data were selected based on the experiments in Guangzhou City and Zhengzhou City, with a total of two days and a running time of about 10 h for the first city and a total of four days and a running time of about 40 h for the second city.
The data acquisition device (Kvaser MemoratorProfessional 5xHS, Produced by Kvaser Company, R&D headquarters is located in Sweden; Source from authorized distributors located in Shanghai, China) was connected to the OBD excuse and the CAN high-voltage/lowvoltage interface to record several operating data points of the bus and the vehicle sensor signal data, as shown in Figure 3.The acquisition items include the vehicle ignition system, the gear system, the pedals, the vehicle speed, the braking system, the radar system, the battery pack, relevant sensor signals of the motor group, etc.Moreover, the acquisition frequency of the different modules varies from 10 to 100 Hz.
Kvaser Company, R&D headquarters is located in Sweden; Source from authorized distributors located in Shanghai, China) was connected to the OBD excuse and the CAN highvoltage/low-voltage interface to record several operating data points of the bus and the vehicle sensor signal data, as shown in Figure 3.The acquisition items include the vehicle ignition system, the gear system, the pedals, the vehicle speed, the braking system, the radar system, the battery pack, relevant sensor signals of the motor group, etc.Moreover, the acquisition frequency of the different modules varies from 10 to 100 Hz.In the one-week bus-natural driving data collection experiment, stops caused by drivers entering a station, passing through an intersection, passing through a pavement, or other unexpected situations were intercepted into clips.Moreover, they were combined with pedal operation data as well as traveling video data.However, no pedal misapplication by drivers was recorded.Therefore, a simulated driver pedal misapplication experiment in a closed field must be conducted next for further data collection.To obtain more data on pedal misapplication and to consider the safety of the experiment, a real-world experiment was conducted in a closed field to simulate common types of pedal misapplication; moreover, the experimental scheme is shown in Table 1.In the one-week bus-natural driving data collection experiment, stops caused by drivers entering a station, passing through an intersection, passing through a pavement, or other unexpected situations were intercepted into clips.Moreover, they were combined with pedal operation data as well as traveling video data.However, no pedal misapplication by drivers was recorded.Therefore, a simulated driver pedal misapplication experiment in a closed field must be conducted next for further data collection.To obtain more data on pedal misapplication and to consider the safety of the experiment, a real-world experiment was conducted in a closed field to simulate common types of pedal misapplication; moreover, the experimental scheme is shown in Table 1.The experiment was conducted at a closed test site.A professional driver who has been driving buses for more than 10 years was recruited to drive a bus without passengers and complete the experiment under 4 error types and 3 traveling speeds (as shown in Table 1).In order to reduce the subjective influence of drivers, each experiment was repeated twice.For each experiment, the driver could terminate the state of pressing the wrong pedal after completing the driving behavior of the experimental protocol (Table 1).Then, brake to stop according to your usual driving habits at the end of this experiment and wait for the next experiment.In this study, due to the inherent risks of the experiment, each experiment was only repeated twice at different vehicle speeds, and a total of 24 datasets for different brake pedal misapplications were collected at the end of the experiment.
After obtaining the natural driving data of the electric bus and the experimental data of the closed site, the braking segments were manually selected under operating conditions such as entering a station, passing through an intersection, and turning.The braking process was defined as the process from braking intention to braking execution to braking hold [10].However, in our study, to enable earlier identification of pedal misapplication, the training dataset consisted of segments captured 2 s before braking and 1 s after braking.This "braking" is defined as the moment when the driver needs to slow down and starts to press the brake pedal, and when the brake pedal opening is significantly higher (judged by plotting the pedal opening change curve).Of course, in the case of driver error, this "brake" should be the moment when the driver has the braking intention and steps on the accelerator pedal.Therefore, the definition of "brake" may need to be determined on a case-by-case basis.Finally, a piece of raw data were extracted, which means that we obtained a driver braking dataset with a duration of 3 s.
In this study [10], the authors conducted an analysis and found that the most important parameters for the identification of pedal misapplication behavior were the vehicle speed, the motor torque, the accelerator pedal opening, and the brake pedal opening.As part of this study, a total of four index datasets were selected as input.Furthermore, based on previous accidents caused by drivers mistakenly stepping on the pedal, it was obvious that some drivers would steer the wheel randomly due to panic.Therefore, steering wheel data were also considered in this study.During data pre-processing, the bus company may not complete the calibration of the radar system, resulting in low reliability of the data; thus, the data collected by the radar system is also excluded from this research.Therefore, somewhat different from the input selected by Yuan et al., accelerator pedal opening change rate, brake pedal opening change rate, and steering wheel angle data were also selected as model inputs in this study.
Moreover, in this study, 1250 groups of normal braking and 24 groups of pedal misapplication fragments were obtained.The obtained sequence data were mapped to a higher dimensional space to obtain 1250 × 9 gray-scale images.After superimposing each group of brake fragment images, 1253 color images from three channels were obtained.
However, one may notice that there is an imbalanced number of positive and negative samples in the dataset; however, considering the challenges associated with the safety and cost of repeated closed-site experiments, Generative Adversarial Networks (GANs) may be an effective tool to remedy this problem [40].GAN includes a G (Generator) and a D (Discriminator).The former is responsible for continuously generating samples similar to the input data, whereas the latter is charged with judging the authenticity of the samples output by the generator.Both play games with each other, and eventually generate new data that is very close to the original.
Hence, in this study, we built a GAN network, took 24 sets of negative sample images as input, set the training rounds to 500, the learning rate to 0.001, and finally generated 136 images having a size of 224 × 224 × 3. The experimental data were divided into training sets, validation sets, and test sets, as shown in Table 2.

Parameter Settings
This section describes the details of the hyperparameter settings and network structure of the MDPMEB model.In more detail, we used a variable learning rate, LambdaLR optimizer, to optimize the loss function; moreover, the learning rate, as a function of the epoch, is shown below, with an initial learning rate of lr 0 = 0.001.The curve of learning rate with training batches is shown in Figure 4.
sets, validation sets, and test sets, as shown in Table 2.

Parameter Settings
This section describes the details of the hyperparameter settings and network structure of the MDPMEB model.In more detail, we used a variable learning rate, LambdaLR optimizer, to optimize the loss function; moreover, the learning rate, as a function of the epoch, is shown below, with an initial learning rate of 0 0.001 lr = . The curve of learning rate with training batches is shown in Figure 4.  L2 regularization adds a penalty term to the loss function to prevent overfitting.Both batches and epochs were used to reduce the computational cost; 64 batches and 50 epochs were deployed.The window size of the W-MSA was set to 7 × 7, and the Embedding L2 regularization adds a penalty term to the loss function to prevent overfitting.Both batches and epochs were used to reduce the computational cost; 64 batches and 50 epochs were deployed.The window size of the W-MSA was set to 7 × 7, and the Embedding depth was set to 96.Therefore, the input matrix size of each layer of the MDPMEB model is shown in Table 3.

Evaluation Parameters
In addition to accuracy, we introduced precision, recall, and fps as the evaluation indexes of the model's effectiveness, where precision is the proportion of true positive samples predicted by the model with respect to the total predicted positive samples, and recall presents the proportion of true positive samples correctly predicted by the model to the total true positive samples.Finally, the fps indicates the number of frames per second when the model is processed, which reflects the real-time performance of the model.Their expression formulas are shown below: where TP (True Positive) represents the correct positive example, FN (False Negative) represents the incorrect negative example, FP (False Positive) represents the incorrect negative example, and TN (True Negative) represents the correct negative example.

Comparison of Results
We compared this model with other models from two perspectives: first, with traditional machine learning models using sequence data as input, we chose isolation Forest (iForest) and One-Class SVM.Secondly, with the CNN models using images as input, to explore whether the deep learning model can extract more features and improve the performance of the model after converting sequences to images, as well as to verify whether the MDPMEB model with improved Swin Transformer has better performance with other models.

Comparison with the Machine Learning Models
Among the machine learning classification models, iForest requires small data samples and has a linear time complexity, high accuracy, and ability to handle high-dimensional data; moreover, the One Class SVM model is a single classification method that contains only positive samples in the training set.iForest and One-Class SVM models were built, trained, tested, and compared to the MDPMEB.The accuracy, precision, and recall results of the different models were recorded as shown in Table 4.It should be noted that the input to the machine learning model is canonical sequence data, while the input of the MDPMEB has an image data type.The proposed model outperforms the machine learning model in terms of accuracy, precision, and recall.The accuracy is about 8% higher than iForest and SVM, whereas the recall is about 6% higher than iForest and SVM, and the precision is about 3% higher than both models.The results of this study show that the use of CNNs can indeed improve the recognition accuracy of the model after converting the sequence data to images.

Comparison with Models of Deep Learning
A side-by-side comparison of the mainstream deep learning frameworks is then conducted, where the classical Long Short-Term Memory (LSTM), the Residual Network (ResNet-50), and the EfficientNet V2 were selected to be compared with the proposed model.The accuracy variations of the model in the training process and the validation process are shown in Figure 5.After the training was successfully completed, the models were evaluated u dataset.Therefore, Table 5 depicts the performance of the models, and Figure 6 sh confusion matrix of the different models over this dataset.After the training was successfully completed, the models were evaluated using the dataset.Therefore, Table 5 depicts the performance of the models, and Figure 6 shows the confusion matrix of the different models over this dataset.The results show that: (1) compared to the classical LSTM and ResNet frameworks, the proposed model has excellent performance at the level of accuracy and recall, where the first one is higher than 7.61% and 4.85%, and the second one is higher than 2.88% and 2.01%, respectively; (2) although the difference between the performance of EfficientNet V2, DenseNet, and the proposed model in the test set is not large, our model still performs better than them in the overall performance of check-all rate and accuracy; (3) due to the shallow network depth of LSTM and ResNet and the deeper network layers of the proposed model, we sacrifice a certain operation speed; thus, our model has a lower frame rate of 83 fps when computing the performance of fps; (4) The shallow networks, such as LSTM and ResNet-50, perform poorly in terms of checking the full rate, and they are less effective in identifying the negative samples.Both EfficientNet V2, DenseNet, and the proposed model perform better when recognizing positive samples (normal pedal operation).However, our model also shows a better level of recognition of the negative samples (pedal misapplication).The results show that: (1) compared to the classical LSTM and ResNet frameworks, the proposed model has excellent performance at the level of accuracy and recall, where the first one is higher than 7.61% and 4.85%, and the second one is higher than 2.88% and 2.01%, respectively; (2) although the difference between the performance of EfficientNet  We tested the contribution of the different components by disabling one of the above components of MDPMEB, trained without prior knowledge (our model no p-t), down sampled without ResNet structure (our model no R-s), and without SW-MSA structure (our model no SW-MSA), respectively, and their performance on the training and validation sets is shown in Figure 7. Finally, Table 6 provides the evaluation results of the model when applied to the testing dataset.
implementation of real vehicle pedal misapplication, the dataset in this work has un anced positive and negative samples.To avoid overfitting the model on less data an improve its generalization ability to better learn the classification task, a public dataset pre-training the network layer weights was deployed.SW-MSA is an important idea p posed by Swin transformer architecture, and how much this component contributes to model is also a question worth discussing.
We tested the contribution of the different components by disabling one of the ab components of MDPMEB, trained without prior knowledge (our model no p-t), do sampled without ResNet structure (our model no R-s), and without SW-MSA struc (our model no SW-MSA), respectively, and their performance on the training and val tion sets is shown in Figure 7. Finally, Table 6 provides the evaluation results of the mo when applied to the testing dataset.Referring to Figure 7, all models can converge after a certain number of iterations, and the accuracy is above 95%, with excellent performance in the verification set.However, after disabling the prior knowledge module, the convergence speed of the model decreases obviously, and the generalization ability decreases in a small range.Moreover, in the training stage, after disabling the prior knowledge weight, no significant gap is encountered in the performance of the optimal accuracy compared to our proposed model, and the performance decreases by about 0.5%.When the residual module is disabled in the early stages of model training, the feature extraction ability of the model slightly decreases; however, this is not very obvious.However, in the verification dataset, the accuracy rate of the model decreases by 1.0% after disabling the residual module.When the SW-MSA method is disabled, the average accuracy in model training decreases by about 2% compared to models with no components disabled.In the validation dataset, as the number of iterations increases, the accuracy of models with SW-MSA disabled fluctuates greatly, and the average accuracy decreases by about 4%.
In the testing phase, the a priori knowledge pretraining, residual modules, and SW-MSA contribute 0.69%, 0.35%, and 2.23%, respectively, without having a significant difference at the level of the accuracy metrics.However, after disabling the prior knowledge preload module, residual structure, and SW-MSA components, the frame rate of the model improved by 2 to 11 frames; however, we still believe that it is meaningful to use these components to improve the work of the model.These findings indicate that, thanks to the powerful data representation capability of Swin Transformer, the use of the prior knowledge-based preloading module and the residual structure down sampling module is beneficial for improving the generalization ability of the model and can prevent it from overfitting on a small dataset.
Unlike previous studies [19,23], this work did not use foot position data and driver physiological data as model inputs but instead used driving data [34] collected by acquisition modules such as OBD or CAN as inputs.Therefore, the optimal accuracy of our model was 1.87% and 6.18% higher than the detection models established by Bareiss et al. and Wu et al., respectively.Meanwhile, our study avoids using cameras or physiologic instruments in the vehicle.Of course, the authors of this article are not denying that the use of video sensors and physiological instruments can help better identify driver behavior; however, in this research, it may not be worth the cost.Compared to a previous study [10], the braking segment changed from 2 s before and after braking to 2 s before and 1 s after braking.However, the accuracy of our model only decreased by 1.92%, which means that the recognition time of pedal misapplication behavior by this model was fully ahead by 1 s.

Conclusions and Future Work
In this study, we mainly performed PMEB detection, natural driving data collection experiments, and pedal misapplication field experiments for pure electric buses.The driver's normal braking and pedal misapplication segments of 1253 and 24 groups were manually screened, and a CNN model, based on phase space reconstruction theory, was built to map the time series data to a higher dimensional space in order to produce a pure electric bus driver pedal operation behavior image dataset.This study builds MDPMEB based on the Swin Transformer network and ResNet principles to realize the detection of driver pedal operation misapplication.
Moreover, to prevent the model from overfitting on a small dataset, a pre-training is conducted using the cat-dog public dataset, and the results show that the prior knowledgebased weight preloading module can improve the generalization ability of the model and help to enhance the performance of the PMEB detection task.Moreover, the findings show that combining serialized vehicle driving data with deep convolutional neural networks can quickly and accurately detect pedal misapplication by electric bus drivers.Therefore, based on previous research, this study further explores the analysis of pedal misapplication detection by pure electric bus drivers, which is in line with the current era that promotes bus electrification around the world, and the model can detect normal/abnormal pedal operation of electric bus drivers; thus, a theoretical solution for on-board driver monitoring systems is provided.
There are certain shortcomings in this study.Due to the experimental conditions, the sample size of the dataset in this work is small, and there is a certain problem with an unbalanced dataset.To solve this problem, more drivers could be recruited to carry out pedal misapplication experiments in closed fields.In the closed-field experiment, we were unable to recruit a larger number of drivers within the limited time frame.Therefore, only one professional driver participated and completed the experiment.Although each experiment was conducted twice, it was still challenging to eliminate the influence of subjective factors associated with the driver.Although we use the GAN network to optimize the problem of sample imbalance, it does not fundamentally solve this problem.According to relevant studies, oversampling techniques may be an additional effective method [41].
Another aspect that cannot be ignored is that some sensors may accumulate errors, such as accelerometers, gyroscopes, and radars [42,43].For instance, Liu et al. discovered that while the Inertial Measurement Unit (IMU) provides independent three-axis acceleration and angular velocity information without relying on external data, it can still lead to an increase in errors within accelerometers due to prolonged integration operations.
Additionally, Xia et al. designed and developed a data acquisition platform for autonomous vehicles to process multi-source sensor data [43].They observed that due to

Sensors 2023 , 20 Figure 1 .
Figure 1.Architecture of MDPMEB.(a): driving data (brake segments, each sample has 9 columns of tabular type data.);(b): phase space reconstruction module (a mapping method based on time delay method, the input data are changed into gray level map column by column, and then every three columns are superimposed with the output dimension of 112 × 112 × 3 images, and each sample outputs three RGB images, and finally splice them in order.);(c): residual network-based downsampling module; (d): Swin Transformer skeleton (main structure of Swin Transformer, the numbers below represent the number of this module.);(e): Swin Transformer Block module (where W-MSA denotes Windows Multi-Head Self-Attention, SW-MSA denotes Shifted Windows Multi-Head Self-Attention, and MLP denotes Multi-Layer Perceptron); (f): MLP structure (containing a series of inactivation and Layer Normalization).

Figure 1 .
Figure 1.Architecture of MDPMEB.(a): driving data (brake segments, each sample has 9 columns of tabular type data.);(b): phase space reconstruction module (a mapping method based on time delay method, the input data are changed into gray level map column by column, and then every three columns are superimposed with the output dimension of 112 × 112 × 3 images, and each sample outputs three RGB images, and finally splice them in order.);(c): residual network-based downsampling module; (d): Swin Transformer skeleton (main structure of Swin Transformer, the numbers below represent the number of this module.);(e): Swin Transformer Block module (where W-MSA denotes Windows Multi-Head Self-Attention, SW-MSA denotes Shifted Windows Multi-Head Self-Attention, and MLP denotes Multi-Layer Perceptron); (f): MLP structure (containing a series of inactivation and Layer Normalization).

Figure 2 .
Figure 2. Calculation process of the SW-MSA method.(a): Suppose a window with 8 × 8 pixels is divided into nine regions; (b): Adjust the position of the shown regions A, B, C respectively, and then re-divide them to 4 × 4 pixel size; (c): The MSA calculation is performed on the four regions, respectively, and a new sub-window is obtained, still 4 × 4 in size.(d): Splice the sub-windows to obtain a window of size 8 × 8; (e): Move the position of regions A, B, C to obtain a new window.

Figure 2 .
Figure 2. Calculation process of the SW-MSA method.(a): Suppose a window with 8 × 8 pixels is divided into nine regions; (b): Adjust the position of the shown regions A, B, C respectively, and then re-divide them to 4 × 4 pixel size; (c): The MSA calculation is performed on the four regions, respectively, and a new sub-window is obtained, still 4 × 4 in size.(d): Splice the sub-windows to obtain a window of size 8 × 8; (e): Move the position of regions A, B, C to obtain a new window.

Figure 3 .
Figure 3. Data acquisition interface of the pure electric bus used in the experiment.(a) Electric bus CAN interface; (b) OBD interface.

Figure 3 .
Figure 3. Data acquisition interface of the pure electric bus used in the experiment.(a) Electric bus CAN interface; (b) OBD interface.

Figure 5 .
Figure 5.Comparison of accuracy rates during training.(a) training dataset and (b) validation dataset.The results show that: (1) during training, the accuracy of all models except LSTM is around 95% and tends to converge after the 20th iteration; (2) in training and validation, the accuracy of our model improves more slowly compared with LSTM and EfficientNet V2; (3) in training validation, the stability of the proposed model is the best and its accuracy is the highest (training set: 97.95% and validation set: 95.55%).After the training was successfully completed, the models were evaluated using the dataset.Therefore, Table5depicts the performance of the models, and Figure6shows the confusion matrix of the different models over this dataset.

4. 5 .
Ablation Experiment of MDPMEB The model in this study is based on the ResNet structure and the idea of Swin Transformer, and, based on the original Swin transformer network, part of the layer structure of Swin transformer Block is streamlined, whereas the Patch Embedding layer of the original network is replaced by the ResNet structure.Due to the difficulty of the experimental implementation of real vehicle pedal misapplication, the dataset in this work has unbalanced positive and negative samples.To avoid overfitting the model on less data and to improve its generalization ability to better learn the classification task, a public dataset for pre-training the network layer weights was deployed.SW-MSA is an important idea proposed by Swin transformer architecture, and how much this component contributes to the model is also a question worth discussing.

Figure 7 .
Figure 7.Comparison of accuracy in the training process.(a) Training dataset; (b) Validatio dataset.

Figure 7 .
Figure 7.Comparison of accuracy in the training process.(a) Training dataset; (b) Validation of dataset.

Table 2 .
Number of samples in each class.

Table 2 .
Number of samples in each class.

Table 4 .
Comparison of models on the testing dataset.

Table 5 .
Performance of different models on the testing dataset.

Table 5 .
Performance of different models on the testing dataset.

Table 6 .
Comparison of the evaluation indicators on the testing dataset.

Table 6 .
Comparison of the evaluation indicators on the testing dataset.