Bimodal Extended Kalman Filter-Based Pedestrian Trajectory Prediction

We propose a pedestrian trajectory prediction algorithm based on the bimodal extended Kalman filter. With this filter, we are able to make full use of the dual-state nature of the pedestrian movement, i.e., the pedestrian is either moving or remains stationary. We apply the dual-mode probability model to describe the state of the pedestrian. Based on this model, we construct the proposed bimodal extended Kalman filter to estimate pedestrian state distribution. The filter obtains the state distribution for each pedestrian in the scene, respectively, and use that state distribution to predict the future trajectories of all the people in the scene. This prediction method estimates the prior probability of each parameter of the model through the dataset and updates the individual posterior probability of the pedestrian state through the bimodal extended Kalman filter. Our model can predict the trajectory of every individual, by taking the social interaction of pedestrians as well as the surrounding physical obstacles into account, with less than fifty model parameters being used, while with the limited parameter, our model could be nearly accurate as other deep learning models and still be comprehensible for model users.


Introduction
Pedestrian trajectory prediction and pedestrian state filtering are often applied in various fields such as self-driving systems [1,2] and pedestrian tracking in surveillance systems [3,4]. In the self-driving system, the state estimation and the tracking algorithm play an important role in the system.
A good prediction algorithm can help the autonomous driving system compute a better plan according to the encountered situation [2]. By performing tracking for the moving pedestrian around its surroundings, and taking the prediction of the future trajectories of pedestrians into account, the path planner would able to avoid the potential collision with the moving pedestrians or planning in a less oppressive route to bypass its surrounding pedestrians.
On the other hand, the prediction algorithm also plays a key component in the Multi-Object Tracking (MOT) algorithm in surveillance systems. A good prediction model would lower the trajectory prediction error for trackers [5] and therefore help the data association and tracker to track the moving object more accurately in the Multi-Object Tracking algorithm in the surveillance system.
Recently, deep learning-based models such as social GAN [6] showed their ability to predict socially acceptable trajectories via the deep neural network. These models often suffer from lacking the interpretability of the prediction models. It is challenging for the model users to comprehend the models, as the meaning of the weights and biases in the deep neural networks are not obvious. Such a situation motivates us to pursue a model with interpretability while keeping the multi-modality and the social behavior of pedestrians in mind.
Deep learning-based pedestrian prediction models [6][7][8][9][10] are built on top of various neural network structures. This kind of algorithm works by reading the pedestrian trajectories and the context of the surrounding environment, generating the features via deep neural networks, and finally predicting the future pedestrian trajectories based on these features. On the other hand, the traditional algorithms [11][12][13][14] predict the trajectories by first manually analyzing the context in which the pedestrians are moving. After that, the prediction model is built to extract the features based on the analysis and predict the future trajectories on top of the extracted features.
For the deep learning-based pedestrian trajectory prediction models, take the social LSTM [7] as an example. The social LSTM [7] is a deep learning-based pedestrian trajectory prediction model and is proposed by Alahi Alexandre et al. They use the LSTM [15] to consume the observed trajectories and construct the trajectory features for every individual in the scene, and propose a social pooling mechanism to calculate the features of the interactions between pedestrians. This algorithm is a classic deep learning-based method that takes social interaction into consideration.
Continuing from the social LSTM [7], Agrim Gupta et al. propose the social GAN [6]. They employ the GAN [16] architecture in solving the pedestrian trajectory prediction problem. They also adjust the social pooling mechanism from that of social LSTM [7]. A new maximum pooling-based social pooling mechanism is proposed to construct social features. Furthermore, they improve the pooling mechanism and reduce the time of social feature calculation by computing interaction features per scene instead of computing the interaction for every frame.
For the traditional pedestrian algorithms, Dirk Helbing and Peter Molnar propose the concept of social force and use this concept to build the social force model. They treat pedestrians as moving particles, which are affected by the social force from the surrounding. For example, a pedestrian is attracted by the goal and therefore he/she is affected by the attraction force from the goal and trying to move toward the goal. Furthermore, a pedestrian might want to avoid collision with obstacles and other pedestrians, which is modeled as the repulsive force to the pedestrians. The repulsive force discourages the pedestrian from moving in that direction. The social force model is built with the consideration of these forces in its model dynamics.
Yamaguchi Kota et al. propose an energy-based pedestrian behavior model [14]. They construct a series of energy functions to describe how pedestrians see the environment and use the energy function to model the motion of the pedestrians. For example, pedestrians prefer moving at a certain speed and maintaining their current velocity, pedestrians want to move toward the destination, and pedestrians in the same group tend to maintain the same speed and same moving direction. They assume that pedestrians have the nature to move in a way such that the energy is minimized. Therefore they propose to take the negative gradient of the mixture energy function as the pedestrian's velocity and use the velocity to predict future trajectories.
As the traditional model-based pedestrian trajectories prediction models have the advantage of comprehensibility for the model users. The model users can see how the model work. Furthermore, it is much easy for the users to tweak the internal parameters compared to that of the deep learning-based models. Therefore, this study chooses the traditional model as our research direction.
In this research, we employ the concept of the social force model [11] and propose the use of a dual-mode Kalman filter. We consider the static and moving mode of pedestrian motion and predict the trajectory according to the mode of the pedestrians.

Dataset Preparation
In this research, we aim to create a motion model of pedestrians and predict their future trajectories based on observed data in the scene. In order to construct the pedestrian trajectory prediction algorithm, optimize the model parameters, or evaluate the performance of the algorithm, relevant data need to be prepared for follow-up research work. There are two parts of the data to be prepared in this research: one is the pedestrian trajectories in the scene, and the other one is the map of static obstacles in the environment.
To build the dataset for the experiment, we first collect the data at the front square of the National Taipei University of Technology (NTUT), with the help of the Velodyne VLP-16 3D LiDAR sensor. We installed the LiDAR on top of a table, which is near the center of the square, so that the sensor would have a better field of view of the surrounding pedestrians. The environment and setup is shown in Figure 1. The point-cloud data is then collected at the frame rate of 10 Hz for 90 min starting at 16:00 on 1 November 2020. After the point-cloud data is collected, the data have to be processed into the format that is suitable for our research needs. The data would be processed and transformed into an environment obstacle map and pedestrian trajectories dataset. To extract the map and trajectories, a list of steps is executed sequentially.
To extract the trajectories data, firstly, the background point-cloud extraction is performed on all the collected frames of point-clouds by building the histogram of ray-hitting distance for every direction of the laser ray and extracting the most frequently hitting point for every ray direction. In such a way, we are able to extract the background point cloud.
Secondly, after the background point-cloud is retrieved, we take this background and the collected point-clouds as input. We apply the background subtraction method to extract the foreground pedestrian point-clouds. Then we perform the clustering and multi-object tracking algorithm to extract the pedestrian trajectories as shown in the algorithm [17]. The tracked cluster and trajectories are shown in Figure 2. Finally, the extracted trajectories are manually adjusted and annotated, so that the trajectories would not suffer from the errors during the tracking algorithm in the previous step. The frame rate for the trajectories data is also down-sampled to 2.5 frames per second, which is aligned with that in trajectories dataset [18,19].
To obtain the environment obstacle map, we perform the SLAM algorithm [1] to build the point-cloud map of the experiment and align the point-cloud map to the background map with the scan-matching algorithms [20,21]. Then, we keep the points in our range of interests and flatten the 3D point cloud into a 2D one. Finally, we down-sample the point cloud to reduce the points and obtain the obstacle map eventually. This map represents the static obstacles in the experiment environment as shown in Figure 3.

Proposed Model
We proposed a bimodal extended Kalman filter-based pedestrian trajectory prediction model. This model adopts the bimodal extended Kalman filter, which is a special implementation of the Bayes filter. This filter is applied to encode the observed trajectories of the pedestrians and their surrounding obstacles into the state representation. After the state representation is encoded via the filter, the representation is then passed to the decoder to predict the future trajectories of the pedestrians.

Problem Formulation
In this study, we proposed a pedestrian trajectories prediction model G. The model consumes the observed pedestrian trajectories X and obstacles O of their walking environment, and predict the future trajectoriesŶ of the pedestrians in the scenes. We formulate the problem as follows: 1.
Observed trajectories X ζ obs,i i=1:I : The observed trajectories X are the collection of the observed pedestrian trajectory ζ obs,i . The pedestrian index number i is used in the symbol subscript to denote a specific pedestrian trajectory in the scene. The pedestrian trajectory ζ obs,i p t i t=1:t obs is defined as a sequence of 2D positions p t i x t i , y t i ∈ R 2 . It describes how a specific pedestrian is being located throughout a specific period of frames, which starts from frame t = 1 to t = t obs . In this paper, we set the observed sequence length t obs to 8.

2.
Predicted trajectoriesŶ ζ pred,i i=1:I : The predicted trajectoriesŶ are the collection of the predicted pedestrian trajectoryζ pred,i , which are generated via the prediction model G. The pedestrian index number i is used in the symbol subscript to denote a specific pedestrian trajectory in the scene. The pedestrian trajectoryζ pred,i p t i t=t obs +1:t obs +t pred is defined as a sequence of 2D positionsp t i x t i , y t i ∈ R 2 . It describe how a specific pedestrian is located throughout a specific period of time frames, which starts from frame t = t obs to t = t obs+pred . In this paper, we set the predicted sequence length t pred to 8.

3.
Ground-truth predicted trajectories Y ζ pred,i i=1:I : The ground-truth predicted trajectories represent the pedestrian trajectories in the future like the predicted trajectories, but they are not generated by a prediction model G. Instead, they describe the pedestrians' real trajectories in the dataset throughout the same time frames as the predicted trajectories.

4.
Environment Obstacles O o k k=1:K : The environment obstacles are a set of obstacles in the pedestrian walking environment. An obstacle is a 2D point representing where the obstacle is located.

State Representation
In our work, we choose the setting for the state so that it is able to represent the state information of that pedestrian in the specific frame. The state s m, x is a tuple that consists of two variables.
The first variable m ∈ {0, 1}, named mode, is the discrete random variable describing the pedestrian's moving mode. We consider the pedestrians has two moving modes, hence this variable is a binary variable. We define m = 0 being the static mode, any pedestrian in this mode prefers to stay in the same location. On the other hand, we defined the mode m = 1 as the moving mode, any pedestrian in such mode would prefer moving but not staying in the same location.
The second variable x [p T v T ] T ∈ R 4 , named base state, is a 4-dimensional vector which contains the 2-dimensional position p and 2-dimensional velocity v of the pedestrian.
With such a setting, we are able to estimate the state and predict the trajectories in a mode-aware manner.

State Distribution
To represent the belief distribution of the state variable, we first decompose the belief state distribution bel(s) into the product of the mode distribution p(m) and the base state distribution p(x | m) conditioning on the mode. For the mode distribution p(m), it is a Bernoulli distribution; therefore, the distribution is parameterized with the weight parameter w m to represent the probability under the specific mode. For the mode-conditioned base state distribution p(x | m), we parameterize the distribution as a multivariate normal distribution, with µ m as its mean and Σ m as its covariance matrix. The subscript m shows the mode that the parameter is conditioning on. We denote the belief state distribution as bel(s) = N (x; µ m , Σ m )w m .

The Proposed Model
For the proposed prediction model, we first encode the trajectories into the belief state distribution, which is the state distribution built on top of the observed data. To calculate the state distributions for the pedestrians in the scene. We introduce the bimodal extended Kalman filter to help encode the observed trajectories into the state distributions. The algorithm is performed as follows.
Just like the extended Kalman filter [22] is an implementation of the Bayes filter, the proposed bimodal extended Kalman filter is an implementation of the bimodal Bayes filter. We will first show the algorithm of the bimodal Bayes filter, as it provides the overview of the workflow for the bimodal extended Kalman filter. Then we will illustrate the algorithm of the bimodal extended Kalman filter.
The bimodal Bayes filter is a specialization of the Bayes filter in which the state consists of a binary mode variable and a continuous multi-dimensional variable. The estimated state is processed in two phases. For the first phase, the current state distribution is predicted based on the previous state distribution with the help of the state prediction model. In the case of the bimodal extended Kalman filter, there are three minor steps in the prediction phase. Firstly, the mode is being predicted with the mode transition model p(m | m) t m |m . Secondly, the base state for each mode is predicted via the mode-specific motion model x = g m |m (x, ). Finally, the predicted base states would be merged into the predicted mode-conditioned base state distribution. After the first phase is completed, the second phase is followed. The second phase is the correction phase, the predicted distribution would be corrected based on the observed data in this. In the correction phase, there are also three minor steps. Firstly, the likelihood function of the state is being computed based on the observed pedestrian position z and the observation model z = h m (x, δ). Then, the mode is corrected with the likelihood function via the Bayes rule. Finally, the predicted base state is corrected.
To estimate the state with the bimodal Bayes filter, one should provide the previous belief state distribution and the state observation, with a properly configured state transition model and observation model. The state-transition model takes the previous state as input and computes the current state. The state transition model contains two parts: one is the mode transition model and the other one is the base state transition model. The mode transition model p(m | m) represent the mode transition probability and is parameterized as t m |m . The base state transition model g m |m is a function that takes the previous base state x and a standard normal noise vector as input and produces the current base state x , and shows the base state transition from the previous mode m to the current m . The transition process could be denoted as The observation model models the observation process of the state. It takes the state s and the observation noise vector δ as input, and produces the observation vector z as output. The observation process could be denoted as The pseudo-code for this algorithm is shown in Algorithm 1.
As the bimodal extended Kalman filter is the implementation of the bimodal Bayes filter, the bimodal extended Kalman filter further assumes the base state is normally distributed and linearizes the base state transition model as well as the observation model to approximate posterior distribution, which is similar to how the extended Kalman filter [22] extends the Kalman filter [23]. The prediction phase and the correction phase are performed sequentially in the filtering algorithm. The pseudo-code for this algorithm is shown in Algorithm 2.
To estimate the current state distribution of the algorithm, the belief distribution of the previous state bel(s) = N (x; µ m , Σ m )w m is passed into the filter as the input, and it is followed by the prediction phase (Algorithm 2 line 1) and the correction phase (Algorithm 2 line 10), and finally the belief distribution of the current state bel(s ) = N (x ; µ m , Σ m )w m is produced as the output.
In the prediction phase, the mode is predicted using the Bayes rule with the mode transition model p(m | m) t m |m (Algorithm 2 lines 2 and 3). Then, the base state distribution p(x | m, m ) N (x ;μ m ,m ,Σ m ,m ) is predicted with the linearized base state transition model (Algorithm 2 lines 4-6). The Jacobian matrices G and E of the base state transition model g m ,m with respect to the base state x and the standard multivariate normal noise are computed. These Jacobian matrices are used for computing the first-order approximation of the predicted state distribution. The predicted modal meanμ m ,m is approximated by the application of the base state transition model g m ,m at the mean of previous base state µ m and the zero mean of the standard multivariate-normal noise. The predicted modal covarianceΣ m ,m is computed as the sum of the transformed previous base state covariance GΣ m G T and the transformed noise covariance EE T . Finally, the predicted modal base state distributions are merged into the predicted base state distribution. As the mixture of two normal distributions may not distribute normally, the mixture normal distribution is merged approximately by preserving the mean and covariance of the mixture distribution. After the prediction phase is completed, the correction phase is followed (Algorithm 2 lines 8 and 9).
In the correction phase, the predicted distribution would be corrected based on the observed data z . There are also three steps in this phase. First, the observation distribution In our setting, the base state x represents the pedestrian's position p and velocity v, the mode m to represent the static and the moving mode pedestrian, and the observation z represents the position of the pedestrian. The required models for the bimodal extended Kalman filter are specified as follows:

1.
For the observation model, it is parameterized with the standard deviation σ p of the observation noise. 2.
For the mode transition model, it is defined as: 3.
For the base state transition model, we apply different motion-model according to the mode. When current mode m is in static mode, we model the motion as constant position model. On the other hand, when current mode m is in moving mode, we apply the social force model [11] as the motion model. For both scenario, an extra noise is added to the velocity term to model the uncertainty of the pedestrian's plan. The model is defined as follows.
In the encoding step, we adopt the bimodal extended Kalman filter to estimate the current state distribution of every pedestrian. We take the observed pedestrian position sequence p 1:t obs as the input data, and estimate the state distribution frame by frame recursively. For each frame, the filter takes the observed position and the previous state distribution, and estimates the current state distribution accordingly. Through the steps, the current state distribution is computed, which contains the parameterized variables representing the mode distribution and the base state distribution.
After the current state distribution is estimated, the decoding phase is then applied to the distribution to generate the predicted pedestrian future trajectories. To decode and predict the future trajectory of the pedestrians. First, states are sampled from the computed state distributions. The sampled state is taken as the initial input. The input then goes into the state transition models and predicts the state. Then, the state is generated recursively via the state transition model by taking the previous state as the input, until the required prediction sequence length is reached. To reduce the prediction variance, the velocity noise is applied to the only states for the output predicted trajectories.

Model Optimization
To obtain the optimal model for predicting the pedestrian trajectories. We have to optimize the proposed model and obtain the optimal parameters. We will illustrate the optimization process for the proposed model.

Observation Model Optimization
For the position observation model z = h m (x, δ) p + σ p δ, it is parameterized via the parameter σ p to represent the standard deviation of the observation noise.
To compute such parameter, we first compute the smoothed pedestrian trajectories via the BSpline [24] algorithm, and we denote the smooth trajectory asp 1:t pred . Now, for every pedestrian position p there is a smoothed version of positionp. We want to file the parameter σ p such that it maximizes the sum of the log likelihood for all positions in the dataset in following equation.
The optimal parameter could be computed by following equation.

Mode Transition Model Optimization
The mode transition model is a part of the state transition model, and it is used to describe the mode transition probability between the transition of states. The mode transition model is defined as follows.
We parameterize the state transition as follows: If we let the previous mode distribution be p(m) = w m , the current mode distribution p(m ) = w m , and the predict current mode distribution be bel(m ) =w m . The process of the prediction could be formulated as follows.
We want to compute the optimal parameter T * such that the sum of square errors between the predicted probabilityw m ,i and the ground-truth w m ,i is minimized.
Then, we reformulate the equation in the form of matrix multiplication. where Solve the equation we have the optimal transition as follows.

Mode Probability Model Optimization
In order to obtain the mode probability w m,i for optimizing the mode transition model, we construct the mode probability model and compute the mode probability base on this model. First, we assume the speed of the pedestrians is distributed as the mixture normal distribution, the distribution is defined as follows: The model is parameterized with six parameters: two for the mode weight w v,0:1 , two for the speed mean µ v,0:1 , and two for the standard deviation σ v,0:1 , the subscript denotes the mode.
Θ w v,0:1 , µ v,0:1 , σ v,0:1 For all the speed data in the pedestrian trajectory dataset, we would like to search for the optimal parameters so that the sum of model log-likelihood could be maximized as follows: arg max As we assume the pedestrian prefers to move in either static or moving mode, the velocity noise is different mode is thereby distributed, respectively. We model the velocity noise as normal distribution and distributed, respectively, in each mode. We use the notation Θ v,m to denote the velocity noise parameter under the specific mode m , it is defined as follows:

Velocity Noise Parameter Optimization
As we assume the pedestrian prefer to move in either static of moving mode. The velocity noise is different mode is therefore distributed, respectively. We model the velocity noise as normal distribution and distributed, respectively, in each mode. We use the notation Θ v,m to denote the velocity noise parameter under the specific mode m , it is defined as follows: From the motion models, we assumes the velocity difference ∆v m between groundtruth velocity v and the predicted velocityv m is distributed under a scaled and rotated standard normal distribution: We want to maximize the sum of weighted log likelihood as follows: where The parameter could be calculated by the following formula, note that the square and square root operators are applied in an element-wise manner.

Social Force Model Optimization
For the pedestrians' in the moving mode, we model the moving behavior with a motion model: We define the model loss to be the average of expected weighted displacement error for pedestrians in the moving mode: We optimize the loss by minimizing the loss via the gradient decent algorithm, the parameters are then eventually retrieved.

Results
In our experiment, we divide the collected data into two parts. One is the training set for training the models. The other one is the testing set, which is not involved in the training process. The constant velocity model, social force model [11], and social GAN [6] model are selected to be compared with the proposed model (with and without noise). The proposed model is trained and tested in a leave-one-out manner.
As for the configuration details of the testing models. Our model is configured to be tested with both the deterministic and the stochastic prediction settings. The deterministic model uses the same parameters as those of the stochastic model. However, the sampling process is degenerated: the normal distribution is replaced with Dirac delta distribution and the mode is always transit to the most likely one. In the deterministic model the most likely sample is always picked, and the randomness is canceled. The social force model is configured to adopt the repulsive force terms from other pedestrians and the obstacles but discard the attractive force terms for reaching the destination and some other objects of interest, as those terms are not obvious during the prediction process. Two parameter sets of the social force model [11] are chosen for this experiment. One follows the original parameters settings as shown in [11], and the other follows the parameters that are refined through the training process.
For the social GAN model, the dimension of the hidden state is set to 32 for the encoder and decoder of the generator, and is configured to 64 for the encoder of the discriminator. The input coordinates are embedded as 16-dimensional vectors. The generator and discriminator are iteratively trained with a batch size of 16 using Adam [25] with an initial learning rate of 5 · 10 −4 for the generator and 10 −3 for the discriminator. The k-variety loss with k = 10 and the pooling mechanism are applied for training this model.

Hardware and Software Environment
In thisstudy, we used Dell XPS 15 laptop as our main computing platform. All calculations and experiments described in the paper were computed with this device, including the collection of the pedestrian point-cloud data, pedestrian trajectory extraction, data curation, and evaluation for all models. The detailed hardware specifications of this computing platform are shown in Table 1. For the part of the software environment, we used Ubuntu 20.04 as the operating system of our developing environment. We used the C++ programming language and the robotic operating system (ROS) for point-cloud data processing, mapping, and trajectory data curation. We used the Python programming language and the Pytorch machine learning framework for building and testing models. The detailed software specifications are shown in Table 2.

Metrics and Results
We adopt several metrics to help evaluate the complexity and the performance of the proposed model. The metrics include the model parameters count, model prediction run-time, average displacement error, final displacement error, minimum social distance, and minimum physical distance. We will illustrate the definition of these metrics and show the results in the following paragraphs. The metric with the best performance among the compared models is marked with boldface, and the metric with the second-best performance is underlined.

Model Complexity
To evaluate the complexity of the models, we count the number of parameters of the prediction model to show the space complexity of the model. We also measure the average trajectory prediction run-time per scene in the dataset to show the time complexity of the model. The run-time of the model is tested and measured on both the CPU device and GPU device. Table 3 shows the metrics for all the models to be compared. Note that we only compared the parameters count for the prediction model, but we do not include the auxiliary models such as the discriminator in the social GAN [6] model.

Average Displacement Error
The average displacement error (ADE) is the metric that shows the average displacement between the predicted pedestrian trajectoriesŶ and the ground-truth future trajectories Y. To compute this metric, the magnitude of the difference between ground-truth position and the prediction position is computed first, then the average is taken for all the frames in a scene. The formula to compute ADE is defined as follows, in which the variable n denotes the number of pedestrians in the specified scene.
The average displacement error (ADE) is the metric for single-scene evaluation. To evaluate the performance of the scenes in the dataset. We derive two metrics based on this metric, which are the mean ADE metric and the minimum ADE metric.
The mean ADE metric (meanADE) computes the average ADE for all sampled scenes in the specified dataset. Furthermore, in order to capture the performance of the stochastic models, we sample N samples = 10 trajectory predictions for a scene, and compute the average over all the samples as well.
The minimum ADE metric (minADE) is a metric to capture the average minimum error among the samples. To compute this metric, first, the minimum ADE is evaluated for all prediction samples in each scene, and then the average is taken over the scenes. The computation formula for this metric is defined as follows: The experiment result for the ADE metrics is shown in Table 4.

Final Displacement Error
The final displacement error (FDE) is the metric that shows the average displacement in the last predicted frame between the predicted pedestrian trajectoriesŶ and the groundtruth future trajectories Y. To compute this metric, the magnitude of the difference between the ground-truth position and the prediction position is computed first, then the average is taken for the last frame in the scene. The formula to compute FDE is defined as follows, in which the variable n denotes the number of pedestrians in the specified scene.
The final displacement error (FDE) is the metric for single-scene evaluation. To evaluate the performance of the scenes in the dataset. We derive two metrics based on this metric, which are the mean FDE metric and the minimum FDE metric.
The mean FDE metric computes the average FDE for all sampled scenes in the specified dataset. Furthermore, in order to capture the performance of the stochastic models, we sample N samples = 10 trajectory predictions for a scene, and compute the average over all the samples as well.
The min ADE metric is a metric to capture the average minimum error among the samples. To compute this metric, first the minimum ADE is evaluated for all prediction samples in each scene, and then the average is taken over the scenes. The computation formula for this metric is defined as follows: The experiment result for the FDE metrics is shown in Table 5.

Minimum Social Distance
The minimum social distance (MSD) is the metric that computes the minimum distance between the pedestrians among all predicted frames. The metric helps us grasp the idea of how the model handles and avoids social collisions.

MSD(Ŷ) min
To further apply the metric for the scenes in the dataset. We derive three metrics based on this metric, these three metrics are the min MSD metric, the 5th% MSD metric, and the Social collision ratio (SCR) metric. To compute these metrics, the MSD metric for each scene in the dataset is computed first. Then, the minimum MSD metrics among all the MSD metrics are found and defined as the minimum MSD metric. The 5th% MSD metric is computed by taking the 5th% value of the percentile of the MSD metrics in the dataset. As for the Social collision ratio, it is defined as the ratio of collision between pedestrians in the dataset, we consider a scene has collision if the MSD metric for the scene is less than 20 cm.
The experiment result for the MSD metrics is shown in Table 6.

Minimum Physical Distance
The minimum physical distance (MPD) is the metric that computes the minimum distance between the pedestrians among all predicted frames. This metrics help us the grasp the idea how the model handle and avoid the physical collisions.
To further apply this metric to the scenes in the dataset. We derive three metrics based on this metric, these three metrics are the min MPD metric, the 5th% MPD metric, and the physical collision ratio (PCR) metric. To compute these metrics, the MPD metric for each scene in the dataset is computed first. Then, the minimum MPD metrics among all the MPD metrics are found and defined as the minimum MPD metric. The 5th% MPD metric is computed by taking the 5th% value of the percentile of the MPD metrics in the dataset. As for the physical collision ratio, it is defined as the ratio of collision between the pedestrian and the physical obstacles in the dataset, we consider a scene has a collision if the MPD metric for the scene is less than 20 cm. The experiment result for the MPD metrics is shown in Table 7.  Figure 4 shows the predicted trajectories and comparison results with the groundtruth trajectories for deterministic models, including our proposed model, constant velocity model and the social force model [11] in two configurations. The predicted trajectories of the pedestrians in this scene are lines colored either red, green, or blue for every individual. The ground-truth trajectories are represented with black lines. The black dots mark the locations of the physical obstacles in the experiment environment. Each figure shows the 60 by 60 square meters area and is represented in the resolution of 600 × 600 pixels.   Figure 5 shows the predicted trajectories distribution compared with the ground-truth trajectories for two stochastic models. The trajectories of the pedestrians in this scene are colored either red, green, or blue for every individual. The predicted trajectories distribution is generated by stacking 1000 predictions into the 2D histogram and is represented using the heatmap. The ground-truth trajectories are represented with solid dots. The black dots mark the locations of the physical obstacles in the experiment environment. Each figure shows the 60 by 60 square meters area and is represented in the resolution of 600 × 600 pixels.

Graphical Results
These figures demonstrate that the proposed model can predict in a bimodal manner, i.e., the pedestrian prefers to maintain his/her position in static mode while moving with the consideration of the environment in the moving mode. When compared with other deterministic models such as constant velocity and the social force model, we see those models often suffers drifts for the pedestrian which is supposed to standing still at the same spot, as shown by those 4 pedestrians in the left and the pedestrian at the right-bottom by an obstacle in Figure 4. Furthermore, the ground-truth trajectories fall in the predicted distribution for both our proposed stochastic model and the social GAN model [6], which showcase the prediction coverage of the models.

Discussion
From the experimental result in the previous section, we could see that our traditional model-based model uses slightly more parameters than the social force model [11], as we introduce the bimodal extended Kalman filtering mechanism in addition to the social force model. Nevertheless, our model uses much fewer parameters than that of the deep learning-based model such as the social GAN [6] model.
The run-time of the proposed models is longer than that of all the other compared models, which is mainly due to the following three reasons. First, our model has an encoding step, which processes the observed trajectories and estimates the states of the pedestrians. This cause our model appears slower than that of the constant velocity model and social force model [11]. Second, our model considers and computes the social and physical interaction on a frame-by-frame basis. When compared with the social GAN [6], which only computes the social interaction feature once per trajectory. Thus, our frame-by-frame approach requires extras computations more than those that do not. Finally, our model requires the computation of Jacobian matrices for approximation during the encoding phase, which is also a time-consuming operation compared with normal feedforward operations.
For the performance of the ADE and FDE metrics. Our proposed deterministic model outperforms all compared models on the mean ADE metric and the mean FDE metric. Our proposed models also outperform all compared models except for the social GAN [6] model, when comparing the minimum ADE metric and the minimum FDE metric. By considering the bimodal moving nature of the pedestrian with the proposed bimodal extended Kalman filter, our proposed model shows the improvement in performance when compared with the constant velocity model and the social force model as they do not utilize such property. Regarding the metrics compared with the social GAN [6] model, such a result is reasonable, as social GAN [6] is optimized on its variety loss. Optimizing the variety loss will encourage the variety of the predicted trajectories while keeping the minimum displacement error as small as possible. However, such loss might also cause the mean ADE to be larger due to the diversity of the prediction.
As for the performance of the MSD and MPD metrics, our proposed models and the social force model [11] perform better than that by the constant velocity model and the social GAN model for all the MSD and MPD metrics while remaining competitive with that by the social force model. The proposed models have larger distances in terms of MSD and MPD metrics and lower collision ratio values. These metrics show that our model does gain the social avoidance ability thanks to the social force model [11] being applied in the moving mode, while the original social force model performs the best in these metrics, it is reasonable as the model is configured to have a larger repulsive force for both the social and physical obstacles.

Conclusions
We build a trajectory dataset from the point-cloud data collected by the 3D LiDAR sensor in a semi-automatic labeling manner and proposed a bimodal extended Kalman filter to model the bimodal moving nature of the pedestrian. We then adopt the social force model [11] as the motion model for the pedestrian in moving mode, and the constant position model for the pedestrian in the static mode. With this approach, the proposed model shows increased performance in ADE and FDE metrics than of those models which are not aware of the bimodal motion properties. Furthermore, our model inherits the ability from the social force model to avoid collisions with pedestrians and physical obstacles. In addition, our proposed approach is fully explainable as comparing with those deep learning-based model, which justifies the usefulness of the proposed approach.

Data Availability Statement:
The data presented in this study are openly available at https://github. com/cylin-at-research/ntut_library (last accessed on 23 October 2022).