Social-STGMLP: A Social Spatio-Temporal Graph Multi-Layer Perceptron for Pedestrian Trajectory Prediction

: As autonomous driving technology advances, the imperative of ensuring pedestrian trafﬁc safety becomes increasingly prominent within the design framework of autonomous driving systems. Pedestrian trajectory prediction stands out as a pivotal technology aiming to address this challenge by striving to precisely forecast pedestrians’ future trajectories, thereby enabling autonomous driving systems to execute timely and accurate decisions. However, the prevailing state-of-the-art models often rely on intricate structures and a substantial number of parameters, posing challenges in meeting the imperative demand for lightweight models within autonomous driving systems. To address these challenges, we introduce Social Spatio-Temporal Graph Multi-Layer Perceptron (Social-STGMLP), a novel approach that utilizes solely fully connected layers and layer normalization. Social-STGMLP operates by abstracting pedestrian trajectories into a spatio-temporal graph, facilitating the modeling of both the spatial social interaction among pedestrians and the temporal motion tendency inherent to pedestrians themselves. Our evaluation of Social-STGMLP reveals its superiority over the reference method, as evidenced by experimental results indicating reductions of 5% in average displacement error (ADE) and 17% in ﬁnal displacement error (FDE).


Introduction
Within the realm of computer vision, the prediction of pedestrian trajectories has emerged as a pivotal research direction.Its objective is to precisely anticipate the forthcoming movement trend of pedestrians, leveraging past trajectory [1].An example of pedestrian trajectory prediction in a real-world scenario is shown in Figure 1.With the continuous development of intelligent devices and autonomous systems, there is a growing need for pedestrian trajectory prediction [2].In contemporary society, the prediction of pedestrian trajectories holds significant importance across a multitude of applications.Specifically, in the domain of autonomous vehicles, precise prediction of pedestrian trajectories enables vehicles to strategically plan routes and mitigate collisions with pedestrians [3][4][5][6][7].In intelligent transportation systems, pedestrian trajectory prediction can be used to improve traffic flow management and enhance traffic safety [8].
Pedestrian trajectory prediction is commonly perceived as a sequential decisionmaking challenge, where future trajectory coordinates of pedestrians are inferred by their historical trajectories and motion information [9].The challenge of predicting pedestrian trajectories lies in understanding and accurately predicting pedestrian movements in the future, including speed, direction, and possible paths.To solve this problem, multiple factors need to be considered, such as the individual characteristics of pedestrians, environmental conditions, traffic rules, and social interaction.Meanwhile, pedestrian trajectory prediction also needs to consider specific conditions in different scenarios and applications.For example, in urban environments, pedestrian behavior may be influenced by traffic signals, road structure, and crowd density; in indoor environments, building layout and indoor equipment may also have an impact on pedestrian movement [10].Therefore, pedestrian trajectory prediction methods need to have the ability to adapt to different scenarios to cope with diverse situations.Social interaction is a pivotal factor that demands attention in pedestrian trajectory prediction tasks.As pedestrians walk, they inherently influence each other.Behaviors such as queuing, group dynamics, and adherence to social norms significantly impact pedestrians' decision-making processes.Hence, fully extracting social interaction into pedestrian trajectory prediction is essential to enhance prediction accuracy [11].Through the advancements in deep learning [12] technology, particularly the utilization of Recurrent Neural Networks (RNNs) [13], Graph Convolutional Networks (GCNs) [14], and Transformers [15], pedestrian trajectory prediction methods have significantly improved.Neural network models possess the capability to autonomously capture intricate spatio-temporal correlations, extracting pedestrian social dynamics and motion trends from past trajectories and consequently generating anticipated trajectories.Social-LSTM [16] is an important work in pedestrian trajectory prediction, modeling each pedestrian trajectory through LSTM [17] and sharing information through a social-pooling layer.Sparse Graph Convolution Network (SGCN) [18] tackles the issue of redundant interaction in trajectory prediction by incorporating a sparse directed spatio-temporal graph.To solve the difficulty of modeling complex temporal dependencies in recurrent neural networks, Spatio-Temporal Graph Transformer Networks (STAR) [19] introduced a Transformer to model pedestrian trajectories.This model proposes the TGConv graph convolution mechanism to model pedestrian interaction relationships and employs an attention mechanism for trajectory prediction, while these methods have demonstrated good results in various scenarios, the model architecture of the recently proposed methods is not simple, and some models still require prior knowledge.As a result, most model architectures are not conducive to analysis and modification.On the other hand, there are high requirements for the efficiency and recognition accuracy of pedestrian trajectory prediction algorithms in practical scenarios.It is crucial to design a lightweight network that can be applied on embedded devices and on this basis, ensure high-precision recognition results.
To tackle these challenges, we innovatively propose a pedestrian trajectory prediction method based on a multi-layer perceptron.The model only includes two key components: fully connected layers and layer normalization [20].The Social-STGMLP has undergone comprehensive validation using the ETH [21], UCY [22] and the SDD [23] datasets, demonstrating outstanding performance in experimental results.On the other hand, Social-STGMLP has achieved improvements in model parameter count and inference time, which proves its advantages in its lightweight and efficient performance.
The main contributions of this article are as follows: 1.
It is demonstrated that pedestrian trajectory prediction can be modeled more simply and introduces the first-ever pedestrian trajectory prediction approach based on multi-layer perceptrons, termed Social-STGMLP.

2.
We design an efficient structure consisting solely of fully connected layers and layer normalization.Social-STGMLP showcases impressive performance metrics concerning model parameter count and inference time.

3.
Through extensive experimentation and analysis, Social-STGMLP demonstrates superior accuracy compared to alternative approaches.This validation underscores the efficacy and superiority of Social-STGMLP.

Related Works
In the last few years, numerous researchers have conducted extensive and in-depth research on pedestrian trajectory prediction.Traditional trajectory prediction methods typically adopt the strategy based on the manual function to simulate the social interaction relationship among pedestrians.However, these methods encounter challenges such as difficulty in capturing complex pedestrian interaction relationships and achieving low prediction accuracy.The rapid advancements in deep learning have precipitated significant breakthroughs across diverse domains, including image classification and medical image segmentation [24].This solid theoretical foundation provides strong support for applying deep learning technology to pedestrian trajectory prediction.
Because of the temporal structure inherent in pedestrian trajectories, early research used Recurrent Neural Networks [13] to abstract pedestrian trajectories.Social-LSTM [16] is an important work based on Long Short-Term Memory (LSTM) [17], a variant of recurrent neural networks.It extracts the trajectories of each pedestrian using independent LSTM models and shares information through social-pooling layers.In addition, Social-LSTM proposes the assumption that pedestrian trajectories follow a bi-variate Gaussian distribution, which is pioneering work in pedestrian trajectory prediction.The State Refinement module for LSTM network (SR-LSTM) [25] proposes a state refinement module to refine the current state of all pedestrians in the scenarios.Although RNN has good sequence modeling capabilities, it has no obvious advantages in extracting pedestrian relationships.
In pedestrian trajectory prediction, it is crucial to extract the relationships among pedestrians.Graphs, as a kind of data structure that can represent an entity's relationship, can be naturally introduced in pedestrian trajectory prediction.The pioneering application of a Graph Convolutional Neural Network [26] in pedestrian trajectory prediction is showcased by Social-STGCNN, which directly extracts pedestrian trajectories as graphs and uses the distance between pedestrians to simulate their interactions.The distance weighting method uses relative distance to simulate undirected interaction.Some researchers believe that this method makes the interaction between two pedestrians the same, cannot accurately characterize pedestrian social interactions, and dense undirected graphs introduce redundant interaction feature, resulting in the model generating excessive collision avoidance trajectories.SGCN [18] proposes a sparse directed graph to extract the interaction relationship among pedestrians and the motion trend of pedestrians.To avoid excessive loss of social interaction information, Reasonably Dense Graph Convolution Network (RDGCN) [27] sets reasonable micro-interaction weights, integrates a spatio-temporal interaction feature through a 3D graph convolution module, and uses an improved temporal convolutional network for trajectory prediction.
Since its proposal, Transformer [15] has achieved significant results in fields such as natural language processing [15], image classification [28] and speech recognition [29].STAR [19] first introduced a Transformer to model pedestrian trajectories and proposed a spatio-temporal graph trajectory prediction framework leveraging attention mechanism.This approach proposes an improved graph convolutional TGConv based on an attention mecha-nism to capture complex interaction and uses a Transformer to model pedestrian interaction for trajectory prediction.Some researchers believe that the use of only graphs as structures cannot fully capture spatio-temporal information; Social Graph Transformer [30] models pedestrian trajectories as interactive spatio-temporal graphs, captures interaction features through graph convolutional networks, and then uses a Transformer for trajectory prediction.
Although these methods provide competitive results, model structures are becoming increasingly complex and difficult to train.Recently, MLP-Mixer [31] proposed an image classification network consisting solely of multi-layer perceptrons, which reduces computational complexity by replacing the attention module with two multi-layer perceptrons.In human motion prediction, MotionMixer [32] and siMLPe [33] use multi-layer perceptrons to learn the spatio-temporal dependencies of the human.We propose a simpler model structure for modeling pedestrian trajectories based on multi-layer perceptron.Compared with recent works, Social-STGMLP has more efficient parameters and achieves better performance.

Problem Definition
In machine learning and computer vision, pedestrian trajectory video is first segmented into scene maps frame by frame, and then pedestrian recognition and localization are performed on each frame of the image through image recognition technology, thereby obtaining the corresponding bi-dimensional pedestrian trajectory coordinates.The trajectory coordinates of pedestrians in the scene map can indeed be organized to time series.In essence, pedestrian trajectory prediction can be defined as a sequence prediction task.Its objective lies in forecasting the forthcoming position of a designated pedestrian by leveraging their past movement patterns.The primary inquiry of this article is articulated as follows: In a scene map, a pedestrian whose observed historical trajectory is represented as , n is the number of target pedestrians in the scene map, (x t i , y t i ) is the trajectory coordinate of target pedestrian i at the timestamp t, a is the length of the observed historical trajectory, and the ground-truth trajectory is represented as (1) and ( 2): where b is the length of the predicted trajectory, T 1 pred = T a obs + 1.The predicted trajectory is expressed as (3) and ( 4

Main Architecture
The proposed method adopts a novel model structure, and its detailed architecture is shown in Figure 2. Firstly, the model performs a graph embedding operation on pedestrian trajectories to obtain spatial graph embedding and temporal graph embedding.The aim is to extract the trajectory information in the spatial and temporal dimensions to provide a basis for the subsequent feature learning.Then, the spatial and temporal graph embedding are input into the multi-layer perceptron to learn the spatial social interaction feature among pedestrians and the temporal motion tendency feature of pedestrians to extract the intricate correlations among pedestrians and enhance the model's comprehension of dynamic environments.Subsequently, through the feature fusion module, the learned spatio-temporal correlation information is effectively fused to capture richer dynamic feature and more comprehensive pedestrian trajectory feature in pedestrian trajectories.Finally, the learned trajectory feature is input into the fully connected layer, which is used to predict the parameters of the bi-variate Gaussian distribution.The end-to-end framework integrates spatial and temporal information, making the model more adaptable and accurate.

Feature Extraction
Consider a situation with the pedestrian trajectories X in ∈ R T obs ×N×D , where D is the dimension of trajectory coordinates.Based on pedestrian trajectories, we construct both spatial and temporal graphs.The spatio-temporal graph extract the spatial relationship among pedestrians and the temporal dynamics of pedestrian motion over time.The spatial graph G spa = (V t , U t ) represents the locations of pedestrians at the moment t.The temporal graph G tem = (V n , U n ) of pedestrian n represents its trajectory.The node set of spatial graph G spa and temporal graph G tem are represented as , where v t n is the attribute of the pedestrian n and represents the coordinate (x t n , y t n ) of the pedestrian at time t.The edge set of spatial graph G spa and temporal graph G tem are represented as n ∈ {0, 1} indicates whether nodes v t i , v t j or u t i , u t j are connected.If it is connected, it is denoted as 1, and if it is disconnected, it is denoted as 0.
The spatial and temporal graphs are embedded into the vector by graph embedding, as shown in Equations ( 5) and ( 6): where φ(•, •) represented linear transformation.E spa and E tem denote the spatial graph embedding and temporal graph embedding.W E spa ∈ R D×D Espa and W E tem ∈ R D×D E tem are the weights of the linear transformation.
A group of m MLP blocks are introduced to model pedestrian spatial interaction feature of spatial graph embedding and temporal motion tendency feature of temporal graph embedding, respectively, and its detailed architecture is showcased in Figure 3.Each Multi-Layer Perceptron (MLP) block is composed solely of fully connected layers and layer normalization, as depicted in Formulas ( 7) and ( 8): where denotes the output of the l-th MLP block, F represents the embedding vector dimension, LN represents the layer normalization operation, and W l spa and W l tem are the weights of the fully connected layer in the l-th MLP block.H 0 spa and H 0 tem are initialized to E spa and E tem , respectively.

Feature Fusion and Trajectory Prediction
The spatial social interaction and temporal motion tendency feature learned by multilayer perceptrons are fused through fully connected layers to obtain the pedestrian trajectory feature, as shown in Formula ( 9).
where H spa and H tem are the spatial social interaction and the temporal motion tendency feature are learned through the multi-layer perceptron, Concat(•, •) represents the connection operation, and FC represents the fully connected layer.
In this paper, we assume that the trajectory coordinates (x t n , y t n ) of pedestrian n at timestamp t follows the bi-variate Gaussian distribution N ( μt n , σt n , ρt n ), where μt n represents the mean, σt n represents the standard deviation, and ρt n represents the correlation coefficient.Given the pedestrian trajectory feature, the fully connected layer is used to predict the parameters of the bi-variate Gaussian distribution on the time dimension.Hence, the model is trained through the minimization of the negative log-likelihood loss function, as depicted in Formula (10): where W is trainable parameters in Social-STGMLP.

Datasets and Metrics
Social-STGMLP is trained on the ETH [21] and UCY [22] datasets.The ETH dataset comprises two scenarios, ETH and Hotel, whereas the UCY dataset comprises three scenarios, UNIV, ZARA1, and ZARA2.All scenes are bird's-eye views taken outdoors, including 2206 pedestrian trajectories.The dataset includes various behavioral patterns such as pedestrian obstacle avoidance and crowd interaction.This paper employs the leave-one method [34] for conducting experiments.The model is trained on four datasets, and the remaining dataset is utilized as the test set.During training and evaluation, we observe the historical trajectory in the first 8 frames (3.2 s), subsequently forecasting the trajectory for the following 12 frames (4.8 s).
The Stanford Drone Dataset (SDD) is a large, advanced dataset consisting of 60 bird'seye view videos containing more than 10,000 pedestrians and 185,000 interactions.Following the previous method, we divided the dataset into a training set, a validation set, and a test set to verify the validity and generalization of the model.This paper adopts average displacement error (ADE) [35] and final displacement error (FDE) [16] as the evaluation metrics.Below is a concise overview of these evaluation metrics: Average displacement error (ADE) [35]: The average Euclidean distance between the ground truth and the predicted trajectory across all predicted time steps.Mathematically, it is expressed by Formula (11): Final displacement error (FDE) [16]: the Euclidean distance between the endpoints of the ground truth and the predicted trajectory, which depicts the deviation at the final time step of the prediction.This is formalized by Formula ( 12):

Experimental Settings
In our experiment, the experimental hardware processor is an Intel(R) Xeon(R) Gold 6226R CPU @ 2.90 GHz, the approach was trained on a Tesla V100 GPU, and Social-STGMLP is based on Pytorch 1.2.0.The dimension of the FC layers in our model was set to 64.We configured the number of MLP blocks to 16 and set the number of MLP blocks in the feature fusion module to 1.The training utilized the Adam optimizer for 300 epochs with a batch size of 128, and the learning rate was set to 0.01.

Brief Introduction to Comparison Methods
In this paper, we compare Social-STGMLP with 16 state-of-the-art methods.Here is a concise introduction to the comparison approaches: Social-LSTM [16]: Independently models each pedestrian trajectory with separate LSTM units and aggregates information with a social-pooling layer.
Social-GAN [36]: This model utilizes adversarial training and pooling mechanisms to extract interaction patterns between pedestrians.
Sophie [37]: Proposed a generative adversarial model that leverages attention mechanisms.Social-BiGAT [38]: This model utilizes graph attention mechanisms and generative adversarial mechanisms to extract interaction among pedestrians.
PIF [39]: Introduces a multi-task model integrated with information such as human skeletons and surrounding scenes.SR-LSTM [25]: Introduces a state refinement module to refine the current state of all pedestrians in the scenarios.
RSBG [34]: This model proposes a neural network that recursively extracts social relationships and models them as social behavior graphs.
Social-STGCNN [26]: This model predicts pedestrian trajectories using graph convolution mechanisms and an improved Time-Extrapolator Convolution Neural Network.
SGCN [18]: This model effectively represents pedestrian interaction information by removing redundant pedestrian interaction through a sparse graph convolution module.
GCHGAT [40]: This model proposes a hierarchical graph attention network with group constraints to capture interactions within, outside, and between groups separately.PTP-STGCN [9]: The model proposes a spatio-temporal graph convolution network that extracts spatial interactions and temporal dependencies.
Social TAG [41]: This model uses STGAT and STGCN to automatically extract view area and grouping feature together.IST-PTEPN [42]: A pedestrian trajectory prediction method that combines pedestrian trajectory and surrounding scene feature to predict endpoints.
Tri-HGNN [43]: A novel approach to pedestrian trajectory prediction that integrates spatio-temporal interaction and personal intention.
SKGACN [44]: A novel model to extract spatio-temporal relationships among pedestrians with low computational requirements.
RDGCN [27]: This model integrates spatial-temporal information through threedimensional graph convolution modules and utilizes improved temporal convolution networks for prediction.

Quantitative Analysis
Social-STGMLP was compared with 16 advanced pedestrian trajectory prediction models across five scenarios, with the evaluation results presented in Table 1.ADE is represented on the left side, and FDE is on the right side.A lower displacement error indicates a better prediction performance, with bold data highlighting the best prediction result.The evaluation results demonstrate that Social-STGMLP achieved outstanding performance in the experiments on the datasets, particularly excelling in the evaluation metric FDE, attaining optimal performance in all dataset scenarios.Social-STGMLP abstracts pedestrian trajectories into a spatio-temporal graph and individually models spatial social interaction among pedestrians and their temporal motion tendency.According to the experimental results, it is speculated that Social-STGMLP addresses the issue of removing excessive redundant interaction and reducing cumulative error in the prediction process.Regarding the ADE evaluation metric, our approach yielded suboptimal results.Compared to the pioneering work of Social-LSTM, Social-STGMLP demonstrated improvements of 51% and 65% on ADE and FDE.In comparison to the classical method SGCN, Social-STGMLP showed enhancements of 5% and 17% on ADE and FDE.Furthermore, compared to the latest RDGCN, Social-STGMLP achieved an 8% improvement on FDE.
The experimental results of Social-STGMLP and comparison methods in the SDD dataset are shown in Table 2. Experimental results show that Social-STGMLP has achieved good research results.Compared with other comparison methods, Social-STGMLP only uses fully connected layers and layer normalization.The experimental results show that Social-STGMLP can better extract pedestrian interactions and pedestrian movement trends.
Figure 4 illustrates a comparison chart for the average displacement error.Across the HOTEL, ZARA1, and ZARA2 scenarios in the ETH and UCY datasets, Social-STGMLP outperforms classical methods such as SR-LSTM, Social-STGCNN, and SGCN.However, it slightly trails behind the most recent method, RDGCN, in the ETH and UNIV scenarios.
A comparison chart for the final displacement error is shown in Figure 5. Social-STGMLP was compared with four classical methods: SR-LSTM, Social-STGCNN, SGCN, and RDGCN.The experimental results demonstrate that across all five scenarios in the ETH and UCY datasets, Social-STGMLP consistently outperforms all comparison methods in final displacement error.This indicates that Social-STGMLP effectively reduces the cumulative error resulting from redundant interactions in pedestrian trajectory prediction.

Ablation Study
Table 3 presents the ablation study conducted on the spatio-temporal branches of Social-STGMLP.The evaluation results affirm the validity of the structure employed in these branches.Social-STGMLP w/o Spa only employs temporal motion tendency modeling, and Social-STGMLP w/o Tem only employs spatial interaction feature modeling.In pedestrian trajectory prediction, both spatial pedestrian interaction information and temporal pedestrian intention information complement each other and are indispensable.If only temporal graph modeling or spatial graph modeling is considered, it will lead to performance degradation.As depicted in Table 4, this study conducted ablation experiments varying the number of MLP block layers.The optimal experimental results were attained when the number of MLP block layers was configured to 16.If the model layers are shallow, the model can capture pedestrian motion trends with more straight trajectories and less directional changes.Good experimental results were obtained on datasets such as ZARA1 and ZARA2, but the prediction performance was poor on the ETH.With an increase in the number of model blocks, the complexity of the model steadily rises, leading to enhanced fitting capability for trajectories characterized by more curved paths and frequent alterations in direction, as observed in scenarios like ETH.However, excessive model layers may lead to insufficient training and insufficient data to support high complexity models, resulting in underfit results.We conducted a comparative analysis between Social-STGMLP and two existing models (Social-STGCNN [26] and SGCN [18]) across five distinct scenarios, as depicted in Figure 6.These scenarios, arranged from top to bottom are ETH, HOTEL, UNIV, ZARA1, and ZARA2.The observed trajectories are depicted by solid blue lines, while the predicted trajectories are depicted by solid orange lines.Ground truth trajectories are depicted with solid green lines.The closer the predicted trajectory is to the ground truth, the better the experimental performance.Compared to the other two methods, Social-STGMLP can remove redundant interactions, avoid potential collisions, predict trajectories more naturally, always synchronize with the ground truth, and have better prediction performance.The initial row of the figure illustrates two pedestrians moving in the same direction.Social-STGCNN [26] predicts that two pedestrians gradually move away from each other, possibly due to excessive collision avoidance and redundant interaction.SGCN [18] predicts that the trajectories of the two pedestrians are too close, indicating a potential collision.
Compared with other methods, the prediction results we proposed are better.The second row in the figure shows a situation where a single person is walking, but there are other pedestrians in the walking direction, as well as obstacles such as trees and streetlights, which make trajectory prediction difficult.Social-STGCNN [26] and SGCN [18] trajectories are quite tortuous and cannot naturally predict pedestrian trajectories.The third row in the figure may have overlapping future trajectories between two groups of pedestrians, with the right group of pedestrians shifting in direction to avoid stationary pedestrians.Social-STGCNN [26] veers away from the ground truth, with the predicted trajectory closely approaching a stationary crowd, potentially increasing the risk of collision.SGCN [18] predicts a trajectory that deviates significantly from the ground truth towards the end of the trajectory.However, our proposed method ensures that the predicted trajectory direction consistently aligns with the ground truth, effectively capturing pedestrian motion trends and resulting in a better final displacement error.The fourth and fifth rows of the figure depict two scenarios: in the first scenario, two groups of pedestrians are walking toward each other, and in the second, two groups are walking in the same direction toward a stationary pedestrian.Social-STGCNN [26] does not effectively predict pedestrian trajectories, resulting in overlaps between trajectories of different groups.In contrast, Social-STGMLP accurately predicts the movement trends of pedestrians, consistently synchronizing with the actual trajectory trends.For stationary pedestrians, the predicted trajectories indicate that the model can capture the fact that the stationary pedestrian is unaffected by the surrounding movement.

Comparison Of Experimental Processes, Model Parameters and Inference Time
In the same experimental environment, the training process comparison between the proposed Social-STGMLP and SGCN [18] is shown in Figure 7.It can be observed that Social-STGMLP becomes more stable as the training progresses and fits the data faster than SGCN [18].A comparison of model parameters and inference time is shown in Table 5. Social-STGMLP is compared with Social-LSTM [16], SR-LSTM [25], Social-GAN-P [36], and PIF [39].Social-LSTM models each pedestrian trajectory using an LSTM and interacts with social-pooling layers, resulting in larger model parameters and longer inference time.
Social-STGMLP only uses fully connected layers and layer normalization operations, without using attention mechanisms to save computational costs.The model parameter is 147 k and the inference time is 0.0017 s.Compared to the other comparison methods, Social-STGMLP has a simpler structure, significantly reduced inference time, and higher computational efficiency.

Conclusions
This paper introduces Social-STGMLP, which is a pedestrian trajectory prediction method based on Multi-Layer Perceptrons (MLP) and exclusively employs fully connected layers and layer normalization, thereby simplifying the modeling of pedestrian trajectories.Social-STGMLP abstracts pedestrian trajectories into a spatio-temporal graph and individually models spatial social interaction among pedestrians and their temporal motion tendency.We showcase the efficacy of Social-STGMLP over existing methods through the evaluation across five different scenarios using the datasets.Moreover, Social-STGMLP demonstrates advancements in reducing model parameters and improving inference speed, aligning with the growing need for lightweight models in predicting pedestrian trajectories.These enhancements underscore the practical applicability and scalability of Social-STGMLP in autonomous driving systems.We believe that the proposed model contributes to real-time applications such as autonomous driving and intelligent transportation systems, thereby reducing accidents and enhancing pedestrian safety.
In the future, we will further explore expanding models to handle more complex scenarios and integrating additional contextual information to improve the accuracy of the model.

Figure 1 .
Figure 1.Pedestrian trajectory prediction in a real-world scenario.

Figure 2 .
Figure 2. The network structure of Social-STGMLP.Social-STGMLP comprises three core modules: feature extraction, feature fusion, and trajectory prediction.

Figure 3 .
Figure 3.The network architecture of an MLP block.The smallest unit of an MLP block composed solely of a fully connected layer and a layer normalization.

Figure 4 .
Figure 4.A comparison of average displacement error.

Figure 5 .
Figure 5.A comparison of final displacement error.

Figure 6 .
Figure 6.Visualization of contrast methods in different scenarios.The observed trajectories are illustrated by solid blue lines, while the predicted trajectories are denoted by solid orange lines.Ground truth trajectories are depicted with solid green lines.The closer the predicted trajectory is to the ground truth, the better the experimental performance.

Figure 7 .
Figure 7.Comparison of experimental processes.The solid blue line and solid green line, respectively, correspond to the ADE and FDE metrics of Social-STGMLP.Meanwhile, the dashed black line and dashed red line, respectively, indicate the ADE and FDE metrics of SGCN.

Table 1 .
Comparison with the state-of-the-art approach on the ETH/UCY dataset for ADE/FDE.We report the ADE and FDE in meters for each approach, where lower values indicate better performance.Bold data highlighting the best prediction result.

Table 2 .
Comparison with the state-of-the-art approach on the SDD dataset for ADE/FDE.We report the ADE and FDE in meters for each approach, where lower values indicate better performance.Bold data highlighting the best prediction result.

Table 3 .
Ablation research of spatio-temporal branches, where lower values indicate better performance.Bold data highlighting the best prediction result.

Table 4 .
Ablation research of the number of MLP blocks, where lower values indicate better performance.Bold data highlighting the best prediction result.

Table 5 .
Comparison of model parameters and inference time.