Deep Neural Networks for Spatial-Temporal Cyber-Physical Systems: A Survey

: Cyber-physical systems (CPS) refer to systems that integrate communication, control, and computational elements into physical processes to facilitate the control of physical systems and effective monitoring. The systems are designed to interact with the physical world, monitor and control the physical processes while in operation, and generate data. Deep Neural Networks (DNN) comprise multiple layers of interconnected neurons that process input data to produce predictions. Spatial-temporal data represents the physical world and its evolution over time and space. The generated spatial-temporal data is used to make decisions and control the behavior of CPS. This paper systematically reviews the applications of DNNs, namely convolutional, recurrent, and graphs, in handling spatial-temporal data in CPS. An extensive literature survey is conducted to determine the areas in which DNNs have successfully captured spatial-temporal data in CPS and the emerging areas that require attention. The research proposes a three-dimensional framework that considers: CPS (transportation, manufacturing, and others), Target (spatial-temporal data processing, anomaly detection, predictive maintenance, resource allocation, real-time decisions, and multi-modal data fusion), and DNN schemes (CNNs, RNNs, and GNNs). Finally, research areas that need further investigation are identiﬁed, such as performance and security. Addressing data quality, strict performance assurance, reliability, safety, and security resilience challenges are the areas that are required for further research.


Introduction
Cyber-physical systems (CPS) are intended to integrate communication, control, and computational elements with physical processes, aiming to improve the effective monitoring and control of physical components.As the systems are designed to interact with the physical world, monitoring and controlling the physical processes shall generate a variety of data.The data is used to make decisions and control the behavior of CPS [1][2][3].As shown in Figure 1, examples of CPS include autonomous vehicles in smart transportation, the industrial control system (ICS) in smart manufacturing, wearable sensors in medical CPS, etc. [4][5][6][7][8].In CPS, vast amounts of data will be generated, representing space and time while operating themselves.Such data is generally denoted as CPS spatial-temporal data [9][10][11].
In CPS, the spatial-temporal data represents the physical world, and its evolution over time and space.CPS generates large amounts of data as they monitor and control physical processes in real-world objects.This data is used in CPS to make decisions and control behavior, such as predicting future events, detecting anomalies, and optimizing resource allocation, among others.For example, the spatial-temporal data collected by sensors in the smart transportation system detects obstacles and makes vehicle trajectory decisions.ICS, a crucial component of smart manufacturing systems, utilizes spatial-temporal data to monitor and manage physical processes.The effective management and analysis of spatialtemporal data in CPS require sophisticated techniques and algorithms, including spatial data mining, temporal data analysis, and geographic information systems.The effective use of spatial-temporal data in CPS can significantly improve its performance, reliability, safety, and security [12].Several techniques have evolved over the last decades to learn the complex patterns and changing dynamics of spatial-temporal data.This includes temporal time-series analysis (ARIMA, SARIMA, etc.) [13], spatial data analysis (spatial regression and autocorrelation, kriging, etc.) [14], signal processing approaches (Fourier and wavelet analysis, Kalman filtering, etc.) [15], and machine learning approaches (regression analysis, support vector machine, etc.) [16,17].Nonetheless, most of these methods are not well-suited for handling large, dynamic, non-stationary data, which depicts its space and how it changes over time [18][19][20][21].Deep Neural Networks (DNNs), on the other hand, are applicable due to their ability to handle immense amounts of data and their capability of modeling complex relationships, both spatially and temporally [22].For example, DNNs predict future events based on spatial-temporal data in CPS, including traffic patterns in smart cities, resource demand, and fault or intrusion detection in smart manufacturing systems, among others [23][24][25][26].
The existing research efforts revealed numerous successes in the application of DNNs, including anomaly detection [27,28], resource management [24], predictive maintenance [29][30][31][32], multi-modal data fusion [33], real-time decision making [34,35], and spatial-temporal data processing [36,37], among others.For example, Luo et al. [28] reviewed the applications of deep learning for anomaly detection in CPS, outlining areas, where deep learning has achieved promising results and areas that need improvement.Likewise, Zhang et al. [38] reviewed the historical and state-of-the-art applications of deep learning in energy CPS (i.e., frequency analysis and control in power systems).Carvalho et al. [39] systematically reviewed machine learning applications in general to predictive maintenance of industrial CPS to determine the best-performing models and the areas of challenge.However, more research needs to be done to review the applications of DNNs in CPS with spatial-temporal datasets.
In line with Rowe et al. [40], the survey strategy adopted for this paper explored and selected the relevant journals and articles submitted to highly reputable venues that are accessible online.Research databases like Google Scholar, IEEE Xplore, ACM Digital Library, Science Direct, and Springer were used.We focused on titles, abstracts, keywords, and articles that include 'deep neural networks', 'spatial-temporal data', and 'cyber-physical systems'.Even though the 'AND' operation was used within the terms, i.e., 'deep neural networks' AND 'spatial-temporal data' AND 'cyber-physical systems,' it had more precise results.Furthermore, each paper was examined carefully to ensure that the selected DNNs were applied to spatial-temporal datasets in CPS for the experiments.In other words, each article failing to satisfy the requirements falls under the research exclusion category.
This survey paper systematically reviews the applications of DNNs in handling spatialtemporal data in CPS.The research areas are outlined where DNNs have successfully handled spatial-temporal data in CPS and the emerging areas that require improvements as well.The major contributions in this paper are as follows.

•
The applications of Deep Neural Networks (DNNs) -convolutional, recurrent, and graphs in handling spatial-temporal data in CPS are systematically reviewed.• A three-dimensional problem space that considers: CPS (transportation, manufacturing, and others), Target (spatial-temporal data processing, anomaly detection, predictive maintenance, resource allocation, realtime decisions, and multi-modal data fusion), and DNN scheme (CNNs, RNNs, and GNNs) is proposed.• Future research directions concerning data quality, strict performance assurance and reliability, safety, and security resilience have been outlined.
The remainder of this paper is organized as follows.The background of DNNs, spatialtemporal data, as well as CPS are reviewed in Section 2. In Section 3, the state-of-the-art DNNs to handle spatial-temporal data in CPS are explored.In Section 4, the existing research efforts on applying DNN to different CPS application domains are reviewed.In Section 5, several challenges and future research directions are outlined.Finally, Section 6 concludes the paper.

Preliminaries
This section briefly discusses deep neural networks (DNNs), spatial-temporal data, and CPS, respectively.

Deep Neural Networks (DNN)
Generally speaking, DNN is comprised of multiple layers of interconnected neurons that process input data to make a decision (prediction, classification, etc.).The term "deep" means many layers, enabling them to learn increasingly complex features from the input data.DNNs are useful and popular nowadays for their ability to learn and extract vital features from data without the help of any domain experts.They have been used in different domains for various tasks and purposes (e.g., security, industrial component recognition, integrated design of components to optimize the overall system performance) [41][42][43].Training DNNs involves adjusting the weights of connections between neurons to minimize the prediction error with optimization algorithms.Some of the algorithms under DNNs are CNN in computer vision, graph neural networks (GNNs) for graph-structured data, recurrent neural networks (RNNs) in sequential data problems, natural language processing (NLP) in processing text data, and many other applications.

Spatial-Temporal Data
It describes the location and time of an observation.The data changes as the location and time change [44], as shown in Figure 2. It is common in different domains, including climate science [45], transportation [46], manufacturing [47], ecology [48], and social sciences [49], among others [50].Examples of spatial-temporal data include climate data that tracks weather patterns across regions and time [45], traffic data that records vehicle movement and traffic patterns across different times and locations [51], and social media data that captures the space and time of user activity [52].The characteristics of spatial-temporal data typically result in more complex data correlations than conventional methods can handle.Additionally, they are frequently firmly self-correlated, and unlike traditional data, data samples are often not produced independently.It is more challenging to process spatial-temporal data than traditional stationary data.For instance, interpreting spatialtemporal data is more complicated than interpreting pictures, where researchers could rely on visual inspections [44].The data can be represented as raster images, trajectories, and many more.Understanding and analyzing spatial-temporal data is difficult due to the complexity and interdependent nature of the spatial and temporal dimensions.Traditional statistical methods have been applied, but they are not the final panacea, which calls for more advanced machine learning techniques, such as the DNNs.

Cyber Physical Systems (CPS)
It merges the physical world with the virtual world, driven by information communication technology, to create a new intelligent system that enables effective interaction with the environment.As a vertical architecture, CPS has various application domains, including transportation, manufacturing, and healthcare, among others [3,53].It collects information about the physical world via sensors and responds to the system via actuators, and other parts.In the monitoring and control of physical systems, information collected needs to be transmitted to the processing unit (cloud, edge servers, etc.).The processing unit analyzes the data, makes decisions, and sends instructions to the actuators to control the physical system.CPS aims to revolutionize a wide range of industries by improving the performance that they render.For example, Industry 4.0 is the vision of CPS to realize the revolution of manufacturing processes in the industrial domain.

DNNs in CPS-Based Spatial-Temporal Data
In this section, the state-of-the-art DNNs used to handle spatial-temporal data in CPS are explored.As shown in Figure 3, a three-dimensional framework is proposed, in which X-axis indicates the CPS domains (transportation, manufacturing, and others), the Y-axis displays the targets (spatial-temporal data processing, anomaly detection, predictive maintenance, resource allocation, real-time decision, and multi-modal data fusion, among others), and the Z-axis reveals the different types of DNNs (i.e., CNNs, RNNs, and GNNs).Note that the purpose of mapping (say X i , Y j , Z k ) is to categorize the existing effort in the designed 3D framework.Recall that, in the X-axis, denote the transportation CPS as X 1 , manufacturing CPS as X 2 , etc.; in the Y-axis, denote Y 1 as data processing, Y 2 as anomaly detection, Y 3 as predictive maintenance, Y 4 as resource allocation, Y 5 as real-time decisions, and Y 6 as multi-modal data fusion; in Z-axis, denote CNNs as Z 1 , RNNs as Z 2 and GNN as Z 3 , respectively.Given that a specific DNN (say RNN) is applied to anomaly detection in transportation CPS, the corresponding effort can be categorized as (X 1 , Y 2 , Z 2 ) in the defined framework.
The proposed 3D framework can be used as a framework to summarize the existing research efforts concerning DNNs in handling spatial-temporal data in CPS.The designed framework can be used to categorize the existing research efforts, and help readers better understand the intersections among different DNNs with different application targets under different CPS.Furthermore, this designed framework is a generic one and can be extended to include more CPS, targets, and DNN techniques.

The Problem Space
As denoted by Figure 3, six targets are defined to represent the fundamental research objectives that the DNNs have been used to handle spatial-temporal data in CPS.The targets emerged from the adopted research strategy described above and the careful consideration of the various goals achieved by the research conducted.For example, traffic speed, flow, and congestion prediction are achieved by processing road traffic data in transportation CPS.Research with these kinds of goals is categorized under spatial-temporal data processing.While those to detect or prevent the occurrence of attacks are categorized under anomaly detection.Real-time decisions go to autonomous vehicles or industrial control scenarios.However, in manufacturing CPS, for example, predicting equipment failures are classified as predictive maintenance, while the allocation of production logistics counts under resource allocation.Researchers that use data from various sources of different types are categorized under multi-modal data fusion.
The defined targets are elaborated further below:

CPS Application Domains
CPS can be classified into several domains based on the application area and the type of interactions between the physical and cyber components.In this context, CPS is categorized into application domains such as transportation, industrial manufacturing, and others.

Transportation CPS
It is popularly known as smart transportation.Physical systems are integrated with computational and communication techniques to realize various goals, such as improving traffic flow, toll collection, reducing congestion, smart parking, enhancing safety, safe pedestrian crossing, reducing carbon emissions, and autonomous driving, among others.Transportation CPS is a sophisticated, heterogeneous system that intends to offer effective services connected to various modes of transportation and traffic management.Transportation CPS manages a variety of new data sources, including geospatial transportation area, connected vehicle, roadside unit (RSU), and traffic network data.It also empowers users to be better informed and use transportation systems in a more innovative, safer, and organized fashion.
Smart transportation technology can offer services such as utilizing cameras to enforce traffic regulations, dialing 911 in the event of a car accident, and tracking the speed limit of vehicles, among others.There are several forms of security and privacy issues related to transportation CPS, targeting its essential elements such as IoT devices (sensors, actuators, microcontrollers, etc.), cloud services, and location-based services, among others.Examples of security issues include data tampering, man-in-the-middle (MITM) attacks, eavesdropping, impersonation, distributed denial of service (DDoS) attack, and artificial intelligence (AI)-based attacks, etc.In addition, model inversion, model poisoning, model evasion, and model extraction are typical ways of attacking AI models, which leads to severe impacts on driverless cars.While location privacy and commuter privacy are among the privacy concerns.To address the security and privacy issues in transportation CPS, both industry experts and researchers in academia are actively engaged in research to solve the problems.Nonetheless, despite these challenges, transportation CPS has the potential to revolutionize transportation systems, making them safer, more efficient, and more sustainable.
Transportation CPS can be categorized into vehicle transportation (which involves the integration of computing, communication, and physical systems within a vehicle); infrastructure transportation CPS (which consists of the integration of computing, communication, and physical procedures in transportation infrastructure); and system-level transportation CPS (which involves the integration of computing, communication, and physical systems at the system level).
Vehicle transportation CPS includes sensors, actuators, embedded systems, autonomous driving, etc.The main goal of vehicle transportation CPS is to improve safety, energy efficiency, and user experience.Transportation infrastructure for the transportation CPS includes traffic lights, sensors, cameras, communication networks, and other components that enable traffic monitoring, control, and management.The main objectives of transportation infrastructure CPS are to increase safety, improve traffic flow, and lessen congestion.System-level transportation CPS includes data analytics, simulation, optimization, and control algorithms that enable efficient and effective transportation planning, operation, and management.The main goal of system-level CPS is to improve the overall performance and sustainability of transportation systems.Real-time requirements for processing large amounts of data, secure communication mediums, interoperability among different systems, and effective collaboration between different stakeholders are among the challenges faced by transportation CPS nowadays.However, despite these challenges, transportation CPS has the potential to revolutionize transportation systems, making them safer, more efficient, and more sustainable.As shown in Figure 4, using the autonomous car as an example in the transportation CPS domain, the sensors/actuators will collect and send spatial-temporal data of the moving vehicle.With the help of the DNN model, the autonomous vehicle will obtain updated instructions about its environment.

Manufacturing CPS
It is also known as smart manufacturing or Industry 4.0, which integrates physical processes with advanced computing and communication technologies to optimize manufacturing production and operation processes.The systems monitor and regulate physical processes using sensors and actuators while using data analytics and machine learning algorithms to streamline operations, boost productivity, and reduce costs.It applies to various manufacturing processes, including assembly lines, material handling, quality control, maintenance, and supply chain management.The Industrial Internet of Things (IIoT) and machine learning are the critical enabling blocks of manufacturing CPS.The former connects machines, sensors, and other devices to a network to collect and share data.This allows for real-time monitoring and control of equipment and the ability to control and adjust machines remotely.At the same time, the latter makes use of the data generated by the system.This includes predictive analytics, which can forecast equipment failures and maintenance needs, and prescriptive analytics, which can optimize manufacturing processes and improve product quality.With real-time data and analytics, industrialbased manufacturing CPS helps manufacturers optimize their production processes, reduce energy consumption, minimize waste, as well as enhance product quality.

DNN Techniques
There are different variants of DNNs.The most prominent techniques to handle spatialtemporal data in CPS are recurrent neural networks (RNNs), convolutional neural networks (CNNs), and graph neural networks (GNNs).Note that while RNN, CNN, and GNN are discussed, other DNN techniques can be expanded in the framework designed.

CNN
It refers to the class of DNNs, which is mainly used for image recognition and computer vision tasks.The essential attribute of this model is its ability to learn the spatial hierarchy of features from the input data, making it highly effective for tasks requiring an understanding of the visual context of images.CNNs comprise several layers, including convolution, pooling, and fully connected layers.Each layer performs a specific function to extract features from the input data.The convolution layer applies filters or kernels to the input image, which slide over the entire image to capture specific features such as edges or corners.Each filter produces a feature map, representing a particular feature's presence in the input image.The pooling layer downsamples the feature maps obtained from the convolution layer to reduce their dimensionality and make subsequent computations more efficient.The Max Pooling layer can choose the maximum value from each sub-region of the feature map.After several convolution and pooling operations, the output is passed to the fully connected layer, which performs a non-linear transformation of the feature maps to produce a set of probabilities for each possible class.The final output is compared with the actual labels, and the network weights are updated using backpropagation, aiming to minimize the gap between the predicted and actual output.CNNs have been applied to various tasks, including object identification, facial recognition, and image classification.They have also been enhanced nowadays to accommodate other types of data (speech, text, video, etc.), making them a potent tool for numerous machine learning problems.

RNN
It refers to the DNNs designed to process sequential data, such as time series or text data, by maintaining an internal memory or state.The basic idea behind RNNs is that the output at a given time step is impacted not only by the current input but also by the previous inputs and the current state of the network.The network comprises a series of interconnected recurrent cells, which process the input at each time step and update the internal state of the network.Each cell takes the current and previous states as input, produces an output, and passes a new state to the next cell in the sequence.The backpropagation through time method trains RNNs, a variant of the standard backpropagation algorithm used to update the network weights based on the error signal at each time step.RNNs are suitable for application areas such as language modeling, speech recognition, etc.They are particularly effective for tasks requiring an understanding of the temporal dependencies of sequential data.Nonetheless, it is a challenging issue to train and prone to vanishing and exploding gradients.Various mechanisms have been developed to address the drawbacks, such as gradient clipping and regularization techniques (e.g., dropout).
RNN has two representative variants: one is LSTM and the other is GRU, which are briefly explained below.
• LSTM: It is an RNN variant designed to deal with challenges in RNNs in handling sequential data.Since conventional RNNs suffer from the "vanishing gradient problem" that limits their capability of capturing long-term dependencies between input and output sequences.LSTMs improve on that by remembering and selectively forgetting information over longer time horizons, making them effective for modeling sequences of variable lengths, such as natural language processing (text or speech).An LSTM consists of four main components: a memory cell, three gating units (an input, a forget, and an output gate), and an activation function.The memory cell stores information over long periods and passes it on to the next time step.The input gate controls data flow into the memory cell according to the current and previous outputs.Based on the input and output from the previous and current cycles, the forget gate determines the data to be erased from the memory cell.The output gate determines the output based on the current input and the current state of the memory cell.The activation function, mostly a hyperbolic tangent or sigmoid function, is used to compute the cell's current state.At each iteration, the LSTM unit receives an input vector and a hidden state vector from the previous time step and produces an output vector and a new hidden state vector.The input vector is passed through the input gate, and the output gate determines the output vector.

GNN
GNN is an architecture designed to operate on graph-structured data, such as traffic networks, social networks, molecular structures, etc.The framework incorporates information about the graph's structure and the relationships between nodes into their computations.The fundamental idea behind GNNs is to represent each node in the graph as a vector or tensor and use message passing between nodes to update these representations based on the graph's structure.At each iteration of the message-passing process, each node aggregates information from its neighbors and updates its representation based on a learned update function.There are different variations on the basic GNN architecture, depending on the specific problem being addressed [54].For example, some GNNs use graph convolutional layers to learn local features of the graph, while others use attention mechanisms to learn global relationships between nodes.Some GNNs are designed to handle dynamic graphs that change over time, while others are designed for heterogeneous graphs with nodes and edges of different types.GNNs have been applied to address various problems, including traffic prediction, social network analysis, recommendation systems, and drug discovery, among others.

CPS Application Domains
As shown in Figure 3, several representative CPS application domains; transportation, manufacturing, and others are considered.Note that the transportation CPS and manufacturing CPS are two key examples to illustrate the designed framework outlined in Section 3 and show the existing efforts on applying DNN in representative CPS.

Transportation-Based CPS
We now review the recent literature on the DNNs that capture the data's latent spatialtemporal features in the transportation-based CPS with respect to traffic forecasting, threat detection, data inconsistency identification, and autonomous vehicle collision prediction.

Traffic Forecasting
Traffic forecasting predicts future traffic patterns in an urban or city transportation system, which is necessary for traffic control, navigation systems management, and transportation planning.Accurate traffic forecasting aids in reducing congestion, enhancing safety, and maximizing the use of available transportation resources.Nowadays, DNNs learn the latent relationships and patterns within the traffic data to generate predictions based on those patterns.This motivated the effort of Zhou et al. [37], who proposed a "wide-attention and deep-composite (WADC) model".To investigate its performance, they used CNN-LSTM to train the model with traffic flow spatial-temporal datasets.The result revealed that it outperformed other models.Similarly, Guo et al. [55] proposed a "graph attention-temporal convolutional network (GATCN)" to forecast traffic speed in the short term.Graph attention and temporal convolution networks are combined to form each layer in the GATCN to apprehend the hidden spatial-temporal relationships concurrently.Likewise, Ma et al. [56] proposed a capsule network (CapsNet) and Nested LSTM (NLSTM) for network speed prediction, in which CapsNet was considered to extract extensive spatial features from roadway networks, while NLSTM was leveraged to capture traffic state hierarchical temporal dependencies.
Furthermore, Yan et al. [57] aimed to achieve an accurate and adaptable scheme for traffic flow prediction by proposing a graph-based network model.The model employed a fully connected layer to create a matrix from traffic data.LSTM was applied to the data to capture the temporal dependency, while ChebNet captured the spatial dependency.The spatial-temporal attributes were further combined for accurate traffic flow forecasting.Han et al. [58] stated that graph-based neural networks could be applied to enhance forecasting of traffic speed.To this end, they proposed a scheme that can learn time-specific spatial dependencies and a dynamic graph convolution module that aggregates hidden states of neighboring nodes to focal nodes using dynamic adjacency matrices and message passing.According to their study, the proposed scheme could offer clear and interpretable spatial relationships between road segments.
Furthermore, Tian et al. [59] proposed a multi-step prediction model that integrates CNN with an attention mechanism.In this way, the spatial-temporal dependencies and forecast traffic conditions of road networks could be captured.With the self-adaptive node embedding, the model is capable of extracting the latent spatial relationships in the data even without prior knowledge of the graph topology.Li et al. [60] observed that spatial-temporal correlations among road networks are changeable and complex.They proposed a model to achieve a dynamic traffic flow prediction model.Their proposed model comprises an adaptive mechanism block that preprocesses the data, improves its quality, and passes it to the multi-sensor data correlation convolution block to learn the dynamic temporal and spatial correlation among roads.
There are other related efforts.For example, Bai et al. [23] aimed for an effective traffic jam forecasting strategy in smart cities by proposing a "Relative Position Congestion Tensor (RPCT)" and a predictor for the "Position Congestion Tensor".The proposed schemes leveraged the concept of relative locations to realize congestion matrices on regional traffic networks and convert them into spatial-temporal tensors.ConvLSTM was used to forecast future traffic congestion across the entire road network.Likewise, Lin et al. [61] discussed the significance of accurate traffic condition predictions in intelligent transportation systems.To this end, they proposed a "graph convolution gated RNN (GCGRNN)" to analyze multistep traffic volume by automatically determining the spatial-temporal dependencies in historical traffic data, where GCGRNN is based on encoder-decoder RNN and a data-driven graph filter.One benefit of their approach is that graph convolution is not dependent on a predefined adjacency matrix.
In the case of flight networks, there are some related studies.For example, Cai et al. [62] proposed an approach to carrying out the flight delay forecast.Their designed approach leveraged graph convolutional neural networks (GCN), which capture the insightful information of the airport network.In their study, an adaptive graph convolutional block was embedded in the proposed scheme so that the hidden spatial interactions in an airport network could be exposed.As another example, Peng et al. [63] observed that CNNs, GCNs, and RNNs were the most frequently utilized for extracting spatial-temporal features from traffic networks.They added that dynamic graphs could be more effective at reflecting the spatial-temporal features of the traffic network, but generating graph structures from data can be difficult.Thus, they proposed a long-term traffic flow prediction scheme that relies on GCN-LSTM to extract the spatial-temporal features for carrying out prediction.Furthermore, they developed a network of graph convolutional policies using the principle of reinforcement learning to create dynamic graphs when static ones are lacking because of data sparsity problems.These efforts can be mapped to the cube <X 1 , Y 1 , Z 1 /Z 2 /Z 3 > in Figure 3 and Table 1.It means that, in those efforts, all the representative DNNs (CNNs, RNNs and GNNs) Z 1 , Z 2 , and Z 3 are utilized for traffic prediction (Y 1 ) in transportation CPS (X 1 ).

Threat Detection
Threat detection involves ensuring the safety and reliability of transportation CPS by training models with the standard system behavior data to detect deviations from the said standard as anomalies.Some of these anomalous data may be targeted at cyberattacks, trigger equipment failures, or cause environmental disturbances.For example, Kong et al. [64] proposed a framework combining trajectory data with environmental perception to detect outliers in driving behavior.The framework is comprised of trajectory processing, classification, and a mix of spatial-temporal-cost environments.Karim et al. [65] aimed to improve traffic safety by predicting accidents early on using video data recorded by dashboard cameras to study a dynamic spatial-temporal attention (DSTA) network model.The presented model combines both the dynamic temporal and spatial attention modules to focus on the most informative segments of a video and the spatial regions of frames.The gated recurrent unit module predicts the probability of a future accident.
Likewise, Diao et al. [66] aimed to prevent traffic accidents by proposing CRFAST-GCN, a multi-branch spatial-temporal attention graph convolution network that extracts long-and short-term dependencies, semantic similarity, and periodicity.Furthermore, Chen and Lv [67] considered improving the safety, performance, and development of intelligent transportation systems for autonomous vehicle in smart cities using digital twins and AI-based technologies.An architecture was proposed to use the 5G network so that resource load balancing scheduling could be provided to secure the transmission of autonomous vehicle data.A spatial-temporal graph convolution network technique was designed to forecast traffic flow in road networks, as well as real-space analysis of the compound traffic condition in the area of the road network using the concept of digital twin.These efforts can be mapped to the cube <X 1 , Y 2 , Z 3 > areas in Figure 3 and Table 1.It means that in these efforts, GNNs (Z 3 ) are utilized to detect the threats (Y 2 ) of transportation CPS (X 1 ).

Data Inconsistency Identification
Data inconsistency identification entails identifying and resolving inconsistencies in spatial-temporal datasets to ensure their accuracy and reliability.Related to this, Liang et al. [68] proposed a spatial-temporal aware data recovery network (STAR) to address the real-time spatial-temporal data imputation problem in a cooperative intelligent transportation system.The model is geared to handle the three types of data recovery tasks in real time and with inductive inference.Likewise, To infer missing values in the spatiotemporal input data, Kong et al. [69] proposed a novel paradigm for imputing traffic data.The model dramatically decreased the imputation error while increasing imputation accuracy compared with the state-of-the-art.Additionally, the correlated information extracted from historical observations is used to deal with missing values.These efforts can be mapped to the cube <X 1 , Y 5 , Z 3 > area in Figure 3 and Table 1.

Autonomous Vehicle Collision Prediction
Autonomous vehicle collision prediction entails forecasting the likelihood of a collision between an autonomous vehicle and another object, such as a pedestrian or vehicle.Related to such an effort, Malawade et al. [70] proposed a spatial-temporal scene-graph embedding technique (SG2VEC), which adopts GNNs and LSTM layers to predict future autonomous vehicle accidents with the assistance of visual scene perception.Likewise, Sun et al. [71] adopted GNN and RNN to propose a global scheme called GST-GAT for traffic prediction.The framework leveraged "global interaction + node query" as a coherent way of information flow between nodes, which captures the interaction between traffic road networks that is spatial-temporal.These efforts can be mapped to the cube <X 1 , Y 5 , Z 3 > area in Figure 3 and Table 1.
Table 2 houses the identified research gaps and the contributions made by the reviewed efforts in the transportation domain.

Manufacturing CPS
In this section, research efforts that apply DNNs to apprehend the latent spatialtemporal attributes of manufacturing CPS data (real-time monitoring of factory logistics, production resource allocation, threat detection, etc.) are discussed.

Real-time Monitoring of Factory Logistics
Wu et al. [34] proposed a scheme that integrates industrial IoT with digital twin technology to enable timely spatial-temporal traceability and visibility of manufacturing resources for efficient factory logistics.In their study, an LSTM network-based genetic indoor-tracking model was created and utilized to locate product trolleys with Bluetooth low energy and ultra-wide band technology.The extracted spatial-temporal features were used to activate location-based services for operational efficiency.This effort can be mapped to the cube <X 2 , Y 5 , Z 2 > as shown in Figure 3 and Table 3.

Production Resources Allocation
Zhao et al. [24] proposed a model that improves production logistics efficiency through effective resource allocation.The model adopts dynamic knowledge graph modeling and the digital twin spatial-temporal mapping method to learn and represent the spatialtemporal values and relationships among the resources.A graph algorithm is employed to allocate the resources.This effort can be mapped to the cube <X 2 , Y 3 , Z 3 > as shown in Figure 3 and Table 3.

Threat Detection
Anomaly detection mechanisms in manufacturing CPS are only effective if the nonlinear spatial-temporal features of the industrial processing data are considered [25].In their study, the authors proposed a method based on spatial-temporal modeling (AD-RoSM) for detecting FDIA in ICS [25].Their proposed scheme employs a neural-based state estimation model that utilizes CNN for time-related modeling and a mechanism for carrying out spacerelated modeling.In this way, the spatial-temporal correlations within the process data can be described explicitly.Yang et al. [72] proposed a graph representation-based scheme for the detection of multivariate time series anomalies in highly complex industrial processes.Their proposed model is capable of improving the existing techniques by offering spatialtemporal feature extraction and decision criteria based on spatial-temporal graph modeling with no predefined topological priors and a discriminative decision boundary.HiSTAR was shown to provide the expected anomaly detection performance and anomaly localization outcomes.Likewise, Liu et al. [73] adopted CNN on manufacturing spatial-temporal data to identify abnormal production processes.Their study was based on a pasting process in lead-acid battery production as a case study.The CNN-based approach was designed to recognize abnormal processes by analyzing spatial-temporal data from sensors.These efforts can be mapped to the cube <X 2 , Y 2 , Z 1 , and Z 3 > as shown in Figure 3 and Table 3.

Predictive Maintenance
Li et al. [74] proposed a convolutional network model that mines deterioration information in order to anticipate the remaining usable life of a machine.Their designed scheme models the sensor network by taking into account both the spatial-temporal dependencies of the sensors.It adopts a hierarchical graph representation layer to model spatial dependencies, a bi-directional LSTM to model temporal dependencies, as well as a regularized self-attention graph pooling for effective information fusion.Yang et al. [75] proposed SuperGraph, a feature extraction technique for diagnosing rotating machinery faults.The technique adopts graph theory-based spectrum analysis so that a spatial-temporal graph can be constructed and a Laplacian matrix-based feature vector can be derived.GCN was further utilized to learn the latent features.Shcherbakov et al. [77] proposed a hybrid multi-task learning framework by integrating CNN and LSTM to reflect the relatedness of functional life prediction with the health status detection process for complex multi-object systems in the CPS environment.The CNN extracts significant spatial-temporal features from raw multi-sensory input data and compresses the condition monitoring data, while the LSTM captures the temporal dependencies.As another example, Zhang et al. [76] proposed an equipment fault prediction technique using spatial-temporal graph information.Their proposed scheme has the potential to stop fatal damage and reduce equipment maintenance costs.Their experimental results showed that their approach is capable of offering precise short-term and long-term fault prediction.These efforts can be mapped to the cube <X 2 , Y 3 , Z 2 , and Z 3 > as shown in Figure 3 and Table 3.
There are other related efforts concerning predictive maintenance.For instance, Xiong et al. [78] discussed the importance of human-robot collaboration (HRC) in smart manufacturing processes and the role of human action recognition in enabling HRC.In their study, a method based on optical flow and CNN transfer learning was proposed.Their proposed scheme leverages the optical flow to extract time-related information from video images and simultaneously parse spatial-temporal information with a two-stream CNN structure.Transfer learning was also leveraged to establish feature extraction capability by pre-training the model on a non-manufacturing specific dataset and transferring the gained knowledge to the target domain of assembly tasks, which have limited training samples.Zheng et al. [79] addressed the problems of scene recognition in underground coal mining using CNN, LSTM, and an attention mechanism.Jia et al. [80] proposed a data-driven method using a graph convolution network to model the compound and time-varying characteristics of the process industry.The technique tends to capture the relationships among variables.The model was trained with regularization terms so that distinctive localized spatial-temporal correlations can be learned and time-series properties can be derived using temporal convolution.Furthermore, Li et al. [81] proposed CLSTMA, a hybrid model that integrates CNN, LSTM, and an attention mechanism to monitor water quality in a wastewater treatment system.In their proposed model, a sequential fusion CNN, LSTM, and attention mechanism were used to predict water quality and assist in the reduction of energy and emissions.Their proposed scheme captures the fused spatial features using CNN, LSTM for the temporal information, and variable-weighted calculations using the attention mechanism.Likewise, Guo et al. [82] employed historical energy consumption time series and previous knowledge of material flow to propose a spatial-temporal deep learning network (STDLN) framework, which merges a GCN and a GRU and forecasts the energy consumption of nodes.
In order to enhance maintenance practices in production CPS, Bampoula et al.
[83] adopted autoencoders to conduct predictive maintenance so that maintenance planning can be enabled based on real-time machine operation.Table 3 and Table 4 summarizes the identified research gaps and the contributions made by the reviewed efforts in the manufacturing domain.

Other CPS
Apart from transportation CPS and manufacturing CPS, there are other types of CPS in different application domains, such as smart cities, medical CPS, aviation CPS, etc.

Flood Prediction
Related to smart cities as an important application domain of CPS, Chen et al. [84] proposed a flood process prediction model based on CNN using a decade's worth of historical data collected by smart sensors in city infrastructure.To predict the peak of the flood and its arrival time, the model takes rainfall spatial-temporal, geographical, and trend features into account.The model was presented to predict stream flow by integrating the rainfall spatial-temporal feature obtained through analyzing the historical stream flow and the digital elevation model data.These efforts can be mapped to the cube <X 3 , Y 1 , Z 1 , and Z 3 > as shown in Figure 3 and Table 5. Related to smart cities and aviation CPS, Jiang et al. [85] proposed a GNN-based approach for predicting air mobility to enable the control and decision-making process in the airport of things.In their study, a spatial-temporal GCNN was employed to capture the latent characteristics of the graph-structured data.Their proposed approach was validated using airline on-time performance data and found to be effective in predicting spatialtemporal air mobility.This effort can be mapped to the cube <X 3 , Y 1 , Z 1 , and Z 3 > as shown in Figure 3 and Table 5.

Physical Attack Detection
Related to smart cities, Pan et al. [86] adopted ConvLSTM to propose a method for detecting threats (from cyber or physical spaces) against cyber-physical surveillance cameras.The technique uses a new video frame interpolation to detect video anomalies in spatial-temporal feeds.This effort can be mapped to the cube <X 3 , Y 2 , Z 1 , Z 2 > as shown in Figure 3 and Table 5.

Real-time Fire Identification Systems
Also related to smart cities, Zhang et al. [87] developed a real-time fire identification system that uses an IoT sensor network, cloud server, AI engine, and user interface to collect, store, process, and display complex building fire information.Their designed system also leveraged Conv-LSTM neural network.The neural network was trained based on given numerical data and validated in a fire test room with successful results.This effort can be mapped to the cube <X 3 , Y 5 , Z 1 , Z 2 > as shown in Figure 3 and Table 5.

Medical CPS
Wang et al. [8] developed a framework (PhysiQ) that uses passive sensory detection to track and objectively assess people's off-site physical therapy exercises in real-time using a smartwatch.The system used a multi-task spatial-temporal Siamese neural network to evaluate the effectiveness of exercises based on absolute and relative quality.Exercises were assessed by PhysiQ using metrics (i.e., range of motion, stability, and repetition).Ge et al. [89] adopted RNN-LSTM with an attention mechanism to determine the specific variable patterns in a medical application.Likewise, Pan et al. [88] proposed a temporalbased Swin Transformer network (TSTNet) for the surgical video workflow recognition problem.These efforts can be mapped to the cube <X 3 , Y 5 , Z 1 , Z 2 > as shown in Figure 3 and Table 5.
Table 6 outlined the identified research gaps and the contributions made by the reviewed efforts in the other CPS domain.

Challenges and Future Research Directions
Despite DNNs having achieved remarkable success in handling spatial-temporal data in CPS, there are some limitations and challenging issues that require attention in future research.As for the limitations, it is affirmed that DNNs are not the final panacea to all spatial-temporal data problems in CPS, which calls for integrating other sophisticated machine learning and data analytics techniques (continuous learning, and transfer learning, among others).Similarly, in the areas where DNNs have been successfully applied, it is also realized that they raise additional challenging issues to CPS, ranging from longer training times to insufficient training data, which in turn conflicts with the strict performance requirements in CPS.
Note that the challenges and future research directions listed in this section are not only based on the thorough literature review of this topic but also based on our research experience and vision in this topic.As future research directions, we outline three fundamental challenges: Data Quality Assurance, Strict Performance Assurance, and Reliability, Safety, and Security Resilience, which consider both data quality that affects the effectiveness of DNNs and the performance requirements of CPS.Therefore, the purpose of this section is to present the limitations and challenges examined, which are later supported by future research directions from our vision, and we believe that those challenges should be addressed by the research community.Other technical challenges that can affect the application of DNN in spatial-temporal data in CPS are high computational power, problem complexity, and the learning hyperparameter.

•
Performance: Real-time communication could be impacted by the latency caused by several protocols, especially when event-driven communication and detection are involved.Some protocols influence the performance of DNNs while handling CPS spatial-temporal data, i.e., by affecting data transmission, size, latency, reliability, and synchronization.For example, network protocols (UDP and TCP by determining the reliability and latency of data transmission), data serialization protocols (JSON, protocol buffers by affecting the data size, encoding/decoding overhead), compression protocols (by scaling/shrinking the data size during transmission to improve the network performance), real-time communication protocols (MQTT, DDS by providing low-latency, publish-subscribe messaging for timely data delivery), and synchronization protocols (PTP, NTP by ensuring time synchronization in distributed systems, which aides coordinated processing. Cross-platform sensor-actuator communication remains a challenging issue, and it is important to design a comprehensive quality of service framework and satisfy the performance requirements of CPS.In addition, sensor failure is another remaining challenge because most CPS heavily rely on sensing data for the sake of control and motoring purposes.The entire CPS will not function well if there are failures of some sensors within the ecosystem.Thus, the deployment model shall be thoroughly studied to guarantee the robustness of sensor deployment (e.g., coverage, connectivity).Therefore, it is critical to design a holistic solution to ensure the overall performance of CPS by considering all components and their integration as one complex system.The realization of the performance satisfaction of CPS systems depends on the performance with respect to computing, control, and communication.Thus, it is critical to design the modeling and optimization techniques to integrate all components (sensing, networking, computing, and data).Some existing research efforts have been conducted to address the integration of some components (sensing, control, networking, and data).Nonetheless, to enhance the performance of CPS, how all its components interact and interplay jointly, leading to a unified design and optimization strategy, worth investigation.• Security: CPS has unique system requirements and security challenges.Specifically, the confidentiality, integrity, and availability (CIA) security paradigm has been widely used to design security standards for information technology-driven systems.For example, availability is a crucial property regarding security and an essential requirement of a CPS.Different threats (DoS attacks, malware propagation, etc.) could affect the availability of CPS.Under this situation, computing and networking components in the CPS shall employ effective mitigation measures so that malicious computing requests and traffic can be detected in time and the impact of such attacks can be effectively mitigated.For CPS integrity, an ML model that depends on real-time data inputs is critical for the realization of highly dependable and trustworthy CPS (transportation infrastructure, manufacturing infrastructure, etc.).Data fidelity is crucial for the CPS, as it is the information that can accurately simulate and direct the physical system in response to environmental changes accurately and quickly.
In CPS, an adversary could compromise the integrity of sensing data by intercepting the communication channel using either a man-in-the-middle (MITM) attack or the commands transmitted by the programmable logic controllers.Thus, security measures (device authentication, etc.) shall be in place to prevent unauthorized users from changing data.Although solutions based on cryptography have been promoted in the context of CPS, such as those that use TLS, HMACs, or other authentication and integrity guarantees.Historically, such countermeasures have not been widely used due to hardware restrictions and the relative computational cost of deploying protocols and mechanisms.Table 7, summarizes the challenges of utilizing the DNN in CPS spatial-temporal data.
Based on the requirements of performance and security in CPS, we consider the following fundamental research challenges that are required for further research concerning the performance of DNNs and the performance requirement of CPS.

•
Data Quality Assurance for Effective DNNs: spatial-temporal data in CPS is characterized as complex (generated from multiple sources of sensors, microcontrollers, etc.), incomplete (measurement errors, missing values, outliers, etc.), noisy data (real-time streaming data), challenging to interpret, and unavailability in some cases, among others.There is a need to address these challenges, i.e., by developing better data collection techniques, missing data imputation and normalization methods, and new feature extraction procedures that can effectively capture the relevant latent spatial-temporal information in large and complex datasets.Similarly, explainable AI can be leveraged to develop more precise, interpretable, and explainable DNNs that provide the detailed underlying features and relationships driving results or decisions.On the other hand, transfer learning can be leveraged with existing knowledge and pre-trained models to improve the accuracy and efficiency of DNNs.For example, the transfer of knowledge from related spatial-temporal datasets within or across the different CPS domains would be beneficial for improving ML model efficiency and supporting the CPS co-design initiative.

•
Strict Performance Assurance for CPS: Most models that handle spatial-temporal data in CPS are highly complex, combining two or more DNNs for a given task (CNN-LSTM, GCN-GTN, etc.).This calls for the use of multiple layers and many parameters, leading to a longer training time and hindering real-time performance in practice.In a nutshell, the computational complexity, which translates to communication delays, and the dynamic nature of CPS data are among the factors hindering the achievement of real-time performance in various CPS domains.This calls for the design of new efficient model architectures that require few parameters to handle CPS spatial-temporal data.
The targeted model architectures can significantly reduce the computational requirements and memory footprint of DNN models, making them more suitable for real-time tasks.Such architectures can include lightweight models, such as MobileNet and ShuffleNet, with smaller parameters that can be executed on CPS-resource-constrained devices.Alternatively, using specialized hardware, such as field-programmable gate arrays, can further optimize the execution of DNN models.Furthermore, the targeted model architectures can enable the deployment of DNN models in edge devices (sensors, actuators, smart cameras, etc.), which can process data locally within the network edge to reduce communication latency with the cloud.This can be critical for CPS applications, which require real-time decision-making and control.Similarly, developing and using continuous learning techniques that can adapt to the changing CPS data in real time can improve the performance, accuracy, and reliability of DNN models as well.• Reliability, Safety, and Security Resilience Insurance for DNNS and CPS: Reliability entails predicting, detecting, and mitigating failures, while safety guarantees the system by dealing with unexpected failures.Security resilience entails preventing security threats posed by adversaries.As for reliability and safety, CPS could fail due to hardware or software faults resulting in inconsistent spatial-temporal data, which might lead the DNN models to make incorrect predictions or even shut down the system.There is a need for models that adopt spatial-temporal data to predict, detect, and mitigate hardware and software failures in CPS.Similarly, methods for evaluating the reliability and availability of DNN models and systems, such as reliability metrics and failure analysis, are critical too.Developing strategies for implementing faulttolerant techniques and self-healing CPS cost-effectively and efficiently is critical for the predictive maintenance of CPS.
On the other hand, data breaches and theft can compromise the confidentiality and integrity of spatial-temporal data in CPS.Furthermore, threats can lead the models to make incorrect predictions.In addition, malware and other cyberattacks can infect CPS and disrupt its regular operations.There is a need for strategies to predict, detect, and attacks (defensive distillation, adversarial training, etc.).Additionally, introducing methods to ensure the privacy and confidentiality of spatial-temporal data in CPS (such as access control, encryption, and privacy-preserving machine learning techniques) is necessary.Finally, procedures for detecting and mitigating malware and other cyberattacks on CPS systems (e.g., intrusion detection and network segmentation) shall be considered as well.

Final Remarks
CPS combines computational, control, and communication components with physical processes.It is made to communicate with the physical world, keep track of and manage operational physical processes, and produce data.Within the operation of CPS, "spatialtemporal data" refers to the data used to describe the physical world and how it has changed over time.Decisions are made with spatial-temporal data to regulate the behavior and operation of CPS.This paper systematically reviewed the applications of DNNs, i.e., convolutional, recurrent, and graph neural networks, in handling spatial-temporal data in CPS.Additionally, an extensive literature survey was conducted to determine the areas, in which DNNs have successfully taken spatial-temporal data in representative CPS and the emerging areas that require attention.A generic three-dimensional framework was proposed by considering the type of CPS, target (spatial-temporal data processing, anomaly detection, predictive maintenance, resource allocation, real-time decisions, and multimodal data fusion), and DNN schemes (CNNs, RNNs, and GNNs).Finally, research areas that need further investigation, such as performance and security, were identified.Additionally, data quality assurance, strict performance assurance, reliability, safety, and security resilience challenges were outlined as future research challenges and opportunities.
In the future, this line of research could be extended by conducting several case studies to address the areas that have not received sufficient attention (i.e., "N/A") as depicted by Tables 1-5.Similarly, other attention mechanisms like the Transformers could be explored further and employed in this domain to compare their performance with the DNNs considered in this survey.

Figure 1 .
Figure 1.Example of the CPS Architecture.

Figure 3 .
Figure 3. Problem Space for DNNs in CPS Spatial-Temporal Data.

Figure 4 .
Figure 4.An Illustrative Example of Spatial-Temporal Data in Transportation CPS.
LSTMs have been effective in various applications, including speech recognition, machine translation, image captioning, and music composition.They are often combined with other DNNs, such as CNNs or attention mechanisms, to realize performance on given tasks.• GRU: It is a gating mechanism for RNN.It was introduced to serve as a simplified version of the complex LSTM.Like other RNNs, GRU processes sequential input data, such as text or time series, by maintaining a hidden state that captures information about the past sequence elements.It also uses a gating mechanism to selectively update and reset the hidden state, enabling it to capture longer-term dependencies in the sequence.GRU has two gating mechanisms: the reset gate and the update gate.The reset gate decides the past hidden state to forget, while the update gate makes the current input to incorporate into the new hidden state.GRU is a powerful and flexible neural network architecture that can capture long-term dependencies in sequential data and deal with the "vanishing gradient problem" that bedevils standard RNNs.It has been used in various applications, including NLP, speech/voice recognition, time series prediction, etc.
The forget gate determines what information to keep from the previous hidden state, and the memory cell updates its internal state based on the input and the last hidden state.The updated memory cell is then passed on to the next time step.

Table 2 .
Summary of the Reviewed Contributions in Transportation (X 1 ) CPS.

Table 4 .
Summary of the Reviewed Contributions in Manufacturing (X 2 ) CPS.

Table 6 .
Summary of the Reviewed Contributions in Other (X 3 ) CPS.

Table 7 .
Challenges of Using DNN in CPS Spatial-Temporal Data.