Anomalous Vehicle Recognition in Smart Urban Trafﬁc Monitoring as an Edge Service

Online


Introduction
The past decades have been witnessing a global-wise urbanization at an unprecedented speed. According to the World Urbanization Prospects by the United Nations Department of Economic and Social Affairs (UN DESA) Population Division, by 2014, 54% of the population (3.9 billion) in the world lived in urban areas, and this percentage is visioned to achieve 66% by 2050 [1]. While a larger population, more attractive services, and advanced infrastructures bring prosperity, urbanization also incurs many new challenges to city administrators and urban planners. It is non-trivial to provide a satisfactory living quality and maintain the sustainability of smart cities. Real-time information collection and decision-making are essential for a good understanding of dynamic city elements and timely reactions to emergencies [2,3]. Pervasively deployed smart sensors and devices led us into the era of Internet of Things (IoT), which provides a solid foundation for instant decision-making [4]. Situational awareness (SAW) is among the most crucial issues in smart cities [5,6]. In terms of safety, security, and sustainability of smart cities, urban surveillance services, such as object-of-interest identification [7] and anomalous behavior recognition [8], require not only high accuracy performance, but also real-time data-driven decision-making to avoid unacceptable consequences resulting from insufficient data assessment or incorrect event prediction [3]. However, it is a very challenging task to process an overwhelmingly large amount of urban surveillance data in a timely manner, which is dynamic, heterogeneous, and very diverse.
Among the top concerned topics in smart cities, urban traffic surveillance is an indispensable component. Road safety is an important target in the global Sustainable Development Goals (SDGs). By 2030, the goal is to provide a safe, affordable, accessible and sustainable transport system for all residents [9]. Abnormal driving behaviors have been a challenge for decades since it often seriously jeopardizes the efficiency of traffic, endangers road safety, and even results in public emergencies. Therefore, it is critical to detect and interpret abnormal driving behaviors as early as possible. With the proliferation and ubiquitous deployment of IoTs, traditional solutions like human monitoring on site are being converted into many more automatic and intelligent ways. On the one hand, traffic data can be collected in real time and transferred to cloud centers where machine learning (ML) algorithms will be applied to recognize abnormal traffic patterns. On the other hand, nowadays, with ubiquitously presented smart devices, edge computing [10] rises as a new distributed computing paradigm at the network edge allowing onsite artificial intelligence much closer to the source data for real time decision-making [11][12][13]. Being considered as an extension of cloud computing, edge computing has shown many advantages and is promising to meet the requirements from smart city surveillance applications [14]. Using ubiquitously deployed computing, communication and storage at the edge, edge computing provides on-site/near-site storage and real-time task-processing capabilities, particularly for latency-sensitive applications [15].
Currently, regarding edge computing, there is a close concept named fog computing. For the sake of clarity in this paper, though fog computing emphasizes more on fog servers (like cloudlets or powerful laptops), while edge computing is closer to the network edge where computing nodes could be sensors or actuators; these two concepts will be treated the same, and the term edge computing will be used mostly in the rest of this paper.
Because of the superior performance, Deep Neural Networks (DNNs) are widely adopted in pattern recognition applications [16]. To accommodate edge computing, there are many research efforts attempting to deconstruct DNNs and partitioned networks are allocated on different layers through an edge-fog-cloud computational architecture [17]. To accelerate the performance on the resource constrained edge, there are vertical and horizontal collaborations between edge and cloud. With horizontal collaboration, a single DNN layer is divided into partitioned slices that are assigned to a number of edge devices [18]. In this way, the inference parallelism of edge layers is increased to deal with a large amount of urban data. For solutions with a vertical strategy, edge/fog layers only perform simple references and, if a DNN is confident enough to predict with lower level features, there are early exit points [19]. Inference tasks are offloaded to the cloud layer for a more accurate process. While these research efforts have achieved remarkable performance, they have limitations. First, although DNN has an outstanding prediction power, it always requires a large amount of data for training and the max pooling operations will potentially discard substantial information. In addition, DNNs do not consider spatial and orientation information of features since outputs of activation functions are scalars, which could result in failing to predict vehicle behaviors since they are essentially spatio-temporal data. Secondly, all urban data need to be transferred to edge/fog layers or the cloud layer for inference. This could explode network traffic, and, even in extreme situations, there could be no connection at all.
Capsules Network [20] is a recently proposed new class of DNNs. It utilizes capsules instead of artificial neurons for prediction. As a result, instead of scalars after activation between different layers in conventional DNNs, the Capsules Network outputs vectors and the vector directions represent orientations or poses of features, and the length represents the probability of existence. By leveraging dynamic routing by agreement, lower level features only flow into higher level features when they agree with each other. Therefore, the spatial relationship between lower and higher level features is captured. In this regard, the Capsules Network requires less training data, and it intuitively becomes beneficial for applications with spatial-temporal data [21]. The Capsules Network has been proven to achieve better prediction accuracy on the MNIST data set (Modified National Institute of Standards and Technology database). When facing overlapping digits in images, it outperforms other conventional DNNs. There are also studies applying a Capsules Network on spatial-temporal data like vehicle trajectories and urban traffic data; however, the structure is relatively simple and usually there are only two Capsule layers.
In this paper, a Smart Urban traffic Monitoring (SurMon) scheme is proposed following an edge computing paradigm. SurMon explores an online abnormal vehicle detection using multidimensional singular spectrum analysis (mSSA) at the edge. We format this issue as a change point detection problem where the differences of various characteristics of a vehicle behavior will be identified [22]. Following the detection of misbehaved vehicles, a cascaded Capsules Network is adopted to interpret those behaviors to facilitate decision-making to tell whether this behavior is actually abnormal. There are two major components in our cascaded Capsules Network. The first is to take the data of vehicles with suspicious behaviors and recognize whether the behaviors fall into a pre-trained normal pattern. All vehicle data will go into the first Capsules Net. Then, the outputs will be the input of the second Capsules Net with a new routing agreement to decide if this behavior will be agreed by other vehicles. This strategy is based on a common observation: most of the vehicles on roads are non-aggressively of normal behaviors and aggressive behavior patterns are not agreed on by other vehicles. We implemented and tested the SurMon scheme with a public traffic data set from Next Generation Simulation (NGSIM) [23].
This paper is an extension based on one of our earlier conference publications [22], which was inspired by the excellent performance of the SSA algorithms in change point detection in time series [24]. The mSSA algorithm is creatively adopted to catch the differences in the dimension of characteristics of vehicles on the road by reformatting the abnormal vehicle detection problem as a change point detection problem. Instead of depending on pre-defined abnormal behavior patterns, we catch different/suspicious behaviors and then recognize this behavior. The mSSA approach reduces the amount of data to be transmitted to either a fog or a cloud layer. On top of [22], the major contributions of this paper are: • A novel framework is proposed to detect and recognize abnormal vehicle behaviors by leveraging the mSSA algorithm and Capsules Networks at the edge; • A new cascaded Capsules Network structure is introduced with a new routing agreement for abnormal vehicle behavior recognition; and • Extensive experimental studies have been conducted with real-world traffic data that validated the effectiveness of SurMon scheme.
The rest of this paper is organized as follows: Section 2 briefly presents the related work. An overall introduction of the SurMon architecture is given in Section 3. The anomalous vehicle behavior detection procedure using mSSA is discussed in Section 4 and the behavior interpretation scheme using the cascaded Capsules Network architecture is presented in Section 5. Experimental results are reported in Section 6, and Section 7 concludes this paper with some discussions and the future work.

Anomaly Vehicle Behavior Detection
A plethora of research efforts are reported in this area [25][26][27], and recently analysis and prediction of behaviors of autonomous driving have attracted more attention [28][29][30]. In this subsection, we provide a concise discussion on works that are closely related to our proposal. A novel deep-learning-based model is proposed for abnormal driving detection [31], in which a stacked sparse autoencoder model is adopted to learn generic driving behavior features from an unlabelled dataset in an unsupervised manner. Then, an error propagation algorithm is used with labeled data to fine-tune the model for a better performance. Corresponding to the behaviors under consideration, there are four patterns defined in this work: normal, drunk/fatigue, reckless, and phone use [31].
A vehicle-edge-cloud framework is suggested to evaluate driving behaviors with deep learning [32], where driving behaviors are marked in three rankings with scores in a descending order: A, B, and C. In this framework, real-time vehicle driving data are transmitted to edge layers for reference, and the cloud exists to train or upgrade the predictive model. Similarly, in another work, researchers leverage edge layers to pre-process traffic video data to reduce network traffic [33]. The processed data will be transmitted to the cloud for vehicle detection and anomaly behavior detection. In contrast, Jiang et al. aim to detect drivers inattention driving behaviors with large scale vehicle trajectory data [34]. In addition, based on the output derived from the inattention detection combining with point of interest and climate data, a Long Short Term Memory (LSTM) based model is deployed to predict driver's upcoming abnormal operations on the road [34].
Although these reported efforts have achieved remarkable performance in abnormal driving behaviors detection, the application scenarios are different with ours. First, in some research works, the edge layer serves as more like a transponder which receives data from end devices and pre-processes it for preliminary learning or simple features extraction and transmits them to a cloud layer for a complicated reference. Secondly, to recognize abnormal behaviors, prior knowledge is required to pre-define what is an abnormal driving behavior. Therefore, on the one hand, these approaches necessitate large-scale, good quality data sets that cover a sufficient amount of scenarios; on the other hand, in real-world traffic scenarios, the reliability could be weak since, in some cases, normal and abnormal behaviors can exchange.
In this paper, instead of trying to define a complete set of abnormal patterns, our approach trains models with normal patterns that are the most common cases in real-world. Additionally, mSSA is firstly utilized to identify vehicles that behave differently with others, and then use the first Cascaded Capsules Network to predict whether the different behavior is normal. Otherwise, the second Capsules Network is activated to decide whether this behavior will be agreed by other vehicles. In this way, SurMon is more robust to various abnormal behaviors in practice and easier to collect training data.

Deep Learning with Edge Computing
Deep learning has been recognized as a very powerful technique in many applications. However, it is non-trivial to fit deep learning algorithms in the edge computing paradigm, mainly because of the limited computational resources at the network edge [35,36]. In order to enable deep learning at the edge, researchers explore various approaches [37]: (1) reducing the latency of executing deep learning on resource constrained devices by customized model design [38][39][40], model compression [41][42][43], and hardware-based accelerators [44][45][46]; (2) off-loading computing intensive tasks to edge servers or cloud [47,48]; (3) optimizing edge resource management [49,50]; or (4) partitioning and distributing computing tasks among peer edge devices [51,52].
For instance, a framework is proposed that ties front end devices with back end "helpers" like a home server or cloud [53]. Based on various metrics like bandwidth, this framework will accordingly determine task size to be offloaded to back end "helpers". Besides offloading strategies among the edge-cloud structure, distributed deep neural networks are proposed over distributed computing hierarchies, consisting of the cloud, the edge/fog and end devices [19]. Models are trained with early exits to reduce computational costs at the edge and only shallow portions of networks are deployed on the edge layer. In addition to vertical collaboration between edge and cloud, researchers also attempt to adapt DNNs on edge-cloud with horizontal collaboration [18], where a scalable fused tile partitioning approach is proposed to partition convolutional networks into slices for parallel processing at the edge layer and effectively minimize the memory footprint for edge nodes.
In this paper, the SurMon architecture explores the application of deep learning in the area of smart transportation leveraging the edge-fog-cloud computing paradigm. It shows that edge AI is promising to handle the large-scale transportation in modern smart cities.

Capsules Network
Although convolutional neural networks (CNN) have achieved substantial successes in many important applications, there are some constraints, i.e., efficiency [20]. By treating the information at the neuron level as vectors, Capsules Networks have shown excellent performance and are applied in a number of areas, including image classification [54,55], target recognition [56,57], object segmentation [58,59], natural language understanding [60,61], industrial control [62], and more [63]. For instance, researchers propose an associated spatiotemporal Capsules Network for gait recognition [64], in which, particularly, a relationship layer is built to measure the relationship between lower level features and its corresponding higher level features, and it achieves great performance on benchmark datasets. Besides applications of Capsules Networks in various challenging areas, researchers also explore different routing algorithms to accelerate the training process [65]. Dynamic routing by agreement significantly slows down the training process in the Capsules Network. Therefore, various replacement algorithms are investigated, for instance, grouping equivalent routing, K-Means routing, etc..
There are reported applications of Capsules Networks in the area of urban traffic prediction [66]. A Capsules Network is bundled with a Nested Long Short Term Memory (NLSTM) Network for transportation network forecasting [67], where the Capsules Network is trained to extract spatial features of traffic networks, and the NLSTM is leveraged to evaluate temporal dependencies in traffic sequence data. The experimental results show that the combination of CapsNet and NLSTM outperforms the CNN and NLSTM. The Capsules Network was adopted to predict traffic speed in complex transportation networks [68]. Spatial-temporal traffic sequence data are converted into matrices and then fed into the Capsules Network. However, to the best of our knowledge, there are not any reported efforts that apply Capsules Networks to detect anomaly behaviors on the road. We hope this work will attract more attention and inspire more discussions in the application of Capsules Networks in the smart transportation area, specifically the safety surveillance. Figure 1 presents the overall architecture of the proposed Smart Urban traffic Monitoring (SurMon) system, which includes three layers: end layer, edge layer and cloud layer. End layer consists of smart sensors and devices that collect urban traffic data. In the prototype of the SurMon system, drones are adopted to collect videos over a certain road area in real time, and the h264 coded video streams are transmitted to the ground controller that functions as the edge layer, where compressed video streams get decoded and displayed to the human operator. While the end layer is in charge of data collection and light-weight data pre-processing tasks, the edge layer operates near-site and performs data analysis for real-time decision-making. Edge computing leverages heterogeneous smart devices as edge nodes. They collaborate with each other to facilitate surveillance tasks with a lower latency. The smart devices include smart phones, tablets, personal laptops, etc. In the SurMon system, major computations take place at the edge layer including vehicle detection, vehicle tracking, anomalous behavior detection, and interpretation. Vehicle detection and tracking tasks extract vehicle trajectories in real time. Due to the resources constraint on smart devices and the complexity of tracking tasks, it entails the collaboration among a number of edge nodes, which are divided into master and slave nodes [69]. The master node orchestrates the collaboration among slave nodes. In the SurMon prototype system, for instance, the in-car user laptop plays the role of the master edge node. It receives video streams from drones and then dispatches decoded video frames to each slave node. Each edge node tracks one single vehicle and transmits the vehicle trajectory data back to the master node for other tasks. Therefore, the computing intensive multi-target tracking task is reformatted as multiple single-target tracking tasks [69]. A task assignment scheme is designed to ensure the synchronization among edge nodes. The video streams from drones are transferred to the ground controller. Once suspicious vehicles are identified, each individual vehicle and the region-of-interest are assigned to a target tracker for consistent tracking and velocity estimation. When more suspicious vehicles are detected, the controller checks with the edge nodes around and dispatches a single target tracking task to one who has sufficient computing power to process the job. Readers who are interested in the details of the rationale and implementation of multiple targets tracking at the edge based on a single target tracking algorithm are referred to [69].

SurMon Architecture Overview
Besides task assignments to each slave node, master node consolidates all trajectories data from each edge node and attempt to detect abnormal driving behaviors and interpret them to decide whether it is appropriate or not. Instead of trying to pre-define driving patterns of abnormal behaviors, SurMon proposes to focus on behaviors that are different from others. Multi-dimensional Singular Spectrum Analysis (mSSA) is adopted to reformulate this problem as a change point detection problem [22]. On the one hand, defining abnormal patterns in advance results in detection inaccuracy since, in practice, it is very difficult to enumerate every abnormal pattern; on the other hand, behaviors that bias away from average patterns are identified first, which effectively reduces computations on edge nodes and the amount of data that is transmitted to a remote cloud center if necessary. However, one trade-off of this approach is that, after the detection of different behaviors, it cannot be simply categorized as aggressive or abnormal driving behaviors in certain situations. For example, in the United States, one of the right-hand drive countries, if a vehicle stops waiting for a left turn while all others are moving forward, this behavior is likely detected and marked as a difference, but it is actually not a violation of traffic rules.
Therefore, to deal with this challenge, a smart traffic surveillance system is necessitated to interpret the detected driving behaviors in a specific context. Due to its superior performance, Capsules Network requires less training data and considers a spatial relationship among features. In the SurMon system, two Capsules Networks, namely CapNet-1 and CapNet-2, are trained with a dataset of good driving behaviors, and they are cascaded together. The CapNet-1 is trained as a multi-class classifier with individual driving behavior data like trajectories, speed, etc.. CapNet-1 predicts whether an individual behavior is classified as any of the good ones from the training dataset. CapNet-2 is trained with a set of good driving behaviors in the same time window, and it is a binary classifier that only predicts whether these behaviors agree with each other to formulate a harmony driving scenario. Specifically, the set of the behaviors data goes into CapNet-1 first and then the outputs are taken as the training set for CapNet-2. Within a time window, all vehicle driving behavior data flow into the mSSA for the detection of outliers, and then CapNet-1 takes the detected behavior data and makes the prediction. If it does not fall into any good behavior patterns, all vehicles data are instead fed into CapNet-1, and its output is leveraged by CapNet-2 to predict whether the existence of this abnormal behavior will result in a real negative of a good driving scenario. The basic rationale here is that most vehicles obey the traffic rules and a good driving behavior does not put others at risk.
The last layer of SurMon system is the cloud layer. Meaningful urban data, after being processed at the edge layer, can be transmitted to the cloud for large-scale city-wise traffic pattern recognition. In addition, because of the plentiful computational resources, models can be upgraded in the cloud and then deployed to edge nodes. This paper focuses on the real-time detection and interpretation at the network edge, end layer, and edge layer; the design and experimental studies for the cloud layer are beyond the scope.
The architecture discussed in this section presents the concept and major function blocks. The implementation of an actual SurMon system in the real world depends on many factors, including the budget, the range of the space under surveillance, local traffic regulations, etc. In Section 6, a system configuration of the proof-of-concept prototype for our experimental study is presented.

Anomalous Vehicle Behavior Detection Using mSSA
The fundamental idea of the proposed approach lies in an intuitive observation. Most vehicles on roads actually follow traffic rules and behave legitimately. Therefore, vehicles that act significantly different are more likely violating the rules. Consequently, it is reasonable to identify vehicles that present different characteristics from others as objects-of-interest, and more analysis should be conducted on them. The movement of a vehicle can be described using its trajectory information, which consists of an ordered sequence depicting positions, speeds and directions of the vehicle over time. Following this rationale, the anomaly vehicle detection problem is reformatted as a problem of finding the object of changes v c from the set of objects V K .
For the convenience, Table 1 summarizes the notations used in this paper.

A Basic Introduction to SSA
The basic SSA algorithm consists of four steps: embedding, singular value decomposition (SVD), grouping, and diagonal averaging. Let us consider a time series with real values X N = (x 1 , x 2 , . . . , x N ) with the length of N.

1.
Embedding: map X N to a M × L trajectory matrix X. M is the window length and L = N − M + 1. (1) where M is the dimensions of characteristics and L is the number of observations. Trajectory matrix X is a Hankel matrix of which the elements along the anti-diagonals (i + j = const.) are the same: x ij = x i+j−1 . The embedding procedure is a one-toone mapping.

2.
SVD: the second step is to perform the SVD procedure on the trajectory matrix X. Set covariance matrix T = XX T , then its eigenvalues are λ 1 , λ 2 , . . . , λ d and the corresponding biorthogonal eigenvectors are U 1 , U 2 , . . . , U d , where d is the rank of X. Note that eigenvalues are arranged in a decreasing order and larger than 0. V g = X T U g / λ g (g = 1, 2, . . . , d); then, the trajectory matrix X can be written as follows: where X g is the elementary matrix with rank 1 of the trajectory matrix X, and V g is one of the eigenvectors of the matrix S = X T X.

3.
Grouping: In the third step, elementary matrices are partitioned into disjoint subsets: I 1 , I 2 , . . . , I m ; then, the trajectory matrix X can be rewritten as below. Each subset represents one component of the time series, such as trend, oscillation, or noise.

4.
Diagonal Averaging: in the last step, the reconstruction process maps the matrix with only principal components back to a time series by Hankelizing the matrix X l with l principal components (X l = X I 1 + X I 2 + · · · + X I l ).

Multi-Dimensional SSA
The mSSA algorithm analyzes multi-channel time series in the presence of noises.
Denote X (k) N as k time series with length N of a system. The major difference with basic SSA lies in the trajectory matrix construction step. For channel X Each volume in X (i) is as follows: Putting all the trajectory matrices of the time series in each channel together, then Depending on the application scenarios, as an alternative, trajectory matrices of multiple channel time series can be concatenated vertically. The steps of the mSSA algorithm are the same as the basic SSA procedures.

SSA-Based Change Point Detection
The original time series can be partitioned into multiple, different components of a time series. With a reasonable grouping strategy, components corresponding to noises or secondary factors can be removed, and only principal components are reserved. Basically, the operation of an SSA based change point detection algorithm is that it moves the sliding window and calculates the distance between the test matrix obtained from the target segmentation of the time series and the base matrix reconstructed from the l-dimensional subspace. An obvious distance between the test matrix and the base matrix will be observed when a change occurs.
Considering a time series, assuming that N, M, l, r, s are fixed integers and n is the iteration step number. Here, N is the segmentation length and M is the window length, 0 < M < N/2, 0 < r < s. An l dimensional base matrix X (n) base is constructed by performing the steps one to three: embedding, performing SVD, and grouping. Denote U l = [U 1 , . . . , U l ] as the eigenvector subspace composed of the eigenvectors corresponding to first l eigenvalues. Similarly, the test matrix is constructed for the time series segmentation in [n + r + 1, n + s]: X where j = r + 1, . . . , s. The size of the test matrix is M × (s − r). Next, the Euclidean distance D between the base matrix and the test matrix is calculated: where vector X (n) j is each column of test matrix X (n) test .

Detection of Anomalously Behaved Vehicles
The objective of this detection procedure can be described as: given a set of K vehicles V K , identify the vehicle with different behaviors v c . With V K , there are the time sequences of vehicle behavior describing information X Just as the steps described earlier, the first step is to embed the time series of each feature for a specific vehicle v n into a matrix: where j = 1, . . . , S, n = 1, . . . , K, M is the window length and L = N − M + 1.
With the matrix X (j) n,N is obtained, the second step is to perform SVD and select the first l j eigenvalues to construct the subspace. The subspace closely represents the original time series but ignores the noise and non-significant factors. Denote U l j as the matrix of which each column is the corresponding eigenvector to the selected eigenvalues. Similarly, V l j can be calculated.
Next, by reconstructing the time series from the l-dimensional subspace, X (j,l) n,N will be obtained. Then, for each feature, calculate the average of the reconstructed time series of all vehicles to obtain the base matrix: The SVD operation is conducted again, but, on this base matrix, an l-dimensional base,N is obtained. Then, each matrix X (j) n,N is the test matrix of each feature of each individual vehicle, and calculate the distance of the test matrix to this base matrix: Normalize the distances within each channel, and then the anomaly score is calculated for each vehicle: where P j is the weight of each characteristic of the motion of vehicles and ∑ S j=1 P j = 1. A threshold h is pre-set by the system operator either based on past experiences or a certain mathematical analysis. If P n > h, then the n-th vehicle is detected as an anomalous one. It is a complex and challenging process to determine an optimal or suboptimal threshold for SSA algorithms. It is essentially an optimal solution finding problem in high-dimensional space as multiple factors play significant roles, including the window length, number of eigenvalues selected, distance between the base and test matrix, and user expectations. It is worth noting that the trade-offs between the detection delay and the detection accuracy are highly application dependent. A comprehensive discussion is beyond the scope of this paper, and interested readers are referred to [70] for more details.

Vehicle Behavior Data
Again, consider a time series with real values X N = (x 1 , x 2 , . . . , x N ) with the length of N and vehicle behaviors' characteristics are extracted in different feature channels within a time window; then, a spatial temporal data series of vehicle behaviors with real values in different channels will be: where N denotes the length of a time window, n represents n-th vehicle, and k is the number of channels. The overall length of X n,N is N × K. Then, this spatial temporal series could be converted into a q × p matrix as: where q × p = k × N. This matrix will be as the input for the cascaded Capsules Network.

Cascaded Capsules Network
Capsules are groups of artificial neurons that take vectors as the input and output vectors. The length of the output vectors represents the probabilities of a particular feature that is detected in each capsule. Figure 2 illustrates the overall architecture of the cascaded Capsules Network, which consists of CapNet 1 and CapNet 2. CapNet 1 takes vehicle behavior data as input and predicts if this behavior is one of the normal driving patterns. Otherwise, the outputs of its decoder layer will be fed into CapNet 2. Unlike CapNet 1, CapNet 2 takes multiple vehicle driving behavior data and predicts whether the detected driving behavior will be agreed by others.

CapNet 1
There are three layers in the first Capsules Network of the cascaded approach: primary capsules layer, behavior capsules layer and reconstruction decoder layer.
The vehicle behavior matrix is fed into a convolutional layer first to extract preliminary features. The kernel size is 6 × 6, and the stride is one. The input channel is equal to one and there are 256 output channels. The primary capsules layer consists of 32 channels of capsules, and each of them has eight convolutional units with kernels of size 9 × 9, and the stride is equal to two. Then, the output vectors are multiplied by weight matrix W ij and gets summed over with different weights for each channel in a behavior capsules layer: where c ij is the coupling coefficients between the lower capsules layer and the upper capsules layer, which is determined by the dynamic routing agreement process, and s j is the input vector of the upper level capsules network, which is a weighted sum over all output vectorsû j|i from the lower capsules layer. A squash activation function is applied on s j to ensure the nonlinearity between different capsules layers: This activation function maintains the direction of input vectors while the length is normalized. With this squash function, short vectors are closer to zero while long vectors are close to one. The coupling coefficients c i j is calculated by routing softmax: where b ij is the logits of the coupling coefficients, which is calculated by the dynamic routing process. The dynamic routing process is illustrated as Algorithm 1 below. With the dynamic routing algorithm, the scalar product ofû j|i v j is leveraged to evaluate the agreement between lower level features in primary capsules and high level features in behavior capsules. Then, the agreement is added into b ij to update the coupling coefficients between different capsules layers.
ROUTING(û j|i , r, l) 2: for all capsule i in layer l and capsule j in layer (l + 1): b ij ← 0 for r iterations do 4: for all capsule i in layer l: c i ← so f tmax(b i ) for all capsule j in layer (l + 1): for all capsule j in layer (l + 1): v j ← squash(s j ) for all capsule i in layer l and capsule j in layer (l + 1) The l2 norm of the output vectors are used to represent the probability of the existence of a capsule entity. The margin loss for class k would be: where T k = 1 if the class instantiation exists and m + = 0.9 while m − = 0.1. λ is set to 0.5 and the total loss becomes the summation of each capsules. Following the behavior Capsules layer, we will check whether the detected behavior falls into any class of normal behaviors.
If not, the vehicle driving behavior data will be reconstructed by the reconstruction decoder layer, which consists of three fully connected layers. The size of output matrix is the same as the size of the input matrix.

CapNet 2
There are three layers in the second Capsules Network, CapNet 2, of the cascaded approach: max pooling layer, primary capsules layer, and the behavior capsules layer. The architecture of CapNet 2 in the cascaded approach is the same as CapNet 1 except that, at the beginning, there is a max pooling layer to reduce the size of input matrix. Suppose there are M vehicles detected from road images; then, the input of CapNet 2 is a matrix X with size M × N. Then, let X input = X T X, which is the input for a max pooling layer. The kernel size of the max pooling layer is 25, and the stride is 25. After max pooling layer, the structure of CapNet 2 is the same as CapNet 1. The softmax function will be applied to check whether the output falls into the class in which the detected behavior is agreed by others.

Experimental Setup
A prototype of the proposed SurMon system has been implemented. Two DJI Phantom 3 Professional drones are used to monitor the moving vehicles on the road, and two Nexus 9 tablets are connected to the drone controllers to display the real-time surveillance video. In this prototype, one laptop acts as an edge computing node whose configuration is as follows: the processor is 2.3 GHz Intel Core i7, the RAM memory is 16 GB, and the operating system is OS X EI Capitan. The resolution of video frames is 1280 × 720. The OpenCV 3.1 and Eigen 3.2.1 are used for the tracking algorithm. The given FOV (field of view) of the camera mounted on the drones is 94 • , and the actual FOV after calibration is 89.39 • according to the fact that manufacturers would always make the image plane not perfectly circumscribed with the CCD plate but a little larger than that.

mSSA-Based Anomalous Behavior Detection
The proposed mSSA based anomaly detection algorithm is validated using two data sets. One is the video streams that are collected locally using drones as the end layer devices, another is a public data set from the Next Generation Simulation (NGSIM) public Lankershim Boulevard Dataset. In the NGSIM dataset, the vehicle trajectory data were collected on the Lankershim Boulevard in the Universal City neighborhood of Los Angeles on 16 June 2005. A total of 30 min of data are available from 8:30 am to 9:00 a.m. Detailed characteristics of different channels of vehicles trajectory data include spatial information of trajectories in terms of local_x and local_y, vehicle length, vehicle velocity, acceleration, etc. The time interval of each data point is 0.1 s. Figure 3 shows an example scenario of the videos collected by the drone overseeing the local traffic. The video streams record the traffic on a road near our campus, which is a local highway with the speed limit of 72 Km/h. For testing purposes, multiple types of anomalous behaviors are created. Three parameters are selected to describe the trajectory for each vehicle: x position and y position of a vehicle in a video frame and the driving speed. The sample frequency on the video is ten samples per second. Five trajectories are selected mainly to validate the feasibility and effectiveness of our method. The five vehicles are labelled as ID1, ID2, ID3, ID4, and ID5, and the anomalous vehicle is ID4. Grouping is the most important step in the mSSA procedure. Elaborated selections will allow the decomposition of the components of a time series to be conducted efficiently and achieve a higher precision. Figure 4 illustrates the eigenvalues corresponding to the driving speed of vehicle ID1. The larger the eigenvalue, the more significant information is contained in the corresponding dimension. In the example shown by Figure 4, the largest eigenvalue is almost 10 3 , and all the others stay below ten. This substantial difference leads to the decision that only the largest eigenvalue is applied for reconstruction operation. The others are discarded due to the limited influences. Figure 5 presents the results of reconstruction for different components of testing trajectories. The time series length N is 30 within three seconds and the window length L is 15. By default, in this paper, the window length L for the construction of the trajectory matrix is set as half of N unless explicitly noted. Figure 5a shows the original speed of the five vehicles. The anomalous vehicle, marked as ID4, stops on the road. The curves in Figure 5a verified that the original time series of speed contains noises. The noises mainly result from the vehicle detection algorithms, which are not able to obtain the position information of a vehicle from the same part on the vehicle body in all the frames. Meanwhile, it is very encouraging that the SSA algorithm performs robustly and noises are efficiently attenuated at the grouping step.  In contrast, Figure 5b is the reconstruction result from the subspace that contains only the largest eigenvalue. It is obvious that the interference from noises is removed, and the trend for each vehicle is very clear. After the grouping step, time series for normal vehicles become closer, which is very helpful to increase the detection accuracy. Figure 5c In this experimental study, we have also compared the computation time on two different edge devices. One is a laptop computer with the i7-6820HQ processor, 32 GB memory space, and the maximum frequency is 2.7 GHz. The experimental results show that, when the number of vehicles increases from five to 35, the computing time grows from around 10 ms to almost 100 ms. Since the sampling frequency is still ten samples per second, this performance is sufficient to meet the requirement for anomaly detection in real time. In our data set, with a different number of total vehicles, the percentage of anomaly behaved vehicles is around 20%. The time series length N is 30 s. Each data point in the figures is the average of 30 times of tests. The other device adopted in our experiments is a Google Pixel C tablet with a NVIDIA quad processor X1, 3 GB RAM, and the maximum frequency is 1.9 GHz. Overall, the computation time is in the range from 100 ms to 900 ms, which is longer than the laptop achieved.

Tests on the Public NGSIM Data Set
A subset is extracted from the NGSIM US-101 data for our experimental study. This subset contains four kinds of vehicle motions, and there are 35 vehicles in total in each video frame, and 50 frames are used in this study. There are two kinds of anomalous vehicle trajectories among the four. The first is one vehicle moving with a much lower speed and the second is that the x position of a vehicle is significantly different from other vehicles. This study compares the mSSA approach with a widely used approach, the KMeans clustering method. Parameters are set as follows: N = 20, L = 10, K = 2, h = 0.5. K is set as two for clustering: normal and abnormal vehicles. The weights of x, y, and speed are the same.
The experimental results reported in Figure 6 verifies that our mSSA approach outperforms the K-means clustering in terms of detection accuracy. When there are two anomalous vehicles, the mSSA achieves a high detection accuracy of 97%. When the number of anomalous vehicles increases, the mSSA maintains a higher detection accuracy than the K-Means clustering method obtained. In this experimental study, a very large percentage of anomalous vehicles was considered, where ten out of 35 vehicles are misbehaved, which is very unlikely to happen in practice. Additionally, the K-Means algorithm can only detect one out the two types of anomalies when the number of anomalous vehicles grows to eight and ten. This comparison study verifies that the similarity based anomaly detection methods always require a thorough prior knowledge about the monitoring scenario and the rules. However, this requirement leads to a poor adaptation for different testing scenarios.  The experimental study reveals that the proposed mSSA-based anomalous vehicle detection approach is sensitive enough, and it achieves a good performance when the length is the time series is small. The trade-off is, however, that a relatively longer time series provides more information, which potentially will be more reliable in some application scenarios. Hence, N and L should be adjusted accordingly for different scenarios. Meanwhile, larger values of N and L also imply longer detection delays, which must be taken into consideration specifically for delay-sensitive, mission-critical tasks. Figure 8 plots the ROC curve that illustrates the trade-offs between the false alarm rate and the detection rate. It is very encouraging that, when the detection rate of the proposed approach comes to 100%, the false alarm is still low. For example, as shown in Figure 8, when the number of anomalous vehicles is below eight, the operator can set the threshold to get a detection rate of 98%, and the false alarm is still below 15%.

CapNet-Based Anomalous Behavior Interpretation
In the experiment, a subset of data from NGSIM is extracted, and vehicle trajectory time series data are from five channels: local_x, local_y, vehicle velocity, vehicle acceleration, and vehicle movement direction.
There are nine sample trajectories in Figure 9 and the legend on the right is vehicle ID. The x-axis and y-axis are the coordinates of vehicle locations within the image. The moving direction is from the top to the bottom. In the recorded video, vehicle 343 made a right turn and vehicles 332, 338, 340, 342, 347, 354, and 355 are going forward in different lanes. Vehicle 345 is the anomalous one. In the video stream, it is observed that vehicle 345 turned left to merge into the lane from a left-turn only exit, which is illegal. Therefore, the entire subset of trajectories is labeled as five behaviors: turning right, moving forward at lane one, lane two, and lane three, and the last behavior is the abnormal one which is turning left from a wrong exit. There are two capsules networks trained. The CapNet-1 takes individual trajectories as the input, and its training dataset is for four legal driving behaviors except the abnormal one. The goal is to check whether the detected behavior using mSSA can be recognized as any of the normal driving behaviors. The CapNet-2 takes multiple trajectories as the input and is to predict if, with the existence of the detected behavior, it will be agreed by other vehicles.
The time window length is 12.5 s, and there are five channels for each vehicle with different characteristics. The overall length of a trajectory is 625, which will be converted into a 25 × 25 matrix as the model input. There are 8000 trajectories in the training dataset while the test dataset contains 2000 trajectories. The number of Epoch is six, and the kernel size of the convolutional layer is six with stride one. In the primary capsules layer, there are 32 capsules, and each one contains eight convolutional units. At the behavior capsules layer, there are eight capsules and four classes are for prediction. The model is trained at a virtual machine with 4 GB memory. Figure 10 shows the training and testing time that is consumed per piece of trajectory data. It takes around 0.225 s to train the model with a trajectory while testing only takes under 0.050 s. Figure 11 shows the training and testing accuracy. After three epochs, the training accuracy can achieve 100% accuracy. The testing accuracy after epoch three becomes 90% to 100% as well. The experimental results show that the proposed approach can be deployed on edge nodes with limited computational resources and achieve a good performance.

Conclusions
In this work, a novel edge computing enabled Smart Urban Traffic Monitoring system named SurMon is investigated. Compared with cloud computing, edge computing facilitates data processing, anomaly detection, pattern recognition, and intelligent decisionmaking at the network edge with low latency. However, due to the limited computing capabilities of edge nodes, lightweight algorithms are required. A novel mSSA anomaly detection approach is proposed to catch vehicles with abnormal behaviors. Different from traditional ways that identify change points in the time domain, the SurMon system detects anomalies in the space domain. By identifying anomaly vehicles through detecting changes/differences in behaviors, pre-defined normal patterns are not required. Consequently, it relieves the burden of collecting and labelling a huge amount of traffic data, which is relatively labor intensive and impractical. In addition, a cascaded Capsules Network is adopted to interpret these behaviors to decide whether they are legal moves or violations.
A fundamental assumption of our mSSA based anomaly detection algorithm is that most vehicles on roads follow traffic rules and behave legitimately, and outliers are very likely violators. Consequently, it is reasonable to identify vehicles that present different characteristics from others as objects of interest, and more analysis should be conducted on them. In situations where a significant amount or even the majority of the vehicles on the road do not follow traffic rules, our mSSA algorithm may not be an ideal candidate. Meanwhile, a rule-based target-tracking or classification approach using ML is more appropriate.
The results reported here are still at an early stage toward a complete and mature smart urban traffic surveillance application. Several typical driving behaviors are selected to validate the feasibility of the SurMon system. While the experimental results are very encouraging, a lot of open problems are yet to be tackled. In this paper, we focus on the system design, and the experimental study is mainly used to validate the feasibility of the system. Therefore, we tested the effectiveness of mSSA and the multi-target tracking algorithms at the edge for the purpose of suspicious/anomaly vehicles detection, and then switched to the CapNet for behavior Interpretation. Currently, we are extending the implement the CapNet to test its performance on the suspicious behavior detection part, the success of which will provide an alternative solution for devices that are not able to handle the mSSA algorithm.
Our other on-going efforts include mainly two directions. The first is to collect more real-world data and get a comprehensive understanding of the driving behavior on the road. Secondly, since the vehicle information like trajectory is very critical, the security and privacy of driver should not be tampered with. However, edge nodes which facilitate the traffic surveillance tasks will not always be on the white list, and the mobility is also preventing the creation of such a white list. Blockchain is a nice candidate to solve this problem. It allows multiple nodes to reach a consensus within a trustless environment. Data Availability Statement: Not applicable, the study does not report any data.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: