Trafﬁc Anomaly Prediction System Using Predictive Network

: Anomaly anticipation in trafﬁc scenarios is one of the primary challenges in action recognition. It is believed that greater accuracy can be obtained by the use of semantic details and motion information along with the input frames. Most state-of-the art models extract semantic details and pre-deﬁned optical ﬂow from RGB frames and combine them using deep neural networks. Many previous models failed to extract motion information from pre-processed optical ﬂow. Our study shows that optical ﬂow provides better detection of objects in video streaming, which is an essential feature in further accident prediction. Additional to this issue, we propose a model that utilizes the recurrent neural network which instantaneously propagates predictive coding errors across layers and time steps. By assessing over time the representations from the pre-trained action recognition model from a given video, the use of pre-processed optical ﬂows as input is redundant. Based on the ﬁnal predictive score, we show the effectiveness of our proposed model on three different types of anomaly classes as Speeding Vehicle, Vehicle Accident, and Close Merging Vehicle from the state-of-the-art KITTI, D2City and HTA datasets.


Introduction
Anything that is radically different from normal behavior may be considered as anomalous, such as appearance of cars on footpaths, an abrupt dispersal of people in a crowd, a person unexpectedly slipping when walking, careless driving, or bypassing signals at a traffic junction. The availability of public video datasets significantly improved the research outcomes for video processing and anomaly detection [1]. Anomaly detection systems are usually trained by learning the expected behavior of the traffic environments. Anomalies are typically categorized as point anomalies [2], contextual anomalies [3], and collective anomalies [4].
Development towards driverless vehicles has drawn increasing attention and made significant progress in the past last decade [5,6]. While this advancement provides convenience to people and addresses the emerging needs from industry, it also raises concerns with traffic accidents. As a result, there is a need for further advances towards accident prediction using the time and frame components of video clips. Given this objective, our work seeks to demonstrate the power of PredNet (Predictive Network) [7] for accident anticipation in HTA (Highway Traffic Anomaly), KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) and D2city (Didi Dashcam City) [8][9][10] datasets. Specifically, these datasets consist of dashcam videos captured from vehicles driving in several traffic scenarios. Videos contained in datasets show that not only is the camera moving, but other vehicles and background features are also varying. The datasets consist 2 of 19 of three classes of anomalies: speeding vehicle, vehicle accident and close merging vehicle. However, labelled data is expensive and time-consuming to get. Furthermore, the recent crowdsourced data sets are of questionable quality. Unsupervised learning is therefore a promising direction.
Motion and the temporal component of the videos play a significant role in traffic anomaly anticipation as compared to systems working only on still images. Video prediction is one of the possible ways of learning from unlabeled data [11][12][13]. For this reason, most current advanced models [14][15][16] extract the optical flow from contiguous video frames, and then use LSTM (Long Short-Term Memory networks), RNN (Recurrent Neural Networks) or feedforward networks to ingest sequences [17][18][19][20]. Regarding pre-processed optical flow, it is observed that it provides no motion information for models. Instead, it produces more semantic information, since optical flows can be perceived as object masks. Two observations in the experiment described in [21], demonstrate that: (1) In appearance, the opical flow is invariant and permits models to recognize the action without assessing object color. (2) Small object motions and the bordering accuracy of optical flow are closely correlated with the performance of action recognition. In this paper, we examine and extend the Predictive Coding Network (a deep neural network architecture), designed on the principles of PredNet [7]. It appears that in order to accurately predict how a visual world will change over time, the model needs to learn about the object structure and possible transformations that an object might undergo [7].
In this work, the aim is to analyze the architecture of neural networks for directly collecting the motion data from video frames. The model can also learn to focus on regions that change between consequent frames, which is more sample efficient, as it enables the model to learn from far fewer data samples. PredNet trains the model at the pixel level to predict the next frame of a video. Thus, we design a new combined model involving CNN (Convolutional Neural Networks) and PredNet. CNN processes the RGB (Red Blue Green) frames, while PredNet takes the features from CNN and predicts anomaly in later features.
Anomaly and accident prediction using optical flow often require a high processing time. Our obtained results show that even without optical flow, it is possible to achieve productive results in comparison to cutting-edge models that involve pre-computing the optical flow fields from video frames. According to our hypothesis, extracting motion information directly from the video frames for action recognition without optical flow is effective, since for PredNet, the bottom-up and top-down deep recurrent connections resemble the way optical flow is generated, yet they are more informative since these pixel movements are captured in the (PredNet) recurrent process. Our model proposed (FWPredNet) which is a combination of CNN model GoogleNet with swish [22] and PredNet, to perform action recognition tasks without directly using optical flow. PredNet inputs the features extracted from CNN and predicts the features in the next time step. Then, concatenated features from CNN and PredNet are fed to the action prediction classifier and draw results based on the prediction score.
In our reported method, the model achieves competitive results compared to other state-of-the-art models without pre-computed optical flow from video frames as an input. Technical immersion of this work can be summarized as:

•
We combine unsupervised video prediction i.e., PredNet and supervised action classification. Our novel idea is to predict future frames based on features extracted using CNN with their labels.

•
The model just uses video frames as input and does not require pre-processed optical flows. This approach predicts and propagates error at the feature level, rather than at the pixel level.
In the subsequent sections of the paper, we present the following structures. Section 2 focuses on related work that outlines the associated research to determine the current advanced methods. Further, Section 3 introduces a proposed new framework for accident anticipation. The implementation details and experimental evaluation of the proposed framework are presented in Section 4. The conclusion and future work section summarizes findings and sums up future directions.

Related Work
Here, we briefly discuss some of the approaches, distinguishing before and after deep neural networks models were introduced in the domain of anomaly and accident predictions. A vehicle tracking algorithm has been proposed, which is based on spatiotemporal Markov random fields, for detection of traffic accidents at intersections [23]. The model can track individual vehicles robustly, without getting affected by occlusion and clutter effects, which are main characteristics at most busy intersections. Similarly, spot sensors were used as a principle for incident detection systems [24]; however, their scope tends to be rather trivial for anomaly detection systems.
Vision-based systems are widely used in a variety of applications, primarily because of their superior event recognition capabilities. It would be simple to extract information from vision-based systems about traffic jams, traffic violations, accidents, and other relevant topics. Static CCTV cameras are used to detect vehicles on a highway in [25]. To detect traffic violations at intersections, Lai et al. [26] also proposed a method which is deployed on the roads of Hong Kong so that they are able to detect moving vehicles when the traffic light is red. Features aggregation was proposed by Fatima et al. [27] along with DSA, in contrast with the interaction between agents. A distributed algorithm was proposed based on anomaly detection for predicting and classifying traffic abnormalities in a variety of traffic scenes [28]. Using image processing technology, Ikeda et al. [29] were able to detect abnormal traffic incidents automatically. In their study, they noted several types of traffic anomalies, where their method is capable of detection including slow-moving vehicles, stopped vehicles, fallen objects, and vehicles attempting to change lanes abruptly. Michalopoulos et al. in [30] performed an auto-scope video-based prediction system to track incidents. In their work, a method was proposed to track accidents up to 2 miles distant. An accident detection algorithm was proposed that utilizes the features of moving vehicles to automatically detect and record the picture of an accident scene before the accident occurred and afterwards [31].

Video Prediction Learning
The learning can roughly be characterized into three groups: supervised, semisupervised, or unsupervised [5,[21][22][23]. The learning of normal behavior is not only crucial for the identification of abnormality but also for numerous applications. Wei et al. in [32] used Faster R-CNN to detect vehicles and introduced an unsupervised anomaly detection system based on background modelling. Liu et al. [33] proposed a region-aware deep model which extracted discriminative local features from a series of local regions of a vehicle. A generative adversarial network-based approach was introduced using normal data to detect the anomalies in images [34]. It is reported that certain algorithms used multiple camera systems employing a system of video sensors to identify stationary vehicles. A method combining vision-based tracking with Intersection over Union would allow for an actual multi-object tracking technique that could be scaled to detect traffic anomalies [35]. In [36], a machine learning technique was stated to analyze traffic behaviors and detect the location of collision prone vehicles with high precision. Similarly, Yun et al. in [37] applied motion interaction interface to analyze abnormal behavior to detect traffic accidents. Finn et al. [12] by predicting a distribution over pixel motion from previous frames and conditioning on a robot's future actions produced a model of pixel-level motion. Srivastava et al. [38] showed that unsupervised prediction of future sequences of high-level representations of frames improved video classification results. Villegas et al. [39] analyzed both raw frames and their high-level representations, which are the frames' corresponding human poses, to predict the future frames. The proposed method for assisting people in everyday work was implemented by a real robotic framework in [40], and the trajectories of the vehicle were used to predict the intent for lane shift or rotation. Another approach used multiple cameras based sensoryfusion to explore future frame prediction using GPS and vehicle dynamics [41]. However, several strategies presume that the valuable details often arrive before the action at a fixed timeframe. Furthermore, there are two exclusions that applied HMM and RNN inputoutput to design and model the temporal order of prompts [38]. Some illustrations are demonstrated in the following Figure 1. The proposed method for assisting people in everyday work was implemented by a real robotic framework in [40], and the trajectories of the vehicle were used to predict the intent for lane shift or rotation. Another approach used multiple cameras based sensoryfusion to explore future frame prediction using GPS and vehicle dynamics [41]. However, several strategies presume that the valuable details often arrive before the action at a fixed timeframe. Furthermore, there are two exclusions that applied HMM and RNN inputoutput to design and model the temporal order of prompts [38]. Some illustrations are demonstrated in the following Figure 1. The topic-based model anomaly detection [43]. A car that has crossed the stop line appears on top of the row, a middle row is a hybrid, and a vehicle taking an odd turn is on the bottom of the row. (c) Multi-instance learning (MIL) anomaly detection in the real-time example [44]. The use of an anomaly ranking determines the detection of anomalies. (d) A vehicle on a walkway is identified with the STAN method [45]: as the top row is generator anomaly visualization, and the lower row represents the discriminator's anomaly visualization.

Predictive Coding in Anomaly and Accident Prediction
Lotter et al. [7] described the PredNet, a network that learns to predict future frames by making local predictions with each layer and forwarding deviations from those predictions to the following network layers. The authors affirm that the PredNet learns internal representations of the objects and is capable of capturing important features. Their work served as a starting point for a number of different researchers. For example, the AFA-PredNet was developed, incorporating motor action as an additional signal that modulates the top-down generative process through an attention mechanism [46]. Furthermore, another method proposed a hierarchical artificial predictor using different timescales of prediction for different levels of hierarchical coding, which are defined by the neurons' temporal parameters [47]. Han et al. [48] extended this model by creating a bidirectional, dynamic neural network with local recurrent processing, which referred to a predictive coding network. To efficiently retrieve cues, the presented work adopts the advanced techniques for extracting moving items in effective ways [49] and represents it by learned deep features as an observation in our FWPredNet model. Finally, we equate the anomaly and accident anticipation data collection with three separates large dashcam video datasets.

Related Datasets
Dashcam videos or cameras' videos on vehicles were gathered to review a variety of localization and recognition challenges, such as analysis based on semantic perception of The topic-based model anomaly detection [43]. A car that has crossed the stop line appears on top of the row, a middle row is a hybrid, and a vehicle taking an odd turn is on the bottom of the row. (c) Multi-instance learning (MIL) anomaly detection in the real-time example [44]. The use of an anomaly ranking determines the detection of anomalies. (d) A vehicle on a walkway is identified with the STAN method [45]: as the top row is generator anomaly visualization, and the lower row represents the discriminator's anomaly visualization.

Predictive Coding in Anomaly and Accident Prediction
Lotter et al. [7] described the PredNet, a network that learns to predict future frames by making local predictions with each layer and forwarding deviations from those predictions to the following network layers. The authors affirm that the PredNet learns internal representations of the objects and is capable of capturing important features. Their work served as a starting point for a number of different researchers. For example, the AFA-PredNet was developed, incorporating motor action as an additional signal that modulates the top-down generative process through an attention mechanism [46]. Furthermore, another method proposed a hierarchical artificial predictor using different timescales of prediction for different levels of hierarchical coding, which are defined by the neurons' temporal parameters [47]. Han et al. [48] extended this model by creating a bidirectional, dynamic neural network with local recurrent processing, which referred to a predictive coding network. To efficiently retrieve cues, the presented work adopts the advanced techniques for extracting moving items in effective ways [49] and represents it by learned deep features as an observation in our FWPredNet model. Finally, we equate the anomaly and accident anticipation data collection with three separates large dashcam video datasets.

Related Datasets
Dashcam videos or cameras' videos on vehicles were gathered to review a variety of localization and recognition challenges, such as analysis based on semantic perception of urban scenes, the Daimler Urban Segmentation [50], Leuven [51], and CamVid [52]. In addition, three large-scale datasets have been compiled recently in our current work; HTA [8], KITTI [9], and D2city [10] can also be taken as examples which are highly reputed datasets used to analyze video tasks such as object detection, multi-object tracking, monitoring, semantic segmentation and visual odometry for the detection of road/lane objects, etc. The videos are often taken by vehicles with the same equipment (no collision/accidents) under normal driving conditions. Most of the datasets used for anomalous action recognition consist of videos taken in rural areas and roads around middle-sized towns. A broad dashcam dataset [53] was published to test semantic segmentation. The photographs were captured in 50 different cities. Among them, 5k frames and 30k frames are labeled with detailed and with coarse semantic marks. While the data collection includes diversified findings, most frames are still taken and captured in a typical driving environment.

Motivation
Anomaly anticipation in surveillance video is defined as the detection of rare events that do not conform to events happening in normal situations. We base our study on the following critical requirements for a successful video anomaly detector: • Extract important features from the video sequence.

•
Decode features map and calculate the final prediction score by IOU function for anomaly and accident prediction.

Materials and Methods
In this section we briefly explain our FWPredNet architecture for anomaly anticipation, classification module and future frame prediction. Figure 2 depicts the PredNet architecture [7]. The network structure is created on weighted hierarchical layers. Each of the network's components try to create local predictions about their inputs. Next, this prediction and the difference from the actual input is transferred up the hierarchy towards the subsequent layer. Here, the information transfers in three ways across the network: (1) Error signal flows from the bottom to top as indicated by the red arrows on the right side of Figure 2. (2) A green arrow on the left indicates that the prediction signal flows from top down. (3) There is a constant flow of local error signals and prediction estimation signals within individual layers. In each layer, there are four units, an input convolution unit (Ai), a recurrent representation unit (Ri), a prediction unit (Ahati), and an error computation unit (Ei), as indicated in Figure 2.

PredNet Architecture
ConvLSTM [54] is used to create the representation unit (Ri), that estimates the input for the next time step and is then fed in the prediction unit (Ahati), which consists of a convolution layer that produces the prediction. Using error units (Ei), it can calculate the difference between a prediction and the input. To add even more nonlinearity, it divides them into positive and negative error populations. This error is subsequently passed onto the next layer as an input parameter. A copy of the error signal (red arrow) and the up sampled input (green arrow) from the representation unit of the higher-level are received by the representation unit, and they are used in conjunction with its recurrent memory to make predictions.

Prosposed Method
We use a pre-trained CNN model to extract the features at frame level, and then a proposed version of PredNet [7], i.e., FWPredNet is used to predict the feature representation of the next video frame. Our idea is to condition the future frame predictions on predicted action class labels. The following section describes the functionality of the proposed FWPredNet architecture.

Prosposed Method
We use a pre-trained CNN model to extract the features at frame level, and then a proposed version of PredNet [7], i.e., FWPredNet is used to predict the feature representation of the next video frame. Our idea is to condition the future frame predictions on predicted action class labels. The following section describes the functionality of the proposed FWPredNet architecture.

Semantic Feature Extraction
Our proposed work considers the same network as GoogleNet [22]. However, we arranged the layers according to our needs by settling convolution with a stride size of two by adding seven pooling layers. The inception module was reduced to six layers. The datasets (HTA, KITTI, D2city) consisting of video images have different resolutions. In order to unify the training procedure, all images were resized to [224,224], and fed in the classifier as an input. The original size of an image is shown as one eighth in resulting maps as [2048 × 3 × 3]. This helps us to detect even small objects with a fine-grained detection while maintaining a low computational load. For detection and recognition, we use the feature maps as the basis.
To address the aforementioned situations, we utilize the novel "Swish" activation function which was proposed in [22]. Suggested from the studies of Google Brain, where ϰ-input multiplies by sigmoid function, this is a reasonably simple feature (Swish activation function is demonstrated in Figure 3.).

Semantic Feature Extraction
Our proposed work considers the same network as GoogleNet [22]. However, we arranged the layers according to our needs by settling convolution with a stride size of two by adding seven pooling layers. The inception module was reduced to six layers. The datasets (HTA, KITTI, D2city) consisting of video images have different resolutions. In order to unify the training procedure, all images were resized to [224,224], and fed in the classifier as an input. The original size of an image is shown as one eighth in resulting maps as [2048 × 3 × 3]. This helps us to detect even small objects with a fine-grained detection while maintaining a low computational load. For detection and recognition, we use the feature maps as the basis.
To address the aforementioned situations, we utilize the novel "Swish" activation function which was proposed in [22]. Suggested from the studies of Google Brain, where κ-input multiplies by sigmoid function, this is a reasonably simple feature (Swish activation function is demonstrated in Figure 3). The mathematical definition [22] of 'Swish' is presented by using (1): Shown above is the Swish function which is graphically smooth; it doesn't abruptly change its course, like RELU close to ϰ = 0. In contrast, it falls smoothly from 0 to less than The mathematical definition [22] of 'Swish' is presented by using (1): Shown above is the Swish function which is graphically smooth; it doesn't abruptly change its course, like RELU close to κ = 0. In contrast, it falls smoothly from 0 to less than 0, and soon thereafter rises. This information also indicates the non-monotonic nature of the function. Similarly, like RELU, this function does not persist as stable or unidirectional. Conforming to the experimental study performed in [22], the Swish function seems to operate in more complex data sets than RELU on deeper models. It performs with the same computational effectiveness, but better than RELU.

FW-PredNet Architecture
The basis of the theory of predictive coding is that the brain endlessly generates topdown predictions from bottom-up input. The representation at a higher level predicts the representation at its lower level. A difference between the predicted and actual representation causes an error in prediction. This error propagates to the higher levels to update their representations in order to attain an improved prediction. The process is repeated throughout the hierarchy until the prediction error reduces, or until bottom-up processes no longer transmit any "new" information (or unpredicted information) for updating hidden representations. Therefore, predictive coding is a computational mechanism by which the model recursively updates its internal representation of the visual input towards convergence.
The input to PredNet [7] is pixel-level image data and the authors utilized a fourlayer architecture. Instead, we use the feature extracted from top layer of CNN which contains semantic information necessary for performing anomaly detection. In contrast to recognition in still images, motion and temporal aspects of videos play a significant role in action recognition. Accordingly, the majority of presently available state-of-the-art models use video pre-processing to acquire optical flow fields between contiguous video frames and models that can consume video sequences. Hence, rather than propagating errors at pixel level, we predict the error at feature level. Moreover, in order to make the prediction more robust, we engage classification unit (C) that is attached to the top of the representation layer. This unit consists of an encoder section and decoder section. The compilation of FWPredNet includes four core components which can be seen in Figure 4, in which, R, A , A, E and C units denote representation unit, prediction, input, error and classification unit, respectively.
As referred to by Wen et al. [53], in FWPrednet the higher-level representation R t l where l is denoted as layer and t is denoted as time step predicts its low-level representation A t l−1 via linear weighting W l,l−1 where the weights of layer l − 1, to layer l, are denoted as W. The prediction error E l−1 is the difference betweenÂ t l−1 and R t l .
During the feedforward process, the prediction error on layer l − 1, E t l−1 , is propagated to upper layer l, in order to update its representation layer R t l ; thus, the prediction error is reduced with the updated representation. We can minimize E t l−1 , by defining losses as the sum of squared errors normalized by the variance of the representation σ 2 l−1 as:  In Figure 2, we have included an additional 'classifying unit' which is composed of two ConvLSTM layers that convert the input into class probabilities shown in Figure  4. The decoder is made up of transposed convolution layers that upsample and transform the classes back to the image features, which are then fed back into the top down. The classification unit makes predictions at each in-coming frame. A weighted sum of these prediction scores is calculated and passed through the softmax function to get predicted class probability. In the first few frames of the video, the model does not have enough context to make any meaningful predictions and therefore the weighing over time is done using an exponential function, as shown as Figure 5. Notice how predictions in the first few frames are weighted low while weights for later predictions stabilize at 1.0.
First of all, and are all initialized as 0 in step one. Then, from the bottom layer zero, after convolution yields as well as , and . Later on, the convolution is produced by as input for the next layer of a same time step. The same principle applies to layer one, where generates by convolution and is derived from subtractions between as well as , and . After convolution is produced by as input for the next layer of a same time step. As at layer two, produces using convolution, whereas is derived via subtractions between as well as , and . This represents the conclusion of the bottom-up process. For the time step one at layer one, based on convolution LSTM, and initialize at layer one for time step two. with and , output representing the bottom-up process. This bottom-up process occurs continuously throughout a time domain.
The encoder is composed of two ConvLSTM layers that convert the input into class probabilities. The decoder is made up of transposed convolution layers that upsample and transform the classes back to the image features, which later feed back into the top-down. The classification unit makes a prediction at each in-coming frame. A weighted sum of these prediction scores is calculated and passed through the softmax function to get predicted class probability. In the first few frames of the video, the model does not have enough context to make any meaningful predictions and therefore the weighting The gradient of E t l−1 in contrast with R t l is given in Equation (5): To decrease E t l−1 , by using gradient descent, R t l is updated with an updating rate, α l , given in (6): If the weights of feedback connections are the transpose of those of feedforward connections W l,l−1 = (W l−1,l ) T , it is possible to rewrite (6) as a feedforward operation, as in (7): This method involves forwarding the prediction error from layer l − 1 to layer l, for updating the representation with an update rate of α l = 2α l σ 2 l−1 . During the feedback process, the top-down prediction is used to update representations at layer l, R t l in order to reduce prediction error E t l . In a similar way to the feedforward process, gradient descent is used to minimize error, where the gradient of E t l with respect to R t l is as in (8), and R t l is updated with an updating rate of β l , as in (9):  (9) is rewritten as follows: Here, Equation (10) illustrates a feedback process where the representation at the higher layer R t l+1 , caused top-down predictionÂ t l+1 , and impacts the lower layer representation R t l . In Figure 2, we have included an additional 'classifying unit' which is composed of two ConvLSTM layers that convert the input R 1 2 into class probabilities shown in Figure 4. The decoder is made up of transposed convolution layers that upsample and transform the classes back to the image features, which are then fed back into the top down. The classification unit makes predictions at each in-coming frame. A weighted sum of these prediction scores is calculated and passed through the softmax function to get predicted class probability. In the first few frames of the video, the model does not have enough context to make any meaningful predictions and therefore the weighing over time is done using an exponential function, as shown as Figure 5. Notice how predictions in the first few frames are weighted low while weights for later predictions stabilize at 1.0.  The number of classes recognized by classification unit (C) depends on the dataset, i.e., 2 classes for KITTI and D2City and 3 in case of HTA. These classes are interpreted as labels for future frame prediction. The encoder-decoder first transforms the output from representation unit into label class probabilities where the decoder transforms the output back to image modalities.

Network Training Parameters
We use standard stochastic gradient descent (SGD) to train deep networks, with momentum set at 0.75 and weight decay at 0.001 in each case. Initial learning rate is 0.0057 andbatch size is 8. As soon as the validation loss reaches a saturation, 10x reduction of the learning rate is applied. Our FWPredNet model is trained using the action classification layer model for 40 epochs on the KITTI dataset. Our training or finetuning of moments uses only 10 plus epochs due to memory and time constraints. Models are all implemented in Pytorch using a 2080Ti.

Datasets
Our proposed method of analysis is conducted on three different large-scale HTA [8], KITTI [9], and D2city [10] dashcam videos datasets. Furthermore, Figure 6 shows the real video examples of these state-of-the-art datasets. The encoder is composed of two ConvLSTM layers that convert the input R 1 2 into class probabilities. The decoder is made up of transposed convolution layers that upsample and transform the classes back to the image features, which later feed back into the topdown. The classification unit makes a prediction at each in-coming frame. A weighted sum of these prediction scores is calculated and passed through the softmax function to get predicted class probability. In the first few frames of the video, the model does not have enough context to make any meaningful predictions and therefore the weighting over time is performed by using an exponential function, as shown as Figure 5. Notice how predictions in the first few frames are weighted low while weights for later predictions stabilize at 1.0.
The number of classes recognized by classification unit (C) depends on the dataset, i.e., 2 classes for KITTI and D2City and 3 in case of HTA. These classes are interpreted as labels for future frame prediction. The encoder-decoder first transforms the output from representation unit R 1 2 into label class probabilities where the decoder transforms the output back to image modalities.

Network Training Parameters
We use standard stochastic gradient descent (SGD) to train deep networks, with momentum set at 0.75 and weight decay at 0.001 in each case. Initial learning rate is 0.0057 andbatch size is 8. As soon as the validation loss reaches a saturation, 10x reduction of the learning rate is applied. Our FWPredNet model is trained using the action classification layer model for 40 epochs on the KITTI dataset. Our training or finetuning of moments uses only 10 plus epochs due to memory and time constraints. Models are all implemented in Pytorch using a 2080Ti.

Datasets
Our proposed method of analysis is conducted on three different large-scale HTA [8], KITTI [9], and D2city [10] dashcam videos datasets. Furthermore, Figure 6 shows the real video examples of these state-of-the-art datasets.

HTA
The collection of videos includes 1280 × 720 dashcam videos from cars recorded in New York and the Bay Area [8]. Videos have been excluded from this subset which are blurred or show visually degraded conditions. In brief, this dataset collection contains traffic videos (clear, cloudy, or semi-cloudy weather conditions), minimally appearing or blurred cars (due to large vehicles) also including traffic moving on the roads. This dataset is not noise-free or in other words, a little bit imperfect because of the flawed nature of data collection. For example, bumps and cracks on the road lead videos to have transient shakes. These features make the dataset more realistic and at the same time, more challenging to deal with anomalies. Videos lie in the standard driving conditions characterized in this dataset as vehicles that do not disturb a dashcam movement. The training set involves 286 regular traffic videos, an average length of 40 s and has a total count of 321,102 video frames.

KITTI
This selected dataset comprises 600 video frames (having a pixel value of 375 × 1242)

HTA
The collection of videos includes 1280 × 720 dashcam videos from cars recorded in New York and the Bay Area [8]. Videos have been excluded from this subset which are blurred or show visually degraded conditions. In brief, this dataset collection contains traffic videos (clear, cloudy, or semi-cloudy weather conditions), minimally appearing or blurred cars (due to large vehicles) also including traffic moving on the roads. This dataset is not noise-free or in other words, a little bit imperfect because of the flawed nature of data collection. For example, bumps and cracks on the road lead videos to have transient shakes. These features make the dataset more realistic and at the same time, more challenging to deal with anomalies. Videos lie in the standard driving conditions characterized in this dataset as vehicles that do not disturb a dashcam movement. The training set involves 286 regular traffic videos, an average length of 40 s and has a total count of 321,102 video frames.

KITTI
This selected dataset comprises 600 video frames (having a pixel value of 375 × 1242) [9] extracted at a minimum space distance of 20 m from the large-scale dataset of KITTI. The reported videos are from 5 different scenarios and are in comparatively low vehicle conditions, i.e., the road/path is often completely visible. Specific data was collected from the KITTI website and chosen for our work (including color-stereo photos, laser scans as Velodyne, information gathered by GPS).

D2city
D2city dataset [10] is a certain version, in which all videos and several scenarios of incidents are recorded, such as a motorcycle hitting a vehicle, or a car colliding another car, as shown in Figure 6. In addition, most videos are from different cities in Taiwan. These are usually busy streets with several moving items or objects and complex road signs/panels in a difficult driving vision. We noted the bounding boxes of vehicle, motorcycle, bike, person, and the time of accident manually for each video. 58 videos out of 678 are used for object detector training. The rest of the 620 files are randomly picked from 620 positive and 1130 negative clips, each consisting of 60 to 70 frames.

Experiments and Results
We evaluate the proposed approach on several datasets. In order to better understand what each module at each layer of our designed model does, we conducted extensive visualization experiments. In this section, we dedicate one paragraph to each of the evaluations on three different datasets (HTA, KITTI, D2city), estimating the average precision (AP), with a Figure and Table to aid in discussing the results.
One thing to mention is due to lack of dataset availability for anomaly anticipation, we have taken publicly available datasets and repurposed them for the anomaly anticipation. In case of HTA, we were able to classify the datasets in three different classes, i.e., Speeding Vehicle, Close Merge, and Accident whereas, in case of KITTI and D2City, we classify the datasets into two classes i.e., Close Merge and Accident. Some baseline methods for our proposed method's comparison are given. We labeled the dataset with 3 different classes i.e., Accident, Close Merge, and Speeding Vehicle. We report top1 accuracy for abnormal event classification. Using a CNN, a sequence of images is passed that is subsequently encoded into feature maps. For abnormal event classification, the context feature is passed through a spatial pooling layer along with a fully connected layer followed by a multi-way SoftMax layer at the end. Since we are predicting frames (N + 10), the abnormal event is triggered immediately as soon as the frame is classified as one of the labeled classes. In this paper, we have introduced the FWPredNet framework for accident and anomaly anticipation, and outperformed the previous state-of-the-art by a better margin on the downstream tasks of classification accuracy on KITTI, D2city and HTA datasets.

Baseline
The baseline for our proposed method's comparison is discussed in further subsections.

CGAN
Generative models can learn to predict dense optical flow from normal motion models, since an anomaly is described as an irregular motion [42]. For training purposes of optical flow, ground truth is evaluated by using OpenCV optical flow implementation. GAN can be generalized to a conditioned model (CGAN) with some additional information on either G or D for any further details such as class labels or other methodologies [56]. By feeding y to the discriminator D and generator G, the conditioning is carried out in an additional input layer. Noise prior pz(z) and y is combined via generator G, in a joint hidden representation. The framework of adversarial training offers exceptional stability in the format of this hidden representation. x and y are presented as input to a discriminator function in the discriminator D. The objective function for the condition is shown below in (11): In order to anticipate the optical flow between two sequential frames, the CGAN is trained [57]. Inputs to the generator are two RGB images concatenated depth wise, and usedto predict the optical flow. As one method for detecting abnormal motion is the usage of difference between the predicted optical flow by the generator and the actual optical flow of the ground truth. Using a sliding window, the difference is then averaged, and the frame is considered anomalous if it exceeds the threshold in any of the x or y components.

FlowNet
FlowNet [58] compared two architectures: FlowNet Simple and FlowNetCorr; both architectures are end-to-end approaches to learning. The system initially generates two images distinctly and then merges them in a correlation layer together and learns the higher representation. Additionally, multiplicative patch contrasts between two maps of functions are used with the correlation layer. More precisely, two multi-channel maps of m 1 and m 2 have the number of channels, height, and width as c, h and w. The correlation of two patches mentioned in the first map based on a 1 and the second map a 2 is then described as (12): There are two trends described in the method [59] on neural network design for event cameras through recurrent (spiking) variants of EV-FlowNet and FireNet. In this study, authors propose a novel approach to event-based optical flow estimation using self-supervised learning (SSL), which stresses the ability of networks to integrate temporal data from successive slices of events. In the training pipeline, the self-supervised loss function was reformulated to improve its convexity.

Model Analysis
We design our experiments to analyze each part of our model as follows:

•
The last layer of CNN pre-trained on ImageNet is fine-tuned on HTA.

•
The last layer of CNN pre-trained on KITTI is fine-tuned on HTA. • Fix the weights of CNN pre-trained on ImageNet dataset and train the PredNet on HTA. • Fix the weights of CNN pre-trained on KITTI dataset and train the PredNet on HTA.
Experiments 1 and 2 are to examine the performance of CNN with different pre-trained weights and without temporal features. Experiment 3 and 4 are to exhibit how effective PredNet is with two pre-trained CNN models, respectively. Here, we present the results of our analysis for the HTA dataset. Using CNN pretrained on the KITTI dataset yields higher accuracy than using a model pre-trained in imageNet dataset, after fine-tuning the classification layer on HTA. The accuracy of the first result is 5.71% while other is 56.2%. This indicates the usage of features of the pre-trained model for object classification is not a good representation for traffic anomaly classification. If we add PredNet on both scenarios and train the PredNet from scratch on HTA, there is a significant boost in accuracy for both pre-trained CNN, with a 13% increase for the ImageNeT pre-trained CNN and more than 4% increase for the KITTI pre-trained one. This indicates that our proposed variation in FWPredNet is able to capture additional information from generated video sequences. Table 1 provides the classification accuracy for experimental results analysis of our model as follows. Here, we discuss our model's results compared with other baselines on the HTA dataset as described in Figure 7. We note that our model outperforms most of the cuttingedge approaches by a great margin, namely CGAN [56], FlowNet [58], and PredNet [7], by taking RGB frames only as an input. In addition to the RGB-frames, a two-stream model based on the convolution neural network explicitly contains optical flow which is used as the system input. Their model reaches a significantly higher rate of accuracy than that of our model, having only RGB frame inputs as in Table 2.
Remote Sens. 2021, 13, x FOR PEER REVIEW 14 of 19 used as the system input. Their model reaches a significantly higher rate of accuracy than that of our model, having only RGB frame inputs as in Table 2.

Evaluation on KITTI
We also conduct results on KITTI as an anomaly and anticipation results can be seen in Figure 8. In Table 3, a comparison of our system with state-of-the-art schemes is reported showing the identical results as extracted on the KITTI dataset. Our model only uses RGB frames as input and achieved 0.578 and 0.724 in Accident and Close Merge, respectively.

Evaluation on KITTI
We also conduct results on KITTI as an anomaly and anticipation results can be seen in Figure 8. In Table 3, a comparison of our system with state-of-the-art schemes is reported showing the identical results as extracted on the KITTI dataset. Our model only uses RGB frames as input and achieved 0.578 and 0.724 in Accident and Close Merge, respectively.

Evaluation on D2city
Compared to other models, our model performs better in both accident prediction and close merge detection on D2City detection, as shown in Figure 9. Our architecture uses optical flow as an input to the two-stream model. The model can achieve significantly higher accuracy than other models' results as shown in Table 4 below.   Compared to other models, our model performs better in both accident prediction and close merge detection on D2City detection, as shown in Figure 9. Our architecture uses optical flow as an input to the two-stream model. The model can achieve significantly higher accuracy than other models' results as shown in Table 4 below.  . Playing from left to right given the above illustration shows successful anomaly and accident anticipation in D2city dashcam videos dataset. Figure 9. Playing from left to right given the above illustration shows successful anomaly and accident anticipation in D2city dashcam videos dataset.

Discussion
We implemented a model that is informally named as FWPredNet which outputs not only classification results, but also conditions future predictions on its previously labelled class labels. For all the experiments above, we report the top1 accuracy for supervised learning for action classification on KITTI, HTA and D2City in the rightmost column. The training and testing splits of KITTI, HTA, and D2City are presented. Tables 2-4 present top1 accuracy for action classification on the KITTI, HTA and D2City datasets.
Furthermore, we evaluated vanilla PredNet and compared it to our proposed model. Compared with traditional models, our model performs better in both Accident and Close Merge classes but struggles with the speeding vehicles because the learning ability of the model is sensitive to the continuity of the motion. Our model outperforms Flownet by 32.96%, with 7.4% in accident detection and marginally outperforms Flownet by 3.03% in HTA due to the type of the dataset and the learning ability of our model, which is sensitive to the continuity of the motion. Several of the presented techniques for extracting moving items are adopted in order to retrieve cues efficiently [53] and then represent these features as observations in our FWPredNet model by learning deep features. As compared with previous studies, Villegas et al. [39] predicted the future using both raw frames and their high-level representations, which is the frame's corresponding human pose. However, this only works with static backgrounds and requires labelled pose information. Moreover, Vondrick et al. [13] extended this work and continued the tradition of encoding images at a higher level than pixels. In order to anticipate objects and actions, they used recognition algorithms on the predicted representation. In contrast, the PredNet [7] model that we extended and informally named as FWPredNet, learns directly from the pixel space and works with videos that have dynamic backgrounds and real-world settings. Moreover, this model is a type of neuroscientific framework that can learn features at different hierarchical levels without being specifically tuned to do so.
Parameter comparison of baseline (CGAN, FlowNet and PredNet) with our proposed FWPredNet is shown in Table 5. below. The vanilla Prednet model is approximately 3.9 M parameters less than the FWPrednet model, but it performs significantly better in every class and dataset. It is a compromise we have to strike to extract better results from Prednet's classification performance. Hence, we adopted a kind of trade-off for "accuracy over computational cost" which will be interesting to explore for better classification performance in future work.

Conclusions and Future Work
• In this paper, we have introduced the FWPredNet framework for accident and anomaly anticipation, and outperformed the previous state-of-the-art by a better margin on the downstream tasks of classification accuracy on KITTI, D2city and HTA datasets. • It can be deduced that our proposed variation in FWPredNet is able to capture additional information from generated video sequences while we train the PredNet from scratch on the given dataset.

•
We evaluated vanilla PredNet [7] then compared it to our FWPredNet model. Compared with traditional models (CGAN, Flownet, PredNet), FWPredNet performs better in both Accident and Close Merge classes. One limitation of the test performance is that it struggles with speeding vehicles because the learning ability of the model is sensitive to the continuity of motion but still achieves better results compared with the rest of the three methods. • Finally, on the engineering front, the current implementation of the FWPredNet takes very long time to train, and work can be done towards more efficient usage of GPUs. The computational cost of FWPredNet is 3.9M more than the vanilla PredNet but we accepted it as trade-off for "accuracy over computational cost". A successor to FWPredNet can be designed, which does not have the aforementioned limitations and is faster in implementation of the proposed model.