An Efficient Anomaly Recognition Framework Using an Attention Residual LSTM in Surveillance Videos

Video anomaly recognition in smart cities is an important computer vision task that plays a vital role in smart surveillance and public safety but is challenging due to its diverse, complex, and infrequent occurrence in real-time surveillance environments. Various deep learning models use significant amounts of training data without generalization abilities and with huge time complexity. To overcome these problems, in the current work, we present an efficient light-weight convolutional neural network (CNN)-based anomaly recognition framework that is functional in a surveillance environment with reduced time complexity. We extract spatial CNN features from a series of video frames and feed them to the proposed residual attention-based long short-term memory (LSTM) network, which can precisely recognize anomalous activity in surveillance videos. The representative CNN features with the residual blocks concept in LSTM for sequence learning prove to be effective for anomaly detection and recognition, validating our model’s effective usage in smart cities video surveillance. Extensive experiments on the real-world benchmark UCF-Crime dataset validate the effectiveness of the proposed model within complex surveillance environments and demonstrate that our proposed model outperforms state-of-the-art models with a 1.77%, 0.76%, and 8.62% increase in accuracy on the UCF-Crime, UMN and Avenue datasets, respectively.


Introduction
In the 21st century, one of the leading causes of lost lives and property is the surge in the crime rate, as compared to other issues [1]. An intelligent video surveillance system is a most preferred solution for the quick and early detection of such unusual events. Anomalous event recognition in surveillance videos demands much attention due to its vast applications in many domains, including crime prevention, automated intelligent visual monitoring, and traffic security [2]. To avoid any mishap and ensure public safety, for the past few decades, a vast amount of surveillance cameras have been deployed in private and public places for effective real-time monitoring. However, most of these cameras provide only passive recording services and lack monitoring capabilities. The volume of these videos increases each minute, making understanding and analyzing them effortful for human experts. Similarly, surveillance analysts must wait for hours to capture or witness anomalous events for instant reporting. Due to the rareness of real-world anomalous events, video anomaly recognition has previously been investigated as a one-class classification problem [3][4][5], i.e., the model is trained on normal videos, and in the test set, a video is classified as anomalous when abnormal patterns are encountered. It is not feasible to accumulate all the usual events of real-world surveillance in a single dataset. Hence, various normal behaviors might stray from normal events in the training set and eventually generate false alarms.
Recognizing anomalies in surveillance videos is an extremely hard and challenging task, for reasons including a subjective definition of anomaly, an inadequate amount of

•
We propose a light-weight model for anomaly detection, functional for a real-world surveillance network. We adopted a pretrained model and extracted frame-wise features, followed by a sequential learning mechanism for the precise recognition of anomalous activity.

•
We employed the residual attention-based long short-term memory (LSTM) concept, which can effectively learn temporal context information and precisely recognize anomalous activity. Moreover, using a residual attention-based LSTM saves more than 10% of learnable parameters as compared to the usual LSTM network size. • Our proposed model is tested using the challenging University of Central Florida UCF-Crime dataset, outperforming the baseline methods in terms of accuracy with reduced number of model parameters and size compared to existing anomaly activity recognition models.
The rest of this manuscript is structured as follows: Section 2 gives a brief overview of the existing techniques of anomaly detection and recognition in the literature. Section 3 is a detailed explanation about the materials and methods that are used for abnormal activity recognition. The model implementation and experimental results, along with the evaluation of the proposed model are discussed in Section 4, followed by the conclusion of the current work in Section 5.

Related Work
The anomaly detection and recognition problems in the surveillance environment are extensively studied in the existing literature. In the current section, the existing anomaly detection techniques are summarized in two broad categories; traditional feature-based techniques and deep learning-based techniques for anomalous event recognition are discussed in detail in subsequent sections.

Traditional Feature-Based Techniques
Previously, low-level feature-based methods were extensively applied for anomaly detection. These methods are mainly based on three phases: (1) feature extraction, where the low-level pattens are extracted from the training set; (2) learning from the features to distinguish the distribution of encoding regular patterns or normal events; and (3) identifying the outliers or isolated clusters as anomalous events. For the feature extraction phase, earlier approaches mostly employ low-level trajectories, image coordinates, and regular patterns [21,22]. However, these techniques do not perfectly provide appropriate performances in crowded or complex occurrences with multiple shadows and occlusions, as trajectory-based features mostly fail in such cases. To handle the problems of the trajectory features, the researchers introduced the alternative feature procedures known as low-level spatiotemporal features, including a histogram of oriented gradients and a histogram of oriented flow, broadly utilized for anomaly detection [23,24]. Taking advantage of spatiotemporal features, Zhang et al. [25] used the Markov random field for modeling the usual events. Kim and Grauman [26] proposed a system that used a Markov random field model to detect unusual events in videos. To learn the normal patterns of each event at a local node, they captured the distribution of the continual optical flow observation and atomic motion patterns using a mixture of probabilistic principal component analyzers. Another study proposed detecting the frequently occurring local histograms by an exponential distribution of the optical flow [27]. The authors in [28] proposed a Gaussian mixture model-based technique to integrate dynamic textures. Dictionary learning and sparse coding is a famous technique used to encode normal patterns and detect abnormal events [6,29,30]. The core idea of these techniques is that the usual patterns are characterized on the basis of a dictionary, which is used to encode the normal patterns in the training set. Consequently, the patterns are considered usual/normal when their reconstruction error is low, while the pattern is seen as abnormal/unusual when its reconstruction error is high. The main drawback of these methods is that optimizing the sparse coefficients is generally time consuming.

Deep Feature-Based Techniques
In the current era, deep feature-based models have achieved great success in numerous domains of nonlinear high dimensional data, such as activity recognition [31] and video summarization [32], among many others [33,34]. Most of the previous literature is based on semi-supervised anomaly detection techniques in which the model is trained on normal data. Liu et al. [18] proposed a framework that used a convolutional neural network (CNN) as an encoder to encode the video frames, and ConvLSTM was applied to detect anomalous events. Their encoder efficiently encodes the changes in motion for detecting anomalies in a surveillance environment. Similarly, Parab et al. [35] introduced a system based on a CNN and LSTM to detect unusual situations at an automated teller machine. In this model, the frame-level features are extracted from the videos and then fed to a bidirectional LSTM to classify abnormal events at an automated teller machine. In our pioneering work, we used deep CNN features from a series of frames and passed them through a multilayer bidirectional LSTM to learn the spatiotemporal information of the input video and detect abnormal events [36]. Luo et al. [18] suggested a convolutional LSTM with an autoencoder-based model for anomaly detection in videos. Additionally, they extended his work using a stacked recurrent neural network (RNN) with an autoencoder to detect anomalies. Hasan et al. [37] recommended a system for anomaly detection established on a convolutional autoencoder, followed by a RNN. Liu et al. [14] introduced a system in which the fusion of a temporal and spatial detector is presented to detect anomalies in videos. In this model, the discriminant saliency detector and a set of dynamic texture features are modeled as normal events from the training data. Liu et al. [14] introduced a model for future frame prediction for anomaly detection that prevents the identity mapping and also increases its functioning in anomaly detection. Additionally, generative models are one of the popular techniques that are utilized to detect abnormalities in videos. Sabokroul et al. [38] suggested generative adversarial networks (GANs) to detect anomalies in the surveillance environment. In this model, they use GANs with discriminator and generator methods to learn the normal distribution. Deng et al. [39] introduced a model called the "Spatio-Temporal Autoencoder", where they applied a deep neural network to extract both temporal and spatial features from videos. Furthermore, they introduced weight-reducing projection loss to predict future frames effectively and learn motion features in videos. Cheng et al. [40] introduced a clustering-based deep autoencoder to produce efficient information within usual events. Spatiotemporal feature regularity is learned using two modules. In the first module, the spatial autoencoder manages the last individual video frame, and the second module is a temporal autoencoder, which operates and constructs the RGB difference from the rest of the frames. To detect anomalies in videos, supervised learning-based techniques have been well studied over the past few years. Recently, weakly supervised-based state-of-the-art techniques for video labeling have been recommended in studies [12,41], where the detection of anomalous events is performed using C3D [42] and multi-instance learning (MIL) [43,44]. Sultani et al. [12] proposed a framework based on weak video labels using deep features and the MIL approach to detect anomalous events. This paradigm is trained on both normal and abnormal videos by generating two different bags for usual and unusual events, and the MIL method was applied to detect anomalous event scores in the videos. Tan et al. [45] introduced an anomaly detection technique which efficiently used sparse components and hyperspectral image pixel decomposition into lower ranks. Furthermore, they used a spatial constraint for lower ranks, which uses a single or multiple local window technique to represent and smooth the coefficients for effective anomaly detection. Following the ranked-based technique [46], an abnormal event detection framework was introduced by using MIL with a graph-based technique to represent the normal and abnormal events. Zhu et al. [13] proposed a temporal augmented network with MIL by incorporating an attention block and achieved state-of-the-art performance for normal and abnormal event detection. Kuldeep et al. [47] suggested a system known as "DEARESt" anomaly recognition. This system is based on two flow feature networks: one uses CNN-based features while the other uses motion features separately. In our pioneering work, we used deep CNN features from the series of frames and passed them through a multilayer bidirectional LSTM to learn the spatiotemporal information of the input video and detect the abnormal events [36].

Materials and Methods
In this section, we discuss the overall structure of our proposed model and its key elements which are presented in Figure 1. The proposed system is divided into three key phases: the surveillance video frames are passed from the pretrained light-weight CNN model to extract features; we generate a feature vector from a series of 30 frames of the video; and this feature vector is fed to the residual LSTM to recognize anomalous activities in a realworld environment. Each phase of the model is discussed in detail in subsequent sections. video; and this feature vector is fed to the residual LSTM to recognize anomalous activities in a real-world environment. Each phase of the model is discussed in detail in subsequent sections.

Feature Extraction Using Light-Weight CNN
The core concept at backend of MobileNet paradigms is to supplant huge and costly convolutional layers with depth-wise distinguishable convolutional blocks. These convolutional blocks consist of two key elements: 1) a depth-wise convolutional layer that uses 3 × 3 filters for a given input, and 2) point-wise convolutional layers that incorporate a 1 ×1 filter that functions to merge these filtered values and extract the learned features. The MobileNet model is light-weight and considerably faster than a conventional convolution-based model and achieves approximately the same results. The MobileNetV1 has 3 × 3 convolution and 13 depth-wise distinguishable convolutional blocks [48]. Mo-bileNetV2 contains one extra expansion layer in each block with a filter size of 1 × 1 for point-wise and depth-wise convolutional layers [49]. The core objective of this layer is to increase the amount of channels in the data rather than moving to the depth-wise convolution. As a result, this layer generates more output channels than the given input channels. Unlike V1, the point-wise layers of MobileNetV2 are part of the projection layer: this layer is responsible for projecting data with huge amount of channels into tensors, along with a small amount of channels ( Figure 2). The main function of the bottleneck residual block is that it provides the end result of these blocks; the residual block is modified, using

Feature Extraction Using Light-Weight CNN
The core concept at backend of MobileNet paradigms is to supplant huge and costly convolutional layers with depth-wise distinguishable convolutional blocks. These convolutional blocks consist of two key elements: (1) a depth-wise convolutional layer that uses 3 × 3 filters for a given input, and (2) point-wise convolutional layers that incorporate a 1 × 1 filter that functions to merge these filtered values and extract the learned features. The MobileNet model is light-weight and considerably faster than a conventional convolutionbased model and achieves approximately the same results. The MobileNetV1 has 3 × 3 convolution and 13 depth-wise distinguishable convolutional blocks [48]. MobileNetV2 contains one extra expansion layer in each block with a filter size of 1 × 1 for point-wise and depth-wise convolutional layers [49]. The core objective of this layer is to increase the amount of channels in the data rather than moving to the depth-wise convolution. As a result, this layer generates more output channels than the given input channels. Unlike V1, the point-wise layers of MobileNetV2 are part of the projection layer: this layer is responsible for projecting data with huge amount of channels into tensors, along with a small amount of channels ( Figure 2). The main function of the bottleneck residual block is that it provides the end result of these blocks; the residual block is modified, using convolutions to build a bottleneck. This block is valuable for reducing the number of parameters and matrix multiplications. As usual, every layer of MobileNetV2 contains ReLU6 and batch normalization, which is used as an activation function. The results of the projection layer are generated without using the activation function, but the overall composition of the MobileNet involves 17 bottleneck residual blocks and regular 1 × 1 convolution, followed by a global average pooling and classification layer. The top of MobileNetV2 is the global average pooling layer, which is helpful in reducing the problem of overfitting. The MobileNetV2 model is pretrained on the challenging ImageNet dataset, which consists of 1000 classes and approximately 1.4 M images, and we use these weights in our model. We exclude the topmost layer of the MobileNetV2, which is ideal for feature extraction. We extracted features from a 30-frame sequence; these features are further passed through our proposed residual attention-based LSTM to recognize anomalous activities. convolutions to build a bottleneck. This block is valuable for reducing the number of parameters and matrix multiplications. As usual, every layer of MobileNetV2 contains ReLU6 and batch normalization, which is used as an activation function. The results of the projection layer are generated without using the activation function, but the overall composition of the MobileNet involves 17 bottleneck residual blocks and regular 1 × 1 convolution, followed by a global average pooling and classification layer. The top of Mo-bileNetV2 is the global average pooling layer, which is helpful in reducing the problem of overfitting. The MobileNetV2 model is pretrained on the challenging ImageNet dataset, which consists of 1000 classes and approximately 1.4 M images, and we use these weights in our model. We exclude the topmost layer of the MobileNetV2, which is ideal for feature extraction. We extracted features from a 30-frame sequence; these features are further passed through our proposed residual attention-based LSTM to recognize anomalous activities.

Sequential Learning Techniques
LSTM was introduced to resolve the vanishing or exploding gradients issue in recurrent neural networks, and involves internal memory cells that are controlled by an input and forget gate network. The cell state is altered by the forget gate ranked under the cell state and also modified by the input gate. Additionally, the main purpose of the forget gate in the LSTM is to decide how much information from the previous memory should be passed into the next time step. Similarly, the input gate first regulates how much new information should be entered into the memory cell and then a vector is formed applying the tan h function, which provides output. Depending on these gates, LSTM can handle the short-and long-term dependency of the sequential information [50]. The LSTM formulation is expressed as follows: Sensors 2021, 21, x FOR PEER REVIEW 7 of 18

Sequential Learning Techniques
LSTM was introduced to resolve the vanishing or exploding gradients issue in recurrent neural networks, and involves internal memory cells that are controlled by an input and forget gate network. The cell state is altered by the forget gate ranked under the cell state and also modified by the input gate. Additionally, the main purpose of the forget gate in the LSTM is to decide how much information from the previous memory should be passed into the next time step. Similarly, the input gate first regulates how much new information should be entered into the memory cell and then a vector is formed applying the ℎ function, which provides output. Depending on these gates, LSTM can handle the short-and long-term dependency of the sequential information [50]. The LSTM formulation is expressed as follows: In the above equations, the weights ŵ ɧ * throughout the parallel LSTM architecture, including the ŵɧ , ŵɧ ʄ , ŵɧ į , and ŵɧ Ō parameters are employed to control and monitor the prior time step's information concerning the hidden states along with ŵ , ŵ ʄ , ŵ į , and ŵ Ōį , which are applied for the current input time steps' weight matrices. The superscript indicates the input sequence information, the subscript displays the time step information, is used for the sigmoid function, and ʘ represents elementwise multiplication.

Residual Attention-Based LSTM
A residual learning technique was proposed for image recognition to train ultra-deep CNNs [51,52]. The residual concept is employed to characterize the top-level layers' sequential information and the reformulation of layers by discovering residual functions, given as the input layer [53]. Generally, the residual function learning can be formulated as follows: In Equation (7), the given input and resultant sequential information vectors of the layers are considered Ẍ and . The ʄ (Ẍ , Ẅ) demonstrates the residual learned from the related layers. The results of these layers in the residual learning, which develops a sequence of the given input and nonlinear residual, are presented in Figure 3. The main advantage of this technique is that it creates a shortcut function among the several layers for more effective training of the model, and it is also useful for preventing the main issue of vanishing gradients owing to the composition with the adapting residual ʄ (Ẍ , Ẅ).

Residual Attention-Based LSTM
A residual learning technique was proposed for image recognition to train ultradeep CNNs [51,52]. The residual concept is employed to characterize the top-level layers' sequential information and the reformulation of layers by discovering residual functions, given as the input layer [53]. Generally, the residual function learning can be formulated as follows:

Sequential Learning Techniques
LSTM was introduced to resolve the vanishing or exploding gradients issue in recurrent neural networks, and involves internal memory cells that are controlled by an input and forget gate network. The cell state is altered by the forget gate ranked under the cell state and also modified by the input gate. Additionally, the main purpose of the forget gate in the LSTM is to decide how much information from the previous memory should be passed into the next time step. Similarly, the input gate first regulates how much new information should be entered into the memory cell and then a vector is formed applying the ℎ function, which provides output. Depending on these gates, LSTM can handle the short-and long-term dependency of the sequential information [50]. The LSTM formulation is expressed as follows: In the above equations, the weights ŵ ɧ * throughout the parallel LSTM architecture, including the ŵɧ , ŵɧ ʄ , ŵɧ į , and ŵɧ Ō parameters are employed to control and monitor the prior time step's information concerning the hidden states along with ŵ , ŵ ʄ , ŵ į , and ŵ Ōį , which are applied for the current input time steps' weight matrices. The superscript indicates the input sequence information, the subscript displays the time step information, is used for the sigmoid function, and ʘ represents elementwise multiplication.

Residual Attention-Based LSTM
A residual learning technique was proposed for image recognition to train ultra-deep CNNs [51,52]. The residual concept is employed to characterize the top-level layers' sequential information and the reformulation of layers by discovering residual functions, given as the input layer [53]. Generally, the residual function learning can be formulated as follows: In Equation (7), the given input and resultant sequential information vectors of the layers are considered Ẍ and . The ʄ (Ẍ , Ẅ) demonstrates the residual learned from the related layers. The results of these layers in the residual learning, which develops a sequence of the given input and nonlinear residual, are presented in Figure 3. The main advantage of this technique is that it creates a shortcut function among the several layers for more effective training of the model, and it is also useful for preventing the main issue of vanishing gradients owing to the composition with the adapting residual ʄ (Ẍ , Ẅ).
In Equation (7), the given input and resultant sequential information vectors of the layers are considered

Sequential Learning Techniques
LSTM was introduced to resolve the vanishing or exploding gradients issue in recurrent neural networks, and involves internal memory cells that are controlled by an input and forget gate network. The cell state is altered by the forget gate ranked under the cell state and also modified by the input gate. Additionally, the main purpose of the forget gate in the LSTM is to decide how much information from the previous memory should be passed into the next time step. Similarly, the input gate first regulates how much new information should be entered into the memory cell and then a vector is formed applying the ℎ function, which provides output. Depending on these gates, LSTM can handle the short-and long-term dependency of the sequential information [50]. The LSTM formulation is expressed as follows: In the above equations, the weights ŵ ɧ * throughout the parallel LSTM architecture, including the ŵɧ , ŵɧ ʄ , ŵɧ į , and ŵɧ Ō parameters are employed to control and monitor the prior time step's information concerning the hidden states along with ŵ , ŵ ʄ , ŵ į , and ŵ Ōį , which are applied for the current input time steps' weight matrices. The superscript indicates the input sequence information, the subscript displays the time step information, is used for the sigmoid function, and ʘ represents elementwise multiplication.

Residual Attention-Based LSTM
A residual learning technique was proposed for image recognition to train ultra-deep CNNs [51,52]. The residual concept is employed to characterize the top-level layers' sequential information and the reformulation of layers by discovering residual functions, given as the input layer [53]. Generally, the residual function learning can be formulated as follows: In Equation (7), the given input and resultant sequential information vectors of the layers are considered Ẍ and . The ʄ (Ẍ , Ẅ) demonstrates the residual learned from the related layers. The results of these layers in the residual learning, which develops a sequence of the given input and nonlinear residual, are presented in Figure 3. The main advantage of this technique is that it creates a shortcut function among the several layers for more effective training of the model, and it is also useful for preventing the main issue of vanishing gradients owing to the composition with the adapting residual ʄ (Ẍ , Ẅ).

. The
Sensors 2021, 21, x FOR PEER REVIEW 7 of 18

Sequential Learning Techniques
LSTM was introduced to resolve the vanishing or exploding gradients issue in recurrent neural networks, and involves internal memory cells that are controlled by an input and forget gate network. The cell state is altered by the forget gate ranked under the cell state and also modified by the input gate. Additionally, the main purpose of the forget gate in the LSTM is to decide how much information from the previous memory should be passed into the next time step. Similarly, the input gate first regulates how much new information should be entered into the memory cell and then a vector is formed applying the ℎ function, which provides output. Depending on these gates, LSTM can handle the short-and long-term dependency of the sequential information [50]. The LSTM formulation is expressed as follows: In the above equations, the weights ŵ ɧ * throughout the parallel LSTM architecture, including the ŵɧ , ŵɧ ʄ , ŵɧ į , and ŵɧ Ō parameters are employed to control and monitor the prior time step's information concerning the hidden states along with ŵ , ŵ ʄ , ŵ į , and ŵ Ōį , which are applied for the current input time steps' weight matrices. The superscript indicates the input sequence information, the subscript displays the time step information, is used for the sigmoid function, and ʘ represents elementwise multiplication.

Residual Attention-Based LSTM
A residual learning technique was proposed for image recognition to train ultra-deep CNNs [51,52]. The residual concept is employed to characterize the top-level layers' sequential information and the reformulation of layers by discovering residual functions, given as the input layer [53]. Generally, the residual function learning can be formulated as follows: In Equation (7), the given input and resultant sequential information vectors of the layers are considered Ẍ and . The ʄ (Ẍ , Ẅ) demonstrates the residual learned from the related layers. The results of these layers in the residual learning, which develops a sequence of the given input and nonlinear residual, are presented in Figure 3. The main advantage of this technique is that it creates a shortcut function among the several layers for more effective training of the model, and it is also useful for preventing the main issue of vanishing gradients owing to the composition with the adapting residual ʄ (Ẍ , Ẅ).
demonstrates the residual learned from the related layers. The results of these layers in the residual learning, which develops a sequence of the given input and nonlinear residual, are presented in Figure 3. The main advantage of this technique is that it creates a shortcut function among the several layers for more effective training of the model, and it is also useful for preventing the main issue of vanishing gradients owing to the composition with the adapting residual Sensors 2021, 21, x FOR PEER REVIEW

Sequential Learning Techniques
LSTM was introduced to resolve the vanishing or exp rent neural networks, and involves internal memory cells and forget gate network. The cell state is altered by the fo state and also modified by the input gate. Additionally, t gate in the LSTM is to decide how much information from be passed into the next time step. Similarly, the input gate information should be entered into the memory cell and th the ℎ function, which provides output. Depending on the short-and long-term dependency of the sequential in mulation is expressed as follows: = tanℎ (ŵɧ * ɧ + ŵ * ʄ = (ŵɧ ʄ * ɧ + ŵ ʄ * į = (ŵɧ į * ɧ + ŵ į * Ō = ŵɧ Ō * ɧ + ŵ Ō * = ʄ ʘ + į ʘ ɧ = Ō ʘ tanℎ ( ) In the above equations, the weights ŵ ɧ * throughout including the ŵɧ , ŵɧ ʄ , ŵɧ į , and ŵɧ Ō parameters are em the prior time step's information concerning the hidden stat ŵ Ōį , which are applied for the current input time steps' w indicates the input sequence information, the subscript mation, is used for the sigmoid function, and ʘ represen

Residual Attention-Based LSTM
A residual learning technique was proposed for image CNNs [51,52]. The residual concept is employed to chara quential information and the reformulation of layers by d given as the input layer [53]. Generally, the residual funct as follows: = ʄ (Ẍ , Ẅ) + Ẍ In Equation (7), the given input and resultant sequen layers are considered Ẍ and . The ʄ (Ẍ , Ẅ) demonstrate related layers. The results of these layers in the residual quence of the given input and nonlinear residual, are pr advantage of this technique is that it creates a shortcut fun for more effective training of the model, and it is also usefu of vanishing gradients owing to the composition with the . In this study, we normalize the information by employing the normalization layer [54] in a residual LSTM to ease the dynamic hidden state, normalize the information of the neurons for the LSTM and also reduce the training time of a deep RNN as follows:  In this study, we normalize the information by employing the normalization layer [54] in a residual LSTM to ease the dynamic hidden state, normalize the information of the neurons for the LSTM and also reduce the training time of a deep RNN as follows: where (ɧ ) is the hidden state in each layer of the LSTM of the ἰth neuron, ġ and are the trainable weights that are used to rescale the input sequence of the activation function ʄ, and the time step is represented using the subscript t. We applied a dropout threshold of 0.5 in each layer of the residual LSTM before the forward connections to reduce overfitting. A baseline research [55] used encoder and decoder with attention mechanisms to enhance the performance of their video captioning model. They employed decoder for word generation that inputs video features corresponding to their next word, which is based on the words previously produced by the model. This technique effectively generates video captions using two type of inputs, including natural language processing 1D where  In this study, we normalize the information by employing the normalization layer [54] in a residual LSTM to ease the dynamic hidden state, normalize the information of the neurons for the LSTM and also reduce the training time of a deep RNN as follows: where (ɧ ) is the hidden state in each layer of the LSTM of the ἰth neuron, ġ and are the trainable weights that are used to rescale the input sequence of the activation function ʄ, and the time step is represented using the subscript t. We applied a dropout threshold of 0.5 in each layer of the residual LSTM before the forward connections to reduce overfitting. A baseline research [55] used encoder and decoder with attention mechanisms to enhance the performance of their video captioning model. They employed decoder for word generation that inputs video features corresponding to their next word, which is based on the words previously produced by the model. This technique effectively generates video captions using two type of inputs, including natural language processing 1D is the hidden state in each layer of the LSTM of the ith neuron,  In this study, we normalize the information by employing the normalization layer [54] in a residual LSTM to ease the dynamic hidden state, normalize the information of the neurons for the LSTM and also reduce the training time of a deep RNN as follows: where (ɧ ) is the hidden state in each layer of the LSTM of the ἰth neuron, ġ and are the trainable weights that are used to rescale the input sequence of the activation function ʄ, and the time step is represented using the subscript t. We applied a dropout threshold of 0.5 in each layer of the residual LSTM before the forward connections to reduce overfitting. A baseline research [55] used encoder and decoder with attention mechanisms to enhance the performance of their video captioning model. They employed decoder for word generation that inputs video features corresponding to their next word, which is based on the words previously produced by the model. This technique effectively generates video captions using two type of inputs, including natural language processing 1D and b are the trainable weights that are used to rescale the input sequence of the activation function  baseline research [55] used encoder and decoder with attention mechanisms to enhance the performance of their video captioning model. They employed decoder for word generation that inputs video features corresponding to their next word, which is based on the words previously produced by the model. This technique effectively generates video captions using two type of inputs, including natural language processing 1D feature vector and video frames 2D data. Inspired by [55], we also used self-attention layer with residual LSTM, that functions for both short-and long-term dependencies by utilizing latent correlation among features at various positions. This self-attention layer produces context-aware vector and temporal order representation for sequential features. In contrast to the video captioning model, in our case, we have only one input, which is feature vector from the video frames sequence that we input to residual attention-based LSTM that requires a single block of features for sequence learning.
where (ɧ ) is the hidden state in each layer of the LSTM of the ἰth neuron, ġ and are the trainable weights that are used to rescale the input sequence of the activation function ʄ, and the time step is represented using the subscript t. We applied a dropout threshold of 0.5 in each layer of the residual LSTM before the forward connections to reduce overfitting. A baseline research [55] used encoder and decoder with attention mechanisms to enhance the performance of their video captioning model. They employed decoder for word generation that inputs video features corresponding to their next word, which is based on the words previously produced by the model. This technique effectively generates video captions using two type of inputs, including natural language processing 1D feature vector and video frames 2D data. Inspired by [55], we also used self-attention layer with residual LSTM, that functions for both short-and long-term dependencies by utilizing latent correlation among features at various positions. This self-attention layer produces context-aware vector and temporal order representation for sequential features. In contrast to the video captioning model, in our case, we have only one input, which is feature vector from the video frames sequence that we input to residual attention-based LSTM that requires a single block of features for sequence learning.
= ∑ ŵ ʄ (11) Տ = ŵ tan ℎ ŵ ɧ ɧ + ɧ + ɧ (12) In the above equations, ŵ , ŵ ɧ , ɧ , ɧ are the parameters learned for the frame features ʄ according to the attention weight ŵ to return the score Տ . Finally, Ą shows the output probabilities attained from the Softmax classification layer. The extracted deep In the above equations, ỳ = ʄ ( ġ ẟ ʘ (ɧ ñ ) + ) (10) where (ɧ ) is the hidden state in each layer of the LSTM of the ἰth neuron, ġ and are the trainable weights that are used to rescale the input sequence of the activation function ʄ, and the time step is represented using the subscript t. We applied a dropout threshold of 0.5 in each layer of the residual LSTM before the forward connections to reduce overfitting. A baseline research [55] used encoder and decoder with attention mechanisms to enhance the performance of their video captioning model. They employed decoder for word generation that inputs video features corresponding to their next word, which is based on the words previously produced by the model. This technique effectively generates video captions using two type of inputs, including natural language processing 1D feature vector and video frames 2D data. Inspired by [55], we also used self-attention layer with residual LSTM, that functions for both short-and long-term dependencies by utilizing latent correlation among features at various positions. This self-attention layer produces context-aware vector and temporal order representation for sequential features. In contrast to the video captioning model, in our case, we have only one input, which is feature vector from the video frames sequence that we input to residual attention-based LSTM that requires a single block of features for sequence learning.
= ∑ ŵ ʄ (11) Տ = ŵ tan ℎ ŵ ɧ ɧ + ɧ + ɧ (12) In the above equations, ŵ , ŵ ɧ , ɧ , ɧ are the parameters learned for the frame features ʄ according to the attention weight ŵ to return the score Տ . Finally, Ą shows the output probabilities attained from the Softmax classification layer. The extracted deep are the parameters learned for the frame features ỳ = ʄ ( ġ ẟ ʘ (ɧ ñ ) + ) (10) where (ɧ ) is the hidden state in each layer of the LSTM of the ἰth neuron, ġ and are the trainable weights that are used to rescale the input sequence of the activation function ʄ, and the time step is represented using the subscript t. We applied a dropout threshold of 0.5 in each layer of the residual LSTM before the forward connections to reduce overfitting. A baseline research [55] used encoder and decoder with attention mechanisms to enhance the performance of their video captioning model. They employed decoder for word generation that inputs video features corresponding to their next word, which is based on the words previously produced by the model. This technique effectively generates video captions using two type of inputs, including natural language processing 1D feature vector and video frames 2D data. Inspired by [55], we also used self-attention layer with residual LSTM, that functions for both short-and long-term dependencies by utilizing latent correlation among features at various positions. This self-attention layer produces context-aware vector and temporal order representation for sequential features. In contrast to the video captioning model, in our case, we have only one input, which is feature vector from the video frames sequence that we input to residual attention-based LSTM that requires a single block of features for sequence learning.
= ∑ ŵ ʄ (11) Տ = ŵ tan ℎ ŵ ɧ ɧ + ɧ + ɧ (12) In the above equations, ŵ , ŵ ɧ , ɧ , ɧ are the parameters learned for the frame features ʄ according to the attention weight ŵ to return the score Տ . Finally, Ą shows the output probabilities attained from the Softmax classification layer. The extracted deep according to the attention weight ỳ = ʄ ( ġ ẟ ʘ (ɧ ñ ) + ) (10) where (ɧ ) is the hidden state in each layer of the LSTM of the ἰth neuron, ġ and are the trainable weights that are used to rescale the input sequence of the activation function ʄ, and the time step is represented using the subscript t. We applied a dropout threshold of 0.5 in each layer of the residual LSTM before the forward connections to reduce overfitting. A baseline research [55] used encoder and decoder with attention mechanisms to enhance the performance of their video captioning model. They employed decoder for word generation that inputs video features corresponding to their next word, which is based on the words previously produced by the model. This technique effectively generates video captions using two type of inputs, including natural language processing 1D feature vector and video frames 2D data. Inspired by [55], we also used self-attention layer with residual LSTM, that functions for both short-and long-term dependencies by utilizing latent correlation among features at various positions. This self-attention layer produces context-aware vector and temporal order representation for sequential features. In contrast to the video captioning model, in our case, we have only one input, which is feature vector from the video frames sequence that we input to residual attention-based LSTM that requires a single block of features for sequence learning.
= ∑ ŵ ʄ (11) Տ = ŵ tan ℎ ŵ ɧ ɧ + ɧ + ɧ (12) In the above equations, ŵ , ŵ ɧ , ɧ , ɧ are the parameters learned for the frame features ʄ according to the attention weight ŵ to return the score Տ . Finally, Ą shows the output probabilities attained from the Softmax classification layer. The extracted deep to return the score ỳ = ʄ ( ġ ẟ ʘ (ɧ ñ ) + ) (10) where (ɧ ) is the hidden state in each layer of the LSTM of the ἰth neuron, ġ and are the trainable weights that are used to rescale the input sequence of the activation function ʄ, and the time step is represented using the subscript t. We applied a dropout threshold of 0.5 in each layer of the residual LSTM before the forward connections to reduce overfitting. A baseline research [55] used encoder and decoder with attention mechanisms to enhance the performance of their video captioning model. They employed decoder for word generation that inputs video features corresponding to their next word, which is based on the words previously produced by the model. This technique effectively generates video captions using two type of inputs, including natural language processing 1D feature vector and video frames 2D data. Inspired by [55], we also used self-attention layer with residual LSTM, that functions for both short-and long-term dependencies by utilizing latent correlation among features at various positions. This self-attention layer produces context-aware vector and temporal order representation for sequential features. In contrast to the video captioning model, in our case, we have only one input, which is feature vector from the video frames sequence that we input to residual attention-based LSTM that requires a single block of features for sequence learning.
= ∑ ŵ ʄ (11) Տ = ŵ tan ℎ ŵ ɧ ɧ + ɧ + ɧ (12) In the above equations, ŵ , ŵ ɧ , ɧ , ɧ are the parameters learned for the frame features ʄ according to the attention weight ŵ to return the score Տ . Finally, Ą shows the output probabilities attained from the Softmax classification layer. The extracted deep where (ɧ ) is the hidden state in each layer of the LSTM of the ἰth neuron, ġ and are the trainable weights that are used to rescale the input sequence of the activation function ʄ, and the time step is represented using the subscript t. We applied a dropout threshold of 0.5 in each layer of the residual LSTM before the forward connections to reduce overfitting. A baseline research [55] used encoder and decoder with attention mechanisms to enhance the performance of their video captioning model. They employed decoder for word generation that inputs video features corresponding to their next word, which is based on the words previously produced by the model. This technique effectively generates video captions using two type of inputs, including natural language processing 1D feature vector and video frames 2D data. Inspired by [55], we also used self-attention layer with residual LSTM, that functions for both short-and long-term dependencies by utilizing latent correlation among features at various positions. This self-attention layer produces context-aware vector and temporal order representation for sequential features. In contrast to the video captioning model, in our case, we have only one input, which is feature vector from the video frames sequence that we input to residual attention-based LSTM that requires a single block of features for sequence learning.
= ∑ ŵ ʄ (11) Տ = ŵ tan ℎ ŵ ɧ ɧ + ɧ + ɧ (12) In the above equations, ŵ , ŵ ɧ , ɧ , ɧ are the parameters learned for the frame features ʄ according to the attention weight ŵ to return the score Տ . Finally, Ą shows the output probabilities attained from the Softmax classification layer. The extracted deep shows the output probabilities attained from the Softmax classification layer. The extracted deep features of the 30-frame sequence are used to recognize that the sequence contains either abnormal activities or normal events, which are passed from the residual attention-based LSTM, and final predictions are performed using the SoftMax layer. Several experiments were performed to select the best hyperparameter settings, and finally we choose Adam as an optimizer, with a learning rate of 0.01, and categorical cross-entropy as a loss function. The batch size was 32 and the number of epochs was 200 for training the model. We stopped the training process when the loss no longer decreased.  In this study, we normalize the information by employing the normalization layer [54] in a residual LSTM to ease the dynamic hidden state, normalize the information of the neurons for the LSTM and also reduce the training time of a deep RNN as follows:

Results
We experimentally assessed our proposed model using the benchmark anomaly detection UCF-Crime dataset [56]. To test the performance of the proposed paradigm, we experimentally evaluated it across numerous metrics, including the confusion matrix, F1 score, recall, precision, class-wise accuracy, area under curve (AUC), and receiver operating characteristic (ROC) curve. The performance of our model is compared with recent abnormal activity recognition techniques. The proposed model is implemented using Keras and backend TensorFlow with Python 3.6 on a Windows 10 platform and Corei5-6600 setup with 16-GB RAM, equipped with a 12-GB GeForce-Titan-X graphics processing unit (GPU).

Datasets
In this work, the performance of the proposed model is extensively evaluated on various benchmark datasets, i.e., the University of Minnesota UMN dataset [57], Avenue dataset [58], and UCF-Crime dataset [56]. The UCF-Crime dataset consists of 1900 long untrimmed videos for 13 real-world anomalous events including fighting, stealing, shooting, shoplifting, robbery, road accident, arson, abuse, arrest, assault, burglary, vandalism and explosion. The UCF-Crime dataset is an almost balanced dataset that contains 800 normal and 810 anomalous event videos in the training set. The rest of the videos of the dataset include 150 normal and 140 anomalous events that are temporally annotated to test the performance of the model. The challenging part of the UCF-Crime dataset is that it only contains temporal annotation for the testing set. We follow a former research strategy [47] to determine the training, testing, and validation ratio. The UMN dataset consists of 11 video sequences of various scenes of abnormal activities and is an extensively utilized dataset. This dataset has in total 4144, 2144, and 1453 frames of three scenes, plaza, indoor, and lawn, respectively. The Avenue dataset consists of 16 training and 21 testing videos and contains in total 30,652 frames. This dataset has 47 abnormal events, and the resolution of each frame is 360 × 640 pixels.

Evaluation Methods
In this portion, to measure the performance and effectiveness of the proposed model, we used evaluation parameters often used for abnormal activity detection [6,12,59], such as the AUC and the receiver operating characteristic curve. We also evaluate our proposed model using the recall, F1 score, and precision. We applied these evaluation parameters on test videos and counted the total number of false negative (FN), true positive (TP), and false positive (FP) results.

Results
In this portion, broad experiments are carried out to test the performance of our proposed model utilizing the UCF-Crime dataset. We perform experiments using the LSTM, bidirectional LSTM (BD-LSTM), and residual LSTM. We tried several variants of the LSTM in our experimental analysis before reaching the final choice of the proposed residual attention-based LSTM model, which has shown tremendous performance for the investigated problem of anomaly recognition. The proposed residual attention-based LSTM model achieved quite promising results as compared with baseline techniques, which are presented in Table 1. Some of the visual results of our proposed model for anomalous activity recognition are shown in Figure 4. Figure 4a-c show the accurate prediction results of the proposed model, and Figure 4d shows the incorrect prediction results. The prediction results of the UCF-Crime dataset of the residual LSTM are as follows: for each class, results are represented as a confusion matrix, provided in Figure 5. The training graph and loss and the class-wise accuracy of the proposed model are shown in Figures 6-8, and the precision, F1 score, and recall are demonstrated in Table 1. The ROC and AUC curves of the proposed model are displayed in Figure 9. We calculated the time complexities of MobileNetV2, an attention-based LSTM, and our overall proposed model (Figure 10), and the floating point operations per second (FLOPS) of these models, which are converted from Giga FLOPs to Mega FLOPs, are 3.1, 615, and 618.1 ( Table 2). class, results are represented as a confusion matrix, provided in Figure 5. The training graph and loss and the class-wise accuracy of the proposed model are shown in Figures  6-8, and the precision, F1 score, and recall are demonstrated in Table 1. The ROC and AUC curves of the proposed model are displayed in Figure 9. We calculated the time complexities of MobileNetV2, an attention-based LSTM, and our overall proposed model (Figure 10), and the floating point operations per second (FLOPS) of these models, which are converted from Giga FLOPs to Mega FLOPs, are 3.1, 615, and 618.1 (Table 2).            are represented in bold.

Comparison with the State-of-the-Art Techniques
In this section, the performance of our proposed anomaly recognition model is compared with state-of-the-art techniques by using the UCF-Crime dataset. The authors of [47] checked various deep learning models, i.e., VGG-16, VGG-19, FlowNet, and DEAR-ESt, respectively. The DEARESt model provides the best performance among these meth-

Discussion
The main objective of this paper is to utilize a light-weight CNN model to detect and recognize anomalies efficiently. For many computer vision applications, CNN models are becoming deeper and deeper, making their applicability over edge devices questionable. To overcome these problems and achieve light-weight functionality, we utilize Mo-bileNetV2 for feature extraction, followed by a residual attention-based LSTM for anomalous sequence recognition. MobileNetV2 is used to extract efficient features from input videos, improving the performance of the proposed attention-based LSTM. We use standard training and testing sets provided in the existing literature to test and compare the performance of the proposed model against rivals. Additionally, we tried several variants of LSTM in our experimental analysis, such as LSTM, BD-LSTM, and residual LSTM, before reaching our final decision to use the proposed residual LSTM model, which has shown tremendous performance for the investigated problem of anomaly recognition, as reported in Table 1. Moreover, the efficiency comparison of the proposed model in terms of time complexity, model size, and parameters utilization showed that it outclasses the existing models, as shown in Table 2. Table 3 shows the performance comparison of the proposed model with recent state-of-the-art techniques [47,[60][61][62][63][64] using various bench-

Comparison with the State-of-the-Art Techniques
In this section, the performance of our proposed anomaly recognition model is compared with state-of-the-art techniques by using the UCF-Crime dataset. The authors of [47] checked various deep learning models, i.e., VGG-16, VGG-19, FlowNet, and DEARESt, respectively. The DEARESt model provides the best performance among these methods. In the modern era, deep learning models are becoming deeper and deeper, and also require huge amounts of storage; they also have increased computational complexity and stringent installation protocols over the edge node. In anomaly recognition, a delay in response can cost human lives and property; therefore, efficient model selection is a very important aspect for any anomaly recognition system. Our decision to use the light-weight CNN model MobileNetV2 is due to its small storage size, a smaller number of learned parameters, and its fast processing time, with a performance equivalent to heavy-weight CNN models [12,47,60]. The efficiency of the proposed model is compared with these existing techniques in terms of model size, time complexity, and the number of parameters, as shown in Table 2. We achieved overall accuracies of 78.43%, 98.20%, and 98.80%, which is increased by 1.77%, 0.76%, and 8.62% when compared to existing state-of-the-art techniques, with fewer parameters and a reduced model, as shown in Table 3. The proposed model can process a sequence of 30 frames in 0.263 s, which is comparatively lower than the recent existing techniques [47,62,63]. The sizes of existing models are much bigger, and their recognition performance is relatively low as compared to our proposed model, as shown in Table 2.

Discussion
The main objective of this paper is to utilize a light-weight CNN model to detect and recognize anomalies efficiently. For many computer vision applications, CNN models are becoming deeper and deeper, making their applicability over edge devices questionable. To overcome these problems and achieve light-weight functionality, we utilize MobileNetV2 for feature extraction, followed by a residual attention-based LSTM for anomalous sequence recognition. MobileNetV2 is used to extract efficient features from input videos, improving the performance of the proposed attention-based LSTM. We use standard training and testing sets provided in the existing literature to test and compare the performance of the proposed model against rivals. Additionally, we tried several variants of LSTM in our experimental analysis, such as LSTM, BD-LSTM, and residual LSTM, before reaching our final decision to use the proposed residual LSTM model, which has shown tremendous performance for the investigated problem of anomaly recognition, as reported in Table 1. Moreover, the efficiency comparison of the proposed model in terms of time complexity, model size, and parameters utilization showed that it outclasses the existing models, as shown in Table 2. Table 3 shows the performance comparison of the proposed model with recent state-of-the-art techniques [47,[60][61][62][63][64] using various benchmark datasets in terms of accuracy. The proposed model outperformed the existing techniques by increasing the accuracy by 1.77%, 0.76%, and 8.62% margins for the UCF-Crime, UMN, and Avenue datasets, respectively.

Conclusions
Smart surveillance systems are gaining attention among computer vision experts; they are mainly deployed for monitoring purposes. However, deep models are data-thirsty and demand heavy processing systems for effective analysis. In contrast, surveillance systems require quick countermeasures and responses against any abnormal events, detected automatically using computer vision systems. In the current study, we introduced a lightweight efficient model to recognize anomalies in smart cities with state-of-the-art accuracy by utilizing various challenging benchmark datasets. Our proposed model extracts deep CNN spatial features from a sequence of frames; then, it uses a residual attention-based LSTM to recognize anomalous events in a surveillance system. The usage of light-weight CNN features with a residual attention-based LSTM provides a high-level adaptability to smart surveillance environments. We validated our proposed model using various evaluation parameters. The proposed model is proven to have a higher accuracy than existing anomaly recognition methods. The experimental results of our proposed model reveal better accuracies with increase of 1.77%, 0.76%, and 8.62% in accuracy for the UCF-Crime, UMN, and Avenue datasets, respectively, and considerable improvements in reducing false alarm rates compared to the abnormal activity recognition literature.
In future works, we aim to investigate other deep learning models, 3D models, graph neural networks, and multi-instance learning formulations to enhance the system performance. Additionally, we intend to develop a generative technique appropriate for recognizing more classes of anomalies.