A Self-Attention Augmented Graph Convolutional Clustering Networks for Skeleton-Based Video Anomaly Behavior Detection

: In this paper, we propose a new method for detecting abnormal human behavior based on skeleton features using self-attention augment graph convolution. The skeleton data have been proved to be robust to the complex background, illumination changes, and dynamic camera scenes and are naturally constructed as a graph in non-Euclidean space. Particularly, the establishment of spatial temporal graph convolutional networks (ST-GCN) can effectively learn the spatio-temporal relationships of Non-Euclidean Structure Data. However, it only operates on local neighborhood nodes and thereby lacks global information. We propose a novel spatial temporal self-attention augmented graph convolutional networks (SAA-Graph) by combining improved spatial graph convolution operator with a modiﬁed transformer self-attention operator to capture both local and global information of the joints. The spatial self-attention augmented module is used to understand the intra-frame relationships between human body parts. As far as we know, we are the ﬁrst group to utilize self-attention for video anomaly detection tasks by enhancing spatial temporal graph convolution. Moreover, to validate the proposed model, we performed extensive experiments on two large-scale publicly standard datasets (i.e., ShanghaiTech Campus and CUHK Avenue datasets) which reveal the state-of-art performance for our proposed approach when compared to existing skeleton-based methods and graph convolution methods. an enhancement of spatial temporal graph convolution to capture global features. Our proposed model achieves the excellent performance on both two anomaly detection datasets, ShanghaiTech Campus and CUHK Avenue. Future work includes detecting abnormal phenomena between humans and


Introduction
Video anomaly detection is a highly challenging task in unsupervised video analysis. In recent years, surveillance video anomaly detection has gained widespread attention owing to its applications in public security, social security management, and the rising trends in deep learning and computer vision. Inherently, the abnormal events are also complex in nature due to various reasons such as messy background/objects, motion in the scene, etc. Therefore, the complexity of the abnormal events creates a bottleneck issue in the detection of such events from real-world video data. Additionally, handling and modeling of video data itself are difficult because of its high dimensionality, noise, and a diversity of events and interactions involved. So far, many efforts have been reported in literatures that provide in-depth studies on video anomaly detection by mainly focusing on appearance features, depth features, optical flow modeling, etc., but very less attention has been paid to skeleton-based video anomaly detection models. Likewise, we explicitly use a common structure of surveillance video, i.e., people and objects moving on a static background where most abnormal phenomena are caused by the humans. However, most of these models are primarily based on the image level and instead of focusing on normal pattern modeling of humans, emphasize more on background hence increasing the burden on background model. Therefore, to mitigate the above stated issues, we employ skeleton features and take advantage of their compactness, strong structure, semantically rich properties, and strong description of human behavior and movement. In this way, the analysis can be free from any interference caused by the factors such as illumination and busy background.
Nowadays, graph convolutional networks (GCN) are listed among the most popular methods available for analyzing Non-Euclidean Structure Data. As an effective representation of Non-Euclidean Structure Data, they can effectively capture spatial (intra-frame) and temporal (inter-frame) information. While referring to skeleton-based action recognition, Yan et al. [1] proposed the spatial temporal graph convolutional networks (ST-GCN), which first apply GCN to model skeleton data. The ST-GCN model has been proven to perform well on skeleton data [2][3][4], but as spatiotemporal graph convolution operation only operates on a local neighborhood node and is restricted by the size of the convolution kernel, it lacks the global information. Moreover, the correlation between body joints in the human skeleton that are not directly connected are also underestimated, e.g., the left hand and right foot. Transformer self-attention [5] was originally applied in natural language processing tasks to encode the short-distance and long-distance correlations between the words in sentences. Likewise, considering the sequential nature and hierarchical structure of the human skeleton sequences, this mechanism can be extended to the skeleton data. Self-attention can resolve the major shortcoming of ST-GCN (i.e., it can only capture the local features of the spatial dimension) because of its flexibility in dealing with long dependencies. Recently, the self-attention method is used in one of the works to solve the locality of the convolution operator by capturing the global context of pixels in the image [6]. The proposed novel spatial temporal self-attention augmented graph convolutional network (SAA-Graph) contains a new graph convolution operator by combining improved spatial graph convolution operator with a modified transformer self-attention operator to capture both local and global information of the joints. The improved spatial graph convolution operator uses a data-driven approach to improve the flexibility of the model building graphs and brings in more versatility to align with various data samples. Our work uses self-attention mechanism on skeleton data to enhance the graph convolution. We capture the information of local and global joints by combining the operator of the improved spatial graph convolution with the modified transformer self-attention operator. Spatiotemporal graph convolution operation only operates on a local neighborhood node and is restricted by the size of the convolution kernel, it lacks the global information. Therefore, the autoencoder constructed with spatiotemporal graph convolution also lacks global information. We use self-attention to solve the locality of the graph convolution operator by capturing the global information in the skeleton data.
Specifically, the extracted spatiotemporal graph of skeleton features is encoded to generate a latent vector using the encoder part of a spatial temporal self-attention augmented graph convolutional autoencoder (SAA-STGCAE). The deep embedded clustering layer is used to softly assign the latent vector to the clusters. We use the Dirichlet process mixture model to measure their distribution. We can obtain the normality score for each sample and determine whether the action should be classified as normal or not. An overview of proposed method can be viewed in Figure 1.
The key contributions of this work are summarized in this paper as follows: (1) We propose a novel spatial temporal self-attention augmented graph convolutional clustering networks for skeleton-based video anomaly detection tasks by employing the spatial temporal self-attention augmented graph convolutional autoencoder to extract the relevant features and embedded clustering; (2) We design a new spatial self-attention enhancement graph convolution operator to understand the intra-frame interaction between different body parts and capture the local and global features of a skeleton in the frame; (3) Our model achieves state-of-the-art AUC of 0.789 for the ShanghaiTech Campus anomaly detection datasets and also exhibits excellent performance metrics for CUHK Avenue datasets. The deep embedded clustering layer is used to softly assign the latent vector to the clusters, and P nk represents the probability of the sample being assigned to the cluster k.

Video Anomaly Detection
Video anomaly detection is defined as a way to find abnormal patterns or actions in the data. These abnormalities are defined as infrequent or rare events. Traditional methods for abnormal event detection that extract and analyze the hand-crafted low-level visual features are unable to characterize the more complex behaviors. Additionally, the extracted features by such methods are relatively single, which demonstrates the fact that the generalization ability of hand-crafted features is usually weak and is not robust to crowd scenes. For instance, trajectory [7,8] is used to describe the trajectory of moving objects. Similarly, Histogram of Oriented Gradient (HOG) [9] and Histogram of Flow (HOF) [10] can characterize the shape and contour information of the human body in a static image. Accordingly, optical flow [11] can describe the changes in the gray value of pixels between adjacent frames and is often used to characterize the motion information. Zhang et al. [12] associated optical flows to capture short-term trajectories between multiple frames and described short-term trajectories by histogram-based shape descriptor. However, the mentioned methods revealed only a suboptimal performance when subjected to complex surveillance scenarios and large-scale video anomaly detection datasets.
Recently, various works have used deep learning-based models to address the problem of video anomaly detection. Such models can be roughly categorized into reconstructive models, predictive models, and generative models. The reconstruction model uses the difference between reconstructed image and the original image as a basis for scoring and positioning of anomaly detection, and often relies on autoencoders [13,14]. The prediction model, on the other hand, utilizes the recurrent neural networks [15][16][17] or 3D convolutions [14] and emphasizes on addition of prediction and generation of future frames based on the original reconstruction to calculate the loss. Finally, the generative models primarily use the variational autoencoders (VAEs) or GANs to reconstruct, predictor model the distribution of the data. The early anomaly detection work of Leo et al. [18] has been used for human activity recognition in wide-area automatic visual surveillance. A method proposed by Liu et al. [19] uses a future frame prediction model by combining U-Net and Beyond-MSE. Wu et al. [20] proposed a Fast Sparse Coding Network based on High-level Features to discriminate spatio-temporal fusion features for video anomaly detection to achieve higher accuracy. In another work, Morais et al. [21] adhered two RNN branches together to form global and local features, using a message-passing encoder-decoder RNN architecture. The work by Luo et al. [22] is the first one which applies graph convolutional networks on skeleton-based video anomaly detection to analyze the graph connection of human joints. Progressively, Markovitz et al. [23] proposed an approach to use embedded pose graphs and a Dirichlet process mixture for video anomaly detection with a new coarse-grained setting for exploring broader aspects of video anomaly detection.

Skeleton-Based Action Recognition
Most of the conventional techniques for skeleton-based action recognition generally rely on hand-crafted features to model the human body [24][25][26]. However, it is evident from the literature that hand-crafted features can only perform well on some certain types of datasets [27], which further illustrates the fact that the hand-crafted features are extracted from one data set cannot be always transferred to other data set. Moreover, deep learning has revolutionized the activity recognition by proposing techniques which can directly improve the robustness through data-driven approaches to achieve unprecedented performance metrics, where the most widely used models are RNNs and CNNs.
The RNN-based method is suitable for processing time series data due to its unique structure while skeleton sequences are natural time series of joint coordinate positions, but its spatial modeling ability is weak. Alternatively, many CNN-based researches encode the skeleton joints to multiple 2D Pseudo images to learn useful features [28,29]. However, existing CNN-based models largely fail to capture the various aspects of a skeleton sequence. Banerjee et al. [30] extracted four feature representations from the angle information and kinematics information of human movements, which then captured the complementary features of key joint sequences. Even so, neither RNNs nor CNNs can fully represent the structure of skeletal data because skeletal data are naturally embedded as graphs rather than vector sequences or two-dimensional grids.
Lately, GNN-based methods have been proposed which demonstrate a better performance by considering the fact that human skeleton data is a natural topological graph data structure (joints and bones can be treated as vertices and edges, respectively) rather than images and sequences vector. In order to retain the skeleton spatial information and improve the feature generalization ability, Yan et al. [1] proposed the ST-GCN to directly model the human skeleton data as the spatiotemporal graph structure by realizing the automatic extraction of robust spatiotemporal features from human skeleton data. It has strong expression and generalization capabilities, thus achieves better performance than previously reported methods. Inspired by this work, we used an improved ST-GCN block to construct a spatial temporal self-attention graph convolutional autoencoder, named SAA-STGCAE. We encoded to generate a latent vector using the encoder part of SA-STGCAE.

Transformer
The Transformer was originally proposed for natural language processing. It uses the attention mechanism to achieve parallel capturing of sequence dependencies and to process tokens at each position of the sequence simultaneously. The transformer follows and encoder-decoder structure and only relies on multi-head self-attention [5]. Recently, the self-attention mechanism has also been implemented for visual tasks [6] to enhance the standard convolution. Likewise, Our work uses self-attention mechanism on skeleton data to enhance the graph convolution.

Graph Convolutional Neural Networks
The implementation methods for graph convolutional neural networks (GCN) are mainly divided into two categories: (1) Spectral-based method and (2) Spatial-based method. Spectral-based method uses graph Fourier transform to convert the graph data into frequency domain data and then performs the calculation by exploiting the fundamental property of time domain convolution being equivalent to frequency domain multiplication. On the other hand, Spatial-based methods construct a convolution kernel directly in the spatial domain for feature extraction. In this work, we adopt the spatial-based graph convolutional neural network (GCN) method to extract features from structured graph data composed of human skeleton sequences.

Proposed Method
We propose a framework called SAA-STGCN for skeleton-based anomaly detection. The overall framework diagram of proposed method is illustrated in Figure 1. The suggested method focuses on human behavior detection while searching for anomaly detection. First, we directly perform the pose estimation algorithm to extract the human skeletons in each frame of the input video to generate spatiotemporal graphs (Section 3.1). This step makes the algorithm robust to complex backgrounds, lighting changes, human scales, and dynamic camera views. Next, we use the encoder part of SAA-STGCAE as a feature extractor to embed data and generate latent vectors (Sections 3.2 and 3.3). The deep embedding clustering layer (Section 3.4) is used to softly assign latent vectors to the clusters, and then each sample is represented by the probability that it is assigned to k cluster. Later on, we use the Dirichlet process mixture model (Section 3.5) to evaluate a set of distribution parameters in the estimation stage and uses the fitted model to provide a score for each embedding sample. The normality score provided by the model is used to determine whether the action is normal or not.

Spatiotemporal Graph Connection Configuration for Skeleton
The original skeleton data that can be obtained from pretrained video pose estimation algorithms or motion capture devices are provided as a sequence of vectors. We define N as the number of joints in skeleton and T as the total number of frames. For each person, a spatiotemporal graph is established as G = (V, E), where V = {v tn | t = 1, 2, · · · , T; n = 1, 2, · · · , N} is the set of all the joint nodes as the vertices of the graph, and E represents the set of all the edges describing natural connections in the human body structure and time as the edge of the graph. Further more, E consists of two subsets E s and E t , where E s = {(v si , v sj ) | s = 1, 2, · · · , T; i, j = 1, 2, · · · , N} represents the connection of any pair of joints (i, j) in each frame t. E t = {(v tn , v (t+1)n ) | t = 1, 2, · · · , T; n = 1, 2, · · · , N} represents the connection between each frame along the continuous time. Figure 2a shows an example of the constructed spatiotemporal graph, where the joints are represented as vertices and their natural connections in the human body are represented as spatial edges (the blue lines in the Figure 2a) and the corresponding between two adjacent frames are connected as temporal edges (the green lines in the Figure 2a).
We adopt the spatial configuration partition [1] to divide the neighborhood of a node into three subsets according to graph distance. First, the center of gravity is determined as the average coordinate of all joints of the skeleton in the frame, then the first subset is the root node itself (red node in Figure 2b), the second subset is the neighbor nodes closer to the center of gravity than the root node (green node in Figure 2b), and the third subset is adjacent nodes away from the center of gravity (blue node in Figure 2b).

Feature Extraction
The proposed SAA-STGCAE uses a spatial self-attention augmented graph convolution module (SAA-Graph) presented next and the temporal convolution module (TCN) to embed the spatiotemporal graphs as shown in Figure 3. We employ the same temporal convolution module as ST-GCN and execute a 1 × K t convolution on the feature map obtained from the spatial dimension, where K t is the kernel size in the time dimension. Then, we use encoder part of SAA-STGCAE to embed the extracted skeleton pose into the spatiotemporal graph to generate latent vectors for clustering branch.

Spatial Self-Attention Augmented Graph Convolution
We propose a new graph convolution operator called Spatial Self-Attention Augmented Graph Convolution (SAA-Graph), which is based on the improved ST-GCN block and uses the Self-Attention module to enhance spatial graph convolution, as shown in

Spatial Graph Convolution
For the spatial dimension, we use adjacency matrices of three types: static adjacency matrices (A 1 ), globally-learned adjacency matrices (A 2 ), and adaptive adjacency matrices (A 3 ). A 1 is a N × N hard-coded adjacency matrix of graph representing the physical structure of the human intra-body connections, A 2 is also an N × N adjacency matrix, which is learned by initializing a fully-connected graph according completely to the training data. The matrix and the parameters of the model are optimized together during training process. The matrix element can be any value, which can not only indicate whether there is a connection between two joints, but also the strength of the connection. A 3 is learned an adaptive graph for each sample to represent the strength of the connection between two vertices. We embed the input twice by using two sets of learned weights, then we transpose one of the embedded matrices and take the dot product between the two and normalize to get the adaptive adjacency matrix, similar as [4].
Each adjacency type is applied with its own graph convolution operation (GCN) by using individual weights instead of replacing the original A 1 with A 2 or A 3 . Then, the output of GCN applies a 1 × 1 convolution as a learnable reduction measure for weighting the stack output and provides the required number of output channels. In this way, the model can increase flexibility without reducing the original performance.
For the spatial dimension, the graph convolution operation is formulated as where A l is adjacent matrixs, D k is a degree matrix, I is an identity matrix describing the self-connection of joints, f in is the set of joints, and W i is trainable parameter of the neighbor subset. (D

Spatial Self-Attention (SAA) Module
The transformer model employing self-attention originally designed to operate on words in NLP tasks. The self-attention mechanism reduces the dependence on external information and is better at capturing the internal correlation of data or features. The SAA applies a modified self-attention operator, as depicted in Figure 5, to capture the spatial features of different joints in the same frame and dynamically build spatial relationships within and between joints to strengthen the correlation of body joints that are not directly connected in human skeletons. The relations between joints are dynamically generated in SAA, thus the relevant structure of skeleton is adaptively generated, not fixed for all actions. The SAA is achieved by independently calculating the correlation between each pair of joints in each frame, as shown in Figure 6. When the source node that calculates the weight needs to calculate the weighted results, all the other nodes are required to participate in the calculation, which is a manifestation of the ability to capture the global characteristics. For each joint v tn of the skeleton at time t, we first calculate the query vector q t n ∈ R d q , the key vector k t n ∈ R d k , and the value vector v t n ∈ R d v by applying trainable linear transformations to the joint features j t n ∈ R C in with parameters W q ∈ R C in ×d q , W k ∈ R C in ×d k , W v ∈ R C in ×d v , shared by all nodes. Where C in is the number of input features and d k , d q , d v are the channel dimensions of the key vectors, the query vector and the value vector, respectively. Then, for each pair of body joints (V tn , V tm ), the score α t nm , which represents the strength of the correlations between the two joints, is determined by α t nm = q t n · (k t m ) T ∈ R, ∀t ∈ T, then it is used to weight each joint value v t m , and a weighted sum is calculated to get a new embedding z t n ∈ R C out for joint v tn , as shown in Equation (3).
Multi-head attention [5] is multiple independent self-attention calculations applied by repeating the process many times, each time with a diverse set of learnable parameters as an integrated function to prevent overfitting.
where X N is the reshaped input, and Then the outputs of all heads are concatenated as where W o is a learnable linear transformation combining outputs of all heads.

Deep Embedded Clustering
The beginning of clustering layer is the embedding of SAA-STGCAE. We adjust the deep embedded clustering [31] and use our proposed SAA-STGCAE architecture for soft clustering spatiotemporal graphs. The embedding is fine-tuned based on the initial reconstruction to obtain a cluster-optimized embedding, then each sample is represented by its probability p nk assigned to each cluster where Z n is the latent embedding generated by the encoder part of SAA-STGCAE, y n is the soft cluster assignment, and Θ is the clustering layer's parameters with cluster number k.
We perform an algorithm optimization following the clustering goal [31] to minimize the Kullback-Leibler (KL) divergence between the current model probability cluster prediction P and the target distribution Q In the process of expectation, we fixed the model and updated the target distribution Q, and during the maximization step, the model is optimized to minimize the clustering loss, L cluster .

Normality Score
The Dirichlet process mixture model [32] is a useful measure for assessing the distribution of proportional data and theoretically ideal for processing large, unlabeled dataset. It evaluates a set of distribution parameters in the estimation stage, and uses a fitted model to provide a score for each embedded sample in the inference stage. In the testing phase, the fitted model is used to score each sample with logarithmic probability. The normality score provided by the model is used to determine if the action is normal or not.

Experiment
We evaluate the performance of our method for video anomaly detection on two public datasets: ShanghaiTech Campus [17] and CUHK Avenue [33], which can easily identify pedestrians and extract human skeleton data. Figure 7 shows some normal and abnormal events in the dataset used in our experiment. We compare our proposed network with appearance-based [13,17,19] and skeleton-based [21,23] methods. All experiments are evaluated on the frame-level AUC measurement.

Dataset
ShanghaiTech Campus dataset [17] is a new complex and large-scale anomaly detection dataset. The video data of the dataset were collected under 13 scenes with complex lighting conditions and different camera angles in campus. Most of anomalous events in the dataset can be caused by humans, which are the target of our method. We conduct more detailed experiments on this dataset. The previous work [21] divides a subset from Shang-haiTech Campus which contains only anomalous events related to human, denominated HR-ShanghaiTech. We also evaluate our method on this subset.
CUHK Avenue [33] contains 16 training and 21 testing video clips including 47 abnormal events such as movement of pedestrians, the wrong direction of movement, the appearance of abnormal objects. The clips are captured in CUHK campus avenue with a single view.

Implementation Details
We use Alpha-Pose algorithm [34] to extract skeletons for each frame in each clip in the dataset. For video streams of unknown length, we divide the input pose sequence into fixed-length clips with the sliding window method. For more than one person in the clip, each person is scored individually and we take the highest score of each person in the frame. As done in work [35], the number of heads of multi-head attention is set to 8, and the embedding dimensions of d q , d k , and d v in each layer are 0.25 × C out in all these experiments.
The training of the model includes two stages, the pre-training stage of an autoencoder and the optimization stage of the refinement embedding and clustering adjustment. The pre-training stage of the autoencoder learns to encode and reconstruct sequences by minimizing reconstruction loss, named L reconstruction , which is the L 2 loss between the original spatiotemporal graph and the reconstruction of SAA-STGCAE. The optimization stage combines reconstruction loss and clustering loss and the combined loss function is where λ value is used for weighted clustering loss. The default value is 0.6.

Comparison with State-of-the-Art Methods
The most popular evaluation metric of video anomaly detection is area under ROC curve (AUC) in previous work [14,15,17,19,33,36]. We report the same metric of frame-level AUC results in Table 1. following the previous work for performance evaluation. A higher value indicates better anomaly detection performance. According to AUC indicator, we compare our method with appearance-based methods and skeleton-based ones. In general, the skeleton-based methods perform better than the appearance-based methods, especially on the HR-ShanghaiTech Campus subset on the HR-ShanghaiTech Campus subset where the anomaly is only related to humans. The reason is that these algorithms only focus on human posture instead of irrelevant features, such as complex background, illumination changes, dynamic camera views, etc. As to the skeleton-based methods, the GCN-based method [22,23] performs better than the RNNbased methods [21], because skeleton can be naturally defined as a graph structure and Graph convolutional networks have advantages over RNN networks in processing Non-Euclidean Structure Data. In addition, our method performs better than GEPC [23] which builds an autoencoder with ST-GCN can only capture local features in spatial dimensions to model the relationship of skeletons, while our method utilizes self-attention to capture global features of skeletons to enhance graph convolution. Therefore, the SAA-Graph can understand the intra-frame interaction of different body parts, and can dynamically establish the relationship between the bones and joints to represent the various parts of the human body.

Ablation Study
We conduct some experiments to evaluate the effectiveness of our model with the SAA-Graph module by comparing it with the graph convolution baseline (Graph) and self-attention single module (SA), where all these methods adopt the same temporal convolution (TCN). We follow the GEPC [23] settings to implement the graph convolution baseline (Graph) and the results of simplified modules of our method are listed in Table 2. Regarding SA, it conditionally depends on movement and is independent of natural human body structure. From Table 2, we can see the performance of the self-attention module can achieve a similar effect to that of the graph convolution baseline, which demonstrates that self-attention module can replace the graph convolution baseline. The experimental results confirm that the self-attention module is effective, and the best result can be obtained by the SAA-Graph in the control experiments.

The Visualization of SAA-Graph
In order to further analyze cases of success and failure, each sample is scored by using the logarithmic probability of the fitted model and visualize the video clips on the ShanghaiTech dataset. As shown in Figure 8, our model SAA-Graph can effectively detect human-related abnormal events in most cases. SAA-Graph can produce high regularity scores in normal activities and low regularity scores in abnormal activity. Abnormal conditions will produce a strong drop peak.

Fail Cases Analysis
The performance of SAA-Graph is better than the related methods, but there are still some failure cases. Figure 9a shows the vehicle appearing in the video, which is non-human related incidents and can not be processed by our method because no skeleton is extracted. Figure 9b shows that the abnormal motion tracking skeleton may be lost when obstructed by obstacles. The main reason of this error is the inaccuracy of skeleton detection and tracking. We tested the current advanced skeleton detection methods, all of them have inaccurate skeleton phenomena, such as low-resolution areas of the target person or obstruction by obstacles. Figure 9c shows the pattern of a slow cyclist misjudged to walk due to similar speed and posture to walking due that all appearance features are filtered out. Although individual movements and postures can reflect anomalies in most cases, they do not include the interactions between multiple people and between people and objects in the event. We will consider using visual features to enhance the skeleton structure as a future work to solve this problem.

Conclusions
In this work, we propose a novel spatial temporal self-attention augmented graph convolutional clustering networks for skeleton-based video anomaly detection tasks by employing the SAA-STGCAE to extract features and embedded clustering. We proved that the SAA-Graph can achieve a more flexible and dynamic representation between skeletons while overcoming the locality of graph convolution. This data-driven approach increases the flexibility of the graph convolutional network and brings more versatility to adapt to various data samples. To the best of our knowledge, we are the first to consider using self-attention for video anomaly detection tasks as an enhancement of spatial temporal graph convolution to capture global features. Our proposed model achieves the excellent performance on both two anomaly detection datasets, ShanghaiTech Campus and CUHK Avenue. Future work includes detecting abnormal phenomena between humans and human-object interactions, enhancing skeleton features with appearance features, and looking for a fully self-attentional solution, which leads to improved network performance and reduces the number of parameters.