GAT-Based Bi-CARU with Adaptive Feature-Based Transformation for Video Summarisation

: Nowadays, video is a common social media in our lives. Video summarisation has become an interesting task for information extraction, where the challenge of high redundancy of key scenes leads to difficulties in retrieving important messages. To address this challenge, this work presents a novel approach called the Graph Attention (GAT)-based bi-directional content-adaptive recurrent unit model for video summarisation. The model makes use of the graph attention approach to transform the visual features of interesting scene(s) from a video. This transformation is achieved by a mechanism called Adaptive Feature-based Transformation (AFT), which extracts the visual features and elevates them to a higher-level representation. We also introduce a new GAT-based attention model that extracts major features from weight features for information extraction, taking into account the tendency of humans to pay attention to transformations and moving objects. Additionally, we integrate the higher-level visual features obtained from the attention layer with the semantic features processed by Bi-CARU. By combining both visual and semantic information, the proposed work enhances the accuracy of key-scene determination. By addressing the issue of high redundancy among major information and using advanced techniques, our method provides a competitive and efficient way to summarise videos. Experimental results show that our approach outperforms existing state-of-the-art methods in video summarisation.


Introduction
Video summarisation is a challenging task for rapid review and content comprehension, with the aim of generating concise representations from original videos.Compared to surveillance video content, people tend to watch more frequent views or events rather than static content.Therefore, an effective summarisation technique often results in high key-frame redundancy when dealing with user-interest content.This study presents a bi-CARU model for summarising user-interesting videos, which has been adapted using an attention mechanism for content encoding and decoding [1].
The powerful architecture of encoder-decoder is based on the neural network; it enables the neural system to extract the information from a sequence input and produce new sequence features for representing output.The work of [2] initially adapted a classical attention-based encoder-decoder structure for supervised video summarisation [3][4][5].This structure aimed to highlight the major information from images, which could further be used for key-frame discovery.However, the limited output length of the encoder in such an encoder-decoder framework resulted in the abandonment of salient features, leading to a decrease in accuracy for key-frame selection and prediction.To alleviate this limitation, an attention mechanism was incorporated into the encoder-decoder framework [6].This attention mechanism efficiently alleviated the issues arising from the limited encoder output length, but a lack of high-quality human-annotated labels is still faced, restricting the performance of video summarisation.Consequently, unsupervised-based models gained popularity, as they did not depend on annotated labels for training.One such unsupervised method was proposed by [7], who utilised dictionary selection for video summarisation [8].Following this, Ref. [9] introduced a variational auto-encoder with Generative Adversarial Networks (GANs) to summarise videos [10].Another noteworthy approach was presented by [11] that applied reinforcement learning techniques to achieve video representations.They proposed an end-to-end LSTM-based unsupervised learning method for video summarisation [12].These methods incorporated an advanced feedback reward mechanism that combined diversity and representativeness.To further enhance unsupervised video summarisation, Ref. [13] proposed an unsupervised (Cycle-LSTM), which integrated a frame selector and a cycle-consistent learning-based evaluator to achieve better performance.Overall, these unsupervised methods addressed the challenges posed by the lack of human-annotated labels, resulting in more efficient video summarisation [14,15].
In practice, the use of Graph Attention Networks (GATs) in video representation is a promising approach within the end-to-end framework.The GAT approach works by transforming the features of a node in a graph into higher-level features, taking into account the influence of neighbouring nodes that are related to the target node [16][17][18].Differing from other models, such as Graph Convolutional Network (GCN), which uses predetermined non-parametric weights [19], or GraphSage, which employs identical weights [20], GAT allocates variable (trainable) weights to the neighbouring nodes based on their contribution to the target node, which is achieved through an attention mechanism [21].By considering variable weights, GAT can capture more fine-grained information about the relationships between nodes in the graph, thus improving the overall capabilities for feature extraction [22].Moreover, by combining the effects of related neighbouring nodes through attentional mechanisms, GAT allows for more flexibility and adaptability in dealing with different types of graphs and tasks [23].

Related Works
Video summarisation has become an interesting task in computer vision and multimedia content analysis, aiming to condense a full-length video into a brief and comprehensive summary.In practice, the Content-Aware Video Summarisation (CAVS) approach performs well for video summarisation [24].CAVS focuses on diversity, representativeness, and interestingness to create a summary of the received content.It extracts and encodes the main features and decodes the high-level textual information to select the most semantically representative summaries [25].In recent years, attention-based video summarisation methods have become popular due to the CAVS approach and have produced highly competitive results [14,26,27].
Recently, there has been an increasing amount of research on predicate summarisation by integrating semantic information from videos.A query-centred video summary extraction technique was proposed in [28], which treats each frame as equally important and searches a given video to select key frames related to predefined events or interesting scenes [14].Also, Refs.[29,30] proposed a text-based evaluation method to assess the extent to which the summary preserves the semantic information of the original video.This text-based approach uses Natural Language Processing (NLP) metrics to measure the semantic distance between human-related summaries and the behavioural performance of the generated summaries.In practice, these approaches are efficient and easy to implement but do not take into account the visual quality of the generated summaries.In a different approach, Ref. [31] developed a Semantic Attended Video Summarisation Network (SASUM) that focuses on extracting features from a keyframe that are semantically related to the given textual description.The SASUM consists of a frame selector and a video descriptor, where the frame selector is an LSTM-based model and the event descriptor is an LSTM-based encoder-decoder model.Further, Ref. [32] extended SASUM with two separate modules for video summarisation and moment localisation.Each module estimates a frame-by-frame importance map to indicate keyframes or moments.To further enhance the semantic content of the output summary, Ref. [33] replaced the RNN by a CARU layer with an attention mechanism approach, allowing different input frames to be assigned different importance weights.These methods effectively establish a connection between textual information and a summary determined by multi-behaviour within the video scene.
Recent developments have shown that attention mechanisms are highly effective in video summarisation.The attention mechanism enables the assignment of different weights to input sequences and the capture of global temporal dependencies.Ref. [34] proposed an efficient visual attention model that incorporates both static and dynamic visual detection.Also, Ref. [35] presented a minimalist transformer-based model for temporal action localisation that combines multi-scale feature representations with local self-attention and uses a lightweight decoder to classify and predict each moment in time.These two visual attention models were non-linearly combined to summarise videos [36].Similarly, Ref. [37] designed a multimodal feature fusion module that fuses frame features with attention-based query features to achieve retrieval-driven video summarisation.These approaches mainly rely on the underlying features and do not consider the temporal dependency between video frames.Furthermore, Ref. [38] further enhanced attention by performing weighted score filtering on the input frame sequence to focus on a subset of video frames from various scenes [39,40].
Inspired by the context of the video representation, the GAT enables the model to capture the relationships between different sequences or scenes of a video, which can be crucial for accurate video classification.The contributions of this work are reported as follows:

•
This work presents two CNN models for processing the visual and audio components of a video.Both models utilise a spatial CNN layer to account for the time series input by employing the VGGreNet architecture.A Spatial Attention Module (SAM) is also considered to attend and weight the extracted features, and a subnet is introduced to combine the received features.

•
In the Graph Attention Network (GAT) approach, a comprehensive graph is constructed from features derived from the Self-Attention Mechanism (SAM).The attention mechanism is then applied to the edges of the GAT to extract interesting content from the features.

•
Next, an Adaptive Feature-based Transformation (AFT) is then introduced.This transformation is used to improve the calculation of weights, thereby increasing the overall effectiveness of the process.The AFT is then integrated into a multi-headed approach.This integration facilitates the identification of events, thereby enabling a more comprehensive understanding and interpretation of the video content.• Bi-CARU is recommended for decoding and extracting contextual information using its context-adaptive gate.It also improves the accuracy of the similarity module by combining the forward and backward features of CARU, taking into account the hidden features and attention to generate detailed sentences for fine-grained prediction.
This approach highlights the potential of combining GAT with AFT within a multiheaded framework to enhance the analysis and interpretation of video content.It represents a significant advance in the field of video content understanding and demonstrates the power of attentional mechanisms to extract meaningful information from complex data structures.Overall, this work investigates the combination of GAT and Bi-CARU to enhance accuracy in key-scene selection for video summarisation.These contributions improve the overall performance of video summarisation algorithms, resulting in better outcomes and increased efficiency in video analysis.

Proposed Model
To improve the discrimination between images by efficiently using both visual and audio features, a novel GAT-based Bi-CARU model is developed that effectively discriminates between images by integrating coded features.The proposed model is able to discover the connection between the interested objects within the sequence of frames, addressing the relationship determination from the hidden feature, as illustrated in Figure 1.The model consists of several major components: Spatial Attention Module (SAM).This module takes the visual features as input to predict the spatial information of the frame(s) and determines the coordinate of the current frame to compute their relative spatial information and perform consistent coordinate projections.By applying the 3D transformation algorithm, the model can be focused on a consistent projective space to compute the acquisition of the relevant regional information of the image.

Graph Attention Transformer (GAT).
The visual features of the nodes are transformed using the Adaptive Feature-based Transformation (AFT) mechanism.This process enhances the representation of visual features in the model.These visual features are then processed by a Bi-CARU network, also employing the multi-headed approaches for the key-frame determination.
Bi-CARU Layer.An advanced bi-directional RNN architecture with CARU is combined with the GAT attention score generated by the GAT.This bi-CARU architecture enables simultaneous forward and backward feature searching from the input feature.
Moreover, we make use of the reinforcement network to optimise for loss by backpropagating, which performs the training model by iteratively adjusting the attention mechanism.The proposed model is conducted to improve and discover the relationship between interested objects in the middle of the encoding and decoding side.This design provides a clear connection and discards some noise when more objects are extracted within the same scene, making the decoding module perform better by applying this design.By incorporating these components, our proposed model demonstrates better discrimination between images, as it can effectively use both audio and visual features from the original video.

Spatial Attention Module (SAM)
The proposed SAM performs spatial attention in a convolutional layer.In practice, since the work proposes to receive video sequences for analysis, scalable feature inputs and outputs are required, which are not good for extraction when using traditional CNN networks.Therefore, VGGreNet has the advantage of being able to receive inputs of non-fixed length while still providing the performance of VGGnet.It first receives and concatenates (audio and video) features from VGGreNet [41], adjusting inter-spatial features related to behaviours and their relationships.Different from the colour in the 2D domain, the spatial attention is mainly focused on motion feature extraction and also discards noisy information on the time-axis from the frame sequence.The proposed SAM for spatial feature extraction consists of Chebyshev-pooling [42] and max-pooling layers to produce weighted features (w) as follows: here, an advanced pooling ChebyPool( f ) ∈ R 1×H×W is recommended to project the feature f into a probability domain that can be used as a weighting feature.It makes use of the Chebyshev theory, whereby the output feature can be projected to a stable range without the need for a sigmoid function, which provides a readable probabilistic result for subsequent processing.Next, an extractor concatenates these weighted features and projects them into a spatial attention v( f ) ∈ R H×W that can be expressed as: where σ is the sigmoid activation function that ensures the output range in (0, . . . 1).These are then convolved by a convolutional layer CNN with the filter and kernel sizes of 7 × 7 and 3 × 3, respectively.In practice, each v t can be seen as an interesting scene within a video that needs to be extracted.This enables the extraction of objects or behaviours from the target, effectively eliminating most meaningless content and allowing subsequent procedures to be analysed in more detail.

Graph Attention Transformer (GAT)
After receiving the scenes extracted by SAM, GAT discovers potential relationships between these features v from them.As illustrated in the graph attention networks in Figure 1, there is a complete graph consisting of nodes, each connected by edge weights a ij .In our study, a ij : R N i × R N j → R represents the attention scores that reflect the relationship between node v i and v j .The GAT computes normalised coefficients e ij across pairs of v i and v j for every received scene in the set V = {v 1 , v 2 , . . . ,v N }.The injected graph structure, which only allows a node v i to attend over nodes in its neighbourhood, can be expressed as follows: here, it involves learning the parameters of the two linear projections W Q ∈ R D×D and W K ∈ R D×D corresponding to the paired video scenes v i and v j , respectively.To ensure gradient contribution, 1 √ q (q = D) is used to scale the attention value and normalise their output instead of directly, which can contribute more gradients to enhance convergence, rather than using an activation function of σ [43].The pairwise attention matrix e ij ∈ R N×N shows the temporal relation between frame pairs in V.The resulting pairwise attention weights e ij are then converted to the corresponding normalised weights a ij using the Softmax function.This approach also provides gradients without scaling and allows training parameters to be shared with the rest of the network during backpropagation.
To discover events from these v ij , it is necessary to summarise all potential features in these scenes.This work introduces Adaptive Feature-based Transformation (AFT), which focuses on identifying scenes in a video that are significantly different from others, effectively capturing and representing the global diversity between different video scenes.These weights can be adaptably considered by the multi-headed approach [44] to identify and transform interesting events for further decoding.In practice, a LINEAR layer is used to concatenate the dissimilarity between each a ij with W k : k = 1, 2, . . ., K from the multiheaded approach.This operation is repeated K times while updating these parameters during each backpropagation as follows: here a (0) rj is initial to a rj from (4), a (k) rj are derived by the k-th time, and W k ∈ R N×K serves as a weight matrix specifying a linear transformation containing N event features.

Bi-CARU Decoder
Linear RNNs are effective in NLP decoding tasks, but they struggle with long-term dependency and convergence issues.Variants like LSTM and GRU aim to tackle these problems.Our work uses the advanced RNN unit, CARU, which introduces two gates (the context-adaptive gate and the update gate) with fewer parameters than other units.The context-adaptive gate in CARU weighs the input based only on the current feature, allowing for a better combination of weights, unlike the reset gate in GRU, which depends on the entire sequence.This procedure is similar to a tagging task, connecting weights and parts of speech, filtering noise, and highlighting major features in the current input.The complete procedure is as follows: The CARU processes involve assigning weights to words and gates, and multiplying them by content weights, taking into account both words and content.It allows for precise contextual sentence prediction by accurately predicting individual words and analysing their content based on their current parts of speech.Corresponding to Bi-CARU in Figure 1, bidirectional structures enhance context awareness by considering the previous and next hidden features.Let − → d and ← − d denote the feature by concatenated forward and backward CARU.The word probabilities can be obtained through the decoding process as follows: A Feed-Forward Network (FFN) is used to decode the word-feature.Each update contains sub-updates corresponding to individual FFN parameter vectors.These vectors promote comprehensible concepts, which are often easy for humans to understand.The model is trained using word-level cross-entropy loss and uses the Adam optimiser [45].

Experimental Analysis
The model's performance was evaluated on two datasets: TVSum [46] and SumMe [47].TVSum is a collection of 50 videos that have been carefully selected from the vast amount of content available on YouTube.SumMe consists of 25 user-created videos covering a diverse range of topics, mostly from sports to vacation experiences.These videos are divided into N = 10 categories, making it more complex for our model to navigate.The duration of the videos ranges from 1 to 6 min.To understand user behaviour and preferences, it involves over 15 users in creating human-generated summaries.The users also provide attention scores at the frame level, which offers valuable insight and data for this work.
In addition, we evaluate the similarity of the generated summaries to human-annotated summaries following the configuration in [6].The evaluation metric used is the F-score, which is the harmonic mean of precision (P) and recall (R).The correct parts are identified by overlapping word features between human-generated and model-generated summaries.The F-score is then calculated as: 2 where P is defined as P = overlap prediction , and R is defined as R = overlap user−annotated .
Table 1 provides a comprehensive overview, showing that on the SumMe dataset, the proposed method achieves competitive results with other methods.It can be found that our proposed method obtained the second-best and best F-scores in TVSum and SumMe, respectively.In practice, our method outperforms those of the baseline studies that use LSTM structures, such as Bi-LSTM [48] and DPP-LSTM [6].These methods require the temporal feature from the forget gate in LSTM, which can often cause the long-term dependency problem and perturb the attention weights produced by GAT.Our proposed approach achieves superior performance on the SumMe dataset and second-best performance on the TVSum dataset compared to advanced approaches.Our proposed method uses the features received from SAM and applies the attention mechanism to the edges of GAT to discover interesting content.Bi-CARU can adaptively select the long-term feature field through its content-adaptive gate.The AFT enhances the weight calculation and is integrated into the multi-headed approach to determine the event for a more comprehensive understanding of the video.As shown in Table 2, this analysis highlights the interesting scenes identified by humans and two advanced methods in this work.This work has selected interesting scenes that cover the human (ground truth) and accurately represent the time period of the video.It is worth noting that most of the comparison algorithms on the SumMe dataset have relatively low F-scores compared to the TVSum dataset.This observation proves that these comparison algorithms perform poorly when dealing with raw or simply edited videos.However, our method's effectiveness and potential for practical applications are further enhanced by the high score of 58.4% on the TVSum dataset, demonstrating the successful alleviation of these shortcomings.
Table 2. Visualisation of interesting scenes from the "Paper Wasp Removal|From the Ground Up" video in TVSum dataset.From top to bottom, the highlighted scenes are determined by human annotation, SUM-FCN [49], SUM-DeepLab [49], and this work.

Approach human annotation
There were two wasps on the branch.Beehives carved from beeswax.Man standing in front of wooden greenhouse.Hive on a table with a couple of bees.

SUM-FCN [49]
A bee hanging from a branch.
Beehive made with natural beeswax.Man standing outside wooden greenhouse.Close-up of bees in a hive on a table.

Feature Visualisation by Video Descriptions
To provide the weighting of extracted features, this section displays the relationship between the attention scores and the relevant objects.This allows for post-processing of the significant messages generated by the video description by the study [55].This technique aims to represent a heatmap covering the interesting scene(s) based on the information extracted from the selected frames.The objective is to indicate the various intervals and their dynamics that affect the viewer's attention and understanding of the video content.As highlighted in Figure 2, the heatmap focuses primarily on the human face or the moving object within a sequence.When considering the whole video, the video description generated by the proposed AFT can better describe the interesting objects and their connections.Therefore, the saliency of the video description generated by Bi-CARU accurately represents the main content of the textual description.The video summaries generated by our model are consistent with the video descriptions, which further proves the effectiveness of our approach.

Conclusions
To present more accuracy for video summarisation, this work introduces a GATbased Bi-CARU model that enhances the feature extraction between interesting frames by capturing visual and audio features.The SAM module first takes the visual features as input to predict the spatial information of the frame(s) and then determines the coordinate of the current frame to compute their relative spatial information and perform consistent coordinate projections.Also, the redundancy frames can be discarded by the proposed AFT selection, caused by the low attention calculated from the weight feature(s).Moreover, the Bi-CARU is adapted to decode the summary for the interesting scene(s) predicted by the GAT module.The experiments report that the proposed GAT-based Bi-CARU model achieves competitive results from the advanced methods.In future work, we will integrate spatial analysis into our model and explore the feasibility of reconstructing 3D space to make better use of the video structure.We will also improve the encoding to represent fine-grained features from a global sequence perspective.

Figure 1 .
Figure 1.Framework of the proposed GAT-based bi-CARU for video summarisation.

Table 1 .
Comparison with state-of-the-art methods in F-score (%); the best scores are in bold.

Table 2 .
Cont. on a branch in the tree.Detailed beehive shaped with beeswax.Man standing next to rustic wooden greenhouse structure.A bee in a hive on a wooden table.