On the Use of Deep Learning for Video Classiﬁcation

: The video classiﬁcation task has gained signiﬁcant success in the recent years. Speciﬁcally, the topic has gained more attention after the emergence of deep learning models as a successful tool for automatically classifying videos. In recognition of the importance of the video classiﬁcation task and to summarize the success of deep learning models for this task, this paper presents a very comprehensive and concise review on the topic. There are several existing reviews and survey papers related to video classiﬁcation in the scientiﬁc literature. However, the existing review papers do not include the recent state-of-art works, and they also have some limitations. To provide an updated and concise review, this paper highlights the key ﬁndings based on the existing deep learning models. The key ﬁndings are also discussed in a way to provide future research directions. This review mainly focuses on the type of network architecture used, the evaluation criteria to measure the success, and the datasets used. To make the review self-contained, the emergence of deep learning methods towards automatic video classiﬁcation and the state-of-art deep learning methods are well explained and summarized. Moreover, a clear insight of the newly developed deep learning architectures and the traditional approaches is provided. The critical challenges based on the benchmarks are highlighted for evaluating the technical progress of these methods. The paper also summarizes the benchmark datasets and the performance evaluation matrices for video classiﬁcation. Based on the compact, complete, and concise review, the paper proposes new research directions to solve the challenging video classiﬁcation problem.


Introduction
The task of automatically classifying videos has become very successful recently.Particularly, the subject has drawn increased interest since deep learning models became an effective method for automatically classifying videos.The importance of the accurate video classification task can be realized by the large amount of video data available online.People around the world generate and consume a huge amount of video content.Currently, on YouTube only, over 1 billion hours of video are being watched by different people every single day.In recognition to the importance of the video classification task, a combined effort is being made by researchers for proposing an accurate video classification framework.Companies such as Google AI are investing in different competitions to solve the challenging problem under constrained conditions.To further advance the progress of the automatic video classification task, Google AI has released a public dataset called YouTube-8M with millions of video features and more than 3700 labels.All these efforts being made demonstrate the need for a powerful video classification model.
An artificial neural network (ANN) is an algorithm based on interconnected nodes to recognize the relationships in a set of data.Algorithms based on ANNs have shown a great 1.
A more recent review was done by A. Anusya [5]; this review covers very few methods for video classification, clustering, and tagging.However, the review provided is not comprehensive and lacks concise information, coverage of topic, datasets, analysis of state-of-art approaches, and research limitations; 2.
Rani et al. [6] also conducted a recent review on video classification methods, and their review covered some recent video classification approaches and summary-based description of some recent works.This review also had some limitations including the missing analysis of recent state-of-art approaches and a very limited description of topics covered; 3.
Y. Li et al. [7] recently conducted a systematic and good review on live sport video classification.This review covers most of the recent works in live sport video classification, including the tools, video interaction features, and feature extraction methods.This is a comprehensive review, but the findings are not summarized in tables for research gaps and advantages and disadvantages of existing methods for a quick review.Moreover, this review is more specific to live sport video classification; 4.
A recent review was also done by Md Islam et al. [8]; in this review, they included all the methods for video classification, including deep learning.However, as the focus of review is not on deep learning approaches, these methods are therefore not completely covered in this review; 5.
Ullah.H. et al. [9] also conducted a recent systematic review; however, the focus of their review remained on human activity recognition; 6.
Z. Wu. [10] presented a concise review on video classification specific to deep learning methods.This review provides a good description on deep learning models, feature extraction tools, benchmark dataset, and comparison of existing methods for video classification.However, this review was conducted in the year 2016, and it does not cover the recent state-of-art deep learning methods; Appl.Sci.2023, 13,2007 3 of 24 7. Q. Ren [11] conducted a simple review on video classification methods; however, the techniques covered in this review are not well described, and the review also lacks in the description of research gaps, benchmark datasets, limitations of existing methods, and performance metrics.
In contrast to the existing reviews on classification of videos, this paper provides a more comprehensive, concise, and up-to-date review of deep learning approaches for video classification.In this current review, most of the recent state-of-art contributions related to the topic are analyzed and critically summarized.Deep learning is an emerging and vibrant field for the analysis of videos; therefore, we hope this review will help in stimulating future research along the line.The following are the key contributions to this review paper: 1.
A summary of state-of-art, CNN-based deep learning models for image analysis; 2.
An in-depth review of deep learning approaches for video classification highlighting the notable findings; 3.
A summary of breakthroughs in the automatic video classification task; 4.
Analysis of research trends from past towards future; 5.
Description of benchmark datasets, evaluations metrics, and comparison of recent state-of-art deep learning approaches in terms of performance.
The rest of the paper is organized as follows: Section 2 reviews some existing CNNs for images; Section 3 provides an in-depth review on deep learning models for video classification; Section 4 provides a summary for benchmark datasets, evaluation metrics, and comparison of existing state-of-art methods for the video classification task; and Section 5 provides conclusion and future research directions.

Convolutional Neural Networks (CNN) for Image Analysis
Deep learning models, specifically convolutional neural networks (CNNs), are well known for understanding images.A number of CNN architectures are proposed and developed in the scientific literature for image analysis.Among these, the most popular architectures are LeNet-5 [12], AlexNet [13], VGGNet [14], GoogleNet [15], ResNet [16], and DenseNet [17].The trend that follows from the formerly proposed architectures towards the recently proposed architectures is to deepen the network.A summary of these popular CNN architectures along with trend of deepening the network is shown in Figure 1, where the depth of network increases from left-most (LeNet-5) to right-most (DenseNet).Deep networks are believed to better approximate the target function and to generate better feature representation with more powerful discriminatory powers [18].Although deeper networks are better in terms of having more discriminatory powers, the deeper networks require more data for training and more parameters to tune [19].Finding a professionally labeled, huge dataset is still a big challenge faced by the research community, and therefore, it limits the development of deeper neural networks.

Video Classification
In this section, a very comprehensive and concise review for deep learning models employed in the video classification task is provided.This section covers a description on video data modalities, traditional handcrafted approaches, breakthroughs in video classi-

Video Classification
In this section, a very comprehensive and concise review for deep learning models employed in the video classification task is provided.This section covers a description on video data modalities, traditional handcrafted approaches, breakthroughs in video classification, and recent state-of-art deep learning models for video classification.

Video Data Modalities
As compared to images, videos are more challenging to understand and classify due to the complex nature of the temporal content.However, three different modalities, i.e., visual information, audio information, and text information, might be available to classify videos in contrast to image classification, where only a single visual modality can be utilized.Based on the availability of different modalities in videos, the task of classification can be categorized as a uni-modal video classification or a multi-modal video classification, as summarized in Figure 2. The existing literature has utilized both of these models for the video classification task, and it is generally believed that models utilizing multi-modal data perform better than the models based on uni-modal data [20,21].Moreover, the visual description [22] of a video works better than the text [23] and the audio [24,25] description for the classification purpose of a video.
tory powers of network architectures increases from formerly proposed architectures towards the recently proposed architectures.

Video Classification
In this section, a very comprehensive and concise review for deep learning models employed in the video classification task is provided.This section covers a description on video data modalities, traditional handcrafted approaches, breakthroughs in video classification, and recent state-of-art deep learning models for video classification.

Video Data Modalities
As compared to images, videos are more challenging to understand and classify due to the complex nature of the temporal content.However, three different modalities, i.e., visual information, audio information, and text information, might be available to classify videos in contrast to image classification, where only a single visual modality can be utilized.Based on the availability of different modalities in videos, the task of classification can be categorized as a uni-modal video classification or a multi-modal video classification, as summarized in Figure 2. The existing literature has utilized both of these models for the video classification task, and it is generally believed that models utilizing multimodal data perform better than the models based on uni-modal data [20,21].Moreover, the visual description [22] of a video works better than the text [23] and the audio [24,25] description for the classification purpose of a video.

Traditional Handcrafted Features
During the earlier developments of the video classification task, the traditional handcrafted features were combined with state-of-art machine learning algorithms to classify the videos.Some of the most popular handcrafted feature representation techniques used in the literature are spatiotemporal interest points (STIPs) [26], improved dense trajectories (iDT) [27], SIFT-3D [28], HOG3D [29], motion boundary histogram [30], actionbank [31], cuboids [32], 3D SURF [33], and dynamic-poselets [34].These hand-designed representations use different feature encoding schemes such as the ones based on pyramids and histograms.iDT is one of these handcrafted representations that is widely considered the state-of-the-art.Many recent competitive studies demonstrated that handcrafted features [35][36][37][38] and high-level [39,40] and mid-level [41,42] video representations have contributed towards the task of video classification with deep neural networks.

Deep Learning Frameworks
Along with the development of more powerful deep learning architectures in the recent years, the trend for the video classification task has followed a shift from traditional handcrafted approaches to the fully automated deep learning approaches.Among the very common deep learning architectures used for video classification is a 3D-CNN model.An example of 3D-CNN architecture used for video classification is given in Figure 3 [43].In this architecture, 3D blocks are utilized to capture the video information necessary to classify the video content.One more very common architecture is a multi-stream architecture, where the spatial and temporal information is separately processed, and the features extracted from different streams are then fused to make a decision.To process the temporal information, different methods are used, and the two most common methods are based on (i) RNN (mainly LSTM) and (ii) optical flow.An example of a multi-stream network model [44], where the temporal stream is processed using optical flow, is shown in Figure 4.A high-level overview of the video classification process is shown in Figure 5, where the stages of feature extraction and prediction are shown with the most common type of strategies used in the literature.In the upcoming sections, the breakthroughs in video classification and studies related to classification of videos, specifically using deep learning frameworks, are summarized, describing the success rate of utilizing deep learning architectures and the associated limitations.
cent years, the trend for the video classification task has followed a shift from traditional handcrafted approaches to the fully automated deep learning approaches.Among the very common deep learning architectures used for video classification is a 3D-CNN model.An example of 3D-CNN architecture used for video classification is given in Figure 3 [43].In this architecture, 3D blocks are utilized to capture the video information necessary to classify the video content.One more very common architecture is a multi-stream architecture, where the spatial and temporal information is separately processed, and the features extracted from different streams are then fused to make a decision.To process the temporal information, different methods are used, and the two most common methods are based on (i) RNN (mainly LSTM) and (ii) optical flow.An example of a multistream network model [44], where the temporal stream is processed using optical flow, is shown in Figure 4.A high-level overview of the video classification process is shown in Figure 5, where the stages of feature extraction and prediction are shown with the most common type of strategies used in the literature.In the upcoming sections, the breakthroughs in video classification and studies related to classification of videos, specifically using deep learning frameworks, are summarized, describing the success rate of utilizing deep learning architectures and the associated limitations.sary to classify the video content.One more very common architecture is a multi-str architecture, where the spatial and temporal information is separately processed, and features extracted from different streams are then fused to make a decision.To pro the temporal information, different methods are used, and the two most common m ods are based on (i) RNN (mainly LSTM) and (ii) optical flow.An example of a m stream network model [44], where the temporal stream is processed using optical flo shown in Figure 4.A high-level overview of the video classification process is show Figure 5, where the stages of feature extraction and prediction are shown with the common type of strategies used in the literature.In the upcoming sections, the br throughs in video classification and studies related to classification of videos, specifi using deep learning frameworks, are summarized, describing the success rate of util deep learning architectures and the associated limitations.

Breakthroughs
The breakthroughs in recognition of still-images originated with the introduction of a deep learning model called AlexNet [13].The same concept of still-image recognition

Breakthroughs
The breakthroughs in recognition of still-images originated with the introduction of a deep learning model called AlexNet [13].The same concept of still-image recognition using deep learning is also extended for videos, where individual video frames are collectively processed as images by a deep learning model to predict the contents of a video.The features from individual video frames are extracted, and then, temporal integration of such features into a fixed-size descriptor using pooling is performed.The task is either done using high-dimensional feature encoding [45,46] or through the RNN architectures [47][48][49][50].For un-supervised spatiotemporal feature learning in 3D convolutions, restricted Boltzmann machines [51] and stacked ISA [52] are also studied in parallel.The 3D-CNNs using temporal convolutions to extract temporal features automatically were first proposed by Baccouche et al. [53] and by Ji et al. [54].

Basic Deep Learning Architectures for Video Classification
The two most widely used deep learning architectures for video classification are convolutional neural network (CNN) and recurrent neural network (RNN).CNNs are mostly used to learn the spatial information from videos, whereas RNNs are used to learn the temporal information from videos, as the main difference between these two architectures is the ability to process temporal information or data that come in sequences.Therefore, both these network architectures are used for completely different purposes in general.However, the nature of video data with the presence of both the spatial and the temporal information demands the use of both these network architectures to accurately process the two-stream information.The architecture of a CNN applies different filters in the convolutional layers to transform the data.RNNs, on the other hand, reuse the activation functions to generate the next output in a series from the other data points in the sequence.However, the use of only 2D-CNNs alone limits the understanding of video to only spatial domain.RNNs, on the other hand, can understand the temporal content of a sequence.Both these basic architectures and their enhanced versions are applied in several studies for the task of video classification.

Developments in Video Classification over Time
The existing approaches for video classification are categorized based on their working principle in Table 2.The trend observed for the classification of videos from the existing literature is that the recently developed state-of-art deep learning models are outperforming the earlier handcrafted classical approaches.This is mainly due to the availability of large-scale video data for learning deep architectures of neural networks.Besides an improvement in classification performance the recently developed models are mostly selflearned and does not require any manual feature engineering.This added advantage makes them more feasible for use in real applications.However, the better performing recently developed architectures are deeper as compared to the previously developed architectures which brings a compromise on the computational complexity of the deep architectures.

Categories Working Principle References
Hand-crafted approaches These representations are handcrafted and employ various feature encoding techniques, such as histograms and pyramids.

2D-CNNs
These are image based models where frame level feature extraction is performed using CNN architecture and classification is performed using state-of-art classification models, for example SVM. [55] 3D-CNNs 2D image classification extension to 3D for video (For example the Inception 3D (I3D) architecture). [56]

Spatiotemporal Convolutional Networks
To aggregate the temporal and the spatial information, these methods primarily depend on convolution and pooling.
[ 54,57,58] Recurrent Spatial Networks To represent temporal information in videos, recurrent neural networks such as LSTM or GRU are used.[47,53,59,60] Two/multi Stream Networks In addition to the context frame visuals, these methods use layered optical flow to identify movements.
[50, [61][62][63] Mixed convolutional models Models built with the ResNet architecture in mind.They are particularly interested in models that utilize 3D convolution in the bottom or top layers but 2D in the remainder; these are referred to as "mixed convolutional" models.Or the methods based on mixed temporal convolution with different kernel sizes.[64,65] Hybrid Approaches These are models based on integration of CNN and RNN architectures. [66-68] Among the initially developed hand-crafted representations, improved Dense Trajectories (iDT) [27] is widely considered the state-of-the-art.Whereas, many recent competitive studies demonstrated that hand-crafted features [35][36][37][38], high-level [39,40], and mid-level [41,42] video representations have contributed towards the task of video classification with deep neural networks.The hand-crafted models were among the very early developments of video classification problem.Later, 2D-CNNs were proposed for video classification, where image-based CNN models are used to extract frame level features and based on the frame level CNN features, some state-of-art classification models (for example SVM) are learned to classify videos.These 2D-CNN models do not require any manual feature extraction and these models performed better than the competing hand-crafted approaches.After successful development of 2D-CNN models where features are extracted from frame level, the same concept was extended to propose 3D-CNNs to extract features from videos.The proposed 3D-CNNs are computationally more expensive as compared to the 2D-CNN models.However, these models consider the time variations in feature extraction therefore these 3D-CNN models are believed to perform better as compared to 2D-CNN models for video classification [54,58,69].
The development of 3D-CNN models paved the way for fully automatic video classification models using different deep learning architectures.Among the developments using deep learning architectures, spatiotemporal convolutional networks are approaches based on integration of temporal and spatial information using convolutional networks to perform video classification.To collect temporal and spatial information, these methods primarily rely on convolution and pooling layers.Stack optical flow is used in two/multi-stream networks methods to identify movements in addition to context frame visuals.Recurrent spatial networks use recurrent neural networks (RNN) to model temporal information in videos, such as LSTM or GRU.The ResNet architecture is used to build mixed convolutional models.They are particularly interested in models that utilize 3D convolution in the bottom or top layers but 2D in the remainder; these are referred to as "mixed convolutional" models.These also include methods based on mixed temporal convolution with different kernel sizes.Advanced architectures based on DenseNet have also shown promising results for the video classification task.Some of these notable architectures based on DenseNet include region-based CNN (R-CNN) [70,71], faster R-CNN [72,73], and YOLO [74].Besides these architectures, there are also hybrid approaches based on the integration of CNN and RNN architectures.A summary of these architectures is provided in Figure 6.
Appl.Sci.2023, 13, x FOR PEER REVIEW 8 of 23 visuals.Recurrent spatial networks use recurrent neural networks (RNN) to model temporal information in videos, such as LSTM or GRU.The ResNet architecture is used to build mixed convolutional models.They are particularly interested in models that utilize 3D convolution in the bottom or top layers but 2D in the remainder; these are referred to as "mixed convolutional" models.These also include methods based on mixed temporal convolution with different kernel sizes.Advanced architectures based on DenseNet have also shown promising results for the video classification task.Some of these notable architectures based on DenseNet include region-based CNN (R-CNN) [70,71], faster R-CNN [72,73], and YOLO [74].Besides these architectures, there are also hybrid approaches based on the integration of CNN and RNN architectures.A summary of these architectures is provided in Figure 6.The different deep learning architectures described above employ different fusion strategies.These fusion strategies are either for the fusion of different features extracted from the video or for the fusion of different models used in the architecture.The fusion strategies mainly used for the extracted features are (i) concatenation, (ii) product, (iii) summation, (iv) maximum, and (v) weighted, where the concatenation approach simply combines all the features together, and all the features are used for classification.The product/summation approach performs the product/summation between the features extracted using different strategies and uses the result of product/summation to perform classification.The maximum approach takes the maximum value of the features extracted using different strategies and uses that for classification.The weighted approach gives different weights to different features and performs the classification using the weighted features.Different fusion methods are summarized in Figure 7.The different deep learning architectures described above employ different fusion strategies.These fusion strategies are either for the fusion of different features extracted from the video or for the fusion of different models used in the architecture.The fusion strategies mainly used for the extracted features are (i) concatenation, (ii) product, (iii) summation, (iv) maximum, and (v) weighted, where the concatenation approach simply combines all the features together, and all the features are used for classification.The product/summation approach performs the product/summation between the features extracted using different strategies and uses the result of product/summation to perform classification.The maximum approach takes the maximum value of the features extracted using different strategies and uses that for classification.The weighted approach gives different weights to different features and performs the classification using the weighted features.Different fusion methods are summarized in Figure 7.
Appl.Sci.2023, 13, x FOR PEER REVIEW 8 of 23 visuals.Recurrent spatial networks use recurrent neural networks (RNN) to model temporal information in videos, such as LSTM or GRU.The ResNet architecture is used to build mixed convolutional models.They are particularly interested in models that utilize 3D convolution in the bottom or top layers but 2D in the remainder; these are referred to as "mixed convolutional" models.These also include methods based on mixed temporal convolution with different kernel sizes.Advanced architectures based on DenseNet have also shown promising results for the video classification task.Some of these notable architectures based on DenseNet include region-based CNN (R-CNN) [70,71], faster R-CNN [72,73], and YOLO [74].Besides these architectures, there are also hybrid approaches based on the integration of CNN and RNN architectures.A summary of these architectures is provided in Figure 6.The different deep learning architectures described above employ different fusion strategies.These fusion strategies are either for the fusion of different features extracted from the video or for the fusion of different models used in the architecture.The fusion strategies mainly used for the extracted features are (i) concatenation, (ii) product, (iii) summation, (iv) maximum, and (v) weighted, where the concatenation approach simply combines all the features together, and all the features are used for classification.The product/summation approach performs the product/summation between the features extracted using different strategies and uses the result of product/summation to perform classification.The maximum approach takes the maximum value of the features extracted using different strategies and uses that for classification.The weighted approach gives different weights to different features and performs the classification using the weighted features.Different fusion methods are summarized in Figure 7.

Summary of Some Notable Deep Learning Frameworks Developments
A summary of some deep learnings architectures for video classification is provided in Table 3.These studies are summarized based on the architecture, the datasets, the evaluation metrics, the fusion strategy, and the notable findings.The most common architectures for video classification are fundamentally based on the RNN and CNN architectures; classification accuracy is one of the most common evaluation metrics; UCF-101 and Sports-1M datasets are the choice for validation in most cases, multi-class classification problem is considered in almost all cases, SMART blocks outperform 3D convolutions in terms of spatiotemporal feature learning, and average fusion, kernel average fusion, weighted fusion, logistic regression fusion, and MKL fusion are all proven to be inferior compared to the multi-stream multi-class fusion technique.Moreover, a more applied form of classification in videos is to identify/recommend tags or thumbnails in videos, and this specific task is successfully caried out in [75][76][77][78][79].

Few-Shot Video Classification
FEW-SHOT learning (FSL) has received a great deal of interest in recent years.FSL tries to identify new classes with one or a few labeled samples [80][81][82][83].However, due to most recent work in few-shot learning being centered on image classification, FSL in the video domain is still hardly being explored [84,85].Some of the notable works done in this domain are discussed below.
A multi-saliency embedding technique was developed by Zhu et al. [85] to encode a variable-length video stream into a fixed-size matrix.Graph neural networks (GNN) were developed by Hu et al. [86] to enhance the video classification model's capacity for discrimination.The local-global link in a distributed representation space was still disregarded nevertheless.To categorize a previously unseen video, Cao et al. [87] introduced a temporal alignment module (TAM) that explicitly took advantage of the temporal ordering information in video data through temporal alignment.To combine the two-stream aspects of videos more effectively, Fu et al. [88] developed a depth-guided adaptive instancenormalization module (DGAdaIN).A C3D encoder was created by Zhang et al. [89] to record close-range action patterns for spatiotemporal video blocks.Few-shot video categorization was addressed by Qi et al. [90] by learning a collection of SlowFast networks enhanced with memory units.To comprehend realistic films of the target classes, Fu et al. [91] presented embodied agent-based one-shot learning, which made use of synthetic videos created in a virtual environment.For the issues of few-shot and zeroshot action recognition, Bishay et al. [92] presented the temporal attentive relation network (TARN), which was trained to compare representations of varying temporal length.By examining local-global linkages and preserving the specifics of properties, Y. Feng et al. [93] recently presented a dual-routing capsule graph neural network (DR-CapsGNN) to address the issue of severely constrained samples in few-shot learning.
Apart from this, contrastive learning has also proved successful in recognizing human actions.Some of the interesting works done in this regard are multi-granularity anchorcontrastive representation learning [94] and X-invariant contrastive augmentation and representation learning [95].

Geometric Deep Learning
Shape descriptors play a significant role in the description of manifolds for 3D shapes.In general, a global feature descriptor is created by aggregating local descriptors to describe the geometric properties of the entire shape, for example, using the bag-of-features paradigm.A local feature descriptor assigns a vector to each point on the shape in a multi-dimensional descriptor space, representing the local structure of the shape around that point.Most deep learning techniques that deal with 3D shapes essentially use the CNN paradigm.Volumetric 2D multi-view shape representations are applied directly using standard (Euclidean) CNN architectures in neural networks via methods such as [96,97].These techniques are unsuited for dealing with deformable shapes because the shape descriptors they use are dependent on extrinsic structures that are invariant under Euclidean transformations, as demonstrated in Figure 8a [98], while some other approaches [99-103] create a new framework by adopting the CNN feature extraction pattern to investigate the inherent CNN versions that would enable handling shape deformations by using intrinsic filter structure, as shown in Figure 8b [98].Geometric deep learning deals with non-Euclidean graph and manifold data.This type of data (irregularly arranged/distributed randomly) is usually used to describe geometric shapes.The purpose of geometric deep learning is to find the underlying patterns in geometric data where the traditional Euclidean distancebased deep learning approaches are not suitable.There are basically two methods available in the literature to apply deep learning on geometric data: (i) extrinsic methods and (ii) intrinsic methods.The filters in extrinsic methods are applied on the 3D surfaces such that it effects the structural deformity due to the extrinsic filter structure.The key weakness of extrinsic approaches [96,97] is that they continue to consider geometric data as Euclidean information.When an object's position or shape changes, the extrinsic data representation fails.Additionally, for these methods to support the challenging-in-practice task of attaining the invariance of shape deformation, complicated models and extensive training are required.The filters in intrinsic approaches are applied on the 3D surfaces without being affected by the structural deformity.Rather than Euclidean realization, intrinsic methods work on the manifold and are isometry-invariant by construction.Some of the works based on intrinsic deep learning include (i) geodesic CNN [99], (ii) anisotropic CNN [100], (iii) mixture model network [101], (iv) structured prediction model [102], (v) localized spectral CNN [103], (vi) PointNet [104], (vii) PointNet++ [105], and (viii) RGA-MLP [106].The application of geometric deep learning (mostly intrinsic methods) in analyzing videos can help in better understanding from the machine perspective, but it is still an open research problem and needs further investigation.For further details on geometric deep learning, readers are referred to [98,107].
Euclidean transformations, as demonstrated in Figure 8a [98], while some other approaches [99-103] create a new framework by adopting the CNN feature extraction pattern to investigate the inherent CNN versions that would enable handling shape deformations by using intrinsic filter structure, as shown in Figure 8b [98].Geometric deep learning deals with non-Euclidean graph and manifold data.This type of data (irregularly arranged/distributed randomly) is usually used to describe geometric shapes.The purpose of geometric deep learning is to find the underlying patterns in geometric data where the traditional Euclidean distance-based deep learning approaches are not suitable.There are basically two methods available in the literature to apply deep learning on geometric data: (i) extrinsic methods and (ii) intrinsic methods.The filters in extrinsic methods are applied on the 3D surfaces such that it effects the structural deformity due to the extrinsic filter structure.The key weakness of extrinsic approaches [96,97] is that they continue to consider geometric data as Euclidean information.When an object's position or shape changes, the extrinsic data representation fails.Additionally, for these methods to support the challenging-in-practice task of attaining the invariance of shape deformation, complicated models and extensive training are required.The filters in intrinsic approaches are applied on the 3D surfaces without being affected by the structural deformity.Rather than Euclidean realization, intrinsic methods work on the manifold and are isometry-invariant by construction.Some of the works based on intrinsic deep learning include (i) geodesic CNN [99], (ii) anisotropic CNN [100], (iii) mixture model network [101], (iv) structured prediction model [102], (v) localized spectral CNN [103], (vi) PointNet [104], (vii) Point-Net++ [105], and (viii) RGA-MLP [106].The application of geometric deep learning (mostly intrinsic methods) in analyzing videos can help in better understanding from the machine perspective, but it is still an open research problem and needs further investigation.For further details on geometric deep learning, readers are referred to [98,107].

Benchmark Datasets, Evaluation Metrics, and Comparison of Existing State-of-the-Art for Video Classification 4.1. Benchmark Datasets for Video Classification
There are several benchmark datasets being utilized for classification of videos, AND some of these notable datasets are summarized in Table 4.The details related to these datasets, such as total number of videos contained in the dataset, number of classes present in the dataset, the year of publication of dataset, and the background of videos in the dataset, are included in the summary.

Performance Evaluation Metrics for Video Classification
The evaluation of video classification models is performed using different performance measures.The most common measures utilized to evaluate the models are accuracy, precision, recall, F1 score, micro F1, and K-fold [8].Some of the recent studies using these measures are listed in Table 5.

Evaluation Metric Year of Publication Reference
and it is widely being used by researchers working on the video classification problem.Therefore, it is easy to compare most of the existing literature based on this dataset.The existing works employing UCF-101 are compared in Table 6, where the methods are arranged in ascending order based on the performance.The results reported in Table 6 are taken from the existing studies in the literature.

Comparison of Different Deep Learning Architectures
In Table 7, some important deep learning architectures are compared in terms of performance and computational requirement.These architectures are the basis of development of different deep learning models for video classification, and from this comparison, an estimation of the requirement of computational cost for each of these architectures can be drawn.not always helpful to optical flow, especially for the case of videos taken from the wild, e.g., Sports-1 M. (vi) It is important to use a sophisticated sequence processing architecture such as LSTM to take advantage of optical flow.(vii) LSTMs, when applied on both the optical flow and the image frames, yield the highest performance measure for the Sports-1M benchmark dataset.(viii) Augmenting optical flow and RGB input helps in improving the performance.(ix) Optical flow modality provides complementary information.(x) The high computational requirement of optical flow limits its use in real-time systems.(xi) Multi-stream multi-class fusion can perform better than average fusion, weighted fusion, kernel average fusion, MKL fusion, and logistic regression fusion on datasets such as UCF-101 and CCV.(xii) In 3D group convolutional networks, the volume of channel interactions plays a vital role in achieving a high accuracy.(xiii) The factorization of 3D convolutions by separating spatiotemporal interactions and channel interactions can lead to an improvement in accuracy and a decrease in the computational cost.(xiv) Further, 3D channel-separated convolutions results in a kind of regularization and prevents overfitting.(xv) Popular frameworks of conventional semi-supervised algorithms (which were originally developed for 2D images) are unable to obtain good results for 3D video categorization.(xvi) For semi-supervised learning, a calibrated employment of the object appearance cues keenly improves the accuracy of the 3D-CNN models.

Conclusions
This article reviews deep learning approaches for the task of video classification.Some of the notable studies are summarized in detail, and the key findings in these studies are highlighted.The key findings are reported as an effort to help the research community in developing new deep learning models for video classification.
The latest developments in deep learning models have demonstrated the potential of these approaches for the video classification task.However, most of the existing deep learning architectures for video classification are basically adopted from the favored deep learning architectures in image/speech domain.Therefore, most of the existing architectures remain insufficient to deal with the more complicated nature of video data that contain rich information in the form of spatial, temporal, and acoustic clues.This calls for attention towards the need for a tailored network capable of effectively modeling the spatial, temporal, and acoustic information.Moreover, training CNN/RNN models requires labeled datasets, and acquiring those datasets is usually time-consuming and expensive, and hence, a promising research direction is to utilize the considerable amount of unlabeled video data to derive better video representations.
Furthermore, the deep learning approaches are outperforming other state-of-the-art approaches for video classification.The deep learning Google trend is still growing, and it is still above the trend for some other very well-known machine learning algorithms, as shown in Figure 9a.However, the recent developments in deep learning approaches are still under-evaluated and require further investigations for the video classification task.One such example is geometric deep learning approaches, and the worldwide research interest in this specific topic is shown in Figure 9b, which describes that this topic is still confined to some states of U.S., Europe, and India.Therefore, it has yet to be developed and investigated further.The use of geometric deep learning in extracting rich spatial information from videos can also be a new research direction as a future work for better accuracy in the video classification task.

23 Figure 1 .
Figure 1.State-of-art image recognition CNN networks.The trend is that the depth and discriminatory powers of network architectures increases from formerly proposed architectures towards the recently proposed architectures.

Figure 2 .
Figure 2. Different modalities used for classification of videos.

Figure 2 .
Figure 2. Different modalities used for classification of videos.

Figure 3 .
Figure 3.An example of 3D-CNN architecture to classify videos.

Figure 4 .
Figure 4.An example of two-stream architecture with optical flow.

Figure 3 .
Figure 3.An example of 3D-CNN architecture to classify videos.

Figure 3 .
Figure 3.An example of 3D-CNN architecture to classify videos.

Figure 4 .
Figure 4.An example of two-stream architecture with optical flow.

Figure 4 . 23 Figure 5 .
Figure 4.An example of two-stream architecture with optical flow.

Figure 5 .
Figure 5.An overview of video classification process.

Figure 6 .
Figure 6.Summary of video classification approaches.

Figure 6 .
Figure 6.Summary of video classification approaches.

Figure 8 .
Figure 8. Illustration of deep learning approaches on geometric data.(a) Extrinsic method and (b) intrinsic method.

Figure 8 .
Figure 8. Illustration of deep learning approaches on geometric data.(a) Extrinsic method and (b) intrinsic method.

Figure 9 .
Figure 9. (a) Google trend on deep learning vs. some other state-of-the-art methods.(b) Worldwide research interest in geometric deep learning.

Table 2 .
Different categories of approaches of video classification.

Table 3 .
Summary and findings of studies based on deep learning models.

Table 5 .
Commonly used evaluation metrics for video classification.

Table 6 .
Comparison of video classification method on UCF-101.