Deep-Learning-Based Sequence Causal Long-Term Recurrent Convolutional Network for Data Fusion Using Video Data

: The purpose of AI-Based schemes in intelligent systems is to advance and optimize system performance. Most intelligent systems adopt sequential data types derived from such systems. Real-time video data, for example, are continuously updated as a sequence to make necessary predictions for efﬁcient system performance. The majority of deep-learning-based network architectures such as long short-term memory (LSTM), data fusion, two streams, and temporal convolutional network (TCN) for sequence data fusion are generally used to enhance robust system efﬁciency. In this paper, we propose a deep-learning-based neural network architecture for non-ﬁx data that uses both a causal convolutional neural network (CNN) and a long-term recurrent convolutional network (LRCN). Causal CNNs and LRCNs use incorporated convolutional layers for feature extraction, so both architectures are capable of processing sequential data such as time series or video data that can be used in a variety of applications. Both architectures also have extracted features from the input sequence data to reduce the dimensionality of the data and capture the important information, and learn hierarchical representations for effective sequence processing tasks. We have also adopted a concept of series compact convolutional recurrent neural network (SCCRNN), which is a type of neural network architecture designed for processing sequential data combined by both convolutional and recurrent layers compactly, reducing the number of parameters and memory usage to maintain high accuracy. The architecture is challenge-able and suitable for continuously incoming sequence video data, and doing so allowed us to bring advantages to both LSTM-based networks and CNN-based networks. To verify this method, we evaluated it through a sequence learning model with network parameters and memory that are required in real environments based on the UCF-101 dataset, which is an action recognition data set of realistic action videos, collected from YouTube with 101 action categories. The results show that the proposed model in a sequence causal long-term recurrent convolutional network (SCLRCN) provides a performance improvement of at least 12% approximately or more to be compared with the existing models (LRCN and TCN).


Introduction
Deep-learning-based research using sequential input data is important when it comes to effectively extracting features for the video streaming data. Most deep neural networks that depend on sequence data have previously shown improvements only in data areas such as voice and text. Sequence data are now used in the current research fields of image and vision, where deep learning techniques such as action classification and object detection are used to effectively predict an outcome. In prior works, schemes using the recurrent neural network (RNN) family [1][2][3][4] were generally used to solve vision problems for sequential data flow. Nowadays, a more diverse set of schemes is used for solving such problems.
CNNs have demonstrated remarkable performance on a variety of computer vision tasks in the context of state-of-the-art architectures [5][6][7]. They are designed to learn hierarchical representations of the input data by stacking multiple layers of a convolutional filter (CF) that detect increasingly complex patterns in the data, such as edges, textures, and shapes. A CNN can be successfully and efficiently applied to a variety of deep learning architectures, such as image classification (ImageNet) [8], object detection (Faster R-CNN) [9], semantic segmentation (U-Net) [10], and image generation (generative adversarial network: GAN) [11]. Effective neural network architectures have abilities to learn hierarchical representations of the input data and capture complex patterns with a powerful tool for a variety of visioning pattern recognition tasks.
According to the exited models based on sequence prediction, the learning methods using a CNN [12][13][14][15] are successfully proposed with the RNN family [16]. There is another related work [17] in progress, where we aim to increase the accuracy of two-stream using single and multiple optical flows. In this paper, many action classification models are provided and such models are compared for efficient prediction, which includes factors such as data used, memory, parameter, and time complexity. The aforementioned models will depend on multiple video data fusion methods which are on the same backbone network with diverse neural network architectures that use action classification data. SCCRNN is suitable to extract continuously incoming data rather than with sequential data coming, and it can bring both advantages of LSTM-based and CNN-based networks, which neural networks can have a remarkable model compression rate even though there is a limited expense of model performance.
CNN-based sequence data processes such as causal CNN networks are already used in NLP (natural language processing) fields (TTS (text to speech)). However, we compare many different action classification methods between single-frame-based CNN adopted by the sequential 2D visual field and SCLRCN combined with causal CNN and LRCN using sequential input data in this paper. SCLRCN has the advantage of LRCN in that not only does it have the size that can accommodate a large amount of data, but it can also effectively extract time-dependent features of a causal CNN and reuse network values from the previous learning process. Therefore, the architecture can find effective experimental results for a large amount of continuous data generated in the neural network environment. In addition, the SCLRCN is potentially a great starting point over many AI development areas since product models for the learning process can be applied to diverse neural network architectures using sequential input data, and it is not confined to a predictor made for a specific output.

Related Works
There are many CNN-based related works based on sequential input data to be successfully conducted using the RNN family. In particular, the action classification model can be configured in different ways depending on the efficient data types such as memory, parameter, and time complexity in which the neural network should be used in the actual learning process using video data fusion methods to the backbone network.

Dilated Convolutional Neural Network (Dilated CNN)
A dilated CNN is a CNN-based method that increases the size of the receptive field (RF) in the basic convolutional network so that even a convolution filter of the same size can be adopted for a large area of convolution [18]. It has a dilation rate (DR) as a parameter, and the RF, DR, and CF sizes are as follows; As shown in Figure 1, the existing convolutional neural network needs to have 121 parameters to have an RF of 11 × 11. As such, the dilated CNN has advantages in terms of both computational amounts and wide RFs [19]. By applying this advantage to TTS, which is a sequence data processing problem, in [16], high accuracy was obtained. Figure 2 shows a CNN-based learning process including input, mask filter, and output in the representative features. The input is a multi-dimensional array of values that represents the data being processed. A mask filter in the CNN is a matrix of the same shape as the output of a convolutional layer used to selectively filter the output of the layer. It typically depends on the properties of the specific task being performed for input data. The output of a CNN can be further processed by additional layers to extract higher-level features and classify the input data.

Causal Convolutional Neural Network (Causal CNN)
As shown in Figure 3, the neural network is the difference of structures between a standard and a causal CNN. The standard CNN is a deep learning scheme that is used to determine the relationship between adjacent data features without sequential data. The causal CNN is also an approach based on deep learning that causally informs the relative network between the sequenced data and the past data. Such a neural network scheme using the causal network would be generally used in the sequential data, and it can effectively extract the feature relationship from multiple datasets. In the case of, for example, RNN and LSTM, the next operation generally cannot be performed until the prior operation is completed in the training mode. The causal CNN, however, has the advantage of being able to train as fast as the network can operate in parallel regardless of [16,[20][21][22].

Temporal Convolution Network (TCN)
TCN has also a CNN-based neural network structure designed to effectively extract features for sequence data. TCN has been combined with both causal CNN and dilated CNN as shown in Figure 4. TCN has two advantages: (a) the advantage of effectively extracting features from the causal CNN, and (b) the advantage of the RF of a wide area based on the dilated CNN. Therefore, the network can be achieved with high-performing accuracy in TTS such as waveNet [16], where the features need to be extracted from a specific RF for good performance in other sequence models such as [20,23].

Other Video Fusion Networks
Diverse feature fusion neural networks for video-based action recognition are being studied to improve the performance of network accuracy.

Late Fusion
Late fusion is multiple sources, such as different modalities and sequence features, combined at a later stage of the processing pipeline. It is processed independently to result in representations using a fusion method, such as averaging or concatenation, at a later stage. By combining the features from multiple sources, the resulting representation can be more robust and capture more complete features. Figure 5 shows the late fusion network using a CNN applied to a fully convolutional network (FCL) for semantic segmentation by merging the current data and the prior 15 frames, respectively, after the CNN [2,24]. The features of the object motion need to be calculated in FCL by comparing two frames with the time difference, even though it cannot be detected with a single frame.

Early Fusion
Early fusion is in contrast to late fusion, where the sources are combined at a later stage of processing. It has different sources of information combined into a single representation at an early stage to form a single multi-modal feature. In this Figure 6, video data are fused as soon as the number of input data is as in [2]. A CNN is generally applied after the causal CNN in the first layer. By connecting pixel data, it is initially possible to effectively detect local data between pixels.

Slow Fusion
Slow fusion is similar to late fusion in that it is repeated over multiple stages with each level incorporating an additional representation of the data. It implies the use of an RNN in multi-modal processing, where the features are combined at each time step of the network. As shown in Figure 7, the network structure of slow fusion is combined with early fusion and TCN. It is implemented to access additional global information by data fusion through multiple layers.

Recurrent Neural Network (RNN)
An RNN is essentially designed to handle sequential data by utilizing the concept of recurrence neural network architecture where information flows only in one direction from input to output. It has a type of feedback loop method that allows for information to be passed from one step of the network to the next to maintain a kind of memory usage. An RNN has sequence input data with unspecified size. It is represented and expressed in Figure 8, as (2).

Long Short-Term Memory (LSTM)
In the case of an RNN, gradient vanishing occurs where the previously hidden state value disappears over time. To redeem a defect, an LSTM [2] with cell state added to hidden-state was devised. An LSTM can be divided into three gates: forget, input, and output, as shown in Figure 9. The forget gate is determined by the previous cell state and the input gate updates to remember the current state. In addition, the output gate can find output results from both the cell state and input data. An LRCN is a type of neural network architecture that combines the advantages of both the CNN and the RNN to process sequential data with spatial and temporal dependencies. It has the CNN-based first few layers that can extract useful spatial features from the input data, and those features need to be fed into RNN-based temporal dependencies in the data and capture long-term relationships. Both CNN and LSTM components can be combined into LRCN to learn spatial and temporal complex data representations, and the neural network architecture is powerful for processing sequential data. LRCN [2] is an RNN-based network that was developed to investigate whether it works effectively on data of the RNN family as sequence input data (video, streaming data). Each frame in input data is sequentially applied to the CNN to extract features, and then applied to the LSTM as seen in Figure 10.

Materials and Methods
The proposed method is designed based on SCLRCN with the strengths of both effectively extracting the causal relationship from the adjacent data in a causal CNN and effectively extracting the old data using an LSTM. The method can extract features between time and space based on a causal CNN, and it can be additionally adopted into an LSTM to collect features of necessary data in the long-term period with a small size of the RF for memory consumption and parallel operation.

Sequence Causal Long-Term Recurrent Convolutional Network (SCLRCN)
An SCLRCN is designed with the advantages of both effectively extracting the causal relationship from the adjacent data in a causal CNN and extracting the old data from the LSTM. In general, the RNN family models allow for an effective result that can be extracted from all neural network areas, not determined by input sequence. However, the SCLRCN has the potential fields to lose information in a hidden state since the prior data are affected by the current data. For that reason, it can cause a problem in that data may be lost even though there is a close correlation in the adjacent data. CNN models based on sequence data are generally able to keep important features with no specific learning process. However, it is also possible to collect the correlation data only within the RF accepted by the CNN model.
An SCLRCN shortly extracts features between time and space using a causal CNN as shown in Figure 11, and it adopts an LSTM to collect features of necessary data in the long-term period. According to CNN-based sequence models, memory consumption, computational time, and a degree of feature extraction are needed to set a causal CNN. Using a CNN network for a single frame has the same effect as an LRCN, even if it cannot use temporal features from the CNN. As shown in Figure 12, using late fusion implies having features that cannot be effectively extracted due to the space between two layers, and it also cannot perform temporal feature extraction in CNN models. In addition, late fusion has to save part of the data, which is not used, or it cannot re-calculate data even though the part has already been calculated, as seen in Figure 12. In the case of early fusion, it is possible to rapidly collect temporal and spatial features at the beginning of the learning process. However, many CNN-based network channels are required for a large amount of sequence data. In a case such as a TCN, the network can acquire a wide range of RFs depending on the number of layers, and it also has possibly low network memory and parallel operation since the same CNN mask is used on the same layer. Figure 13 shows that memory space and parallel operation are required when a TCN is combined with an LSTM. TCN has the memory enhancement to use the prior operation for the next operation when the layers are deepened. On the other hand, it can be required with the selection of the size in the layer since it is possible using parallel operation.

Optimization of Learning Performance by SCLRCN
Slow fusion has two similar features to both early fusion and TCN; (1) The slow fusion using a causal CNN can adopt small memory and parallel operation even if the size of the RF is smaller than that of the TCN; (2) There is no need to save data separately when using both a causal CNN and an LSCN, as seen in Figure 14. As shown in Table 1, an amount of computation and memory consumption is required for sequence-based CNN models while an LRCN stores the previous data in the memory. Slow fusion, TCN, and causal CNN [16,25] have high performing accuracy to be compared with other sequence-based models. These methods also have the advantage of being able to perform in parallel. As network layers become deeper and longer, however, the slower fusion and TCN need to calculate a large amount of data for the next level.
The RF in causal CNN, TCN, and slow fusion is small because the sequence data area for the RF would be sufficiently wide. In addition, the amount of memory consumption is probably increased due to the nature of the CNN network since a TCN and slow fusion are used with an LRCN at the same time. From the result, we can observe that the accuracy of the network model is high in the order of single frame, early fusion, late fusion, and slow fusion. This shows that causal CNN, TCN, and slow fusion using a CNN are more effective than other types of network models. Therefore, it can be confirmed that a causal CNN is suitable with an LRCN in terms of accuracy, memory usage, and consumption, as well as the amount of operation. The image-based neural network [26,27] can recognize the total number of non-parallel CNN layers vertically listed for the increment in accuracy rate as shown in Figure 14. As the number of CNN layers in, for example, AlexNet [13], VGGNet [15], GoogleNet [14], and ResNet [28], the RFs can be increased by viewing sequence data in terms of causal CNN, TCN, and slow fusion. The number of CNN layers for object detection in the cases of YOLO, such as Yolo-v1 [29], Yolo-9000 [30], Yolo-v3 [31], and Yolo-v4 [32] can also be increased. Table 1. Simple calculation formula for the amount of computation and memory consumption required by CNN-based sequence models.

CNN-Based Computational Usage
Prior Data Computational Usage

Memory without Current Usage
Single

Experimental Results
In this section, we analyze the proposed method using reliable deep learning models such as LRCN, TCN, and SCLRCN using the benchmark datasets UCF-101 [33,34]. The neural network structure of each model is based on convolutional layers via which the target load is connected. The performance of the models was also evaluated through the averaged test accuracy.

Experimental Environment
The performance improvement of the SCLRCN is tested with two different DL-based networks such as LRCN and TCN, which used lightweight networks based on VGGnet [15], as shown in Table 1. As shown in Figure 15, the RF for the TCN is set at 32 frames, and the SCLRCN is set at 13 frames. The output of the TCN and SCLRCN can be modified according to the number of input data. Since all outputs of the networks have the same label, it was created as the same size as the prediction through the LSTM layer in Figure 14.
All of the datasets were based on UCF-101 [33]. For the training dataset, 7272 images were re-shaped into 32 frames tailored to the RF of the TCN so that all networks could be observed by images of the same length. Table 2 describes experimental neural network structures (network summary) such as LRCN, TCN, and SCRCN. They are essentially used with Conv3, which means the output of the Layer, block3-pool (MaxPooling2D) in each layer. However, the neural network structure for layer output is configured slightly differently, connecting each layer.
In this experiment, we evaluated and compared the prediction accuracy according to the type of neural network. The prediction is used in the information of all frames, not the prediction of the entire process.   Table 3 shows the test accuracy with early stop for the end of the learning process in each network and the same learning environment. It can be shown that the SCLRCN has an approximately 12% higher accuracy than the LRCN, and a 36% higher accuracy than the TCN. The experimental results show that while DL-based networks such as TCN and causal CNN are effective for sequence feature extraction, the image networks of sequence features using the LSTM are slightly more effective. In addition, the frame method using the LSTM can find a result with higher accuracy than the TCN even if the RF in the CNN network of the SCLRCN is smaller than the TCN. The result also shows that the method using a sequence-based CNN in video images can predict the movement of pixels between adjacent frames. Therefore, it can be confirmed that an RNN-based network such as an LSTM using the extracted features with the RF of an appropriate CNN network is more effective.

Discussion and Conclusions
In this study, we proposed an SCLRCN designed with the advantages of both effectively extracting the causal relationship from the adjacent data in a causal CNN and effectively extracting the old data from the LSTM. This method can not only shortly extract features between time and space using a causal CNN, but the method can also be adopted into an LSTM to collect features of necessary data in the long-term period. In particular, slow fusion using a causal CNN has a small size of the RF for memory consumption and parallel operation. It also does not need to save data for both a causal CNN and an LSCN. In addition, the RF in a causal CNN, TCN, and slow fusion has more effective advantages in that there is no need to consume large amounts of computational resources for future operations and a small network size for sequential data. In this paper, we confirmed that efficient memory usage and higher accuracy can be obtained for sequence visual data such as video images through an SCLRCN, which is merged by a causal CNN and an LRCN. For that reason, it is important to both analyze the relationship between pixels in a general area and to analyze the difference between the video pixels. The research for the SCLRCN is also useful in the hyper-parameter improvement and the method for the performance of the current CNN networks. It can be additionally used in action classification and other related fields, where similar experiments can be conducted since it can be applied so effectively to any network that uses sequence visual data. An SCLRCN generally performs high-performing accuracy in neural network architectures that utilize sequence features in image networks based on LSTM networks; however, in structures where the network extracts sequence features, the current neural networks such as the TCN and the causal CNN still demonstrate slightly better performance for learning accuracy. In future work, we plan to use more complicated CNN-based architectures to further analyze the effect of the hyper-parameter improvement and the method for the performance of both CNN-based and LSTM-based neural network architectures to extract sequence features in the action classification methods.