Next Article in Journal
A Novel Software Architecture Solution with a Focus on Long-Term IoT Device Security Support
Previous Article in Journal
ATP Bioluminescence for Rapid and Selective Detection of Bacteria and Yeasts in Wine
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

End to End Alignment Learning of Instructional Videos with Spatiotemporal Hybrid Encoding and Decoding Space Reduction

School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2021, 11(11), 4954; https://doi.org/10.3390/app11114954
Submission received: 29 March 2021 / Revised: 10 May 2021 / Accepted: 18 May 2021 / Published: 27 May 2021
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
We solve the problem of how to densely align actions in videos at frame level, with only the order of occurring actions available, in order to save the time-consuming efforts to accurately annotate the temporal boundaries of each action. We propose three task-specific innovations under this scenario: (1) To encode fine-grained spatiotemporal local features and long-range temporal patterns simultaneously, we test three popular backbones and compare their accuracy and training times: (i) a recurrent LSTM; (ii) a fully convolutional model; and (iii) the recently proposed Transformer model. (2) To address the absence of ground truth frame-by-frame labels during training, we apply connectionist temporal classification (CTC) on top of the temporal encoder to recursively collect all theoretically valid alignments, and further weight these alignments with frame-wise visual similarities, in order to avoid a significant number of degenerated paths and improve both recognition accuracy and computation efficiency. (3) To quantitatively assess the quality of the learned alignment, we apply a comprehensive set of frame-level, segment-level, and video-level evaluation measurements. Extensive evaluations verify the effectiveness of our proposal, with performance comparable to that of fully supervised approaches across four benchmarks of different difficulty and data scale.

1. Introduction

Fine-grained temporal action segmentation/alignment [1,2,3] is important in many applications, such as daily activity understanding, human motion analysis, and surgical robotics, to name a few. Given a video of length T, x = ( x 1 , , x T ) , the goal of temporal segmentation/alignment is to localize each occurrence of a given action a n A in the time domain. A frame-to-action segmentation/alignment can be mathematically defined densely, as a sequence of occurring action labels at every frame in a video π = ( π 1 , , π T ) , π t A ; or sparsely, as a set of temporally segmented clips = ( 1 , , N ) , n A , with each segment associated with a start time, finish time, and label. N = | | is the length of the transcript sequence; note that usually T N since the sampling frequency of the machine at the encoding end is orders of magnitude higher than the granularity of the manual labeling at the decoding end.
The difference between the tasks of segmentation/alignment is that during training, only the ordered sequence of video-level actions (defined as transcript [4,5,6]) is available for alignment, while the classes of each frame are given for segmentation. Figure 1 shows the training/testing settings in the task of temporal action alignment learning, as compared to temporal action segmentation, where accurate dense labels of each frame are also available during training.
We also restrict our focus on modeling instructional videos with relatively stable background, usually composed of dozens of actions lasting minutes, recorded in a kitchen or surveillance setup. Under this setting, research focus can be saved from variances in extrinsic shooting conditions, and concentrates upon the major challenges of the task.

1.1. The Motivation of Long-Term Temporal Encoding

The first challenge lies in the relatively large temporal search space, resulting from long-range temporal dependencies and flexible temporal length, compared to action recognition (also known as action classification). In general, action alignment is more challenging than action recognition, for the following reasons:
  • In action recognition, temporal localization is not taken into account, where input samples are previously truncated to contain exactly the temporal span of a certain target action, leading to relatively short inputs (e.g., 2∼20 s in UCF101 datasets). In frame-wise action alignment, inputs generally last minutes or hours. In action recognition, background class is not taken into account, where input samples may not contain any of the target actions.
  • In action alignment, dependencies can also last temporally across seconds or even minutes. The types of temporal dependencies include individual action durations, pairwise compositions between consecutive actions, and long-term compositions lasting across multiple sub-actions. As an example of a cooking instance, when cutting a potato, it is difficult to recognize what is being cut because it tends to be occluded by hands holding it. The recognition of frames where the potato is being cut shares dependencies with previous frames where the potato is being taken out before being cut.
  • In action recognition, only one label needs to be assigned to the whole video, whereas the action alignment task needs to densely assign a label to each frame. Consider a video instance consisting of 20 frames sharing the same class.
  • In action recognition, even if a convolution network correctly predicts only for 10 frames, it is still very likely to correctly predict the whole video. A powerful temporal encoder, illustrated in Figure 2, on top of a convolution network would not bring any improvement in this case, because per video labels do not change whether or not per frame labels are neighboring. In action alignment, 10 accurate but not neighboring predictions would lead to over-segmentation error with 20 segments. Bidirectional temporal encoders would be motivated to predict neighboring segments, because they encode late samples together with the early ones for final judgement, leading to fewer false positives compared to a purely convolutionnetwork (see Figure 3).

1.2. The Motivation of Hybrid Spatiotemporal Encoding

The second challenge lies in the simultaneous encoding of low-level fine-grained spatiotemporal features together with high-level long-range temporal patterns. Appearance information, which can be regarded as visual features presented by static composing frames without taking temporal order into account, also serves as vital to the start and finish boundary indicators of actions (e.g., ‘cutting a tomato’ is often only subtly different from ‘peeling a cucumber’ spatially, such as the food types ‘tomatoes’ or ‘cucumbers’ and food states ‘sliced’, ‘diced’, ‘peeled’, and so on).
We experiment with three deep neural network models for encoding. In each case the encoder consists of two modules (or sub-networks): a spatiotemporal visual module Table 2 encoding per frame into one feature vector, and a temporal module Table 3 encoding the sequence of per-frame feature vectors into a sequence of sub-action labels. The visual module is common across the three models; they only differ in the temporal module (Figure 2).

1.3. The Motivation of Decoding Space Reduction

The third challenge lies in that manual annotations with per-frame action labels for accurate training are too expensive to be feasibly applicable in practice. Automatic extractions from instructional transcriptions [4,5,6] can provide label sequences of occurring actions (hereafter referred to as ‘transcript’), without the accurate start and end frame for each action, which further impose new challenges in two ways:
  • The first challenge is that densely aligning thousands of frames to a few sub-actions results in very large search spaces of possible alignments.
  • The second challenge is that there exist degenerated alignments that are visually inconsistent.
We introduce CTC [7] (Figure 2) to evaluate all alignments efficiently in one-pass recursive traverse, and further incorporate frame-level visual similarities to down-weight trivial paths which deteriorate performance. Figure 4 shows the difference of training strategies under different granularities of supervision.

2. Related Work

We mainly focus on related deep end-to-end approaches, which can be grouped into two research directions according to different annotating granularity—that is, at frame-level (segmentation) and video-level (alignment) supervision:

2.1. Action Segmentation

Before deep neural solutions, the most prevalent were statistical models [8,9,10] based on conditional independent assumption between segments, which ignore long-range dependencies and have been generally outperformed by the state of the art (see Table 1 for more quantitative performance comparison). MSB-RNN [11] uses a two-stream network and bi-directional LSTMs to learn representations and capture dependencies between video chunks, respectively. ED-TCN [12] uses temporal convolutions and pooling layers within an encoder–decoder architecture to learn long-range dependencies between frames. TDRN (temporal deformable residual network) [13] proposes two parallel temporal streams, facilitating temporal segmentation at local, fine-scale, and multiple long-range scales, respectively, for improving the accuracy of frame classification. TCED [14] introduces a learnable bilinear pooling in the intermediate layers of a temporal convolutional encoder–decoder net, in order to capture more complex local statistics than conventional pooling. ASRF [15] proposes to alleviate over-segmentation errors by detecting and refining action boundaries with a dedicated boundary regression module and a wider temporal receptive field.
One problem in this context arises from the fact that the annotation to mark action boundaries for training is very time- and cost-intensive, leading to recent efforts trying to train classifiers without exact start and end frames of the related action classes (which is our case); the goal of this work is to infer frame boundaries given only an ordered list of the occurring actions.

2.2. Action Alignment

Compared with full supervision, there are only a few approaches that rely solely on video-level class labels to localize actions in the temporal domain. GRU + HMM [1] proposes a combination of a recurrent neural network and a probabilistic model to inference over long sequences for a temporal alignment. TCFPN + ISBA [2] proposes a pyramid temporal convolutional network, iteratively updated by a training strategy named ISBA (iterative soft boundary assignment), to align action sequences with frame-wise action labels in a more efficient and scalable fashion. CDFL [3] proposes a constrained discriminative forward loss upon GRU + HMM ground. Duration network [16] treats the the remaining duration of a given action as a predictable distribution conditioned on the type of action, and obtains the best alignment by maximizing its posterior probability.
In general, these methods are still more or less hand-engineered, involving statistical components designed on prior knowledge, and rely on sophisticated techniques to improve performance. To our best knowledge, the inherent integration of visual similarity into the CTC output layer with encoder–decoder spatiotemporal transducers sets our method apart from previous works, as the first purely end-to-end automatic approach without human intervention. We select ASRF [15] and duration network [16], respectively, as representatives of the state-of-the-art baselines under the two above-mentioned branches in Section 5.

2.3. Automatic Feature Extraction

Automatic feature extraction is fundamental in all video recognition tasks. Early methods normally devise hand-crafted features [27,28]. The development of deep learning enables end-to-end automatic feature extraction, including two-stream network [29], 3D ConvNet (C3D) [30], and inflated 3D (I3D) [31,32]. We adopt I3D for feature extraction.

2.4. Automatic Label Extraction

Automatic label extraction is also related, as our action order comes from transcripts of video instructions [24,33]. Unlike these approaches focusing on the text processing part of the task, we assume that the discrete target label sequences are available in training stages.

3. The Proposed Encoder–Decoder Network

3.1. Spatial Encoder

Generally, common 2D/3D backbones such as VGG [34], Residual [35], Inception [36], and I3D [31] variants can be applied orthogonally; in our empirical practice it is found that deep 3D backbones such as Res3D (Table 2 and Figure 5) generally yield better performance and efficiency. The convolutional feature maps of T frame representations x 1 : T = ( x 1 , x 2 , , x T ) obtained from the last pooling layer (i.e., p o o l 49 ) with size L × W 32 × H 32 are fed into temporal encoding architectures.

3.2. Temporal Encoder

In this section, we formally describe three common temporal architectures Table 3, and empirically compare them for this task in Section 5, in terms of performance accuracy, training time, generalization at test time, and ease of use.

3.2.1. LSTM

Following [39], we adapt a 1-layer bidirectional LSTM (BLSTM), taking the vision feature vectors x t as input and outputting a class probability y t ( π t ) , π t A for every frame t. The BLSTM has 1024 cells in each direction. The overall network is trained together with CTC (see Section 3.4.2 for training detail). The output alphabet A is therefore augmented with the CTC blank class label, and the decoding is performed with a beam search.

3.2.2. Full Convolution

Following [40], we adapt depth-wise separable convolution layers, which consist of a separate convolution along the time dimension for every channel, followed by a projection along the channel dimensions (a position-wise convolution with filter width 1). Each spatial convolution is followed by a shortcut connection, batch normalization, and ReLU. The overall network consists of 15 convolutional layers, also trained with a CTC loss, with sequences decoded by using a beam search (above).

3.2.3. Transformer

Following [41], we adopt 6 encoder and 6 decoder layers, log 2 | A | attention heads, and each attention has 512 channels and is followed by two position-wise fully connected layers with 2048 and 512, compared to that of 1536 for the fully convolutional network. Every encoder layer is a self-attention, where the input tensor serves as the attention queries, keys, and values at the same time.
Every decoder layer attends on the embedding produced by the encoder using common soft-attention: the encoder outputs are the attention keys and values, and the previous decoding layer outputs are the queries. The decoder produces target class probabilities which are matched to the ground-truth labels by CTC decoding and trained as a whole with a cross-entropy loss.
This section demonstrates how to achieve more efficient decoding by the early elimination of paths that obviously violate the visual consistency based on existing labels. To understand how to reduce the search space of eligible candidate decoding paths, it is necessary that we first briefly look back on how the original CTC decoding process performs general end-to-end alignment learning in sequential signal transduction tasks.

3.3. The Original CTC Decoding

CTC sums over all possible alignments upon conditional independent assumptions.
Firstly, given a training sample of T frames:
x = ( x 1 , , x T )
where t is the index of T frames and x t is a vector of frame features.
According to the conditional independence assumption (CIA) from the original CTC formulation [7], the probability of path π = ( π 1 , , π T ) , π t A is the stepwise product of the network output softmax activation y t ( π t ) of π t at each frame t:
P ( π | x ) CIA t = 1 T y t ( π t )
where y t ( k ) is the probability of the network outputting action k at time t, given input x , k A ( A is the collective set of all possible actions).
We refer to π over A as paths, to be distinguished from the action order l , which is naturally produced from path π by applying the operator that removes repetitions B ( π ) :
B ( π ) = l
The probability of l sums over all paths consistent with l :
P ( l | x ) = π B 1 ( l ) P ( π | x ) E q u a t i o n ( 2 ) B ( π ) = l t = 1 T y t ( π t )
Finally, the negative log likelihood of observing l :
J = ln P ( l | x )

3.4. Decoding Search Space Reduction

One drawback of original CTC is that Equation (2) weights all paths equally, causing the sum in Equation (4) to include visually inconsistent paths π that deteriorate the performance.
We incorporate visual similarity into by Equation (2) rewarding paths:
P ( π | x ) t = 1 T ϕ t · ψ t t + 1 where: ϕ t = y t ( π t ) , ψ t t + 1 = max ( θ , s t t + 1 ) π t = π t + 1 θ π t π t + 1 s t t + 1 = f s i m ( x t , x t + 1 )
where ϕ t = y t ( π t ) represents the original Equation (2) formulation, and θ is a minimum threshold of the frame similarity function s t t + 1 = f s i m ( x t , x t + 1 ) .
  • When π t = π t + 1 and s t t + 1 > θ (high similarity), ψ t t + 1 = s t t + 1 reward path to stay at the same prediction.
  • When π t = π t + 1 and s t t + 1 < θ (low similarity), ψ t t + 1 = θ means no intervention after normalization.

3.4.1. Example

Figure 6 illustrates how Equation (6) re-weights a ground-truth path and a degenerated path. In Figure 6, there is a ground-truth path π 1 and another path π 2 that produces the same action order with π 1 B ( π 1 ) = B ( π 2 ) = , but yields different frame-wise label sequences:
A
= { A c t i o n 1 , A c t i o n 2 , A c t i o n 3 } , represented by green, yellow, and orange nodes, respectively.
 
= g r e e n y e l l o w o r a n g e , which is supposed to be already known during training.
π 1
= g r e e n g r e e n g r e e n y e l l o w y e l l o w o r a n g e .
π 2
= g r e e n y e l l o w o r a n g e o r a n g e o r a n g e o r a n g e .
In fact, for T = 6 , | | = 3 , there are altogether T 2 | | 1 = 4 2 = 4 ! 2 ! 2 ! = 6 distinct paths π i satisfying the supervised constraints B ( π i ) = , because:
  • π 1 has to be ( 1 ) = g r e e n ;
  • π T has to be ( e n d ) = o r a n g e ; and
  • The difference among different π i s is to choose at which π t , t = 2 , , T 1 to transit the node label from ( j ) to ( j + 1 ) , with j = 1 , 2 , , | | 1 .
To summarize, the necessary and sufficient condition for path π to be consistent with the supervised action order , if and only if:
  • π 1 = ( 1 ) .
  • π T = ( e n d ) .
  • For each middle node π t , t = 2 , , T 1 , there are only two possible options: (1) Stay the same as the previous node, which means if π t 1 = ( j ) , π t = ( j ) as well. Whenever this case holds true, a ‘repetition’ happens in π . (2) Transit from π t 1 = ( j ) to the next label in , which means π t = ( j + 1 ) . Any other label assignment will cause B ( π ) = holding false.
So far we can draw the conclusion that the only difference between valid paths π and supervised action order is that π contains ‘repetitions’; by inserting and removing ‘repetitions’, π and can be converted to each other.
In this example, π t = π t + 1 holds true at t = 1 , 2 , 4 in path π 1 , and t = 3 , 4 , 5 in path π 2 . π t = π t + 1 means that a ‘repetition’ happens in path π at frame t + 1 , but there are two cases to judge whether such a ‘repetition’ should be encouraged or not if taking into account its consistency with the ground truth alignment.
  • Case 1: s t t + 1 > θ ,
which means that apart from the supervised order information , it is also unsupervisedly observed that frame t + 1 is visually in great similarity with frame t, which suggests such a ‘repetition’ should be additionally encouraged.
At t = 1 , π t 1 = π t + 1 1 = g r e e n holds true for π 1 and false for π 2 , where π t 2 = g r e e n π t + 1 2 = y e l l o w . π 1 is consistent with ground truth while π 2 is not, since s t t + 1 = 10 > θ . Therefore:
  • ψ t t + 1 = s t t + 1 for path π 1 when s t t + 1 > θ is introduced to encourage π 1 , which has a ‘repetition’ at t + 1 = 2 to yield a higher probability than π 2 after P ( π | x ) re-normalization at time step t + 1 .
  • ψ t t + 1 = θ for path π 1 since π t 2 π t + 1 2 at t = 1 , which means no encouragement after P ( π | x ) re-normalization at time step t + 1 ; the calculation remains the same as the original Equation (3).
  • Case 2: s t t + 1 θ ,
which means that apart from the supervised order information , it is also unsupervisedly observed that frame t + 1 is visually not similar with frame t, which suggests such a ‘repetition’ should not be encouraged.
At t = 3 , π t 2 = π t + 1 2 = o r a n g e holds true for π 2 and false for π 1 , where π t 1 = g r e e n π t + 1 1 = y e l l o w . π 1 is consistent with ground truth while π 2 is not, since s t t + 1 = 1 θ . Therefore:
  • ψ t t + 1 = θ for path π 2 when s t t + 1 θ , which means such a ‘repetition’ at t + 1 = 4 is not encouraged and P ( π | x ) remains the same as the original Equation (3) after re-normalization.
  • ψ t t + 1 = θ for path π 1 since π t 1 π t + 1 1 at t = 3 , which also means no intervention.

3.4.2. How to Train the Proposed Decoder

The back-propagation through the CTC layer to obtain the closed form of y t ( k ) a t ( k ) is quite cumbersome, and may last for several pages, so we chose not to present the mathematical derivation in too much detail; for readers interested in the complete derivation of Equation (7) to obtain the gradient of P ( | x ) , it can be easily found in the relevant literature or tutorials, such as [7,42], based on dynamic programming under chained rule of derivation.
Here we briefly give the closed form for forward loss function calculation J = ln P ( | x ) , together with its backward gradient w.r.t. the neural network output y t ( k ) (the response of label k at time t):
P ( | x ) = j = 1 J = | | α t ( j ) · β t ( j ) J y t ( k ) = ln P ( | x ) y t ( k ) = 1 P ( | x ) P ( | x ) y t ( k ) = 1 P ( | x ) · y t ( k ) j l a b e l ( k ) α t ( j ) · β t ( j ) J a t ( k ) = k J y t ( k ) · y t ( k ) a t ( k ) = y t ( k ) 1 P ( | x ) j l a b e l ( k ) α t ( j ) · β t ( j )
where:
a t ( k )  
are the un-normalised outputs before the softmax activation function is applied: y t ( k ) = exp ( a t ( k ) ) k exp ( a t ( k ) ) , k ranges over all outputs; and
α t ( j )
β t ( j ) are forward and backward variables, respectively. α t ( j ) is defined as the summed probability of all paths satisfying B ( π 1 : t ) = l 1 : j , and β t ( j ) appends to α t ( j ) from t + 1 that completes l , where l 1 : j is the first j actions of ; both α t ( j ) and β t ( j ) can be calculated by recursive inductions:
α t ( j ) = α t 1 ( j ) y t 1 ( b ) + α t 1 ( j 1 ) y t 1 ( j ) β t ( j ) = β t + 1 ( j ) y t ( b ) + β t + 1 ( j + 1 ) y t ( j + 1 ) α 1 ( j ) = α t = 1 ( j = 1 ) = 1 α t = 1 ( j 1 ) = 0 β T ( j ) = β t = T ( j = | | ) = 1 β t = T ( j | | ) = 0 j = 1 , , | |
y t ( k ) a t ( k ) in Equation (7) is the final ‘error signal’ back-propagated through the network during training.

4. Experimental Setup

4.1. Visual Similarity Measurement

Cut the video into T M clusters, each with length M. Set M to be conservatively shorter than common action lengths (e.g., ∼400 frames on average in the YouCook2 dataset) so that frames belonging to different actions do not blend into the same cluster.
Thus, s t t + 1 can be set under the resulting constraints that:
  • π t = π t + 1 if and only if x t and x t + 1 fall within same cluster;
  • π t π t + 1 if and only if x t or x t + 1 is at the boundary between clusters.
s t t + 1 = f s i m ( x t , x t + 1 ) = π t = π t + 1 cos ( x t , x t + 1 ) = x t · x t + 1 | x t | | x t + 1 | π t π t + 1

4.2. Datasets

We evaluate the proposed approach on four public available datasets: YouTube Instructional (https://www.di.ens.fr/willow/research/instructionvideos, accessed on 12 January 2021), YouCook2 (http://youcook2.eecs.umich.edu, accessed on 12 January 2021), Breakfast (https://serre-lab.clps.brown.edu/resource/breakfast-actions-dataset, accessed on 12 January 2021), and 50 Salads (http://cvip.computing.dundee.ac.uk/datasets/foodpreparation/50salads, accessed on 12 January 2021).
  • YouTube Instructions [24] contains 150 samples from YouTube on five tasks: making coffee, changing a car tire, CPR, jumping a car, and potting a plant, with approximately two minutes per sample.
  • YouCook2 [25] contains about 2 k samples from YouTube on 90 cooking recipes, with approximately 3∼15 steps per recipe class, where each step is a temporally aligned narration collected from paid human workers.
  • Breakfast [38] contains about 2 k samples on ten common kitchen tasks with approximately eight steps per task. The average length of each task varies from 30 s to 5 min.
  • 50 Salads [18] contains 4.5 h of 25 people preparing 2 mixed salads each, with approximately 10 k frames per sample. Each sub-action corresponds to two levels of granularity, and each low-level granularity is further divided into pre-, core-, and post-phase.

4.3. Metrics

  • Frame-level accuracy is calculated as the percentage of correct predictions. Intuitively, frame-wise metrics ignore temporal patterns and occurrence orders in the sequential inputs. It is possible to achieve high frame measures but at the same time generate considerable over-segmentation errors, as visualized later, raising the need to introduce segmental metrics to penalize predictions that are out of order or over-segmented.
  • Segment-level edit distance is also known as the Levenshtein distance, and only measures the temporal order of occurrence, without considering durations. It is therefore useful for procedural tasks in this work, where the order is the most essential.
    It is calculated as segment insertions, deletions, and substitutions between predicted order and the ground-truth sequence, then normalized to range [0∼100] in Table 4 such that higher is better:
    E d i t = | i n s e r t i o n s | + | d e l e t i o n s | + | s u b s t i t u t i o n s | | G r o u n d t r u t h |
  • Segment-level mean average precision (mAP@k) With an intersection over union (IoU) threshold k, calculated as dividing the intersection between each pair of predicted segments I and the ground-truth segment of the same action category I by their union:
    I o U ( I ) = | I I | | I I |
    I is considered as a ‘true positive’ ( T P ) if I o U ( I ) k , otherwise it is a ‘false positive’ ( F P ). Average precision is accumulated across all categories. m A P @ k is more invariant to small temporal shifts as compared to the above metric.
  • Segment-level F1-score (F1@k) With an intersection over union (IoU) threshold k, where true positives are judged by I o U ( I ) k with labels same as the ground truth:
    P r e c i s i o n = T P T P + F P R e c a l l = T P T P + F N F 1 = 2 R e c a l l · P r e c i s i o n R e c a l l + P r e c i s i o n

4.4. Hyper-Parameters

  • CNN pre-training: I3D Res-50 backbone [37] pre-trained on Kinetics [43] (https://github.com/deepmind/kinetics-i3d, accessed on 20 August 2020).
  • Optimization: The BLSTM is trained with Vanilla SGD, a fixed momentum of 0.9, initial learning rate 10 2 and reduced down to 10 4 every time the error plateaus. Following [44], the fully convolutional network and Transformer are trained with the ADAM optimizer, initial learning rate 10 3 , reduced down to 10 4 on plateaus.
  • Transformer embedding: The information about the sequence order of the encoder and decoder inputs is fed to the model via fixed positional embedding in the form of sinusoid functions. The Transformer is trained using teacher forcing—the ground truth of the previous decoding step is fed as the input to the decoder, while during inference the decoder prediction is fed back.
  • Dropout: Following [45], the BLSTM is trained with dropout probability p = 0.3 on the units of the inputs and recurrent layers. The fully convolutional network is trained with a dropout probability p = 0.8 on the units of batch normalization layers. The Transformer is trained with dropout probability dropout p = 0.1 .
  • Termination: The loss function no longer drops on the validation set between 2 consecutive epochs.
  • Software and Hardware: All the models are implemented in TensorFlow and trained on a single GeForce GTX 1080 Ti GPU with 11 GB memory.

5. Results and Discussion

5.1. Ablation Analysis

Rows 5∼9 of Table 4 show the ablation studies of combining different modules elaborated in Section 3. The final performance improvement can be estimated by the average differences across the four tested benchmarks between Baseline (Row 5. Baseline) and the best-performing 9. I3D + Transformer + R-CTC, which is denoted as Δ B a s e hereafter for simplicity:
Δ B a s e ( 9 ) = ( Δ F . A c c ¯ = 7.74 ± 1.70 , Δ m A P ¯ = 6.76 ± 1.81 , Δ E d i t ¯ = 12.3 ± 3.68 )

5.1.1. Ablation Analysis of Temporal Encoding Backbone

Rows 6∼8 investigate the impact of different temporal backbones (Section 3) separately. The average differences between BLSTM, the fully convolutional network, and Transformer under original CTC across the four tested benchmarks were:
Δ B a s e ( 6 ) = ( Δ F . A c c ¯ = 11.54 ± 2.53 , Δ m A P ¯ = 10.05 ± 2.69 , Δ E d i t ¯ = 9.07 ± 2.69 ) Δ B a s e ( 7 ) = ( Δ F . A c c ¯ = 6.46 ± 1.42 , Δ m A P ¯ = 5.41 ± 1.45 , Δ E d i t ¯ = 0.26 ± 0.07 ) Δ B a s e ( 8 ) = ( Δ F . A c c ¯ = 3.58 ± 0.78 , Δ m A P ¯ = 3.20 ± 0.85 , Δ E d i t ¯ = 8.83 ± 2.62 )
LSTM performed worse than the fully convolutional network, even though the recurrent model has full context of every decoding time step compared to the convolutional network that only looks at a limited time window of the input. This is in accordance with current trends shifting from recurrent networks towards purely convolutional/self-attentional models in other related domains, such as translation [46,47] and speech [48,49]. Dedicated explorations [50,51] blame this inferiority to the inherent limitations of recurrent structures. Although the gating mechanism alleviates the difficulty of gradient propagation, the maximum memory is still restricted to a limited distance, usually not exceeding Θ ( 10 2 ) time steps. The fully convolutional model has a smaller number of parameters and trains faster than BLSTM and bidirectional Transformer; the best-performing backbone was the Transformer.

Training Time

Transformer and the fully convolutional network took approximately the same amount of time to compute a batch of 32 samples (Table 3). This is in accordance with the theoretical complexity ( Θ ( T d 2 ) ) for each layer of both models, where d is the dimension of channels. Transformer has fewer channels (512) for every self-attention block, but it is in effect a deeper model, with 3 × ( 6 + 6 ) = 24 layers in total. However, the fully convolutional network took fewer iterations to converge, completing the full curriculum in 3.5 days, compared to the 14 days for Transformer. This may be due to the more complex contraction among the self-attention queries, keys, and values during the gradient propagation and weight updating. In contrast, the fully convolutional model has no reverse connections among learnable modules given a fixed context. The BLSTM naturally took more time to run one batch, since the computations within its layers have to be executed sequentially; consequently it converged in almost double the clock time of the fully convolutional network.

5.1.2. Ablation Analysis of CTC Search Space Reduction

Row 8 and 9 investigate the impact of the decoding block (Section 3.4) separately. The average differences between the proposed R-CTC and the standard original CTC across the four tested benchmarks were:
Δ D e c o d i n g ( 9 8 ) = ( Δ F . A c c ¯ = 9.8 ± 1.25 , Δ m A P ¯ = 6.53 ± 2.29 , Δ E d i t ¯ = 4.23 ± 1.58 )
Generally, original CTC performed quite competitively at the order level, despite its rather poor performance at frame-level (the finer the evaluation metric applied, the larger the gap grew in the rightmost column of Rows 8 and 9, highlighted as brown background cell color, growing darker as the cell value grows greater), which is consistent with its underlying principle to learn order distributions rather than alignments. The performance improvement was statistically significant in all 12 dataset/measurement combinations tested, indicating that the proposed decoding procedure is effective.

Training Time

Figure 7 compares learning efficiency. The non-CTC model converged the fastest but the final loss on validation was the highest. CTC converged slower than non-CTC baseline because of the increased depth in the gradient propagation process. Although the gradient depth was further increased in reduced-CTC compared to the original CTC, its convergence process did not deteriorate but even accelerated, thanks to the significantly reduced search space of potential alignments.

5.2. Comparison with State-of-the-Art Frame-Wise Methods

In contrast to video-level supervision, fully supervised segmentation is a much more researched topic, and there are more diverse benchmarks available for comparison. As expected, there was an obvious gap between the best-performing video-level methods and the state-of-the-art frame-wise method (4. ASRF):
Δ F r a m e ( 4 9 ) = ( Δ F . A c c ¯ = 8.05 ± 1.90 , Δ m A P ¯ = 9.53 ± 4.44 , Δ E d i t ¯ = 8.23 ± 2.32 )
Although 11 out of the 12 current best results obtained in different dataset/measurement combinations were provided by fully supervised approaches (highlighted in bold font in Table 4), they rely on expensive frame-by-frame manual annotations and task-specific hand-engineered pre-/post-processing techniques, while our proposal is purely automatic end-to-end without human intervention, both during the training phase and after deployment. Note that on the second dataset (YouTube Instructions), our proposed 9. I3D + Transformer + R-CTC even outperformed the best frame-wise result by a noticeable margin in terms of order-level evaluation ( + 12.7 % by re-normalized E d i t measurement). The performance improvement was statistically significant, indicating the possibility that the explicit alignment learning in a purely data-driven fashion may be more effective than fully supervised methods if end-to-end ordering information is well-learned.

5.3. Qualitative Analysis

Some representative examples are shown in Figure 8. Baseline without CTC outputs were more noisy and over-segmented actions. CTC outputted a degenerated path, however the order was correct. Reduced-CTC did not suffer from over-segmentation, and had better localization and ordering.
Specifically, we found that Transformer + RCTC captured longer-range temporal dependencies than the state of the art, especially in cases when distinct actions were visually very similar. For example, Baseline wrongly predicted the ground-truth class ‘add coffee’ as ‘pour coffee’, while neglecting the temporal dependencies between certain action pairs. Our hybrid encoder also made more reliable predictions of extremely short action instances that fell in between two long actions (‘butter pan’, ‘withdraw stove’, etc.). This suggests that using modified temporal classification is critical for improving the accuracy of prediction of action boundaries.

6. Conclusions

We present Hybrid Reduced-CTC to align actions in instructional videos. The main contribution is a hybrid convolutional transformer LSTM to capture long-term temporal dynamics within and between actions, as well as the modified dynamic programming to recursively sum together every possible alignment and reduce the decoding search space by weighting the priority of qualified paths by their visual consistency with existing labels. Our results were found to be competitive in terms of accuracy on four publicly available datasets of different size and difficulty. Based on the confirmed effectiveness of CTC in frame- and transcript-level localization, we plan to further explore its impact on the stability of convergence, due to the observation that that CTC empirically tends to become more difficult to converge than non-CTC models as the length of the input video increases, which suggests more effective under-sampling techniques may be needed to adapt CTC with longer time span.

Author Contributions

Conceptualization, L.W. and X.W.; methodology, L.W.; software, L.W.; validation, L.W., A.H., and X.W.; formal analysis, L.W.; investigation, L.W.; resources, L.W.; data curation, L.W.; writing—original draft preparation, L.W.; writing—review and editing, A.H.; visualization, L.W.; supervision, A.H.; project administration, A.H.; funding acquisition, X.W., Y.X., and A.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Fundamental Research Funds for the Central Universities grant numbers WK2150110007 and WK2150110012, and the National Natural Science Foundation of China grant numbers 61772490, 61472382, 61472381, and 61572454.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

YouTube Instructional is publicly available at https://www.di.ens.fr/willow/research/instructionvideos, YouCook2 is publicly available at http://youcook2.eecs.umich.edu, Breakfast is publicly available at https://serre-lab.clps.brown.edu/resource/breakfast-actions-dataset, and 50 Salads is publicly available at http://cvip.computing.dundee.ac.uk/datasets/foodpreparation/50salads.

Acknowledgments

We would like to thank the unknown reviewers for their effort and time spent on this work. Hardware platform (a workstation with GeForce GTX Titan Z GPU and 6 GB RAM) and software distribution license (Matlab Deep Learning Toolbox) were provided by the Super Computing Center and Network Information Center of University of Science and Technology of China, respectively. The 50 Salads dataset is distributed by the Computer Vision Image Processing (CVIP) group at the University of Dundee, Scotland, UK; YouTube Instructions dataset is distributed by the Willow project team at the Computer Science Department of the Ecole Normale Superieure (DI ENS), Paris, France. The Breakfast Actions Dataset is distributed by the Serre Lab at Brown University. YouCook2 is distributed by University of Michigan.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analysis, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

CTCConnectionist Temporal Classification
LSTMLong Short-Term Memory Network
CR-CTCReduced Connectionist Temporal Classification
FC FullyConvolution
TMTransformer
ED-TCNEncoder–Decoder Temporal Convolutional Network

References

  1. Richard, A.; Kuehne, H.; Gall, J. Weakly supervised action learning with rnn based fine-to-coarse modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 754–763. [Google Scholar]
  2. Ding, L.; Xu, C. Weakly-supervised action segmentation with iterative soft boundary assignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6508–6516. [Google Scholar]
  3. Li, J.; Lei, P.; Todorovic, S. Weakly supervised energy-based learning for action segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 6243–6251. [Google Scholar]
  4. Alayrac, J.B.; Bojanowski, P.; Agrawal, N.; Sivic, J.; Laptev, I.; Lacoste-Julien, S. Learning from narrated instruction videos. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2194–2208. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. Tang, Y.; Ding, D.; Rao, Y.; Zheng, Y.; Zhang, D.; Zhao, L.; Lu, J.; Zhou, J. COIN: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 1207–1216. [Google Scholar]
  6. Fried, D.; Alayrac, J.B.; Blunsom, P.; Dyer, C.; Clark, S.; Nematzadeh, A. Learning to Segment Actions from Observation and Narration. arXiv 2020, arXiv:2005.03684. [Google Scholar]
  7. Graves, A. Connectionist Temporal Classification. In Supervised Sequence Labelling with Recurrent Neural Networks; Springer: Berlin/Heidelberg, Germany, 2012; pp. 61–93. [Google Scholar] [CrossRef]
  8. Pirsiavash, H.; Ramanan, D. Parsing videos of actions with segmental grammars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 612–619. [Google Scholar]
  9. Richard, A.; Gall, J. Temporal action detection using a statistical language model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 3131–3140. [Google Scholar]
  10. Lea, C.; Reiter, A.; Vidal, R.; Hager, G.D. Segmental Spatiotemporal CNNs for Fine-Grained Action Segmentation. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science. Springer International Publishing: Cham, Switzerland, 2016; pp. 36–52. [Google Scholar]
  11. Singh, B.; Marks, T.K.; Jones, M.; Tuzel, O.; Shao, M. A multi-stream bi-directional recurrent neural network for fine-grained action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1961–1970. [Google Scholar]
  12. Lea, C.; Flynn, M.D.; Vidal, R.; Reiter, A.; Hager, G.D. Temporal Convolutional Networks for Action Segmentation and Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  13. Lei, P.; Todorovic, S. Temporal deformable residual networks for action segmentation in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6742–6751. [Google Scholar]
  14. Zhang, Y.; Tang, S.; Muandet, K.; Jarvers, C.; Neumann, H. Local temporal bilinear pooling for fine-grained action parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 12005–12015. [Google Scholar]
  15. Ishikawa, Y.; Kasai, S.; Aoki, Y.; Kataoka, H. Alleviating Over-Segmentation Errors by Detecting Action Boundaries. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikola, HI, USA, 5–9 January 2021; pp. 2322–2331. [Google Scholar]
  16. Ghoddoosian, R.; Sayed, S.; Athitsos, V. Action Duration Prediction for Segment-Level Alignment of Weakly-Labeled Videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikola, HI, USA, 5–9 January 2021; pp. 2053–2062. [Google Scholar]
  17. Rohrbach, M.; Rohrbach, A.; Regneri, M.; Amin, S.; Andriluka, M.; Pinkal, M.; Schiele, B. Recognizing Fine-Grained and Composite Activities Using Hand-Centric Features and Script Data. Int. J. Comput. Vis. 2015, 119, 346–373. [Google Scholar] [CrossRef] [Green Version]
  18. Stein, S.; McKenna, S.J. Combining Embedded Accelerometers with Computer Vision for Recognizing Food Preparation Activities. In Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp 13), Zurich, Switzerland, 8–12 September 2013; Association for Computing Machinery: New York, NY, USA, 2013; pp. 729–738. [Google Scholar] [CrossRef]
  19. Fathi, A.; Ren, X.; Rehg, J.M. Learning to recognize objects in egocentric activities. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 3281–3288. [Google Scholar]
  20. Gao, Y.; Vedula, S.S.; Reiley, C.E.; Ahmidi, N.; Varadarajan, B.; Lin, H.C.; Tao, L.; Zappella, L.; Béjar, B.; Yuh, D.D.; et al. Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling. In Proceedings of the Modeling and Monitoring of Computer Assisted Interventions (M2CAI), MICCAI Workshop, Boston, MA, USA, 14–18 September 2014; Volume 3, p. 3. [Google Scholar]
  21. Ahmidi, N.; Tao, L.; Sefati, S.; Gao, Y.; Lea, C.; Haro, B.B.; Zappella, L.; Khudanpur, S.; Vidal, R.; Hager, G.D. A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery. IEEE Trans. Biomed. Eng. 2017, 64, 2025–2041. [Google Scholar] [CrossRef] [PubMed]
  22. Kuehne, H.; Arslan, A.B.; Serre, T. The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
  23. Bojanowski, P.; Lajugie, R.; Grave, E.; Bach, F.; Laptev, I.; Ponce, J.; Schmid, C. Weakly-supervised alignment of video with text. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 4462–4470. [Google Scholar]
  24. Alayrac, J.B.; Bojanowski, P.; Agrawal, N.; Sivic, J.; Laptev, I.; Lacoste-Julien, S. Unsupervised learning from narrated instruction videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 4575–4583. [Google Scholar]
  25. Zhou, L.; Xu, C.; Corso, J.J. Towards automatic learning of procedures from web instructional videos. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
  26. Kukleva, A.; Kuehne, H.; Sener, F.; Gall, J. Unsupervised learning of action classes with continuous temporal embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 12066–12074. [Google Scholar]
  27. Oneata, D.; Verbeek, J.; Schmid, C. Action and event recognition with fisher vectors on a compact feature set. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 2–8 December 2013; pp. 1817–1824. [Google Scholar]
  28. Wang, H.; Schmid, C. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 2–8 December 2013; pp. 3551–3558. [Google Scholar]
  29. Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 568–576. [Google Scholar]
  30. Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–17 December 2015; pp. 4489–4497. [Google Scholar]
  31. Carreira, J.; Zisserman, A. Quo vadis, action recognition? In A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
  32. Piergiovanni, A.; Ryoo, M.S. Representation flow for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 9945–9953. [Google Scholar]
  33. Malmaud, J.; Huang, J.; Rathod, V.; Johnston, N.; Rabinovich, A.; Murphy, K. What’s cookin’? interpreting cooking videos using text, speech and vision. arXiv 2015, arXiv:1503.01558. [Google Scholar]
  34. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the Computer Vision and Pattern Recognition, Columbus, OH, USA, June 23–28 June 2014. [Google Scholar]
  35. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
  36. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
  37. Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
  38. Kuehne, H.; Gall, J.; Serre, T. An end-to-end generative framework for video segmentation and recognition. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016; pp. 1–8. [Google Scholar]
  39. Graves, A.; Fernández, S.; Schmidhuber, J. Bidirectional LSTM networks for improved phoneme classification and recognition. In Proceedings of the International Conference on Artificial Neural Networks, Warsaw, Poland, 11–15 September 2005; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2005; pp. 799–804. [Google Scholar]
  40. Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar] [CrossRef] [Green Version]
  41. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.U.; Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
  42. Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
  43. Li, A.; Thotakuri, M.; Ross, D.A.; Carreira, J.; Vostrikov, A.; Zisserman, A. The AVA-Kinetics Localized Human Actions Video Dataset. arXiv 2020, arXiv:2005.00214. [Google Scholar]
  44. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 23rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015; Bengio, Y., LeCun, Y., Eds.; Conference Track Proceedings. 2015. [Google Scholar]
  45. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
  46. Gehring, J.; Auli, M.; Grangier, D.; Dauphin, Y. A Convolutional Encoder Model for Neural Machine Translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Vancouver, BC, Canada, 2017; pp. 123–135. [Google Scholar] [CrossRef]
  47. Kaiser, L.; Gomez, A.N.; Chollet, F. Depthwise Separable Convolutions for Neural Machine Translation. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  48. Zhang, Y.; Pezeshki, M.; Brakel, P.; Zhang, S.; Laurent, C.; Bengio, Y.; Courville, A. Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks. In Proceedings of the Interspeech 2016, San Francisco, CA, USA, 8–12 September 2016; pp. 410–414. [Google Scholar] [CrossRef] [Green Version]
  49. Wang, Y.; Deng, X.; Pu, S.; Huang, Z. Residual Convolutional CTC Networks for Automatic Speech Recognition. arXiv 2017, arXiv:1702.07793. [Google Scholar]
  50. Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; Dauphin, Y.N. Convolutional Sequence to Sequence Learning. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; Precup, D., Teh, Y.W., Eds.; Volume 70, pp. 1243–1252. [Google Scholar]
  51. Bai, S.; Kolter, J.Z.; Koltun, V. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar]
Figure 1. The training/testing settings in the problem of action labeling through temporal alignment learning. (Left) During training, only the order of the occurring actions is given, the model is learned by maximizing the probability of all possible frame-to-label alignments. (Right) During testing, no annotation is given, as the learned model already encodes the temporal structure of data, it predicts the actions frame by frame without any further information.
Figure 1. The training/testing settings in the problem of action labeling through temporal alignment learning. (Left) During training, only the order of the occurring actions is given, the model is learned by maximizing the probability of all possible frame-to-label alignments. (Right) During testing, no annotation is given, as the learned model already encodes the temporal structure of data, it predicts the actions frame by frame without any further information.
Applsci 11 04954 g001
Figure 2. The proposed Hybrid-CTC structure consists of a hybrid encoder with a CTC output layer. See Table 3 for architectural details of different temporal encoders. See Figure 4 and Section 3.4 for how the CTC layer aggregates all possible input–output alignments. See Section 4.1 for how to sub-sample frames from raw video as pre-processing before submitting to CNN encoders.
Figure 2. The proposed Hybrid-CTC structure consists of a hybrid encoder with a CTC output layer. See Table 3 for architectural details of different temporal encoders. See Figure 4 and Section 3.4 for how the CTC layer aggregates all possible input–output alignments. See Section 4.1 for how to sub-sample frames from raw video as pre-processing before submitting to CNN encoders.
Applsci 11 04954 g002
Figure 3. Similar frame-wise accuracy may have large qualitative differences. The top row generates neighbouring segments. The bottom row generates the same proportion of segments but they are non-contiguous, leading to numerous over-segmentation errors.
Figure 3. Similar frame-wise accuracy may have large qualitative differences. The top row generates neighbouring segments. The bottom row generates the same proportion of segments but they are non-contiguous, leading to numerous over-segmentation errors.
Applsci 11 04954 g003
Figure 4. Blank nodes indicate unlabeled frames. CTC addresses the problem of unknown labels by aggregating associate paths (Equation (4)).
Figure 4. Blank nodes indicate unlabeled frames. CTC addresses the problem of unknown labels by aggregating associate paths (Equation (4)).
Applsci 11 04954 g004
Figure 5. Res-50 backbone, 2nd, 3rd, 4th residual blocks are internally similar and are therefore omitted.
Figure 5. Res-50 backbone, 2nd, 3rd, 4th residual blocks are internally similar and are therefore omitted.
Applsci 11 04954 g005
Figure 6. For illustration, an input with T = 6 frames and | A | = 3 annotated actions. High similarity is indicated by thick lines between frames. Note that π 1 stays at the same action prediction across similar frames. Consequently, Equation (6) weights π 1 higher over π 2 . In contrast, π 1 and π 2 are equally treated in original CTC.
Figure 6. For illustration, an input with T = 6 frames and | A | = 3 annotated actions. High similarity is indicated by thick lines between frames. Note that π 1 stays at the same action prediction across similar frames. Consequently, Equation (6) weights π 1 higher over π 2 . In contrast, π 1 and π 2 are equally treated in original CTC.
Applsci 11 04954 g006
Figure 7. Loss curves in log-scale w.r.t. the number of epochs and elapsed clock time. Vertical dotted lines of different colors indicate that the loss function on the corresponding validation no longer dropped between two consecutive epochs, which is the position where the virtual learning process actually terminated. (Note: Due to time and hardware limitations, only a subset of ∼ 1 k samples were selected for cross-validation training on the YouCook2 dataset).
Figure 7. Loss curves in log-scale w.r.t. the number of epochs and elapsed clock time. Vertical dotted lines of different colors indicate that the loss function on the corresponding validation no longer dropped between two consecutive epochs, which is the position where the virtual learning process actually terminated. (Note: Due to time and hardware limitations, only a subset of ∼ 1 k samples were selected for cross-validation training on the YouCook2 dataset).
Applsci 11 04954 g007
Figure 8. Frame and alignment accuracy on randomly selected testing videos from Breakfast (Top) and YouTube Instructions (Bottom), respectively. (Note: there is a background class (▬) in YouTube Instructions, usually present in between neighboring steps, whose fraction varied from 46% to 83% across different tasks, whereas Breakfast is tightly trimmed).
Figure 8. Frame and alignment accuracy on randomly selected testing videos from Breakfast (Top) and YouTube Instructions (Bottom), respectively. (Note: there is a background class (▬) in YouTube Instructions, usually present in between neighboring steps, whose fraction varied from 46% to 83% across different tasks, whereas Breakfast is tightly trimmed).
Applsci 11 04954 g008
Table 1. State-of-the-art performance under different training conditions.
Table 1. State-of-the-art performance under different training conditions.
Trained with Frame-Wise Annotations:MPII Cooking2 [17]MERL Shopping [11]50 Salads [18]GETA [19]JIGSAWS [20,21]
MSB-RNN [11] (2016) 41.22 % ( m A P ) 80.31 % ( m A P )
ED-TCN [12] (2017) 74.1 % ( F . A c c ) 64.2 % ( m A P ) 64.7 % ( F . A c c ) 🟉 59.8 % ( E d i t ) 🟉 64.0 % ( F . A c c ) 80.8 % ( F . A c c ) 84.7 % ( E d i t )
TCED [14] (2019) 66.3 % ( F . A c c ) 🟉 62.5 % ( E d i t ) 🟉 75.9 % ( F . A c c ) 🟉 🟉 71.3 % ( E d i t ) 🟉 🟉 63.4 % ( F . A c c ) 70.9 % ( E d i t ) 82.2 % ( F . A c c ) 87.7 % ( E d i t )
ASRF [15] (2021) 84.5 % ( F . A c c ) 🟉 79.3 % ( E d i t ) 🟉 84.9 % ( F 1 @ 0.10 ) 🟉 83.5 % ( F 1 @ 0.25 ) 🟉 77.3 % ( F 1 @ 0.50 ) 🟉 77.3 % ( F . A c c ) 83.7 % ( E d i t ) 89.4 % ( F 1 @ 0.10 ) 87.8 % ( F 1 @ 0.25 ) 79.8 % ( F 1 @ 0.50 )
Trained with Only Order of Actions:Breakfast [22]Hollywood Ext. [23]50 SaladsYouTube Instructions [24]YouCook2 [25]
GRU + HMM [1] (2017) 33.3 % ( M o F ) 47.3 % ( I o D ) 11.9 % ( I o U ) 51.1 % ( I o D )
ProcNet [25] (2018) 50.6 % ( J a c . ) 37.0 % ( I o U ) 37.1 % ( R e c . ) 30.4 % ( P r e c . ) 33.4 % ( F 1 )
Unsupervised [26] (2019) 41.8 % ( M o F ) 30.2 % ( M o F 🟉 ) 35.5 % ( M o F 🟉 🟉 ) 39.0 % ( M o F )
Duration [16] (2021) 55.7 % ( F . A c c ) 36.3 % ( I o U ) 50.1 % ( F . A c c ) 31.4 % ( I o U )
All results are reported without any form of supervision on test data, with only the temporal ordering of actions available for training, unless marked with . denotes that additional information on the temporal ordering of actions is also provided in the test video. denotes even weaker supervision, where even the order of actions is not available, only un-ordered sets of the actions contained are available. All results are performed with low-level granularity with 48 sub-action classes, where each mid-level granularity is further divided into pre-, core-, and post-phases, unless marked with 🟉 . 🟉 denotes that evaluations were performed at mid-level action granularity with 18 sub-action classes. The mid-level labels differentiate actions like ‘cut tomato’ from ‘cut cucumber’, whereas the higher-level labels combine these into a single class, ‘cut’. 🟉 🟉 denotes that evaluations were performed at eval-level action granularity with 9 sub-action classes such as ‘cut’, ‘peel’, and ‘add dressing’. Metrics definitions: ( 1 ) M o F (mean-over-frames) is equivalent to F . A c c , and refers to the average percentage of correctly labeled frames. ( 2 ) I o U , I o D , and  J a c . (intersection over union/detection/prediction) refer to I I I I , I I I , and I I I , respectively, where I and I are ground-truth and prediction intervals, respectively. I o U penalizes all the misalignment of proposal while J a c c a r d only penalizes the partition of segments beyond the ground truth. ( 3 )   E d i t , F 1 , P r e c . , and  R e c . : see Equations (10)–(12) in Section 4.3 for definitions.
Table 2. Network architecture of the I3D Res-50 backbone [37]. Residual blocks are shown in brackets, next to which is the number of repeated blocks in the stack. Each convolutional layer is followed by batch normalization and ReLU. Down-sampling is performed after each stack with a stride of 2 along dimensions of width and height. r e s 5 culminates with a global spatiotemporal pooling layer outputting a 512-dimensional feature vector, which is subsequently fed to a fully connected layer outputting final class probabilities through softmax activations. The dimension C of the last fully connected layer is equal to the number of classes in the target dataset.
Table 2. Network architecture of the I3D Res-50 backbone [37]. Residual blocks are shown in brackets, next to which is the number of repeated blocks in the stack. Each convolutional layer is followed by batch normalization and ReLU. Down-sampling is performed after each stack with a stride of 2 along dimensions of width and height. r e s 5 culminates with a global spatiotemporal pooling layer outputting a 512-dimensional feature vector, which is subsequently fed to a fully connected layer outputting final class probabilities through softmax activations. The dimension C of the last fully connected layer is equal to the number of classes in the target dataset.
LayerI3D Res-50Output Size
r e s 1 1 × 7 × 7, 64 L × W 2 × H 2
r e s 2 1 × 1 × 1 , 64 1 × 3 × 3 , 64 1 × 1 × 1 , 256 × 3 L × W 4 × H 4
r e s 3 1 × 1 × 1 , 128 1 × 3 × 3 , 128 1 × 1 × 1 , 512 × 4 L × W 8 × H 8
r e s 4 1 × 1 × 1 , 256 1 × 3 × 3 , 256 1 × 1 × 1 , 024 × 6 L × W 16 × H 16
r e s 5 1 × 1 × 1 , 512 1 × 3 × 3 , 512 1 × 1 × 1 , 2048 × 3 L × W 32 × H 32
f c global pooling → softmax 1 × 1 × C
Table 3. Structural and training details of the temporal backbones used in experiments.
Table 3. Structural and training details of the temporal backbones used in experiments.
BLSTMFull ConvolutionTransformer
Dimension 2 × 1024 15 × ( X e p t i o n _ B l o c k + 1536 ) ( 6 + 6 ) × ( 512 + 2048 + 512 )
# Parameters22 M35 M40 M
OptimizationSGD, batch size = 1 Adam, batch size = 32 Adam, batch size = 32
Momentum0.9
Learning Rate 10 2 10 4 10 3 10 4 10 3 10 4
Dropout0.30.80.1
Training Time/Batch0.76 s/320.34 s0.41 s
🟉 Convergence (Iterations) 1.0 × 10 6 × 32 0.8 × 10 6 3.0 × 10 6
🟉 Convergence (Clock Time)9 days3.4 days14 days
🟉 Statistics are collected on the Breakfast dataset [38] on a single GPU; however, convergence curves are generally consistent across the other datasets (Section 4.2) and are therefore omitted for presentation. The time to train the I3D Res-50 CNN (2 weeks) is also excluded from the statistics.
Table 4. Accuracy across different datasets (%). All referenced baselines under both training conditions (Row 1∼4) are re-implemented under the same empirical control (i.e., identical datasets and evaluation metrics) to facilitate consistent comparisons against our proposed variants 🟉 . Results are means over 4 runs (variances are omitted for readability). All differences are significant  ( p < 0.01 ) . The bold-faced font highlights the best results obtained on the given data set. The values in brackets calculate the absolute increments of the current row (denoted by ◯) relative to the row referenced in the rightmost column. The second to the rightmost column calculates the statistical means and variances of the corresponding increments across different datasets within the same row.
Table 4. Accuracy across different datasets (%). All referenced baselines under both training conditions (Row 1∼4) are re-implemented under the same empirical control (i.e., identical datasets and evaluation metrics) to facilitate consistent comparisons against our proposed variants 🟉 . Results are means over 4 runs (variances are omitted for readability). All differences are significant  ( p < 0.01 ) . The bold-faced font highlights the best results obtained on the given data set. The values in brackets calculate the absolute increments of the current row (denoted by ◯) relative to the row referenced in the rightmost column. The second to the rightmost column calculates the statistical means and variances of the corresponding increments across different datasets within the same row.
Datasets ( 1 ) 50 Salads [18] (2013)YouTube Instructions [24] (2016)Breakfast [38] (2016)YouCook2 [25] (2018)
  Size (hour)4.5 h5 h77 h176 h
  #Samples50150 2 k 2 k / 14 k
#Classes/#Sub-actions 2 / 17 5 / 10 / 62 90 / o p e n
  Sample Length4∼5 min2 min0.5∼5 min5 min
  Label Length37∼72 steps7∼10 steps3∼10 steps3∼15 steps
Recording ( w i d t h × h e i g h t × f p s ) 640 × 480 × 30 YouTube 320 × 240 × 15 YouTube
Recording ( C a m e r a / B a c k g r o u n d ) F i x e d / S t a b l e D y n a m i c / O p e n F i x e d / S t a b l e D y n a m i c / O p e n
Frame-wise Annotation1. CNN + BLSTM [11] (2016)Frame Acc.76.960.860.645.8
mAP @ 0.2572.547.064.933.7
Edit71.441.961.828.4
2. ED-TCN [12] (2017)Frame Acc.82.164.964.748.9
mAP @ 0.2573.450.265.736.5
Edit68.944.759.630.4
3. TCED [14] (2019)Frame Acc.68.166.053.747.9
mAP @ 0.2568.554.061.335.9
Edit66.048.357.130.7
4. ASRF [15] (2021)Frame Acc.84.5 ( + 4.5 ) 81.9 ( + 5.5 ) 67.6 ( + 4.5 ) 59.5 ( + 11.8 ) ( 8.05 ± 1.90 )
mAP @ 0.2583.5 ( + 11.9 ) 65.8 ( + 4.4 ) 72.4 ( + 8.3 ) 43.8 ( + 8.1 ) ( 9.53 ± 4.44 )
Edit79.3 ( + 4.1 ) 58.0 ( 2.5 ) 68.9 ( + 3.8 ) 36.9 ( + 2.7 ) ( 8.23 ± 2.32 )
Video-level Annotation5. Duration [16] (2021) ( B a s e l i n e ) Frame Acc.70.767.555.742.1 Δ B a s e = ( 5. Baseline)
mAP @ 0.2563.254.256.631.5
Edit59.347.851.326.9
6. I3D + BLSTM + CTC (Ours)Frame Acc.56.8 ( 13.8 ) 54.3 ( 13.2 ) 44.8 ( 10.9 ) 33.8 ( 8.2 ) ( 11.54 ± 2.53 )
mAP @ 0.2550.8 ( 12.3 ) 43.6 ( 10.6 ) 45.5 ( 11.1 ) 25.3 ( 6.1 ) ( 10.05 ± 2.69 )
Edit47.7 ( 11.6 ) 38.4 ( 9.3 ) 41.3 ( 10.0 ) 21.7 ( 5.3 ) ( 9.07 ± 2.69 )
7. I3D + FC + CTC (Ours)Frame Acc.62.9 ( 7.7 ) 60.1 ( 7.4 ) 49.6 ( 6.1 ) 37.5 ( 4.6 ) ( 6.46 ± 1.42 )
mAP @ 0.2556.5 ( 6.6 ) 48.5 ( 5.7 ) 50.6 ( 5.9 ) 28.2 ( 3.3 ) ( 5.41 ± 1.45 )
Edit59.6 ( + 0.33 ) 48.1 ( + 0.26 )51.6 ( + 0.29 ) 27.1 ( + 0.15 ) ( 0.26 ± 0.07 )
8. I3D + Transformer + CTC (Ours)Frame Acc.74.9 ( + 4.3 ) 71.6 ( + 4.1 ) 59.1 ( + 3.4 ) 44.6 ( + 2.6 ) ( 3.58 ± 0.78 )
mAP @ 0.2567.1 ( + 3.9 ) 57.6 ( + 3.4 ) 60.1 ( + 3.5 ) 33.5 ( + 1.9 ) ( 3.20 ± 0.85 )
Edit70.6 ( + 11.3 ) 56.9 ( + 9.1 ) 61.1 ( + 9.8 ) 32.1 ( + 5.1 ) ( 8.83 ± 2.62 )
9. I3D + Transformer + R-CTC (Ours)Frame Acc.79.9 ( + 9.3 ) 76.3 ( + 8.8 ) 63.0 ( + 7.3 ) 47.6 ( + 5.5 ) ( 7.74 ± 1.70 )
mAP @ 0.2571.5 ( + 8.3 ) 61.3 ( + 7.1 ) 64.0 ( + 7.4 ) 35.6 ( + 4.1 ) ( 6.76 ± 1.81 )
Edit75.1 ( + 15.8 ) 60.5 ( + 12.7 ) 65.0 ( + 13.7 ) 34.1 ( + 7.2 ) ( 12.3 ± 3.68 )
(1) Evaluations on 50 salads were performed at mid-level action granularity with 18 sub-action classes. 🟉 Some values reported in this table may differ from the values in the original literature even under the same dataset/training/evaluation combinations due to the re-implementation; if readers are interested in the performance originally reported, please check their citation link in the referenced methods.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Wang, L.; Wang, X.; Hawbani, A.; Xiong, Y. End to End Alignment Learning of Instructional Videos with Spatiotemporal Hybrid Encoding and Decoding Space Reduction. Appl. Sci. 2021, 11, 4954. https://doi.org/10.3390/app11114954

AMA Style

Wang L, Wang X, Hawbani A, Xiong Y. End to End Alignment Learning of Instructional Videos with Spatiotemporal Hybrid Encoding and Decoding Space Reduction. Applied Sciences. 2021; 11(11):4954. https://doi.org/10.3390/app11114954

Chicago/Turabian Style

Wang, Lin, Xingfu Wang, Ammar Hawbani, and Yan Xiong. 2021. "End to End Alignment Learning of Instructional Videos with Spatiotemporal Hybrid Encoding and Decoding Space Reduction" Applied Sciences 11, no. 11: 4954. https://doi.org/10.3390/app11114954

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop