End to End Alignment Learning of Instructional Videos with Spatiotemporal Hybrid Encoding and Decoding Space Reduction

: We solve the problem of how to densely align actions in videos at frame level, with only the order of occurring actions available, in order to save the time-consuming efforts to accurately annotate the temporal boundaries of each action. We propose three task-speciﬁc innovations under this scenario: (1) To encode ﬁne-grained spatiotemporal local features and long-range temporal patterns simultaneously, we test three popular backbones and compare their accuracy and training times: (i) a recurrent LSTM; (ii) a fully convolutional model; and (iii) the recently proposed Transformer model. (2) To address the absence of ground truth frame-by-frame labels during training, we apply connectionist temporal classiﬁcation (CTC) on top of the temporal encoder to recursively collect all theoretically valid alignments, and further weight these alignments with frame-wise visual similarities, in order to avoid a signiﬁcant number of degenerated paths and improve both recognition accuracy and computation efﬁciency. (3) To quantitatively assess the quality of the learned alignment, we apply a comprehensive set of frame-level, segment-level, and video-level evaluation measurements. Extensive evaluations verify the effectiveness of our proposal, with performance comparable to that of fully supervised approaches across four benchmarks of different difﬁculty and data scale.


Introduction
Fine-grained temporal action segmentation/alignment [1][2][3] is important in many applications, such as daily activity understanding, human motion analysis, and surgical robotics, to name a few. Given a video of length T, x = (x 1 , · · · , x T ), the goal of temporal segmentation/alignment is to localize each occurrence of a given action a n ∈ A in the time domain. A frame-to-action segmentation/alignment can be mathematically defined densely, as a sequence of occurring action labels at every frame in a video π = (π 1 , · · · , π T ), π t ∈ A; or sparsely, as a set of temporally segmented clips = ( 1 , · · · , N ), n ∈ A, with each segment associated with a start time, finish time, and label. N = | | is the length of the transcript sequence; note that usually T N since the sampling frequency of the machine at the encoding end is orders of magnitude higher than the granularity of the manual labeling at the decoding end.
The difference between the tasks of segmentation/alignment is that during training, only the ordered sequence of video-level actions (defined as transcript [4][5][6]) is available for alignment, while the classes of each frame are given for segmentation. Figure 1 shows the training/testing settings in the task of temporal action alignment learning, as compared to temporal action segmentation, where accurate dense labels of each frame are also available during training. We also restrict our focus on modeling instructional videos with relatively stable background, usually composed of dozens of actions lasting minutes, recorded in a kitchen or surveillance setup. Under this setting, research focus can be saved from variances in extrinsic shooting conditions, and concentrates upon the major challenges of the task.

The Motivation of Long-Term Temporal Encoding
The first challenge lies in the relatively large temporal search space, resulting from long-range temporal dependencies and flexible temporal length, compared to action recognition (also known as action classification). In general, action alignment is more challenging than action recognition, for the following reasons: • In action recognition, temporal localization is not taken into account, where input samples are previously truncated to contain exactly the temporal span of a certain target action, leading to relatively short inputs (e.g., 2∼20 s in UCF101 datasets).
In frame-wise action alignment, inputs generally last minutes or hours. In action recognition, background class is not taken into account, where input samples may not contain any of the target actions. • In action alignment, dependencies can also last temporally across seconds or even minutes. The types of temporal dependencies include individual action durations, pairwise compositions between consecutive actions, and long-term compositions lasting across multiple sub-actions. As an example of a cooking instance, when cutting a potato, it is difficult to recognize what is being cut because it tends to be occluded by hands holding it. The recognition of frames where the potato is being cut shares dependencies with previous frames where the potato is being taken out before being cut. • In action recognition, only one label needs to be assigned to the whole video, whereas the action alignment task needs to densely assign a label to each frame. Consider a video instance consisting of 20 frames sharing the same class. • In action recognition, even if a convolution network correctly predicts only for 10 frames, it is still very likely to correctly predict the whole video. A powerful temporal encoder, illustrated in Figure 2, on top of a convolution network would not bring any improvement in this case, because per video labels do not change whether or not per frame labels are neighboring. In action alignment, 10 accurate but not neighboring predictions would lead to over-segmentation error with 20 segments. Bidirectional temporal encoders would be motivated to predict neighboring segments, because they encode late samples together with the early ones for final judgement, leading to fewer false positives compared to a purely convolutionnetwork (see Figure 3).  Table 3 for architectural details of different temporal encoders. See Figure 4 and Section 3.4 for how the CTC layer aggregates all possible input-output alignments. See Section 4.1 for how to sub-sample frames from raw video as pre-processing before submitting to CNN encoders.

Figure 3.
Similar frame-wise accuracy may have large qualitative differences. The top row generates neighbouring segments. The bottom row generates the same proportion of segments but they are non-contiguous, leading to numerous over-segmentation errors.

The Motivation of Hybrid Spatiotemporal Encoding
The second challenge lies in the simultaneous encoding of low-level fine-grained spatiotemporal features together with high-level long-range temporal patterns. Appearance information, which can be regarded as visual features presented by static composing frames without taking temporal order into account, also serves as vital to the start and finish boundary indicators of actions (e.g., 'cutting a tomato' is often only subtly different from 'peeling a cucumber' spatially, such as the food types 'tomatoes' or 'cucumbers' and food states 'sliced', 'diced', 'peeled', and so on).
We experiment with three deep neural network models for encoding. In each case the encoder consists of two modules (or sub-networks): a spatiotemporal visual module Table 2 encoding per frame into one feature vector, and a temporal module Table 3 encoding the sequence of per-frame feature vectors into a sequence of sub-action labels. The visual module is common across the three models; they only differ in the temporal module ( Figure 2).

The Motivation of Decoding Space Reduction
The third challenge lies in that manual annotations with per-frame action labels for accurate training are too expensive to be feasibly applicable in practice. Automatic extractions from instructional transcriptions [4][5][6] can provide label sequences of occurring actions (hereafter referred to as 'transcript'), without the accurate start and end frame for each action, which further impose new challenges in two ways: • The first challenge is that densely aligning thousands of frames to a few sub-actions results in very large search spaces of possible alignments. • The second challenge is that there exist degenerated alignments that are visually inconsistent.
We introduce CTC [7] (Figure 2) to evaluate all alignments efficiently in one-pass recursive traverse, and further incorporate frame-level visual similarities to down-weight trivial paths which deteriorate performance. Figure 4 shows the difference of training strategies under different granularities of supervision.  (4)).

Related Work
We mainly focus on related deep end-to-end approaches, which can be grouped into two research directions according to different annotating granularity-that is, at frame-level (segmentation) and video-level (alignment) supervision:

Action Segmentation
Before deep neural solutions, the most prevalent were statistical models [8][9][10] based on conditional independent assumption between segments, which ignore long-range dependencies and have been generally outperformed by the state of the art (see Table 1 for more quantitative performance comparison). MSB-RNN [11] uses a two-stream network and bi-directional LSTMs to learn representations and capture dependencies between video chunks, respectively. ED-TCN [12] uses temporal convolutions and pooling layers within an encoder-decoder architecture to learn long-range dependencies between frames. TDRN (temporal deformable residual network) [13] proposes two parallel temporal streams, facilitating temporal segmentation at local, fine-scale, and multiple long-range scales, respectively, for improving the accuracy of frame classification. TCED [14] introduces a learnable bilinear pooling in the intermediate layers of a temporal convolutional encoder-decoder net, in order to capture more complex local statistics than conventional pooling. ASRF [15] proposes to alleviate over-segmentation errors by detecting and refining action boundaries with a dedicated boundary regression module and a wider temporal receptive field.
One problem in this context arises from the fact that the annotation to mark action boundaries for training is very time-and cost-intensive, leading to recent efforts trying to train classifiers without exact start and end frames of the related action classes (which is our case); the goal of this work is to infer frame boundaries given only an ordered list of the occurring actions.

Action Alignment
Compared with full supervision, there are only a few approaches that rely solely on video-level class labels to localize actions in the temporal domain. GRU + HMM [1] proposes a combination of a recurrent neural network and a probabilistic model to inference over long sequences for a temporal alignment. TCFPN + ISBA [2] proposes a pyramid temporal convolutional network, iteratively updated by a training strategy named ISBA (iterative soft boundary assignment), to align action sequences with frame-wise action labels in a more efficient and scalable fashion. CDFL [3] proposes a constrained discriminative forward loss upon GRU + HMM ground. Duration network [16] treats the the remaining duration of a given action as a predictable distribution conditioned on the type of action, and obtains the best alignment by maximizing its posterior probability.
In general, these methods are still more or less hand-engineered, involving statistical components designed on prior knowledge, and rely on sophisticated techniques to improve performance. To our best knowledge, the inherent integration of visual similarity into the CTC output layer with encoder-decoder spatiotemporal transducers sets our method apart from previous works, as the first purely end-to-end automatic approach without human intervention. We select ASRF [15] and duration network [16], respectively, as representatives of the state-of-the-art baselines under the two above-mentioned branches in Section 5.
All results are reported without any form of supervision on test data, with only the temporal ordering of actions available for training, unless marked with • . • denotes that additional information on the temporal ordering of actions is also provided in the test video. •• denotes even weaker supervision, where even the order of actions is not available, only un-ordered sets of the actions contained are available. All results are performed with low-level granularity with 48 sub-action classes, where each mid-level granularity is further divided into pre-, core-, and post-phases, unless marked with . denotes that evaluations were performed at mid-level action granularity with 18 sub-action classes. The mid-level labels differentiate actions like 'cut tomato' from 'cut cucumber', whereas the higher-level labels combine these into a single class, 'cut'. denotes that evaluations were performed at eval-level action granularity with 9 sub-action classes such as 'cut', 'peel', and 'add dressing'. Metrics definitions: (1) MoF (mean-over-frames) is equivalent to F.Acc, and refers to the average percentage of correctly labeled frames. (2) IoU, IoD, and Jac. (intersection over union/detection/prediction) refer to I∩I * I∪I * , I∩I * I , and I∩I * I * , respectively, where I and I * are ground-truth and prediction intervals, respectively. IoU penalizes all the misalignment of proposal while Jaccard only penalizes the partition of segments beyond the ground truth. (3) Edit, F1, Prec., and Rec.: see Equations (10)- (12) in Section 4.3 for definitions.

Automatic Label Extraction
Automatic label extraction is also related, as our action order comes from transcripts of video instructions [24,33]. Unlike these approaches focusing on the text processing part of the task, we assume that the discrete target label sequences are available in training stages.

Spatial Encoder
Generally, common 2D/3D backbones such as VGG [34], Residual [35], Inception [36], and I3D [31] variants can be applied orthogonally; in our empirical practice it is found that deep 3D backbones such as Res3D (Table 2 and Figure 5) generally yield better performance and efficiency. The convolutional feature maps of T frame representations x 1:T = (x 1 , x 2 , · · · , x T ) obtained from the last pooling layer (i.e., pool49) with size L × W 32 × H 32 are fed into temporal encoding architectures. Table 2. Network architecture of the I3D Res-50 backbone [37]. Residual blocks are shown in brackets, next to which is the number of repeated blocks in the stack. Each convolutional layer is followed by batch normalization and ReLU. Down-sampling is performed after each stack with a stride of 2 along dimensions of width and height. res5 culminates with a global spatiotemporal pooling layer outputting a 512-dimensional feature vector, which is subsequently fed to a fully connected layer outputting final class probabilities through softmax activations. The dimension C of the last fully connected layer is equal to the number of classes in the target dataset.

Temporal Encoder
In this section, we formally describe three common temporal architectures Table 3, and empirically compare them for this task in Section 5, in terms of performance accuracy, training time, generalization at test time, and ease of use. Statistics are collected on the Breakfast dataset [38] on a single GPU; however, convergence curves are generally consistent across the other datasets (Section 4.2) and are therefore omitted for presentation. The time to train the I3D Res-50 CNN (2 weeks) is also excluded from the statistics.

LSTM
Following [39], we adapt a 1-layer bidirectional LSTM (BLSTM), taking the vision feature vectors x t as input and outputting a class probability y t (π t ), π t ∈ A for every frame t. The BLSTM has 1024 cells in each direction. The overall network is trained together with CTC (see Section 3.4.2 for training detail). The output alphabet A is therefore augmented with the CTC blank class label, and the decoding is performed with a beam search.

Full Convolution
Following [40], we adapt depth-wise separable convolution layers, which consist of a separate convolution along the time dimension for every channel, followed by a projection along the channel dimensions (a position-wise convolution with filter width 1). Each spatial convolution is followed by a shortcut connection, batch normalization, and ReLU. The overall network consists of 15 convolutional layers, also trained with a CTC loss, with sequences decoded by using a beam search (above).

Transformer
Following [41], we adopt 6 encoder and 6 decoder layers, log 2 |A| attention heads, and each attention has 512 channels and is followed by two position-wise fully connected layers with 2048 and 512, compared to that of 1536 for the fully convolutional network. Every encoder layer is a self-attention, where the input tensor serves as the attention queries, keys, and values at the same time.
Every decoder layer attends on the embedding produced by the encoder using common soft-attention: the encoder outputs are the attention keys and values, and the previous decoding layer outputs are the queries. The decoder produces target class probabilities which are matched to the ground-truth labels by CTC decoding and trained as a whole with a cross-entropy loss.
This section demonstrates how to achieve more efficient decoding by the early elimination of paths that obviously violate the visual consistency based on existing labels. To understand how to reduce the search space of eligible candidate decoding paths, it is necessary that we first briefly look back on how the original CTC decoding process performs general end-to-end alignment learning in sequential signal transduction tasks.

The Original CTC Decoding
CTC sums over all possible alignments upon conditional independent assumptions. Firstly, given a training sample of T frames: where t is the index of T frames and x t is a vector of frame features. According to the conditional independence assumption (CIA) from the original CTC formulation [7], the probability of path π = (π 1 , · · · , π T ), π t ∈ A is the stepwise product of the network output softmax activation y t (π t ) of π t at each frame t: where y t (k) is the probability of the network outputting action k at time t, given input x, k ∈ A (A is the collective set of all possible actions). We refer to π over A as paths, to be distinguished from the action order l, which is naturally produced from path π π π by applying the operator that removes repetitions B(π π π): B(π π π) = l l l The probability of l sums over all paths consistent with l: Finally, the negative log likelihood of observing l l l:

Decoding Search Space Reduction
One drawback of original CTC is that Equation (2) weights all paths equally, causing the sum in Equation (4) to include visually inconsistent paths π π π that deteriorate the performance.
We incorporate visual similarity into by Equation (2) rewarding paths: where φ t = y t (π t ) represents the original Equation (2) formulation, and θ is a minimum threshold of the frame similarity function s t+1 t = f sim (x t , x t+1 ).
= green → yellow → orange, which is supposed to be already known during training.
π π π 2 = green → yellow → orange → orange → orange → orange. Figure 6. For illustration, an input with T = 6 frames and |A| = 3 annotated actions. High similarity is indicated by thick lines between frames. Note that π 1 stays at the same action prediction across similar frames. Consequently, Equation (6) weights π 1 higher over π 2 . In contrast, π 1 and π 2 are equally treated in original CTC.
To summarize, the necessary and sufficient condition for path π π π to be consistent with the supervised action order , if and only if: For each middle node π t , t = 2, · · · , T − 1, there are only two possible options: (1) Stay the same as the previous node, which means if π t−1 = (j), π t = (j) as well. Whenever this case holds true, a 'repetition' happens in π π π. (2) Transit from π t−1 = (j) to the next label in , which means π t = (j + 1). Any other label assignment will cause B(π π π) = holding false. So far we can draw the conclusion that the only difference between valid paths π π π and supervised action order is that π π π contains 'repetitions'; by inserting and removing 'repetitions', π π π and can be converted to each other.
In this example, π t = π t+1 holds true at t = 1, 2, 4 in path π 1 , and t = 3, 4, 5 in path π 2 . π t = π t+1 means that a 'repetition' happens in path π π π at frame t + 1, but there are two cases to judge whether such a 'repetition' should be encouraged or not if taking into account its consistency with the ground truth alignment.
which means that apart from the supervised order information , it is also unsupervisedly observed that frame t + 1 is visually in great similarity with frame t, which suggests such a 'repetition' should be additionally encouraged.
which means that apart from the supervised order information , it is also unsupervisedly observed that frame t + 1 is visually not similar with frame t, which suggests such a 'repetition' should not be encouraged.

How to Train the Proposed Decoder
The back-propagation through the CTC layer to obtain the closed form of ∂y t (k * ) ∂a t (k) is quite cumbersome, and may last for several pages, so we chose not to present the mathematical derivation in too much detail; for readers interested in the complete derivation of Equation (7) to obtain the gradient of P( |x), it can be easily found in the relevant literature or tutorials, such as [7,42], based on dynamic programming under chained rule of derivation.
Here we briefly give the closed form for forward loss function calculation J = − ln P( |x), together with its backward gradient w.r.t. the neural network output y t (k) (the response of label k at time t): where: a t (k) are the un-normalised outputs before the softmax activation function is applied: ∑ k * exp(a t (k * )) , k * ranges over all outputs; and α t (j), β t (j) are forward and backward variables, respectively. α t (j) is defined as the summed probability of all paths satisfying B(π 1:t ) = l 1:j , and β t (j) appends to α t (j) from t + 1 that completes l, where l 1:j is the first j actions of ; both α t (j) and β t (j) can be calculated by recursive inductions: (7) is the final 'error signal' back-propagated through the network during training.

Visual Similarity Measurement
Cut the video into T /M clusters, each with length M. Set M to be conservatively shorter than common action lengths (e.g., ∼400 frames on average in the YouCook2 dataset) so that frames belonging to different actions do not blend into the same cluster.
Thus, s t+1 t can be set under the resulting constraints that: • π t = π t+1 if and only if x t and x t+1 fall within same cluster; • π t = π t+1 if and only if x t or x t+1 is at the boundary between clusters.

YouTube Instructions [24]
contains 150 samples from YouTube on five tasks: making coffee, changing a car tire, CPR, jumping a car, and potting a plant, with approximately two minutes per sample.

YouCook2 [25]
contains about 2k samples from YouTube on 90 cooking recipes, with approximately 3∼15 steps per recipe class, where each step is a temporally aligned narration collected from paid human workers.
Breakfast [38] contains about 2k samples on ten common kitchen tasks with approximately eight steps per task. The average length of each task varies from 30 s to 5 min.
50 Salads [18] contains 4.5 h of 25 people preparing 2 mixed salads each, with approximately 10k frames per sample. Each sub-action corresponds to two levels of granularity, and each low-level granularity is further divided into pre-, core-, and post-phase.

Metrics
Frame-level accuracy is calculated as the percentage of correct predictions. Intuitively, frame-wise metrics ignore temporal patterns and occurrence orders in the sequential inputs. It is possible to achieve high frame measures but at the same time generate considerable over-segmentation errors, as visualized later, raising the need to introduce segmental metrics to penalize predictions that are out of order or over-segmented. Segment-level edit distance is also known as the Levenshtein distance, and only measures the temporal order of occurrence, without considering durations. It is therefore useful for procedural tasks in this work, where the order is the most essential.
It is calculated as segment insertions, deletions, and substitutions between predicted order and the ground-truth sequence, then normalized to range [0∼100] in Table 4 such that higher is better: ) are re-implemented under the same empirical control (i.e., identical datasets and evaluation metrics) to facilitate consistent comparisons against our proposed variants . Results are means over 4 runs (variances are omitted for readability). All differences are significant (p < 0.01). The bold-faced font highlights the best results obtained on the given data set. The values in brackets calculate the absolute increments of the current row (denoted by ) relative to the row referenced in the rightmost column. The second to the rightmost column calculates the statistical means and variances of the corresponding increments across different datasets within the same row.

Segment-level mean average precision (mAP@k)
With an intersection over union (IoU) threshold k, calculated as dividing the intersection between each pair of predicted segments I and the ground-truth segment of the same action category I * by their union: I is considered as a 'true positive' (TP) if IoU(I) ≥ k, otherwise it is a 'false positive' (FP). Average precision is accumulated across all categories. mAP@k is more invariant to small temporal shifts as compared to the above metric.
Segment-level F1-score (F1@k) With an intersection over union (IoU) threshold k, where true positives are judged by IoU(I) ≥ k with labels same as the ground truth:

Optimization:
The BLSTM is trained with Vanilla SGD, a fixed momentum of 0.9, initial learning rate 10 −2 and reduced down to 10 −4 every time the error plateaus. Following [44], the fully convolutional network and Transformer are trained with the ADAM optimizer, initial learning rate 10 −3 , reduced down to 10 −4 on plateaus.
Transformer embedding: The information about the sequence order of the encoder and decoder inputs is fed to the model via fixed positional embedding in the form of sinusoid functions. The Transformer is trained using teacher forcing-the ground truth of the previous decoding step is fed as the input to the decoder, while during inference the decoder prediction is fed back.

Dropout:
Following [45], the BLSTM is trained with dropout probability p = 0.3 on the units of the inputs and recurrent layers. The fully convolutional network is trained with a dropout probability p = 0.8 on the units of batch normalization layers. The Transformer is trained with dropout probability dropout p = 0.1.

Termination:
The loss function no longer drops on the validation set between 2 consecutive epochs.
Software and Hardware: All the models are implemented in TensorFlow and trained on a single GeForce GTX 1080 Ti GPU with 11 GB memory.

Ablation Analysis
Rows 5∼9 of Table 4 show the ablation studies of combining different modules elaborated in Section 3. The final performance improvement can be estimated by the average differences across the four tested benchmarks between Baseline (Row 5. Baseline) and the best-performing 9. I3D + Transformer + R-CTC, which is denoted as ∆ Base hereafter for simplicity: ∆ Base (9) = (∆ F.Acc = 7.74 ± 1.70, ∆ mAP = 6.76 ± 1.81, ∆ Edit = 12.3 ± 3.68)

Ablation Analysis of Temporal Encoding Backbone
Rows 6∼8 investigate the impact of different temporal backbones (Section 3) separately. The average differences between BLSTM, the fully convolutional network, and Transformer under original CTC across the four tested benchmarks were: ∆ Base (6) = (∆ F.Acc = −11.54 ± 2.53, ∆ mAP = −10.05 ± 2.69, ∆ Edit = −9.07 ± 2.69) ∆ Base (7) = (∆ F.Acc = −6.46 ± 1.42, ∆ mAP = −5.41 ± 1.45, ∆ Edit = 0.26 ± 0.07) ∆ Base (8) = (∆ F.Acc = 3.58 ± 0.78, ∆ mAP = 3.20 ± 0.85, ∆ Edit = 8.83 ± 2.62) LSTM performed worse than the fully convolutional network, even though the recurrent model has full context of every decoding time step compared to the convolutional network that only looks at a limited time window of the input. This is in accordance with current trends shifting from recurrent networks towards purely convolutional/selfattentional models in other related domains, such as translation [46,47] and speech [48,49]. Dedicated explorations [50,51] blame this inferiority to the inherent limitations of recurrent structures. Although the gating mechanism alleviates the difficulty of gradient propagation, the maximum memory is still restricted to a limited distance, usually not exceeding Θ(10 2 ) time steps. The fully convolutional model has a smaller number of parameters and trains faster than BLSTM and bidirectional Transformer; the best-performing backbone was the Transformer.

Training Time
Transformer and the fully convolutional network took approximately the same amount of time to compute a batch of 32 samples (Table 3). This is in accordance with the theoretical complexity (Θ(Td 2 )) for each layer of both models, where d is the dimension of channels. Transformer has fewer channels (512) for every self-attention block, but it is in effect a deeper model, with 3 × (6 + 6) = 24 layers in total. However, the fully convolutional network took fewer iterations to converge, completing the full curriculum in 3.5 days, compared to the 14 days for Transformer. This may be due to the more complex contraction among the self-attention queries, keys, and values during the gradient propagation and weight updating. In contrast, the fully convolutional model has no reverse connections among learnable modules given a fixed context. The BLSTM naturally took more time to run one batch, since the computations within its layers have to be executed sequentially; consequently it converged in almost double the clock time of the fully convolutional network.

Ablation Analysis of CTC Search Space Reduction
Row 8 and 9 investigate the impact of the decoding block (Section 3.4) separately. The average differences between the proposed R-CTC and the standard original CTC across the four tested benchmarks were: ∆ Decoding (9 − 8) = (∆ F.Acc = 9.8 ± 1.25, ∆ mAP = 6.53 ± 2.29, ∆ Edit = 4.23 ± 1.58) Generally, original CTC performed quite competitively at the order level, despite its rather poor performance at frame-level (the finer the evaluation metric applied, the larger the gap grew in the rightmost column of Rows 8 and 9, highlighted as brown background cell color, growing darker as the cell value grows greater), which is consistent with its underlying principle to learn order distributions rather than alignments. The performance improvement was statistically significant in all 12 dataset/measurement combinations tested, indicating that the proposed decoding procedure is effective.
Training Time Figure 7 compares learning efficiency. The non-CTC model converged the fastest but the final loss on validation was the highest. CTC converged slower than non-CTC baseline because of the increased depth in the gradient propagation process. Although the gradient depth was further increased in reduced-CTC compared to the original CTC, its convergence process did not deteriorate but even accelerated, thanks to the significantly reduced search space of potential alignments.

Comparison with State-of-the-Art Frame-Wise Methods
In contrast to video-level supervision, fully supervised segmentation is a much more researched topic, and there are more diverse benchmarks available for comparison. As expected, there was an obvious gap between the best-performing video-level methods and the state-of-the-art frame-wise method (4. ASRF): ∆ Frame (4 − 9) = (∆ F.Acc = 8.05 ± 1.90, ∆ mAP = 9.53 ± 4.44, ∆ Edit = 8.23 ± 2.32) Although 11 out of the 12 current best results obtained in different dataset/measurement combinations were provided by fully supervised approaches (highlighted in bold font in Table 4), they rely on expensive frame-by-frame manual annotations and task-specific hand-engineered pre-/post-processing techniques, while our proposal is purely automatic end-to-end without human intervention, both during the training phase and after deployment. Note that on the second dataset (YouTube Instructions), our proposed 9. I3D + Transformer + R-CTC even outperformed the best frame-wise result by a noticeable margin in terms of order-level evaluation (+12.7% by re-normalized Edit measurement). The performance improvement was statistically significant, indicating the possibility that the explicit alignment learning in a purely data-driven fashion may be more effective than fully supervised methods if end-to-end ordering information is well-learned.

Qualitative Analysis
Some representative examples are shown in Figure 8. Baseline without CTC outputs were more noisy and over-segmented actions. CTC outputted a degenerated path, however the order was correct. Reduced-CTC did not suffer from over-segmentation, and had better localization and ordering.
Specifically, we found that Transformer + RCTC captured longer-range temporal dependencies than the state of the art, especially in cases when distinct actions were visually very similar. For example, Baseline wrongly predicted the ground-truth class 'add coffee' as 'pour coffee', while neglecting the temporal dependencies between certain action pairs. Our hybrid encoder also made more reliable predictions of extremely short action instances that fell in between two long actions ('butter pan', 'withdraw stove', etc.). This suggests that using modified temporal classification is critical for improving the accuracy of prediction of action boundaries. Instructions, usually present in between neighboring steps, whose fraction varied from 46% to 83% across different tasks, whereas Breakfast is tightly trimmed).

Conclusions
We present Hybrid Reduced-CTC to align actions in instructional videos. The main contribution is a hybrid convolutional transformer LSTM to capture long-term temporal dynamics within and between actions, as well as the modified dynamic programming to recursively sum together every possible alignment and reduce the decoding search space by weighting the priority of qualified paths by their visual consistency with existing labels. Our results were found to be competitive in terms of accuracy on four publicly available datasets of different size and difficulty. Based on the confirmed effectiveness of CTC in frame-and transcript-level localization, we plan to further explore its impact on the stability of convergence, due to the observation that that CTC empirically tends to become more difficult to converge than non-CTC models as the length of the input video increases, which suggests more effective under-sampling techniques may be needed to adapt CTC with longer time span.

Conflicts of Interest:
The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analysis, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: