HalluciNet-ing Spatiotemporal Representations Using 2D-CNN

Spatiotemporal representations learnt using 3D convolutional neural networks (CNN's) are currently the state-of-the-art approaches for action related tasks. However, 3D-CNN's are notoriously known for being memory and compute resource intensive. 2D-CNN's, on the other hand, are much lighter on computing resource requirements, and are faster. However, 2D-CNN's performance on action related tasks is generally inferior to that of 3D-CNN's. Also, whereas 3D-CNN's simultaneously attend to appearance and salient motion patterns, 2D-CNN's are known to take shortcuts and recognize actions just from attending to background, which is not very meaningful. Taking inspiration from the fact that we, humans, can intuit how the actors will act and objects will be manipulated through years of experience and general understanding of the"how the world works,"we suggest a way to combine the best attributes of 2D- and 3D-CNN's -- we propose to hallucinate spatiotemporal representations as computed by 3D-CNN's, using a 2D-CNN. We believe that requiring the 2D-CNN to"see"into the future, would encourage it gain deeper about actions, and how scenes evolve by providing a stronger supervisory signal. Hallucination task is treated rather as an auxiliary task, while the main task is any other action related task such as, action recognition. Thorough experimental evaluation shows that hallucination task indeed helps improve performance on action recognition, action quality assessment, and dynamic scene recognition. From practical standpoint, being able to hallucinate spatiotemporal representations without an actual 3D-CNN, would enable deployment in resource-constrained scenarios such as lower-end phones and edge devices, and/or with lower bandwidth. This translates to pervasion of Video Analytics Software as a Service (VA SaaS), for e.g., automated physiotherapy options for financially challenged demographic.


Introduction
Spatiotemporal representations are densely packed with information regarding the appearance and salient motion 2D CNN Single frame Feature space Figure 1. Concept. Instead of computing spatiotemporal representations using computationally expensive 3D-CNN's, we propose to approximate those using a 2D-CNN from a single still image. We hypothesize that this hallucination task offers a stronger supervision, which helps with action related tasks, while bringing down the computational and communication costs. patterns occurring in the video clips, as illustrated in Fig.  2. Due to this representational power they are currently the best performing models on action related tasks like action recognition [42,21,14,6], action quality assessment [31,30], skills assessment [5], action detection [12]. This representation power comes at a cost of increased computational complexity [60,50,58,13]. Hadidi et al. [13] recently conducted an exhaustive comparison of various CNN's from the perspective of computational cost. We have cited some of their findings in Table 1 to show how much costlier are 3D-CNN's than 2D-CNN's. For further analysis regarding deployment of CNN's on edge devices, we guide readers to the extensive study reported in [13]. Very high compute resource requirements leave 3D-CNN's unsuitable for deploying in resource-constrained scenarios.
2D-CNN's are generally used for learning and extracting spatial features pertaining to a single frame/image. As such, typical 2D-CNN's, by design do not take into account any motion information. Some of the works [48,49,33,10] have addressed this by using optical flows. Optical flow fires at all pixels that have moved/changed (refer to Fig. 2). This means, optical flow will pick up cues from some ir- Figure 2. Visualization of C3D model; illustration taken from [42] with permission. Notice that the model learns to capture appearance from few starting frame and salient motion the rest (every second row), unlike optical flow (every third row), which responds to all the moving pixels. All the moving pixels may not be of use, and attending to useless pixels might adversely affect performance on target task.  relevant activity happening in the background as well. 3D-CNN, on other hand, will attend to salient motion patterns characteristic of an action class. As a matter of fact, 2D-CNN's can also find short cuts to recognize actions, where instead of recognizing an action meaningfully from foreground, 2D-CNN would pick up enough cues from the background as reported in [20,15]. These kinds of a short cuts might get the job done, but it is not very meaningful. However, 2D-CNN's has the advantage of being computationally lightweight, which makes them suitable for deployment on edge devices. In nutshell, 2D-CNN's have the advantage of being computationally less expensive, while 3D-CNN's extract spatiotemporal features that more representation power. In our work, we propose a way to combine the best of both worlds. Our inspiration comes from the observation that given an image of a scene, humans can predict how the scene around them would evolve. They are able to do so because they have better general understanding of how other people are expected to behave and objects would move/manipulated. Building machines/computer vision systems with such capabilities has been a long standing goal. To this end, we propose to hallucinate spatiotemporal representations, as computed by a 3D-CNN, using a 2D-CNN and from a single still frame (see Fig. 1).
Conceptually, encouraging a 2D-CNN to predict spatiotemporal representations pertaining to 16 frames, from a single frame provides strong supervisory signal, which would help the 2D-CNN to gain deeper understanding of actions and how a given scene evolves with time.
Practically, predicting spatiotemporal representations, instead of actually computing, comes in handy in the following situations: • Resource-constrained scenarios: many computer vision efforts in areas like automated physiotherapy, that are targeted for low-income groups, make use of 3D-CNN's. It is more likely that low income demographic would have devices with low computational resources, which are not suitable to run 3D-CNN's; in these cases, we can just hallucinate spatiotemporal representations.
• Limited and/or expensive bandwidth: VA SaaS is increasingly being employed. Bandwidth used for communicating between clients and cloud is usually limited and expensive. Using our method, we can hallucinate information pertaining to multiple frames (e.g., 16 frames) from just one frame, which reduces the transmission load by 15 times).
We propose to use hallucination task as an auxiliary task along with the action related task such as, action recognition, action quality assessment, etc. Experimentally, we show that incorporating hallucination loss during training helps in following four cases: action recognition, finegrained action recognition, action quality assessment, dynamic scene recognition.

Related Work
Our work sits close to predicting features, present or future, from same and different modalities, efficient/lightweight approaches, and knowledge distillation. We briefly visit works that are closest to ours, and compare and contrast our approach against those.
Predicting features: Wang et al. [51] treat actions as transformations from a precondition to the effect. Essentially, they propose to learn the CNN parameters and transformation matrices, such that the product of features of precondition (initial) frames and transformation matrix will produce the features corresponding to effect (future) frames.
Hoffman et al. [17] proposed to hallucinate depth modality using RGB modality, and showed that improvements over single modalities and simple fusion of modalities.
Vondrick et al. [44] propose to learn to predict future, in feature space. Their approach also allows for multi-modal predictions. Given the current video frame, they propose to predict the representation of a future frame. However, they future frame representation is computed using a 2D-CNN pretrained on dataset like ImageNet or Places, which belong to a non-human-action domain.
While Vondrick et al. [45] propose a framework to generate future frames by disentangling foreground and background, Vondrick and Torralba [46] propose to disentangle low-level details and high-level semantics with the use of a transformer. Learning to generate future frames helps the network to learn useful representations that transfer well to other tasks like action recognition. However, our goal is not to predict pixel-perfect future, rather to make predictions at semantic level. Instead of generating future frames, a few works like [48,49,33,10] focus on learning to predict optical flow (very short term motion information) from static images. Gao et al. [10] propose better optical flow encoding method then previous works [48,49,33]. Their approach, by design, requires to use an encoder, and a decoder. Our approach on the other hand, does not require a decoder, which helps in reducing the computational load on resource-constrained edge devices. Moreover, our approach learns to hallucinate spatiotemporal representations corresponding to a stack of 16 frames, as compared to motion information in two consecutive frames like [48,49,33,10]. As can be seen in Fig. 2, optical flow attends to all kind of motion, even irrelevant background motion, while spatiotemporal representations only attend to action relevant salient motion patterns. Through experiments, we confirm the benefits of hallucinating and using spatiotemporal representations over optical flow prediction. Bilen et al. [2] introduce a novel, compact representation of videos, called 'dynamic image'. Dynamic images can be thought of as a summary of videos in a single image. Computing a dynamic image requires to all the corresponding frames, where as in our case, hallucinating requires to process just a single image.
TSD [59] distills a long video sequence into a very short one, and is aimed for VA SaaS scenarios where bandwidth is limited or expensive. Bhardwaj et al. [1] propose to learn a student recurrent neural network that can classify a video using fewer frames. Our goal is hallucinate 3D-CNN representations using a 2D-CNN from a single frame. We also discuss our intuition that stronger supervision can be artificially provided without any manual annotation efforts using our hallucination loss. [3,39] focus on predicting optical flow stream features using raw RGB image stream. While their method can rid of optical flow stream, their method still processes all the frames for/through RGB stream. Our approach enjoys the benefits of processing fewer frames, reduced computation load, and receiving stronger supervision.
3D convolutions can be factorized into 2D convolutions (spatial convolutions) followed by 1D convolutions (along the temporal dimension). This concept has been studied in numerous works [40,43,35,60,52] and better designs have been developed that take advantage of this factorization. 3D-CNN's inherently have larger number of trainable parameters than their 2D counterparts, because of which 3D-CNN's might be prone to overfitting [60]. To address this, [60] proposed to use 2D convolutions along with 3D con-volutions. Tran et al. [43] explore many 3D-CNN variants and observe that by replacing 3D convolutions with (2+1)D convolutions, more non-linearities can be made available in the CNN, which may allow to learn more complex functions. Xie et al. [52] found that 3D convolutions in bottom layers might be redundant, and may be replaced with 2D convolutions followed by 3D convolutions in the top layers for better temporal reasoning. Following this design, they obtained better results with lesser complexity.
Lee et al. [27] introduce MFNet, in which spatiotemporal information is extracted from feature maps from consecutive appearance blocks and used along with appearance information. This reduces the computational cost in comparison to two-stream approaches like [36]. In a concurrent work, Lin et al. [28] introduce a novel Temporal Shift Module (TSM) to allow information exchange among consecutive frames just by shifting channel, which gives strong temporal modeling ability with no additional computational cost.
While these works aim to address either using less visual evidence or more efficient, our solution to hallucinate spatiotemporal representations using a 2D-CNN from a single image aims to solve both the problems, and provides stronger supervision.

Best of Both Worlds
Let's consider the visualization [57] of C3D model [42] shown in Fig. 2, particularly, the instance of gymnast on a balance beam, to understand what a 3D-CNN actually learns to capture. We notice that C3D fires at pixels belonging to the body of the gymnast, and captures the cartwheel done by the athlete over the span of 16 frames.
What would happen if a 2D-CNN was asked to hallucinate C3D features pertaining to 16 frames, just from looking at the single, i.e., the starting frame? In order to complete the hallucination task, 2D-CNN, for e.g., will have to: • learn to identify that there's an actor in the scene and localize them • spatially segment the actors and objects • identify event going on is a balance beam gymnastic event, the actor is a gymnast • identify that gymnast is on her way to attempt a cartwheel • predict how she would be moving while attempting the cartwheel  Teacher network is a 3D-CNN (e.g., C3D [42]), which computes spatiotemporal representation from 16frames clips, while student network is a 2D-CNN (e.g., AlexNet [25]). Student network is a multitask network, which is jointly optimized for action recognition, and to hallucinate spatiotemporal representation (computed by teacher network) from a single frame.
Here we have just discussed a case of the action class balance-beam, but readers can imagine the same for other classes. This is a lot of semantic details to be predicted from a single frame. In typical action recognition task, the network would have been provided with just the action class label, which may be considered as a weak supervision signal. Incorporating hallucination task during training, would be equivalent of artificially providing with dense labels, a much stronger supervisory signal. Joint actor-action segmentation datasets [54] aim to provide such detailed annotations; actor-action segmentation is an actively pursued research direction [18,11,55,19,53]. However, following our proposition, we can get detailed supervision of a similar flavor (not exactly same) for free, which saves tremendous annotation efforts. Hallucination loss will encourage the network to focus on actors and objects and will develop better general understanding about actions and how objects are manipulated. 2D-CNN's will now be less likely to take shortcuts -recognizing actions from background, ignoring the actual actor and action being performed [20,15], as it cannot hallucinate spatiotemporal features from background. Moreover, the ability to just hallucinate spatiotemporal representations, would allow us to replace 3D-CNN's with 2D counterparts in resource-constrained scenarios.
As a method to gain the benefits described in the preceding, we propose to use hallucination task. Note that an another way to do this would be to predict the future frames in pixel space. But we are interested in predicting at semantic level -perfect per pixel construction is not our goal. So rather than doing prediction in pixel space, we propose to do prediction in the feature space.
Hallucination task can also be seen as distilling knowledge from a teacher network (3D-CNN), f t to a student network (2D-CNN), f s ; where, f t is pretrained and then kept frozen, while parameters of f s are learnt. Let φ t and φ s represent mid-level representations from f t and f s , respectively, and F T be T −th video frame.
Hallucination loss, L hallu (Eq. 3), encourages f s to regress φ s to φ t by minimizing the Euclidean distance between φ s and φ t Hallucination task is not the only goal. In addition to bringing down the computational cost, we would also like to improve the performance on action related tasks. To this end, we propose to incorporate hallucination task as an auxiliary task to be used with the actual action related main task, such as action recognition. So, main task loss (e.g., classification loss), L mt , is used in conjunction with the hallucination loss, and the idea is that hallucination loss will help with the main task. So the overall loss can be expressed as follows, where, λ is a loss balancing factor. Our approach is presented in Fig. 3. Realization of our approach is very straightforward.

Experiments
We had hypothesized that incorporating hallucination task, would help by providing deeper understanding of actions. We evaluate the effect of incorporating hallucination task on the following action related tasks: Choice of networks: In principle, any 2D-and 3D-CNN's can be used as student and teacher networks, respectively. We choose to use ResNeXt-101 [14] as our teacher network, and VGG11-bn as our student model. Until not mentioned, assume that we have pretrained our teacher network on UCF-101 dataset, and is kept frozen. Student model is pretrained on ImageNet dataset [4]. We name the network trained with hallucination loss as HalluciNet, without hallucination loss as just 2D-CNN or vanilla 2D-CNN.
Which layer to hallucinate? we choose to hallucinate the activations of the last bottleneck group of ResNeXt-101, which are 2048-dimensional. Representations of shallower layer will have higher dimensionality, and will be less semantically mapped.
Implementation details: We PyTorch [32] to implement all the networks. Network parameters are optimized using Adam optimizer [22] with starting learning rate of 0.0001. λ in Eq. 4 is set to 50, unless specified otherwise. Further experiment specific details are specified along with the experiment. We will make our code publicly available.

Performance baselines:
Our baseline to compare the performance is a 2D-CNN with same architecture, but which was trained without hallucination loss. In addition, we also compare the performance against other methods, which we specify in each experiment.

Action recognition
In first experiment, we evaluate to see if hallucination task helps with general action recognition. We compare the performance with dense optical flow prediction from static image approach [49], and motion prediction from static image approach [10].
Datasets: UCF-101 [38] and HMDB-51 [26] action recognition datasets are considered. In order to be consistent with literature, we adopt their experiment protocol. Central frames from the train and test samples are used for reporting performance, which are named as UCF-and HMDB-static, as in the literature [10].
We considered two cases as shown in Fig. 4. We found that fusing the hallucinated representations yielded better results. So we will consider that case in the remainder of the work.
First of all, we show the evolution of hallucination loss in Fig. 5. Through gradual decrease in value, we can clearly see that 2D-CNN is learning to hallucinate the spatiotemporal representations. Starting value of the loss is less as we are computing the loss after passing the activations through a sigmoid layer.
We summarize the performance on action recognition task in Table 2. We find that on both the datasets, incorporating hallucination task helps. Our HalluciNet outper-Hallucinated representation is not concatenated with taskspecific representation, and simply dropped during testing Hallucinated representation is concatenated with task-specific representation. Concatenated representation is used to make ultimate prediction forms prior approaches [49,10] on UCF101. On HMDB51, our HalluciNet yields better results that [49], but [10] works better than ours. However, our method has an advantage of being computationally lighter than [10], as it does not use a flow image generator network.

Detailed action recognition
We need to find suitable tasks to evaluate the utility of hallucinating future. Evaluating performance on ubiquitous task of recognizing actions in typically used datasets, like UCF-101 action recognition dataset, might not be sufficient. We need to evaluate on a task where the student network is required to hallucinate future in order to "fill the holes" in the input visual datastream. Fine-grained or detailed action recognition makes for a good candidate task.

Method UCF-static HMDB-static
App stream [10] 63.60 35.10 App stream ensemble [10] 64.00 35.50 Motion stream [10] 24.10 13.90 Motion stream [49] 14.30 04.96 App + Motion [10] 65.50 37.10 App + Motion [49] 64. 50  Task description: In Olympic Diving, athletes attempt many different types of dives. In general action recognition dataset, like UCF101, all these dives would grouped under a single action class, Diving. However, these dives vary from each other in a subtle way. Each dive has following five components: a) Position (legs straight or bent?) b) starting from Armstand or not? c) Rotation direction (backwards, forwards, etc.?) d) how many times the diver Somersaulted? e) how many times the diver twisted? Different combinations of these components would produce a unique type of dive. The task is to predict all five components of a dive, using very few frames.
Why is this task more suitable? Unlike general action recognition datasets like UCF-101 [38] or Kinetics [21], action in diving samples in this dataset vary very subtly. Furthermore, cues needed in order to differentiate or recognize a dive are distributed across the entire action sequence. So, to make dive classification task more suitable for our case, we ask the network to classify a dive correctly using only few frames. In particular, we every 16th frame is shown to the student network. We truncate diving samples to 96 frames. So, out of 96 frames, the student network is shown only 6 frames, based on which it needs to classify the dive.
Dataset: For this task, we use a recently released Diving dataset, MTL-AQA [30], which has 1059 training and 353 test samples.
Training procedure: 1. We take a teacher network pretrained on UCF-101 dataset, and a student network pretrained on ImageNet dataset.
2. Firstly, we again pretrain the ImageNet pretrained student network on UCF-101 action recognition, along with spatiotemporal representation hallucination, using loss function as in Eq. 4, with λ set to 50. For vanilla 2D-CNN, we do not use hallucination loss.  Table 3. Performance comparison on detailed action recognition task. Frames represent the number of frames the corresponding method sees. P, AS, RT, SS, TW stand for position, arsmstand, rotation type, number of somersaults, and number of twists.
3. Finally, the student network is trained to classify dives. Since we will be gathering evidence over six frames, we make use of LSTM [16] to aggregate this evidence. LSTM is single-layered, with a hidden state being 256 dimensional. LSTM's hidden state from last time step is passed through separate linear layers, one for each of the properties of a dive. The student network is trained end-to-end for 20 epochs using Adam solver with a constant learning rate of 0.0001.
Results of our models are summarized in Table 3, where we also compare them with other state-of-the-art 3D-CNN based approaches [29,30]. We observe that our HalluciNet outperforms on four out of five fields. Difference in performance is more in case of RT, SS, TW than P, because position (legs straight or bent) may be equally identifiable from a single image or clip, but RT, SS, TW are more difficult to predict by a plain 2D-CNN without. In comparison, our HalluciNet has been trained to forecast short term future, and hence excels in situations which involve longer term dynamics. Our HalluciNet even outperforms 3D-CNN based approaches that use more frames (MSCADC [30] and Nibali et al. [29]). C3D-AVG outperforms HalluciNet, but is computationally very expensive and uses 16x more frames.

Assessing the quality of actions
Action quality assessment (AQA) is another task which can help bring out the utility of hallucinating spatiotemporal representations from still images using 2D-CNN. In AQA, the task is to measure or quantify how well an action was performed. A good example of AQA would be that of judging Olympic events like diving, gymnastics, figure skating, etc. Dataset: MTL-AQA [30], same as in Sec. 4.2. Metric: Consistent with literature, we report Spearman's rank correlation (in %).
We follow the same training procedure as in Sec. 4.2, except that for AQA task we use L2 loss to train, as it is a regression task. We train for 20 epochs with Adam as solver, and anneal the learning rate by a factor of 10 every 5 epochs.
The results are presented in Table 4. Incorporating hallucination task helps improve performance on AQA task. Our HalluciNet outperforms C3D-SVR as well and is close to MSCADC. Although, C3D-AVG performs best on AQA task, this experiment still supports advantage of using hallucination task.

Dynamic scene recognition
Dataset: Feichtenhofer et al. introduced YUP++ dataset for the task of dynamic scene recognition in [8]. It has a total of 20 scene classes. Use of this dataset to evaluate the utility of inferred motion was suggested in [10]. In the work by Feichtenhofer, 10% of the samples are used for training, while the remaining 90% of the samples are used for testing purpose. Gao et al. [10] form their own split, called 'static-YUP++'.
Protocol: For training and testing purposes, we consider the central frame of each sample. We conduct two following experiments, and set λ in Eq. 4 to 1 in both the experiments.
1. In order to evaluate the utility of hallucination task for dynamic scene recognition, and comparing our methods with [41,7,37]. For fair comparison, we use the split used [41,7,37]. Results summarized in Table 5. HalluciNet improves the performance of our vanilla 2D-CNN, also outperforms spatiotemporal energy based approach (BoSE), slow feature analysis (SFA) approach and temporal CNN (T-CNN). T-CNN might be the closest for comparison because it uses a stack of 10 optical flow frames. Yet, our HalluciNet outperforms by a large margin.
2. To compare our approach with [10]. For fair comparison, we use the split used by [10]. Results summarized in Table 6. On static-YUP++, HalluciNet outperforms other motion information using approaches. HalluciNet outperforms by a large margin even when groundtruth motion information is used.

Conclusion
3D-CNN's extract richer spatiotemporal features than 2D-CNN's, but this comes at a considerably higher computational cost. 2D-CNN's have the benefit of being computationally much lighter. Since neural networks are universal function approximators, we propose a simple solution to approximate (hallucinate) spatiotemporal representations (computed by 3D-CNN) using a 2D-CNN. Hallucinating spatiotemporal representations, instead of actually computing, brings down the computational cost, and makes deployment on edge devices feasible, in addition to lowering the communication bandwidth requirement. Besides practical benefits, hallucination loss also provides stronger supervisory signal.