Dual-Stream Spatiotemporal Networks with Feature Sharing for Monitoring Animals in the Home Cage

This paper presents a spatiotemporal deep learning approach for mouse behavioral classification in the home-cage. Using a series of dual-stream architectures with assorted modifications for optimal performance, we introduce a novel feature sharing approach that jointly processes the streams at regular intervals throughout the network. The dataset in focus is an annotated, publicly available dataset of a singly-housed mouse. We achieved even better classification accuracy by ensembling the best performing models; an Inception-based network and an attention-based network, both of which utilize this feature sharing attribute. Furthermore, we demonstrate through ablation studies that for all models, the feature sharing architectures consistently outperform the conventional dual-stream having standalone streams. In particular, the inception-based architectures showed higher feature sharing gains with their increase in accuracy anywhere between 6.59% and 15.19%. The best-performing models were also further evaluated on other mouse behavioral datasets.


Introduction
Over many decades, the ethical implications of animals in research has undergone considerable discussion and scrutiny (Akhtar, 2015).A major landmark geared at regulating and improving the use of animals in research is the 3R's -Replacement, Refinement and Reduction -and the National Centre for 3R's that spearheads its movement in the UK.As of 2020, statistics show that around 2.88 million living animals were used for various research procedures in the UK, with 92% of these made up of rodents and fish (NC3Rs, 2022).Due to their genetic and anatomical similarities with humans, rodents (such as mice and rats) are some of the most utilized animals in biomedical research.
In support of the 3Rs mission, technology has been increasingly used to quantify different aspects of research involving animals.Behavioural phenotyping is particularly important as it is also the primary means of determining welfare concerns that may arise during an experiment.However, the manual detection of these behaviours is expensive, laborious, time consuming and overall, was not easily reproducible (Karl et al, 2003;Jhuang et al, 2010).The development of home-cage monitoring (HCM) systems is a major technological solution which has helped to solve many of these issues (Voikar and Gaburro, 2020).HCM systems facilitate non-intrusive observation of mice and may provide a range of functions such as behavioural annotation and ethogramming, depth sensing and tracking, activity summarising of circadian rhythm, and pose estimation.These HCM systems include the Techniplast Digital Ventilated Cage (DVC) (Iannello, 2019), System for Continuous Observation of Rodents in Home-cage Environment (SCORHE) (Salem et al, 2015) and Intel-liCage (Kiryk et al, 2020), to name a few.However, there are few commercially available solutions to the problem of detecting behaviours from video footage alone.Moreover, many of the solutions that do exist are strongly coupled to commercial hardware, rather than video footage in general.
In this paper, dual-stream deep learning architectures are proposed for behavioural classification of mice in the home-cage.The models in question were developed for entirely supervised learning, whereby spatiotemporal (ST) blocks of video data are mapped to one of several behaviour categories.The dataset utilised is publicly available and contains videos of a singly-housed mice (Jhuang et al, 2010).Their results however are not directly comparable to ours as the models here were trained and evaluated differently.More on this is explained in 3.1.One of the novel aspects of these models is shared layers between the streams of the networks.Here, instead of fusing individual streams at the end (Simonyan and Zisserman, 2014), we propose to combine features at regular interval throughout the architecture.We hypothesize that accurate representations are better enforced when both streams are privy to information from each other (figure 1).Some instances of shared features has been seen in unets (Han, 2017) and its many derivative networks, and some other specialized multi-stream architectures (Zhang et al, 2020;Hou et al, 2021) however not in the same manner as proposed here for multi-stream networks.In the work by the orginal creators of the dataset used in this paper (Jhuang et al, 2010), the authors presented a support vector machine (SVM) classifier coupled with a Hidden Markov Model (HMM) for sequence tagging.Main features extracted from videos were the mouse position and motion statistics, which were consequently mapped to the mouse behaviour(s).The model training was repeated n=12 times using a leave-one-out methodology, achieving a classification accuracy of 77.3% across all 8 classes as opposed to the 71.6% accuracy by human annotators.
Since then, deep learning has emerged as the state-of-the-art for classification of video data.Though 2D models thrived in most applications, the need for better contextual understanding has become increasingly apparent, and the application of 3D convolutions has enabled this.Better yet, the use of multiple streams allows for better encoding of video or clip sequences.It is often the case that one of the model streams operates on an image or image sequence (within time frame t 0 to t n ) while the next stream operates on the optical flow (computed for t 1 to t n+1 ) (Carreira and Zisserman, 2017;Simonyan and Zisserman, 2014).Some other multi-stream variations operate on two image streams of different point of views, resolutions (Wei et al, 2020) or zooms depending on the goal of classification.
Work by (Carreira and Zisserman, 2017) presented a new network called the inflated 3D (I3D) Inception model.The I3D modules differed from the classic Inception module due to the addition of 3D convolutions and 'inflated' filters that allowed wider receptive field necessary to better learn spatiotemporal data.This dual stream I3D architecture (pretrained on ImageNet) was utilized by (Nguyen et al, 2019) to classify home cage mouse behaviours.Its evaluation was carried out on the same MIT dataset (Jhuang et al, 2010) and used a leave-one-out method, therefore averaging test results across the twelve videos in the main dataset.They achieved an average accuracy of over 90% on testing with various stream weights.
In another paper by (Hou et al, 2021), the effect of shared features at higher levels of multi-stream networks was demonstrated.The authors termed this operation feature fusion.The architecture comprised of a framewise spatial transformer-based stream and a clipwise temporal stream.The stream features were combined at two successive final pooling layers.They achieved excellent results with accuracies of 95.3% on the UCF101 (Soomro et al, 2012) and 72.9% on the HMDB51 (Kuehne et al, 2011) datasets.Note however that the feature sharing proposed in our work is implemented at multiple points throughout the dual-stream architectures and leverages on the clipwise nature of both streams.More on feature sharing is explained in section 3.2.

Spatiotemporal Learning
Though utilized in a different machine learning algorithm, (Dollár et al, 2005) proved that spatiotemporal cuboids of data formed better descriptors in both human activity recognition and mice behaviour classifications task.These spatiotemporal features were implied to be better in video classification due to the presence of more information which better captures event contexts, especially those that can be easily confused at instance classification.In rodent phenotyping, a lot of emphasis is placed on themes such as behavioural sequences, periodicity, repetitiveness, or patterns of certain exhibited behaviours (Kyzar et al, 2011).Depending on the researchers' goal, these become increasingly important, else subtle details are missed.A good example of this is selfgrooming behaviour which can be observed as mice transition from their idle periods to high activity (Kalueff and Tuohimaa, 2005;Kyzar et al, 2011).However, when in excess, this behaviour is also commonly associated with mice models of both autism spectrum disorders and compulsive disorders (Liu et al, 2021).This further attests to the importance of temporal content capturing temporal representations in machine learning models.
One of the key architectures to spatiotemporal learning is the I3D (Carreira and Zisserman, 2017) mentioned earlier.In its basic form, I3D was built by changing the 2D convolutional layers in the Inception v1 (Szegedy et al, 2015) model to 3D convolutions while still leveraging on the efficient structure of the Inception blocks.Unlike other 3D convolutional architectures, I3D are deep yet lightweight.In addition, Inception networks have well proven themselves in image classification, therefore it's expansion to learning temporal content was simply expanding its capabilities.Owing to its advantages, the same concept has been applied in some architectures proposed for this paper.

Attention mechanisms
An attention module is characterized by the following elements: query Q, key K and value V.It attempts to map these to the output and scales the output using the dimension of the keys d k (seen in the dot-product version).Multi-head attention (MHA) combines multiple attention instances with trainable parameters W and is often utilized to ensure efficient learning of vector sequences (Vaswani et al, 2017).The general expressions for the attention function and multi-head attention are given below: (1) where Transformers are a derivative architecture of MHA initially applied to natural language understanding (Vaswani et al, 2017) but have also been found to be effective in computer vision.The Vision Transformers (ViT) is one of these which repurposed transformers to image tasks (Dosovitskiy et al, 2020).Further variations of ViTs designed for spatiotemporal learning of videos have achieved state-of-the-art (SOTA) results in activity recognition (Bertasius et al, 2021;Arnab et al, 2021).The work by (Bertasius et al, 2021) also proved that multi-head attention captures vital temporal dependencies by focusing on displaced or moving objects within a sequence.Furthermore, its application was proven to be effective in capturing global features in a multi-stream architecture for video classification (Li et al, 2020).

Long-Short Term Memory (LSTM)
LSTMs are architectures which learn to store information using memory cells and gates.The memory cell was designed to achieve constant error flow and used multiplicative input and output gates that protect data from perturbation (Hochreiter and Schmidhuber, 1997).Further improvements after this saw better defined gate operations which improved the memory retention of the architecture.
Building upon this, Bidirectional LSTMs (BiLSTM) allow for computation of memory both ways and have been proven to achieve good results in vision tasks.BiLSTM are composed of two LSTMs that store relevant dependencies from both forward (i.e.past to present) and backwards (i.e.future to present) state directions (Gharagozloo et al, 2021).In conjunction with other ML architectures, bidirectional LSTMs have been found to outperform the unidirectional LSTM in several natural language understanding (Graves et al, 2005;Suzuki et al, 2018) and image classification (Hua et al, 2019) tasks.In the paper by (Gharagozloo et al, 2021), BiLSTM was used with 1-dimensional convolutions to classify the circadian rhythm of wild-type mice into day or night states.This was trained after the dimensionality reduction of a fiveminute clip which was further subdivided into three-second frame windows.It was found to outperform the other ML algorithms explored, with area-underthe-curve (AUC) of 0.97.In short, BiLSTMs are capable of efficiently detecting and learning patterns that define the behaviours mapped.

Data
The MIT mice dataset is subdivided into a main dataset and clipped database (Jhuang et al, 2010).In this work, we utilize all twelve videos from the main dataset for training and validation while the clipped database, composed of unambiguous behaviours, is used to test the models.Unlike the leave-oneout methodology by the original authors, we surmise that this approach helps to better examine the generalisation performance of our models.The optical flows generated from these was computed using dense optical flow method (Farnebäck, 2003).Both training and test frames were resized to 128 × 128, and further reduced to 128×96 by uniformly cropping redundant parts of each frame that lie along the vertical axis.The data was also temporally downsampled using five-frame intervals (that is, every second was represented by one-sixth of a second).The temporal length used for each T=8 frames.Thus, each spatiotemporal cuboid represents approximately 1.33 seconds of the original video.Towards the end of the videos/clips, any frames which could not fit these specifications were discarded.The final input data is in the form N ×T ×W ×H ×C which represent the number of clips, temporal length, spatial width, spatial height and number of channels, respectively.The N values for the final training, validation and testing sets are 23,444, 4,195 and 5,171 respectively.
Fig. 2: sample frame before and after 'nightification' In the task of classification, there are often prediction discrepancies associated with various inconsistencies.One of these is class imbalance (see appendix C for datasets' distribution of frames to classes).Here, the severe imbalance in class distribution was alleviated using weights (King and Zeng, 2001) which forced the model to percieve the number of samples in each class as having the same value.Hence, the classes which suffered from low sample sizes, such as drinking, were assigned higher weights and the reverse for labels with large sample sizes like micromovement.
Another such inconsistency was the varying lengths of day and night videos.In this particular dataset, there are only two videos recorded in night-time (using infrared cameras) while all the rest (including the clipped dataset) were recorded in the day.In some deep learning applications, conversion to grayscale has proven effective but this method was found degraded the performance of the models.As such, all day videos contained within the dataset were 'nightified ' (i.e., changed into night-time).This was achieved by first calculating the averaged R, G, and B channel values from the two night videos.These were then used to weight the [0-1] normalised data from the day videos and finally expanded back to [0-255] range.The results gave a close approximation of what the videos would look like if recorded in the night, and thus lessened bias in the models caused by the day-night imbalance (figure 2).No further augmentations were performed on the dataset.More data samples, used in both RGB and flow streams, are available in appendix D.

Architectures
All of the models presented are multi-stream and, in this application, use raw video and optical flow streams.The building blocks utilized in the networks are depicted in figure 3.As earlier stated in the introduction, one of the vital aspects of the models presented here is the feature sharing between the dual-streams of the network.Feature sharing entails combination and/or joint processing of the stream outputs after operation by the primary modules.This combination is achieved either via addition or concatenation, followed by further processing on the joint streams which are then projected back to the individual streams.These operations take place at regular intervals throughout the architectures.We hypothesize that this procedure reinforces learnt features better than operating on the streams individually.The various implementations of this are further discussed under each architecture, and in table 1.The overview of each architecture is also depicted in appendix E. The blocks in subfigures 3(a) and 3(c) represent the primary processing modules used in both the RGB image and optical flow streams, while the blocks in 3(b) and 3(d) are the joint processing modules.The blocks in 3(c) and 3(b) depict 3D formats of modules originally found in Inception v3 and Inception v1 architectures respectively (Szegedy et al, 2015(Szegedy et al, , 2016)).In particular, 3(b) was adapted here to boost the performance of the architectures utilizing module 3(a) via further processing at the junctions where the streams meet.Block 3(d) is a custom joint processing module utilized only in the baseline network.

Baseline network (CNN)
This simple architecture consists of blocks with 3D convolutional layers, dropout (with uniform rates of 20%), and batch normalization (see figure 3(a)).The kernel sizes here were made uniform for each block (i.e.kernels of size m rather than m − 2 as depicted in figure 3(a)).After operation by similar blocks, the results from both streams are summed up and further operated on by dense and dropout layers (figure 3(d)) before splitting again into the individual streams.

CNN + Inception v3 D + Attention (CIv3D MHA)
This builds on the baseline architecture, adding the self-attention mechanisms to both streams after the last primary blocks.The kernel size for 3D convolutions were made to undulate (as shown in figure 3(a)) and inversely mirrored between parallel stream blocks.In addition, the simple processing block is replaced by the InceptionD module (Szegedy et al, 2016) (figure 3(b)) throughout the architecture.The self-attention block used here is similar to vision transformers (Dosovitskiy et al, 2020) however it uses batch normalization, and the patch tokens are replaced by the end features of the streams before summation and processing by the last InceptionD block.

CNN + Inception v3 D + BiLSTM (CIv3D BiLSTM)
This uses the same improvisations made in CIv3D MHA but removes the primary modules' dropout layers.The bidirectional LSTMs are used in place of the traditional flattening that precedes fully-connected layers.The input to this is the summed output of both streams' final subsection, reshaped from four to two dimensions to allow loading into the LSTMs.

Purely Inception-based networks
There are two architectures completely built-up using the 3D Inception v1 block (see subfigure 3(c)).This block was revised for spatiotemporal operation from the dimensionality reduction module in the classic Inception v1 architecture (Szegedy et al, 2015) but is without the singular 1 × 1 convolution branch in the original (figure 3(c)).The first architecture can be best described as multi-stream.At the bottleneck between successive subregions of the network, feature learning is reinforced by repeatedly combining strided computes of the original optical flow sequence with the previous features extracted from the RGB stream.Hence the network was termed Singly Reinforced Stream (SRS) network.It also adds an intricate detail of removing the first and last two frames of the optical flow stream (along with some surrounding dimensions) after the first block operation on both streams.This cropping operation is carried out only once and done under the assumption that the temporal sequence is better represented by the centre portions of the mid-four frames.This train of thought is quite similar to the fovea stream in (Karpathy et al, 2014) but takes it further by removing frames at the extremities.
Unlike the SRS network, the second architecture was developed to encourage cross-pollination between streams; this implies that just as the optical stream enforces representation learning in the image stream, the image stream is also used to enforce learning in the optical stream, and they alternate in this manner.This is done by independently concatenating the past features from each streams' block with the jointly-processed input fed into consequent blocks.This operation however led to a considerable increase in computation (see parameter count in Table 2).This network was named Cross Reinforced Streams (CRS) network.

Other networks
To investigate the effectiveness of the shared layers between streams, experiments were conducted on versions of the above models without the unique feature sharing modules.The hallmark algorithms used in each architecture were left in-situ while the areas of joint processing are replicated in both streams, all before the common fully-connected layers.

Model training
All models were trained using the categorical cross-entropy loss and optimized using stochastic gradient descent (SGD).The number of epochs and batch size were set to 85 and 8, respectively.Training was set to reduce its learning rate by a factor of 0.5 if validation loss plateaus or peaks, and finally stop if no notable learning is achieved.This prevents overfitting and allows for early restoration of the best checkpoints.Each model is trained and evaluated n = 4 times corresponding to different random seeds, and averaged.By using the averages, we present an accurate representation of each models' predictive capability.

Metrics
The most popular metric used in supervised classification is the accuracy.However, we evaluate all the models presented here on several metrics, including accuracy, average precision (AP), F1 Score, and area-under-the-ROC-curve (AUC), where ROC is the receiver operating characteristic.The plots depicting these metrics are the confusion matrices, precision-recall plots and ROC plots.Altogether, these metrics give a holistic view of each models' performance.

Model comparison
The best results were obtained on the singly-reinforced stream model with an average accuracy of 81.96±2.71%.The averaged performances of all the feature sharing dual stream models are tabulated in Table 3.The full performances for all seeds can be found in appendix A.

Ensembles
The ensembles were created by averaging the results of the models at inference.Due to the gap in performance, most ensembles between models did not show any improvements over the SRS model.However, by evaluating both the validation and test results, the best and second-best performing seeds of the Here, the results of the models and their non feature sharing variants are presented.The variants were trained and tested on the same dataset, and under the same conditions as those with joint processing.The averaged results across all metrics are tabulated (table 5).It can be clearly observed that the for each architecture pair, the feature sharing models performs better than their standalone forms.

The case for nightification
To justify the choice of nightified spatiotemporal (ST) clips in the image stream, further experiments were conducted for both raw rgb input and grayscale input.This were carried out on the baseline model and the previously ascertained best performing models from section 4.2.These models were trained and testsed in the same rigorous manner as the core paper models.The results show that nightified ST input has higher accuracy than both grayscale and raw video ST inputs for most models, the only exception being the baseline model.Those using grayscale cuboids seemed to initially perform well just observing the AUCs and average precision however all their accuracies were subpar to the nightified cuboids.Observations show that this was due to greater misclassification between visually similar behaviours (such as micromovement and rest), indicative of the fact that the grayscale modality did not possess enough information for these deep models to sufficiently distinguish between .A similar narrative was observed in the raw video inputs, though we argue that in this case that the drop in performance (albeit small) was due to the lack of standardization.The results are presented in table 6.Where GS -grayscale, R-raw RGB, N -nightified data

Varying of temporal length
The temporal length refers to the number of frames that make up each clip.As previously stated, all architectures were designed for a temporal length T=8, corresponding to 1.33 seconds.Further experiments are performed here by varying the preset T value.The new temporal lengths chosen were (i.e.T=4 ) and (i.e.T=16 ).These experiments were only carried out on the baseline and SRS models, and conducted in the same rigorous manner as the initial runs.
Besides changing the input shape, the temporal cropping (refer to section 3.2.4) in the SRS architecture was also slightly modified.Same as the new T values, this feature was halved and doubled respectively for T=4 and T=16.Hence, there was no change to the architectural complexity.For the baseline model, its complexity only increased, slightly, for T=16.The results after averaging the results for various seeds are shown in table 7. The results show that the preset T=8 was optimum as the accuracies obtain in the new experiments were not upto par.

SCORHE
Further experiments were conducted by applying the pretrained versions of the top three seeds (from all models) to a new home-caged mouse dataset.As previously shown in Section 4.2, the top performing seeds occur in CIv3D BiLSTM, CIv3D MHA and SRS models.The dataset of choice is another singly housedmouse data by SCORHE (Salem et al, 2015).Although 13 unique annotations were originally present (see graph in Appendix C), these were refined to 8 classes by removing samples with ambiguous classes (such as behav ignore, behav other ), removing samples having extremely low class occurrence (such as discrepancy, rotating), and merging the supported and unsupported rearing classes.
The recordings in the SCORHE home cage were done from multiple points as no singular viewpoint provides a clear view with occlusions.To address this, the viewpoints from opposite ends of SCORHE were shaped as 128x64 frames and stacked into a singular 128 × 128 frame.The same was also done for the optical flow data.No frame skips were used here to ensure ample training and testing data was available.Data samples for the SCORHE dataset are available in Appendix D.
For the training, the previous FC layers were changed for new ones.All other training parameters remained the same, save the learning rate which was halved to 0.0005.The resulting receiver operating characteristics (ROC) and precision-recall (PR) curves are shown in figure 4. The accuracies achieved on the SCORHE dataset by the feature sharing CIv3D BiLSTM CIv3D MHA and SRS were 80.51%, 79.88% and 79.13% respectively.Their non feature sharing variants achieved 72.18%, 77.95% and 70.83% respectively.A few observations were made on the feature sharing models.The CIv3D BiLSTM and CIv3D MHA were good at reinforcing previously learnt spatiotemporal representations to this complex home cage for similar behaviours.However, despite having lower accuracy, SRS performed better in both learning old classes and balancing predictions to learn totally new class, climbing.This is proven by its class accuracy across the different confusion matrices; while CIv3D BiLSTM and CIv3D MHA achieved 22.34% and 33.68% respectively, the SRS model achieved 53.61%.

UCF101
Finally, the feature sharing and standalone forms of the best performing model (i.e.SRS) were applied to transfer learning on a popular, more challenging activity recognition dataset; the UCF101 (Soomro et al, 2012).This dataset contains 13,320 clips of 101 activity classes totalling over 27 hours.In a similar manner as before, the models (pretrained on the MIT mouse dataset) are utilized, having new fully-connected layers and the learning rate reduced to 0.0005.After shaping data into 8-frame cuboids of 128 × 128, a train/validation/test split of 0.64/0.16/0.20 was applied.This experiment was done purely as a cross-domain investigation into the effectiveness of feature sharing so no further preprocessing was carried out on either the RGB or optical flow data.As shown in table 8, the accuracy of the pretrained SRS yet again bests its non feature sharing counterpart across the board.In addition, the top-5 accuracy of the SRS, without pretraining, is seen reaching very high accuracies.Comparison with SOTA results however is not feasible since data split was done differently.

Conclusion
Generally, it was observed that the more dynamic behaviours were better captured by all the models.The performance lag in all the models was mainly due to misclassifications amongst the resting, grooming and micromovement behaviours.These behaviours are quite closely related; during grooming, the mouse is mostly stationary albeit the motion of its forelimbs and when resting, the mouse is immobile.Micromovement are very small-scale motions.Hence, it is most likely that the 1.33 second windows of T=8 cuboids cannot not capture the full range of motions that will allow the models better distinguish between these classes.Nonetheless, these 'misclassifications' are also indicative of a similitude in the temporal pattern needed to perform certain tasks and may be subject to further interpretation by the subject experts.Further experiments in the ablation study also showed that for time windows lower or higher the 1.33 second window, the performance of the models degrade.Thus these ST clips (especially for T=16 ) may require more specialized model designs to work with the feature sharing framework.
The step up in performance between the feature sharing and standalone baseline models lends credence to the effectiveness of combined streams; by simply summing parallel outputs from both streams and processing with a dense-dropout pair, we observe over 4% improvement in averaged accuracy.This observation was further proven in subsequent networks utilizing algorithms such as bidirectional LSTMs and self-attention mechanisms.Though the CIv3D BiLSTM model was only marginally better in terms of accuracy, it bested its non feature sharing variant in all other metrics.Similarly, we observe a notable boost across all the metrics of the other models, especially so in the purely 3D Inception-based networks (SRS and CRS), both having over 10% improvement in accuracy alone.Future research will also consider unsupervised detection of behaviours and welfare concerns in the home cage, and if the unique feature sharing approach will impact multi-stream models in this learning domain.Appendix A Full performance table Fig. 1: Conventional standalone vs feature-sharing dual networks Fig. 3: Network modules used in dual-stream architectures.

Table 1 :
Full summary of feature sharing models

Table 2 :
Model hyperparameters and other details

Table 3 :
All models' performances across metrics SRS model were ensemled.For intra-model ensembles, the test results yielded 82.37% based on model picks via evaluating of the validation data.This further increases to an accuracy of 85.9% based on the model picks using the test data itself.By evaluating the test results (see appendix A), the best intermodel ensemble was observed between the SRS and CIv3D MHA models and achieved 86.47%.Further ensembles between models are shown in Table4.The confusion matrices and ROC plots for the ensembles can be found in appendix B.

Table 4 :
Result of ensembles (based on test groundtruth)

Table 5 :
Summary of the feature-sharing and standalone stream networks

Table 6 :
Results on grayscale, raw RGB and nightified data

Table 8 :
Results on UCF101 dataset

Table A1 :
Full Performance table for feature sharing models