1. Introduction
Over many decades, the ethical implications of using animals in research have undergone considerable discussion and scrutiny [
1]. A major landmark in the regulation of research involving animals was the designation of the three ‘R’s (Replacement, Refinement, and Reduction), which was spearheaded by The National Centre for the 3Rs (NC3Rs) in the United Kingdom. As of 2022, around 2.76 million living animals were used for various research procedures in the UK, with 96% of these comprised of rodents (rats and mice), birds, and fish [
2]. Due to their genetic, physiological, and anatomical similarities with humans [
3] as well as their short lifecycles [
4,
5], mice are one of the most utilized species in biomedical research.
In support of the 3Rs mission, technology has been increasingly used to better understand the different aspects of research involving animals. Behavioral phenotyping is particularly important as it may highlight welfare concerns that arise over the course of an experimental design. However, the manual observation of these behaviors is expensive, laborious, and time-consuming. Furthermore, behavioral studies relying solely on expert observation are not easily reproducible [
6,
7]. The development of home-cage monitoring (HCM) systems was a major technological breakthrough that has helped to solve many of these issues [
8]. HCM systems facilitate non-intrusive, longitudinal observation of mice and may provide a range of outputs such as behavioral annotation, ethogramming, depth sensing and tracking, activity summarising of circadian rhythm, and pose estimation. Such HCM systems include the Techniplast Digital Ventilated Cage (DVC) [
9], the System for Continuous Observation of Rodents in Home-cage Environment (SCORHE) [
10] and IntelliCage [
11], to name a few. Cameras are extensively utilized across diverse industries for a number of tasks [
12], including autonomous driving, pose estimation [
13], security and surveillance, etc. As such, these home-cage setups may be equipped with either single-view [
14] or multi-view cameras [
10], depending on design considerations. Nevertheless, there are few commercially available solutions to the problem of detecting behaviors from video footage alone. Moreover, many of the solutions that do exist are strongly coupled to commercial hardware, rather than video footage in general. Owing to their recent successes in human action recognition and many other domains, deep learning approaches offer a potential solution to the problem of behavioural phenotyping in the home-cage.
In this paper, dual-stream deep learning architectures are proposed for the behavioral classification of mice in the home-cage. The models in question were developed for entirely supervised learning, whereby spatiotemporal (ST) blocks of video data are mapped to one of several behavior categories. The dataset utilized is publicly available and contains videos of a singly-housed mouse [
7]. Our models are initially trained on the entire main data and then tested using the more unambiguous clipped database. This approach is different compared to that in the original paper. Nevertheless, comparisons were also made between our proposed methodology and their results [
7] using the same cross-validation technique adopted in the original publication. Furthermore, a select few of our models were also evaluated on a more complex, multi-view home-cage data [
10]. One of the novel aspects of these models is shared layers between the streams of the networks. Here, instead of fusing individual streams at the end (termed “late fusion” in [
15], we propose to combine features at regular intervals throughout the architecture. We hypothesize that accurate representations are better enforced when both streams are privy to information from each other (
Figure 1). Some instances of shared features have been seen in U-Nets [
16] and its many derivative networks, and some other specialized multi-stream architectures [
17,
18], albeit in a different manner to that proposed herein for multi-stream networks.
To the best of our knowledge, our work is the first to propose this “horizontal” form of connection in multi-stream deep learning (DL) architectures. The kind of connection present in U-Nets has been referred to as long skip connections [
19] and is an integral part of the models’ ability to prevent dilated features while transferring useful representations to its decoding stage [
20]. The other known kind of connection has been referred to as the short-skip connection [
19] and was first introduced in ResNet [
21] to solve the problem of vanishing gradients as these architectures scaled higher with increasing depths [
22,
23]. Some research has even combined these connections in their architectural DL designs [
24]. The similarity between the long and short connections is the use of simple operations such as addition or concatenation. However, the forms of
feature sharing range from concatenation to the use of new, joint-processing blocks that can be optimized with the entire architecture. Thus, our research study is novel in its presentation of
feature sharing between dual-stream architectures. This investigation forms the highlight of our paper, and will be evaluated against the conventional or standalone forms for all the architectures developed.
5. Discussion
Generally, it was observed that the more dynamic behaviors were better captured, by all the models, than the less dynamic behaviors. Areas of weak performance across all the models were mainly due to misclassification of resting, grooming, and micromovement behaviors. These behaviors are quite closely related; during grooming, the mouse is mostly stationary albeit the motion of its forelimbs and when resting, the mouse is completely immobile. Micromovement describes very small-scale motions and hence it is most likely that the 1.33-second windows of T = 8 cuboids cannot capture the full range of motion to distinguish between these classes. Nonetheless, these ’misclassifications’ are also indicative of similitude in the temporal pattern needed to perform certain tasks and may be subject to further interpretation by the subject experts.
Further experiments in the ablation study also showed that, for time windows lower or higher than the 1.33-second window, the performance of the models degrades. Thus, other clip sizes will require more intense hyperparameter tuning and data preprocessing to work with the
feature sharing paradigm. In particular, the
T = 16 temporal input may also require a deeper architecture (i.e., having more rungs or blocks) at the cost of increasing the computational complexity of the learning objective. The step up in performance between the
feature sharing and standalone baseline models lends credence to the effectiveness of combined streams; by simply summing parallel outputs from both streams and processing with a dense-dropout pair (depicted in
Figure 3b), we observe between 0.33% and 8.97% improvement in accuracy. This observation was further proven in subsequent networks utilizing algorithms such as bidirectional LSTMs and self-attention mechanisms. Though the CIv3D_BiLSTM model was only marginally better in terms of accuracy, it outperformed its non-
feature sharing variant in all other metrics. Similarly, we observe a notable improvement across all the metrics for the other models, especially in the purely 3D Inception-based networks (SRS and CRS), both having over 10% improvement in averaged accuracy alone. The ensemble of the SRS and CIv3D_MHA was also seen to achieve better accuracy than human annotators on cross-validation using the training data. Although this accuracy was not up to par with the proposed methodology in the original paper, it sufficiently demonstrates the workability of the
feature sharing paradigm.
Based on both the parameter count and floating point operations per second (FLOPS), the implementation of
feature sharing was also found to mostly reduce the complexity of the architectures, with the exception of the CRS model (see
Table 2). Conversely, utilizing
feature sharing would require establishing which
feature sharing method is best suited for the architecture, i.e., either simple concatenation or a new processing block (such as
Figure 3d or
Figure 3b). These investigations would generally increase the number of experiments needed, therefore increasing the time needed to establish its utility.
6. Conclusions
In summary, this paper proposed an approach to mouse behavior classification based on multi-stream convolutional neural networks with feature sharing. By including this architectural consideration, we observed gains ranging from 0.33% to 15.19% for all the custom architectures that were presented. Only in one model type (i.e., the CIV3D_BiLSTM) was the feature sharing architecture reported to achieve a lower accuracy than its standalone variant. Nevertheless, upper-limit gains of 3.92% were also possible for this same architecture. We validate this approach using two publicly available datasets, and it performs favourably compared to the start-of-the-art.
Further work will investigate improving the overall cross-validation by employing data augmentations not employed in this paper. In addition, feature sharing can be adapted using well-established, state-of-the-art supervised models (both convolutional and transformer-based) to further investigate its pros and cons. Finally, future research will also consider the unsupervised detection of behaviors and welfare concerns in the home cage, and whether the unique feature sharing approach will impact multi-stream models in this learning domain.