Murine Motion Behavior Recognition Based on DeepLabCut and Convolutional Long Short-Term Memory Network

: Murine behavior recognition is widely used in biology, neuroscience, pharmacology, and other aspects of research, and provides a basis for judging the psychological and physiological state of mice. To solve the problem whereby traditional behavior recognition methods only model behavioral changes in mice over time or space, we propose a symmetrical algorithm that can capture spatiotemporal information based on behavioral changes. The algorithm ﬁrst uses the improved DeepLabCut keypoint detection algorithm to locate the nose, left ear, right ear, and tail root of the mouse, and then uses the ConvLSTM network to extract spatiotemporal information from the keypoint feature map sequence to classify ﬁve behaviors of mice: walking straight, resting, grooming, standing upright, and turning. We developed a murine keypoint detection and behavior recognition dataset, and experiments showed that the method achieved a percentage of correct keypoints (PCK) of 87 ± 1% at three scales and against four backgrounds, while the classiﬁcation accuracy for the ﬁve kinds of behaviors reached 93 ± 1%. The proposed method is thus accurate for keypoint detection and behavior recognition, and is a useful tool for murine motion behavior recognition. 3 × 3 convolution kernels, with a convolution step of one in the block replaced by the ASPP module. The ASPP module consists of one 1 × 1 convolution, three 3 × 3 dilated convolutions with different expansion rates, and global average pooling. All feature maps were fused by using concatenation.


Introduction
Behavior is the body language of animals expressing psychology and physiology, which can provide a theoretical basis for judging their psychological and physiological state. Animal behavior research should seek to observe animals not only in their natural state, but also in laboratory conditions. Animal behavior analysis has been widely used in biological sciences, neuroscience, pathology, and genetics [1][2][3][4] to study neural functions, psychological processes, and drug effects. Traditional behavioral analysis methods mostly use manual observation and sensor methods [5][6][7], such as piezoelectric sensors, infrared sensors, and micro-photoelectric systems, which are time-consuming, laborious, and inflexible. More importantly, the influence of sound, light, electricity, and odor caused by sensors and manual observation can interfere with the natural behavior of experimental animals, resulting in deviations in the experimental data. Therefore, it is important to develop an algorithm that can automatically identify animal behaviors and calculate related indicators. This can reduce the workload of researchers, provide them with quantitative behavioral analysis, and improve the objectivity of experiments. Refined behavioral analysis can help researchers to capture some difficult-to-detect behavioral patterns. Since the 1990s, computer image processing technology has been widely used in the field of animal behavior analysis, providing an objective, quantitative, and precise analysis method. This allows Figure 1. Framework of the murine motion behavior recognition model: The DeepLabCut algorithm detects the keypoints of the nose, left ear, right ear, and tail root of the mouse. The Con-vLSTM network extracts spatiotemporal information and classifies behaviors. The model horizontally extracts spatial information, and vertically extracts temporal information.

Improved DeepLabCut Network
DeepLabCut is an open-source system for single-target pose estimation. Based on transfer learning, it can accurately detect keypoints without large quantities of training data, and has been tested and calibrated across species of mice, flies, and humans. The structure of the algorithm is shown in Figure 2. Its backbone network is Resnet50, which has been pre-trained in ImageNet [25]-a large target-recognition image database-eliminating the need for a large number of training samples. Unlike the standard Resnet50 network, DeepLabCut limits the downsampling multiple of the Resnet50 network to 16 times in order to obtain a larger feature map size. Specifically, the positions of Block2, Block3, and Block4 in Figure 2 are not downsampled, and Block represents the residual block in Resnet50.

Improved DeepLabCut Network
DeepLabCut is an open-source system for single-target pose estimation. Based on transfer learning, it can accurately detect keypoints without large quantities of training data, and has been tested and calibrated across species of mice, flies, and humans. The structure of the algorithm is shown in Figure 2. Its backbone network is Resnet50, which has been pre-trained in ImageNet [25]-a large target-recognition image database-eliminating the need for a large number of training samples. Unlike the standard Resnet50 network, DeepLabCut limits the downsampling multiple of the Resnet50 network to 16 times in order to obtain a larger feature map size. Specifically, the positions of Block2, Block3, and Block4 in Figure 2 are not downsampled, and Block represents the residual block in Resnet50. DeepLabCut uses a heatmap plus coordinate offset to predict keypoint locations. The Resnet50 network outputs a feature map of 2048 channels and, by transpose convolution, obtains the class probability feature map with the number of channels as the  DeepLabCut uses a heatmap plus coordinate offset to predict keypoint locations. The Resnet50 network outputs a feature map of 2048 channels and, by transpose convolution, obtains the class probability feature map with the number of channels as the number of keypoints, as well as the coordinate offset feature map, where the number of channels is twice the number of keypoints. The coordinate offset provides more accurate location information than the traditional method of using only a heatmap for prediction. Therefore, DeepLabCut needs two parts of the loss function during training: for the category probability part, the binary cross-entropy loss is used, and for the coordinate offset part, the Huber loss is used. The loss function formula is as follows: where is the category label, is the real coordinate offset, is the predicted category probability, is the predicted coordinate offset, and is a hyperparameter used to judge whether a point is a singular data point. When the predicted deviation is less than , the mean square error (MSE) is adopted; when the predicted error is greater than , the linear error is adopted to prevent oversensitivity to outliers. In keypoint detection tasks, large receptive fields and multiscale information are crucial. First, the representation of different keypoints may require information at different scales. Second, large receptive fields and multiscale information can implicitly es- DeepLabCut uses a heatmap plus coordinate offset to predict keypoint locations. The Resnet50 network outputs a feature map of 2048 channels and, by transpose convolution, obtains the class probability feature map with the number of channels as the number of keypoints, as well as the coordinate offset feature map, where the number of channels is twice the number of keypoints. The coordinate offset provides more accurate location information than the traditional method of using only a heatmap for prediction. Therefore, DeepLabCut needs two parts of the loss function during training: for the category probability part, the binary cross-entropy loss is used, and for the coordinate offset part, the Huber loss is used. The loss function formula is as follows: loss all = loss sigmoid_cross_entropy + loss huber where y c is the category label, y r is the real coordinate offset, x c is the predicted category probability, x r is the predicted coordinate offset, and δ is a hyperparameter used to judge whether a point is a singular data point. When the predicted deviation is less than δ, the mean square error (MSE) is adopted; when the predicted error is greater than δ, the linear error is adopted to prevent oversensitivity to outliers. In keypoint detection tasks, large receptive fields and multiscale information are crucial. First, the representation of different keypoints may require information at different scales. Second, large receptive fields and multiscale information can implicitly establish spatial connections between keypoints. The convolutional pose machine (CPM) [26] and stacked hourglass network (Hourglass) [27] algorithms use cascade networks to obtain large receptive fields and multiscale information. To make the keypoint detection model suitable for most murine behavioral scenarios, this study used atrous spatial pyramid pooling (ASPP) in DeepLabV3 [28] to fuse context information at multiple scales. Specifically, the intermediate convolution kernel of the residual unit in the residual block was replaced by the ASPP module, as shown in Figure 3. The ASPP module consisted of one 1 × 1 convolution, three 3 × 3 dilated convolutions with different expansion rates, and a global average pooling, and finally used the concatenation operation to fuse multiscale features. In this paper, the effect of the ASPP module was verified by the ablation experiment, and the effects of different sites and different expansion rates on the model were compared. In the following sections, DLC_b1_r1 means that the ASPP module has been inserted into Block1, and the expansion rates are 1, 2, and 4; DLC_b12_r2 means that the ASPP module has been inserted into Block1 and Block2, and the expansion rates are 2, 4, and 6. sion rates, and a global average pooling, and finally used the concatenation operation to fuse multiscale features. In this paper, the effect of the ASPP module was verified by the ablation experiment, and the effects of different sites and different expansion rates on the model were compared. In the following sections, DLC_b1_r1 means that the ASPP module has been inserted into Block1, and the expansion rates are 1, 2, and 4; DLC_b12_r2 means that the ASPP module has been inserted into Block1 and Block2, and the expansion rates are 2, 4, and 6.  Figure 3. Application of the ASPP module in the residual network: The arrows represent 3 × 3 convolution kernels, with a convolution step of one in the block replaced by the ASPP module. The ASPP module consists of one 1 × 1 convolution, three 3 × 3 dilated convolutions with different expansion rates, and global average pooling. All feature maps were fused by using concatenation.
The keypoint network plays two roles in this study: The first is to output precise coordinates of the keypoints of the mouse's nose, left ear, right ear, and tail root. The second is to output a feature map containing the spatial posture information of the mice, which is combined with context information to recognize murine behavior.

Convolutional Long Short-Term Memory Network
Behavior can be described as posture change over time. Traditional behavior recognition algorithms extract distance and angle features between keypoints and then use a long short-term memory (LSTM) [29] model to represent the temporal relationships between behaviors, as proposed by Fu et al. [20]. However, do the obtained features comprehensively summarize the spatial connections between keypoints? Do they properly represent the temporal change in posture? How should false and missing results in the keypoint detection model be dealt with? Artificial features must address these considerations.
This paper introduces the ConvLSTM [30] to solve the above problem. ConvLSTM replaces the fully connected layer in FC-LSTM with a convolution, which can capture spatial features in multidimensional data. The information transmission mode is shown in Equation (4). The state of the previous moment can be transmitted to the next moment in a spatial form, so that spatiotemporal features can be extracted: Figure 3. Application of the ASPP module in the residual network: The arrows represent 3 × 3 convolution kernels, with a convolution step of one in the block replaced by the ASPP module. The ASPP module consists of one 1 × 1 convolution, three 3 × 3 dilated convolutions with different expansion rates, and global average pooling. All feature maps were fused by using concatenation.
The keypoint network plays two roles in this study: The first is to output precise coordinates of the keypoints of the mouse's nose, left ear, right ear, and tail root. The second is to output a feature map containing the spatial posture information of the mice, which is combined with context information to recognize murine behavior.

Convolutional Long Short-Term Memory Network
Behavior can be described as posture change over time. Traditional behavior recognition algorithms extract distance and angle features between keypoints and then use a long short-term memory (LSTM) [29] model to represent the temporal relationships between behaviors, as proposed by Fu et al. [20]. However, do the obtained features comprehensively summarize the spatial connections between keypoints? Do they properly represent the temporal change in posture? How should false and missing results in the keypoint detection model be dealt with? Artificial features must address these considerations.
This paper introduces the ConvLSTM [30] to solve the above problem. ConvLSTM replaces the fully connected layer in FC-LSTM with a convolution, which can capture spatial features in multidimensional data. The information transmission mode is shown in Equation (4). The state of the previous moment can be transmitted to the next moment in a spatial form, so that spatiotemporal features can be extracted: where i t , f t , and o t are the input, forget, and output gates, respectively, with step size t, X t represents the input data, C t is the storage cell state, h t is the output of the network at time t, "*" represents the convolution operation, and "•" is the Hadamard product.
In the murine behavior recognition task, the sequence of adjacent frames' feature maps, which is output by the keypoint detection algorithm, was used as input to the ConvLSTM network to model dynamic changes in behavior. The structure of the ConvLSTM network is shown in Figure 4, and its main functions are as follows: First, the keypoint feature maps reduce redundant information in the images. Second, the keypoints contain not only the spatial connections between keypoints, but also location information for the mice. For example, the position and spatial relationship of keypoints did not change during static behavior, but the location and spatial relationship of keypoints changed simultaneously as the mouse moved. Third, behavior is jointly determined by previous and current actions; therefore, the keypoint feature maps of previous and current frames are used as inputs to jointly determine current behavior, which is consistent with the definition of behavior. Fourth, it does not require additional artificial design features-the network can implicitly model the temporal relations of keypoints when behavior occurs, and even if there are false and missing results, it can be used as a representation of behavior.
points contain not only the spatial connections between keypoints, but also location information for the mice. For example, the position and spatial relationship of keypoints did not change during static behavior, but the location and spatial relationship of keypoints changed simultaneously as the mouse moved. Third, behavior is jointly determined by previous and current actions; therefore, the keypoint feature maps of previous and current frames are used as inputs to jointly determine current behavior, which is consistent with the definition of behavior. Fourth, it does not require additional artificial design features-the network can implicitly model the temporal relations of keypoints when behavior occurs, and even if there are false and missing results, it can be used as a representation of behavior.
Specifically, four ConvLSTM layers were used to extract spatiotemporal information; the numbers of convolution kernels were 256, 128, 64, and 5, respectively, and the size of the convolution kernels was 3 × 3. Global average pooling was then used to achieve behavior classification. The resolution of the feature maps output by each layer remained unchanged, which is more conducive to learning spatial information. The structure of the ConvLSTM network is shown in Figure 4. 256   Specifically, four ConvLSTM layers were used to extract spatiotemporal information; the numbers of convolution kernels were 256, 128, 64, and 5, respectively, and the size of the convolution kernels was 3 × 3. Global average pooling was then used to achieve behavior classification. The resolution of the feature maps output by each layer remained unchanged, which is more conducive to learning spatial information. The structure of the ConvLSTM network is shown in Figure 4.

Dataset
In this study, a dataset containing four keypoints representing the mouse's nose, left ear, right ear, and tail root was produced, which was taken from the top in an open-field experimental box, with an image resolution of 640 × 480. The video was shot in a dark room using the same device, using fill lights to fix the light intensity at 200 lux. In total, 24 videos were captured, each of which was 5 min long and featured two mice. The videos were shot at different heights and with different background colors. The dataset comprehensively considered the influence of behavior, color, and scale on model performance. It included five behaviors (walking straight, resting, grooming, standing upright, and turning), four background colors (white, light gray, dark gray, and black), and three shooting heights (60, 70, and 80 cm). Figure 5 shows images from the dataset. There were 2700 images in the training set-including only the scale of 70 cm, with a uniform distribution of colors-and 1200 images in the test set, with uniform distribution of colors and scales. Each image was split from the videos above and randomly assigned to the training set or test set. mance. It included five behaviors (walking straight, resting, grooming, standing upright, and turning), four background colors (white, light gray, dark gray, and black), and three shooting heights (60, 70, and 80 cm). Figure 5 shows images from the dataset. There were 2700 images in the training set-including only the scale of 70 cm, with a uniform distribution of colors-and 1200 images in the test set, with uniform distribution of colors and scales. Each image was split from the videos above and randomly assigned to the training set or test set. For the behavior recognition task, to ensure consistency of spatial scale, the videos were collected at a height of 70 cm. The behavior recognition dataset contained five types of behaviors: walking straight, resting, grooming, standing upright, and turning. Each type of behavior was represented by 600 labeled images, of which 75% were randomly selected as the training set and 25% as the test set. All stages of a given behavior were covered as much as possible in the labeling process. During training and testing, an image sequence consisting of the annotated frame and several frames preceding the annotated frame was used as an input to the model. Figure 6 shows an example of an image sequence of the five types of behaviors, in which the sequence length is five and the interval between frames is two. For the behavior recognition task, to ensure consistency of spatial scale, the videos were collected at a height of 70 cm. The behavior recognition dataset contained five types of behaviors: walking straight, resting, grooming, standing upright, and turning. Each type of behavior was represented by 600 labeled images, of which 75% were randomly selected as the training set and 25% as the test set. All stages of a given behavior were covered as much as possible in the labeling process. During training and testing, an image sequence consisting of the annotated frame and several frames preceding the annotated frame was used as an input to the model. Figure 6 shows an example of an image sequence of the five types of behaviors, in which the sequence length is five and the interval between frames is two.

Evaluation Metrics
In this study, the percentage of correct keypoints (PCK) was used as the evaluation metric for keypoint detection. The PCK refers to the proportion of the normalized distances between the detected keypoints and their corresponding labels that is less than a set threshold. In the FLIC dataset [31], the torso size was used as the scale factor to cal-

Evaluation Metrics
In this study, the percentage of correct keypoints (PCK) was used as the evaluation metric for keypoint detection. The PCK refers to the proportion of the normalized distances between the detected keypoints and their corresponding labels that is less than a set threshold. In the FLIC dataset [31], the torso size was used as the scale factor to calculate the normalized distance, while in the MPII dataset [32], the length of the head was used as the scale factor. The calculation formula of the PCK is as follows: where i represents the particular keypoint, k is the threshold, p is the image, d pi is the Euclidean distance between the predicted and true values of keypoint i in the pth image, d de f p is the scale factor of the pth image, PCK k i represents the PCK for the keypoint of category i under threshold T k , and PCK k mean represents the average PCK for all keypoints under threshold T k . In this study, the distance between the ears in each frame was denoted by d de f p . If the distance between the ears could not be calculated, the median of the scale could be used as the scale factor-60 cm was 15.29, 70 cm was 12.85, and 80 cm was 10.81.
For the behavior recognition task, the accuracy and the confusion matrix were used as evaluation metrics. Accuracy was defined as the ratio of samples that were predicted correctly to the total number of samples considered. Each column of the confusion matrix represented the category of prediction, and the total number of items in each column represented the number of data items predicted for the relevant category. Each row represented the true category, and the total number of items in each row represented the number of data instances in that category.

Experimental Details
In the keypoint detection task, CPM [26], Hourglass [27], DeepLabCut, and the improved DeepLabCut were used for comparison. CPM used six stage units and trained for 300 epochs. Stacked hourglass used four hourglass modules and trained for 1000 epochs, with a learning rate of 0.0001. CPM and stacked hourglass needed to crop out the mouse in the original image for detection and then map the coordinates back to the original image. DeepLabCut and the improved DeepLabCut required a total of 1,030,000 iterations. The learning rate was 0.005 for the first 10,000 iterations, 0.02 for 10,000 to 430,000, 0.002 for 430,000 to 730,000, and 0.001 for the final stage, using Resnet50 weights pre-trained on ImageNet.
In the murine behavior recognition task, 3DCNN [33,34], LSTM, and Bi-LSTM [35,36] were compared with ConvLSTM. The 3DCNN network also used the keypoint feature map sequence as an input. This study used a four-layer 3D convolution; the numbers of convolution kernels were 64, 64, 128, and 256, respectively, and the size of the convolution kernel was 3 × 3 × 3. The 3DCNN network used the full connection and softmax functions for classification, and trained for a total of 10 epochs. Eleven features were extracted from the murine keypoints for training LSTM and Bi-LSTM, as shown in Figure 7. The LSTM network consisted of four layers with 64, 64, 128, and 256 neurons, respectively, used full connections for classification, and trained for a total of 50 epochs. The bi-LSTM network's training parameters were the same as those for LSTM, except that Bi-LSTM utilized the information in the later frames. Figure 7. The LSTM network consisted of four layers with 64, 64, respectively, used full connections for classification, and trained fo The bi-LSTM network's training parameters were the same as th that Bi-LSTM utilized the information in the later frames.  Table 1 shows the results of a performance comparison betwe detection algorithms, with a threshold of 0.4. As can be seen from Hourglass, and DeepLabCut, the accuracy of DeepLabCut for e highest, and the detection speed was 18.4 frames per second. After added, DLC_b1_r2, DLC_b2_r2, and DLC_b12_r2 achieved high a point, and the PCK increased by 2-3%, proving that the ASPP m prove keypoint detection accuracy under a high expansion rate. results at different scales show that this improvement in accuracy proved accuracy of small-scale target detection, with the PCK at 8  Table 1 shows the results of a performance comparison between different keypoint detection algorithms, with a threshold of 0.4. As can be seen from Table 1, among CPM, Hourglass, and DeepLabCut, the accuracy of DeepLabCut for each keypoint was the highest, and the detection speed was 18.4 frames per second. After the ASPP module was added, DLC_b1_r2, DLC_b2_r2, and DLC_b12_r2 achieved high accuracy on each keypoint, and the PCK increased by 2-3%, proving that the ASPP module can indeed improve keypoint detection accuracy under a high expansion rate. At the same time, the results at different scales show that this improvement in accuracy was related to the improved accuracy of small-scale target detection, with the PCK at 80 cm increasing from 68% to 77%. However, the location where ASPP joined had little effect, but one bias was that when the ASPP module was located in the shallow layers of the network, the accuracy rate was higher. The reason for this may be that the shallow layer of the network contains detailed features, the deep layer has semantic features, and the keypoints at different scales have different detailed features rather than semantic information. The addition of the ASPP module incurred extra calculation, and the FPS decreased, but this decline was acceptable. Considering the amount of computation, DLC_b1_r2 was selected as the optimal model in this paper.

Results of Keypoint Detection
The parameters were consistent during training, but the order of images was shuffled. Each algorithm was trained 10 times; the results are reported as mean differences at a 95% confidence interval. For detailed results, see the Data Availability Statement. "Keypoints" represents the PCK of each algorithm at different keypoints; "Heights" represents the PCK of each algorithm at different scales; "Avg" represents the average PCK of different keypoints.  Figure 8 compares the performance of each algorithm under different thresholds. The smaller the threshold, the smaller the prediction error. Although the DeepLabCut series algorithms achieved high PCK when the threshold was 0.4, their performance was worse than that of the CPM and stacked hourglass algorithms at a lower threshold. The reasons for this may be as follows: First, the CPM and stacked hourglass algorithms both use intermediate supervision, making location information more accurate in continuous multistage learning. Second, the CPM and stacked hourglass algorithms both use the method of cropping out the mouse region before detection, whereas the DeepLabCut series directly detects on the original image. Analysis of the series of improvements in DeepLabCut shows that the ASPP module with a high expansion rate not only improves prediction accuracy, but also reduces prediction error.
Symmetry 2022, 14, x FOR PEER REVIEW 11 Figure 8. Comparison of algorithms under different thresholds: The thresholds were set to 10%, 20%, 30%, and 40% of the scale factor, and the distance between the ears was used as the factor. Table 2 shows the accuracy of each behavior under different input parame Clearly, the accuracy of different behaviors varied greatly. The accuracy of resting havior was the lowest, at less than 80%; the accuracies for grooming and walking stra were the highest, at greater than 95%. Second, the recognition rate varied greatly w different input parameters. For example, for resting behaviors, when the sequence terval was 1, the accuracy was the lowest, but for most behaviors, the influence of The thresholds were set to 0%, 10%, 20%, 30%, and 40% of the scale factor, and the distance between the ears was used as the scale factor. Table 2 shows the accuracy of each behavior under different input parameters. Clearly, the accuracy of different behaviors varied greatly. The accuracy of resting behavior was the lowest, at less than 80%; the accuracies for grooming and walking straight were the highest, at greater than 95%. Second, the recognition rate varied greatly with different input parameters. For example, for resting behaviors, when the sequence interval was 1, the accuracy was the lowest, but for most behaviors, the influence of the input parameters did not adhere to obvious rules. Analyzing the input data leads to the conclusion that the time scales of different behaviors-or even the same behaviors-are not always the same, and that changes in input parameters lead to different information being received by the network. When information is redundant or lacking, it is difficult to form a consistent understanding of behaviors. Because the objective law of different scales exhibited by behavior cannot be changed, the key is to enable the network to adapt to different time scales, which will be the focus of upcoming research. The results presented here demonstrate the feasibility of ConvLSTM for behavior recognition. This network had the highest accuracy of 0.93 ± 0.01 when the length of the sequence was seven and the sequence interval was zero. Each algorithm was trained 10 times; the results are reported as mean differences at a 95% confidence interval. "Sequence length" represents the length of the continuous image sequence; "Sequence interval" represents the interval between adjacent frames in a continuous image sequence; "Avg" represents the average accuracy of different behaviors. Table 3 compares the recognition accuracy of different algorithms. It is apparent that the algorithms of the long short-term memory network series-such as LSTM, Bi-LSTM, and ConvLSTM-had higher accuracy than 3DCNN. Although 3DCNN can also process time series, its output is only related to the input-not to the order of the input. Thus, it could not handle the time series. Compared with the one-dimensional LSTM network, ConvLSTM achieved the highest accuracy of 0.93 ± 0.01, indicating that the implicit establishment of spatial temporal information is more conducive to the network's understanding of behavior. Finally, compared with LSTM, the accuracy of Bi-LSTM was increased by 2.2%, indicating that the context information in the video sequence helped in recognizing behavior. This provides inspiration for the next step to improve the ConvLSTM network. A comparison of behaviors showed that the accuracy of the method in identifying resting was significantly lower than that in identifying the other behaviors. We generated a confusion matrix to analyze identifications by each algorithm to explore the reasons for the low accuracy in identifying resting behavior, as shown in Figure 9. Each algorithm was trained 10 times, and the results are reported as mean differences at a 95% confidence interval. "Avg" indicates the average accuracy of different behaviors. standing of behavior. Finally, compared with LSTM, the accuracy of Bi-LSTM was increased by 2.2%, indicating that the context information in the video sequence helped in recognizing behavior. This provides inspiration for the next step to improve the Con-vLSTM network. A comparison of behaviors showed that the accuracy of the method in identifying resting was significantly lower than that in identifying the other behaviors. We generated a confusion matrix to analyze identifications by each algorithm to explore the reasons for the low accuracy in identifying resting behavior, as shown in Figure 9.    It is apparent that the misrecognition of behaviors by different algorithms was similar. For resting and grooming behaviors, the misrecognition rate was high because, in both behaviors, the mouse is generally immobile; although the keypoints fluctuate over a small range in the grooming behavior, this jitter was reflected in the keypoint feature map with little difference, where this resulted in similar spatial distribution of keypoint feature maps. Therefore, it is difficult for the network to distinguish between resting and grooming behaviors. On the other hand, the misrecognition rates of upright and turning behavior were also high, because upright behavior often contains turning behavior. In summary, this study reveals the complexity of behavior recognition, which cannot be simply recognized directly through the network. One possible method is to decompose behavior into behavioral elements that are easily recognized by the algorithm, and then to infer behavior according to the combination of behavioral elements.

Discussion
We integrated the DeepLabCut algorithm with the ASPP module to detect the nose, left ear, right ear, and tail root of mice, and achieved a PCK of 0.87 ± 0.01. Compared with the original DeepLabCut, the overall performance improved by 3%. The performance improved by 9% for small targets (shooting height at 80 cm). This shows that ASPP can fuse multiscale information and enable the network to adapt to changes in the scale of the object. The closer the ASPP module was to the shallow layer, and the higher the expansion rate, the better the performance of the proposed method. The network delivered optimal performance when the ASPP module was at Block1 and the expansion rates were 2, 4, and 6. This result was possibly obtained because the shallow layers of the network contain more detailed information on objects, and the receptive field was smaller. The shallow features were different for the representation of objects at different scales. The semantic information tended to be consistent for deep layers of the network. As a result, the ASPP module worked well at the shallow level. A higher expansion rate may be applicable to the dataset in this paper. The optimal parameters of the expansion rate may be different for different detection tasks, and this requires more detailed research. In addition, we found that when highly precise identification was required, the accuracy of the DeepLabCut series algorithms was not as high as that of CPM and Hourglass, because of their low input resolution. Finding ways to make use of finer local information will be the focus of our next improvement to this algorithm.
The traditional analysis of the behavior of mice has focused only on the parameters of their movement or posture [37][38][39][40][41], and has paid little attention to their behavior. We propose an algorithm that integrates a keypoint detection model with the ConvLSTM network to detect the motion behavior of mice. This algorithm achieved an average accuracy of 93.8% in identifying the five behaviors of walking straight, resting, grooming, standing upright, and turning, where this was higher than the accuracies of the LSTM, Bi-LSTM, and 3DCNN. Behavior can be expressed as a pattern of spatiotemporal changes in posture. The map of keypoint features containing posture-related information was used as the input to the ConvLSTM network to implicitly establish the relationships between keypoints. The ConvLSTM network passed the state information of the previous moment to the next moment. This is consistent with the definition of behavior. The output of ConvLSTM network thus contained temporal and spatial information on behavioral changes. We experimentally demonstrated the feasibility of this method, and explored the impact of behavioral factors at the temporal scale on network performance. The durations for which different behaviors could be sustained were different, but the length of the sequence of the network input was fixed, and led to varying network performance at different time scales. Although we experimentally determined the optimal parameters of the time scale parameters, the network did not have the ability to adapt to different time scales of behavior. In addition, ConvLSTM could not distinguish between similar behaviors, such as resting and grooming. One possible solution to this is to decompose behaviors into behavioral elements that are easily identifiable by the algorithm, and then to combine different behavioral elements into behaviors [42,43].
We can use DeepLabCut and the ConvLSTM network to detect both the parameters of motion and the behavior of mice. For example, we can use keypoints to calculate the speed of movement of a mouse, draw a trajectory map, and determine the difference between its central and peripheral movements. The results of behavior recognition of each frame can then be used to calculate the frequency and duration of each behavior. We provide an accurate and quantitative tool for behavioral analysis that is important for reducing the workload of researchers and objectively analyzing experimental data. However, the proposed algorithm still has many limitations. For example, the FPS of the improved DeepLabCut was reduced from 18.4 to 16.7 due to the addition of the ASPP module. Although this still satisfies the requirements of use, this limitation renders it unsuitable for some scenarios requiring real-time detection. In addition, the proposed algorithm requires training two models, which is cumbersome. Implementing an end-to-end model will also be the focus of our future research in this area.

Conclusions
In this study, we proposed a method to identify the motion behavior of mice based on the DeepLabCut model and the ConvLSTM network. The results of our experiments showed that the ASPP module can improve the multiscale representation capability of the network. The average PCK of DeepLabCut increased from 84% to 87% after the ASPP module was added, and the accuracy of the method at detecting small targets increased by 9%. The performance of the network was better when the ASPP module was located in the shallow layer of the network. We also demonstrated ConvLSTM's ability to extract temporal and spatial information from keypoint feature maps, and used it for behavior classification. Moreover, we verified the effect of the temporal scale of behavior on the performance of the model. When the length of the sequence was seven and the sequence interval was zero, the proposed method delivered the best performance, with an average accuracy of 93.8%, which was higher than those of the LSTM, Bi-LSTM, and 3DCNN.