5.1. Model Feature Map Analysis
In the detection process, the data of each convolutional layer are superimposed by multiple feature maps’ information. The feature map is the most representative part of the image features extracted by the convolutional layer. The feature information extracted from the feature map directly determines the merit of the model. This study uses the model to extract feature maps from the images of the cow feeding behaviour dataset captured and collected from the front and above, and further analyses the quality of the model based on the feature information.
Due to the different sizes of the feature maps, some of the feature maps cannot be identified manually, so several representative feature maps were selected. The DRN-YOLO feature maps of different layers are shown in
Figure 11. It can be seen that as the network continues to deepen, it places more and more emphasis on the semantic representation of the object, and the feature maps become more abstract. The computer has the advantage of processing high-dimensional deep semantic feature information to achieve accurate recognition.
Based on the DRN-YOLO model, the perceptual field of the feature map was increased to enrich the feature information. In
Figure 11, the brighter the colour of the area, the more concentrated the features are in the area during the detection process. After extracting the feature information through the model, the feature map in front of the detection module only highlighted the feature information of the cow’s head during feeding, effectively verifying the effectiveness of the algorithm. Once again, the accuracy and effectiveness of this study’s algorithm for extracting features from cow feeding behaviour were demonstrated.
5.2. Comparative Performance Analysis
5.2.1. Performance Comparison of Characteristic Scales
The detection of cow feeding behaviour requires adequate feature extraction of the behaviour currently performed by the cow and thus accurate judgment, while the YOLOv4 model has insufficient feature extraction ability, which is most intuitively reflected in the low mAP value of the model. To enhance the feature extraction of cow feeding behaviour, we used an enhancement to the feature pyramid network [
27], i.e., adding a feature scale that allows for a closer connection between the deeper network in the backbone network and the neck connection network. It can be seen from
Table 3 and
Table 4 that when the new feature scale is added, the mAP value of the training data set of cow feeding behaviour photographed in front increases from 95.13% to 95.86% when the training is completed, and the mAP value of the cow feeding behaviour training data set photographed above increases from 95.01% to 95.53% after the training was completed, and the
accuracy and
recall were improved to varying degrees. By increasing the feature scale to extract features of the current feeding behaviour of cows, the detection failure rate of cow feeding behaviour was reduced and the model detection performance was improved.
5.2.2. Performance Comparison of SPP Pooling Structures
It can be seen that increasing the mAP value to a four-feature scale increases the connectivity between the networks and enhances the ability to extract features from the behaviour of cows during feeding, but the feature perceptual field is not sufficiently extracted, making the feature extraction not comprehensive enough, so the SPP pooling structure was tested to solve the problem of insufficient feature extraction. The use of the SPP pooling structure allows the shallow network to be richer in feature information and to increase the perceptual field, thereby improving the model’s detection performance. With the addition of the SPP pooling structure, the mAP of the training dataset of cow feeding behaviour taken in front was 96.27%, 1.14% higher than the mAP of YOLOv4, and the mAP of the training dataset of cow feeding behaviour taken above was 95.97%, 0.96% higher than the mAP of YOLOv4.
5.2.3. Performance Comparison of DRN Modules
The use of a four-feature scale and SPP pooling structure improved the model’s detection effect and made the detection of cow feeding behaviour more accurate, but adding structure makes the model bulkier. In order to reduce the complexity of the model and improve its depth, the CSPDarknet module in YOLOv4 is replaced with a DRN module. When the feature information passes through the DRN module, on the one hand, the feature information is passed directly from the shortcut channel to the output location to protect the integrity of the information. On the other hand, the model only needs to learn and reinforce the part of the input and output difference, which simplifies the learning objective and difficulty while also mitigating the gradient disappearance phenomenon and reducing the memory consumption used by the model. Using the DRN module alone, the mAP value for the training dataset of cow feeding behaviour taken in front was 96.16% when training was completed, 1.03% higher than the mAP value for YOLOv4, and the mAP value for the training dataset of cow feeding behaviour taken above was 95.69% when training was completed, 0.68% higher than the mAP value for YOLOv4.
5.3. DRN-YOLO vs. YOLOv4
The above analysis shows that DRN-YOLO uses the DRN module and the SPP structure on an increased feature scale, and the model performance data is improved compared to the YOLOv4 model. The following is a comparison of the DRN-YOLO and YOLOv4 models in terms of the model value of loss curves.
Figure 12a shows a comparison of the value of the loss curves of the cow feeding behaviour training data set taken from the front, and
Figure 12b shows a comparison of the value of the loss curves of the cow feeding behaviour training data set taken from the top. As can be seen from the graphs, after 100,000 training cycles, the DRN-YOLO model achieves a perfect fit for both the front and top cow feeding behaviour training data sets, with no peak fluctuations and stable performance during the training process, while the YOLOv4 model experiences large fluctuations for both the front and top cow feeding behaviour training data sets, with a loss value break at around 80,000. The YOLOv4 model shows large fluctuations in both the front and top training datasets, with a break in the loss value at around 80,000 times and a peak fluctuation at around 55,000 times in the top training dataset, with a loss value of 4.8 or more. During the DRN-YOLO training process, the range of variation of loss values was small, generally controlled within 0.1, while the range of variation of loss values during the YOLOv4 training process was large, with the variation of loss values before and after of basically around 0.5–0.6. From the analysis of the loss curves, it can be seen that DRN-YOLO has a greater improvement compared to YOLOv4.
F1-score is one of the important indicators to evaluate the goodness of the model. F1-score integrates the
accuracy and
recall rate of the model training, so it is very important to analyse the model F1-score. As shown in
Figure 13a for the F1-score of the cow feeding behaviour dataset taken in front and
Figure 13b for the F1-score of the cow feeding behaviour dataset taken from the top, the red curve indicates the F1-score of the DRN-YOLO model, and the orange curve indicates the F1-score of the YOLOv4 model. It can be seen that the F1-score of the DRN-YOLO model is 0.2–0.3 higher than the F1-score of the YOLOv4 model for 113 iterations, which also proves that the DRN-YOLO model performs better than the YOLOv4 model.
The
mAP value is an important indicator of the accuracy of the model in identifying object species. For example,
Figure 14a shows the
mAP values of the cow feeding behaviour dataset taken in front, and
Figure 14b shows the
mAP values of the cow feeding behaviour dataset taken from the top. The red curve indicates the
mAP values of the DRN-YOLO model, and the orange curve indicates the
mAP values of the YOLOv4 model. The
mAP values of the DRN-YOLO model basically remained above 0.98, while those of the YOLOv4 model stayed between 0.95 and 0.96, demonstrating that the DRN-YOLO model is more accurate in identifying object species than the YOLOv4 model.
5.4. Comparison of DRN-YOLO with Classical Object Detection Algorithms
The performance of deep learning models needs to be compared between models [
28]. To validate the effectiveness of cow feeding behaviour detection in the application of deep learning algorithms and to further analyse the performance of the DRN-YOLO model, we used three classical models, YOLOv4, SSD, and Faster RCNN, with
precision, recall, mAP, and F1-score as evaluation metrics, for a comprehensive comparison with the DRN-YOLO model. A uniform dataset and test set were used for training and testing, with a uniform image input size of 416 × 416, while the experimental parameters were kept consistent. The results of the comparison tests for each model for the dataset of cow feeding behaviour taken in front and above are shown in
Table 5 and
Table 6.
The experimental results showed that the overall performance of the DRN-YOLO model was about 2% better compared to both the YOLOv4 model and the SSD model in the testing of the cow feeding behaviour datasets, whether taken in front or above, and was similar to the performance of the Faster RCNN model, which was only about 0.05% better; however, the Faster RCNN model is a two-stage detection model and the DRN- YOLO is a single-stage detection model. The two-stage model needs to go through both the RPN branch and the classification branch in the detection process, which causes the Faster RCNN model to use a longer time for detection than the DRN-YOLO model. In training and testing, the DRN-YOLO model detected the feeding behaviour of cows photographed in front with a precision, recall, mAP, and F1-score of 97.16%, 96.51%, 96.91%, and 96.83%, respectively, showing an improvement of 1.70%, 1.82%, 0.97%, and 1.76%, respectively, compared to YOLOv4. The DRN-YOLO model detected the feeding behaviour of cows filmed from above with 96.84%, 96.25%, 96.49%, and 96.55% precision, recall, mAP, and F1-score, respectively, showing an increase of 1.67%, 1.27%, 1.48%, and 1.48% compared to YOLOv4. It can be seen that DRN-YOLO has improved cow feeding behaviour detection compared to YOLOv4, the model detection performance is slightly higher than that of the two-stage detection model Faster RCNN, and the detection takes less time. The DRN-YOLO model was shown to be more comprehensive, faster, and more accurate in detecting cow feeding behaviour, which meets the requirement of accurate and fast cow feeding behaviour detection.
The recent studies on dairy cow feeding behaviour include Refs. [
5,
11]. Ref. [
5] used CNN 2 to detect dairy cow feeding behaviour with a
precision of 94.18% at 5800 iterations; Ref. [
11] detected dairy cow feeding behaviour by sound and deep learning algorithm, and the final detection
precision, recall, and F1-score were 79.3%, 79.7%, and 79.5%. Comparing DRN-YOLO with these two algorithms to analyse the detection performance of the DRN-YOLO algorithm shows that DRN-YOLO has a 2.98% improvement in
precision compared to the detection method of [
5], and 17.86%, 16.81%, and 17.33% improvement in
precision, recall, and F1-score compared to the detection method of [
11]. As can be seen, DRN-YOLO shows a large improvement compared with both dairy cow feeding behaviour detection algorithms, which validates the feasibility of DRN-YOLO for detecting dairy cow feeding behaviour.
5.5. Limitations Analysis
The limitation of the model algorithm proposed in this paper is that only the ablation test is carried out for each detection module, and it is not compared with other target detection and target tracking models, which makes the model lack comparison data. In addition, in the process of identifying the feeding behaviour of dairy cows, this model is quick to misjudge the grass arching behaviour as a feeding behaviour. After preliminary analysis, this is because the characteristics of cow grass arching behaviour and feeding behaviour are highly similar, and grass arching behaviour is generally completed in a very short time, the model tends to judge grass arching behaviour and feeding behaviour as the same behaviour. Therefore, the model has not effectively excluded the redundant state of the arch grass behaviour.
In future research, cow feeding behaviour can be further subdivided to identify and detect indistinguishable behaviours such as chewing, swallowing, regurgitation, and pushing using deep learning techniques; the feed depth information collected by the ZED binocular camera can be used to analyse the feed consumption of cows within a certain period of time, thus realising real-time calculation of cow feeding, and when combined with the research results of this paper, the design and development of an automatic monitoring system for dairy cows’ feeding behaviour can be realised to achieve the requirement of long-term monitoring of dairy cows’ feeding behaviour.