1. Introduction
Pigs are the most common source of meat products worldwide. With the progress of human society, people pay more and more attention to the quality of pork.
Object detection technology has high value in improving animal welfare. Early in pig production, it can be utilized to monitor pig health to improve pork quality [
1]. The dietary behavior of animals is closely related to their health status, and subtle dietary changes are important for animal health observations [
2]. When pigs are sick, they usually show reduced feeding, reduced exercise, depression, and lethargy [
3]. Leveraging scientific methods to monitor live pigs and, if necessary, human intervention, will help protect animal welfare and prompt pork quality and profitability. Initially, the vast majority of pigs were monitored manually, which led to significant increases in labor intensity. Meanwhile, during the monitoring process, human subjective judgment errors often occur, which was not conducive to the high-quality production of live pigs.
To handle the above problems, in the early days, researchers used RFID systems to monitor pigs’ diets. However, the sensitivity of RFID system monitoring was often affected by their surrounding environment as well as their own height, direction, and distance [
4]. In pursuit of better accuracy, it is necessary to constantly adjust the position of the antenna [
5], and RFID monitoring requires a large number of pig ear tags, which is time-consuming and expensive to maintain. Furthermore, there are also some problems to these approaches that depend on the utilization of the wearable equipment RFID such as being easy to damage, being invasive, and prone to infection [
6,
7]. Subsequently, the breeder deployed cameras to record the pigs’ behavior and manually analyzed the recorded video data, in order to obtain the health status of the pigs. These methods are all based on manual analysis, resulting in a significant increase in the workload of breeders.
Despite some desirable results, the above-mentioned methods suffer from compromised animal welfare and high physical labor intensity. This makes it urgent to leverage more efficient methods, for pig detection. The following are some effective attempts by researchers.
In [
8], by leveraging elliptical displacement calculation methods, Kashila et al. has achieved 89.8% accuracy in pig movement detection. Concerning the individual pig classification, Kashila et al. [
9] has received 88.7% accuracy in detecting individual pig identification via ellipse fitting technique. Based on traditional computer vision technology [
10], Nasirahmadi et al. [
11,
12] has employed ellipse fitting and the Otus algorithm, to realize the individual pig detection and pig lying position detection. Furthermore, Nasirahmadi [
13] has utilized support vector machine (SVM) algorithm to classify pig poses, with 94% classification accuracy achieved. Leveraging the linear discriminant analysis algorithm, Viazzi et al. [
14] has achieved 89.0% accuracy in the recognition of aggressive behavior of pigs. Furthermore, a more promising method for individual identification and behavioral recognition of pigs is based on 2D or 3D cameras. For example, Matthews et al. [
15] has utilized a depth camera to track the movement of pigs, enabling the effective detection of pigs’ standing, eating, and drinking behaviors. Depending on the depth sensor, Kim et al. [
16] has realized the pig standing behavior recognition under the complex environment. Meanwhile, the effectiveness of the proposed method in terms of both the cost and the accuracy have been verified. Based on the images captured by the CCD camera, Nasirahmadi [
11] et al. have utilized an ellipse fitting approach to locate each pig in the image, while cameras can easily record the pig behavior, factors such as farm environment and lighting conditions can make pig classification challenging.
Currently, deep-learning based approaches [
17,
18,
19] have achieved promising detection performance, especially in the field of animal phenotype detection. For example, Wu et al. [
20] has proposed an effective corbel detection method based on YOLOv3 and relative step size characteristic vector. Specifically, based on the relative step size characteristic vector, the YOLOv3 algorithm was utilized to detect the position of the corbel, and then the LSTM model was employed to identify the normal walking and the lame behavior of the cattle, and an accuracy of 98.57% obtained. Shen et al. [
21] has first applied the YOLO model to detect cows, and then an improved AlexNet model has been employed to classify the corresponding detected individual cow. Finally, they obtained 96.65% accuracy of individual cow classification. Tassinari [
22] proposed a deep learning-based system for individual cow classification and location analysis. Zhang [
23] proposed a lightweight YOLO detection model, using MobileNetV3 to replace the backbone network in the YOLOv3 network, and obtained 96.8% of the cattle key position detection accuracy. Hu et al. [
24] employed the YOLO algorithm to extract cow objects, and then a segmentation algorithm was utilized to extract the head, torso, and legs parts of the corresponding cow object. Subsequently, the deep feature fusion was performed on these extracted parts. Finally, the SVM classifier was employed to do the classification, and an accuracy of 98.36% was obtained. Jiang [
25] proposed a filter-based YOLOv3 algorithm and achieved 99.18% accuracy in the detecting key parts of cows. Based on an RGB camera and convolutional neural network, Bezen [
26] built a computer vision system for measuring cow feeding, and an accuracy of 93.65% was obtained. Achour [
27] has built a CNN-based image analysis system for the classification of individual cows, their foraging behavior, and their food. Specifically, their model obtained an accuracy of 97% for individual cow classification and an accuracy of 92% for cow foraging behavior separately. Wu [
28] proposed a CNN-LSTM (Fusion of Convolutional Neural Network and Long Short-Term Memory Network) model for cow action recognition. Specifically, the action categories of cows in their experiments included drinking, ruminating, walking, standing, and lying down, and the average classification accuracy of their model has reached 97.6%.
Above all, while the monitoring method using RFID ear tags is simple, it often causes harm to the pig and compromises animal welfare. Although the computer vision technology [
29] can improve animal welfare and recognition accuracy, it is not suitable for industrial production requirements due to its slow detection speed. Furthermore, when the pigs are occluded, or the size of the target pigs in the image varies greatly, the detection performance of the model drops significantly.
Attention mechanism is an important means to improve feature robustness [
30], among which Convolutional Block Attention Module (CBAM) [
31] have shown promising success in a broad range of fields. Based on an intermediate feature map, CBAM captures attention maps in two independent dimensions, including channel and spatial dimensions. Then, the input feature map is multiplied with this attention map for adaptive feature refinement. Since CBAM is a general and lightweight component, it can be seamlessly incorporated into any CNN architecture for end-to-end training with negligible overhead.
Furthermore, the Spatial Pyramid Pooling (SPP) [
32] module realizes the feature map-level fusion of local features and global features, enriching the expressiveness of the final feature map.
Consequently, based on YOLOv3, leveraging the advantages of CBAM and SPP, this paper proposes a novel improved pig detection model YOLOv3-SC. The YOLOv3-SC model enables efficient detection of pigs. In addition, the model can achieve effective pig detection in the case of occlusion, and can achieve effective multi-scale pig targets. The main contributions of this paper is summarized as follows:
We first propose a novel pig detection method YOLOv3-SC based on the CBAM and the SPP modules. The channel attention and the spatial attention units in the CBAM module enable the YOLOv3-SC to focus on the regions of the image that are important for detection, thereby extracting richer, more robust, and more discriminative features. The SPP module endows YOLOv3-SC the capacity of extracting multi-scale features, which enables the model to detect objects of different sizes, thereby improving the model’s pig detection performance. Specifically, our model achieves the best performance for pig detection task, with 2.3% improvement of the existing model.
Numerous ablation experiments have been designed and performed to verify the performance of our model. Specifically, these studies include the comparison of different models, evaluation of the effectiveness of the spp module, evaluation of the effectiveness of the CBAM module, and the evaluation of the superiority of the YOLOv3-SC.
2. Materials and Methods
2.1. Datasets
The individual pig detection dataset utilized in this paper was collected from one pig farm in Jinzhong City, Shanxi Province, China. The breeding method of this farm is captive breeding, surrounded by iron fences to form a closed area, and the ground of the farm is cement concrete. The data collection cameras were installed at a height of 3 m from the ground, at 45° diagonally and directly above the farm. Through this collection strategy, the whole view of the pig and its range can be well captured. The data collection period lasted for two months, from August to October 2020. It should be noted here that videos with poor picture quality are deleted due to factors such as light, and finally a total of about 2 Terabyte video data is obtained. Specifically, the video is sliced into image frames with rgb format at a sampling rate of 25 f/s and some images with no target objects, blurring and poor quality, are deleted. Further, the labelImg tool was employed to label the image frames in the PASCAL VOC format, and the labeled data was saved as an XML file. Finally, we obtained a dataset with a total of 4019 images with 13,996 annotations, some sample images are shown in
Figure 1.
Figure 1a indicates the camera position above the 45° angle of the farm.
Figure 1b shows the camera position above the 45° diagonal of the farm.
Figure 1c indicates that the camera position directly above the farm. In order to evaluate the performance of the proposed model, the dataset is divided as follows, 3255 samples are employed as the training data, the 362 samples are employed for validation data, and the remaining 402 samples are utilized as the test data. The samples in the test data are the unseen data. Further, to increase the diversity of data and allows the model to obtain richer features, this paper adopts the following data augmentation techniques, such as random scaling, random flipping, random cropping, and other operations.
2.2. Technical Route
The technical route of the individual pig detection model proposed in this paper is shown in
Figure 2. To reduce the noise in the data and enable the model to obtain better detection ability, the samples are first preprocessed and data augmented. Specifically, the image preprocessing operation utilized for data noise reduction refers to deleting samples with poor quality in the data set and and resize the input image to a fixed size 416 × 416. In order to increase the diversity of data, we have adopted the following data enhancement methods, including the random_distort, random_expand, random_interp, and random_flip, shuffle_gtbox. Subsequently, the processed data is sent to the YOLOv3-SC for model training and evaluation. Finally, an effective individual pig detection model is obtained, which can realize fast and accurate individual pig detection.
2.2.1. Feature Extraction
The backbone network DarkNet-53 with the addition of the CBAM of the proposed model, are utilized to extract features with richer spatiotemporal dependencies, which will facilitate the pig detection significantly.
2.2.2. Feature Fusion
The Feature Fusion in
Figure 2 refers to two kind of fusions; they are the multiple YOLO-head fusion and the SPP (Spatial Pyramid Pooling) fusion. The detailed feature fusion operations are described in
Section 2.5.
2.3. YOLOv3
Based on YOLOv1 [
33] and YOLOv2 [
34], to leverage the anchor mechanism, BN operation, and multi-scale fusion strategies, we proposed YOLOv3 models. The basic principle of YOLOv3 is to divide the input image into
grids, where
and each grid predicts 3 anchors. Each anchor has 5 parameters including
, where
x and
y are the coordinate positions of the anchor,
w and
h represent the width and height of the anchor, and
c is the confidence level of the predicted object. In addition to the parameters of the anchor, the YOLOv3 algorithm predicts the probability of each category and the confidence level can be achieved by the following Equation (
1).
where
is set as 0 or 1 and the
denotes the intersection ratio of the predicted and true frames. The confidence reflects whether the grid contains objects or not, and the accuracy of the prediction frame when the grid contains objects. Finally, the redundant anchor is eliminated by non-maximal suppression (NMS), and the position and size of the corresponding anchors are adjusted to produce the final result. YOLOv3 improves the detection performance via introducing the anchor mechanism based on YOLOv2 and a K-means clustering algorithm. The K-means clustering algorithm can be employed to obtain the suitable prior frame size, which is shown in the following
Table 1 [
35].
By introducing suitable prior frames, the network no longer needs to randomly generate anchor frames of different sizes to predict objects, thus making the network train faster and converge faster. YOLOv3 leverages a multi-scale strategy for object detection, which can detect more objects and identify smaller objects than those of YOLOv1 and YOLOv2, respectively. Concretely, the YOLOv3 network consists of four parts, including the input unit, backbone network unit, neck unit, and output unit. The backbone network in the YOLOv3 framework is the Darknet-53, and its basic unit is the residual structure [
36], which can alleviate the gradient vanishing or explosion problems caused by the deepening of the network layers. The network structure of YOLOv3 is shown in
Figure 3.
2.4. Attention Module
Pigs exhibit different actions at different times of the day, and there are often problems with pigs occluding each other. These situations can lead to a lack of distinct behavioral identification features in pig datasets. In order to handle the above-mentioned problems and improve the accuracy of the network by capturing more effective features, it is necessary for the network to learn action features adaptively. Consequently, this paper proposes an attention-enhanced YOLOv3 network, which aims to utilize the attention module to make the neural network pay more attention to the corresponding regions in the image, and these regions play a key role in action discrimination.
The Convolutional Block Attention Module (CBAM) [
31] is a lightweight unit that consists of two separate parts, they are the Channel Attention Module (CAM, Channel Attention Module) and the Spatial Attention Module (SAM, Spatial Attention Module). CBAM is a combination of spatial attention and channel attention, which can be utilized to obtain rich semantic information in pig images. CBAM can capture the dependencies between channel feature space features, reduce the weight of unimportant information, and improve the detection performance of individual pigs. The structure diagram of the CBAM module is shown in
Figure 4.
Specifically, CAM first performs a maximum pooling operation and a global average pooling operation on the input feature layer in turn, their output features are sent to a Multi-Layer Perception (MLP) layer. Then the output features of the MLP of two branches are summed together and send to a sigmoid function for fixing the weights between 0 and 1 distribution. The final result is obtained by multiplying the original input feature layer.
SAM first performs a maximum pooling operation and a global average pooling operation on the input feature layer, and then executes a tensor splicing on the corresponding output features. Finally, they are sent to the Sigmoid function to fix the weights between 0 and 1 distribution and then multiplied with the original input feature to achieve the final result.
The CBAM module can be represented by the following equation.
where
presents the channel attention module (CAM),
denotes the spatial attention module (SAM),
F indicates the feature layer of the input network, and
/
denotes the feature map after the Channel Attention Module (CAM)/Spatial Attention Module(SAM), respectively. × represents to perform the pointwise multiplication, and
is the fully connected operation.
denotes the global average pooling operation and the
indicates the global maximum pooling operation, respectively.
presents the tensor splicing Concat operation. + represents the summation operation and the
denotes the sigmoid activation function.
2.5. The Proposed Novel YOLOv3-SC Model
The proposed YOLOv3-SC model is built by leveraging the attention mechanism and Spatial Pyramid Pooling (SPP) module to the YOLOv3 backbone network Darknet-53. Specifically, the CBAM module endows YOLOv3 with powerful feature extraction capabilities. Furthermore, the SPP structure extracts features of different scales in the final stage of the backbone network and fuses them. This design can alleviate network overfitting, increase the robustness of the model, and allow the network to learn richer features. The architecture of the novel proposed YOLOv3-SC is presented in
Figure 5.
Specifically, as in
Figure 5 show, the YOLOv3-SC is enhanced by integrating the CBAM module in each
and adding SPP unit in the final stage of the backbone network.
The backbone network with the addition of the CBAM consists of Res1-CBAM, Res2-CBAM, Res4-CBAM, and Res8-CBAM. Resi_CBAM i = 1, 2, 4, 8 denotes a stack of CBL and i residual structures with CBAM modules, where CBL is composed of Conv, Batch Normalization (BN), and activation function (LeakyReLu) unit. The CBAM component allows the model to extract richer spatiotemporal features, which will facilitate the pig detection significantly.
Furthermore, the SPP structure is introduced for feature fusion. Specifically, SPP performs maximum pooling operations at different scales on the input feature maps, and finally all the output feature maps are tensor-spliced with the original feature maps. In this way, the network can perform feature fusion at different scales to prevent overfitting.
The input image is improved with the backbone network and SPP structure to obtain three different scales of feature layers with sizes of 13 × 13, 26 × 26, and 52 × 52. After that, further feature fusion operations are performed. The 13 × 13 feature map is upsampled to obtain the 26 × 26 feature map and the 26 × 26 feature map of the backbone network is concatenated. After that, we perform up-sampling again to obtain 52 × 52 feature map and 52 × 52 feature map in the backbone network for Concat operation to fuse the feature information of different scales. In the feature fusion stage, we first perform CBAM operation on the input of different feature layers, and then perform CBAM operation again after the tensor stitching Concat to obtain the improved feature fusion network.
The training process of the YOLOv3 is presented in Algorithm 1.
Algorithm 1: YOLOv3-SC Model Training |
|
2.6. The Loss Function
In order to achieve the optimized pig detection model, a reasonable loss function needs to be designed for the network training. Specifically, the loss function consists of three terms: including category loss (
), confidence loss (
), and locality loss (
). Among them, IoU is utilized to calculate the locality loss, and the cross-entropy loss is employed to calculate the confidence loss and class loss separately. The loss function of the proposed model is defined as follows.
where
indicates the error between the coordinates and the length and width of the real frame and the coordinates and length and width of the predicted frame,
illustrates the prediction region confidence error, and
denotes the object classification error, respectively.
where
TP represents the number of positive samples predicted to be positive;
FP indicates the number of negative samples predicted as positive samples; and
FN illustrates the number of positive samples predicted as negative samples.
n represents the number of the detected pig category, and its value is set as 1 here.
AP is the average precision, denotes the area under the PR curve. mAP indicates the average accuracy over all categories.
2.7. Experiment Setup
In this paper, for fair comparison, all experiments are developed and run based on the PyTorch framework. The stochastic gradient descent optimization algorithm is employed for the model parameter update. Furthermore, the batchsize is 16, the initial learning rate is set as 0.001 and updated based on the cosine descent theory, the momentum is 0.937, and the total number of iterations is 400 epochs.
Additionally, the hardware configuration is as follows: operating system Ubuntu 20.04, CPU Intel(R) Xeon(R) CPU E5-2670 v3, GPU Nvidia Feforce GTX 3060 12G, and memory 16G DDR4.