Detection and Localization of Albas Velvet Goats Based on YOLOv4

Guo, Ying; Wang, Xihao; Han, Mingjuan; Xin, Jile; Hou, Yun; Gong, Zhuo; Wang, Liang; Fan, Daoerji; Feng, Lianjie; Han, Ding

doi:10.3390/ani13203242

Open AccessArticle

Detection and Localization of Albas Velvet Goats Based on YOLOv4

by

Ying Guo

^1,2,

Xihao Wang

³,

Mingjuan Han

³

,

Jile Xin

³

,

Yun Hou

³

,

Zhuo Gong

³,

Liang Wang

³,

Daoerji Fan

³,

Lianjie Feng

³ and

Ding Han

^3,4,*

¹

School of Information Engineering, Inner Mongolia University of Science and Technology, Baotou 014010, China

²

College of Mechanical and Electrical Engineering, Inner Mongolia Agricultural University, Hohhot 010018, China

³

College of Electronic Information Engineering, Inner Mongolia University, Hohhot 010021, China

⁴

Inner Mongolia State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Hohhot 010020, China

^*

Author to whom correspondence should be addressed.

Animals 2023, 13(20), 3242; https://doi.org/10.3390/ani13203242

Submission received: 18 September 2023 / Revised: 14 October 2023 / Accepted: 15 October 2023 / Published: 18 October 2023

(This article belongs to the Section Animal System and Management)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Simple Summary

We proposed a target detection algorithm based on the channel attention mechanism SENet, the GeLU activation function and layer normalized ShallowSE. We refined and simplified the PANet part and the YOLO Head part in YOLOv4 to obtain the Custom_YOLO target detection module. We designed a 3D coordinate regression algorithm for three fully connected networks in order to predict the goats’ coordinates. We combined the improved YOLOv4 target detection algorithm and coordinate regression algorithm to achieve goat localization.

Abstract

In order to achieve goat localization to help prevent goats from wandering, we proposed an efficient target localization method based on machine vision. Albas velvet goats from a farm in Ertok Banner, Ordos City, Inner Mongolia Autonomous Region, China, were the main objects of study. First, we proposed detecting the goats using a shallow convolutional neural network, ShallowSE, with the channel attention mechanism SENet, the GeLU activation function and layer normalization. Second, we designed three fully connected coordinate regression network models to predict the spatial coordinates of the goats. Finally, the target detection algorithm and the coordinate regression algorithm were combined to localize the flock. We experimentally confirmed the proposed method using our dataset. The proposed algorithm obtained a good detection accuracy and successful localization rate compared to other popular algorithms. The overall number of parameters in the target detection algorithm model was only 4.5 M. The average detection accuracy reached 95.89% and the detection time was only 8.5 ms. The average localization error of the group localization algorithm was only 0.94 m and the localization time was 0.21 s. In conclusion, the method achieved fast and accurate localization, which helped to rationalize the use of grassland resources and to promote the sustainable development of rangelands.

Keywords:

target detection; objective positioning; coordinate regression

1. Introduction

In order to strengthen the scientific management of grazing and reduce human consumption, we should use information technology to manage pastures. The detection and localization of goats in natural grazing conditions are important in range management. On the one hand, detecting and locating goats prevents them from being lost and reduces losses for ranchers. On the other hand, detecting and locating goats can be used to track the foraging paths of goats as a way to study the correlation between goat movement and fleece, milk and meat production. Lastly, we can develop precise feeding programs for individual goats in order to improve the economic efficiency of farms [1]. Therefore, detection and localization studies for goats are of great value.

The current research focuses on the individual detection and individual localization of animals. Animal positioning is divided into indoor positioning and outdoor positioning. Relatively little research has been carried out on indoor positioning. Asikainen M et al. proposed a specific metric vector iterative positioning algorithm that took into account the limitations of WSN nodes in terms of computational power and energy usage, and achieved a positioning accuracy of 3 m with a lower communication overhead [2]. M. Hierold et al. implemented the encounter detection and simultaneous localization of bats in their natural habitat with a low-weight wireless sensor node [3]. N. Duda proposed an enhanced version of the developed mobile node, which had a range of up to 420 h in the BATS project for small-sized animals like bats [4]. L. Catarinucci used a novel RFID-based approach that enabled the effective localization and tracking of small-sized laboratory animals, and tests were conducted on rats to prove the validity of the method [5]. H. Millner proposed a system for the 3D indoor and outdoor localization of animals using a sequential Monte Carlo method by incorporating the dynamics of the target object [6]. The core part of outdoor positioning is the use of satellite positioning systems to determine the latitude and longitude information of livestock, which is then combined with other sensors and wireless communication technologies to achieve a wider range of applications [7]. In recent years, LoRa was used as a means of localization to measure the location of target nodes using algorithms, such as the time difference of arrival, time of arrival and indication of received signal strength [8,9]. P. Singh et al. proposed a system based on the Robust Principal Component Analysis (Robust PCA) that spatially localizes the animals in the image, and this system was better than a pre-trained R³Net [10].

As far as animal detection is concerned, with the booming development of artificial intelligence, more detection algorithms are being applied to livestock [11,12,13,14,15]. Target detection is mainly divided into two categories. The first consists of anchor-based target detection algorithms, which use a priori box-assisted models for prediction. The current two-stage R-CNN family of algorithms [16,17] and the single-stage SSD algorithms [18] are anchor-point-based algorithms. Another class of anchorless target detection algorithms that use the key points or the centroid of the target object for prediction are represented by YOLOv1 [19], Corner-Net [20] and CenterNet [21]. Dong W achieved 92.49% accuracy with his constructed dairy goat dataset using the faster R-CNN [22]. Ran Y implemented target detection for pigs using MobileNetv2-YOLOv3, with an accuracy rate above 97% [23]. Lei, J et al. proposed an improved detection method, termed YOLOv7-waterbird, by adding an extra prediction head, a SimAM attention module and a sequential frame to YOLOv7, and achieved a mean average precision value of 67.3%, enabling real-time video surveillance devices to identify attention regions and perform waterbird monitoring tasks [24]. Shuang Song et al. used a pruning-based YOLOv3 deep learning algorithm to detect sheep feces, obtaining a mAP value of 96.84% [25]. Taejun Lee et al. identified individual cattle using a YOLOv8-based model, with an accuracy of 97% [26]. Yu Zhang et al. proposed an integrated goat head detection and automatic counting method based on YOLOv5, with a detection accuracy of 92.19% [27].

Given that none of the above studies incorporated target detection and localization, the approach proposed in this paper is more suitable for the smart management of rangelands. This paper proposes a goat detection and localization method based on machine vision. The innovations of this paper are as follows:

(1): We construct a goat image dataset. The dataset has a total of 8387 goats and consists of 11 goats in their natural grazing state.
(2): We propose a goat target detection algorithm consisting of ShallowSE and an improved Custom_YOLO based on YOLOv4. We first added the attention mechanism to enhance the feature extraction capability of the module. We then changed the activation function to enhance the generalization of the model. Finally, we lightened the model to improve detection speed while ensuring accuracy.
(3): We propose a 3D coordinate regression algorithm based on fully connected networks. We constructed a fully connected network to fit the transformation relationship between the 2D and 3D coordinates of goats. The network can calculate the spatial coordinates of a goat after detecting the goat.

2. Materials and Methods

2.1. Dataset Creation

The experimental site was located in a pasture of the Otokqi sub-field of the Inner Mongolia Yi-Wei White Velvet Goat Limited Liability Company and the experimental subjects were Albas velvet goats. The ranch covers an area of 32 square kilometers and we laid out 14 PTZ cameras, each with a field of vision of two kilometers. Image acquisition was performed using camera No. 12, as shown in Figure 1a. The acquisition period was from 10 to 12 June 2021. The camera was an IPC3526-X22R4-SI 2-megapixel starlight infrared dome network camera. The camera was connected to a computer that directly saved the video of the goats foraging outside. After frame separation and the removal of unclear images from the video, a total of 8387 images were obtained, of which 80% were used as a training set and 20% as a test set.

Positioning data were collected from GPS information via a Tibbo UG908 locator from Beijing, China with a positioning accuracy of 1.5 m, as shown in Figure 1b. The GPS data were collected on 12 June 2021. The Tibbo UG908 locator was strapped to the head goat. The data were sent at a frequency of 1 s/time and the computer received GPS data from the head goat in real time.

2.2. Experimental Platform

The target detection experiment used Python with the Pytorch deep learning framework, which was run on an Intel (R) Core (TM) i9-11900 K CPU with a NVIDIA RTX-3090-24 G graphics card and 64 GB RAM. This experiment did not use a pre-trained model. Every 16 images comprised a batch, there were 100 epochs of thawed training data and the initial learning rate was set to 10⁻³. The 3D coordinate regression experiment used Python with the Pytorch deep learning framework, as well as a NVIDIA GeForce GTX 1660Ti video card and 16 GB of memory. Each model was separately trained for 20,000 epochs, with an initial learning rate of 10⁻⁴, using GPU-accelerated training and Adam for network optimization.

2.3. Construction of the Goat Target Detection Algorithm

The goat target detection model proposed in this paper was based on YOLOv4 and the shallow convolutional neural network was first used as the backbone of the target detection model. Then, the YOLO Head and PANet (Position Attention Network) parts were streamlined.

2.3.1. SE Module

In convolutional neural networks, each convolutional operation is performed for all channels within the convolutional kernel perceptual field. Although the spatial features of local regions can be fused with channel features, the feature differences between channels are ignored. Hu et al. proposed the channel attention mechanism SENet [28] to address these problems. The SE module can enhance the network’s ability to learn important features and improve learning efficiency. The basic structure of the SE module is shown in Figure 2 and contains four main stages: preprocessing, squeeze, excitation and reweight.

Before entering the SE module, a simple transformation of the input feature map, i.e., the mapping of X to U in the graph, was required. The procedure is shown in the following equation:

F_{t r} : X \to U, X \in R^{H' \times W' \times C'}, U \in R^{H \times W \times C}

where H is the height, W represents the width and C represents the number of channels.

After performing the normalized transformation, the feature map was globally averaged and pooled to compress the feature map into the global features, which was calculated as follows:

z_{c} = F_{s q} (u_{c}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} u_{c} (i, j)

where u_c ∈ R^H^×W, z ∈ R^C.

After that, SENet applied the excitation operation to the global features to calculate the correlation between different channels. First, the global features were reduced to C/r dimensions by a fully connected layer and ReLU activation function; after that, the dimensions were recovered again by one full connection and Sigmoid activation to obtain a weight matrix composed of the weights of different channels, which was calculated as follows:

s = F_{e x} (z, W) = σ (g (z, W)) = σ (W_{2} δ (W_{1} z))

where

W_{1} \in R^{\frac{C}{r} \times C}

,

W_{2} \in R^{\frac{C}{r} \times C}

and r is the hyperparameter and the ReLU activation function.

Finally, the reweight operation multiplied each channel in the feature map with its corresponding weight to obtain a feature map with the weight information. The calculation formula is as follows:

{\tilde{x}}_{c} = F_{s c a l e} (u_{c}, s_{c}) = u_{c} \times s_{c}

where

\tilde{X} = [{\tilde{x}}_{1}, {\tilde{x}}_{2}, \dots, {\tilde{x}}_{c}]

,

u_{c} \in R^{H \times W} .

After going through the SENet module, the model was able to put more resources into the learning of channels with larger weights based on the different importance levels of channels, achieving the reinforcement of learning for important feature channels and placing a weaker emphasis on learning for non-important feature channels, which improved learning efficiency.

2.3.2. GeLU Activation Function

In image processing tasks, the role of the activation function is to attach nonlinear properties to the neural network [29]. One of the most commonly used is the ReLU function, as shown in the curve in Figure 3. When the input is x > 0, the derivative of ReLU is constantly equal to 1, which can effectively solve the problem of gradient disappearance. However, when the input information is x ≤ 0, the derivative of the function is 0, which makes it extremely sensitive to abnormal inputs, resulting in the network being unable to back-propagate and causing the neurons to deactivate. So, we adopted GeLU activation, which can reduce the sensitivity of the activation function to outliers and enhance the generalization ability of the model.

The GeLU function is represented as:

G e L U (x) = x P (X \leq x) = x \int_{- \infty}^{x} \frac{e^{- \frac{(X - μ)^{2}}{2 σ^{2}}}}{\sqrt{2 π} σ} d X

2.3.3. Layer Normalization

Normalization plays a crucial role in computer vision: by calculating the mean and variance, and then normalizing the input data to keep the scale of the data within a set range, the network can effectively avoid the problem of gradient disappearance and gradient explosion during training [30].

In previous image processing tasks, scholars generally used batch normalization, which normalizes all training samples in a mini-batch, as shown in Figure 4a where C denotes the number of channels, N denotes the batch size and H, W denotes the aspect of the feature map. Batch normalization is too dependent on the batch size, which will produce different means and variances for each batch, and when the batch sizes are inconsistent, the final processing results may not be representative of the overall data distribution.

To address this drawback of batch normalization, which has a strong correlation with batch size, Jimmy Lei Ba proposed layer normalization [31], which is processed as shown in Figure 4b. Layer normalization does not consider the size of each batch, and the mean and variance are computed for all neurons within the same layer, which is followed by normalization.

ConvNeXt mimics Transformer and Swin-Transformer by replacing batch normalization with an improved layer normalization, achieving an accuracy improvement rate of 1% when batch normalization is used [32]. As shown in Figure 4c, the layer normalization in ConvNeXt is a refinement of the traditional layer normalization, in which the setting of each pixel on all channels in the same layer of the feature map is normalized. ShallowSE borrowed the layer normalization design from ConvNeXt by replacing all instances of traditional batch normalization with this improved layer normalization method in the network structure.

2.3.4. Model Streamlining

Considering that the dataset used in this paper is dominated by medium-sized targets, the number of a priori frames used for medium-sized feature layers is much higher than that for small and large sizes. To ensure that the accuracy rate does not decrease, we streamlined the YOLO Head, as shown in Figure 4, deleted the dashed part and kept the medium-sized feature layer. The part between the head and backbone is called the neck, and its main function is to fuse the semantic information contained in the effective feature map with the texture information, enhancing the expression ability of the feature map. Common necks include, among others, the FPN (Feature Pyramid Network) and PANet [33]. In this paper, we used the PANet as the neck section. Due to the single target class and the relatively small single target size in the dataset, we further streamlined the PANet, which reduced the original 5 convolutions to 3 convolutions. The red part in Figure 5 is the streamlined part and the streamlined target detection module is called Custom_YOLO.

2.4. Coordinate Regression Algorithm

Three coordinate regression models consisting of fully connected layers with ReLU activation functions were designed in this study. All three regression models used the ReLU activation function for nonlinear activation. SmallerFC used 2 fully connected layers, SmallFC used 3 fully connected layers and BigFC used 4 fully connected layers. The specific structures of the three networks are shown in Figure 6.

The dataset was organized according to the picture serial number, which is the sequence number of the picture when the video was split into frames. The coordinate conversion text dataset was obtained with a total of 5029 data points, a part of which is shown in Table 1. Among them, xmin, ymin, xmax, ymax, pan, tilt and zoom are input variables, and Xw and Zw are output variables.

Figure 7 shows the schematic diagram of the seven input variables. xmin, ymin, xmax and ymax represent the bounding box pixel coordinates of the target goats in the area in the image. Pan, tilt and zoom are the PTZ head parameters, which are the values in the lower left corner of the image. The output variables Xw and Zw are the values of the goat only in the direction of the Xw and Zw axes in the constructed world coordinate system. Figure 8 illustrates the flowchart of this study.

2.5. Model Training and Test Precision Evaluation

In order to evaluate the performance of the model, we evaluated the model in terms of mean average precision (mAP), loss, frames per second (FPS), parameters, multiply accumulate operations (MACs) and detection time as per the following metrics descriptions:

(1) mAP: The average accuracy of detecting all categories of targets, which is one of the important indicators for evaluating the performance of target detection models.

(2) Loss: Used to express the size of the deviation between the model’s prediction frame for the target and the target’s real frame. In the loss curve, the smoothness of the curve, the speed of convergence of the loss value and the size of the loss value after convergence are all references for judging the effectiveness of the model in fitting the dataset.

(3) MACs: Multiplying the cumulative number of operations, which can be interpreted as computational effort, is used to measure the time complexity of the algorithm/model.

(4) Parameters: Refers to the number of parameters in the network/model and measures the amount of computer storage space occupied by the model, which can be used as one of the indicators for determining the suitability of the network/model for mobile deployment.

(5) FPS: The number of images that can be processed in a second.

(6) Time: Time taken to detect an image.

3. Results

The YOLOv4 target detection module was used as a platform to test the performance of six backbone feature extraction networks with performance parameters including the number of parameters, computation, detection accuracy and detection speed. The test results are shown in Table 2.

The lightweight networks, MobileNetv3 and EfficientNet, were significantly smaller than the classical convolutional neural networks, VGG-16 and ResNet-50, in terms of number of parameters and computation time, but the lightweight networks were not necessarily faster than the classical network models in terms of inference speed. As can be seen from Table 2, the classical network was slightly better than the lightweight network in terms of detection speed when the difference in the average accuracy of the mAP was not significant. The proposed shallow convolutional neural network, ShallowSE, improved the FPS to be more than 20 frames per second, with guaranteed detection accuracy, which was a big improvement compared to the classical and lightweight networks. The mAP of ShallowSE was second only to the 96.34% of CSPDarkNet-53 and 95.53% of MobileNetv3.

3.1. Model Streamlining before and after Comparison

The models with CSPDarkNet-53, MobileNetv3 and ShallowSE as feature extraction networks were streamlined and noted as CSPDarkNet-53-Custom_YOLO, MobileNetv3-Custom_YOLO and ShallowSE-Custom_YOLO, respectively. We compared the original PANet and the simplified PANet results, as shown in Table 3.

As can be seen from Table 3, the total number of parameters in the streamlined model was further reduced. The CSPDarkNet-53-Custom_YOLO model had 51.60% fewer parameters than the original YOLOv4 model. The MobileNetv3-Custom_YOLO model had 77.21% fewer parameters than the original YOLOv4 model. The number of parameters in the ShallowSE-Custom_YOLO model was reduced by 87.75% compared to that in the original YOLOv4 model. The significant reduction in the number of parameters led to further improvement in the detection speed of the target detection model. The FPS of the CSPDarkNet-53-Custom_YOLO model was improved by 3.27 and the detection time was reduced by 3.3 ms when compared with the original YOLOv4 model. The FPS of the MobileNetv3-Custom_YOLO model was improved by 0.39 and the detection time was reduced by 1.6 ms when compared with the original YOLOv4 model. The FPS of the ShallowSE-Custom_YOLO model was improved by 3.38 and the detection time was reduced by 3 ms when compared with the original YOLOv4 model. The FPS of the Custom_YOLO model was improved by 3.38 and the detection time was reduced by 3 ms when compared to the original YOLOv4 model. In conclusion, compared with the model that had CSPDarkNet-53 as the main feature extraction network, the model designed in this paper showed a decrease in mAP, but also demonstrated a decrease in the number of parameters and detection time, and an improvement in FPS. The model we designed was generally better than the other models.

Figure 9 shows the comparison of the loss value curves before and after the streamlining of the three target detection models, where the blue curve is the loss curve before streamlining and the orange curve is the loss curve after streamlining. It can be seen that the optimized loss values converged slightly faster than the pre-optimized ones and the loss values after smoothing were smaller than those of the pre-optimized models, which indicates that the models fit the dataset faster and better.

3.2. Analysis of 3D Coordinate Regression Algorithm Results

The BigFC regression model was adopted in this paper and the experimental results were compared with the SmallerFC regression model and SmallFC regression model. The prediction results of the three regression models were compared using the training sets and test sets, and the results are shown in Figure 10 and Figure 11, respectively.

Figure 10 shows the prediction curves of the three networks using the training set compared with the true curves, where the blue curves are the true values of Xw and Zw, and the orange curves are the predicted values. In the comparison plot of Xw, SmallerFC and SmallFC fitted the training samples less well than BigFC. In 1~1000 samples, the prediction curve did not cover the true curve well. And starting from the 3500th sample, the goats moved rapidly in a short period of time causing the true curve to rise more. The difference between the prediction curve and the true curve was larger at this time. Relatively speaking, the prediction curve of BigFC basically covered the true curve and had the best fitting effect on Xw. In the comparison graph of Zw, the fitting degree of the first 1000 samples, SmallerFC and SmallFC were slightly inferior to BigFC. After that, the fitting degrees of the three were not much different. But the prediction curve of BigFC had significantly smaller up and down fluctuations, so the fitting effect was better. Overall, the BigFC regression model had a better fit on the training set.

Figure 11 shows the prediction curves of the three networks using the test set compared with the real curves. In the Xw prediction curve, the prediction effect of BigFC was significantly better than that of SmallerFC and SmallFC. And the fluctuation in the prediction curve for Zw was significantly smaller than that of SmallerFC and SmallFC. The prediction effect of BigFC was the best among the three.

Table 4 shows the prediction errors of the three regression models for the training and test sets. The average error of the model prediction decreased as the number of fully connected layers in the network increased. This indicates that the more the layers are fully connected, the better the model fitting and the smaller the prediction error. SmallFC increased by one fully connected layer compared to SmallerFC, which had a more significant reduction in the average error. BigFC worked best on both the training and test sets.

3.3. Goat Positioning Algorithm Results

Table 5 shows the statistical results of the localization errors of the three localization algorithms in 30 random samples. It was found that in the actual localization test, BigFC and SmallerFC had unsatisfactory localization results and performed less well than SmallFC in both maximum error and average error. The main reason was that SmallerFC only had two layers of full connectivity and underfitted the dataset, resulting in a higher maximum error and larger error fluctuations in the positioning accuracy test. BigFC had four layers of full connectivity and had excellent prediction curves in terms of both fluctuation and fit, which made it overfit to the dataset and meant it only predicted samples within the dataset well, but it did not have a good generalization ability for samples outside the dataset.

4. Discussion and Conclusions

This study proposed a vision-based goat localization method based on the existing gimbal of a sub-farm of Inner Mongolia Yiwei White Velvet Goat Co. This study proposed the shallow convolutional neural network, ShallowSE, obtained the Custom_YOLO target detection module after streamlining and optimizing YOLOv4, and constructed the goat target detection algorithm based on ShallowSE-Custom_YOLO. After experimental validation, the proposed goat target detection algorithm achieved a 95.89% mAP, 25.32 FPS and an 8.5 ms detection time with only 4.5 M parameters, which met the requirements of being lightweight and carrying out real-time and accurate target detection in practical application scenarios. In this paper, we proposed a 3D coordinate regression algorithm based on a fully connected network. Experiments showed that the fully connected network fit the dataset much better than the traditional machine-learning algorithm SVR and the more fully connected layers there were, the better the fitting effect. Finally, experiments were conducted on the vision-based flock localization algorithm proposed in this paper. The experimental results show that the localization algorithm using SmallFC with three fully connected layers as the coordinate regression model achieved the optimal localization accuracy and the localization speed decreased gradually with an increase in the number of fully connected layers.

On the one hand, the positioning algorithm for goats proposed in this paper expanded the functional utilization of the PTZ surveillance camera, replacing the wearable device positioning method. On the other hand, the video image-based localization method had better expandability and the localization method studied in this paper can incorporate more functions to better realize the intelligent management of pastures.

Author Contributions

Conceptualization, Z.G. and L.F.; methodology, Y.G. and X.W.; software, Y.G.; validation, J.X. and D.F.; formal analysis, L.W. and D.H.; resources, Y.H. and J.X.; data curation, X.W. and M.H.; writing—original draft preparation, M.H.; writing—review and editing, Y.H. and M.H.; funding acquisition, D.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China (Funder: Ding Han, Grant No. 2022YFF1300604-04), the Major Science and Technology Projects of the Inner Mongolia Autonomous Region (Funder: Ding Han, Grant No. 2021ZD0019-4) and the Scientific Research Projects of Universities Directly of the Inner Mongolia Autonomous Region (Funder: Ying Guo, Grant No. 2023QNJS205).

Institutional Review Board Statement

Ethical review and approval were waived for this study. This was because the image data used in this study were captured using a camera, which had no contact with the goat. A GPS positioning device was strapped to the goat during the acquisition of localization data, which did not have any effect on the health of the goat. Therefore, there was no ethical review in this study and there were no ethical issues involved in this study.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors express their gratitude to the reviewers for their suggestions and comments, which significantly improved the quality of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Xiao, D.Q.; Feng, A.J.; Yang, Q.M. Rapid detection method of pig movement based on video tracking. J. Agric. Mach. 2016, 47, 351–357+331. [Google Scholar]
Asikainen, M.; Haataja, K.; Toivanen, P. Wireless Indoor Tracking of Livestock for Behavioral Analysis. In Proceedings of the 2013 9th International Wireless Communications and Mobile Computing Conference (IWCMC), Sardinia, Italy, 1–5 July 2013; pp. 1833–1838. [Google Scholar]
Hierold, M.; Hartmann, M.; Ripperger, S.; Mayer, F.; Heuberger, A.; Weigel, R.; Koelpin, A. Low-weight wireless sensor node for animal encounter detection and dual-band localization. In Proceedings of the 2016 IEEE Topical Conference on Wireless Sensors and Sensor Networks (WiSNet), Austin, TX, USA, 24–27 January 2016; pp. 21–23. [Google Scholar] [CrossRef]
Duda, N.; Weigel, R.; Koelpin, A. Enhanced mobile node design for small size animal borne wireless sensor nodes with encounter detection and localization. In Proceedings of the 2018 11th German Microwave Conference (GeMiC), Freiburg, Germany, 12–14 March 2018; pp. 123–126. [Google Scholar] [CrossRef]
Catarinucci, L.; Colella, R.; Mainetti, L.; Patrono, L.; Pieretti, S.; Sergi, I.; Tarricone, L. Smart RFID Antenna System for Indoor Tracking and Behavior Analysis of Small Animals in Colony Cages. IEEE Sens. J. 2014, 14, 1198–1206. [Google Scholar] [CrossRef]
Millner, H.; Ebelt, R.; Hoffmann, G.; Vossiek, M. Wireless 3D localization of animals for trait and behavior analysis in indoor and outdoor areas. In Proceedings of the 2009 IEEE MTT-S International Microwave Workshop on Wireless Sensing, Local Positioning, and RFID, Cavtat, Croatia, 24–25 September 2009; pp. 1–4. [Google Scholar] [CrossRef]
Xiong, B.H.; Qian, P.; Luo, Q.Y. Design and implementation of a fine feeding program based on individual body condition of dairy cows. J. Agric. Eng. 2005, 21, 118–120. [Google Scholar]
Gor, M.; Vora, J.; Tanwar, S. GATA: GPS-Arduino Based Tracking and Alarm System for Protection of Wildlife Animals. In Proceedings of the 2017 International Conference on Computer, Information and Telecommunication Systems (CITS), Dalian, China, 21–23 July 2017; pp. 166–170. [Google Scholar]
Nico, P.; David, P.; Jens, T. TDoA-Based Outdoor Positioning with Tracking Algorithm in a Public LoRa Network. WireLess Commun. Mob. Comput. 2018, 2018, 1864209. [Google Scholar]
Singh, P.; Lindshield, S.M.; Zhu, F.; Reibman, A.R. Animal Localization in Camera-Trap Images with Complex Backgrounds. In Proceedings of the 2020 IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI), Albuquerque, NM, USA, 29–31 March 2020; pp. 66–69. [Google Scholar] [CrossRef]
Oliva, A.; Torralba, A. Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope. Int. J. Comput. Vis. 2001, 42, 145–175. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R.E. A Desicion-theoretic Generalization of On-line Learning and an Application to Boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
Osuna, E.; Freund, R.; Girosi, F. Training Support Vector Machines: An Application to Face Detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 June 1997; pp. 130–136. [Google Scholar]
Viola, P.; Jones, M. Rapid Object Detection Using a Boosted Cascade of Simple Features. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Kauai, HI, USA, 8–14 December 2001; pp. 511–518. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision & Pattern Recognition, San Diego, CA, USA, 20–25 June 2005; pp. 886–893. [Google Scholar]
He, K.; Zhang, X.; Ren, S. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Zhang, X.; Xuan, C.; Xue, J.; Chen, B.; Ma, Y. LSR-YOLO: A High-Precision, Lightweight Model for Sheep Face Recognition on the Mobile End. Animals 2023, 13, 1824. [Google Scholar] [CrossRef]
Wei, L.; Dragomir, A.; Dumitru, E. SSD: Single Shot MultiBox Detector. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. Int. J. Comput. Vis. 2020, 128, 642–656. [Google Scholar] [CrossRef]
Duan, K.; Bai, S.; Xie, L. CenterNet: Keypoint Triplets for Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27–29 October 2019; pp. 6569–6578. [Google Scholar]
Dong, W.; Tang, J.L.; Zhu, W. Dairy Goat Detection Based on Faster R-CNN from Surveillance Video. Comput. Electron. Agric. 2018, 154, 443–449. [Google Scholar]
Ran, Y.; Deng, M.H. Pig Target Recognition in Piggery Environment Based on Deep Convolutional Neural Network. Test Eng. Manag. 2020, 83, 21087–21099. [Google Scholar]
Lei, J.; Gao, S.; Rasool, M.A.; Fan, R.; Jia, Y.; Lei, G. Optimized Small Waterbird Detection Method Using Surveillance Videos Based on YOLOv7. Animals 2023, 13, 1929. [Google Scholar] [CrossRef]
Song, S.; Liu, T.; Wang, H.; Hasi, B.; Yuan, C.; Gao, F.; Shi, H. Using Pruning-Based YOLOv3 Deep Learning Algorithm for Accurate Detection of Sheep Face. Animals 2022, 12, 1465. [Google Scholar] [CrossRef]
Lee, T.; Na, Y.; Kim, B.G.; Lee, S.; Choi, Y. Identification of Individual Hanwoo Cattle by Muzzle Pattern Images through Deep Learning. Animals 2023, 13, 2856. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Yu, C.; Liu, H.; Chen, X.; Lei, Y.; Pang, T.; Zhang, J. An Integrated Goat Head Detection and Automatic Counting Method Based on Deep Learning. Animals 2022, 12, 1810. [Google Scholar] [CrossRef] [PubMed]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Geffen, O.; Yitzhaky, Y.; Barchilon, N. A machine vision system to detect and count laying hens in battery cages-ScienceDirect. Animal 2020, 14, 2628–2634. [Google Scholar] [CrossRef] [PubMed]
Felzenszwalb, P.F.; Mcallester, D.A.; Ramanan, D. A Discriminatively Trained, Multiscale, Deformable Part Model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015; pp. 1–14. [Google Scholar]
Wang, Y.; Han, D.; Wang, L.; Guo, Y.; Du, H. Contextualized Small Target Detection Network for Small Target Goat Face Detection. Animals 2023, 13, 2365. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The picture (a) shows camera 12 at a height of about 15.6 m. The picture (b) shows the Tibbo GPS locator that was used to collect GPS data.

Figure 2. The structure of the SE module. C stands for number of channels, H stands for height and W stands for width.

X \to U

is a mapping operation.

X \to \tilde{X}

is the squeeze, excitation and reweight operation.

Figure 2. The structure of the SE module. C stands for number of channels, H stands for height and W stands for width.

X \to U

is a mapping operation.

X \to \tilde{X}

is the squeeze, excitation and reweight operation.

Figure 3. Comparison of GeLU and ReLU.

Figure 4. These are batch normalization method, layer normalization method and layer normalization method schematics, where C denotes the number of channels, N denotes the batch size and H,W denotes the aspect of the feature map.

Figure 5. The structure of ShallowSE-Custom_YOLO.

Figure 6. The structure of the coordinate regression model.

Figure 7. Diagram of input variables.

Figure 8. The overall structure of the goat positioning algorithm.

Figure 9. Comparison of loss function curves before and after model simplification.

Figure 10. Prediction curve of the three fully connected networks using the training set.

Figure 11. Prediction curve of the three fully connected networks using the test set.

Table 1. Coordinate conversion text dataset.

Num	xmin	ymin	xmax	ymax	Pan	Tilt	Zoom	Xw	Zw
1	370.81399	322.28571	411.34543	400.20779	145.6	2.7	18.4	−302.75021	−136.18202
2	371.12577	321.66233	414.15146	403.32467	145.6	2.7	18.4	--	−136.18202
3	370.50221	325.40259	416.64570	403.94805	145.6	2.7	18.4	−302.75021	−136.18202
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
5028	341.64705	256.94117	365.17647	303.0588235	144.7	7.5	3.8	−371.25324	−97.453996

Table 2. Test results of backbone feature networks.

Network	Body	Params (M)	MACs (G)	mAP	FPS
VGG-16	YOLOv4	14.714688	52.97694106	93.73%	19.19
ResNet-50	YOLOv4	23.508032	14.12160512	93.34%	18.30
CSPDarkNet-53	YOLOv4	26.617184	17.33899981	96.34%	13.75
MobileNetv3	YOLOv4	5.483032	0.76975596	95.53%	17.42
EfficientNet-B0	YOLOv4	3.595388	0.279041152	91.07%	16.64
ShallowSE	YOLOv4	0.400344	2.960644704	93.98%	21.94

Table 3. Comparison before and after simplification of PANet.

Model	Params	mAP	FPS	Time (ms)
CSPDarkNet-53-YOLOv4	63,937,686	96.34%	13.75	29.1
CSPDarkNet-53-Custom_YOLO	30,943,462	96.18%	17.02	25.8
MobileNetv3-YOLOv4	42,231,118	95.53%	17.42	22.9
MobileNetv3-Custom_YOLO	9,623,966	96.09%	17.81	21.3
ShallowSE-YOLOv4	37,205,070	93.98%	21.94	11.5
ShallowSE-Custom_YOLO	4,556,458	95.89%	25.32	8.5

Table 4. Prediction error of the three coordinate regression models.

Model	Training Set		Test Set
Model	Average Error (m)	Maximum Error (m)	Average Error (m)	Maximum Error (m)
SmallerFC	1.3951541	8.1814829	1.37147070	8.7922006
SmallFC	1.0992516	8.3155968	1.05533216	8.3128948
BigFC	0.9719390	6.7282131	0.88486233	6.5844086

Table 5. Statistics of positioning errors of the three algorithms.

Model	Longitude		Latitude
Model	Maximum Error Distance (m)	Average Error Distance (m)	Maximum Error Distance (m)	Average Error Distance (m)
ShallowSE-Custom_YOLO-SmallerFC	4.3448	1.4355	5.5023	1.3414
ShallowSE-Custom_YOLO-SmallFC	2.0996	1.0433	3.3299	0.8292
ShallowSE-Custom_YOLO-BigFC	4.3176	1.9062	2.0578	0.8796

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, Y.; Wang, X.; Han, M.; Xin, J.; Hou, Y.; Gong, Z.; Wang, L.; Fan, D.; Feng, L.; Han, D. Detection and Localization of Albas Velvet Goats Based on YOLOv4. Animals 2023, 13, 3242. https://doi.org/10.3390/ani13203242

AMA Style

Guo Y, Wang X, Han M, Xin J, Hou Y, Gong Z, Wang L, Fan D, Feng L, Han D. Detection and Localization of Albas Velvet Goats Based on YOLOv4. Animals. 2023; 13(20):3242. https://doi.org/10.3390/ani13203242

Chicago/Turabian Style

Guo, Ying, Xihao Wang, Mingjuan Han, Jile Xin, Yun Hou, Zhuo Gong, Liang Wang, Daoerji Fan, Lianjie Feng, and Ding Han. 2023. "Detection and Localization of Albas Velvet Goats Based on YOLOv4" Animals 13, no. 20: 3242. https://doi.org/10.3390/ani13203242

APA Style

Guo, Y., Wang, X., Han, M., Xin, J., Hou, Y., Gong, Z., Wang, L., Fan, D., Feng, L., & Han, D. (2023). Detection and Localization of Albas Velvet Goats Based on YOLOv4. Animals, 13(20), 3242. https://doi.org/10.3390/ani13203242

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Detection and Localization of Albas Velvet Goats Based on YOLOv4

Abstract

Simple Summary

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Creation

2.2. Experimental Platform

2.3. Construction of the Goat Target Detection Algorithm

2.3.1. SE Module

2.3.2. GeLU Activation Function

2.3.3. Layer Normalization

2.3.4. Model Streamlining

2.4. Coordinate Regression Algorithm

2.5. Model Training and Test Precision Evaluation

3. Results

3.1. Model Streamlining before and after Comparison

3.2. Analysis of 3D Coordinate Regression Algorithm Results

3.3. Goat Positioning Algorithm Results

4. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI