Wildlife Object Detection Method Applying Segmentation Gradient Flow and Feature Dimensionality Reduction

: This work suggests an enhanced natural environment animal detection algorithm based on YOLOv5s to address the issues of low detection accuracy and sluggish detection speed when automatically detecting and classifying large animals in natural environments. To increase the detection speed of the model, the algorithm ﬁrst enhances the SPP by switching the parallel connection of the original maximum pooling layer for a series connection. It then expands the model’s receptive ﬁeld using the dataset from this paper to enhance the feature fusion network by stacking the feature pyramid network structure as a whole; secondly, it introduces the GSConv module, which combines standard convolution, depth-separable convolution, and hybrid channels to reduce network parameters and computation, making the model lightweight and easier to deploy to endpoints. At the same time, GS bottleneck is used to replace the Bottleneck module in C3, which divides the input feature map into two channels and assigns different weights to them. The two channels are combined and connected in accordance with the number of channels, which enhances the model’s ability to express non-linear functions and resolves the gradient disappearance issue. Wildlife images are obtained from the OpenImages public dataset and real-life shots. The experimental results show that the improved YOLOv5s algorithm proposed in this paper reduces the computational effort of the model compared to the original algorithm, while also providing an improvement in both detection accuracy and speed, and it can be well applied to the real-time detection of animals in natural environments.


Introduction
Target identification and recognition of animals have grown in importance as computer vision technology has progressed. However, conventional approaches to these problems currently do not produce satisfying outcomes, and deep learning has emerged as a breakthrough technology in this area. In recent centuries, the expansion of human society into the natural environment for development has resulted in the loss of wildlife habitats, and the environment has been severely damaged by the advent of the industrial age and rapid population growth. Some fauna have already become extinct as a result of these. Therefore, a novel method for wildlife conservation and ecological study is provided by the application of target detection algorithms in deep learning to detect and identify animals [1].
Convolutional neural networks (CNN) are a class of feedforward neural networks (FNN) with convolutional computation and a deep structure, which is one of the representative algorithms of deep learning [2]. With the development of artificial intelligence and deep learning, the application of convolutional neural networks to wildlife detection and identification is of great significance for wildlife conservation as it extracts surrounding target features in real time. Among the algorithms for target feature extraction, the faster region-based convolutional neural network (Faster R-CNN) algorithm [3], single shot multibox detector (SSD) algorithm [4,5], and the you only look once (YOLO) algorithm [6][7][8] have successfully applied deep learning to target extraction and target detection; the YOLO algorithm is trained and detected in a separate network, and regression and classification 2 of 21 are performed directly on the whole graph in the CNN, so its performance is improved compared to the Faster R-CNN algorithm and the SSD algorithm, but its recognition accuracy for small or distant targets is poor [9,10].
In 2020, YOLOv5 was proposed, and its basic structure was divided into the Backbone, Neck, and Head. YOLOv5s is a subset of YOLOv5, and numerous researchers have made great strides in fusing the target detection algorithm with real-world uses. Jiale Yao et al. solved the recognition of vehicle targets in bad weather by increasing the number of model parameters, merging Transformer and CBAM into the YOLOv5 algorithm, and optimizing the parameters of the Backbone of YOLOv5 algorithm, using the loss function of EIOU instead of the original loss function of CIOU, which is beneficial for the recognition of vehicles [11]. Hao Wang and Shixin Sun et al. created a reinforcement-learning-based system for improving underwater images, which is comparable to target recognition in bad weather. YOLOv5 is a lightweight, quick, and accurate object detection method for underwater environments according to preliminary testing findings. They used a Markov decision process (MDP) to describe the improvement of underwater images. The MDP can represent a variety of improved outcomes for underwater photographs after being trained with reinforcement learning. Their reinforcement learning architecture provided a series of actual actions that are transparent from an implementation standpoint, in contrast to the black-box processing approach of deep learning methods. The outcomes of the experiments supported the framework for reinforcement learning's efficacy in improving underwater image quality [12,13]. This has similarities to the identification of animals in different environments which follows in this paper.
Weimin Liu et al. used coordinate attention to improve YOLOv5 to reduce the loss of feature information and reduced its size by the lightweight method ShuffleNetV2 [14]. Fenghua Wang et al. used Ghostconv to replace the convolutional layer in YOLOv5s CSP to improve the detection speed by lightweight network structure and then, as in the above paper, introduced the BiFPN module to improve the PANet structure of the Neck to improve the detection accuracy of Xiaomila green pepper in surroundings similar to the target [15]. In the meantime, other researchers have made progress by fusing target detection algorithms with animal recognition applications. Ramakant Chandrakar et al. presented a system for automatic detection and recognition of animals using deep CNN with genetic segmentation for animal detection [16]. For image fusion, S. Divya Meena et al. proposed a dual-scale image decomposition-based fusion technique (DDF) that fuses visible and thermal images and introduced a seed-labels-focused object detector (SLOD) [17].
The proposed networks were applied to edge devices; in addition to YOLOv5s, Jiadong Chen et al. proposed convolution kernel first (CKF), an efficient scheme for designing memristor-based fully convolutional neural networks (FCNs). The parameters and circuit power consumption of the edge device are both reduced by CKF. The test set maintains high accuracy while lowering power loss, as shown by the simulation results of real medical image segmentation tasks [18]. Bo Lyu et al. proposed the deployment of spectral graph convolutional networks (GCNs) on memristive crossbars. They also provided an accelerated technique that combines diagonal block matrix multiplication with sparse Laplace matrix reordering. The results showed that the method was effective when used in the supervised learning graph dataset (QM7) and unsupervised learning dataset (karate club). The outcomes showed that the model maintained a high level of accuracy and achieved a memristor number reduction, which is crucial for future network deployment on edge devices [19]. Subeen Lee et al. introduced the task discrepancy maximization (TDM) module. The support attention module (SAM) and query attention module (QAM) are two novel components that TDM uses to learn task-specific channel weights [20]. Heng Li et al. introduced a gated recurrent unit to improve the result by tracing the temporal information of the cost graph [21]. To address the issue of gradient disappearance, we also present the attention mechanism module in this article while adjusting the weights on each channel and using a normalization unit to combine the number of channels in the model. However, the use of target detection algorithms to find animals in natural settings is not widely accepted in all respects, and recognition accuracy and speed need to be increased [22]. To improve the accuracy and speed of YOLOv5s in wildlife recognition, this paper replaces the SPP module in the Backbone with the SPPF module to improve the detection speed of the model and adjusts the feature pyramid network structure in the Backbone to enhance the ability of target feature extraction for large sample animals by expanding the sensory field of the model. Secondly, the Conv module in the Head is replaced with the GSConv module to reduce the number of parameters in the model and to enhance the network feature extraction capability. Finally, the VoVGSCSP module is introduced to divide the input feature map into two channels, which enhances the nonlinear representation of the model and solves the problem of gradient disappearance. The testing findings demonstrate that the model can more effectively recognize wildlife in natural settings, has a compact footprint, is simple to deploy on mobile terminals, and has a high detection accuracy.

YOLOv5 Network Architecture
YOLOv5, proposed by Jocher in 2020, can be divided into four models, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x, according to increasing network size [23]. The four models differ in the depth and width of the network, and the rest of them are the same.
The network structure of YOLOv5 is shown in Figure 1. YOLOv5 mainly consists of four parts: the Input, Backbone, Neck, and Prediction [24,25]. Input is responsible for pre-processing the input images to meet the training requirements. The Backbone, which includes Focus, CBL, CSP, and SPP [26], is the backbone network responsible for providing image feature information. The Neck is the structural layer containing the fused features of the images and passes the feature information to Prediction. Prediction is responsible for providing prediction frames based on the feature information and filtering the detection frames by non-maximal value suppression. (QAM) are two novel components that TDM uses to learn task-specific channel weights [20]. Heng Li et al. introduced a gated recurrent unit to improve the result by tracing the temporal information of the cost graph [21]. To address the issue of gradient disappearance, we also present the attention mechanism module in this article while adjusting the weights on each channel and using a normalization unit to combine the number of channels in the model. However, the use of target detection algorithms to find animals in natural settings is not widely accepted in all respects, and recognition accuracy and speed need to be increased [22]. To improve the accuracy and speed of YOLOv5s in wildlife recognition, this paper replaces the SPP module in the Backbone with the SPPF module to improve the detection speed of the model and adjusts the feature pyramid network structure in the Backbone to enhance the ability of target feature extraction for large sample animals by expanding the sensory field of the model. Secondly, the Conv module in the Head is replaced with the GSConv module to reduce the number of parameters in the model and to enhance the network feature extraction capability. Finally, the VoVGSCSP module is introduced to divide the input feature map into two channels, which enhances the non-linear representation of the model and solves the problem of gradient disappearance. The testing findings demonstrate that the model can more effectively recognize wildlife in natural settings, has a compact footprint, is simple to deploy on mobile terminals, and has a high detection accuracy.

YOLOv5 Network Architecture
YOLOv5, proposed by Jocher in 2020, can be divided into four models, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x, according to increasing network size [23]. The four models differ in the depth and width of the network, and the rest of them are the same.
The network structure of YOLOv5 is shown in Figure 1. YOLOv5 mainly consists of four parts: the Input, Backbone, Neck, and Prediction [24,25]. Input is responsible for preprocessing the input images to meet the training requirements. The Backbone, which includes Focus, CBL, CSP, and SPP [26], is the backbone network responsible for providing image feature information. The Neck is the structural layer containing the fused features of the images and passes the feature information to Prediction. Prediction is responsible for providing prediction frames based on the feature information and filtering the detection frames by non-maximal value suppression.  YOLOv5 uses Mosaic data to enrich the dataset at the input side. Four photographs are randomly selected, then they are combined using a random scale and aligning technique. Then, it performs adaptive anchor frame computation to preprocess the images and adaptive scaling to address the black edge problem, which enhances the model's training efficiency and network robustness [27,28]. The Backbone uses a Focus structure for slicing downsampling, which reduces the information entropy brought by convolution. Meanwhile, it improves the CSP structure to C3 by applying C3_1_X and C3_2_X to the Backbone and Neck, respectively, which enhances the learning ability of the network. Through the use of spatial pyramid pooling (SPP), which can partially address the issue of multi-scale target fusion, it extracts the initial features of the images. The PANet structure used in the Neck consists of a feature pyramid structure of FPN + PAN, with FPN passing top-down information from the higher levels to the lower levels to form the feature map, and PAN passing bottom-up location information to downsample and fuse the feature map. The simultaneous use of both can strengthen the network feature fusion capability, enhance the model's detection function for targets of different sizes, and solve the multi-scale problem. Prediction includes DIoU_NMS and loss function, using CIoU function to calculate the position loss, which solves the problem wherein GIoU degrades to IoU when two target frames intersect. The detection frame is filtered by DIoU_NMS, which can effectively solve the problem of missed detection and improve the accuracy of network prediction [29][30][31][32][33].

CBS
CBS is a composite convolutional module consisting of a convolutional layer, a BN layer, and an activation function layer, which is an important part of many modules. The BN layer mainly normalizes the data and facilitates fast convergence to accelerate the network. The activation function layer uses SiLU as the new activation function, which is essentially a weighted linear combination of the sigmoid. The SiLU function is continuous and smooth. On deeper models, where it can increase the non-linearity of the model and boost detection precision, it performs better than the original activation function LeakyRelu. The expressions are as follows:

C3
C3 is composed of two nested residual modules. The model is simplified, and the number of convolution modules is decreased with this structure without impacting the feature information. Depending on the application location, C3 can be divided into C3_1_X and C3_2_X. C3_1_X is used in the Backbone convolutional neural network part, which contains X residual components (Resunit); the larger the X the deeper the network structure. C3_2_X, on the other hand, is applied to the Neck and contains 2X residual components, the structure of which differs from C3_1_X only in terms of the number of residual components. Increasing the number of Resunit can increase the gradient value of backpropagation between different layers and prevent the gradient degradation problem caused by the deeper structure of the network. Moreover, the model's ability to extract and fuse network features is enhanced. Therefore, along with increasing the network's capacity for learning and lowering its computational and parameter requirements, C3 also increases the network's precision in target detection.

SPP
SPP transforms the input feature map of arbitrary dimension into a fixed dimensional feature vector to ensure the same feature dimension as the fully connected layer. SPP first halves the number of channels by the composite convolution module (CBS), then downsamples it using three maximum poolings of 5 × 5, 9 × 9, and 13 × 13, respectively, and, finally, outputs; thus the output of the convolutional layer retains local features at different scales. The SPP has a skip connection before downsampling, and the three pooling results are overlaid with the image's initial features through the Concat feature connection. This allows the local features to be fused with the global features of the original convolutional layer output, providing good feature extraction capability. Furthermore, the number of channels after feature stacking becomes twice as many as the original, which increases the number of channels to a larger extent at a smaller cost and improves the field of perception.

Prediction
DIoU-NMS NMS is the post-processing of the detection results. It obtains the final detection results by removing the redundant and useless frames for each object. However, NMS tends to filter frames solely based on IoU, which might cause issues when the intersection of frames for various objects is empty.
In YOLOv5, the frames are filtered using DIoU_NMS. Only the frames with the highest scores remain after the NMS filtering because the IoU values are often higher when objects are very close to one another. The DIoU_NMS adds the distance between the midpoints of the frames as an indicator, which effectively solves the problem of missed detections. The calculations are as follows: where A is the prediction frame, B is the true frame, and C is the minimum convex set of A and B. Where s i is the score for the different categories, ε is the threshold set in the NMS; ρ 2 (b, b gt ) is the Euclidean distance between the centroids of A and B; c is the maximum distance of C; i is the number of anchor frames in each grid [34][35][36].

SPP Improvements
In this paper, SPPF is improved on SPP, and experiments show that SPPF can achieve the same computational results as SPP, but SPPF is almost twice as fast [37]. As shown in Figure 2, SPPF first halves the number of channels in the feature map using the composite convolution module (CBS) and then downsamples it through the maximum pooling layer, which uses three maximum poolings of size 5 × 5 in series instead of the three maximum poolings in parallel in SPP to further fuse the image features. It also superimposes the three pooling results with the initial features of the picture to fuse the local features with the global features, changing the number of channels to twice the original at a smaller cost, which improves the receptive field and can solve the problem of multi-scale target fusion to a certain extent. SPPF can convert feature maps of arbitrary dimensions into feature vectors of fixed dimensions and increase the receptive field, which is more efficient than SPP under the condition of having the same adaptive scaling output results [38].

Upgrading of the Feature Pyramid Structure
To enhance the effectiveness of target detection, the receptive field must be expande due to the large size of the targets in the dataset used in this study. Most studies tend t increase the receptive field by increasing the convolutional layer and increasing th downsampling ratio [39]. However, in convolutional neural networks, the feature map obtained by deep convolution are more semantic, but the location information is lost an the computational effort is increased. Therefore, in this paper, we increase the downsam pling rate based on the original algorithm [40,41] and stack the feature pyramid networ

Upgrading of the Feature Pyramid Structure
To enhance the effectiveness of target detection, the receptive field must be expanded due to the large size of the targets in the dataset used in this study. Most studies tend to increase the receptive field by increasing the convolutional layer and increasing the downsampling ratio [39]. However, in convolutional neural networks, the feature maps obtained by deep convolution are more semantic, but the location information is lost and the computational effort is increased. Therefore, in this paper, we increase the downsampling rate based on the original algorithm [40,41] and stack the feature pyramid network structure one level up, i.e., the original P3-5 structure is improved to a P4-6 structure. The newly added P6 detection layer is more suitable for detecting larger targets and can achieve higher accuracy under higher-resolution training conditions [42].
Due to the small number of downsampling layers of YOLOv5s, the detection effect on large-sized objects is not ideal. Therefore, we add a 64× downsampling feature fusion layer P6 in the Backbone, which is output by the backbone network with 64× downsampling and 1024 output channels, generating a feature map of size 10 × 10. The smaller the feature map, the sparser the newly generated feature map's segmented grid, the more advanced the semantic information contained in every grid, and the larger the receptive field obtained, which is conducive to the recognition of large-sized targets [43]. At the same time, the original 8× downsampling feature fusion layer is removed, i.e., only P4, P5, and P6 are used to downsample the image. In this way, the original image is sent to the feature fusion network after 16×, 32×, and 64× downsampling to obtain 40 × 40, 20 × 20, and 10 × 10 feature maps in the detection layer. Three sizes of feature maps are used to detect targets of different sizes, and the original feature extraction model is shown in Figure 3.

Upgrading of the Feature Pyramid Structure
To enhance the effectiveness of target detection, the receptive field must be expanded due to the large size of the targets in the dataset used in this study. Most studies tend to increase the receptive field by increasing the convolutional layer and increasing the downsampling ratio [39]. However, in convolutional neural networks, the feature maps obtained by deep convolution are more semantic, but the location information is lost and the computational effort is increased. Therefore, in this paper, we increase the downsampling rate based on the original algorithm [40,41] and stack the feature pyramid network structure one level up, i.e., the original P3-5 structure is improved to a P4-6 structure. The newly added P6 detection layer is more suitable for detecting larger targets and can achieve higher accuracy under higher-resolution training conditions [42].
Due to the small number of downsampling layers of YOLOv5s, the detection effect on large-sized objects is not ideal. Therefore, we add a 64× downsampling feature fusion layer P6 in the Backbone, which is output by the backbone network with 64× downsampling and 1024 output channels, generating a feature map of size 10 × 10. The smaller the feature map, the sparser the newly generated feature map's segmented grid, the more advanced the semantic information contained in every grid, and the larger the receptive field obtained, which is conducive to the recognition of large-sized targets [43]. At the same time, the original 8x downsampling feature fusion layer is removed, i.e., only P4, P5, and P6 are used to downsample the image. In this way, the original image is sent to the feature fusion network after 16×, 32×, and 64× downsampling to obtain 40 × 40, 20 × 20, and 10 × 10 feature maps in the detection layer. Three sizes of feature maps are used to detect targets of different sizes, and the original feature extraction model is shown in Fig Figure 3, P4, P5, and P6 are three different layers of feature maps, corresponding to 16, 32, and 64 times downsampling magnification, respectively. Feature maps P4, P5, and P6 carry out feature fusion through feature pyramids, i.e., fusing the high-level and low-level feature maps by passing high-level information from top to bottom and location information from bottom to top, combining the location information of the low-level network with the semantic information of the high-level network. The model can be used to enhance the detection function of targets of different sizes and strengthen the multi-scale prediction capability of the network for targets. P6 has a higher downsampling multiplier and contains a larger receptive field per pixel, which provides more sufficient information on large-sized targets during the fusion of feature information transfer, thus, enhancing the learning capability of the network. The feature map then enters the detection layer for prediction, which consists of three detection heads and is responsible for identifying feature points on the feature map and determining whether there is a target corresponding to it.

As shown in
We carry out ablation experiments because target detection layers add more parameters. The experimental findings demonstrate that the number of parameters increases only slightly after the P4-P6 structure is improved due to the addition of only feature layers and not a significant number of extra convolutional layers; however, the detection accuracy is improved. In conclusion, by increasing the downsampling multiplier to obtain a smaller feature map, the feature map receptive field is larger, which is helpful for fully refining the image feature information, reducing the information loss, and strengthening the network's learning capability, thus, improving the accuracy of target detection and recognition while decreasing the computational effort.

GSConv
Although P4-6 are improved, it still introduces a certain number of parameters, which is not optimal for the creation of lightweight networks even if it leads to significant accuracy advances. The design of lightweight networks often favors the use of depth-wise separable convolution (DSC). The greatest advantage of DSC is its efficient computational power, with approximately one-third of the number of parameters and computational effort of conventional convolution, but the channel information of the input image is separated during the calculation. This deficiency leads to a much lower feature extraction and fusion capability of DSC than even the standard convolution (SC). To make up for this deficiency, MobileNets first compute channel information independently and then fuse it with a large number of dense convolutions; ShuffleNets use shuffle to achieve channel information interaction; GhostNet only inputs half of the number of channels for ordinary convolution to retain the interaction information. Many lightweight networks are limited to similar thinking, but all three approaches use only DSC or SC independently, ignoring the joint role of DSC and SC and, thus, cannot fundamentally solve the problems of DSC [44].
To make effective use of the computational power of DSC and, at the same time, make the detection accuracy of DSC reach the standard of SC, this paper proposes a new hybrid convolutional approach, GSConv, based on research on lightweight networks. The GSConv module is a combination of SC, DSC, and shuffle, and its structure is shown in Figure 4. Firstly, a feature map with the input channel number c 1 is input, half of the channel number is divided for deep separable convolution, and the remaining channel is convolved for normal convolution, after which the two are joined for feature concatenation. Then, the information generated by SC is infiltrated into the various parts of the information generated by DSC using shuffle, and the number of output channels in the feature map is c 2 . Shuffle is a channel-mixing technique that was first used in ShuffleNets [45]. It enables channel information interaction by allowing information from the SC to be fully blended into the DSC output by transferring its feature information on various channels.  During the convolution process, the spatial information of feature maps is gradually transferred to the number of channels, i.e., the number of channels increases while the width and height of the feature map decrease, thus, making the semantic information stronger and stronger. In contrast, each spatial compression and channel expansion of the feature map results in a partial loss of semantic information, which affects the accuracy of target detection. SC retains the hidden connections between each channel to a greater extent, which can reduce the loss of information to a certain extent, but the time complexity is greater; on the contrary, DSC completely cuts off these connections, causing the channel information of the input image to be completely separated during the calculation process, that is, the feature map is separable with minimal time complexity. GSConv retains as many of these connections as possible while keeping the time complexity small, which reduces information loss and enables faster operation, achieving a degree of unity between SC and DSC. During the convolution process, the spatial information of feature maps is gradually transferred to the number of channels, i.e., the number of channels increases while the width and height of the feature map decrease, thus, making the semantic information stronger and stronger. In contrast, each spatial compression and channel expansion of the feature map results in a partial loss of semantic information, which affects the accuracy of target detection. SC retains the hidden connections between each channel to a greater extent, which can reduce the loss of information to a certain extent, but the time complexity is greater; on the contrary, DSC completely cuts off these connections, causing the channel information of the input image to be completely separated during the calculation process, that is, the feature map is separable with minimal time complexity. GSConv retains as many of these connections as possible while keeping the time complexity small, which reduces information loss and enables faster operation, achieving a degree of unity between SC and DSC.
The time complexity of the convolution calculation is usually defined by FLOPs, and the time complexity of SC, DSC, and GSConv is calculated as follows: where W and H are the width and height of the output feature map, respectively; K 1 and K 2 are the size of the convolution kernel; C 1 is the number of channels of the input feature map; C 2 is the number of channels of the output feature map.
Applying each of the three convolution patterns to the same image of the dataset in this paper, the visualization results for SC, DSC, and GSConv are as shown in Figure 5. Compared to DSC, the feature maps output by GSConv are more similar to those output by SC, and, in some cases, the detection of the target is even better than SC with the highest detection accuracy; some of the output colors of DSC are darker, and there is a lack of detection accuracy. Further, the convolutional kernel size of the DSC used in the original GSConv is 5 which is replaced with a 7 × 7 sized convolutional kernel to adapt it to the detection large targets so we can obtain a larger scale of features and receptive field. This stu reduces the network parameters and computation, minimizing the drawbacks of DS reducing its detrimental effects on the model, and making full use of the effective comp tational capacity of DSC to make the model easier to deploy to the endpoints.

VoVGSCSP
Based on a new hybrid convolutional approach, GSConv, we introduce a GS bott neck based on Bottleneck and replace Bottleneck in C3 with GS bottleneck to improve C Bottleneck originally comes from Resnet and is proposed for high-level Resnet networ It consists of three SCs with convolutional kernels of sizes 1 × 1, 3 × 3, and 1 × 1, resp tively, where the 1 × 1 convolutional kernel serves to reduce and recover dimensionali and the 3 × 3 is the bottleneck layer with smaller input and output dimensions. The spec structure of Bottleneck means that it is easy to change dimensionality and achieve featu dimensionality reduction, thus, reducing the computational effort [46].
A comparison of the structure of Bottleneck and GS bottleneck is shown in Figure  Compared to Bottleneck, GS bottleneck replaces the two 1 × 1 SCs with GSConv and ad a new skip connection. The two branches of GS bottleneck, thus, perform separate conv lutions without sharing weights and by splitting the number of channels so that the nu ber of channels is propagated via different network paths. The propagated channel inf mation thus gains greater correlation and discrepancy, which not only ensures the ac racy of the information but also reduces the computational effort [47]. Further, the convolutional kernel size of the DSC used in the original GSConv is 5 × 5, which is replaced with a 7 × 7 sized convolutional kernel to adapt it to the detection of large targets so we can obtain a larger scale of features and receptive field. This study reduces the network parameters and computation, minimizing the drawbacks of DSC, reducing its detrimental effects on the model, and making full use of the effective computational capacity of DSC to make the model easier to deploy to the endpoints.

VoVGSCSP
Based on a new hybrid convolutional approach, GSConv, we introduce a GS bottleneck based on Bottleneck and replace Bottleneck in C3 with GS bottleneck to improve C3. Bottleneck originally comes from Resnet and is proposed for high-level Resnet networks. It consists of three SCs with convolutional kernels of sizes 1 × 1, 3 × 3, and 1 × 1, respectively, where the 1 × 1 convolutional kernel serves to reduce and recover dimensionality, and the 3 × 3 is the bottleneck layer with smaller input and output dimensions. The special structure of Bottleneck means that it is easy to change dimensionality and achieve feature dimensionality reduction, thus, reducing the computational effort [46].
A comparison of the structure of Bottleneck and GS bottleneck is shown in Figure 6. Compared to Bottleneck, GS bottleneck replaces the two 1 × 1 SCs with GSConv and adds a new skip connection. The two branches of GS bottleneck, thus, perform separate convolutions without sharing weights and by splitting the number of channels so that the Electronics 2023, 12, 377 9 of 21 number of channels is propagated via different network paths. The propagated channel information thus gains greater correlation and discrepancy, which not only ensures the accuracy of the information but also reduces the computational effort [47]. tively, where the 1 × 1 convolutional kernel serves to reduce and recover dimensionality, and the 3 × 3 is the bottleneck layer with smaller input and output dimensions. The special structure of Bottleneck means that it is easy to change dimensionality and achieve feature dimensionality reduction, thus, reducing the computational effort [46].
A comparison of the structure of Bottleneck and GS bottleneck is shown in Figure 6. Compared to Bottleneck, GS bottleneck replaces the two 1 × 1 SCs with GSConv and adds a new skip connection. The two branches of GS bottleneck, thus, perform separate convolutions without sharing weights and by splitting the number of channels so that the number of channels is propagated via different network paths. The propagated channel information thus gains greater correlation and discrepancy, which not only ensures the accuracy of the information but also reduces the computational effort [47].  In this paper, we use an aggregation method to embed the GS bottleneck in C3 to replace Bottleneck and design a newly structured VoVGSCSP module. A comparison of the structure of C3 and VoVGSCSP is shown in Figure 7. In this paper, we use an aggregation method to embed the GS bottleneck in C3 to replace Bottleneck and design a newly structured VoVGSCSP module. A comparison of the structure of C3 and VoVGSCSP is shown in Figure 7. In VoVGSCSP, the input feature map splits the number of channels into two parts, the first part first passing through the Conv for convolution, after which the features are extracted by the stacked GS bottleneck module. The other part is connected as residuals and passes through only one Conv to convolve. The two parts are fused and connected according to the number of channels and finally output by Conv convolution. VoVGSCSP is not only compatible with all the advantages of GSConv but also has all the advantages that GS bottleneck brings. Thanks to the new skip-connected branch, VoVGSCSP has a stronger non-linear representation compared to C3, solving the problem of gradient disappearance. Meanwhile, similar to the segmentation gradient flow strategy of a crossstage partial network (CSPNet), VoVGSCSP's split-channel approach enables rich gradient combinations, avoiding the repetition of gradient information and improving learning ability. Ablation experimental results showed that VoVGSCSP reduces the computational effort and improves the accuracy of the model [48,49].
Combining the above improvements in the Backbone of YOLOv5s, we replace the SPP module with the SPPF module to improve the pooling efficiency while adding the Conv module to achieve 64× downsampling output; in the Head of YOLOv5s, we replace In VoVGSCSP, the input feature map splits the number of channels into two parts, the first part first passing through the Conv for convolution, after which the features are extracted by the stacked GS bottleneck module. The other part is connected as residuals and passes through only one Conv to convolve. The two parts are fused and connected according to the number of channels and finally output by Conv convolution. VoVGSCSP is not only compatible with all the advantages of GSConv but also has all the advantages that GS bottleneck brings. Thanks to the new skip-connected branch, VoVGSCSP has a stronger non-linear representation compared to C3, solving the problem of gradient disappearance. Meanwhile, similar to the segmentation gradient flow strategy of a crossstage partial network (CSPNet), VoVGSCSP's split-channel approach enables rich gradient combinations, avoiding the repetition of gradient information and improving learning ability. Ablation experimental results showed that VoVGSCSP reduces the computational effort and improves the accuracy of the model [48,49].
Combining the above improvements in the Backbone of YOLOv5s, we replace the SPP module with the SPPF module to improve the pooling efficiency while adding the Conv module to achieve 64× downsampling output; in the Head of YOLOv5s, we replace all the Conv modules with GSConv modules to reduce the number of parameters and computation brought by the upgrade of the feature pyramid structure. The C3 module is replaced with VoVGSCSP module, and the features are extracted by the stacked GS bottleneck for better compatibility with the GSConv module; at the same time, the original 8-fold downsampling feature fusion layer is deleted, and a 64-fold downsampling feature fusion layer is added to strengthen the learning capability of the network and give full play to the efficient computational capability of GSConv. The rest of the original modules of YOLOv5s remain unchanged [25]. Since all the improvements in this paper have good compatibility for different numbers of residual components and convolutional kernels and are not affected by the deepening of the network structure, for the YOLOv5m, YOLOv5l, and YOLOv5x models, which differ only at the network size and depth levels, the same improvements are also applicable. In this paper, we only take YOLOv5s as an example, and the improved network structure is shown in Figure 8.

Experimental Environment
The software environment for the experiments is Linux Ubuntu 20.04, Pytorch 12.0 as the deep learning framework, CUDA 11.6, and Python 3.8.

Experimental Dataset
In the training phase, the image size is redefined in this paper as 640 × 640 to reduce the computational effort of a single image. The images are randomly cropped, randomly scaled, and randomly lined up. The dataset is enhanced and enriched by using the Mosaic data. The experimental weight decay is set to 0.0005, the learning rate to 0.015, and the number of iterations to 600. The wildlife dataset used for the experiments is sourced from

Experimental Environment
The software environment for the experiments is Linux Ubuntu 20.04, Pytorch 12.0 as the deep learning framework, CUDA 11.6, and Python 3.8.

Target Detection Experiments Based on Wildlife Datasets Experimental Dataset
In the training phase, the image size is redefined in this paper as 640 × 640 to reduce the computational effort of a single image. The images are randomly cropped, randomly scaled, and randomly lined up. The dataset is enhanced and enriched by using the Mosaic data. The experimental weight decay is set to 0.0005, the learning rate to 0.015, and the number of iterations to 600. The wildlife dataset used for the experiments is sourced from the OpenImages public dataset, which covers real wildlife images in several scenarios. The dataset is annotated using labeling in XML format. There are a total of 2800 sample images in the dataset, so 1680 images are divided into the training set, 560 images into the test set, and 560 images into the validation set in a ratio of 6:2:2. The distribution of the constructed dataset is shown in Table 1.

Experimental Datasets Number
Training set 1680 Test set 560 Validation set 560 Total 2800 Figure 9 shows an example image of the wildlife dataset in this paper.
Electronics 2023, 12, x FOR PEER REVIEW 12 of 22 Figure 9 shows an example image of the wildlife dataset in this paper. In this paper, precision, recall, AP (average precision), mAP (mean average precision), model parameters (Parameters), model operation (GFLOPs), and frames per second (FPS) are used to evaluate the model [50].
TP refers to the number of detected frames where IoU is greater than the set threshold (denoted as I, 0.5 in this paper, and the same true frame is only recorded for the first time) [51], while frames with IoU ≤ I are FP, i.e., the number of extra detected frames where the same true frame is detected, FN refers to the number of true frames that are not identified, and TN refers to the number of samples that are themselves negative and are also identified as negative ones. The confusion matrix is shown in Table 2.  In this paper, precision, recall, AP (average precision), mAP (mean average precision), model parameters (Parameters), model operation (GFLOPs), and frames per second (FPS) are used to evaluate the model [50].
TP refers to the number of detected frames where IoU is greater than the set threshold (denoted as I, 0.5 in this paper, and the same true frame is only recorded for the first time) [51], while frames with IoU ≤ I are FP, i.e., the number of extra detected frames where the same true frame is detected, FN refers to the number of true frames that are not identified, and TN refers to the number of samples that are themselves negative and are also identified as negative ones. The confusion matrix is shown in Table 2. Because of the possible limitations of precision and recall, the two need to be evaluated in combination, and the AP and mAP are evaluated while precision P and recall R are considered [52]. The formulae for each of these metrics are as follows: Considering the possible limitations of precision and recall, neither metric is sufficient to evaluate model performance alone. Since the extent to which the model is affected by precision and recall respectively is unknown, to explore and measure both, we introduce a P-R curve, where the P in the P-R curve refers to precision, and R represents recall. The P-R curve represents the correlation between the precision rate and recall rate. In general, recall is set as the abscissa, and precision is set as the ordinate. AP is defined as the mean value of the precision rate for different recall rates, which is a measure that visually reflects the degree of model misidentification. It is calculated by finding the area under the P-R curve with the following formula: The mean average precision mAP represents the average accuracy of all species. The formula is as follows: mAP includes mAP@0.5 and mAP@0.5:0.95. mAP@0.5 is calculated when the IoU threshold value is 0.5. For one of the categories with n positive example samples, its mAP@0.5 is the average resulting value of the AP for these n samples. Increasing the IoU threshold from 0.5 in steps of 0.05 to 0.95 and taking the mean value of AP, their corresponding mAP@0.5 can show the trends in AP and R. The higher the mAP@0.5, the easier it is to maintain a high level of both AP and R. mAP@0.5:0.95 is the overall performance under different IoU thresholds, which takes the overall situation into account. A higher mAP@0.5:0.95 means that the model is more capable of high-precision boundary regression, i.e., the more accurate the fit of the prediction frame to the anchor frame. FPS represents the number of images that is detected per second. Supposing it takes t seconds to process each picture, the calculation formula is as follows:

Experimental Results and Analysis
The selection of the initial anchor frame in YOLOv5s is extremely important. After calculating the distance between the prediction frame and the real frame based on the initial anchor frame, the network must repeat in order to update the network parameters in the opposite direction. Adaptive anchor frame calculation can calculate the optimal anchor frame coordinates in the training set by an adaptive, iterative update with each instance of training. However, the outcome of such calculations is occasionally not ideal, so this paper uses the genetic clustering method to conduct dimensional clustering analysis on the width and height of the target frame [53] to calculate new anchor values, which can speed up the model's convergence and boost recognition accuracy. The anchor boxes obtained by clustering are matched according to the feature map scale, and the results are shown in Table 3. Based on the object classification of the dataset, the sample images can be classified into five categories: antelope, elephant, leopard, eagle, and giraffe. A total of 560 images of each animal are selected, and we use the original YOLOv5s model and the improved YOLOv5s model to perform the experiments. The improved YOLOv5s target detection method proposed in this paper contains improvements to the feature pyramid network structure, anchors, and GSConv. To demonstrate the effectiveness of these three components, ablation experiments are performed on the animal dataset in this paper for these three improved components. To ensure the fairness of the experiments, the input image size is always kept at 640 × 640, and the hyperparameters are all set to be constant, and the experimental results are shown in Table 4. (1) The effectiveness of SPPF. In this paper, the SPPF is improved in the first set of experiments based on the SPP by halving the number of channels of the feature map through the CBS, downsampling the maximum pooling layer, and replacing the original parallel maximum pooling layer with a series one, which improves the receptive field. Compared to SPP, SPPF has a higher detection speed with the same output results, for which a comparison is made below in this paper. As can be seen from Table 4, after improving SPP, mAP@0.5 increases by 0.2%, mAP@0.5:0.95 increases by 1.6%, and latency decreases by 28.6%. This is because SPPF solves the problem of multi-scale target fusion, and the receptive field is improved, which is conducive to improving target detection accuracy and speed; (2) The effectiveness of P4-6. The feature pyramid network structure is stacked up one level overall in the second set of experiments in this paper to increase the receptive field to improve target detection performance. The new P6 target detection layer provides more sufficient large-size target information in the fusion process of feature information transfer to improve the feature fusion and feature extraction capability of the network. As can be seen from Table 4, after improving P4-6, mAP@0.5 increases by 1.1%, and mAP@0.5:0.95 increases by 2.9%, which is due to the higher downsampling magnification of P6 and the larger receptive field per pixel point, and a larger receptive field can improve the target detection accuracy; (3) Effectiveness of GSConv. This paper introduces the GSConv hybrid convolution module to replace the standard convolution in the Neck in the third set of experiments in order to decrease the number of parameters and computation of the model while retaining more channel information and enhancing the feature extraction and fusion capability of the network. GSConv stacks SC and DSC on top of the lightweight network so that SC and DSC are feature connected. Additionally, it makes use of ShuffleNets to enable the fusion of channel information from SC and feature information from DSC into the output of DSC for channel information interaction. This lessens the negative effects of DSC on the model while maintaining a lower number of model parameters, reducing computational effort, and minimizing the loss of channel information. As can be seen from Table 4, the introduction of GSConv results in an all-round improvement in the detection performance of the model, with mAP@0.5 and mAP@0.5:0.95 remaining largely unchanged, while Parameters are reduced by 9.4%, GFLOPs by 3.0%, and latency by 11.8%; (4) Effectiveness of VoVGSCSP. In this paper, we use VoVGSCSP to replace C3 in the fourth set of experiments and design the GS bottleneck module based on GSConv, splitting the number of channels so that information is passed through different paths, reducing the computational effort of the original Bottleneck module. VOVGSCSP replaces Bottleneck with the GS bottleneck and embeds it in C3. The GS bottleneck splits the number of channels and adds a new branch through Conv convolution, and the two parts are then feature connected. This method of splitting the number of channels enables a rich combination of gradients, avoiding repetition of gradient information and improving learning ability. It can also enhance the non-linear representation of the model and improve its accuracy. As can be seen from Table 4, the detection accuracy of the model is improved by using VoVGSCSP; mAP@0.5 increases by 2%, and mAP@0.5:0.95 increases by 2.1%, while the computational effort of the model is also reduced, with a 6.15% reduction in GFLOPs.
In addition, in order to prove the superiority of the experimental results, we synthesize multiple sets of experimental results and compare them together. Figure 10 shows the visualization of the ablation experimental data of the improved methods. splitting the number of channels so that information is passed through different paths, reducing the computational effort of the original Bottleneck module. VOVGSCSP replaces Bottleneck with the GS bottleneck and embeds it in C3. The GS bottleneck splits the number of channels and adds a new branch through Conv convolution, and the two parts are then feature connected. This method of splitting the number of channels enables a rich combination of gradients, avoiding repetition of gradient information and improving learning ability. It can also enhance the nonlinear representation of the model and improve its accuracy. As can be seen from Table 4, the detection accuracy of the model is improved by using VoVGSCSP; mAP@0.5 increases by 2%, and mAP@0.5:0.95 increases by 2.1%, while the computational effort of the model is also reduced, with a 6.15% reduction in GFLOPs.
In addition, in order to prove the superiority of the experimental results, we synthesize multiple sets of experimental results and compare them together. Figure  10 shows the visualization of the ablation experimental data of the improved methods. The AP values of each model in the ablation experiments for individual categories are shown in Table 5. The experimental results show that the detection accuracy is higher when the wild animals themselves are clearly distinguished from their environment. The improved YOLOv5s detection algorithm in this paper gives higher performance for certain animals that are large targets, such as elephants and giraffes, and the AP values of elephants are improved by 5.6% and giraffes by 5.2%. The images of eagles in the dataset of this paper are mostly small targets, while, in the model of this paper, the pyramidal network structure is stacked one level up overall, which improves the perceptual field and enhances the detection ability for large targets, and the detection ability for small targets The AP values of each model in the ablation experiments for individual categories are shown in Table 5. The experimental results show that the detection accuracy is higher when the wild animals themselves are clearly distinguished from their environment. The improved YOLOv5s detection algorithm in this paper gives higher performance for certain animals that are large targets, such as elephants and giraffes, and the AP values of elephants are improved by 5.6% and giraffes by 5.2%. The images of eagles in the dataset of this paper are mostly small targets, while, in the model of this paper, the pyramidal network structure is stacked one level up overall, which improves the perceptual field and enhances the detection ability for large targets, and the detection ability for small targets may be weaker, but it still maintains high accuracy. Compared with other classical algorithms in the YOLO series, such as YOLOv3-tiny and YOLOv4-tiny, the improved YOLOv5s detection algorithm in this paper has advantages for the detection accuracy of all five types of animals, and, overall, the improved YOLOv5s detection algorithm in this paper has a performance that surpasses other algorithms in the same series and can better perform the target detection task. The improved YOLOv5s algorithm has better robustness and environmental adaptability, and the detection accuracy is further improved. Since the five animals analyzed in this paper are similar to most animals in nature, it is, therefore, feasible to extend the model to other animals for classification, and the model in this paper can be better applied to the detection of wild animals. This also verifies that the improvement of the network structure of YOLOv5s in this paper makes the model improve the mAP of all the sample images.

Comparison with Mainstream Target Detection Algorithms
To objectively evaluate the performance of the model, we conduct a side-by-side comparison with the mainstream target detection models YOLOv3-tiny, YOLOv4-tiny, Faster R-CNN, and SSD through the YOLOv5s before improvement as well as through the YOLOv5s after improvement. We also use mAP@0.5 and FPS for comparison, which further verify that the improved YOLOv5s algorithm in this paper outperforms other target detection algorithms in detecting wild animals in natural environments.
The datasets in this paper are applied to the YOLOv3-tiny, YOLOv4-tiny, Faster R-CNN + VGG16, SSD + VGG16, YOLOv6s, and YOLOv7-tiny target detection algorithms for experiments [54,55], all of which are performed independently. We use mAP@0.5, FPS, and GFLOPs [56,57] for comparison; the results are shown in Table 6. By comparing the final analysis, it can be obtained that, for both the detection accuracy and the detection speed of the models, on the animal dataset used in this paper, the mAP@0.5 and FPS of the improved YOLOv5s are superior to other mainstream target detection algorithms. Furthermore, the GFLOPs of the improved model reaches the lowest of all the mainstream target detection models.

Comparison of the Detection Effect of the Model before and after the Improvement
Six sample images are taken from the test set of the improved YOLOv5s model before and after the comparison, and the results are shown in Figure 11. Six sample images are taken from the test set of the improved YOLOv5s model before and after the comparison, and the results are shown in Figure 11.  From the comparison diagram in Figure 11a, it can be seen that the original YOLOv5s model does not detect the eagle on the left, while the improved model does not miss the eagle, and the detection accuracy of the original model is further improved. From the comparison diagram in Figure 11b, it can be seen that the elephant is a large target detected in this figure. Since the newly added P6 target detection layer provides more sufficient large-size target information during the fusion process of feature information transfer, increasing the receptive field, the model before the improvement recognizes one elephant as two, and the model after the improvement has no false detection, which reflects the further improvement of the large target detection ability. From the comparison chart in Figure 11c, it can be seen that, in the case of three giraffes with body overlap, the detection accuracy of the improved model for giraffes is higher than that of the original model. The upgraded YOLOv5s model outperforms the original model in detection and recognition, and it lowers the rate of missed detection and false detection of target animals according to the study of the sample picture detection findings.

Comparison of the Detection Effect of the Model before and after Improvement on the VOC2007 + 2012 Dataset
In order to test the generalization ability of the model, we further test the detection ability of the improved model in this paper on other public datasets so as to judge the recognition ability of the improved model for targets of various sizes. In this paper, the Pascal VOC2007 + 2012 public dataset [58] is selected for testing experiments, in which the proportion of small targets is increased to better test the model's ability to detect small targets. A comparison of the performance of the original YOLOv5s and the improved model in this paper on the Pascal VOC2007 + 2012 public dataset is shown in Table 7. As can be seen from Table 7, the improved YOLOv5s model in this paper improves the detection accuracy and comprehensive performance in the Pascal VOC2007 + VOC2012 public dataset, and the improved method is still applicable to the detection of small target objects. Other indicators of this experiment and simulation results are shown in Figures 12 and 13.  Figure 11b, it can be seen that the elephant is a large target detected in this figure. Since the newly added P6 target detection layer provides more sufficient large-size target information during the fusion process of feature information transfer, increasing the receptive field, the model before the improvement recognizes one elephant as two, and the model after the improvement has no false detection, which reflects the further improvement of the large target detection ability. From the comparison chart in Figure 11c, it can be seen that, in the case of three giraffes with body overlap, the detection accuracy of the improved model for giraffes is higher than that of the original model. The upgraded YOLOv5s model outperforms the original model in detection and recognition, and it lowers the rate of missed detection and false detection of target animals according to the study of the sample picture detection findings.

Comparison of the Detection Effect of the Model before and after Improvement on the VOC2007 + 2012 Dataset
In order to test the generalization ability of the model, we further test the detection ability of the improved model in this paper on other public datasets so as to judge the recognition ability of the improved model for targets of various sizes. In this paper, the Pascal VOC2007 + 2012 public dataset [58] is selected for testing experiments, in which the proportion of small targets is increased to better test the model's ability to detect small targets. A comparison of the performance of the original YOLOv5s and the improved model in this paper on the Pascal VOC2007 + 2012 public dataset is shown in Table 7. As can be seen from Table 7, the improved YOLOv5s model in this paper improves the detection accuracy and comprehensive performance in the Pascal VOC2007 + VOC2012 public dataset, and the improved method is still applicable to the detection of small target objects. Other indicators of this experiment and simulation results are shown in Figures 12 and 13.

Conclusions
In this paper, YOLOv5s is better applied to large datasets of animals in natural environments by improving the feature pyramid network structure and convolution module, which effectively improves the detection accuracy and speed of YOLOv5s model on this animal dataset. The final experimental findings demonstrate that the updated YOLOv5s algorithm's detection accuracy and model performance are enhanced to varying degrees when compared to other mainstream networks in the same experimental setting. The final improved YOLOv5s algorithm has a mAP@0.5 of 85.4% for the animal dataset in this paper and a mAP@0.5:0.95 of 59.7%. Compared with the original YOLOv5s, the improved model mAP@0.5 increases by 3.2%, mAP@0.5:0.95 increases by 6.8%, and GFLOPs decreases by 29.11%. While increasing the number of model parameters, FPS increases by 9.4. It can be seen that the model in this paper is improved in both detection accuracy and model lightness for large animal recognition. The accuracy and real-time performance of the detection meet the demand and can finally achieve high-accuracy real-time detection with a small amount of model calculation. In actual natural environment tests, our network structure needs to further improve the detection ability of some wild animals due to their mimetic ability, and, considering the simultaneous occurrence of multiple wild animals, further spatial features need to be extracted to obtain the correlation between different targets in space. The next phase of study will, therefore, concentrate on further network structure optimization and the addition of an attention mechanism to enhance performance in this area.

Conclusions
In this paper, YOLOv5s is better applied to large datasets of animals in natural environments by improving the feature pyramid network structure and convolution module, which effectively improves the detection accuracy and speed of YOLOv5s model on this animal dataset. The final experimental findings demonstrate that the updated YOLOv5s algorithm's detection accuracy and model performance are enhanced to varying degrees when compared to other mainstream networks in the same experimental setting. The final improved YOLOv5s algorithm has a mAP@0.5 of 85.4% for the animal dataset in this paper and a mAP@0.5:0.95 of 59.7%. Compared with the original YOLOv5s, the improved model mAP@0.5 increases by 3.2%, mAP@0.5:0.95 increases by 6.8%, and GFLOPs decreases by 22.78%. While increasing the number of model parameters, FPS increases by 9.4. It can be seen that the model in this paper is improved in both detection accuracy and model lightness for large animal recognition. The accuracy and real-time performance of the detection meet the demand and can finally achieve high-accuracy real-time detection with a small amount of model calculation. In actual natural environment tests, our network structure needs to further improve the detection ability of some wild animals due to their mimetic ability, and, considering the simultaneous occurrence of multiple wild animals, further spatial features need to be extracted to obtain the correlation between different targets in space. The next phase of study will, therefore, concentrate on further network structure optimization and the addition of an attention mechanism to enhance performance in this area.  Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: [https://gitee.com/that-wipe-of-light/wildlife-object-detection-method-applyingsegmentation-gradient-flow-and-feature-dimensionality-reduction (accessed on 25 November 2022)].

Conflicts of Interest:
The authors declare no conflict of interest.