2.5.1. Enhanced Channel Attention
In aerial wildlife detection tasks, the high similarity between targets and backgrounds, coupled with the weak semantic characteristics of small-scale targets, poses severe challenges for feature selection mechanisms. To address this, we developed the Improved Efficient Channel Attention (IECA) module. It was designed to overcome the limitations of existing mechanisms (e.g., SE, CBAM, and ECA) in low-contrast grassland environments. Unlike SE, which relies solely on global average pooling, IECA incorporates both GAP and GMP to capture both contextual and salient features. This dual-branch design is particularly suited for wildlife detection, where animals often blend into the background and require enhanced local contrast. Traditional convolutional neural networks often treat all channel information equally during feature extraction, leading to key features being overwhelmed by background noise. To solve this, the research proposes the Improved Efficient Channel Attention (IECA) module, whose innovative design overcomes the limitations of traditional attention mechanisms.
The development of channel attention mechanisms has evolved from global statistics to local interactions. Early SE-Net [
27] established channel dependencies via global average pooling (GAP), but its single pooling strategy struggled with the sparsity of target features in aerial images. Subsequent research explored feature aggregation methods: CBAM [
28] fused channel and spatial attention to enhance discriminative power and GSoP introduced second-order statistics to strengthen feature representation, but these methods often incurred significant computational overhead. The ECA module balanced efficiency and performance by avoiding dimensionality reduction and optimizing cross-channel interaction. However, its reliance on a single GAP strategy still suffered from insufficient granularity in feature selection [
29].
Theoretical studies indicate that global max pooling (GMP) possesses stronger feature selection capabilities in scenarios with sparse feature activations. After feature maps pass through ReLU activation, negative values are suppressed to zero. In this state, GMP effectively captures salient information like target edge contours. Based on this, the IECA module innovatively constructs a dual-branch feature aggregation structure: the GAP branch extracts the global statistical features of channels, while the GMP branch focuses on local salient regions. These two form complementary feature representations, further improving the model’s accuracy as well as robustness.
The structure of the IECA module is displayed in
Figure 4. The input feature map
undergoes
GAP and
GMP, respectively, generating
and
, where
C,
H, and
W are the number of channels and height, as well as width. They are then sent to a 1D convolutional layer to produce the channel attention map
. The IECA module concatenates the output feature vectors from the one-dimensional convolution, inputs them into an activation unit, and obtains the attention weights
W(
X). The input is multiplied by the weights to produce the output. The specific calculation formulas are as follows:
Here, and represent the averaged and maximized features per channel, respectively. These two vectors capture complementary information: reflects the global contextual background of each channel, while highlights the most salient features within each channel.
These two feature vectors are then concatenated and processed through a one-dimensional convolution with a kernel size k (adaptively determined by Equation (3)) and a sigmoid activation function δ to generate the final attention weights W(X). The output Y is obtained by scaling the original input X with these learned weights.
Where
X denotes the input feature vector,
is the one-dimensional convolutional layer,
δ denotes the sigmoid activation function, and
Y is the output after attention weighting. Additionally, local interactions between channels are captured by the IECA module using one-dimensional convolution. The convolution kernel’s size is a crucial element, as it dictates the coverage of interactions. A relationship between the channel dimension C and the kernel size
k is shown through the use of group convolution [
34,
35]. This suggests that there may be mapping
ψ between
k and
C. Thus, the kernel size is determined using an adaptive function, as seen below:
Equation (3) defines how the kernel size k for the 1D convolution is adaptively determined based on the number of channels C in the layer. This ensures that the cross-channel interaction captured by the convolution is appropriate for the complexity of the feature map. Hyperparameters are set as γ = 2 and b = 1, respectively, meaning k is calculated as the nearest odd integer to /2 + 1/2.
Here, γ as well as b are hyperparameters controlling the kernel size, C is the number of channel dimensions, and b signifies taking the nearest odd number.
In order to combine rich features from several depth layers, the ELAN module in YOLOv7-tiny uses many separate branches to extract features from input data. Learning more complicated feature representations is made easier, and feature usage is effectively improved by combining data from various branches. Features from several branches may, however, differ in significance. The IECA is thus included in the ELAN module. After obtaining features from multiple paths, the attention model learns the important features and assigns them higher weight factors.
Figure 5 depicts the construction of this module, which is known as IECA-ELAN.
2.5.2. Upsampling
In the YOLOv7-tiny model, upsampling is leveraged to further increase the resolution of deep feature maps, enhancing the target detection capability at different scales. The model employs nearest-neighbor interpolation for upsampling, offering the advantages of low computational cost and algorithmic simplicity. However, this method only considers neighboring pixel values, has a small perceptual field, and struggles to offer sufficient effective information for the upsampled image. To solve this, we introduce the CARAFE operator. It was selected over traditional interpolation methods (e.g., nearest-neighbor) and learnable alternatives (e.g., transposed convolution) because of its content-aware nature and computational efficiency. Its ability to adaptively reassemble features based on local context makes it ideal for recovering details in small targets, which are often lost during downsampling. Low-resolution features cause severe texture loss for small targets [
36], and nearest-neighbor interpolation fails to capture sufficient contextual information. This study presents the CARAFE operator as a solution to this problem. The CARAFE operator offers a wider receptive field and more semantic information than conventional interpolation techniques since it adaptively conducts upsampling, depending on current feature map information. Additionally, CARAFE retains a lightweight performance with minimal computational cost as compared to other learnable upsampling techniques like transposed convolution. The overall operator, shown in
Figure 6, consists of two parts: the Kernel Prediction Module as well as the Content-Aware Reassembly Module (CARM).
For a given input feature map , as well as an upsampling factor σ, CARAFE generates the output feature map . For each position in , the corresponding position in x is , where , . Define as the k × k neighborhood centered at l in the original image x.
The key innovation of CARAFE is predicting a unique kernel
for each target location
l’ in the high-resolution output, based on the content of the input feature map, as shown in Equation (4):
Here, ψ is the Kernel Prediction Module. Instead of using a fixed kernel (like bilinear interpolation), CARAFE generates a custom kernel tailored to the specific features surrounding the source location l. This allows for more precise and context-aware upsampling.
The weight matrix generation module ψ consists of three steps: first, a 1 × 1 convolution compresses the original feature map’s channel dimension
C to
. Subsequently, a convolution with kernel size
and output channels
generates the weight matrix. Finally, a softmax function normalizes the reassembly kernel values to sum to 1, preventing shifts in the mean of the original feature map. The CARM is displayed in Equation (5):
Equation (5) describes how the value at the upsampled location l’ is computed. It is a weighted sum of the pixels in the × neighborhood around the corresponding source location l in the input feature map. The weights for this sum come from the content-aware kernel predicted by Equation (4). In essence, for each new pixel to be created, CARAFE looks at a region around the original pixel and intelligently blends the values in that region based on the feature content, rather than just duplicating the nearest pixel value.
For feature reassembly, the value at position in the upsampled image can be computed using the reassembly kernel from the corresponding region centered at in the original image, where . This completes image upsampling. Compared to other upsampling methods, images sampled by CARAFE possess a larger perceptual field as well as richer semantic information, aiding in further improving the model’s detection capability.
2.5.3. Loss Function
In aerial images, wildlife is often small and densely packed, possibly appearing far away in the image or obscured by complex backgrounds, posing significant challenges for traditional detection methods. To mitigate these issues, we propose the Normalized Wasserstein Distance (NWD) loss. It was chosen over IoU-based losses due to its robustness to small localization errors and its ability to handle overlapping bounding boxes—common challenges in aerial wildlife imagery. By modeling bounding boxes as Gaussian distributions, NWD provides a smoother and more representative measure of similarity for tiny objects, which is critical for accurate detection and counting. Localization loss in the YOLO loss function usually uses the Intersection over Union (IoU) to quantify how well the anticipated and actual bounding boxes match. IoU, however, is especially sensitive to even small bounding box variations. This may cause the loss function to become unstable for tiny items. This study presents the NWD loss function as a solution to these problems. It overcomes the limitations of IoU in tiny target recognition by modeling bounding boxes as GD, as well as using the Wasserstein Distance (WD) to more precisely assess the difference between predicted and real bounding boxes [
26]. Compared to IoU, NWD is less sensitive to minor deviations in small target bounding boxes, which helps improve detection accuracy. It also demonstrates a stronger capability when handling overlapping targets. Furthermore, the introduction of the NWD loss function can make the training process more stable as it provides smoother error gradients, avoiding the training instability that IoU might cause in small-target detection. This way, the NWD loss function improves the model’s capability to detect small wildlife targets, improves spatial accuracy, handles overlapping targets better, and provides a more stable and balanced optimization for the entire training process.
In aerial images, small-sized wildlife often exhibits irregular shapes and, in most cases, occupies only a small number of pixels within the central region of the bounding box. As targets are often concentrated in the core area of the image, while background and irrelevant elements are mostly distributed at the edges, this spatial distribution makes traditional rectangular bounding box models difficult to accurately reflect the actual importance of the target.
Therefore, this model proposes an optimization method based on spatial weight distribution. Pixels at the center of the bounding box are assigned the highest weight; as well, the importance of pixels gradually decreases with increasing distance from the center. Specifically, for a horizontal bounding box
,
represents the center coordinates, and
w and
h represent the width as well as height. To model this importance distribution, we represent the bounding box not as a uniform rectangle but as a 2D Gaussian distribution, where the center of the bounding box is the mean of the distribution. The spatial weight distribution within the bounding box
R is defined by an elliptical equation derived from the Gaussian probability density function:
Equation (6) defines an ellipse that fits within the bounding box. A point (x, y) lies on the ellipse boundary. Points inside this ellipse (where the equation result is ≤1) are considered part of the target region, with the importance maximized at the center (, ) and decreasing towards the edges. This better captures the typical spatial distribution of a small target’s pixels compared to a hard rectangular boundary.
Where and denote the coordinates of the ellipse center, = , = . is the length along the x-axis, , and is the length along the y-axis, .
Formally, we model a bounding box
R as a 2D Gaussian distribution
N (
μ, Σ). The mean
μ is set to the center coordinates (
,
). The covariance matrix Σ is a diagonal matrix where the variances are set to (
)
2 and (
)
2 respectively, ensuring the ellipse defined by one standard deviation aligns with the box edges. The probability density function of a two-dimensional Gaussian distribution is given by:
Equation (7) is the standard formula for a 2D Gaussian distribution. For any point , it gives a value representing how likely that point is to belong to the target based on our Gaussian model of the bounding box. This creates a soft, probabilistic representation of the bounding box instead of a hard binary one.
Where
characterizes spatial coordinates,
μ is the mean vector, and Σ is the covariance matrix. For two sets of two-dimensional GDs
, as well as
, the difference between the distributions can be quantified by the second-order WD:
The Wasserstein Distance (WD), shown in its general form in Equation (8), is a metric for measuring the distance between two probability distributions. Intuitively, it represents the minimum “cost” of moving the mass from one distribution to shape it like the other. This mathematical expression can be further simplified to
For two Gaussian distributions
N (
,
) and
N (
,
), the squared Wasserstein Distance has a closed-form solution, shown in Equation (9). This makes it computationally efficient. Therefore, for two bounding boxes modeled by Gaussian distributions,
,
, the distance metric can be expressed as
Equation (10) is the specific application of Equation (9) to our Gaussian bounding box models. Here, and are the Gaussian distributions modeling the predicted box and the ground truth box, respectively.
However, directly applying the distance metric faces the problem of numerical scale incomparability. Therefore, this paper designs a nonlinear transformation strategy: introducing a Softmax function with a learnable coefficient
T. This function can transform any real input into a normalized probability distribution, especially suitable for extracting effective similarity feature representations from distance metrics. This parameterized Softmax transformation via the learnable scaling coefficient
T can adaptively adjust the distribution characteristics of input features, thereby more effectively capturing similarity relationships between samples. Taking the negative exponent of the ratio of the square root of the WD to a constant
T maps it into a probability space, creating a new metric called NWD:
Equation (11) defines the Normalized Wasserstein Distance (NWD). It applies a nonlinear transformation from Equation (10). This maps the Wasserstein Distance, which can be any non-negative number, into the range (0, 1], creating a value that behaves like a similarity score (1 for identical boxes, approaching 0 for very different boxes). This normalization makes the metric more stable and suitable for use in a loss function.
T represents an adjustment coefficient related to the specific dataset. To leverage the complementary advantages of the NWD metric and the IoU metric, this study employs a balanced weighting strategy, setting the contribution of each to 0.5, ensuring both similarity measures have equal influence on the overall loss. The optimized composite loss function expression is
Equation (12) defines the final composite loss function used for training. N is the number of objects. For each predicted box and its corresponding ground truth box, the loss is calculated as 1 − ( + )/2. This means the loss decreases as both the NWD similarity (from Equation (11)) and the traditional IoU similarity increase. By averaging (1 − NWD) and (1 − IoU), we create a loss function that leverages the strengths of both metrics: IoU’s effectiveness for larger objects and NWD’s robustness for small and overlapping objects. The final loss is the average over all object detections.
N is the number of detected frames. Taking the average aggregates the loss from multiple targets into a single value, which can be leveraged for further calculation as well as optimization of the loss function.