3.2. Multi-Scale-Edge Information Select Architecture
The original YOLO12 framework consists of three basic components: Backbone, Neck, and Head, each implementing a different function. The Backbone of YOLO12 is the core component responsible for feature extraction. Through multi-layer convolution operations and a modular structure, it converts the input image layer by layer into multi-scale feature maps for subsequent detection tasks. This sub-module stack includes the C3k2 module and the A2C2f (Area Attention) module, which extracts multi-scale features containing rich semantic information.
Compared to previous versions, YOLO12 introduced the C3k2 module and the A2C2f module, replacing the C2f module used in previous generations. The C3k2 module uses two small kernel convolutions instead of one large kernel, aiming to preserve both global and local information of features. The A2C2f module maintains a large receptive field, aiming to reduce the computational complexity of attention in a simple way and thus increase speed. However, these modules still contain a large number of bottleneck structures, which can lead to vanishing gradients and feature redundancy, weakening the global receptive field coverage.
To address these challenges, the C3k2 and A2C2f modules in the Backbone architecture were redesigned. Specifically, a new EE (Edge Enhancer) module was developed, and the DSM (Dual-domain Selection Mechanism) proposed by Cui et al. [
33] was incorporated to propose a new MS-EIS (Multi-Scale-Edge Information Select) architecture to replace the Bottleneck in the C3k2 and A2C2f modules. The structure of the MS-EIS architecture is shown in
Figure 2.
MS-EIS replaces the simple bottleneck used in YOLO12’s backbone with an integrated module that goes beyond the original DSM by combining principled multi-scale pooling with learnable scale weights, an explicit edge enhancement path based on Laplacian filtering balanced by Gaussian denoising, and an efficient depthwise separable convolution preprocessing stage, all fused by a targeted attention scheme. Compared to simply referencing the original DSM, MS-EIS is more prescriptive and reproducible because it specifies concrete operators and hyperparameters, it introduces an explicit signal-processing style edge path that DSM lacks, and it trades a small amount of added structure for substantial improvements in boundary and small-object sensitivity while maintaining a low computational profile.
Edge information is a consistent and discriminative cue that separates weeds from crops, especially in early growth stages when color and texture signals are weak and plant instances are small, so leaf contours, margins, and vein-defined edges become the most reliable features for distinguishing species. Edges encode shape and boundary geometry that are less sensitive to illumination shifts and color similarity, they provide strong localization anchors for tiny and partially occluded plants, and they help disambiguate overlapping instances by revealing separate object contours. MS-EIS treats edge cues as a first-class signal rather than an incidental feature, so it explicitly extracts high frequency content with a Laplacian path and then stabilizes that signal with Gaussian denoising, while multi scale pooling and learnable scale weights allow the module to emphasize the most relevant edge scales for a given scene or growth stage. This differs from common attention modules such as CBAM which reweight channels and spatial locations on the existing feature maps but do not create a dedicated, signal-processing style edge channel, and it differs from simple Sobel fusion which injects a fixed single scale gradient that amplifies noise and cannot adapt to scale, blur, or imaging conditions. By combining principled filtering, adaptive scale weighting, and efficient depthwise separable preprocessing, MS-EIS amplifies true boundary cues while suppressing noise and keeping computation low, yielding clearer boundaries, better small object localization, and greater robustness in cluttered agricultural scenes than either generic attention or naive fixed edge fusion.
MS-EIS (Multi-Scale-Edge Information Select) is a deep learning feature enhancement module optimized for object detection, segmentation, and feature extraction tasks. Its primary goal is to improve the model’s object detection capabilities in complex scenes through multi-scale feature fusion and edge information enhancement. This architecture combines multi-scale feature extraction, edge information enhancement, deep convolution operations, and a dual region-focused attention mechanism to optimize the effectiveness and robustness of feature extraction. The key concept in the MS-EIS architecture is to utilize multi-scale average pooling and adaptive convolution to extract feature information at different levels, combined with an edge enhancement module for feature comparison and supplementation, and ultimately achieve efficient feature fusion through a dual region selection mechanism. This process enables MS-EIS to simultaneously capture local details and global contextual information, improving detection accuracy while reducing computational costs.
Specifically, four parallel AAP (Adaptive Average Pooling) modules are first used to perform multi-scale adaptive average pooling to extract feature information at different scales. This is then combined with convolution operations to obtain rich semantic information. The core idea is to utilize pooling operations at different scales to capture global and local information with different receptive fields. Assuming an input feature
X, the input feature map can be represented as follows:
where the real number space ℝ is used to describe the range of the feature map, and
i is mainly used to index the feature extraction process at different scales, specifically indicating the feature processing step at the
i-th scale.
H and
W are the height and width of the feature map, respectively, and
C is the number of channels of the input and output features. Multi-scale adaptive pooling is combined with the corresponding convolution operation on the input feature map to construct a feature pyramid:
where
Pooling represents an adaptive pooling operation of scale
i. Through pooling operations of different scales, different resolution versions of the feature map are extracted, so that the network can focus on large targets, medium targets and small targets at the same time.
Conv represents the convolution operation of the
i-th level. The first convolution with a spatial dimension of 1 is used to reduce the dimension and enhance the information interaction between channels, providing more effective input for subsequent convolution operations. The second convolution with a spatial dimension of 3 further extracts local features and improves the detection ability of edges and small targets.
σ represents normalization and nonlinear activation operations, which transforms the features through the corresponding activation functions to make them more expressive.
Upsample represents the upsampling operation, which is used to restore the size of the input features.
Pooling at different scales preserves multi-level features, improving adaptability to small objects and complex backgrounds. Multi-scale pooling also extracts local information of varying sizes, helping to capture the multi-level features of an image. Softmax is then used to weight features at different scales, resulting in smoother fusion without relying on manually configured hyperparameters. Finally, the multi-scale fused feature,
ƑMS, is calculated as follows:
where
ƑMS represents the final fused multi-scale features, and
α is the normalized feature weighting coefficient used to weight feature maps at different scales.
β is a learnable parameter representing the original weight of the feature weighting coefficient.
βi is a scalar value obtained through neural network training and determines the size of
αi.
j is an index variable used to indicate the summation range of feature weights calculated at all scales. The softmax mechanism adaptively adjusts the weights of features at different scales, allowing the model to automatically select the most important scale features. Furthermore, the introduction of learnable parameters allows features of different scales to contribute differently to the final fusion, improving the flexibility of feature extraction. The final fused multi-scale features take into account the feature representation of both large and small objects.
We initialize each β to zero so the softmax produces equal weights across scales at the start, providing a neutral baseline. Each β is then learned as a scalar parameter by the usual backpropagation process but with conservative optimization settings, for example a reduced learning rate or a dedicated learning-rate multiplier and minimal or no weight decay, to avoid abrupt or biased shifts in the fusion distribution. If training is unstable, simple stabilizers can be used such as freezing β for a short warmup period, applying gradient clipping, or adding a softmax temperature greater than one to smooth the resulting weights. During training we monitor the resulting α values and include an ablation that compares fixed versus learnable β to verify the benefit of learned scale weighting.
The EE (Edge Enhancer) module is a key component in the MS-EIS architecture. Its primary function is to address the traditional Bottleneck architecture’s inability to extract edge information and improve the ability to identify target edge contours. The core concept of the EE module is to extract edge information using Gaussian filtering and the Laplacian operator, and then fuse feature maps through convolution. In this step, edge information is first calculated:
where
Edge represents the extracted edge information, and the partial differential equation represents the Laplacian operation.
Gσ represents the standard Gaussian kernel, which displays the Gaussian filtering results and is mainly used for denoising.
λ is a balance parameter that controls the weighting of Gaussian filtering and gradient enhancement. Next, a standard convolutional layer is used to perform a nonlinear transformation on the edge features and then add them to the original features to obtain the edge features:
where
ƑEE represents the final enhanced edge feature, and
σ represents the nonlinear transformation. The EE module improves the model’s perception of boundary information through edge enhancement, facilitating accurate target localization. Furthermore, it combines Gaussian filtering and the Laplacian operator to enhance edges while minimizing noise interference.
Overall, the EE module extracts edge information, enhancing the network’s sensitivity to edges. This is crucial for visual tasks such as object detection and semantic segmentation. The Concat module then interpolates features extracted at different scales to align them to the same scale and concatenates them. Finally, a standard convolutional layer fuses these features into a unified feature representation, improving the model’s perception of multi-scale features.
Finally, all features are fed into the DSM (Dual-domain Selection Mechanism). This attention mechanism aims to efficiently select key features highly relevant to the target task from multi-scale-edge information. By focusing on more important areas in the image, such as complex edges and high-frequency signals, this mechanism adaptively selects features with higher task relevance from the multi-scale features, significantly improving feature selection accuracy and overall model performance.
Figure 3 shows the detailed structure of each submodule in the DSM.
The DSM module addresses feature redundancy by selecting the most important features in both the channel and spatial domains to improve detection accuracy. First, two depthwise separable convolutions are used to reduce computational complexity while enhancing feature representation. Depthwise separable convolution decomposes the standard convolution into DC (Depthwise Convolution) and PC (Pointwise Convolution), resulting in the following output feature maps:
where
C is the number of input and output feature channels, and
k is the convolution kernel size.
Yd represents the intermediate feature map after depthwise convolution, and
Wd is the convolution kernel for depthwise convolution.
Yp represents the output feature map after pointwise convolution, and
Wp is the weight matrix for pointwise convolution.
ReLU represents the activation function used for nonlinear transformations,
ƑDW is the final activated depthwise separable convolution output, and
O represents computational complexity. The left side represents standard convolution, and the right side represents depthwise separable convolution. Compared to standard convolution, depthwise separable convolution reduces computational complexity by approximately
k2 times. When
k is 3, the computational complexity is reduced by approximately 9 times.
The dual region information selection mechanism enhances the information expression capabilities of the channel domain and the spatial domain. That is, the DSM module combines the CA (Channel Attention) and SA (Spatial Attention) mechanisms, allowing the model to focus on the salient features of the target while ignoring unimportant information. First, the feature importance weight is calculated in the channel dimension:
where
Mc represents the channel attention weight matrix, which evaluates the importance of different channels.
Wc represents the learnable parameter matrix used for channel weighting, which adjusts the contribution of different channels.
Pooling represents the GAP (Global Average Pooling) operation, which is used to obtain global information for each channel.
i and
j represent the pixel values of the feature map, respectively.
Compared to the traditional attention mechanism, channel attention only involves the channel dimension and does not consider spatial information, so the computational cost is lower. Afterwards, the feature response is calculated in the spatial dimension:
where
Ms represents the spatial attention weight matrix, which indicates the importance of different spatial locations.
Ws is the learnable parameter matrix used for spatial weighting, which adjusts the contribution of different spatial regions. Large kernel convolution is used to extract spatial features over a wide range, enhancing the integrity and shape perception of the object.
Compared to channel attention, spatial attention focuses directly on the spatial distribution of feature maps, highlighting important target areas while suppressing irrelevant areas. By learning weights for different spatial locations, the model can more accurately locate the target and reduce background interference. Ultimately, the output of the DSM module is as follows:
where
ƑDSM represents the final output feature of the dual region selection mechanism, and element-wise multiplication is used to weight different feature maps. The deeply weighted features are influenced by both channel and spatial attention, ensuring that the final fused features are optimally selected in both the channel and spatial dimensions. Because the attention mechanism suppresses irrelevant features, the feature maps output by the DSM module contain more compact information, thereby improving computational efficiency. For detection tasks, the edge and semantic features of key objects are significantly enhanced, which helps improve small object detection capabilities.
Finally, the features output by the DSM module are transformed and fused through a standard convolutional layer to obtain the final output of the MS-EIS architecture:
where
Output represents the feature output of the MS-EIS architecture. The standard convolutional layer serves as the final feature mapping unit, which optimizes cross-channel information fusion, enables multi-scale feature extraction and dual region information selection to work together, and improves the accuracy and robustness of target detection.
Overall, MS-EIS has efficient multi-scale information integration capabilities. Its core components include adaptive average pooling, edge enhancement module, and dual region information selection mechanism, which synergistically improve target detection performance. First, MS-EIS uses AAP combined with convolution operations to enhance the perception of targets of different scales and improve the scale adaptability of detection. Secondly, the EE module optimizes the target contour features through edge information enhancement, making the target more prominent in complex backgrounds, thereby improving the robustness of target detection. Meanwhile, DSM combines channel attention and spatial attention to filter key features in the channel domain and spatial domain, effectively reducing redundant information and further improving detection accuracy.
Furthermore, MS-EIS employs a lightweight pooling and attention-weighted mechanism for efficient information screening and fusion, ensuring high accuracy while also balancing computational efficiency, making it suitable for real-time tasks. Specifically for small target detection, spatial attention enhances target edge information, the EE module strengthens target contours, and multi-scale feature extraction further enhances the saliency and detection of small targets. Overall, the MS-EIS architecture achieves efficient and accurate target detection through the coordinated optimization of multi-scale feature extraction, edge information enhancement, and dual region information selection.
3.3. Additive-Convolutional Gated Linear Unit Pyramid Network
The Neck portion of the original YOLO12 is located between the Backbone and the Head. Its main function is to fuse and enhance the multi-scale features extracted by the Backbone and pass them to the Head for prediction. The feature pyramid structure combines upsampling and cross-scale connections to achieve adaptive detection of multi-scale targets, while taking into account both global context and local details. Although the Neck simplifies computational complexity to a certain extent, the upsampling process may lead to loss of details or reduced positioning accuracy, especially in the detection of small targets or targets with blurred boundaries. At the same time, multi-scale feature splicing can easily introduce information redundancy, interfere with the expression of effective features, and increase the possibility of false detection or missed detection. In addition, deep features may cover the local details of shallow features, weakening the ability to capture small targets.
To ensure both real-time target detection and improved accuracy, a new Add-CGLU (Additive-Convolutional Gated Linear Unit) pyramid network was designed to optimize the Neck portion. This architecture draws inspiration from the CAS-ViT structure proposed by Zhang et al. [
34], replacing the stacked Bottleneck with Additive Blocks. The new Add-CGLU pyramid network was designed by drawing on the Convolutional GLU (Gated Linear Unit) framework proposed by Shi [
35]. The structure of the Add-CGLU pyramid network is shown in
Figure 4.
The Add-CGLU pyramid replaces stacked bottlenecks with Additive Blocks that combine three coordinated components: a Local Perception stage that preserves shallow spatial detail through standard convolutions, batch normalization and GELU; a Convolutional Additive Token Mixer that captures global context using an additive attention scheme to avoid the quadratic cost of conventional self-attention; and a Convolutional Gated Linear Unit that fuses depthwise convolution with a gating path for efficient feature filtering. Alternating Additive Blocks and patch segmentation give the neck a scalable multi scale fusion pattern while the Stem and pooled head keep integration simple. Compared with CAS-ViT and a plain GLU, this design is more convolution centric, more computationally efficient, and more prescriptive about operators and hyperparameters; it better preserves local detail while still capturing global context, reduces attention overhead, and is easier to reproduce and deploy for real time and small object detection.
Although gating mechanisms, such as the Gated Linear Unit, have been widely adopted in both Transformer and CNN architectures to modulate information flow, the combination of additive fusion and gating proposed in this manuscript represents a distinct structural innovation, rather than a simple application of existing components. In this design, the additive fusion integrates multiple feature streams in a complementary manner before the gating operation selectively filters and emphasizes informative patterns. This sequential coordination enables more precise feature modulation compared with conventional convolutional layers or standard residual blocks, which typically rely on linear summation or identity shortcuts without adaptive feature weighting. As a result, the network achieves higher computational efficiency, since additive fusion reduces redundant computations and mitigates the quadratic complexity associated with traditional self-attention, while the gating mechanism ensures that only salient features are propagated forward. Empirically, this synergy improves the representation of fine-grained local details and strengthens global context modeling, leading to better detection performance, particularly for small objects and real-time scenarios, without increasing the model’s parameter or inference overhead.
In the Add-CGLU pyramid network, the Stem module is located at the initial input stage and is responsible for converting the raw input into appropriate feature representations. Alternating Additive Block and Patch Segmentation modules divide the feature map into multiple small patches, while the classification detection head performs global pooling on the final features and outputs the classification results.
The Additive Block is the core module in the Add-CGLU architecture, comprising the LP (Local Perception) mechanism, CATM (Convolutional Additive Token Mixer), and CGLU (Convolutional Gated Linear Unit). The LP mechanism utilizes local integration to extract local spatial information from input features, overcoming the receptive field limitations of standard convolutional layers. Three standard convolutional layers, a Batch Normalization (BN) layer, and a nonlinear activation function, GELU (Gaussian Error Linear Units), are used to initially process the input features. Assuming
x1 is the input feature, the process of extracting local spatial features using the LP mechanism can be represented as follows:
where
Lp represents the local perception output,
W is the weight function,
BN represents batch normalization, and
σ is the activation function.
After the input features are sensed and extracted by the LP mechanism, they are passed to the CATM component. First, a batch normalization layer and independent linear transformations are used to split the input information into three parts:
Q,
K, and
V, corresponding to the query, key, and value matrices, respectively. This process is demonstrated using the input feature
x2 mentioned above:
where
Φ is the context mapping function,
Ch represents the channel attention mechanism,
Sp represents the spatial attention mechanism, and
Ϻ is a Mirror function.
Overall,
CATM aims to capture global contextual information through its core additive attention mechanism. Combined with the results of the Token Mixer, the architecture achieves optimized attention mechanisms at both the channel and spatial levels. This additive approach reduces computational overhead while maintaining the model’s ability to capture global dependencies, avoiding the quadratic complexity of matrix multiplication common in traditional self-attention mechanisms. The final output of
CATM is as follows:
where
Г is a linear transformation that integrates contextual information and includes necessary information interaction.
After the input features are enhanced by
CATM, they are then processed by
CGLU.
CGLU combines depthwise convolution with a gating mechanism, enabling efficient feature filtering and spatial perception. A dual-branch structure consisting of a batch normalization layer, a standard convolutional layer, and a GELU activation function processes spatial information, enhancing the model’s ability to capture fine-grained features. The final output of the
CGLU module is the element-wise multiplication of the results of the two branches:
where
Wb1 and
Wb2 correspond to the weights on each branch respectively. The
CGLU implements the attention of the channel mixer with fewer computational parameters based on the channel attention mechanism of the nearest neighbor image feature fusion, thereby enhancing the robustness of the model.
3.4. Detail-Enhanced Convolution Detection Head
The head of the original YOLO12 is the final stage of the overall network structure, responsible for generating the final predictions for object detection and classification. Its core task is to process the feature maps passed from the neck, ultimately outputting bounding boxes and class labels for objects within the image. The head of YOLO12 refines features through further convolution operations and optimization modules, employing the C3k2 module to improve computational efficiency and feature representation. It also incorporates CBS (Convolution-Batch Norm-SiLU) to enhance the model’s stability and nonlinear representation capabilities.
However, the original head also has some shortcomings. For example, it suffers from low localization accuracy for small objects, possibly due to loss of detail during upsampling or feature fusion, leading to inaccurate bounding box predictions. It is also susceptible to background interference in dense scenes, increasing the risk of false or missed detections. Furthermore, the computational overhead for high-resolution inputs is high, limiting its suitability for extreme real-time scenarios. To address these challenges, the original YOLO12 head framework has been re-optimized, and a new DEC (Detail-Enhanced Convolution) detection head has been proposed. The structure of the DEC detection head is shown in
Figure 5.
The design goal of the DEC detection head is to improve fine-grained feature extraction capabilities through a shared feature enhancement module, aiming to optimize the detection efficiency of small objects in complex scenes. DE Conv (Detail-Enhanced Convolution) is the core component of the DEC detection head and was first proposed by Chen et al. [
36] in DEA-Net (Detail-Enhanced Attention Network). By enhancing the input features, detail information is better preserved. During the feature fusion stage, DE Conv shares weight information at different feature scales, strengthening feature consistency and reducing computational cost.
Specifically, DE Conv combines the feature extraction capabilities of standard convolution with a detail-enhancing normalization strategy. It uses the GN (Group Normalization) layer to group and normalize features, reducing the impact of statistical noise during small-batch training and avoiding the instability of BN (Batch Normalization) during small-batch training. DE Conv first reduces the dimensionality of an input feature
x3 and extracts spatial features:
where
σ is the Sigmoid activation function, ensuring that the output range is between 0 and 1. Then use the content-guided attention mechanism to generate spatial weights for each position:
The weight distribution of the convolution kernel is then adjusted to better capture the details of the spatial features. The weight distribution of the features is then dynamically adjusted to adapt to local changes in the target. While improving the deep modeling capability, the grouping mechanism reduces computational redundancy and ensures the global representation capability of the features. The dynamic kernel generation process of DE Conv is as follows:
The end of the DEC detection head contains two submodules, Cls Conv (Classification Convolution) and Reg Conv (Regression Convolution), which are used for target category prediction and bounding box regression respectively. Cls Conv maps the feature map to the category space and uses the Sigmoid activation function to generate the category probability of each anchor, while Reg Conv outputs the regression parameters of each anchor bounding box. The parameters in Cls Conv are also shared. Whether it is a small, medium or large target, it is the same target for classification learning. The DEC detection head uses a separable loss function, including classification loss and regression loss, which can be expressed as follows:
where
represents the classification and regression losses, and
λ is the corresponding weight hyperparameter. Furthermore, the shared parameters in Reg Conv need to account for the impact of inconsistent object scales, thus scaling the features through the Scale layer. The Scale operation is also shared, allowing the regression output to be adjusted to accommodate objects of varying scales.
In the DEC detection head, the Detail-Enhanced Convolution module is further refined through the use of dilated convolutions and content-adaptive filter selection. Specifically, dilated convolutions expand the receptive field without increasing the number of parameters, enabling the network to capture broader contextual information around small objects while preserving fine-grained local features. Meanwhile, DE Conv dynamically adjusts its convolutional kernels based on the content-guided attention maps, allowing the network to focus on regions with high information density, such as edges, corners, or blurred object boundaries. This combination ensures that even small or partially occluded objects receive enhanced feature representation. In addition, DE Conv employs a set of multi-scale filters within each additive block, which are adaptively weighted through the spatial and channel attention mechanisms, allowing the head to selectively amplify relevant features while suppressing background noise. By integrating these mechanisms, the DEC detection head refines the localization of small objects and blurred boundaries, as the dynamically generated kernels emphasize critical spatial cues and context-dependent feature variations, reducing misalignment in bounding box predictions and improving both precision and recall in dense or cluttered scenes.
Overall, the DEC detection head achieves fine-grained enhancement of feature maps through dynamic receptive field generation and content-guided attention. Its core is to combine dynamic channel weights and spatial weights to generate highly adaptable convolution kernels to capture local details and global contextual information of the target. Through multi-scale feature fusion and refined processing, it effectively improves small object detection capabilities while ensuring accurate prediction of object categories and boundaries in complex scenes.
3.5. Double Self-Knowledge Distillation Strategy
This paper aims to eliminate the impact of all lightweight improvements on accuracy and perform knowledge distillation on the model. This method was first proposed by Hinton et al. [
37] and applied to classification tasks. Its principle is based on the transfer and migration of knowledge between models. Knowledge distillation usually trains a lightweight student model through additional supervision transferred from a large teacher model. The teacher model is a complex and accurate model that usually has a large number of parameters and a complex structure. It fits the target task by learning a large amount of data during training to provide high-precision prediction results. The student model is a simplified model that usually has fewer parameters and a simplified structure. The goal is to distill knowledge from the teacher model and learn the predictive behavior of the teacher model, in order to reduce the complexity of the model while maintaining high performance.
Knowledge distillation-based methods can be broadly categorized into feature distillation and logical distillation, each focusing on different levels of information transfer. Logical distillation primarily focuses on the final output and category probability distribution of the teacher model, optimizing the student model’s decision-making by imitating these probabilities. However, because logical distillation only utilizes the teacher model’s final predictions and ignores the model’s internal feature expression capabilities, this approach may not fully leverage the teacher model’s strengths in some cases, particularly when the student model’s capacity is small, making it difficult to fully replicate the teacher model’s learning capabilities.
In contrast, feature distillation focuses on the consistency of intermediate-level features between the teacher and student models, enabling the student model to directly learn the teacher model’s deep feature representations, rather than just the final category output. Specifically, feature distillation brings the student model’s features closer to the teacher model through methods such as aligning feature maps or guiding the student model to learn the teacher model’s attention distribution. This allows the student model to better capture key features of the data, thereby improving representational capabilities and generalization performance. Because feature distillation provides richer information than logical distillation, it is particularly important in tasks requiring deep feature representation, such as object detection and image segmentation.
Compared to logical distillation, feature distillation has the advantage of not only improving the student model’s final predictions but also enhancing its feature learning capabilities at different levels, enabling it to maintain strong feature extraction capabilities with fewer parameters. This method is particularly suitable for training lightweight models, effectively compensating for performance losses caused by insufficient model capacity, allowing small models to approach the performance of large models at a lower computational cost. Therefore, in tasks with high feature learning requirements, such as computer vision, feature distillation has become an important means of improving model performance and is widely used in model optimization processes.
Based on the principle of feature distillation, Shu et al. [
38] proposed a channel knowledge distillation method. Based on this principle, this paper proposes a new DS (Double Self-Knowledge Distillation) distillation strategy. Specifically, a model trained with YOLO12-N is defined as the student model, and a model trained with YOLO12-S is defined as the teacher model for the first knowledge distillation. Subsequently, the resulting model is defined as the student model again, and this student model, with double the number of channels, is defined as the teacher model for the second knowledge distillation.
In order to better utilize the knowledge information in each channel of the model, we soft-align the activations of the corresponding channels between the teacher and student networks. We define the teacher model and student model as
T and
S respectively, and the activation maps of
T and
S are defined as
yT and
yS respectively. The loss of this channel can be expressed as:
where
n represents the channel index,
i represents the spatial position of the channel, and
T is a temperature hyper-parameter. This method normalizes the activation map of each channel in complex scene object detection tasks. The activation of each channel tends to encode the salience of the scene category. For each channel, the DS distillation strategy guides the student network to focus on simulating regions with significant activation effects, thereby achieving more accurate localization in the object detection task.
The Double Self-Knowledge Distillation strategy uses two sequential rounds of distillation together with dual-path supervision in each round. In the first round a larger-capacity YOLO12-S teacher guides a compact YOLO12-N student to shape robust channel responses and coarse attention patterns. After that round the resulting student is distilled again from an amplified sibling model that has twice the number of channels, exposing richer, higher-order feature combinations without changing the student’s inference graph. Each round jointly optimizes the original detection objective together with logit-level distillation, intermediate feature alignment, and channel-wise alignment losses. In practice the channel-wise alignment is emphasized in the first round to establish stable salience, while the feature alignment is weighted more heavily in the second round to leverage the amplified teacher’s finer compositions. This progressive, dual-path design narrows the teacher–student capacity gap, stabilizes training, and lets the student first learn coarse salience and then absorb finer patterns, improving small-object localization and boundary sharpness without increasing inference cost.
Compared with single-stage distillation, the proposed two-stage approach reduces the capacity gap between teacher and student and produces more effective and stable supervision. In the first stage a strong but realistic teacher shapes robust channel responses and attention patterns in the lightweight student, which is an easier learning target than trying to mimic a very large model in one step. In the second stage the student benefits from a larger-capacity teacher that exposes richer, fine-grained feature combinations and higher-order channel relationships, effectively refining and expanding the student representations without changing its inference cost. This progressive learning strategy yields smoother optimization, faster convergence, and better generalization, while also acting as a regularizer that reduces overfitting and improves weed detection stability under varied field conditions.