DBCA-Net: A Dual-Branch Context-Aware Algorithm for Cattle Face Segmentation and Recognition

Feng, Xiaopu; Zhang, Jiaying; Qi, Yongsheng; Liu, Liqiang; Li, Yongting

doi:10.3390/agriculture15050516

Open AccessArticle

DBCA-Net: A Dual-Branch Context-Aware Algorithm for Cattle Face Segmentation and Recognition

by

Xiaopu Feng

¹,

Jiaying Zhang

^1,2,

Yongsheng Qi

^1,2,3,*,

Liqiang Liu

^1,2,3 and

Yongting Li

^1,2,3

¹

College of Electric Power, Inner Mongolia University of Technology, Hohhot 010051, China

²

Large-Scale Energy Storage Technology Engineering Research Center of Ministry of Education, Hohhot 010080, China

³

Inner Mongolia Autonomous Region University Smart Energy Technology and Equipment Engineering Research Center, Hohhot 010080, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(5), 516; https://doi.org/10.3390/agriculture15050516

Submission received: 17 January 2025 / Revised: 21 February 2025 / Accepted: 27 February 2025 / Published: 27 February 2025

(This article belongs to the Special Issue Advanced Image Collection, Processing, and Analysis in Crop and Livestock Management)

Download

Browse Figures

Versions Notes

Abstract

Cattle face segmentation and recognition in complex scenarios pose significant challenges due to insufficient fine-grained feature representation in segmentation networks and limited modeling of salient regions and local–global feature interactions in recognition models. To address these issues, DBCA-Net, a dual-branch context-aware algorithm for cattle face segmentation and recognition, is proposed. The method integrates an improved TransUNet-based segmentation network with a novel Fusion-Augmented Channel Attention (FACA) mechanism in the hybrid encoder, enhancing channel attention and fine-grained feature representation to improve segmentation performance in complex environments. The decoder incorporates an Adaptive Multi-Scale Attention Gate (AMAG) module, which mitigates interference from complex backgrounds through adaptive multi-scale feature fusion. Additionally, FACA and AMAG establish a dynamic feedback mechanism that enables iterative optimization of feature representation and parameter updates. For recognition, the GeLU-enhanced Partial Class Activation Attention (G-PCAA) module is introduced after Patch Partition, strengthening salient region modeling and enhancing local–global feature interaction. Experimental results demonstrate that DBCA-Net achieves superior performance, with 95.48% mIoU and 97.61% mDSC in segmentation tasks and 95.34% accuracy and 93.14% F1-score in recognition tasks. These findings underscore the effectiveness of DBCA-Net in addressing segmentation and recognition challenges in complex scenarios, offering significant improvements over existing methods.

Keywords:

complex scenarios; cattle face segmentation; cattle face recognition; fine-grained features; multi-scale feature fusion

1. Introduction

In recent years, biometric recognition technology has become an integral part of agricultural modernization, offering critical tools for optimizing farming processes and enhancing animal management. Among these, cattle facial recognition has garnered increasing attention as a significant application in the biometric recognition field [1,2]. Traditional cattle identity management methods, such as ear tags, branding, and electronic identification chips, often provoke resistance from cattle, heightening the risk of injury. Moreover, these methods are susceptible to misuse, leading to fraudulent activities such as insurance and subsidy fraud. Recently, computer vision-based feature classification algorithms have achieved promising results in livestock real-time monitoring, providing strong support for the advancement of smart animal husbandry [3]. Accurate livestock detection and identification are essential for implementing intelligent farm management and promoting the establishment of ecological unmanned farms [4]. In this context, livestock facial semantic segmentation has emerged as a key technology, attracting considerable attention worldwide. Its core goal is to employ computer vision techniques to segment facial images, extract key regions, and generate high-quality visual data for individual identification, behavior monitoring, and health management [5]. This field leverages advanced methodologies from medical image semantic segmentation while addressing the unique challenges posed by complex backgrounds in farm environments.

The evolution of segmentation algorithms from U-Net [6] to recent Transformer-based architectures [7] reflects a continuous progression from local feature modeling to global semantic representation, driving advancements in medical image segmentation and its application to livestock face segmentation. U-Net, proposed by Ronneberger et al. [6] in 2015, was a groundbreaking work in medical image segmentation. U-Net++ enhanced multi-scale feature fusion through dense skip connections, achieving over 95% segmentation accuracy in complex organ boundary scenarios [8]. U-Net3+ incorporated deep supervision mechanisms and multi-scale feature fusion, significantly improving fine-grained segmentation performance [9]. With the advent of vision Transformers, Swin-Unet integrated Swin Transformer’s sliding window attention with U-Net’s architecture [10], excelling in processing large-scale images and capturing long-range dependencies. TransUNet, the first framework combining Transformers with U-Net, leveraged Transformer modules for global feature extraction and U-Net for local detail recovery, achieving a Dice Similarity Coefficient (DSC) of 78.5% on the Synapse dataset [11]. TransUNet+ further enhanced this approach with feature enhancement modules in skip connections, improving detail capture [12]. Wu et al. introduced MedSegDiff-V2 [13], which integrates diffusion models with Transformers to address instability in medical image segmentation. Xiao et al. revisited convolutional neural networks through the ConvNeXt architecture [14], achieving notable performance gains in visual recognition. Roy et al. proposed MedNeXt, based on ConvNeXt, which combines convolutional networks and Transformer-based long-range dependency modeling to improve performance in complex, high-dimensional medical imaging tasks [15]. Tang et al. developed CMUNeXt [16], an efficient segmentation network utilizing large-kernel convolution and skip fusion techniques, achieving strong results on limited datasets while reducing computational complexity. These advanced approaches, initially successful in medical image segmentation, have been adapted to livestock face segmentation to address unique challenges such as complex backgrounds, occlusions, and lighting variations. For example, Li et al. [17] proposed a Parallel Attention Network (PANet), which significantly improves the accuracy of cattle face recognition in complex environments and demonstrates the robustness of the algorithm across diverse scenarios. Feng et al. applied an improved DeepLabV3+ method to cattle segmentation tasks, achieving better segmentation under complex backgrounds across diverse scenarios [18]. Zhang et al. proposed an interactive segmentation method for dairy goats, integrating deep learning and user interaction to improve accuracy, offering an efficient tool for precision livestock monitoring [19]. Shu et al. introduced a semantic segmentation method based on an improved U-Net model, utilizing infrared imaging for automated cow face temperature collection, presenting a novel approach to enhance livestock health monitoring systems [20]. Segformer++ introduces efficient token-merging strategies to improve high-resolution semantic segmentation by reducing computational complexity, while maintaining high accuracy [21]. SCTNet leverages a single-branch CNN architecture combined with Transformer-based semantic information, which allows for efficient real-time segmentation [22].

With the widespread application of deep learning algorithms in computer vision, deep learning-based facial recognition technology has become a prominent area of research for identity authentication. Against this backdrop, research on cattle face recognition has advanced significantly, showcasing vast potential for practical applications. Traditional computer-based cattle recognition algorithms primarily rely on facial feature extraction and pattern matching. Kumar et al. [23] utilized various feature extraction, dimensionality reduction, and classification techniques in 2016 and evaluated the effectiveness and accuracy of these traditional methods for cattle face recognition. Cheng et al. [24] proposed a single-stage detector (FFR-SSD) based on feature fusion and reconstruction, addressing the challenges of multi-scale object detection and providing new technical support for cattle face detection. Xu et al. [25] proposed a method for cattle face recognition in complex environments by optimizing embedding distributions and enhancing features to address challenges caused by facial pose variations. Ruchay et al. [26] developed a non-invasive cattle face recognition method utilizing deep transfer learning and data augmentation techniques, achieving efficient individual recognition in farm environments and offering critical support for intelligent farm management. Meng et al. [27] introduced a few-shot classification method that improves learning efficiency and accuracy for cattle face recognition through a private branch network. Kimani et al. investigated the features of muzzle images in cattle identification and integrated deep learning algorithms, significantly improving detection accuracy in complex backgrounds. Cai et al. [28] developed a deep reinforcement learning-based anti-spoofing framework (DRL-FAS), demonstrating superior performance in the field of image biometrics. This work provides valuable insights into enhancing the robustness of cattle face recognition. Li et al. [29] introduced a Transformer-based cattle face recognition algorithm, enabling precise individual identity verification. Mahato et al. [30] developed a biometric facial recognition-based cattle identification system, which optimizes dairy cow health and welfare management strategies through enhanced individual recognition.

The dual-branch framework used in DBCA-Net, while effective, builds upon well-established architectures that have demonstrated success across diverse computer vision tasks. In image restoration, dual-branch designs have proven particularly powerful for handling complex degradation patterns. For instance, Zhang et al. [31] proposed a dual-attention-in-attention model for joint rain streak and raindrop removal, where parallel branches separately model spatial and channel-wise features through nested attention mechanisms, achieving a 3.2 dB PSNR improvement over single-branch baselines. Similarly, in desnowing tasks, Zhang et al. [32] developed a deep dense multi-scale network that integrates semantic priors and depth estimation through complementary branches, enabling adaptive snow removal in scenes with varying occlusion levels (e.g., 92.7% accuracy on the Snow100K dataset). Beyond weather restoration, Ju et al. [33] extended dual-branch efficacy to image inpainting with BrushNet, which decouples global structure generation and local texture refinement through deterministic and stochastic diffusion processes, reducing artifact occurrence by 41% compared to conventional UNet-based approaches.

Medical imaging research further validates the framework’s versatility. Zhang et al. [34] bridged ConvNeXt’s hierarchical representation with U-Net’s skip connections in BCU-Net, using dual-branch feature extractors to preserve organ boundary details while capturing global context, yielding 0.89 Dice score on multi-organ CT segmentation. Addressing annotation scarcity, Yang et al. [35] proposed DuEDL’s dual-branch evidential learning, where one branch predicts segmentation masks while the other quantifies epistemic uncertainty from scribble annotations, achieving 85.3% accuracy with only 30% labeled data. These implementations demonstrate that dual-branch architectures systematically address three core challenges: modality-specific feature disentanglement, complementary information fusion, and task-adaptive optimization—principles that directly inform our agricultural adaptation.

Despite significant advancements in cattle face segmentation and recognition, current methods still face notable limitations. Traditional approaches often struggle with occlusions, complex backgrounds, and varying lighting conditions, limiting their robustness and generalization. While deep learning-based models, such as U-Net variants and Transformer-based architectures, have improved performance, they frequently suffer from high computational complexity and limited scalability, making them less practical for resource-constrained environments like farms. Additionally, these models often fail to effectively integrate fine-grained features, multi-scale information, and local–global feature interactions, which are critical for accurate recognition under challenging scenarios such as pose variations and partial occlusions.

To address these challenges, this study proposes DBCA-Net, a dual-branch context-aware network tailored for cattle face segmentation and recognition in complex environments. The main contributions of this work are as follows:

Fusion-Augmented Channel Attention (FACA): A novel FACA mechanism is integrated into the encoder to enhance fine-grained feature representation and strengthen channel attention, improving segmentation performance in intricate scenarios.
Adaptive Multi-Scale Attention Gate (AMAG): The decoder incorporates an AMAG module, which dynamically fuses multi-scale features to mitigate background interference and enhance segmentation accuracy in complex environments.
GeLU-enhanced Partial Class Activation Attention (G-PCAA): A G-PCAA module is introduced in the recognition network to improve salient region modeling and facilitate the interaction between local and global features, boosting recognition robustness under occlusions and pose variations.
Dynamic Parameter Optimization: A feedback-driven mechanism enables dynamic adjustment of parameters and weight matrices during training, allowing the network to iteratively optimize feature representation and segmentation accuracy.

Precision cattle face recognition and segmentation technology improves agricultural efficiency, reduces manual intervention, and supports farming modernization. The DBCA-Net advances cattle face segmentation and recognition by addressing challenges in complex environments through innovative modules and mechanisms. The Fusion-Augmented Channel Attention (FACA) mechanism enhances fine-grained feature modeling, improving segmentation accuracy in intricate scenarios. The Adaptive Multi-Scale Attention Gate (AMAG) balances local and global feature fusion, reducing background interference for robust performance. For recognition, the GeLU-enhanced Partial Class Activation Attention (G-PCAA) module strengthens salient region modeling and improves local–global feature interaction. Additionally, a feedback-driven optimization mechanism dynamically adjusts parameters and weights, creating a cohesive system that enhances both segmentation and recognition tasks.

DBCA-Net exemplifies the integration of precision technology in livestock management, enabling accurate real-time cattle monitoring, identity verification, and behavior analysis. By addressing real-world challenges such as occlusions, lighting variations, and complex backgrounds, DBCA-Net supports intelligent and sustainable farming practices. Its application demonstrates the potential of deep learning in advancing precision agriculture and automated farm management.

2. Materials and Methods

2.1. The DBCA-Net Algorithm Framework

Current cattle face segmentation and recognition algorithms often demonstrate high segmentation precision and recognition accuracy under controlled conditions. However, their performance degrades significantly in complex environments. For segmentation tasks, key limitations include insufficient extraction of fine-grained features and the influence of complex backgrounds. Interference from dense herds and environmental clutter frequently confounds features of cattle faces, hindering the network’s ability to precisely localize target regions. Moreover, the intricate textures and subtle differences along cattle face boundaries pose challenges for conventional segmentation methods, leading to a loss of detail in segmentation results.

In recognition tasks, performance is constrained by inadequate modeling of salient features and high inter-class feature similarity. Minimal differences between individual features make it difficult for existing algorithms to effectively distinguish key regions, reducing classification accuracy and robustness in complex scenarios. These challenges constitute significant bottlenecks in the practical application of current methods.

To overcome these issues, we propose a unified framework, DBCA-Net, specifically designed to optimize cattle face segmentation and recognition in complex environments. By integrating task-coordinated optimization and innovative modules, the framework establishes a dynamic feedback mechanism that significantly enhances segmentation precision and recognition accuracy in challenging conditions. The overall architecture of DBCA-Net is shown in Figure 1.

As depicted in the structural diagram of DBCA-Net, the framework achieves accurate cattle face segmentation and identity recognition through the integration of innovative modules and task-level collaborative optimization. The segmentation network employs a hybrid encoder–decoder architecture with multi-scale decoding capabilities, enhanced by the Fusion-Augmented Channel Attention (FACA) mechanism. This design strengthens fine-grained feature extraction, effectively addressing challenges such as blurry segmentation boundaries and the loss of intricate details.

In the decoding phase, the Adaptive Multi-Scale Attention Gate (AMAG) module is introduced to adaptively fuse local and global features. Additionally, a closed-loop feedback mechanism dynamically optimizes feature representations and parameter updates, significantly enhancing segmentation robustness in complex environments.

In the Multi-Scale Feature Decoder, different color coding is used to represent various feature maps. The light purple cube signifies the local feature map from the Hybrid Encoder, while the light blue cube represents the high-level semantic feature map from the Transformer module. The purple cube illustrates the output from the Adaptive Multi-Scale Attention Gate (AMAG) module. This output from AMAG is concatenated with the high-level semantic feature map from the Transformer module, forming a fused feature map. The concatenated feature map is then passed to the next stage of the Multi-Scale Feature Decoder, where further processing takes place.

For recognition, the network incorporates the GeLU-enhanced Partial Class Activation Attention (G-PCAA) module, which emphasizes salient region feature modeling to improve classification accuracy and stability. By leveraging the collaborative optimization of segmentation and recognition tasks, the DBCA-Net framework exhibits exceptional robustness and efficiency in addressing challenges posed by complex scenarios.

2.2. Cattle Face Segmentation Models in Complex Environment

2.2.1. Hybrid Encoder

The hybrid encoder in the original segmentation network integrates convolutional neural networks (CNNs) and Transformer modules, enabling the combination of local and global features for multi-scale feature modeling [36]. This design supports semantic information extraction and detail restoration in segmentation tasks. However, cattle face segmentation faces challenges due to the complexity of ranch environments, where fine-grained cattle face features often overlap with background elements, complicating segmentation.

Existing hybrid encoders, while effective in global feature modeling, often dilute fine-grained local details. Furthermore, traditional convolutional methods have limitations in modeling inter-channel correlations, reducing their ability to emphasize critical channels. These factors hinder the performance of hybrid encoders in noisy scenarios, necessitating improvements in fine-grained feature modeling and the integration of global and local features.

To address these issues, the Fusion-Augmented Channel Attention (FACA) mechanism is introduced at the final stage of convolutional feature extraction. FACA consists of a channel attention pathway and a local feature pathway. Global features are extracted using Global Average Pooling (GAP), and adaptive channel weights are generated via dynamic 1D convolution to enhance critical channels. Simultaneously, local fine-grained features are captured through

3 \times 3

convolution, incorporating detailed information from complex backgrounds.

FACA employs a dynamic weighting strategy to adaptively adjust the fusion of local and global features, allowing the model to suppress irrelevant information and emphasize key regions. Lightweight

1 \times 1

convolution further refines the fused features, improving computational efficiency. This integration enhances segmentation accuracy and robustness for cattle face regions. The structure of the FACA mechanism is shown in Figure 2.

The overall design of the FACA module consists of four main components: input preprocessing, channel attention pathway, local feature pathway, and feature fusion. The detailed implementation steps are as follows:

Output feature maps generated by the original hybrid encoder undergo input preprocessing, during which standard convolutional operations are applied to produce the initial feature representations.

The channel attention pathway is first computed for the feature representations by applying Global Average Pooling (GAP) to extract the global features, as described in Equation (1).

z = G A P (θ) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} θ (i, j)

(1)

The variable

z \in ℝ^{C}

represents the compression of the feature space along the channel dimension

θ

, allowing the model to concentrate on the global characteristics of each individual channel.

Subsequently, the function

k = φ (c)

is utilized to adaptively select the size of the convolutional kernel. A convolution operation is then applied to

z

, as described in Equation (2), to capture the inter-channel correlations within the feature map. The resulting weights are further processed using the sigmoid activation function

σ

, yielding the channel attention weights

ω

. The specific calculation for

ω

is provided in Equation (3).

k = {|\frac{\log_{2} (C)}{2} + \frac{1}{2}|}_{o d d}

(2)

ω = σ (c o n v (z, k))

(3)

The variable

ω \in ℝ^{C}

denotes the channel attention weights. These weights, computed through the channel attention pathway, are dynamically adjusted to emphasize the importance of each feature map channel, prioritizing channels relevant to the cattle face segmentation task while suppressing irrelevant or redundant ones. The obtained channel attention weights are then applied to the feature representation

θ

, recalibrating the original features and producing the output

F_{g l o b a l}

of the channel attention pathway.

F_{g l o b a l} = θ (H, W, C) \otimes ω (C)

(4)

The feature representation

θ

is further refined through the local feature pathway. By applying Same Padding to

θ

, the spatial dimensions

H

and

W

are preserved, producing the output

F_{l a c a l}

of the local feature pathway. This pathway is specifically designed to capture fine-grained textures and boundary details within the cattle face region, effectively addressing the global channel attention pathway’s limitations in modeling local information. By complementing each other, the local and global features enhance the overall feature representation, achieving a balance between expressive power and computational efficiency.

Subsequently, the outputs from the channel attention pathway and the local feature pathway are fused using the operation defined in Equation (5). This fusion effectively integrates global contextual information with preserved local spatial features, resulting in a refined feature map. To restore the original channel dimension of the feature representation

θ

, a

1 \times 1

convolution is applied, producing the final output

\tilde{θ}

. This final representation corresponds to a linear combination or redistribution of the original feature

θ

across channels, selectively emphasizing critical channels while suppressing redundant or irrelevant information.

F_{f u s e d} = α \cdot F_{g l o b a l} + β \cdot F_{l a c a l}

(5)

Here,

α

and

β

represent learnable weights that are dynamically updated during the model’s training process.

Employing an adaptive fusion strategy, the FACA mechanism balances the contributions of global channel attention and local detail features across scenarios of varying complexity. By effectively coordinating the importance of global and local information, the network dynamically adjusts its focus based on scene-specific characteristics. Global contextual information is extracted through the channel attention pathway using Global Average Pooling (GAP), enhancing the ability to capture overall features of the cattle face region. In contrast, the local feature pathway utilizes convolutional operations to capture fine-grained local features, reinforcing the representation of boundary and texture details. To optimize feature representation, a dynamic weighting strategy adjusts the fusion ratio between global channel attention and the local feature pathway according to the salience of global and local features in a given scene, ensuring a balanced and adaptive feature expression.

When cattle face images are situated in complex-background scenarios, such as environments with group occlusions or significant texture interference, the module assigns greater weight to the local feature pathway. This adjustment allows the network to prioritize the detailed boundaries of the cattle face region, thereby improving its ability to capture fine-grained local features. In contrast, for scenarios that emphasize global semantic information—such as under conditions of pronounced lighting variations or substantial background interference—the module increases the weight of the channel attention pathway. This adjustment enhances the network’s understanding of global features, facilitating the separation of cattle face regions from similar background features. By leveraging these dynamic adjustments, the FACA mechanism significantly enhances the network’s adaptability to complex backgrounds, addressing the limitations of traditional feature modeling methods in responding to diverse scenarios. As a result, it improves both the accuracy and robustness of cattle face segmentation tasks.

2.2.2. Multi-Scale Feature Decoder

Demonstrating exceptional performance in common medical segmentation tasks, the multi-scale decoder of the original segmentation network achieves multi-level fusion of high- and low-level features through progressive upsampling combined with skip connections. The spatial resolution of feature maps is restored using transposed convolutions or bilinear interpolation, ensuring that the output matches the input size. Transformer modules are incorporated to capture global contextual information, enhancing segmentation capabilities for complex structures, while the fusion of high- and low-resolution features improves boundary prediction accuracy.

For fine-grained segmentation tasks involving cattle faces in complex backgrounds, several limitations hinder the performance of the original decoder. High-resolution features often become dominated by low-resolution ones during repeated upsampling and feature fusion, making it difficult to preserve fine details such as hair textures and boundaries. While Transformer modules effectively model global contextual semantics, their ability to capture local feature details remains weak, leading to insufficient modeling of boundaries and texture details and causing an imbalance between global and local information. Additionally, edges and textures of cattle faces are easily confused with complex backgrounds, reducing the decoder’s focus on target regions and making it challenging to suppress background interference. These issues are further exacerbated by the decoder’s skip connections, which rely on simple concatenation of high- and low-level features without fully leveraging their complementary relationships, resulting in inefficient feature fusion. Collectively, these challenges limit the fine-grained segmentation performance of the original TransUNet in complex scenarios.

Addressing the limitations of the original segmentation network decoder in fine-grained cattle face segmentation, this paper introduces an improved module—AMAG (Adaptive Multi-Scale Attention Gate). Embedded into the decoder’s feature fusion process, AMAG strengthens multi-scale feature fusion capabilities to tackle challenges such as insufficient fine-grained feature representation, imbalance between local and global information, background interference, and inefficient feature integration.

The AMAG module integrates spatial attention and multi-scale feature aggregation to effectively filter out background noise while enhancing relevant features. To achieve matching between feature maps of different dimensions, the module uses pointwise convolutions to align the feature maps. Within the decoder, AMAG operates in two steps: first, it fuses downsampled high-level semantic features with local semantic features from the hybrid encoder. Then, it concatenates the fused feature maps with the originally input downsampled high-level features to produce the final output, which is passed to the next stage of the multi-scale decoder. The AMAG module operates in parallel, processing multiple feature maps simultaneously, allowing for efficient interaction with the segmentation head and refinement of the segmentation output.

In fine-grained cattle face segmentation tasks, AMAG enhances the capture of details such as hair textures and boundaries through a multi-scale feature fusion mechanism. A dynamic attention mechanism ensures a balanced contribution of local and global information, achieving a harmonious integration of boundary details and overall structural information. Furthermore, a spatial selection mechanism emphasizes key regions of cattle faces, effectively suppressing interference from complex backgrounds. By incorporating AMAG, the segmentation accuracy and robustness of the original decoder are significantly improved in challenging background conditions, providing reliable support for high-precision fine-grained cattle face segmentation tasks. The structural design of AMAG (Adaptive Multi-Scale Attention Gate) is illustrated in Figure 3.

Designing the AMAG (Adaptive Multi-Scale Attention Gate) module aims to enhance the original decoder’s ability to extract fine-grained features and improve the multi-scale fusion of local and global information. This module is composed of four key components: Multi-Scale Feature Fusion, Spatial Selection, Dynamic Spatial Fusion, and Recalibration.

Multi-Scale Feature Fusion leverages Depthwise Convolution (DW) to extract fine-grained features from the local patterns identified by the convolutional neural network. To broaden the receptive field and capture richer contextual information, Dilated Convolution (DW-D) is incorporated. Pointwise Convolution (PW) subsequently integrates multi-channel local features, enhancing the capacity to represent intricate fine-grained details, as shown in Equation (6).

U_{l o c a l} = P W (DW-D (D W (X)))

(6)

Here,

X

represents the local feature maps fed into the AMAG module.

For the global features output by the Transformer module, channel-wise max pooling and average pooling are employed to extract critical semantic information from the global semantic features. The extracted semantic information is further refined using Pointwise Convolution (PW) to ensure dimensional consistency between the global semantic features and the local features, as illustrated in Equation (7).

U_{g l o b a l} = P W ([P_{A v g} (G), P_{M a x} (G)])

(7)

Here,

G

represents the global semantic features fed into the AMAG module.

Subsequently, the transformed local feature

U_{l o c a l}

is added to the transformed global feature

U_{g l o b a l}

, resulting in the final fused feature

U

for this module, as illustrated in Equation (8).

U = U_{l o c a l} + U_{g l o b a l}

(8)

Spatial Selection applies dynamic weight assignment to the channels of the feature maps fused in the first part, directing attention to the most critical feature channels while minimizing the impact of redundant or irrelevant ones. By emphasizing fine-grained features of cattle faces, this strategy not only enhances feature representation but also effectively suppresses interference from background noise, as demonstrated in Equations (9)–(12).

S W_{i} = s o f t m a x (PW (U)) i = 1, 2

(9)

X^{'} = S W_{1} \otimes X + X

(10)

G^{'} = S W_{2} \otimes G + G

(11)

The fused feature

U

undergoes a Softmax operation, after which the Spatial Selection module dynamically adjusts the channel-wise weights of the feature map. These weights are then applied to

U

through element-wise multiplication, followed by the addition of the original local and global features input into the AMAG module. This process highlights the critical fine-grained feature channels of cattle faces while effectively suppressing irrelevant and redundant information from other channels.

Dynamic Spatial Fusion further enhances the interaction between the optimized local feature

X^{'}

and the global contextual feature

G^{'}

, both derived from the Spatial Selection module. This module employs a dynamic weight allocation mechanism to adjust the weights of local and global features. Local features are enriched with higher-level contextual information, while global features are refined to emphasize fine-grained local details. The bidirectional interaction and mutual enhancement between local and global features result in a cohesive representation, effectively integrating fine-grained cattle face details with global contextual information. This integrated feature representation significantly improves the ability to capture fine-grained details within the global context. The evolution of the weight matrix is detailed in Equations (12) and (13).

α = σ (W_{α} \cdot G A P (X^{'}) + W_{β} \cdot G A P (G^{'}))

(12)

β = 1 - α

(13)

Global average pooling is applied to the optimized local and global features obtained from the second stage, followed by a linear transformation. A Sigmoid activation function is subsequently employed to calculate the dynamic weight matrices for the local and global features, denoted as

α

and

β

, respectively. The learnable projection matrices

W_{α}

and

W_{β}

dynamically evolve with iterative updates during the algorithm. Using the weight matrices

α

and

β

, an element-wise weighted fusion of the optimized local and global features is performed, producing an enhanced and interactively fused feature representation, as detailed in Equation (14).

F_{f u s e d} = α \otimes X^{'} + β \otimes G^{'}

(14)

Further extraction of contextual spatial information is performed on the cross-enhanced features, followed by restoring the original number of channels. This process produces the interaction-enhanced feature map, as described in Equation (15).

U^{'} = C o n v_{1 \times 1} (C o n v_{3 \times 3} (F_{f u s e d}))

(15)

Leveraging the dynamic spatial interaction mechanism, the AMAG module enhances the contribution of fine-grained facial features to global contextual understanding while simultaneously strengthening the global context’s interpretation of local facial details. This reciprocal enhancement improves segmentation consistency, effectively unifying local and global feature interactions.

Recalibration starts by applying Pointwise Convolution (PW) to the fused feature

U^{'}

from the third stage, further refining spatial information across channels. A Sigmoid activation function is then employed to generate attention weights, which emphasize critical features while suppressing redundant ones. These attention weights are subsequently applied to recalibrate the original local feature

X

. After recalibration, another Pointwise Convolution (PW) is performed on the recalibrated feature to adjust its dimensionality and further enhance its representation. This process ensures that the recalibrated feature is better optimized for downstream tasks and enables the multi-scale decoder to fully leverage the fine-grained multi-scale feature information provided by the AMAG module. The methodology is detailed in Equation (16).

X = PW (sigmoid (PW (U^{'})) \otimes X)

(16)

To enhance the decoder’s feature extraction capabilities in complex scenarios, the Adaptive Multi-Scale Attention Gate (AMAG) module is incorporated into the decoder of the TransUNet segmentation network. By integrating mechanisms such as multi-scale feature fusion, spatial selection, dynamic spatial interaction, and recalibration, AMAG effectively addresses the challenge of insufficient fusion between fine-grained local features and global contextual information, significantly boosting the model’s performance in fine-grained cattle face segmentation tasks. Multi-Scale Feature Fusion and Spatial Selection focus on capturing detailed local textures and global semantic information while dynamically highlighting critical regions and suppressing background interference, thereby improving the model’s attention to facial textures and boundary details. Dynamic Spatial Fusion and Recalibration further enable bidirectional interaction and optimization between local and global features, ensuring a cohesive and efficient feature integration. These mechanisms collectively provide the decoder with precise and robust multi-scale feature representations, enhancing its ability to handle complex segmentation tasks.

An improved multi-scale decoder utilizes the dynamic adjustment mechanism of the AMAG module to effectively balance the fusion of fine-grained local features and global semantic information in complex scenarios. When detailed features are critical, the module prioritizes attention to local boundaries and textures; when semantic consistency takes precedence, it enhances the modeling of global contextual information. This adaptive adjustment mechanism significantly enhances the network’s ability to adapt to diverse and challenging environments, addressing the shortcomings of traditional segmentation methods in handling complex backgrounds. As a result, it substantially improves both the performance and accuracy of cattle face segmentation tasks.

2.3. Inter-Class Feature Enhancement Model for Cattle Face Recognition

Recognition networks currently employed for cattle face recognition tasks adopt a hierarchical feature extraction strategy to progressively capture multi-scale features while integrating local feature modeling with global semantic information. These networks exhibit a certain degree of robustness in handling complex backgrounds. By leveraging their multi-layer architecture, they effectively extract fine-grained features, and the incorporation of global semantic enhancement modules enables the modeling of long-range dependencies, providing strong support for cattle face recognition. However, limitations remain in specific stages of the recognition process, particularly during the Patch Partition stage. At this stage, feature extraction relies solely on basic patch-based operations, lacking targeted modeling for category-salient regions. This shortcoming leads to inadequate feature extraction from cattle face regions. Furthermore, the networks struggle with interactive fusion of local details and global semantic information, as they overly depend on local window features. This reliance hinders the effective integration of local details with global context in complex backgrounds, ultimately restricting further improvements in recognition performance.

To address the limitations of existing recognition networks in category saliency modeling and the interaction between local and global features, the G-PCAA (GeLU-enhanced Partial Class Activation Attention) mechanism is introduced into the feature extraction process following the Patch Partition stage. This attention mechanism substantially improves the network’s ability to focus on critical regions of cattle faces, while optimizing the feature representation of salient regions. Additionally, it ensures the integrity and semantic consistency of inter-class regional features in complex scenarios. Figure 4 presents the architecture of the inter-class feature enhancement model designed for cattle face recognition.

G-PCAA module design comprises four components: Partial CAM, local class center generation, global class representation generation, and feature enhancement and aggregation [37]. To further optimize inter-class feature representation, GeLU (Gaussian Error Linear Unit) activation is introduced between the output of Partial CAM and the local class center generation stage. By smoothing the activation of generated saliency maps, GeLU enhances the flexibility of nonlinear representation during the class center generation phase, enabling features to adapt more effectively to distribution variations in different scenarios. Additionally, the smoothing of gradient flow improves the optimization stability of the entire module. These improvements significantly enhance the flexibility and expressiveness of feature modeling, providing robust support for fine-grained feature extraction and inter-class feature enhancement in cattle face recognition tasks. The structure of G-PCAA is shown in Figure 5.

Partial CAM processes the input features

X_{i n} \in ℝ^{H \times W \times C}

, derived from the output of the Patch Partition stage, to enhance the network’s attention to task-relevant category regions. This module focuses on generating class-specific activation feature maps that highlight critical areas for the recognition task. Specifically, by applying Pointwise Convolution (PW) to

X_{i n}

, the network produces the class-specific activation feature maps

A_{c}

, as shown in Equation (17):

A_{c} = PW (X_{i n})

(17)

where

A_{c} \in ℝ^{H \times W \times K}

, and

K

represents the number of task-specific categories.

Adaptive average pooling is subsequently applied to the class-specific activation feature maps to perform block-level processing. The GeLU activation function is then introduced to further enhance the feature representation by generating sub-block class-specific activation feature maps, improving the flexibility and effectiveness of feature modeling. This process is formulated in Equation (18).

S_{c} = G e L U (A v g P o o l_{S \times S} (A_{c}))

(18)

In this context,

S

denotes the scaling factor for the block size, while

S_{c}

represents the dimensionality of the features, defined as

ℝ^{S \times S \times K}

. Global average pooling partitions the input features into

N_{p}

sub-blocks, where

N_{p} = S \times S

. Each sub-block has dimensions

R_{H} \times R_{W}

, satisfying

R_{H} = H / S

and

R_{W} = W / S

.

The GeLU activation function enhances the nonlinear representational capacity of the activation process, mitigating gradient saturation issues and improving the precision of class-specific activation feature maps within each sub-block. This enables the network to capture fine-grained features of cattle faces more effectively. To assign class labels to each sub-block feature map, max pooling is applied across sub-blocks, as expressed in Equation (19):

L_{c}^{'} = M a x P o o l_{S \times S} (L_{c})

(19)

where

L_{c}^{'}

has dimensions

ℝ^{S \times S \times K}

, and

L_{c}

refers to the original one-hot encoded task-specific class labels with dimensions

ℝ^{K \times H \times W}

. By combining the sub-block class-specific activation feature maps with their corresponding class labels, this method establishes an effective association between features and their respective class labels. This improves the model’s ability to focus on fine-grained cattle face features, thereby enhancing the overall performance of the recognition network.

Patch Partition in the original recognition model performs only basic image segmentation, lacking focus on category-salient regions and limiting the effectiveness of extracting key features from cattle faces. Partial CAM addresses this limitation by significantly enhancing attention to category-relevant regions through localized saliency modeling. Furthermore, GeLU improves the dynamic nonlinear modeling of features, ensuring precise and high-quality representation of critical cattle face regions.

Local Class Center generation processes the previously obtained

A_{c}

and the input feature map

X_{i n}

by applying split and flatten operations, resulting in

{\tilde{A}}_{c}^{i}

and

{\tilde{X}}_{i n}^{i}

for

i \in [0, 1, \dots, N_{p} - 1]

. These outputs have dimensions

ℝ^{N \times K}

and

ℝ^{N \times C}

, respectively, where

N = R_{H} \times R_{W}

, representing the total number of pixels in each sub-block.

GeLU-activated sub-block class-specific attention weights

{\tilde{S}}_{c}^{i}

are also processed through split and flatten operations, producing outputs with dimensions

N_{p} \times K

. Here,

{\tilde{S}}_{c}^{i}

indicates the attention strength corresponding to each class. During the calculation of the Local Class Center

{\hat{F}}_{l}^{i}

, element-wise multiplication and dimensional adjustment are applied to align matrix dimensions. The Local Class Center for each sub-block is computed as follows:

{\hat{F}}_{l}^{i} = {\tilde{S}}_{c}^{i} ⊙ [σ_{s} {({\tilde{A}}_{c}^{i})}^{⊤} \otimes {\tilde{X}}_{i n}^{i}]

(20)

Here,

σ

denotes the Softmax operation, and

{\hat{F}}_{l}^{i}

has dimensions

ℝ^{K \times C}

, representing the feature representation of the sub-block for each class.

Addressing the limitations of the original network in constructing fine-grained local features, the Local Class Center generation module utilizes GCU to optimize class center adaptability to dynamic variations in feature distributions. This improves the quality of class center representations while GeLU enhances scalability across diverse scenarios. The module effectively strengthens feature modeling for cattle faces in complex backgrounds, providing precise class-aware semantic representations for subsequent global feature integration.

Global Class Representations, the third component, utilize each Local Class Center

{\hat{F}}_{l}^{i}

as the fundamental unit for

1 \times 1

convolutional and neural network operations. This process generates an enhanced local feature

{\tilde{F}}_{l}^{i}

for each sub-block by integrating relational information across local class centers. By propagating this information, the operation refines the representation of class-specific channels within each sub-block while simultaneously enriching the local class center with hierarchical and contextual information from neighboring sub-blocks. Consequently, each sub-block not only reflects semantic details from its own region but also incorporates broader contextual dependencies, enhancing the granularity and expressiveness of class-specific feature representations.

After obtaining the enhanced local class centers,

1 \times 1

convolution and linear transformations are applied to

{\tilde{F}}_{l}^{i}

, resulting in

F_{l}

. These features are then weighted by coefficients

f_{i}

, and the global class center

F_{g}

is computed using the following equations:

F_{l} = L i n e a r ({Conv}_{1 \times 1} ({\tilde{F}}_{l}))

(21)

F_{g} = \sum_{i = 0}^{N_{p} - 1} f_{i} F_{l}^{(i)}

(22)

where

F_{g} \in ℝ^{1 \times K \times C}

represents the global class center features for each class.

The original recognition model faced challenges in effectively integrating local and global features. Global Class Representations resolve this issue by consolidating local class centers, ensuring consistency and integrity in global feature representations. The introduction of GeLU further enhances the adaptability and flexibility of feature expressions, allowing the model to capture richer and more precise semantic representations of cattle face features. This improvement provides a robust foundation for higher-level recognition tasks. The combined structure of the second and third components is depicted in Figure 6.

Feature Enhancement and Aggregation, the fourth component, processes the global class center features and enhanced local class center representations by dividing the input features

{\tilde{X}}_{i n}^{(i)}

of the one sub-block, with dimensions

ℝ^{R_{H} \times R_{W} \times C}

. The relationship between each image pixel and the assigned class label is then calculated, producing the attention map

P^{(i)}

, which represents the attention score for each class within the one sub-block, as shown in Equation (23):

P^{(i)} = σ_{c} (W_{q} ({\tilde{X}}_{i n}^{(i)}) \otimes W_{k} {(F_{l}^{(i)})}^{⊤})

(23)

Here,

P^{(i)} \in ℝ^{N \times K}

is the attention map for the one sub-block, and

σ_{c}

denotes a Softmax operation applied along each class dimension.

W_{q}

and

W_{k}

are learnable linear projection matrices that project

{\tilde{X}}_{i n}^{(i)}

and

F_{l}^{(i)}

into the query and key spaces, respectively.

The resulting attention map

P^{(i)}

is then used to query the global class center

F_{g}

, projecting it into the value space and calculating the final output

{\tilde{X}}_{o u t}^{(i)}

for each sub-block through matrix multiplication, as shown in Equation (24):

{\tilde{X}}_{o u t}^{(i)} = P^{(i)} \otimes W_{v} (F_{g})

(24)

Here,

W_{v}

represents the learnable linear projection matrix used to project

F_{g}

into the value space. Finally, the output of each sub-block,

{\tilde{X}}_{o u t}^{(i)}

, is combined and reshaped to match the original input dimensions,

H \times W \times C

, ensuring dimensional consistency for the G-PCAA module’s output.

The G-PCAA attention mechanism effectively addresses the limitations of the Patch Partition stage in the original recognition network, which lacked focus on salient regions. By modeling local class-specific features and leveraging GeLU activation, the module smooths gradient flows, stabilizes training, and enhances the nonlinear representation capacity. This allows the model to capture critical semantic information in cattle face regions, particularly in areas of high relevance, improving both class-specific attention and the integration of contextual dependencies.

2.4. Iterative Optimization of Dynamic Parameters and Weight Matrices

In fine-grained segmentation tasks for cattle faces, the feature representation capability of the segmentation network is critical to achieving robust performance in complex-background scenarios. To enhance the network’s ability to model local details and global semantics in a unified manner, a feedback-based iterative method for optimizing dynamic parameters and weight matrices is proposed, extending the existing dual-branch network. By leveraging predicted class probabilities of cattle IDs from the recognition network, this method dynamically adjusts the parameters of the FACA (Fusion-Augmented Channel Attention) mechanism and allocates weight matrices for the AMAG (Adaptive Multi-Scale Attention Gate) module in real time. These adjustments significantly strengthen the network’s ability to capture fine-grained features in cattle face regions and improve the integration of global contextual information.

2.4.1. Dynamic Parameter Tuning in the FACA Module

FACA mechanism is designed to capture fine-grained and boundary features in cattle face regions, addressing limitations in conventional global–local interaction models that inadequately focus on local details. Dynamic adjustment of parameters

α

and

β

lies at the core of FACA, enabling effective weight distribution along local feature pathways. This adjustment enhances the capture of fine-grained and boundary features while maintaining robust interactions between global and local features. Dynamic parameter updates are calculated iteratively through the following Equations (25) and (26):

α' = α \cdot (1 - k_{α} \cdot F)

(25)

β' = β \cdot (1 + k_{β} \cdot F)

(26)

Here,

F

represents the feedback signal, derived from the difference in predicted probabilities of cattle ID classes output by the recognition network. This signal balances classification confidence with segmentation performance. Feedback scaling factors

k_{α}

and

k_{β}

control the influence of the feedback signal on the dynamic parameters

α

and

β

, ensuring adaptive optimization based on contextual demands.

During the optimization process, the scaling factors

k_{α}

and

k_{β}

are designed to decay progressively with training iterations, preventing parameter oscillations caused by excessive reliance on the feedback signal. The decay process is governed by the following Equations (27) and (28):

k_{α} (t) = k_{α} (0) \cdot e^{- λ \cdot t}

(27)

k_{β} (t) = k_{β} (0) \cdot e^{- λ \cdot t}

(28)

Here,

k_{α} (0)

and

k_{β} (0)

denote the initial scaling factors, both set to 0.5, while

λ

, the decay rate, is set to 0.05. A smaller feedback signal

F

indicates higher confidence from the recognition network regarding its current classification output, prompting FACA to prioritize capturing fine-grained local features. Conversely, a larger feedback signal

F

shifts FACA’s focus toward enhancing attention on critical global features, ensuring a balanced emphasis on both local details and global context.

2.4.2. Iterative Optimization of Weight Matrices in the AMAG Module

The AMAG module employs a dynamic spatial interaction mechanism to enhance the integration and interaction between local features and global contextual information. This mechanism facilitates bidirectional interaction and complementation, significantly improving the decoder’s ability to fuse multi-scale features. By dynamically adjusting the weight matrices

W_{α}

and

W_{β}

, the module optimizes the multi-scale relationships between local and global features. The weight matrix adjustments are governed by the following Equations (29) and (30):

W_{α}' = W_{α} \cdot (1 - η_{α} \cdot F)

(29)

W_{β}' = W_{β} \cdot (1 + η_{β} \cdot F)

(30)

Here,

η_{α}

and

η_{β}

are feedback scaling factors that decay progressively over training iterations. The decay is described by the following Equations (31) and (32):

η_{α} (t) = η_{α} (0) \cdot e^{- λ \cdot t}

(31)

η_{β} (t) = η_{β} (0) \cdot e^{- λ \cdot t}

(32)

In these equations,

η_{α} (0)

and

η_{β} (0)

denote the initial feedback scaling factors, both set to 0.5, while the decay rate

λ

is set to 0.05. When the feedback signal

F

is relatively small, the network emphasizes refining local features. Conversely, when

F

is large, the module shifts its focus to enhancing attention on critical global features, achieving a balanced integration of multi-scale information for improved feature representation.

Establishing a feedback mechanism enables the AMAG module to dynamically adjust based on the confidence level of the feedback signal from the recognition network. When the feedback signal is high, the module reduces the influence of multi-scale feature fusion, preventing excessive reliance on local features and mitigating the risk of overfitting. Conversely, when the feedback signal is low, the module strengthens the interaction between local and global features, enhancing the representation of fine-grained details.

Leveraging the predicted probabilities from the recognition network as feedback signals allows for dynamic adjustments to the parameters of the FACA module and the weight matrices of the AMAG module. This approach enables the segmentation network to continuously refine its feature extraction strategies based on the recognition network’s outputs, improving its ability to focus on and represent fine-grained information. Furthermore, this mechanism enhances DBCA-Net’s capacity to integrate global contextual information with local details, resulting in improved performance in fine-grained segmentation tasks.

3. Experimentation and Analysis

All algorithmic experiments and training were conducted in an Ubuntu 24.04 Linux operating system environment. The hardware configuration includes an i9-14900K CPU (Intel Corporation, Santa Clara, CA, USA), an NVIDIA GeForce RTX 4090 GPU, and 24 GB of RAM (NVIDIA Corporation, Santa Clara, CA, USA). The software environment utilizes CUDA 11.8 and cuDNN 8.7.0 for acceleration, implemented on the PyTorch 2.0.1 deep learning framework.

For the cattle face segmentation model designed for complex scenarios, training is performed using the Adam optimizer with an initial learning rate of 0.1, which is reduced via an exponential decay strategy. The total number of iterations is set to 1056, with a batch size of 8. The model includes 12 self-attention heads and a hybrid encoder composed of 12 Transformer layers. The convolutional encoder uses ResNet50 as the backbone, while the Transformer layers utilize pre-trained weights from ViT-B.

For the cattle face recognition network enhanced by inter-class feature modeling, training employs the AdamW optimizer combined with a cosine annealing learning rate scheduler. The initial learning rate is carefully configured, and mixed-precision training (AMP) is introduced to improve computational efficiency and reduce memory usage. The input image resolution is fixed, with a batch size of 16. Pre-trained weights from the Swin Transformer, trained on the ImageNet-1k dataset, are used to initialize the network.

3.1. Dataset Preparation and Network Training

Dataset collection was conducted between April 2023 and September 2024 at the Outai Ranch in Hohhot, Inner Mongolia. A total of 658 cattle aged 12 to 75 months were selected, each assigned a unique ID. Videos of cattle faces, each lasting 8–16 s, were recorded at a resolution of 1920 × 1080 pixels. Frames were extracted from each video, with 10–15 representative images selected per ID. For each cattle ID, a separate folder was created to store the corresponding face images. Similar images were manually removed, resulting in 658 subfolders labeled with cattle IDs, containing a total of 13,145 cattle face images.

During the dataset collection and preprocessing phase, the segmentation task was performed under multiple complex scenarios, each presenting distinct challenges. These challenges include illumination variations such as backlighting, shadows, and high dynamic range lighting, as well as background interference from elements like hay stacks and pasture fences. Additionally, pose diversity, such as head tilting and variations in pitch angles, further complicated the segmentation process.

From these images, 877 were selected to represent diverse facial postures. Boundary points were labeled on these images using LabelMe, generating corresponding JSON files. These JSON files were used to create segmentation masks, which were then combined with the original images to produce NPZ files. The dataset was divided into training and testing sets, with 80% allocated to training and 20% to testing, to generate pre-trained weights for the segmentation network.

After completing pre-training of the cattle face segmentation network, the full dataset of 13,145 cattle face images was input into the network for segmentation. Complex backgrounds in the images were removed and replaced with white, while the cattle face regions were retained. The processed images were restructured following the original dataset format and divided into training and testing sets using the same split ratio, resulting in segmented cattle face training and testing datasets.

The segmented training set was then used as input for the inter-class feature-enhanced cattle face recognition model to perform cattle ID recognition. Feedback signals generated by the recognition model dynamically adjusted the parameters and weight matrices of the segmentation network during training. After iterative training, both the segmentation and recognition networks were evaluated on the testing set, and performance metrics were calculated.

3.2. Performance Evaluation Metrics

In the cattle face segmentation task designed for complex scenarios, Dice Similarity Coefficient (DSC), Intersection over Union (IoU), and HD95 (Hausdorff Distance 95th Percentile) are employed to evaluate the segmentation performance of the TransUNet model before and after improvement. These metrics assess the robustness of the segmentation network under complex conditions and its overall performance. Mean Dice Similarity Coefficient (mean DSC) and mean Intersection over Union (mean IoU) are calculated by averaging results across 174 test NPZ files, providing a comprehensive evaluation of the model’s performance on the entire test set.

For the inter-class feature-enhanced cattle face recognition model, accuracy and F1-score are used to evaluate performance on the test set. The formulas for these metrics are detailed in Equations (33)–(40).

D S C (A, B) = \frac{2 |A \cap B|}{|A| + |B|}

(33)

m D S C = \frac{1}{N} \sum_{i = 1}^{N} \frac{2 |A_{i} \cap B_{i}|}{|A_{i}| + |B_{i}|}

(34)

I o U = \frac{|A \cap B|}{|A \cup B|}

(35)

m I o U = \frac{1}{N} \sum_{i = 1}^{N} \frac{|A_{i} \cap B_{i}|}{|A_{i} \cup B_{i}|}

(36)

Here,

A

denotes the set of pixels in the predicted segmentation region, while

B

represents the set of pixels in the ground truth segmentation region.

A_{i}

and

B_{i}

represent the files indexed as

i

within the test set. Intersection over Union (

I o U

) evaluates the overlap between the predicted and ground truth regions by calculating the ratio of their intersection to their union. The mean IoU (

m I o U

) is computed as the average

I o U

across all test samples, reflecting the overall segmentation performance.

Dice Similarity Coefficient (

D S C

) directly measures the similarity between the predicted and ground truth segmentation regions. The mean Dice Similarity Coefficient (

m D S C

) is obtained by averaging

D S C

values across all samples, offering a more comprehensive evaluation of segmentation accuracy.

A c c = \frac{T P + T N}{T P + T N + F P + F N}

(37)

F 1 score = \frac{2 p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l}

(38)

p r e c i s i o n = \frac{T P}{T P + F P}

(39)

r e c a l l = \frac{T P}{T P + F N}

(40)

Here,

T P

,

T N

,

F P

, and

F N

represent true positives, true negatives, false positives, and false negatives, respectively.

A c c

denotes the proportion of correctly classified cattle face samples to the total number of samples, including both same-class and different-class predictions. Precision measures the accuracy of positive predictions, while Recall reflects the proportion of correctly identified positive cases.

To balance precision and recall, the F1-score is used, particularly in cases where class distributions are imbalanced. In the cattle ID classification task, some cattle inevitably have fewer samples, which can affect model performance. The F1-score provides a balanced evaluation of the model’s ability to correctly classify cattle IDs, ensuring a reliable assessment of recognition accuracy and quality.

3.3. Ablation Experiment

To evaluate the effectiveness of the proposed FACA and AMAG modules in enhancing the cattle face segmentation network under complex scenarios, ablation experiments were performed on a custom dataset. The improvement achieved through these modules is demonstrated in Table 1.

The ablation study results in Table 1 demonstrate that integrating both FACA and AMAG modules achieves the best performance, with an mIoU of 95.48%, an mDSC of 97.61%, and an HD95 of 10.02. These improvements highlight the effectiveness of FACA in capturing fine-grained features and AMAG in enhancing multi-scale feature fusion, significantly boosting the segmentation model’s accuracy and robustness.

The loss function curve for the segmentation model, designed to handle complex cattle face segmentation under challenging scenarios, is presented in Figure 7.

During training, the loss function shows a gradual reduction, reflecting the model’s adaptation to the data and effective optimization. Around the 600th epoch, the loss begins to stabilize, indicating that the model has reached convergence. This trend highlights the model’s ability to learn robust feature representations while suppressing background noise, ensuring stable performance in complex environments.

Segmentation results presented in Table 2 highlight the trained model’s effectiveness in removing complex backgrounds, replacing them with white while retaining the fine details of cattle faces. These results showcase the model’s capability to achieve high segmentation accuracy and robust performance in challenging scenarios designed for complex cattle face segmentation tasks. The segmentation results presented in Table 2 include examples from complex scenarios, such as backgrounds with hay stacks and pasture fences, as well as variations in lighting and cattle pose. These results demonstrate the effectiveness of our algorithm in handling challenging environments.

Segmented cattle face images, reformatted according to the original dataset structure, were input into the recognition network for training, with the highest-accuracy model weights saved. Additionally, the original cattle face dataset was used as input for the G-PCAA-enhanced inter-class recognition network in ablation experiments. Results and evaluation metrics of the recognition network are shown in Table 3.

As shown in Table 3, segmenting the 13,145 cattle face images from the original dataset and retraining the recognition network improved both accuracy and F1-score. With the integration of the G-PCAA module, the recognition network achieved an accuracy of 95.34% and an F1-score of 93.14% on the segmented dataset, compared to 91.48% and 89.16% on the unsegmented dataset.

In our ablation study, we compared the performance of the full dual-branch model (with both FACA and AMAG) to single-branch models. The single-branch segmentation model, without the dual-branch context fusion, achieved the following performance: an mIoU of 86.76%, an mDSC of 92.19%, an mHD95 of 37.23, and a model size of 433 MB. The single-branch recognition model, which did not use segmentation or G-PCAA optimization, achieved an accuracy of 89.26%, an F1-score of 87.53%, and a model size of 314 MB. The results demonstrate that removing the dual-branch context fusion step significantly impacts the model’s performance. Specifically, mIoU decreased from 95.48% to 86.76%, mDSC dropped from 97.61% to 92.19%, and mHD95 increased from 10.02 to 37.23. Similarly, recognition accuracy decreased from 95.34% to 89.26%, and F1-score dropped from 93.14% to 87.53%, reinforcing the importance of the dual-branch context fusion in improving both segmentation and recognition performance.

The training loss and accuracy curves for the G-PCAA-enhanced recognition network on the segmented dataset are presented in Figure 8. These curves illustrate the effectiveness of segmentation preprocessing in reducing background interference and improving feature focus, which collectively enhance classification performance.

Figure 8 depicts the training process of the G-PCAA-enhanced recognition network using the segmented cattle face dataset. The loss values for both the training and testing sets show convergence around the 20th epoch, while the accuracy for both sets stabilizes shortly thereafter.

Following segmentation, the processed cattle face images were input into DBCA-Net, which leverages dynamic parameter and weight matrix iteration. By loading the model weights with the highest accuracy, DBCA-Net was used to predict cattle IDs for both the segmented images and the original complex-background images. The corresponding results are presented in Table 4.

Table 4 presents cattle face images with various poses, along with recognition results and corresponding ID probabilities for segmented images. The results demonstrate the enhanced robustness and reliability of the DBCA-Net algorithm in accurately identifying cattle IDs from segmented images.

Importantly, for the same segmented cattle face image, the differences in prediction probabilities between various IDs are utilized as confidence-based feedback signals. These signals are employed to adjust the dynamic parameters and weight matrices in the segmentation network, ensuring that the segmentation and recognition networks work in harmony to optimize overall performance.

3.4. Comparative Experiment

To evaluate the effectiveness of the proposed cattle face segmentation network under complex scenarios in the DBCA-Net framework, several classical segmentation networks—U-Net++, U-Net3+, and Swin-Unet—were selected for comparative experiments. U-Net++ enhances the original U-Net by introducing multi-scale skip connections, improving the network’s representational capacity. U-Net3+ builds on U-Net++ by incorporating multi-level skip pathways to strengthen feature fusion, further enhancing segmentation accuracy. Swin-Unet leverages the strengths of Transformers, employing self-attention mechanisms to capture long-range dependencies, delivering strong segmentation performance.

Additionally, recent state-of-the-art medical segmentation algorithms, as discussed in the introduction, were adapted to the custom cattle face dataset for comparative analysis. These algorithms, originally designed for medical imaging tasks, were selected due to their proven effectiveness in handling complex segmentation scenarios. This study conducts a detailed comparison of evaluation metrics, model sizes, computational efficiency, and other characteristics across these networks to highlight the advantages of the improved TransUNet in cattle face segmentation. The comprehensive results of these comparative experiments are summarized in Table 5, providing clear evidence of the improvements achieved by the proposed approach.

In addition to comparing DBCA-Net with baseline models, we now compare it with two state-of-the-art Transformer-based models: Segformer++ [21] and SCTNet. Segformer++ uses efficient token-merging strategies to improve high-resolution segmentation with reduced computational complexity. SCTNet, on the other hand, employs a single-branch CNN with integrated Transformer semantic information, enabling real-time segmentation with competitive performance. These models are compared in Table 5, where their results are presented alongside DBCA-Net’s performance.

As evidenced by Table 5 and the prior ablation experiments, the DBCA-Net segmentation network exhibits outstanding performance in cattle face segmentation tasks under complex backgrounds. Additionally, it surpasses the state-of-the-art MedSegDiff-V2 [13] network in terms of model size, making it more suitable for deployment on embedded devices. This adaptability enhances DBCA-Net’s capability to effectively handle diverse cattle face segmentation scenarios. The corresponding visualizations are shown in Figure 9.

To validate the effectiveness of the segmented cattle face dataset in the improved recognition network, several classical and recently proposed recognition networks were adapted to the dataset for comparative experiments.

ArcFace [38] employs Additive Angular Margin Loss to enhance feature discriminability and stability, showing outstanding performance in small-sample datasets and complex scenarios. ViT-G [39], a large-scale Transformer-based model, partitions images into fixed-size patches and uses self-attention to extract global features, making it particularly suitable for high-resolution image recognition. Swin Transformer [40] achieves efficient global–local feature fusion through sliding window attention and hierarchical feature extraction, excelling in tasks involving complex backgrounds and high-resolution images.

ConvNeXt V2 [41] combines the efficiency of convolutional networks with the modeling capabilities of Transformers, optimizing convolution operations and structural design to deliver performance close to Transformers while maintaining computational efficiency. EfficientFormer [42] merges Transformer modeling power with lightweight design, making it ideal for deployment on edge devices with limited computational resources while maintaining efficiency and stability.

PVTv3 [43], the latest iteration of the Pyramid Vision Transformer series, introduces a Progressive Attention mechanism and more efficient feature extraction strategies, using a hierarchical architecture to reduce computational overhead and optimize parameter usage. FocalNet [44] integrates global and local features using the Focal Modulation Mechanism, reducing the computational complexity of global feature extraction and performing well in high-resolution tasks. EHVT [45] incorporates layered feature modeling and domain generalization, making it effective for cross-domain recognition in complex backgrounds.

All recognition networks were trained under consistent environments and training epochs. This study compares evaluation metrics, model sizes, and other characteristics to demonstrate the advantages of the improved Swin Transformer V2 recognition network for cattle face identification. The results are summarized in Table 6.

The results of the comparative experiments indicate that the inter-class enhanced recognition network of DBCA-Net achieves a well-balanced trade-off among accuracy, F1-score, and model size. Based on these findings, the inter-class enhanced cattle face recognition network emerges as the most suitable choice for application to the segmented cattle face dataset. The corresponding visualization is shown in Figure 10.

4. Discussion

4.1. Advantages of the DBCA-Net

DBCA-Net offers significant advantages in cattle face segmentation and recognition, particularly in complex scenarios. The model achieves high accuracy while maintaining a moderate size, striking an effective balance between performance and resource requirements. Its robust recognition capabilities are evident in challenging environments, where it demonstrates strong resistance to interference and reliable identification.

A notable strength of DBCA-Net is its incorporation of dynamic parameter and weight matrix feedback mechanisms. These allow the segmentation network to adaptively adjust its parameters based on the recognition network’s predictions, enhancing the integration of segmentation and recognition tasks. This dynamic feedback loop improves the overall robustness of the model and ensures precise identification, even in cases of ambiguous or challenging inputs.

The integration of FACA and AMAG modules further enhances its performance, capturing fine-grained features and efficiently integrating multi-scale information. The dynamic channel attention mechanism in the FACA module ensures that fine-grained features are preserved even in low-light and high-contrast scenarios. Additionally, the AMAG module, through multi-scale feature fusion, effectively suppresses interference from occluded areas, improving the continuity of segmentation boundaries, particularly in complex environments. Combined with the G-PCAA module’s ability to focus on critical regions, DBCA-Net excels in segmentation accuracy and recognition reliability, making it well suited for practical applications in intelligent livestock management.

4.2. Limitations of the DBCA-Net

Despite its strengths, DBCA-Net faces certain limitations that need to be addressed. The model’s training process requires significant computational resources due to its advanced architecture and feedback mechanisms. This reliance on high-performance hardware may limit its accessibility, especially in resource-constrained environments. Additionally, the computational overhead during inference remains relatively high, presenting challenges for real-time applications and deployment on edge devices.

These challenges underscore the need for further optimization to make DBCA-Net more suitable for deployment in low-resource scenarios without sacrificing performance. To tackle this issue, DBCA-Net will be tested and deployed on the Jetson AGX Orin embedded platform. With its higher computational capacity, this platform is expected to meet the model’s computational demands, thereby optimizing real-time deployment and enhancing operational performance to the fullest extent.

4.3. Outlook of the DBCA-Net

Future research on DBCA-Net should focus on optimizing its architecture for lightweight and efficient operation. Techniques such as model pruning, quantization, and the integration of lightweight attention mechanisms could reduce its computational requirements while maintaining high accuracy. Further exploration into adaptive learning methods may also help refine the model’s dynamic feedback mechanisms for greater efficiency.

In future practical applications, we will address common failure modes by introducing two key strategies. First, when the probability difference between predicted cow IDs is small, the algorithm will display the top three predicted cow IDs with corresponding images for visual comparison and adjustment. Second, we will continuously collect more facial images of cows in various lighting and pose conditions to diversify the dataset and improve accuracy. Furthermore, the dataset will be expanded with images of cows of different breeds and fur types to enhance training performance.

Expanding DBCA-Net’s applicability to a broader range of environments, including low-power devices and real-time systems, is a promising direction. These efforts will ensure that DBCA-Net not only retains its strong performance in complex scenarios but also becomes a scalable and practical tool for intelligent livestock management. Balancing accuracy, efficiency, and adaptability will be key to advancing the impact of DBCA-Net.

5. Conclusions

Accurate segmentation and recognition of cattle faces are essential for intelligent livestock management, particularly in complex scenarios with diverse backgrounds and subtle variations in cattle appearances. Traditional methods often struggle with such challenges, failing to achieve reliable identification in intricate environments. To address these issues, we proposed DBCA-Net, a novel framework that integrates dynamic parameter and weight matrix feedback mechanisms. These mechanisms enable adaptive adjustments during segmentation and recognition, enhancing the network’s robustness and performance.

DBCA-Net incorporates several innovative modules, including the FACA module for fine-grained feature extraction, the AMAG module for efficient multi-scale feature fusion, and the G-PCAA module for inter-class attention enhancement. The dynamic feedback between the segmentation and recognition networks ensures continuous optimization, allowing the system to improve its focus on critical features and reduce background interference. As a result, DBCA-Net achieves a balanced trade-off between accuracy, model size, and computational efficiency. Experimental results demonstrate that DBCA-Net outperforms state-of-the-art models in segmentation and recognition tasks, achieving superior robustness and reliability in challenging conditions.

DBCA-Net provides significant practical value in cattle identification by improving segmentation accuracy and maintaining high recognition performance, even in the presence of complex backgrounds. However, its computational demands during training and inference remain a limitation, particularly for deployment on low-power devices. Future research should prioritize lightweight design strategies to minimize model size and reduce computational requirements while preserving accuracy. Furthermore, integrating DBCA-Net with real-time edge computing systems could enable low-latency, high-accuracy solutions for intelligent livestock monitoring. These advancements would further enhance its applications in agricultural intelligence and precision livestock management.

Author Contributions

Writing—original draft, formal analysis, methodology, validation, and software, X.F.; data curation, J.Z.; conceptualization, Y.Q.; resources and project administration, L.L.; investigation, supervision, funding acquisition, and writing—review and editing, Y.L. and Y.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Fund Projects: the National Natural Science Foundation of China (62363029); the Inner Mongolia Science and Technology Program (2021GG0256); the Natural Science Foundation of Inner Mongolia Autonomous Region (2022MS06018); the Hohhot “Government-Industry-University-Research-Application Bank” Innovation Consortium Project (2023RC-Consortium-10); and the Collaborative Innovation Project among Colleges and Institutes (XTCX2023-16).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, C.; Zhu, W.; Norton, T. Behaviour recognition of pigs and cattle: Journey from computer vision to deep learning. Comput. Electron. Agric. 2021, 187, 106255. [Google Scholar] [CrossRef]
García, R.; Aguilar, J.; Toro, M.; Pinto, A.; Rodríguez, P. A systematic literature review on the use of machine learning in precision livestock farming. Comput. Electron. Agric. 2020, 179, 105826. [Google Scholar] [CrossRef]
Diana, S.M.; Porselvi, R.; Geetha, B.; Vanitha, B.; Muthukannan, K.; Ravi, S. Experimental evaluation of systematic animal classification system using advanced deep learning principle. In Proceedings of the IEEE International Conference on Intelligent and Advanced Systems, Kuala Lumpur, Malaysia, 14–16 June 2024. [Google Scholar]
Kimani, G.N.; Oluwadara, P.; Chacha, L. Cattle identification using muzzle images and deep learning techniques. arXiv 2023, arXiv:2305.11321. [Google Scholar]
Saygılı, A.; Cihan, P.; Ermutlu, C.Ş. CattNIS: Novel identification system of cattle with retinal images based on feature matching method. Comput. Electron. Agric. 2024, 221, 108963. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. Med. Image Comput. Comput. Assist. Interv. 2015, 18, 234–241. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762v5. [Google Scholar] [CrossRef]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. Unet++: A nested U-Net architecture for medical image segmentation. Deep Learn. Med. Image Anal. Multimodal Learn. Clin. Decis. Support 2018, 4, 3–11. [Google Scholar] [CrossRef]
Huang, H.; Lin, L.; Tong, R.; Zeng, Z.; Zhang, B.; Ma, Z. Unet 3+: A full-scale connected U-Net for medical image segmentation. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 4–8 May 2020; pp. 1055–1059. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q. Swin-unet: U-Net-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23-27 October 2022; pp. 205–218. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Zheng, Y.; Yang, X.; Wang, Z. TransUNet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Liu, Y.; Wang, H.; Chen, Z.; Ma, J.; He, X. TransUNet+: Redesigning the skip connection to enhance features in medical image segmentation. Knowl.-Based Syst. 2022, 256, 109859. [Google Scholar] [CrossRef]
Wu, J.; Ji, W.; Fu, H.; Wang, Y.; Song, Z. MedSegDiff-V2: Diffusion-based medical image segmentation with transformer. Proc. AAAI Conf. Artif. Intell. 2024, 38, 6030–6038. [Google Scholar] [CrossRef]
Xiao, T.; Li, S.; Lin, C.; Wang, C.; Wei, H. ConvNeXt: Revisiting convolutions for visual recognition. arXiv 2022, arXiv:2201.03545. [Google Scholar]
Roy, S.; Koehler, G.; Ulrich, C.; Wang, X. Mednext: Transformer-driven scaling of ConvNets for medical image segmentation. Med. Image Comput. Comput. Assist. Interv. 2023, 36, 405–415. [Google Scholar] [CrossRef]
Tang, F.; Ding, J.; Quan, Q.; Liu, X.; Wang, Z. CMUNext: An efficient medical image segmentation network based on large kernel and skip fusion. In Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging (ISBI), Athens, Greece, 27–30 May 2024; pp. 1–5. [Google Scholar]
Li, J.; Zou, X.; Wang, S.; Chen, B.; Xing, J.; Tao, P. A parallel attention network for cattle face recognition. arXiv 2024, arXiv:2403.19980. [Google Scholar]
Feng, T.; Guo, Y.; Huang, X.; Zhang, R.; Liang, H. Cattle target segmentation method in multi-scenes using improved DeepLabV3+ method. Animals 2023, 13, 2521. [Google Scholar] [CrossRef]
Zhang, L.; Han, G.; Qiao, Y.; Xu, L.; Chen, L.; Tang, J. Interactive dairy goat image segmentation for precision livestock farming. Animals 2023, 13, 3250. [Google Scholar] [CrossRef]
Shu, H.; Wang, K.; Guo, L.; Zhao, Y.; Xu, J. Automated collection of facial temperatures in dairy cows via improved UNet. Comput. Electron. Agric. 2024, 220, 108614. [Google Scholar] [CrossRef]
Kienzle, D.; Kantonis, M.; Schön, R.; Lienhart, R. Segformer++: Efficient Token-Merging Strategies for High-Resolution Semantic Segmentation. arXiv 2024, arXiv:2405.14467. [Google Scholar]
Xu, Z.; Wu, D.; Yu, C.; Chu, X.; Sang, N.; Gao, C. SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation. Proc. AAAI Conf. Artif. Intell. 2024, 38, 6378–6386. [Google Scholar] [CrossRef]
Kumar, S.; Tiwari, S.; Singh, S.K. Face recognition of cattle: Can it be done? Proc. Natl. Acad. Sci. India Sect. A Phys. Sci. 2016, 86, 137–148. [Google Scholar] [CrossRef]
Cheng, X.; Wang, Z.; Song, C.; Yu, Z. FFR-SSD: Feature fusion and reconstruction single shot detector for multi-scale object detection. Signal Image Video Process. 2023, 17, 3145–3153. [Google Scholar] [CrossRef]
Xu, X.; Deng, H.; Wang, Y.; Zhang, L. Boosting cattle face recognition under uncontrolled scenes by embedding enhancement and optimization. Appl. Soft Comput. 2024, 164, 111951. [Google Scholar] [CrossRef]
Ruchay, A.; Kolpakov, V.; Guo, H.; Wang, F. On-barn cattle facial recognition using deep transfer learning and data augmentation. Comput. Electron. Agric. 2024, 225, 109306. [Google Scholar] [CrossRef]
Meng, Q.; Gao, X.; Wen, Y.; Liu, C. Few-shot classification with shared and private branches for cow face recognition. In Proceedings of the IEEE International Workshop on Radio Frequency and Antenna Technologies (iWRF&AT), Beijing, China, 12–14 October 2024; pp. 1–5. [Google Scholar]
Cai, R.; Li, H.; Wang, S.; Chen, C.; Kot, A.C. DRL-FAS: A novel framework based on deep reinforcement learning for face anti-spoofing. IEEE Trans. Inf. Forensics Secur. 2020, 16, 937–951. [Google Scholar] [CrossRef]
Li, X.; Liu, Y. Cow face recognition based on transformer group. In Proceedings of the Fourth International Conference on Computer Vision and Pattern Analysis (ICCPA), Anshan, China, 17–19 May 2024; SPIE: Bellingham, WA, USA, 2024; Volume 13256, pp. 203–209. [Google Scholar]
Mahato, S.; Neethirajan, S. Integrating artificial intelligence in dairy farm management− biometric facial recognition for cows. Inf. Process. Agric. 2024, 11, 456–469. [Google Scholar] [CrossRef]
Zhang, K.; Li, D.; Luo, W.; Ren, W. Dual attention-in-attention model for joint rain streak and raindrop removal. IEEE Trans. Image Process. 2021, 30, 7608–7619. [Google Scholar] [CrossRef]
Zhang, K.; Li, R.; Yu, Y.; Luo, W.; Li, C. Deep dense multi-scale network for snow removal using semantic and depth priors. IEEE Trans. Image Process. 2021, 30, 7419–7431. [Google Scholar] [CrossRef]
Ju, X.; Liu, X.; Wang, X.; Bian, Y.; Shan, Y.; Xu, Q. Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 150–168. [Google Scholar]
Zhang, H.; Zhong, X.; Li, G.; Liu, W.; Liu, J.; Ji, D.; Li, X.; Wu, J. BCU-Net: Bridging ConvNeXt and U-Net for medical image segmentation. Comput. Biol. Med. 2023, 159, 106960. [Google Scholar] [CrossRef]
Yang, Y.; Xu, X.; Hu, H.; Long, H.; Zhou, Q.; Guan, Q. DuEDL: Dual-Branch Evidential Deep Learning for Scribble-Supervised Medical Image Segmentation. arXiv 2024, arXiv:2405.14444. [Google Scholar]
Liu, Z.; Hu, H.; Lin, Y.; Cai, Z. Swin transformer V2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12009–12019. [Google Scholar]
Liu, S.A.; Xie, H.; Xu, H.; Li, T. Partial class activation attention for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16836–16845. [Google Scholar]
Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. ArcFace: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4690–4699. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Guo, H.; Wei, Y. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10012–10022. [Google Scholar]
Woo, S.; Debnath, S.; Hu, R.; Sun, M. ConvNeXt V2: Co-designing and scaling ConvNets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16133–16142. [Google Scholar] [CrossRef]
Li, Y.; Yuan, G.; Wen, Y.; Xu, L. EfficientFormer: Vision transformers at MobileNet speed. Adv. Neural Inf. Process. Syst. 2022, 35, 12934–12949. [Google Scholar]
Hu, X.; Zhang, X.; Wang, F.; Sun, J. Efficient camouflaged object detection network based on global localization perception and local guidance refinement. IEEE Trans. Image Process. 2024, 33, 456–470. [Google Scholar] [CrossRef]
Yang, J.; Li, C.; Dai, X.; Wang, Y. Focal modulation networks. Adv. Neural Inf. Process. Syst. 2022, 35, 4203–4217. [Google Scholar]
Yang, X.; Chang, T.; Zhang, T.; Li, Z. Learning hierarchical visual transformation for domain-generalizable visual matching and recognition. Int. J. Comput. Vision 2024, 132, 4823–4849. [Google Scholar] [CrossRef]

Figure 1. Structural diagram of the DBCA-Net algorithm.

Figure 2. Diagram of the Fusion-Augmented Channel Attention (FACA) mechanism architecture.

Figure 3. Architecture of AMAG (Adaptive Multi-Scale Attention Gate).

Figure 4. Diagram of the inter-class feature enhancement model for cattle face recognition.

Figure 5. Diagram of the GeLU-enhanced Partial Class Activation Attention (G-PCAA) mechanism.

Figure 6. Diagram of local class center generation and global class representation generation.

Figure 7. Loss curve of the cattle face segmentation network for complex environments.

Figure 8. (a) Loss curves of the G-PCAA-enhanced recognition network. (b) Accuracy curves of the G-PCAA-enhanced recognition network.

Figure 9. (a) Visualization of mean Intersection over Union (mIoU) for segmentation networks. (b) Visualization of mean Dice Similarity Coefficient (mDSC) for segmentation networks. (c) Visualization of 95th Percentile Hausdorff Distance (mHD95) for segmentation networks.

Figure 10. (a) Visualization of recognition accuracy across different networks. (b) Visualization of recognition F1-score across different networks. (c) Visualization of model size across different networks.

Table 1. Ablation experiments on the segmentation network.

TransUNet	FACA	AMAG	mIoU (%)	mDSC (%)	mHD95	Model Size
✓			86.76	92.19	37.23	433 MB
✓	✓		92.65	94.83	22.35	465 MB
✓		✓	90.31	93.21	29.58	451 MB
✓	✓	✓	95.48	97.61	10.02	509 MB

Table 2. Segmentation results of cattle faces in complex backgrounds.

Original Image
Segmented Result

Table 3. Ablation experiments of the cattle face recognition network.

	Baseline	G-PCAA	Acc (%)	F1-Score (%)	Model Size
Unsegmented	✓		89.26	87.53	314 MB
Unsegmented	✓	✓	91.48	89.16	345 MB
Segmented	✓		92.63	90.87	311 MB
Segmented	✓	✓	95.34	93.14	343 MB

Table 4. Results of the DBCA-Net algorithm.

Cattle ID	10225	11366	210310	220075
Original Image
Prediction
Segmented Image
Prediction

Table 5. Comparative experiments of different segmentation networks.

Networks	mIoU (%)	mDSC (%)	mHD95	Model Size
U-Net++ [8]	80.11	86.34	38.49	168 MB
U-Net3+ [9]	83.23	89.57	26.31	227 MB
Swin-Unet [10]	90.16	93.68	18.84	303 MB
TransUNet [11]	86.76	92.19	37.23	433 MB
TransUNet+ [12]	94.48	96.51	12.02	487 MB
ConvNeXt [14]	91.52	93.82	17.42	206 MB
MedSegDiff-V2	96.34	98.42	8.52	895 MB
MedNeXt [15]	94.63	96.07	11.56	497 MB
CMUNeXt [16]	88.05	90.51	21.28	196 MB
SegFormer++	94.73	96.32	10.95	401 MB
SCTNet [22]	89.65	92.91	15.37	271 MB
Ours	95.48	97.61	10.02	509 MB

Table 6. Comparative experiments of different recognition models.

Networks	Acc (%)	F1-Score (%)	Model Size
ArcFace [38]	81.95	79.63	81 MB
ViT-G [39]	96.21	94.56	588 MB
Swin Transformer	93.17	91.02	271 MB
ConvNeXt V2 [41]	91.34	89.68	143 MB
EfficientFormer	87.06	84.13	106 MB
PVTv3 [43]	92.45	90.31	301 MB
FocalNet [44]	94.32	92.51	351 MB
EHVT [45]	95.74	94.20	415 MB
Ours	95.34	93.14	345 MB

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Feng, X.; Zhang, J.; Qi, Y.; Liu, L.; Li, Y. DBCA-Net: A Dual-Branch Context-Aware Algorithm for Cattle Face Segmentation and Recognition. Agriculture 2025, 15, 516. https://doi.org/10.3390/agriculture15050516

AMA Style

Feng X, Zhang J, Qi Y, Liu L, Li Y. DBCA-Net: A Dual-Branch Context-Aware Algorithm for Cattle Face Segmentation and Recognition. Agriculture. 2025; 15(5):516. https://doi.org/10.3390/agriculture15050516

Chicago/Turabian Style

Feng, Xiaopu, Jiaying Zhang, Yongsheng Qi, Liqiang Liu, and Yongting Li. 2025. "DBCA-Net: A Dual-Branch Context-Aware Algorithm for Cattle Face Segmentation and Recognition" Agriculture 15, no. 5: 516. https://doi.org/10.3390/agriculture15050516

APA Style

Feng, X., Zhang, J., Qi, Y., Liu, L., & Li, Y. (2025). DBCA-Net: A Dual-Branch Context-Aware Algorithm for Cattle Face Segmentation and Recognition. Agriculture, 15(5), 516. https://doi.org/10.3390/agriculture15050516

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DBCA-Net: A Dual-Branch Context-Aware Algorithm for Cattle Face Segmentation and Recognition

Abstract

1. Introduction

2. Materials and Methods

2.1. The DBCA-Net Algorithm Framework

2.2. Cattle Face Segmentation Models in Complex Environment

2.2.1. Hybrid Encoder

2.2.2. Multi-Scale Feature Decoder

2.3. Inter-Class Feature Enhancement Model for Cattle Face Recognition

2.4. Iterative Optimization of Dynamic Parameters and Weight Matrices

2.4.1. Dynamic Parameter Tuning in the FACA Module

2.4.2. Iterative Optimization of Weight Matrices in the AMAG Module

3. Experimentation and Analysis

3.1. Dataset Preparation and Network Training

3.2. Performance Evaluation Metrics

3.3. Ablation Experiment

3.4. Comparative Experiment

4. Discussion

4.1. Advantages of the DBCA-Net

4.2. Limitations of the DBCA-Net

4.3. Outlook of the DBCA-Net

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI