2.1. Dataset
2.1.1. Data Acquisition
This study was conducted at a large-scale breeding pig farm located near Hohhot City in the Inner Mongolia Autonomous Region of China. The core facilities of the farm include eight replacement gilt houses (64 pens in total), four gestation houses (with more than 4600 gestation stalls), and 17 farrowing houses (952 farrowing crates), collectively housing 5500 breeding pigs. Replacement gilts are housed in spacious pens, each containing about 35 pigs. Due to their frequent interactions and high activity levels, ear tag loss was particularly prevalent among these animals. For data collection, a cloud–edge dual-view system was installed in three replacement gilt pens. The system consisted of video acquisition terminal, edge computing devices in the machine room, and a cloud storage server. The overall data acquisition workflow is illustrated in
Figure 1.
(1) Video acquisition terminal: Each pilot gilt pen measures 5.3 × 6.0 m and houses 35 breeding pigs, including 5 without ear tags and 30 with ear tags. Two DS-2PT7D20IW-DE dome cameras (Hikvision, Hangzhou, China) were installed in each pen. The cameras provide a resolution of 1920 × 1080 pixels at 25 frames per second. One camera was ceiling-mounted 3.4 m above the feeding area to capture high-definition top-down data for ear tag loss detection, effectively highlighting the ear region. The second camera was suspended at the midpoint above one side of the pen to capture panoramic data for continuous tracking. Both cameras automatically switched between day and night modes by detecting illumination intensity, enabling the collection of color and grayscale video data under varying activity states during both daytime and nighttime.
To ensure experimental rigor and reproducibility, all selected pigs were 2–3-month-old female Landrace gilts in good physical condition, without apparent health abnormalities or clinical diseases. All pigs were reared under identical housing conditions and stocking densities, with consistent feeding, watering, and environmental parameters. Management practices were standardized, and no special treatments were applied. Data were collected from October to December 2022. Raw video data under varying light conditions and activity states were obtained through 24-h continuous dual-view monitoring of pigs within the gilt pens.
(2) Edge computing devices in the machine room: The edge devices in the machine room were responsible for local data storage, compression, and forwarding. Using FFMPEG (v5.1) for real-time transcoding, RTSP streams were converted into MP4 files, with a new file generated every ten minutes. The incremental data were then synchronized to the cloud storage server via RSYNC (v3.2.4).
(3) Cloud storage server: The cloud server received and stored incremental video data transmitted from the edge devices through RSYNC, from which the data could be retrieved for experimental use.
2.1.2. Data Preprocessing
This study adopted a timestamp alignment method to align dual-view data. A total of 782 paired video files containing pigs with ear tag loss were extracted simultaneously from top-down and panoramic views. Each video lasted 1–3 min and captured various activity states of pigs during both daytime and nighttime. Frames were sampled at 1 frame per second. To reduce overfitting caused by redundant images, the Structural Similarity Index (SSIM) algorithm [
24] was used for image filtering. The SSIM threshold was set to 0.78, based on multiple experiments with this dataset and supported by relevant studies [
2,
19]. This threshold effectively removed redundant images while preserving data diversity. Ultimately, 6752 images were retained, including 1403 daytime active, 2112 daytime mixed-state, 1893 daytime stationary, 313 nighttime active, 547 nighttime mixed-state, and 484 nighttime stationary images. Representative examples are shown in
Figure 2.
2.1.3. Dataset Construction
(1) The detection dataset was annotated at the instance level using EISeg (v1.1) and divided into training and test sets with an 8:2 ratio. The dataset contained images of breeding pigs under different lighting and motion conditions, ensuring that the distribution of images across scenes accurately reflected real-world scenarios. This approach provided sufficient sample size and a reasonable data split. The training set included 5404 images, comprising 5185 pig instances with visible ear tags and 7261 with non-visible ear tags, while the test set contained 1348 images, including 1396 pig instances with visible ear tags and 1848 with non-visible ear tags. Here, “non-visible ear tags” includes cases where ear tags are outside the camera frame, and not just ear tag absence.
(2) The individual tracking dataset was annotated using DarkLabel (v2.4) in the MOT-17 format. Each annotation recorded the target ID, the coordinates of the upper-left corner of the bounding box, and its width and height. The annotations were then exported as tracking trajectory data and converted into the LaSOT format for single-target tracking. In total, 2968 complete pig tracking trajectories were obtained and split into training and test sets with an 8:2 ratio. The data distribution is summarized in
Table 1.
2.2. Methods
In this study, we propose a dual-view synergistic lightweight method for detecting and tracking breeding pigs with ear tag loss. The framework consists of three components: a detector for pigs with ear tag loss, a dual-view synergistic system, and a tracker for pigs with ear tag loss, as illustrated in
Figure 3. First, a lightweight detector named Cascade-TLDP was developed by integrating the Cascade-TagLossDetector with a channel pruning algorithm. This detector classifies pigs in localized top-down view images of the feeding area as either with visible ear tags or with non-visible ear tags, depending on tag visibility and whether the pig is fully within the frame. During training, Cascade-TLDP ranked channel importance using channel information entropy, selected candidate channels for pruning, and removed redundant channels at a pruning rate of 50%. The pruned lightweight backbone replaced the original network, enabling efficient and accurate detection of pigs with ear tag loss. Second, a dual-view synergistic system was constructed, in which the detection bounding boxes from Cascade-TLDP served as inputs. Through camera calibration, coordinate transformation, and target matching between the localized top-down view and the panoramic oblique view, the positions of pigs with ear tag loss were determined within the panoramic oblique view. Finally, an enhanced STARK-MOT tracker with Motion Attention was designed. Built upon the STARK framework [
25], it models global spatiotemporal dependencies in video sequences to capture state changes of pigs with ear tag loss. The Motion Attention mechanism reinforces the model’s focus on motion regions, thereby enabling continuous tracking of localized pigs with ear tag loss in the panoramic oblique view.
2.2.1. Cascade-TLDP Detector
Although the Cascade-TagLossDetector developed by our team previously achieved high detection accuracy for pigs with ear tag loss, its network structure contains redundant feature channels, resulting in an excessive number of parameters. This not only increases computational overhead during training and inference but also limits deployment and operational efficiency in resource-constrained environments, such as edge devices. To meet the real-time requirements for detecting and tracking pigs with ear tag loss in production scenarios, this paper proposes Cascade-TLDP, a detector for pigs with ear tag loss that integrates channel pruning methods into the Cascade-TagLossDetector, as illustrated in
Figure 4. The proposed detector consists of a backbone network, a region proposal network, and a cascade detection network.
(1) Backbone Network: ResNeXt101 was used as the backbone for feature extraction. A feature extraction module, IRDSC, was constructed by integrating depthwise separable convolutions with inverted residual structures, replacing the standard convolutions in ResNeXt101. This design enhanced feature extraction capability while reducing computational complexity. In each grouped convolution layer of ResNeXt101, the SENet channel attention mechanism was incorporated to capture correlations between channels. This mechanism improved the representation of salient features and enabled adaptive adjustment based on channel importance. As a result, the model achieved a bounding box mAP of 94.15% and a Mask mAP of 90.32%, with a detection speed of 25.33 FPS, which was still insufficient for real-time detection.
To mitigate redundancy in certain channels that contributed little information to the final output, a structured pruning method was introduced. By selectively pruning channels according to their importance, this method reduced the parameter count and computational complexity while maintaining accuracy, thereby improving detection speed.
Channel importance was measured using the BatchNorm scaling factor
. During backbone training, an
regularization term was added to encourage
values to converge towards a sparse distribution, as shown in Equation (
1). This design effectively compressed the weights of unimportant channels towards zero, while retaining larger values for informative channels, thus improving discrimination in channel selection.
where
denotes the training loss of the Cascade-TagLossDetector;
is the balance coefficient for sparse regularization; and
represents the scaling factor of the BatchNorm layer at the
l-th layer.
Next, the channels of each layer were sorted in ascending order of their
values. Channels with lower importance were pruned under different pruning rates, simultaneously removing the associated convolutional kernels and BatchNorm parameters. After pruning, the number of input and output channels in the convolutional layers was updated, and the BatchNorm configurations were reconstructed. Following Yu et al. (2022) [
26], the impact of different sparsity rates on model performance was analyzed. Under a pruning rate of 50%, an optimal balance between model size and detection accuracy was achieved.
(2) Region Proposal Network: The region proposal network (RPN) generated anchor boxes for candidate target regions using the four 256-channel multi-scale feature layers of the feature pyramid network. By mapping these anchor boxes back to the original image space, candidate regions for subsequent detection and segmentation were obtained.
(3) Cascade Detection Network: The cascade detection network employed cascade detection and regression branches to identify target objects and refine their coordinates. This process optimized classification and regression loss calculations by leveraging the advantages of Focal Loss for minority categories. As a result, the model placed greater emphasis on minority samples and hard-to-detect cases. This approach effectively improved detection accuracy, particularly under nighttime conditions and in scenarios characterized by imbalanced positive and negative sample distributions in the training dataset.
In this study, the dataset for detecting pigs with ear tag loss was classified based on ear tag visibility, distinguishing between pigs with visible ear tags and those with non-visible ear tags. Non-visible ear tags arose in two scenarios: (i) pigs with ear tag loss, and (ii) pigs whose ear tags were outside the image field of view. The latter case could not be conclusively identified as ear tag loss. Therefore, this study defined the identification criteria for pigs with ear tag loss as follows: if the ear tag was not visible and the predicted mask was at least one pixel away from the image boundary, the pig was classified as having ear tag loss. In other words, if a breeding pig was fully contained within the field of view and detected as having a non-visible ear tag, it was labeled as a pig with ear tag loss, and the coordinates of its detection box were recorded.
2.2.2. Dual-View Synergistic Method
In this study, we achieve the position mapping of two cameras from both a localized top-down view and a panoramic oblique view through the processes of camera calibration [
27], coordinate transformation, and target matching.
(1) Camera Calibration: This study utilizes the slatted floor as the reference plane for the world coordinate system. By measuring the three-dimensional positions and rotation angles of both the localized top-down view and global oblique view cameras relative to this coordinate system, we can determine the initial values of the external parameters. Subsequently, several representative feature points are selected on the slatted floor. The three-dimensional world coordinates are measured manually, and their corresponding two-dimensional pixel coordinates are annotated in the synchronously captured localized top-down and global oblique view images to complete the camera calibration.
During calibration, the extrinsic parameters, R and T, are initially established using physical measurement results. Subsequently, known 3D points are projected onto the image plane to compute the reprojection error between predicted pixel locations and their manually annotated counterparts. By minimizing the reprojection error for all feature points across both viewpoints, the intrinsic matrix
K and the extrinsic parameters
are jointly optimized to yield more accurate camera parameters. This projection relationship is articulated in Equation (
2), while the optimization objective function is presented in Equation (
3).
where,
denotes pixel coordinates,
represents the world coordinates of the feature point on the slatted floor
,
K is the internal parameter matrix,
is the external parameter matrix.
In this context, denotes the pixel coordinates of the ith point in the jth camera, represents the known 3D world coordinates. The projection operator is denoted as , refer to the internal and external parameters that require optimization. These parameters are determined using the Levenberg–Marquardt nonlinear least-squares method, which ultimately facilitates an accurate spatial mapping between the world coordinate system and the image plane.
(2) Coordinate transformation: The target frames of breeding pigs with ear tag loss, as detected by Cascade-TLDP, were transformed into the panoramic oblique view coordinate system using rotation
and translation
, as shown in Equation (
4).
In the context, refers to the coordinates of the target bounding box detected from a top-down perspective, and denotes the corresponding point coordinates in the panoramic view following coordinate transformation.
(3) Individual pig detection: In the panoramic oblique view, Cascade-TLDP was used to detect individual pigs, generating bounding boxes for each breeding pig.
(4) Target matching: The normalized Euclidean distance metric was employed to compute the distance between and the center points of all detected bounding boxes in the frame. Candidate boxes were then selected according to the nearest-neighbor principle. To improve robustness, a normalized distance threshold was applied to filter detection results. The system only outputs the identification result for pigs with ear tag loss if the minimum distance is smaller than the threshold; otherwise, no valid match is considered. Ultimately, the identified pig with ear tag loss is marked in the panoramic oblique view.
2.2.3. STARK-MOT Tracker
In this study, when Cascade-TLDP detected pigs with ear tag loss, the system immediately activated the tracker and triggered an alert to notify farm staff for timely re-tagging. Previous studies have reported that the average rate of ear tag loss in pigs over their life cycle is 2.8% [
3]. By enabling timely re-tagging, the likelihood of multiple pigs with ear tag loss occurring simultaneously within a pigsty was significantly reduced. Multi-target tracking schemes typically rely on real-time multi-target detection models, which impose high computational demands. To address this issue, we employed a single-target tracking strategy to ensure continuous tracking of pigs with ear tag loss. For potential multi-target requirements in production, multiple single-target trackers were executed in parallel to meet management needs.
In actual production environments, the high visual similarity among breeding pigs complicates target differentiation during tracking. To address this challenge, we evaluated mainstream single-target tracking methods, such as SiamRPN++ [
28] and MixFormer [
29], and found that their robustness in complex scenarios—such as pose deformation and partial occlusion—remained insufficient. In contrast, STARK, based on the Transformer architecture [
30], overcame these limitations by capturing global target features through the self-attention mechanism. Furthermore, to mitigate the effect of rapid pig movements on tracking accuracy in production, we proposed a Motion Attention-enhanced tracker, termed STARK-MOT, for pigs with ear tag loss, whose structure is shown in
Figure 5. The model consists of six main modules: backbone network, feature pre-encoding, encoder, decoder, prediction head, and training and inference. By embedding Motion Attention to perform motion weighting in both spatial and temporal domains, the model’s ability to track fast-moving targets was significantly improved, leading to enhanced accuracy.
(1) Backbone Network: STARK-MOT employed ResNet-50 [
31] as its backbone. It took three inputs: (a) the initial target template of a calibrated breeding pig with ear tag loss obtained through the dual-view synergistic system, (b) the search region of the current frame representing the global visual image, and (c) a dynamically updated template sampled from intermediate frames. The backbone extracted hierarchical visual features from low to high levels, generating three sets of deep feature maps. The dynamic template was initialized with the calibrated pig with ear tag loss and was continuously refined during training and inference to improve tracking robustness.
(2) Feature Pre-encoding: The three sets of feature maps produced by the backbone were first compressed using a
convolution in the bottleneck layer. Subsequently, these feature maps are motion-weighted in the spatial domain through the Motion Attention module. The resulting feature maps were flattened and concatenated into three feature sequences along the spatial dimension. Then, the sequences underwent another round of motion weighting via the Motion Attention module in the sequence domain, enabling the encoder to focus more effectively on the target’s motion features. The structure of the Motion Attention module is illustrated in
Figure 6.
To highlight significant motion regions within a frame, Motion Attention computed the element-wise residuals of the salient features from adjacent frames, as shown in Equation (
5).
In the equation, t represents the time step index; and are the feature maps output by the backbone in the current and previous frames, respectively.
Subsequently, channel compression and nonlinear transformation were applied to
using a
convolution and ReLU, yielding
.
was then flattened into
. The spatial-domain motion guidance matrix was constructed through Equation (
6).
where
, and
denotes the pairwise position similarity matrix obtained through low-dimensional motion embedding.
In the attention computation, the spatial-domain motion guidance matrix was incorporated as a score bias term to assign higher weights to regions of significant motion, as shown in Equation (
7).
Here, , and Q, K, and V are the query, key, and value obtained by linear mapping of X. is the scale factor of the key and query; is the spatial-domain motion guidance matrix with weight coefficient ; is the weighted sum of V; after normalization and reshaping, was obtained.
Finally, the features from the template and search branches were flattened and concatenated with positional encoding. Motion Attention was then applied again to complete motion weighting in the sequence domain.
(3) Encoder: The encoder was composed of multiple Transformer encoder layers. Each layer first captured dependencies among positions in the feature sequence using a multi-head self-attention module. The resulting features were then transformed through a feedforward network with nonlinear mapping. To stabilize gradient propagation and prevent degradation of feature representation, residual connections and layer normalization were applied between submodules.
(4) Decoder: The decoder integrated multiple target queries with the encoder output feature sequences as inputs. Through self-attention and cross-attention mechanisms, the query information was fused with global spatiotemporal features, producing representation vectors for target boxes.
(5) Prediction Head: Based on the representation vectors generated by the decoder, a fully convolutional network composed of Conv-BN-ReLU modules was employed to generate corner probability maps, predict bounding box coordinates, and output binary confidence scores. These results were further used to determine whether the dynamic template should be updated.
(6) Training and Inference: During inference, if the predicted confidence exceeded a preset threshold, the corresponding bounding box region was cropped to generate a new dynamic template, which replaced the previous one. This process enabled the tracker to capture appearance changes of the target over time.
When multiple breeding pigs with ear tag loss were detected in the same frame, a separate STARK-MOT tracking process was initiated for each pig. Each process was assigned to a dedicated GPU, allowing independent operation. This approach preserved the high accuracy and frame-rate advantages of single-target tracking while enabling simultaneous tracking of multiple pigs with ear tag loss.
2.2.4. Experimental Settings
The experiments were conducted on a high-performance server running Ubuntu 20.04, equipped with two Intel(R) Xeon(R) Gold 6137 processors and six NVIDIA GeForce RTX 3090 GPUs. The software environment was based on a deep learning framework that included Miniconda3, Python 3.10.11, CUDA 11.7, PyTorch 2.0.0, and MMTracking 0.14. For the detector, a stochastic gradient descent (SGD) optimizer was used with a momentum factor of 0.9 and an initial learning rate of 0.02. The intersection over union (IoU) thresholds for the three stages of the cascaded RCNN were set sequentially to 0.5, 0.6, and 0.7. The batch size was 36, and the training lasted for 100 epochs. For the tracker, the AdamW optimizer was adopted, with the learning rate, learning rate multiplication factor, and maximum gradient clipping initialized at 0.0001, 0.1, and 0.1, respectively. A stepwise decay strategy was used for learning rate adjustment. The batch size was 114, and training was performed for 150 epochs.
2.2.5. Evaluation Indicators
The Cascade-TLDP detector was evaluated using bounding box mean average precision (Bbox mAP), instance segmentation mean average precision (Mask mAP), number of parameters (Params), computational complexity (FLOPs), and detection speed.
The performance of the dual-view synergistic method was assessed using four indicators: target matching accuracy, coverage, dual-view mapping accuracy, and rejection rate. Specifically, target matching accuracy was defined as the proportion of correctly mapped samples among all pigs with ear tag loss that initiated dual-view mapping. Coverage referred to the proportion of samples in which the normalized distance between the predicted detection box centers from the two views did not exceed the preset threshold, relative to all pigs with ear tag loss requiring mapping. Dual-view mapping accuracy measured the proportion of correctly mapped samples among all pigs with ear tag loss intended for mapping. Rejection rate denoted the proportion of samples excluded from mapping results because the center-point distance exceeded the normalized threshold during candidate box selection under the nearest-neighbor principle.
The STARK-MOT tracker was evaluated using success rate (Success), normalized precision (Norm precision), precision, and model size. Success rate measured the proportion of frames where the intersection over union (IoU) between predicted tracking boxes and ground-truth annotations exceeded a predefined threshold, reflecting the overall tracking capability. Normalized precision was defined as the average Euclidean distance between the centers of predicted and ground-truth bounding boxes, normalized by target size, thereby evaluating spatial alignment under scale variation. Precision calculated the average pixel-level distance between the centers of predicted and ground-truth boxes, reflecting fine-grained localization accuracy. The computational formulas for these three metrics are provided in Equations (
8)–(
10).
In Equation (
8),
N denotes the total number of frames used for evaluation.
represents the predicted bounding box in the
i-th frame, and
is the corresponding ground-truth bounding box.
denotes the intersection over union between
and
, while
is the IoU threshold (set to 0.5 in this study).
is an indicator function that returns 1 if the condition inside the parentheses is satisfied, and 0 otherwise.
In Equation (
9),
and
denote the center coordinates of the predicted and ground-truth bounding boxes in the
i-th frame, respectively.
A denotes the area of the image frame, serving as a normalization factor to ensure scale invariance.
In Equation (
10), the Euclidean distance between the predicted and ground-truth bounding box centers is computed for each frame, thereby quantifying the tracking error in pixel units.