A Multi-Source Fusion-Based Material Tracking Method for Discrete–Continuous Hybrid Scenarios

Yang, Kaizhi; Xiao, Xiong; Zhang, Yongjun; Liu, Guodong; Li, Xiaozhan; Zhang, Fei

doi:10.3390/pr13113727

Open AccessArticle

A Multi-Source Fusion-Based Material Tracking Method for Discrete–Continuous Hybrid Scenarios

by

Kaizhi Yang

,

Xiong Xiao

^*,

Yongjun Zhang

,

Guodong Liu

,

Xiaozhan Li

and

Fei Zhang

Institute of Engineering Technology, University of Science and Technology Beijing, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Processes 2025, 13(11), 3727; https://doi.org/10.3390/pr13113727

Submission received: 18 September 2025 / Revised: 10 October 2025 / Accepted: 16 October 2025 / Published: 19 November 2025

(This article belongs to the Section Manufacturing Processes and Systems)

Download

Browse Figures

Versions Notes

Abstract

Special steel manufacturing involves both discrete processing events and continuous physical flows, forming a representative discrete–continuous hybrid production system. However, due to the visually homogeneous surfaces of steel products, the highly dynamic production environment, and frequent disturbances or anomalies, traditional single-source tracking approaches struggle to maintain accurate and consistent material identification. To address these challenges, this paper proposes a multi-source fusion-based material tracking method tailored for discrete–continuous hybrid scenarios. First, a state–event system (SES) is constructed based on process rules, enabling interpretable reasoning of material states through event streams and logical constraints. Second, on the visual perception side, a YOLOv8-SE detection network embedded with the squeeze-and-excitation (SE) channel attention mechanism is designed, while the DeepSORT tracking framework is improved to enhance weak feature extraction and dynamic matching for visually similar targets. Finally, to handle information conflicts and cooperation in multi-source fusion, an improved Dempster–Shafer (D-S) evidence fusion strategy is developed, integrating customized anomaly handling and fault-tolerance mechanisms to boost decision reliability in conflict-prone regions. Experiments conducted on real special steel production lines demonstrate that the proposed method significantly improves detection accuracy, ID consistency, and trajectory integrity under complex operating conditions, while enhancing robustness against modal conflicts and abnormal scenarios. This work provides an interpretable and engineering-feasible solution for end-to-end material tracking in hybrid manufacturing systems, offering theoretical and methodological insights for the practical deployment of multi-source collaborative perception in industrial environments.

Keywords:

discrete–continuous hybrid scenario; material tracking; state modeling; visual perception; Dempster–Shafer evidence fusion

1. Introduction

Industry 5.0 emphasizes “intelligent collaboration” and “high-efficiency flexibility” as core objectives, and special steel production, as a key foundation for high-end manufacturing, directly determines product quality and production efficiency through its material tracking accuracy [1]. Special steel production combines “discrete–continuous hybrid” characteristics: continuous processes require high-frequency positioning, while discrete processes demand precise association of batch IDs and semantic information, making it difficult for single solutions to cover the entire workflow [2,3]. RFID/barcode reading failure rates exceed 30% in high-temperature occlusion environments [4]; pure visual solutions, although advanced in defect detection, are prone to missed detections and ID switches in complex tracking tasks, leading to trajectory fragmentation [5,6]; and encoder-based displacement estimation fails to meet precision requirements due to error accumulation [7]. Even though industrial internet and digital twins provide integration frameworks, they remain constrained by semantic inconsistencies and architectural conflicts [8,9,10].

Material modeling is the foundation of target tracking. The state–event system (SES) is a general dynamic system modeling framework whose core function is to “abstract and describe the state change rules of complex systems” [11]. For manufacturing, by abstracting equipment signals, processing actions into event streams, and combining data calibration, interpretable inference of material states can be achieved [12]. Engelmann et al. [11] proposed an event detection method based on machine tool Numerical Control (NC) data and machine learning, which uses industrial control/sensor time-series features to automatically identify production events. However, it relies on a single signal source with limited event granularity and semantics, making it difficult to support refined material tracking in complex special steel production lines. Hennebold et al. [13] constructed a state–event graph through process mining and causal discovery, but their method is only suitable for offline process analysis, cannot meet the needs of online real-time inference, and has limited fault tolerance for heterogeneous modal conflicts. Gaugel et al. [14] proposed a deep segmentation architecture for continuous sensor streams, but failed to provide interpretable event inference logic. In engineering applications, rule engines are the key to connecting process specifications with real-time data; to address trajectory accumulation errors, researchers have proposed multi-anchor compensation strategies, based on sliding window smoothing and fixed delays [7,11]. Process mining technologies (e.g., PM-SPC [15], GPT-Sim [16]) have also improved the modeling adaptability of dynamic processes. However, existing methods generally assume “fixed material paths,” making them unable to cope with the “small-batch, multi-specification” and process intersection scenarios of special steel [17].

Visual detection and tracking are critical to breaking tracking bottlenecks, exhibiting a trend of “detection optimization—tracking enhancement—multi-modal fusion” [18]. YOLO-ADS reduces small-target information loss by optimizing the CSPNet structure and SPD layer [19]; GCP-YOLO introduces ghost convolution and dedicated small-target detection layers, lowering false detection rates by 12% [20]; and YOLOv8-STE maintains mAP@0.5 above 85% by enhancing the neck module’s anti-interference capability of the neck module [21]. In terms of tracking algorithms, the standard DeepSORT [22] predicts the motion state of materials based on Kalman filtering, extracts appearance features using convolutional neural networks, and realizes cross-frame matching via the Hungarian algorithm. However, in the scenario of parallel multi-material transmission in special steel production, its fixed weight of motion/appearance similarity cannot handle occlusion, easily leading to frequent ID switches, and its ability to distinguish similar materials is weak. StrongSORT [23] increases the ID retention rate to 82% by introducing more robust re-ID features and dynamic matching strategies; Preformer MOT [24] optimizes occlusion recovery using the global context of Transformers, but its computational overhead increases by three times, making it difficult to adapt to industrial-embedded deployment. Multi-modal visual fusion has also become a research hotspot [25,26]; the self-attention fusion block of SFusion improves the distinguishability of similar materials by learning inter-modal associated features [27], but problems such as unclear segmentation and ID confusion still exist under strong interference [28,29,30].

Multi-source fusion and decision-making are the core of breaking through tracking bottlenecks. Among them, Dempster–Shafer (D-S) evidence theory is a multi-source fusion framework for processing uncertain information. It quantifies the support degree of each evidence for propositions through basic probability assignment (BPA) and fuses multi-modal information, using combination rules. Due to its ability to effectively handle the uncertainty of heterogeneous data, it has become a mainstream fusion method [31]. At the visual–signal fusion level, researchers have proposed multi-level collaborative strategies [32]; geometry-guided score fusion reduces association errors by 20% through spatial consistency verification [33]. To address modal conflicts, the combination of deep learning and evidence theory has become a new direction: El-Din et al. proposed an adaptive late-fusion framework based on D-S, improving decision reliability by 18% [34]; Shao et al.’s dual-layer evidence fusion framework achieved an accuracy of 99.2% in equipment anomaly recognition [35]. However, in special steel scenarios, temporal deviations and the semantic heterogeneity of multi-source data still reduce the fault tolerance rate of fusion strategies by more than 25% [36].

In summary, material tracking in special steel production faces three major challenges:

(1): Difficulties in material state modeling under complex processes: Existing SES models have three key limitations—(a) relying on fixed path assumptions and being unable to adapt to “small-batch, multi-specification” production; (b) lacking hierarchical inference logic, with most adopting single-layer event-driven production, making it difficult to resolve process intersection conflicts; (c) separating inference from error correction, only being able to output static states, and being unable to suppress accumulated errors of dynamic trajectories, resulting in traditional modeling being unable to cover the discrete–continuous hybrid process of special steel.
(2): Insufficient accuracy of visual detection and tracking under similar materials and complex working conditions: existing detection algorithms have a weak ability to extract weak features under a high-temperature glare; tracking algorithms are prone to ID switches and trajectory breakage due to fixed weights and weak distinguishability, making it difficult to meet the high-beat and high-reliability tracking requirements of special steel production.
(3): Unstable decision-making under multi-source information heterogeneity and conflicts: traditional D-S fusion lacks spatiotemporal alignment mechanisms and insufficiently quantifies evidence conflicts caused by temporal deviations and semantic heterogeneity, leading to reduced reliability of state decisions in conflict areas and failure to ensure the continuity of full-process tracking.

To this end, this paper proposes a multi-source fusion material tracking method for discrete–continuous hybrid scenarios. The main contributions are as follows:

(1): Proposes an integrated semantic modeling method of “process-rule-driven hierarchical inference + trajectory inference with anchor correction,” breaking the scenario adaptability bottleneck of traditional state modeling: first, a general SES framework, decoupled from specific processes, is constructed. By abstracting “material states” and “trigger events,” a unified state description for multi-specification steel materials is realized. Second, a three-layer hierarchical inference system based on Drools is designed, improving the robustness of state inference. Finally, a signal-driven trajectory inference and multi-anchor dynamic correction strategy is proposed. Through “prediction-matching sliding-window smoothing,” errors are suppressed, ensuring trajectory continuity in non-visual areas and filling the gap of traditional modeling that “only has static states without in-process position localization”.
(2): Proposes a visual perception scheme of “YOLOv8-SE + improved DeepSORT,” solving the accuracy and stability problems of visual tracking in complex industrial scenarios: first, the squeeze-and-excitation (SE) channel attention mechanism is embedded after the C2f module of YOLOv8 to enhance the discriminability of weak features, such as steel surface details and boundary notches. Second, a hybrid loss function of the “focal loss (solving class imbalance) + CIOU loss (optimizing bounding box regression)” is adopted to improve detection accuracy in high-temperature glare and blurred scenarios. Finally, the DeepSORT framework is improved—optimizing re-ID feature extraction through cross-entropy loss and triplet loss (solving similar material confusion), dynamically adjusting the weight of motion/appearance similarity (reducing ID switches), and retaining weak matches and cross-camera spatial alignment (avoiding trajectory breakage)—thus, improving the ID retention rate and trajectory integrity.
(3): Proposes an improved D-S evidence fusion strategy with “dynamic confidence weighting, spatiotemporal alignment, and multi-frame consistency voting,” realizing the reliable collaborative decision-making of multi-source data: first, a dynamically confidence-weighted BPA is constructed, integrating historical state consistency, temporal stability, and upstream module confidence, and adjusting the discount factor according to the modal noise level (reducing abnormal interference). Second, a spatiotemporal alignment mechanism (temporal offset estimation + spatial coordinate calibration) is introduced, to ensure multi-source data are fused under the same benchmark. Finally, a multi-frame consistency voting module is designed to select the optimal result through branch trajectory scoring in case of high conflicts, improving fusion reliability in complex scenarios and ensuring the accuracy of full-process tracking.

The structure of this paper is as follows: Section 2 elaborates on method design, Section 3 conducts experimental verification, and Section 4 summarizes conclusions and future prospects.

2. Multi-Source Fusion Material Tracking Method for Discrete–Continuous Hybrid Scenarios

The proposed method adopts a three-layer structure of “dynamic modeling—robust tracking—collaborative fusion”:

(1): Semantic modeling layer: utilizes SES and the Drools rule engine to abstract equipment signals and processing actions into event streams, driving state transitions and achieving trajectory inference, providing process priors for downstream modules;
(2): Visual perception layer: based on YOLOv8-SE and improved DeepSORT, performs real-time detection and cross-frame association under complex conditions, improving detection accuracy and ID stability;
(3): Multi-source fusion layer: through improved D-S evidence theory, incorporates dynamic confidence, spatiotemporal alignment, and consistency voting mechanisms to integrate visual, signal, and anchor data from multiple sources, enhancing system robustness.

This progressive design realizes a closed-loop tracking from semantic abstraction to precise spatial positioning, ensuring full-process reliability under complex conditions. The system architecture is shown in Figure 1.

2.1. Semantic Modeling Layer

Material state modeling is key to capturing the essence of processing. This module first constructs a state model to drive transitions, then introduces trajectory inference for continuous position estimation. The two collaborate to ensure tracking continuity in visual blind spots and provide prior constraints for subsequent visual modules.

2.1.1. State Modeling and Rule Inference

In special steel manufacturing, the discrete state transitions of materials are influenced by physical paths, process flows, and multi-source sensor feedback. This requires models that balance event-driven approaches with embedded engineering knowledge to enhance interpretability. We define the set of materials to be tracked as

M = {m_{1}, m_{2}, \dots, m_{k}}

, where the processing of each material

m_{i}

is abstracted as a sequence of state transition events:

T_{i} = {e_{i 1}, e_{i 2}, \dots, e_{i n}}

. Here,

e_{i j} = (t_{j}, p_{j}, s (t_{j}))

indicates that material

m_{i}

undergoes a state change with attribute

s_{j}

at time

t_{j}

and position

p_{j}

(e.g., furnace entry, heating, rolling). The evolution of the global trajectory structure

T = {T_{1}, T_{2}, \dots, T_{k}}

is described by the state transition function:

s_{j} (t + 1) = f (s_{j} (t), e (t))

(1)

where

e (t)

originates from multi-source channels. To handle noise and uncertainty, a Bayesian update mechanism is incorporated:

P (s_{t} ∣ E_{1 : t}) \propto P (e_{t} ∣ s_{t}) \sum_{s_{t - 1} \in S} P (s_{t} ∣ s_{t - 1}, e_{t}) P (s_{t - 1} ∣ E_{1 : t - 1})

(2)

This probabilistic framework supports flexible inference with incomplete events, but pure mathematical descriptions struggle to embed complex process specifications. Thus, a three-layer inference system based on the Drools rule engine is designed, mapping state updates to interpretable rule sets. Each rule takes the form

C (e, s, t) \Rightarrow (s^{'}, μ)

, where the composite condition

(C (e, s, t) = C_{cwat} (e) \land C_{spatial} (p) \land C_{temparal} (t))

covers event ontology, spatial consistency, and temporal constraints;

s^{'}

is the subsequent state, and

μ \in [0, 1]

is the confidence (used for subsequent fusion weight adjustment). The three-layer inference logic is as follows:

(1): Atomic event rule layer L1: encodes equipment signals with Boolean logic to generate standardized events as foundational drivers;
(2): State inference layer L2: builds a finite state machine that receives event streams to drive state transitions, embedding a time window for delayed updates to accumulate confidence and avoid ambiguous paths;
(3): Advanced constraint layer L3: resolves rule conflicts through interlocking mutual exclusion, priority scheduling, and causal chain inference, with backtracking to repair state drifts.

2.1.2. Signal-Driven Trajectory Inference Model

The state modeling provides a semantic framework, but in continuous processes like hot rolling and sawing with insufficient visual coverage, position estimation accuracy directly impacts tracking continuity; reliance on pure incremental observations easily leads to error accumulation and trajectory deviations. To this end, a signal-driven trajectory inference model is constructed, enabling continuous position prediction in non-visual areas through equipment behavior events and motion parameters. Upon upstream equipment issuing an action completion signal, the initial position

p_{0}

is recorded, combined with transmission speed ratio

k

(equipment parameters + empirical calibration) and real-time speed

v (t)

, and position is predicted via integration:

p (t) = p_{0} + \int_{t_{0}}^{t} k ν (τ) d τ

(3)

To suppress the impact of long-term inference on speed disturbances and speed ratio deviations, a multi-anchor correction strategy is introduced: Anchor set

A = {a_{1}, a_{2}, \dots, a_{m}}

is preset, triggering position reset when the predicted trajectory reaches an anchor and satisfies the matching criterion:

∥ p (t) - a_{i} ∥ < ϵ_{p} \land | t - t_{event} | < ϵ_{t} \land s (t) \in S_{valid}

(4)

Within the sliding window

W = [t_{0}, t_{0} + T]

, offset

b

is estimated via least squares to minimize deviations from observations and anchor constraints:

\min_{b} \sum_{t \in W} ({\tilde{x}}_{t} + b_{t}) - {\hat{x}}_{t}^{anchor 2} + λ \cdot \sum_{t \in W} b_{t} - {b_{t - 1}}^{2}

(5)

where

{\tilde{x}}_{t}

is the inferred position vector,

{\hat{x}}_{t}^{anchor}

is the anchor-corrected position estimate vector,

b_{t}

is the offset component at time

t

, and

λ

is the smoothing coefficient (balancing deviation and offset smoothness). The closed-form solution is:

b^{⋆} = {(A^{⊤} A + λ \cdot L)}^{- 1} \cdot A^{⊤} c

(6)

where

b^{⋆}

is the optimal offset vector,

A

is the observation matrix,

L

is the smoothing matrix, and

c

is the constraint vector. For multi-anchor inconsistencies, weighted fusion is adopted:

{\hat{x}}_{t} = \frac{\sum_{k} ω_{k} (t) \cdot {\tilde{x}}_{t}^{(k)}}{\sum_{k} ω_{k} (t)}, ω_{k} (t) \propto \frac{1}{σ_{k}^{2} + α \cdot | t - t_{k} |}

(7)

where

{\tilde{x}}_{t}^{(k)}

is the inferred position for the k-th anchor,

ω_{k} (t)

is the weight coefficient,

σ_{k}^{2}

is the variance of the k-th anchor, and α is the temporal decay coefficient. To enhance tolerance to signal jitter and false triggers, the matching score between candidate trajectory

T = {T_{1}, T_{2}, \dots, T_{n}}

and event

e_{j}

is computed:

Score (T_{i}, e_{j}) = w_{t} \cdot \exp (- \frac{| t_{i} - t_{j} |}{σ_{t}}) + w_{p} \cdot \exp (- \frac{∥ p_{i} - p_{j} ∥}{σ_{p}}) + w_{d} \cdot (d_{i} \cdot d_{j})

(8)

where

w_{t}, w_{p}, w_{d}

are weights for time, position, and flow direction;

σ_{t}

,

σ_{d}

are standard deviations for time and position similarity; and

d_{i} \cdot d_{j}

is the semantic consistency coefficient (0~1). Multi-trajectory competition prioritizes the minimum distance trajectory, updating states via confidence arbitration

μ_{i} (t + 1) = α \cdot μ_{i} (t) + (1 - α) \cdot m (e_{j})

; signal-free smoothing intervals use uniform speed prediction

p (t) = p (t_{0}) + v_{0} \cdot (t - t_{0})

as a buffer (

v_{0}

is average speed). This model constrains inference scope via state boundaries and suppresses error accumulation with correction strategies, providing initialization constraints for the visual module.

2.2. Visual Perception Layer

The signal-driven model excels in continuous processes, but signal perception limitations in process intersections and multi-material convergence zones easily lead to ID confusions and trajectory breaks. To this end, a visual perception module is introduced, combining the YOLOv8-SE [37] detector with an improved DeepSORT [22,23] framework for real-time detection and association in dense scenarios. This module inherits semantic priors from upstream state modeling to initialize detection regions, with output trajectory candidates directly fed into the fusion layer to provide precise spatial evidence, forming a “signal–visual” complementary mechanism. The network architecture is shown in Figure 2.

2.2.1. YOLOv8-SE Detection Network Optimization

Visual detection in special steel scenarios faces challenges like high-temperature glare and homogeneous appearances, with traditional YOLOv8 exhibiting an insufficient response to weak features (e.g., material surface numbering, boundary notches), leading to small-target missed detections. To this end, squeeze-and-excitation (SE) channel attention is embedded after YOLOv8’s C2f module to construct the YOLOv8-SE network, dynamically recalibrating feature weights to enhance key region discriminability. The SE module structure is shown in Figure 3.

For input feature map

X \in ℝ^{H \times W \times C}

(H height, W width, C channels), the SE mechanism proceeds as follows:

(1): Channel squeeze: computes channel descriptors via global average pooling to compress spatial dimensions:

$z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} x (i, j, c)$

(9)

where $x (i, j, c)$ is the pixel value of feature map $X$ at position $(i, j)$ in channel c, and $z_{c}$ is the global descriptor for channel c.
(2): Channel excitation: learns channel weights via fully connected layers to highlight key channels:

$w_{c} = σ (W_{2} \cdot ReLU (W_{1} \cdot z_{c}))$

(10)

where $W_{1} \in ℝ^{\frac{c}{r} \times C}$ and $W_{2} \in ℝ^{C \times \frac{c}{r}}$ are fully connected layer weight matrices (ris compression ratio, set to 4), ReLU(⋅) is the activation function, and σ is the Sigmoid function.
(3): Feature enhancement (Scale): multiplies learned weights with the original feature map channel-wise to generate enhanced features:

${X^{'}}_{c} = w_{c} \cdot X_{c}$

(11)

where $X_{c}$ is the original feature map’s c-th channel, and ${X^{'}}_{c}$ is the enhanced feature channel.

Training employs a hybrid loss function:

L_{\det} = λ_{obj} \cdot L_{obj} + λ_{cls} \cdot L_{cls} + λ_{box} \cdot L_{box}

(12)

where

λ_{obj}

λ_{cls}

λ_{box}

are loss weights;

L_{obj}

is object confidence loss,

L_{cls}

is classification loss, and

L_{box}

is bounding box regression loss. The loss terms are defined as follows:

(1): $L_{cls}$ and $L_{obj}$ use focal loss to address class imbalance:

$L_{Focal} (\hat{y}, y) = - α \cdot {(1 - \hat{y})}^{γ} \cdot \log (\hat{y}) (if y = 1)$

(13)

where $\hat{y}$ is predicted probability, $y$ is true label, $α$ is balance coefficient (0.25), and γ is difficulty weight (2).
(2): $L_{box}$ uses CIOU loss, integrating bounding box center distance, overlap area, and aspect ratio:

$L_{CIOU} = 1 - IOU + \frac{ρ^{2} (b, b^{gt})}{c^{2}} + α \cdot v$

(14)

where $b$ is predicted bounding box center vector, $b^{gt}$ is ground truth bounding box center vector, $ρ$ is Euclidean distance, $c$ is the diagonal length of the smallest enclosing rectangle for both boxes, $v$ is aspect ratio consistency coefficient, and $α = v / (1 - IOU + v)$ .

2.2.2. Improved DeepSORT Tracking Algorithm

After detection, cross-frame association is core to maintaining trajectory continuity. Standard DeepSORT is susceptible to appearance similarity in dense occlusion scenarios (e.g., multi-material parallel transmission), leading to ID losses. To this end, DeepSORT is improved as follows:

(1): Re-ID feature extraction optimization: extracts highly discriminative re-ID feature vectors via lightweight CNN:

$L_{reid} = L_{CE} + λ_{tri} \cdot L_{tri}$

(15)

where $L_{CE}$ is cross-entropy loss, and $L_{tri}$ is triplet loss.
(2): Multi-dimensional similarity fusion: combines Kalman filter predictions to compute motion similarity, introducing dynamic weights to balance motion and appearance information.

Motion similarity: uses Mahalanobis distance to measure deviation between predicted and detected positions:

d_{mot} (x, \hat{x}) = \sqrt{{(x - \hat{x})}^{⊤} \cdot S^{- 1} \cdot (x - \hat{x})}

(16)

where

x

is detected position vector,

\hat{x}

is Kalman predicted position vector, and

S

is predicted error covariance matrix.

Appearance similarity: Uses cosine distance to measure re-ID feature similarity:

d_{app} (f_{i}, f_{j}) = 1 - \frac{f_{i}^{⊤} \cdot f_{j}}{∥ f_{i} ∥ \cdot ∥ f_{j} ∥}

(17)

Dynamic weight fusion: introduces weight

α_{t} = σ (η_{1} \cdot c + η_{2} \cdot {consist}_{m})

, where

c

is detection confidence,

{consist}_{m}

is trajectory consistency, and combined matching cost:

C (i, j) = α_{t} \cdot d_{app} (i, j) + (1 - α_{t}) \cdot d_{mot} (i, j)

(18)

(3): Association optimization strategies:

Uses Hungarian algorithm to solve minimum matching cost for ID assignment.

Retains weak matches (e.g., low similarity due to occlusion) for 5 frames to reduce trajectory losses.

Trajectory confidence update

c_{ID} (t + 1) = β \cdot c_{ID} (t) + (1 - β) \cdot m_{\det} (t)

,

β

is historical weight,

c_{ID}

is ID confidence,

m_{\det} (t)

is detection confidence at

t

;

Cross-camera ID mapping: achieves multi-view trajectory association via spatiotemporal constraints (e.g., position projection in camera overlap regions);

Modal alignment: employs timestamp synchronization (unified system clock) and spatial projection

p_{v} = K^{- 1} \cdot {[u, v, 1]}^{⊤} \cdot T_{cam}

(

K

is camera intrinsic matrix;

[u, v]

is pixel coordinates;

T_{cam}

is camera extrinsic matrix), supplemented by 3-frame consistency voting to filter false detections.

The improved algorithm dynamically adapts to scenario complexity, prioritizing appearance weights when reliable and relying on motion prediction otherwise, reducing ID switch rates by over 30%. It outputs material ID, position vector

x_{t}

quantity, and feature confidence

m_{v} (A)

, providing visual evidence for the fusion module.

2.3. Multi-Source Fusion Decision Model

Upstream modules provide rich information, but heterogeneous modalities exhibit temporal deviations and semantic conflicts, easily leading to decision inconsistencies. To this end, a multi-source fusion decision model based on D-S evidence theory is proposed, constructing globally consistent material state sequences through consistency verification, evidence combination, and anomaly tolerance. As the top-level framework, it integrates upstream observations to generate final estimates, ensuring system robustness under complex conditions. The process is shown in Figure 4.

2.3.1. Consistency Verification and BPA Construction

Before fusion, input semantic compatibility must be verified to avoid invalid computations: based on shared triples

(p (t), d (t), s (t))

;

p (t)

position vector; and

d (t)

semantic description,

s (t)

state performs verification:

∥ p_{signal} (t) - p_{vision} (t) ∥ < ϵ_{p} \land | d_{signal} (t) - d_{vision} (t) | < ϵ_{d} \land s (t) \in S_{valid}

(19)

where

p_{signal} (t)

and

p_{vision} (t)

are signal-inferred and visually detected position vectors;

d_{signal} (t)

and

d_{vision} (t)

are their semantic descriptions; and

ϵ_{d}

is the semantic deviation threshold. Failed verification trajectories are marked as “pending verification,” suppressing short-term jitter (e.g., visual deviations from transient glare) via time window

Δ t = 2 s

, and deferring updates when confidence fluctuates

| μ (t) - μ (t - 1) | > 0.2

, ensuring input quality to reduce subsequent conflicts.

Based on verified data, basic probability assignment (BPA)

m (s_{i})

(

s_{i} \in S

,

S

state space) is constructed, mapping state space to discernment frame

Θ = {s_{1}, s_{2}, \dots, s_{k}}

and integrating three types of information (historical consistency, temporal stability, and path reliability):

m (s_{i}) = α \cdot c_{hist} (s_{i}) + β \cdot c_{stab} (s_{i}) + (1 - α - β) \cdot μ

(20)

where

α = 0.4

,

β = 0.3

are weights (sum less than 1, reserving uncertainty allocation), with components defined as follows:

(1): Historical consistency $c_{hist} (s_{i})$ : represents consistency between historical and current states:

$c_{hist} (s_{i}) = \frac{\sum_{j = 1}^{k} Γ (s_{i j} = s_{i}) \cdot w_{j}}{\sum_{j = 1}^{k} w_{j}}$

(21)

where $Γ (\cdot)$ is indicator function, $s_{i j}$ is historical state of material $i$ at moment $j$ , and $w_{j} = \exp (- λ \cdot (t - t_{j}))$ is temporal decay weight.
(2): Temporal stability: measures state temporal stability via entropy:

$c_{stab} (s_{i}) = 1 - \frac{H (s)}{H_{\max}}, H (s) = - \sum_{s \in Θ} P (s) \cdot \log_{2} P (s)$

(22)

where $H (s)$ is state entropy, $H_{\max} = \log_{2} | Θ |$ is maximum entropy( $| Θ |$ is number of states), and $P (s)$ is state probability distribution.
(3): Path reliability: output from upstream modules (rule confidence from state modeling or classification confidence from visual detection).

To reflect modal uncertainty, dynamic discount factor is introduced to revise BPA:

{\tilde{m}}^{(m)} (A) = (1 - α_{m}) \cdot m^{(m)} (A), {\tilde{m}}^{(m)} (Θ) = α_{m} + (1 - α_{m}) \cdot m^{(m)} (Θ)

(23)

2.3.2. D-S Evidence Combination and Anomaly Tolerance

Based on discounted BPA, the Dempster combination rule is applied to synthesize multi-modal evidence:

m (s) = \frac{\sum_{A \cap B = {s}} m_{1} (A) \cdot m_{2} (B)}{1 - K}, K = \sum_{A \cap B = \emptyset} m_{1} (A) \cdot m_{2} (B)

(24)

where

m_{1} (A)

,

m_{2} (B)

are BPAs for modalities 1 and 2, and

K

is conflict factor. If

K > θ_{K} = 0.8

(high conflict,

θ_{K}

conflict threshold), switch to soft fusion to avoid combination rule failure:

s_{fused} = \arg \max_{s \in Θ} (μ_{1} \cdot m_{1} (s) + μ_{2} \cdot m_{2} (s))

(25)

where

s_{fused}

is fused state,

μ_{1}

,

μ_{2}

are modal weights, and

\arg \max (\cdot)

is the argument of the maximum.

Simultaneously, introduce spatiotemporal alignment mechanisms to eliminate modal deviations:

(1): TemporalTemporal alignment: estimate inter-modal time offset $τ_{m}^{⋆} = \arg \min_{τ} \sum_{t} ∥ ϕ_{ref} (t) - ϕ_{m} (t + τ) ∥^{2}$ , where $ϕ_{ref}$ is reference modal trajectory, $ϕ_{m}$ is m-th modal trajectory, $τ$ is time offset, and $τ_{m}^{⋆}$ is optimal offset;
(2): Spatial alignment: corrects spatial differences via equipment coordinate system calibration, ensuring multi-modal position vectors in the same coordinate system.

Post-alignment generates a fused state

s_{fused} (t)

and path identifiers, concatenating visual features with process attributes.

To handle anomalies from occlusion or temporal misalignment, design a hierarchical tolerance mechanism:

(1): Conflict detection: identifies anomalies via conflict factor $K > θ_{K}$ or confidence fluctuation $| μ (t) - μ (t - 1) | > ϵ_{μ} = 0.2$ ;
(2): Branch path generation: generates branch paths for anomalous trajectories, recording timestamp, position vector $x_{t}$ , and confidence $μ$ for each branch;
(3): Multi-frame voting selection: computes branch scores within time window $W = [t, t + 3 Δ t]$ , selecting the highest-score branch as primary trajectory $Score (T_{i}) = \sum_{j \in W} μ (e_{i j}) \cdot \exp (- λ \cdot (t - t_{j}))$ .
(4): Backtracking repair and manual review: repairs inconsistent segments (e.g., position jumps) in primary trajectory via backtracking; triggers manual review if conflicts persist over 3 frames.

3. Experiments

To verify the effectiveness of the multi-target tracking method based on state modeling and multi-source fusion in the special steel scenario, experiments were conducted on the hot processing production line of a special steel enterprise.

3.1. Experimental Setup

3.1.1. Experimental Scenario and Problem Definition

Experiments were conducted on a hot processing line for seamless steel pipes in a special steel enterprise. The line includes 10 core processes (billet sawing, annular heating furnace, piercing, pipe rolling, etc.) and five process intersection scenarios (before billet sawing, before/after cooling bed, etc.). Core challenges include scenario complexity—continuous processes require high-frequency position tracking, while discrete processes need batch ID association—and data heterogeneity—equipment signals, sensor feedback, and video streams differ in format/frequency, and are easily interfered by high-temperature glare or signal loss. The experiment aims to verify if the multi-source fusion method achieves four-dimensional (“position–time–state–ID”) accurate tracking of steel pipes, with a fault tolerance for anomalies like equipment jamming. Inputs include three heterogeneous data types: main equipment signals (action commands and parameters like temperature for charging machines/rolling mills), auxiliary mechanism signals (position/detection signals for transfer chains/HMD/CMD), and industrial video streams (1080 p images from five high-temperature cameras). Outputs are real-time tracking results (position error ≤ 5 cm, ID retention ≥ 95%) and anomaly warnings (time, location, and handling suggestions).

3.1.2. Experimental Data

All data were collected from the above line via an edge computing platform with multi-protocol compatibility, verified by the MES system. Main equipment data cover core production devices, with switch signals (e.g., furnace door control) and analog signals (e.g., rolling force), sampled at 1 Hz for 1000 h, yielding 3.6 × 10⁶ valid entries (covering Φ159–Φ460 mm pipes). Preprocessing used five-point moving average filtering, with invalid data <8%. Auxiliary mechanism data cover transfer chains/HMD, with position/detection/state signals, sampled at 2 Hz for 1000 h, yielding 7.2 × 10⁶ valid entries (92% correlated with main equipment data). Preprocessing synchronized to 1 Hz via NTP and removed maintenance data. Industrial video streams were collected by five cameras (1920 × 1080, 30 fps) for 500 h, covering 10,000 pipes (8000 with “signal-video” records). Preprocessing cleaned oil mist-invalid frames and manually annotated bounding boxes/IDs. The dataset includes normal scenarios (stable production) and abnormal scenarios (equipment jamming, network disconnection, glare).

3.1.3. Experimental Environment

Hardware included an Intel Xeon E5-2698 CPU (2.3 GHz, 32 cores) and NVIDIA RTX 4060 GPU (8 GB VRAM), adapting to industrial-embedded deployment. Software used Ubuntu 22.04 LTS, PyTorch 2.0 (deep learning), Drools 7.74 (rule engine), and Python 3.9, with libraries like OpenCV 4.8.0 and NumPy 1.25.2.

3.1.4. Verification Scheme

Verification focused on three aspects: logical reasoning verification used 1000 h of main/auxiliary equipment time-series data (with MES info) to test the accuracy/robustness of the basic rule layer, the state consistency layer, and the full system layer. Visual perception verification split 500 h videos into 7:1:2 (train/val/test), with manual annotations as the ground truth, comparing YOLOv8-SE + improved DeepSORT with traditional schemes. Multi-source fusion verification tested 8000 pipes with joint records, comparing single-logic, single-vision, proposed fusion, and traditional fusion in process intersection/abnormal scenarios, using manual inspection trajectories as the final ground truth.

3.1.5. Core Parameter Selection

Parameters were determined based on production line equipment configuration and process characteristics, taking the sawing area as an example. Process-constrained parameters were confirmed via equipment hard-constraint derivation and full-factor experiments, including roller speed ratio

k

(optimal 0.85), anchor position threshold

ε_{p}

(optimal 0.05 m), error time threshold

ε_{t}

(optimal 0.5 s), and state window

Δ t

(optimal 1.5 s), ensuring no safety boundary violations and optimal indicators. Algorithm-structured parameters were adjusted based on algorithm characteristics and resources: detection-end confidence threshold

τ_{conf}

(optimal 0.35, recall > 94%, false detection < 6%), NMS threshold

τ_{nms}

(0.5), SE compression ratio

r

(optimal 16), focal loss parameter

γ

(optimal 2); DeepSORT-end Mahalanobis distance threshold

τ_{max_dist}

(optimal 0.2), appearance similarity threshold

τ_{app}

(optimal 0.45), trajectory smoothing weight

α_{traj}

(optimal 0.6), and conflict handling parameter

θ_{K}

(optimal 0.8). Sensitivity analysis used single-factor perturbation, testing within ±20% of optimal values: SAR fluctuation of

ε_{p}

< 9.0%, mAP@0.5 fluctuation of

r

< 5.1%, and SAR fluctuation of

θ_{K}

< 1.6%, verifying the parameter adjustability.

3.2. Evaluation Metrics

On the basis of common target-tracking evaluation metrics [38], new metrics are defined in Table 1 for the material tracking of special steel products.

3.3. Performance Quantification of Three-Layer Rule-Based Inference

To quantify the inference performance of different rule layers (atomic rule L1, state consistency check L1 + L2, and full constraint L1 + L2 + L3) in the discrete–continuous hybrid scenario, six types of experimental scenarios were designed, and four core indicators were selected for comparative analysis, using Figure 5. Light-colored rectangles in the figure mark the performance degradation relative to the normal scenario, while Δ values indicate the difference in performance attenuation between 10% and 20% interference.

The experimental results show the following:

(1): In the normal scenario without data interference, the three-layer rule model (L1 + L2 + L3) achieves the best performance: state inference accuracy rate (SAR) reaches 97.8%, anomaly detection rate (ADR) 98.5%, anomaly blocking success rate (BSR) 99.0%, and anomaly localization precision (ALP) 98.5%. All indicators are over 10% higher than those of the L1 model. The performance of the L1 + L2 model falls between L1 and L1 + L2 + L3. This indicates that the layered rule system can effectively integrate process specifications, reduce semantic ambiguity, and improve inference accuracy.
(2): In scenarios with data missing or redundancy, the performance of all models decreases, but the three-layer model exhibits significantly stronger robustness. In the data redundancy scenario, the three-layer model achieves a BSR of 88.0% (10% higher than L1) and an ALP of 88.0% (10% higher than L1), with a smaller performance attenuation difference (Δ). This demonstrates that the interlock/anti-redundancy logic and causal chain reasoning of the high-level constraint layer can effectively suppress interference and reduce the rate of performance attenuation.

3.4. Visual Perception Performance

To intuitively compare the comprehensive performance differences between the “YOLOv8-SE + improved DeepSORT” combination and the “original YOLOv8 + standard DeepSORT” combination across multi-dimensional indicators, a multi-indicator radar chart (Figure 6) was used for visualization. The experiments covered five typical industrial visual scenarios: normal illumination, high-temperature glare, partial occlusion, blur, and an integrated scenario. To address the issues of inconsistent dimensions and evaluation trends among indicators, the original data were processed as follows:

(1): Min–max normalization was used to map all indicators to the [0, 100] range.
(2): The number of ID switches (IDS) was reversed using the “maximum value reversal method” (reversed IDS = maximum IDS value − original IDS) before normalization, ensuring all indicators follow the “higher value = better performance” evaluation standard.

The experimental results show the following:

(1): The “YOLOv8-SE + improved DeepSORT” combination exhibits significant advantages in the normal illumination scenario: mean average precision (mAP@0.5) reaches 90.5% (5.5% higher than the “original YOLOv8 + standard DeepSORT” combination), multiple object tracking accuracy (MOTA) 82.0% (7.0% higher than YOLOv8), and IDS is reduced by 33.3%. This reflects the enhancement effect of the squeeze-and-excitation (SE) module on key features.
(2): YOLOv8-SE shows stronger robustness in interference scenarios: as scenario complexity increases (e.g., high-temperature glare → partial occlusion → blur), the performance of both models decreases, but the degradation amplitude of YOLOv8-SE is smaller.

In summary, by enhancing weak features through the SE channel attention mechanism and optimizing the feature weight allocation strategy, YOLOv8-SE outperforms YOLOv8 in all visual scenarios and multi-dimensional indicators, providing a more reliable visual perception foundation for subsequent multi-source fusion.

3.5. Multi-Source Fusion Performance

Experiments were divided into two parts: comparison between fusion and single-mode baselines, and comparison among different fusion strategies.

3.5.1. Comparison Between Fusion and Single-Mode Baselines

To verify the performance improvement of the multi-source fusion method (multi-source) compared to single-mode baselines (rule-only, vision-only), comparative analyses were conducted across six scenarios from three dimensions: rule-based inference, visual perception, and comprehensive recovery. The results are presented in the scenario-specific multi-indicator bar chart (Figure 7).

The experimental results show the following:

(1): In the normal scenario without data interference, the multi-source fusion method outperforms single-mode baselines in all indicators. Rule-based inference dimension: SAR reaches 99.8% (2.0% higher than rule-only), and ADR 99.9% (1.4% higher than rule-only), benefiting from the correction of rule ambiguity by visual data. Visual perception dimension: mAP@0.5 reaches 99.7% (9.2% higher than vision-only), and MOTA 99.8% (17.8% higher than vision-only), due to the filtering of visual false detections by rule constraints. Comprehensive recovery dimension: association recovery rate (ARR) reaches 99.0% (3–4% higher than both baselines), reflecting the advantage of “rule-visual” bidirectional verification.
(2): The multi-source fusion method exhibits outstanding anti-interference capability in scenarios with data missing or redundancy. With under 20% data missing, the SAR of rule-only drops to 72.0% (26.4% lower than the normal scenario), while the SAR of multi-source remains at 96.4% (only 3.4% lower), as visual data compensates for the missing rule inputs. With under 20% data redundancy, the mAP@0.5 of vision-only drops to 80.0% (10.5% lower than the normal scenario), while the mAP@0.5 of multi-source reaches 96.7% (only 3.0% lower), as rule constraints eliminate redundant visual noise.
(3): In the integrated scenario (mixed interference), the multi-source fusion method maintains core indicators above 98%: SAR = 98.2% (10.2% higher than rule-only), mAP@0.5 = 97.7% (11.7% higher than vision-only), and recall = 98.2% (9.2% higher than baselines). This verifies its core value of “complementary fault tolerance” in complex industrial environments—rule-based inference resolves visual ambiguity, while visual perception supplements missing rule information, forming a closed-loop verification.

3.5.2. Comparison of Different Fusion Strategies

To quantify the performance advantage of the proposed fusion strategy (ours) compared to traditional strategies (weighted average, max confidence, rule priority), a 3D heatmap (Figure 8) of “indicator–strategy–scenario” was used for comparative analysis. Each heatmap corresponds to one core indicator; rows represent fusion strategies, columns represent experimental scenarios, and color depth reflects performance (darker blue indicates a better performance for positive indicators, while lighter red indicates a better performance for IDS). The original values are labeled in each cell to quantify differences.

The experimental results show the following:

(1): Our strategy achieves the best performance across all indicators: In the normal scenario, our strategy reaches SAR = 99.8% (4.8% higher than the second-best rule priority) and ADR = 99.9% (8.9% higher than the second-best weighted average), benefiting from the “dynamic confidence weighting + spatiotemporal alignment” mechanism to correct rule ambiguity. With under 20% data redundancy, our strategy achieves mAP@0.5 = 96.7% (11.7% higher than max confidence) and MOTA = 97.2% (18.2% higher than weighted average), as the optimized dynamic matching strategy of DeepSORT enhances ID persistence for similar materials.
(2): The performance degradation amplitude of our strategy is significantly smaller than that of traditional strategies: For SAR, the degradation of our strategy from the normal to 20% missing scenario is 3.4%, while that of the weighted average strategy is 13.7%. This difference stems from the “anomaly fault tolerance module” of our strategy—which suppresses the weight of interfering data through dynamic confidence allocation, avoiding the sharp performance drop of traditional strategies caused by “over-reliance on a single data source”.

4. Conclusions

To address the core challenges in material tracking for the discrete–continuous hybrid scenario of special steel production—difficult material state modeling under complex processes, insufficient visual tracking accuracy for similar materials, and unstable multi-source decision-making due to conflicts—this study proposes a three-layer material tracking method integrating “semantic modeling–visual perception multi-source fusion.” The key outcomes and problem-solving value are as follows:

(1): For the challenge of difficult state modeling (fixed path assumptions, process intersection conflicts, and separated inference-correction in existing SES), the semantic modeling layer combines a process-decoupled state–event system (SES) with a Drools hierarchical rule engine and multi-anchor dynamic calibration. Experimental results show a state inference accuracy (SAR) of 97.8% in the normal scenario, verifying that the three-layer reasoning system effectively resolves process intersection conflicts. Additionally, the multi-anchor correction suppresses trajectory accumulation errors by over 25% compared to traditional static modeling, enabling reliable state inference for small-batch and multi-specification production.
(2): For the challenge of insufficient visual tracking accuracy (weak feature extraction under high-temperature glare, frequent ID switches, and trajectory breakage in occlusion), the visual perception layer adopts YOLOv8-SE (embedded with SE channel attention) and an improved DeepSORT. Under complex working conditions (e.g., high-temperature glare and partial occlusion), this layer achieves a 5.5–8.0% improvement in mAP@0.5 (reaching 90.5% in glare scenarios) and a reduction of over 30% in ID switches (IDS). These results confirm that the SE module enhances weak feature discriminability for similar materials, while the dynamic matching strategy of DeepSORT ensures trajectory continuity in multi-steel parallel transmission.
(3): For the challenge of unstable multi-source decision-making (temporal deviations, semantic heterogeneity, and low conflict tolerance in traditional D-S fusion), the multi-source fusion layer introduces dynamic confidence weighting, spatiotemporal alignment, and multi-frame consistency voting based on improved D-S evidence theory. In scenarios with 20% data missing, the state inference accuracy (SAR) remains 96.4%—9.0% higher than that of single-mode methods (rule-only: 72.0%; vision-only: 85.0%). This proves that the spatiotemporal alignment mechanism eliminates modal deviations and the consistency voting module enhances conflict tolerance, ensuring stable decision-making under heterogeneous data interference.

Experiments on an actual special steel production line demonstrate that the proposed method achieves an end-to-end traceability compliance rate of 99.98%. Its inference accuracy under complex working conditions is over 9.0% higher than that of single-mode methods and outperforms traditional fusion strategies by 12% in conflict tolerance. This method provides an engineering-feasible path for full-process material traceability in hybrid manufacturing scenarios (e.g., special steel production). Future work will extend its application to chemical and metallurgical fields with similar discrete–continuous characteristics and adapt it to edge deployment through model lightweighting (e.g., pruning and quantization).

Author Contributions

K.Y. proposed the methodology of this study, participated in the conceptualization of the research framework, conducted relevant investigation work, and completed the writing—original draft; X.X. (corresponding author) participated in the conceptualization of the research, acquired funding for the project, provided necessary resources to support the research, and was responsible for writing—review and editing; Y.Z. took charge of the project administration and provided supervision for the entire research process; G.L. participated in the writing—review and editing of the manuscript to improve its quality; X.L. provided supervision for the research and participated in relevant investigation work; F.Z. provided comments on the manuscript review, assisted in analyzing and addressing reviewer suggestions, and contributed to optimizing the manuscript’s logic and technical accuracy. All authors have read and agreed to the published version of the manuscript.

Funding

The study is financially supported by the National Natural Science Foundation of China (NSFC) under Grant No. U21A20483.

Data Availability Statement

The data cannot be publicly deposited in open repositories due to industrial privacy restrictions of the enterprise. To ensure compliance with academic ethics and the enterprise’s confidentiality agreement, any data access request will require confirmation from all authors (to align with the manuscript’s joint authorship rights) and approval from the collaborating special steel enterprise.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zafar, M.H.; Langas, E.F.; Sanfilippo, F. Exploring the Synergies between Collaborative Robotics, Digital Twins, Augmentation, and Industry 5.0 for Smart Manufacturing: A State-of-the-Art Review. Robot. Comput.-Integr. Manuf. 2024, 89, 102769. [Google Scholar] [CrossRef]
Li, J.; Wang, X.; Yang, Q.; Sun, Y.; Zhao, J.; Mao, X.; Qie, H. IoT-Based Framework for Digital Twins in Steel Production: A Case Study of Key Parameter Prediction and Optimization for CSR. Expert Syst. Appl. 2024, 250, 123909. [Google Scholar] [CrossRef]
Kasilingam, S.; Yang, R.; Singh, S.K.; Farahani, M.A.; Rai, R.; Wuest, T. Physics-Based and Data-Driven Hybrid Modeling in Manufacturing: A Review. Prod. Manuf. Res. 2024, 12, 2305358. [Google Scholar] [CrossRef]
Vasan, V.; Sridharan, N.V.; Vaithiyanathan, S.; Aghaei, M. Detection and Classification of Surface Defects on Hot-Rolled Steel Using Vision Transformers. Heliyon 2024, 10, e38498. [Google Scholar] [CrossRef]
Tang, B.; Chen, L.; Sun, W.; Lin, Z. Review of Surface Defect Detection of Steel Products Based on Machine Vision. IET Image Process. 2023, 17, 303–322. [Google Scholar] [CrossRef]
Ogunrinde, I.; Bernadin, S. Improved DeepSORT-Based Object Tracking in Foggy Weather for AVs Using Sematic Labels and Fused Appearance Feature Network. Sensors 2024, 24, 4692. [Google Scholar] [CrossRef]
Chang, X.; Yan, X.; Qiu, B.; Wei, M.; Liu, J.; Zhu, H. Anomaly Detection and Confidence Interval-based Replacement in Decay State Coefficient of Ship Power System. IET Intell. Transp. Syst. 2024, 18, 2409–2439. [Google Scholar] [CrossRef]
Chen, L.; Bi, G.; Yao, X.; Tan, C.; Su, J.; Ng, N.P.H.; Chew, Y.; Liu, K.; Moon, S.K. Multisensor Fusion-Based Digital Twin for Localized Quality Prediction in Robotic Laser-Directed Energy Deposition. Robot. Comput.-Integr. Manuf. 2023, 84, 102581. [Google Scholar] [CrossRef]
Shi, L.; Ding, Y.; Cheng, B. Development and Application of Digital Twin Technique in Steel Structures. Appl. Sci. 2024, 14, 11685. [Google Scholar] [CrossRef]
Huang, G.; Huang, H.; Zhai, Y.; Tang, G.; Zhang, L.; Gao, X.; Huang, Y.; Ge, G. Multi-Sensor Fusion for Wheel-Inertial-Visual Systems Using a Fuzzification-Assisted Iterated Error State Kalman Filter. Sensors 2024, 24, 7619. [Google Scholar] [CrossRef]
Engelmann, B.; Schmitt, A.-M.; Heusinger, M.; Borysenko, V.; Niedner, N.; Schmitt, J. Detecting Changeover Events on Manufacturing Machines with Machine Learning and NC Data. Appl. Artif. Intell. 2024, 38, 2381317. [Google Scholar] [CrossRef]
Chen, H.; Wang, J.; Shao, K.; Liu, F.; Hao, J.; Guan, C.; Chen, G.; Heng, P.-A. Traj-Mae: Masked Autoencoders for Trajectory Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 8351–8362. [Google Scholar]
Hennebold, C.; Islam, M.M.; Krauß, J.; Huber, M.F. Combination of Process Mining and Causal Discovery Generated Graph Models for Comprehensive Process Modeling. Procedia CIRP 2024, 130, 1296–1302. [Google Scholar] [CrossRef]
Gaugel, S.; Reichert, M. Industrial Transfer Learning for Multivariate Time Series Segmentation: A Case Study on Hydraulic Pump Testing Cycles. Sensors 2023, 23, 3636. [Google Scholar] [CrossRef]
Dogan, O.; Areta Hiziroglu, O. Empowering Manufacturing Environments with Process Mining-Based Statistical Process Control. Machines 2024, 12, 411. [Google Scholar] [CrossRef]
Jackson, I.; Jesus Saenz, M.; Ivanov, D. From Natural Language to Simulations: Applying AI to Automate Simulation Modelling of Logistics Systems. Int. J. Prod. Res. 2024, 62, 1434–1457. [Google Scholar] [CrossRef]
Akhramovich, K.; Serral, E.; Cetina, C. A Systematic Literature Review on the Application of Process Mining to Industry 4.0. Knowl. Inf. Syst. 2024, 66, 2699–2746. [Google Scholar] [CrossRef]
Zhou, L.; Zhang, L.; Konz, N. Computer Vision Techniques in Manufacturing. IEEE Trans. Syst. Man Cybern. Syst. 2022, 53, 105–117. [Google Scholar] [CrossRef]
Gui, Z.; Geng, J. YOLO-ADS: An Improved YOLOv8 Algorithm for Metal Surface Defect Detection. Electronics 2024, 13, 3129. [Google Scholar] [CrossRef]
Li, M.; Yu, Z.; Fang, L.; Meng, Y.; Zhang, T. GCP-YOLO Detection Algorithm for PCB Defects. In Proceedings of the 2024 4th International Symposium on Computer Technology and Information Science (ISCTIS), Xi’an, China, 12–14 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 492–495. [Google Scholar]
Jing, Z.; Li, S.; Zhang, Q. YOLOv8-STE: Enhancing Object Detection Performance Under Adverse Weather Conditions with Deep Learning. Electronics 2024, 13, 5049. [Google Scholar] [CrossRef]
Wojke, N.; Bewley, A.; Paulus, D. Simple Online and Realtime Tracking with a Deep Association Metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 3645–3649. [Google Scholar]
Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. Strongsort: Make Deepsort Great Again. IEEE Trans. Multimed. 2023, 25, 8725–8737. [Google Scholar] [CrossRef]
Wang, Y.; Qing, Y.; Huang, K.; Dang, C.; Wu, Z. Preformer MOT: A Transformer-Based Approach for Multi-Object Tracking with Global Trajectory Prediction. Fundam. Res. 2025, in press. [Google Scholar]
Liang, J.; Cheng, J. Mirror Target YOLO: An Improved YOLOv8 Method with Indirect Vision for Heritage Buildings Fire Detection. IEEE Access 2025, 13, 11195–11203. [Google Scholar] [CrossRef]
Tang, J.; Ye, C.; Zhou, X.; Xu, L. YOLO-Fusion and Internet of Things: Advancing Object Detection in Smart Transportation. Alex. Eng. J. 2024, 107, 1–12. [Google Scholar] [CrossRef]
Liu, Z.; Wei, J.; Li, R.; Zhou, J. SFusion: Self-Attention Based N-to-One Multimodal Fusion Block. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2023; Greenspan, H., Madabhushi, A., Mousavi, P., Salcudean, S., Duncan, J., Syeda-Mahmood, T., Taylor, R., Eds.; Lecture Notes in Computer Science; Springer Nature Switzerland: Cham, Switzerland, 2023; Volume 14221, pp. 159–169. ISBN 978-3-031-43894-3. [Google Scholar]
Sheng, W.; Shen, J.; Huang, Q.; Liu, Z.; Ding, Z. Multi-Objective Pedestrian Tracking Method Based on YOLOv8 and Improved DeepSORT. Math. Biosci. Eng. 2024, 21, 1791–1805. [Google Scholar] [CrossRef]
Cao, Z.; Huang, Z.; Pan, L.; Zhang, S.; Liu, Z.; Fu, C. Towards Real-World Visual Tracking with Temporal Contexts. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15834–15849. [Google Scholar] [CrossRef] [PubMed]
Song, X.; Zhang, T.; Yi, W. An Improved YOLOv8 Safety Helmet Wearing Detection Network. Sci. Rep. 2024, 14, 17550. [Google Scholar] [CrossRef] [PubMed]
Hua, Z.; Jing, X. An Improved Belief Hellinger Divergence for Dempster-Shafer Theory and Its Application in Multi-Source Information Fusion. Appl. Intell. 2023, 53, 17965–17984. [Google Scholar] [CrossRef]
Wang, Y.; Peng, J.; Zhang, J.; Yi, R.; Wang, Y.; Wang, C. Multimodal Industrial Anomaly Detection via Hybrid Fusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 8032–8041. [Google Scholar]
Tao, C.; Cao, X.; Du, J. G²SF-MIAD: Geometry-Guided Score Fusion for Multimodal Industrial Anomaly Detection. arXiv 2025, arXiv:2503.10091. [Google Scholar]
El-Din, D.M.; Hassanein, A.E.; Hassanien, E.E. An Adaptive and Late Multifusion Framework in Contextual Representation Based on Evidential Deep Learning and Dempster–Shafer Theory. Knowl. Inf. Syst. 2024, 66, 6881–6932. [Google Scholar] [CrossRef]
Shao, Z.; Dou, W.; Pan, Y. Dual-Level Deep Evidential Fusion: Integrating Multimodal Information for Enhanced Reliable Decision-Making in Deep Learning. Inf. Fusion 2024, 103, 102113. [Google Scholar] [CrossRef]
Qu, X.; Liu, Z.; Wu, C.Q.; Hou, A.; Yin, X.; Chen, Z. MFGAN: Multimodal Fusion for Industrial Anomaly Detection Using Attention-Based Autoencoder and Generative Adversarial Network. Sensors 2024, 24, 637. [Google Scholar] [CrossRef]
Wang, H.; Ji, Z.; Lin, Z.; Pang, Y.; Li, X. Stacked Squeeze-and-Excitation Recurrent Residual Network for Visual-Semantic Matching. Pattern Recognit. 2020, 105, 107359. [Google Scholar] [CrossRef]
Shen, J.; Yang, H. Multi-Object Tracking Model Based on Detection Tracking Paradigm in Panoramic Scenes. Appl. Sci. 2024, 14, 4146. [Google Scholar] [CrossRef]

Figure 1. System architecture diagram.

Figure 2. YOLOv8-SE network architecture diagram.

Figure 3. SE module schematic diagram.

Figure 4. Multi-source fusion and decision-making flowchart.

Figure 5. Performance comparison of three-layer rule-based inference.

Figure 6. Detection performance comparison between YOLOv8-SE and YOLOv8.

Figure 7. Performance comparison of multi-source fusion and single-source approaches.

Figure 8. Performance comparison of different fusion strategies across multiple scenarios.

Table 1. Definition of evaluation indicators.

Indicator Name	Abbreviation	Definition and Formula
State Reasoning Accuracy Rate	SAR	(Number of Correct States/Total Number of States) × 100%
Anomaly Detection Rate	ADR	(Number of Correctly Detected Anomalies/Total Number of Anomalies) × 100%
Anomaly Blocking Success Rate	BSR	(Number of Successfully Blocked Incorrect Updates/Number of Detected Anomalies) × 100%
Anomaly Localization Precision	ALP	(Number of Correctly Localized Anomalies/Number of Detected Anomalies) × 100%
Mean Average Precision	mAP@0.5	Average Precision Under Intersection Over Union (IoU) Threshold of 0.5
Multiple Object Tracking Accuracy	MOTA	Tracking Accuracy Integrating Missed Detections, False Detections, and ID switches
Number of ID Switches	IDS	Total Number of ID Switches During the Tracking Process
Association Recovery Rate	ARR	(Number of Successfully Recovered Associations/Total Number of Interrupted Associations) × 100%
Precision	Precision	(True Positives/(True Positives + False Positives)) × 100%
Recall	Recall	(True Positives/(True Positives + False Negatives)) × 100%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, K.; Xiao, X.; Zhang, Y.; Liu, G.; Li, X.; Zhang, F. A Multi-Source Fusion-Based Material Tracking Method for Discrete–Continuous Hybrid Scenarios. Processes 2025, 13, 3727. https://doi.org/10.3390/pr13113727

AMA Style

Yang K, Xiao X, Zhang Y, Liu G, Li X, Zhang F. A Multi-Source Fusion-Based Material Tracking Method for Discrete–Continuous Hybrid Scenarios. Processes. 2025; 13(11):3727. https://doi.org/10.3390/pr13113727

Chicago/Turabian Style

Yang, Kaizhi, Xiong Xiao, Yongjun Zhang, Guodong Liu, Xiaozhan Li, and Fei Zhang. 2025. "A Multi-Source Fusion-Based Material Tracking Method for Discrete–Continuous Hybrid Scenarios" Processes 13, no. 11: 3727. https://doi.org/10.3390/pr13113727

APA Style

Yang, K., Xiao, X., Zhang, Y., Liu, G., Li, X., & Zhang, F. (2025). A Multi-Source Fusion-Based Material Tracking Method for Discrete–Continuous Hybrid Scenarios. Processes, 13(11), 3727. https://doi.org/10.3390/pr13113727

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Source Fusion-Based Material Tracking Method for Discrete–Continuous Hybrid Scenarios

Abstract

1. Introduction

2. Multi-Source Fusion Material Tracking Method for Discrete–Continuous Hybrid Scenarios

2.1. Semantic Modeling Layer

2.1.1. State Modeling and Rule Inference

2.1.2. Signal-Driven Trajectory Inference Model

2.2. Visual Perception Layer

2.2.1. YOLOv8-SE Detection Network Optimization

2.2.2. Improved DeepSORT Tracking Algorithm

2.3. Multi-Source Fusion Decision Model

2.3.1. Consistency Verification and BPA Construction

2.3.2. D-S Evidence Combination and Anomaly Tolerance

3. Experiments

3.1. Experimental Setup

3.1.1. Experimental Scenario and Problem Definition

3.1.2. Experimental Data

3.1.3. Experimental Environment

3.1.4. Verification Scheme

3.1.5. Core Parameter Selection

3.2. Evaluation Metrics

3.3. Performance Quantification of Three-Layer Rule-Based Inference

3.4. Visual Perception Performance

3.5. Multi-Source Fusion Performance

3.5.1. Comparison Between Fusion and Single-Mode Baselines

3.5.2. Comparison of Different Fusion Strategies

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI