1. Introduction
Iron foundry is one of the most mature and widely adopted manufacturing technologies, and it enables the large-scale production of metallic components with complex geometries and demanding mechanical requirements. Global cast iron production accounts for hundreds of millions of tonnes per year, supplying key industrial sectors such as automotive, energy, railway transportation, and heavy machinery [
1]. Within this domain, ductile (also known as nodular) iron has become a preferred material for critical components due to its improved strength, ductility, fatigue resistance, and damage tolerance when compared to grey cast iron [
2].
From a process point of view, the iron casting manufacturing process involves a sequence of tightly coupled stages, including (i) metal melting, (ii) pouring into moulds, (iii) solidification under controlled thermal conditions, (iv) extraction of the solidified part, and (v) subsequent forming or finishing operations. Among these stages, the pouring phase plays a decisive role in determining the final quality of the casting. During pouring, molten metal is transferred into the mould cavity, where its flow behaviour, temperature evolution, and interaction with the mould directly affect filling completeness, microstructural development, and defect formation. Continuous and semi-continuous pouring processes offer significant advantages, such as high production efficiency, improved dimensional repeatability, and reduced energy consumption due to controlled solidification and elimination of auxiliary cooling steps [
3,
4]. Nevertheless, these benefits come at the cost of increased process sensitivity. Hence, minor disturbances during pouring can propagate downstream, resulting in defects that are difficult to detect and costly to correct.
Despite the extensive industrial experience achieved by foundries, the pouring phase remains particularly prone to quality deviations. Common defects include incomplete filling, cold shuts, poor filling quality, and surface or internal imperfections such as inclusions, porosity, segregation, and thermal cracking [
5]. Many of these defects are directly linked to diverse phenomena occurring during pouring, such as stream instability, turbulence, interruptions, or overflow episodes. Importantly, these events often develop over very short time windows and may not be captured by conventional monitoring systems. Nevertheless, identifying these defective cases in industrial scenarios poses a great challenge. This is mainly due to the scarcity of such examples under normal production conditions, which hinders data collection and annotation, requiring hundreds of expert hours. Although anomaly detection algorithms are often used, the techniques employed generally rely on low-dimensional statistical outlier detection or require at least a partially annotated dataset. Furthermore, relying solely on a single data modality (such as sensor signals or visual information) is often insufficient for comprehensive analysis [
6]. Many casting-related events and anomalies are better characterised by the fusion of multiple data sources. For instance, sensor readings may indicate changes in casting level or flow rates, while visual data may reveal surface defects or slag layer behaviour that are not captured by numerical sensors alone. Therefore, integrating both sensor and image data in a multimodal approach is essential to capture the full complexity of the process and improve detection performance [
7,
8].
In this sense, effective control of pouring parameters (e.g., metal temperature, pouring duration, inter-pour time, mould alignment, and flow stability) is critical to achieve consistent production quality. In industrial foundries, this control task presents a serious challenge due to the large number of sensors deployed along the production line, the high acquisition frequency of process signals, and the resulting data volume, which complicates real-time interpretation and decision making [
9]. Traditional manual inspection or rule-based supervision strategies struggle to cope with this complexity and often fail to provide early warnings of emerging defects [
10].
To address these limitations, automated monitoring and intelligent analysis systems have gained increasing attention. Specifically, artificial intelligence (AI) methods, particularly those based on machine learning (ML) and deep learning (DL), offer powerful tools for processing high-dimensional and heterogeneous data streams, enabling improved event detection, anomaly identification, and predictive capabilities [
11]. In the context of casting processes, recent works have explored computer vision (CV) techniques for detecting surface defects and pouring anomalies [
12,
13], as well as sensor-based approaches for the real-time supervision of thermal and flow-related variables [
14,
15].
Nevertheless, many existing solutions remain limited by their unimodal nature. These solutions, which rely exclusively on either sensor data or visual information, often provide an incomplete view of the process [
6]. On the one hand, while sensors can capture global trends such as temperature or flow rate variations, they may overlook localised visual phenomena at the mould cup or pouring stream. On the other hand, vision-based systems may struggle under occlusions, glare, or variable lighting conditions without complementary process context. For this reason, multimodal approaches that combine image and sensor data have been increasingly advocated as a means to capture the full complexity of industrial processes and enhance robustness [
7,
8].
Beyond foundry characterisation, industrial digitisation efforts increasingly emphasise the need for higher-level integration and interpretability. In this sense, digital twins, the virtual representations of physical systems that remain synchronised with their real counterparts, have emerged as a key enabler for monitoring, optimisation, and decision support in manufacturing. Specifically, in casting processes, digital twin principles have been applied to optimise cooling strategies, control solidification behaviour, and reduce defect rates [
16,
17]. However, effective digital twin deployment requires structured, time-aligned, and semantically meaningful data describing not only sensor readings but also events, conditions, and operational context. This highlights the importance of frameworks capable of transforming raw observations into machine-readable representations suitable for integration with such systems.
Against this background, this work proposes a multimodal framework for the detection and assessment of pouring-related anomalies in industrial iron foundry processes. Rather than focusing solely on visual defect detection or isolated sensor thresholds, the proposed approach aims to unify heterogeneous evidence into a coherent and interpretable description of each pouring cycle. The system is designed to operate under real foundry conditions, accounting for occlusions, non-regular operational contexts, and transient disturbances that frequently occur in production environments.
The proposed framework combines visual analysis of pouring scenes with domain-specific process signals within an explainable MoE-style architecture for modular expert fusion. In particular, a visual detection backbone based on YOLO (You Only Look Once) identifies relevant Areas of Interest (AoI), such as pouring streams, mould cups, and protective elements, while temporal validation through object tracking ensures spatial consistency across frames. Regarding the detection of anomalous pouring events, we further encode visual dynamics through self-supervised VideoMAE (Video Masked Autoencoder) features [
18], enabling a compact representation of pouring behaviour over time, which can be used for anomalous event detection through an outlier-aware Density-Based Spatial Clustering of Applications with Noise (DBSCAN) clustering algorithm [
19]. Crucially, the limited inductive biases present in transformer architectures allow for seamless integration of multimodal data. Hence, selected sensor signals are integrated to contextualise visual observations and reinforce interpretation through expert-informed reasoning. The current approach addresses several common issues related to anomaly detection in industrial settings. On the one hand, it allows us to pinpoint potential anomalies, greatly easing the burden on expert annotators to find and tag such events. On the other hand, it presents a valid approach to categorise various normal and anomalous industrial events, successfully integrating and leveraging multimodal information.
The outputs of the different expert modules are aggregated into a structured and machine-readable JavaScript Object Notation (JSON) representation that captures the state, quality indicators, and contextual conditions of each pouring event. This representation is intended for interoperability with higher-level software layers, including supervisory platforms, Manufacturing Execution Systems (MESs), and digital twin environments. While the present work does not develop or validate a full digital twin itself, it provides the semantically enriched event-level information required to support such systems more effectively.
In summary, the main operational objectives of this work are as follows:
- 1.
To reliably identify and segment individual pouring events under real industrial conditions.
- 2.
To detect and track key visual elements required for interpreting pouring dynamics and safety conditions.
- 3.
To extract interpretable indicators describing stream stability, filling quality, and process compliance.
- 4.
To find anomalous pouring events through unsupervised multimodal feature learning and clustering.
- 5.
To unify heterogeneous evidence through an explainable Mixture-of-Experts (MoE)-style decision framework based on modular expert fusion.
- 6.
To generate structured outputs suitable for integration with digital twin and supervisory platforms.
The remainder of this paper is organised as follows.
Section 2 describes the industrial setup, data acquisition strategy, and methodological components of the proposed framework. Then,
Section 3 presents the experimental evaluation of event detection, AoI analysis, unsupervised anomaly detection, and MoE-based decision unification. Finally,
Section 4 discusses industrial deployment aspects and limitations, summarises the main contributions, and outlines future research directions.
2. Materials and Methods
To tackle the large number of challenges faced by researchers (see
Section 1), we adopt the well-known
divide-and-conquer methodology. The main objective, in line with the MoE approach, is to simplify the resolution of the original problem by decomposing it into smaller and more manageable subproblems. This strategy allows for a reduction of computational complexity and promotes a structured and efficient problem-solving approach. Historically, this methodology has been widely applied to manage diverse challenges, for instance, in legal reasoning [
20], mathematical computations [
21], and computational problems related to parallel processing [
22]. Formally, the general problem is defined as
P, which can be expressed as a set of subproblems
. Once each subproblem is individually handled and its specific solution is obtained, denoted as
, the partial results are systematically combined to construct the complete solution to the original problem as follows:
where
C represents the function that combines partial solutions into the final result.
In this work, we decompose the industrial monitoring problem into specialised modules within an explainable multimodal MoE-style framework. Each expert represents an independent subproblem, and their results are unified through an explicit event-level fusion layer that provides a coherent global assessment and supports downstream supervisory or digital twin-compatible systems. All stages, challenges, and integration phases defined in this work are summarised next and visually represented in
Figure 1.
- 1.
Problem and Challenge Identification. The aim of this initial step is to establish the background and context of the industrial problem to be solved. Analysing both the multimodal nature of the data (i.e., images, video, and sensor streams), as well as other aspects inherent to the problem at hand, such as pouring streams, leaks, overpouring situations, or irregular filling patterns. In a nutshell, understanding what, how, and why must be monitored, providing a clear foundation for each expert specialisation.
- 2.
Knowledge Acquisition. This second phase focuses on obtaining domain-specific and technological knowledge to guide the further research and development of each expert subsystem. Here, high-level knowledge is firstly acquired to strongly define the overall workflow of the research. This stage also involves defining data sources, formats, and the synchronisation requirements necessary for a multimodal integration.
- 3.
Challenge Division. Following the divide-and-conquer paradigm, the global problem P is divided into smaller subproblems , each corresponding to a functional expert module that will be included in the MoE framework. More accurately, the definition of each expert will include the following sub-phases: (i) acquisition of specific knowledge, (ii) definition of techniques and experiments, and (iii) results evaluation and analysis.
Local Image Interpretation and Event Detection. The challenge behind this subproblem focuses on identifying the most probable sources of anomalies. This challenge requires the acquisition of specific knowledge related to visual feature extraction, keypoint detection, and motion-based event discovery. The achieved solution will contribute to enhancing the reliability of visual monitoring in complex production settings through the detection of critical areas where irregularities could appear and finally cause a faulty casting.
Global Environment Evaluation. The second challenge tries to characterise the broader industrial environment and provide an overall process overview. This involves an analysis of plant topology, global process dependencies, and large-scale events such as line breaks, maintenance operations, or overpouring occurrences, among others. This task is an evaluation at a macroscopic level of events occurring during the manufacturing process, ensuring that the contextual events are identified and interpreted.
Manufacturing Evaluator. This challenge tackles quantitative process evaluation and compliance verification. It is grounded on the acquisition of knowledge related to industrial process parameters, tolerance limits, and production quality regulations. That information should be dependent on the current manufactured reference and must be updated and extracted in real time to be able to characterise the full process accurately.
Casting Pouring Stream Interpretation. This last challenge involves understanding the dynamic behaviour of molten metal flow during casting operations for unsupervised identification of anomalous events. This work requires the acquisition of expert knowledge, time-synchronised video–sensor fusion, self-supervised flow dynamics characterisation, and the categorisation of defect topology associated with pouring streams.
- 4.
Functional Integration through the MoE Unification Metamodel. Once all
partial solutions have been computed by each expert, they will be aggregated through the MoE Unification Metamodel. In fact, this metamodel acts as the combination function
C (shown in Equation (
1)), which merges all outputs into a unified and interpretable representation of the industrial environment.
- 5.
Structured Output Distribution to Supervisory and Digital Twin Platforms. The unified output of S is serialised as a structured JSON stream and made available to higher-level software layers through a real-time communication interface. This output can be consumed by supervisory tools, MES platforms, or digital twin environments, depending on the plant architecture. In this work, the relevance of this stage lies in preserving the semantic richness of the extracted evidence and enabling its downstream exploitation for traceability, monitoring, and future causal or predictive analyses.
- 6.
Evaluation of Global Results and Dissemination. After a full system integration, a final evaluation phase is carried out to assess the overall performance of the Multimodal Industrial Monitoring System with MoE. This final phase includes (i) the quantitative technical validation of each partial solution, (ii) an integrated evaluation of the complete system on industrial data, and (iii) dissemination activities in the form of scientific publications.
2.1. Data Acquisition and Capture Environment
The data used in this work were acquired on a commercial sand-casting production line. To preserve industrial confidentiality, site-specific identifiers (for instance, plant name, exact location, and proprietary process parameters) have been omitted and, in contrast, they were replaced by aggregated descriptors and representative statistics. Despite this fact, the authors provide enough methodological details to enable the replication of the proposed approach.
Casting is one of the oldest and most fundamental manufacturing processes, in which molten metal is poured into a mould cavity that reproduces the desired shape of the final component [
23]. Among different casting techniques,
green sand casting remains one of the most widely used due to its cost-effectiveness, reusability of materials, and suitability for mass production [
24,
25]. In this process, the mould is made of a mixture of sand, clay, and water (also known as
green sand). This mixture provides enough strength and plasticity to retain the shape during metal pouring and cooling. A modern variant of this method is the
vertical moulding process, where the sand moulds are created and aligned vertically in a continuous production line. This configuration, usually associated with DISA moulding machines, enables a high-speed production (close to an average of 550 moulds per hour) with excellent dimensional accuracy and automation capabilities [
26]. The vertical arrangement allows moulds to cool down as they move continuously over a conveyor.
Figure 2 illustrates the schematic of the vertical sand-casting line used for this study.
There are several types of metals used to create castings. The material used in our case is
nodular cast iron, also known as ductile iron. This metal is a ferrous alloy characterised by the spheroidal morphology of its graphite inclusions. This microstructural feature provides a superior combination of strength, toughness, and ductility compared to grey cast iron, making it ideal for automotive, hydraulic, and structural applications [
27,
28].
The moulding line operation starts with the moulding machine. It compacts green sand around a pattern, and it ejects the created mould to the conveyor. The formed moulds advance along the mould train until they reach the pouring station, where a press-pour mechanism deposits the molten metal into the pouring cup. Then, gravity pulls down the metal which ends up filling the cavity inside the mould. During pouring, a protective cover, known as
tile, is placed over the next pouring cup to prevent the unintended spillage of metal into downstream moulds. This reduces the risk of cross contamination between them. The quality of the pouring operation is critical as bad pouring procedures frequently cause defects. These may have different consequences for component performance [
5,
23,
25,
29], and it can be categorised as follows:
Severe (structural) defects: Phenomena that may cause catastrophic failure or render a part as unusable. For instance, this group includes carbide formation or hard brittle phases, compromising toughness and increasing the risk of premature fractures under load; cold shots, causing weak discontinuities due to incomplete coalescence of metal streams; shrinkage porosity, voids produced by volume contraction during the solidification stage, which cause leakage; and gross lack of material due to underfilling.
Functional defects: Defects that affect the final functionality of a part or the subsequent finalisation operations, such as machining. Some examples are (i) hard inclusions (for instance, slag, sand inclusions or oxides) and (ii) local carbide pockets that damage cutting tools or reduce the service life of a part. All those defects are usually formed by a turbulent pouring, inadequate filtration, or a poor gating/riser design.
Aesthetic defects: Surface flaws or minor inclusions, such as light sand inclusions or small blow-holes, among others. These do not compromise mechanical performance but may require rework to solve them or cause the rejection of a component.
An automated monitoring system that combines video evidences of the pouring process with synchronised sensor signals provides a practical route for the early detection of these faults, supporting root-cause analysis and corrective actions on the line [
13,
15]. In order to handle this kind of monitoring, the video camera is positioned to capture the pouring event, the filling level of each mould, and any special events like visible overflow or spillage. Simultaneously, process sensors (e.g., temperature, flow/weight, among others) and the material flow behaviour are synchronised with the video to try to identify any problem along the pouring area.
Figure 3 illustrates how this pouring monitoring system is deployed in a real-life green sand foundry plant. The resulting multimodal dataset is used throughout this paper, and is the basis for model training, anomaly detection and the real-time updates sent to the digital twin.
The data used in this study originated from two plant-level information sources with different roles. First, the foundry operates an independent production monitoring system that registers mould-by-mould process information, including mould generation, pouring, and shakeout times, together with mould related production descriptors. Second, a centralised industrial DataLogger acquisition system was configured to communicate with the plant Programmable Logic Controllers (PLCs), map their memory zones, and record the relevant signal tags (i.e., digital I/O, analog channels, counters, and event markers). In parallel, network cameras were integrated into this acquisition topology via the Real-Time Streaming Protocol (RTSP), and the incoming video and PLC signals were stored together with acquisition timestamps in a centralised repository.
In this context, “timestamp” refers to the acquisition time assigned by the DataLogger environment (with the software ibaAnalyzer v8.2.3) to each recorded PLC sample and video frame (or frame-aligned video sample) within that subsystem. Therefore, video and PLC data are temporally aligned inside the acquisition platform (as shown in
Figure 4). Nevertheless, this acquisition subsystem is not natively synchronised with the independent mould-level production system. As a result, the identity of the mould observed in the video is not directly available
a priori, and visual pouring events cannot be linked directly to a unique mould record from the production database. This correspondence must instead be reconstructed afterwards through an event-driven synchronisation procedure based on process timing and operational markers.
Next, we provide a detailed explanation of the contents of the dataset used in this study, including data types, acquisition settings, and synchronisation strategy to ensure full methodological reproducibility. Nevertheless, to preserve industrial confidentiality, exact counts and specific parameters are omitted and replaced by aggregated descriptors and representative statistics.
2.1.1. Video Recordings
The video dataset comprises approximately 12 h recorded across 2 sessions under normal operating conditions (with 8 and 4 h, respectively). All sequences were acquired at a resolution of
pixels and 20 fps, encoded with the H.264/AVC standard at bitrates ranging between 3.0 and 3.2 Mbps. No audio tracks were included in the recordings. Both videos show a stable keyframe period of 1.5 s, ensuring temporal alignment with sensor data. In both recordings, the camera placement corresponds to the field of view illustrated in
Figure 3, capturing the pouring stream, the tile/cup area, and the first produced moulds in the beginning area of the conveyor (all detailed in
Figure 2).
The captured footage represents the continuous operation of the vertical green-sand moulding line producing nodular iron castings of different references and cup geometries. Recordings display the typical working conditions of the industrial casting line, including changes in lighting due to natural illumination and the radiant emission of the molten metal, turbulent jets in a pouring stream, sporadic metal splashing, overflows, and short production stops during reference changes or mould rejection. The lighting control of the camera is automatic and maintains overall visual consistency despite diurnal variations. Besides pouring events occurring approximately every 5 s, these videos also recorded short pauses such as maintenance interventions or reference changes, among others. This special events are ∼2% of the total time. In addition, the dataset also includes both regular and irregular pouring sequences, covering normal variability of the foundry day-by-day work. Although no event tags are included with the data, we were able to identify the special using CV and the synchronised records from the process sensors. Furthermore, reference changes are visually recognizable due to a series of empty moulds and a manual mark on the conveyor. With this, we have identified 14 such events, confirming the visual stability of the setup and the absence of camera shake or abrupt illumination changes. All videos are stored in raw format without stabilisation or image corrections to preserve authentic process dynamics and facilitate a further reproducible multimodal analysis.
Crucially, the precise synchronisation of video and sensor data enables robust multimodal analysis. By correlating visual features, like pouring stream shape, with process variables, such as flow rate, this alignment provides the foundation for effective digital twins and data-driven anomaly detection, supporting predictive quality control in real production environments.
2.1.2. Sensor Recordings
Parallel to video acquisition, a complete set of process and control signals was recorded from the plant. These signals were sampled at an approximate rate of 2.5 Hz and were temporally aligned with the video data within the centralised DataLogger subsystem using a common acquisition time base. Nevertheless, this internal alignment does not by itself provide direct mould-level identification across the independent production-monitoring system. The gathered dataset contains 155 distinct process variables, reflecting both continuous and discrete values across the foundry line. Although the precise sensor identifiers are hidden for confidentiality reasons, the signals can be grouped into several functional subsystems that are representative of a vertical green-sand casting process.
Table 1 describes the approximate distribution of signals among these subsystems and provides representative examples of each group. These can be summarised into four groups as follows:
Moulding area, which includes critical variables for maintaining a consistent mould density and ensuring dimensional integrity before pouring.
Pouring process, including variables related to the press-pour unit and supervising molten metal conditions. These parameters directly determine casting quality and correlate strongly with the aforementioned casting defects.
Conveyor and cooling line, which monitors mould evolution after pouring.
Auxiliary and safety systems signals, which provide contextual information about line stops and maintenance events.
Quality and traceability signals identify reference changes, rejected moulds, or operator interventions, facilitating synchronisation between the physical process and the third-party systems.
2.2. Global Context Evaluation
Then, we focus on understanding the global production context in which these operations occur. This stage handles the classification of the overall operational state of the foundry line from visual information. Its specific goal is to distinguish whether the process is running under normal conditions or affected by contextual disturbances such as maintenance tasks (which might create operator occlusions), overpouring, or technical stops. This enables both real-time alerts and the exclusion of anomalous sequences from subsequent analytical modules. In fact, this classification enables filtering non-productive intervals and contextual anomalies that could otherwise distort quality assessment or multimodal fusion with process signals. We have defined 4 contextual states, which represent the production workflow itself (see
Figure 5), as follows:
- 1.
Normal operation, corresponding to regular casting cycles where pouring, mould transport and cooling proceed without interference.
- 2.
Maintenance, when operators enter the scene or perform manual adjustments, often producing occlusions that compromise visual analysis.
- 3.
Overflow, describing situations where molten metal splashes outside the mould cup, typically caused by misalignment or excessive pouring rate.
- 4.
Stop line, representing planned or unplanned halts such as pattern changes and safety stops, among others. Usually, moulds are manually marked as rejected with chalk.
To automate this contextual classification, several image classification architectures were considered. Traditional convolutional networks such as ResNet-50 [
30] and EfficientNet [
31] have been the foundation of manufacturing vision systems due to their accuracy and fine tuning capabilities. Nevertheless, more recent transformer-based architectures such as ViT [
32] and ConvNeXt [
33] introduce improved global attention mechanisms, achieving outstanding accuracy in large scale datasets but with significant computational demands. Hence, they are less suitable for real-time inspection problems. On the contrary, the YOLOv11-class [
34] model offers an optimal balance between precision and latency, optimised for real-time industrial deployment. It also ensures architectural compatibility with the area detector and classifier used in this work (see
Section 2.3.2 and
Section 2.4.1), simplifying deployment and reducing hardware requirements.
The dataset employed for training comprised approximately 262,600 images extracted from the 4 h video recording. Each frame was manually labelled by a foundry expert into one of four categories:
normal operation (212,000 samples),
overflow (45,000),
maintenance (2500), and
stop line (2100). The data were split into training, validation, and test sets with a 60/20/20 ratio. We fine tuned the pretrained YOLOv11 backbone, employing the standard Ultralytics augmentation strategies (for instance, including random scaling and cropping, horizontal flipping, HSV adjustments, Gaussian blur, and mosaic), which have shown to significantly improve robustness under varying illumination (e.g., molten metal glare) and occlusions commonly found in casting environments [
34].
In summary, this module completes the visual perception pipeline by attaching a contextual state to every detected pouring event. As a result, only segments corresponding to normal operation are forwarded to quality evaluation and MoE fusion, while frames affected by maintenance tasks, line stoppages, overflows, or camera occlusions are automatically flagged and excluded.
2.3. Local Image Interpretation and Event Detection
This section addresses the local analysis of the video stream, where visual information is structured into meaningful process entities. It comprises 2 complementary stages as follows: first, the Pouring Event Identification, which temporally segments the continuous sequence into discrete casting cycles; and second, the Detection of AoIs, which spatially isolates the relevant regions within each event for subsequent interpretation. Together, these stages transform raw video data into semantically organised information that serves as the axes for a high level of modelling and process understanding.
2.3.1. Pouring Event Identification
The final objective of this stage is to enable the automatic interpretation of manufacturing visual content through event delimitation within the pouring process. This stage aims to temporally segment the continuous video sequence into well-defined pouring cycles that correspond to real process operations. We denote these segments as
pouring events. From this stage, every frame is processed not as an isolated observation but as part of a dynamic process where recurrent operations occur under industrial constraints, since each production cycle can be represented as a repetitive task of mould generation, pouring, transport, and cooling. Formally, we define the set of pouring events as
, where each event
represents a complete pouring operation within the production line. Each event
is delimited in time by the lowering and subsequent rise of the pouring unit, specifically,
, with
being the instant when the pouring unit begins its descent and
when the filled mould leaves the pouring zone or the next empty mould arrives. During this interval, we observe a multimodal sequence of variables as follows:
where
denotes the video stream;
the stream shape descriptors;
the environment characterisation indicators;
the process variables such as pressure and temperature, among others; and
the quality measurements related to the current production reference.
In theory, the acquired signals can provide direct temporal markers. Nevertheless, in practice, industrial data acquisition often suffers from synchronisation drift, missing packets, or timing mismatches between video and process signals [
35]. Therefore, a double check provided by an artificial vision-based approach was adopted to autonomously identify event boundaries by analysing image motion and frame similarity.
For this purpose, several algorithms were considered to determine an optimal trade-off between accuracy, robustness, and computational efficiency. One common approach is dense Optical Flow (OF), such as Recurrent All-Pairs Field Transforms (RAFT) [
36], which computes pixel-wise motion vectors across consecutive frames. OF has become a benchmark for high precision motion estimation in industrial and robotic inspection tasks, as it offers fine-grained motion tracking. However, this approach exhibits high computational cost and large memory footprint, limiting real-time deployments on full HD sequences. Although this fact is discussed in [
37] and the authors provide new optimisation possibilities, we decided to discard it.
Alternatively, simple frame or histogram difference methods were also tested as baseline strategies for motion quantification, given their minimal computational load. Notwithstanding, as noted by Karbalaie et al. [
38], such techniques are highly sensitive to the molten metal, due to the produced glow and automatic exposure corrections by cameras, which may trigger false positives.
Finally, we consider the Structural Similarity Index Measure (SSIM), which has been validated as a perceptual metric for detecting state transitions in industrial video monitoring [
39,
40]. SSIM quantifies luminance, contrast, and structural coherence between successive frames, enabling detection moulds and metal movements. Recent studies have validated its robustness against illumination fluctuations and flicker noise in real production environments [
41]. Its low computational complexity makes it particularly well-suited for embedded industrial monitoring systems.
To further optimise performance, the analysis was restricted to a dynamically defined AoI enclosing the pouring stream and adjacent mould cup, as illustrated in
Figure 6. In addition, the pouring stream region is detected dynamically through an adaptive AoI strategy. This allows the system to automatically adjust to camera viewpoint changes, ensuring that the pouring zone and the first poured mould exiting the scene remain consistently localised even under shifts in perspective or framing. Hence, the performance is improved when reducing redundant pixel processing and only maintaining focus on semantically relevant areas [
35,
41]. In our implementation, the AoI position is automatically updated based on the previously detected stream centroid, creating our own self-adaptive mechanism that maintains consistent spatial tracking.
Algorithm 1 summarises the final adopted implementation using the SSIM motion detector. The algorithm processes consecutive frames within a small AoI enclosing the pouring zone and computes their structural similarity to quantify inter-frame changes. Motion is declared whenever the similarity drops below a threshold
, which effectively distinguishes between static and dynamic phases. Temporal smoothing, carried out as a 5-frame moving average, reduces spurious transitions caused by exposure fluctuations or molten metal glow. Finally, to ensure lightweight processing, we downscale the AoI region down to 320 × 180 px. This configuration provides a robust and computationally efficient solution that runs in real time while preserving accurate event delimitation. The resulting binary signal
marks the start and end of each pouring cycle and serves as the temporal reference for subsequent analysis and synchronisation with sensor data.
| Algorithm 1: SSIM-based motion detection within AoI (casting event delimitation) |
![Applsci 16 03430 i001 Applsci 16 03430 i001]() |
2.3.2. Detection of Areas of Interest
Once the pouring events are detected, the next step is concerned with the interpretation of the visual content within each event. Despite the high spatial richness of each frame, large regions of the image correspond to static background or machine structures with little relevance to measure the quality of the pouring process. Therefore, this analysis must be focused on specific
Areas of Interest, which correspond to the relevant regions where molten metal interacts with the mould and auxiliary devices during the pouring stage (see
Figure 7), such as the following:
Mould cup(s): Visible holes that are positioned in front of the camera on arrival to the pouring area. They may appear empty, partially, or fully filled. Monitoring their fill level provides contextual information on pouring accuracy and metal delivery consistency.
Pouring stream: The continuous jet of molten metal descending from the pouring unit. Its shape, thickness, and turbulence reflect process stability and are strongly correlated with casting defects.
Tile (splash guard): A protective device that prevents splashes or secondary jets from contaminating the next mould.
Incoming mould cup: The next cup that becomes visible when the conveyor advances. Occasionally, early contact with residual metal or splashes may occur, and its early detection allows triggering predictive alarms.
To automatically identify these areas, several object detection architectures were considered, spanning both two-stage and one-stage paradigms. Two-stage models, such as Faster R-CNN [
42], first generate region proposals and then classify them, achieving excellent accuracy but at high computational cost. In contrast, one-stage models like SSD [
43] or the YOLO family [
44] predict bounding boxes directly over feature maps, delivering competitive precision at significantly lower latency, making them ideal for industrial applications with real-time constraints. In particular, the YOLO family consistently provides the best balance between inference speed and detection accuracy [
43,
44], outperforming SSD and achieving significantly lower latency than Faster R-CNN. Given the real-time operational requirements of our foundry line, YOLOv11 was selected as the detection backbone. For the initial proof of concept, we adopted the lightweight YOLOv11-nano variant to evaluate feasibility and runtime, leaving open the possibility of upscaling to larger versions for accuracy improvements.
The model was trained using a dataset of approximately 258,000 frames extracted from a 4 h production video. As the manual annotation of such a volume is prohibitive, we employed a two-stage semi-supervised strategy similar to the human-in-the-loop approaches described by Lee et al. [
45] and Liu et al. [
46]. The first phase involved 17,000 manually labelled frames covering diverse conditions (illumination, references, and turbulence levels). The second phase used the preliminary detector trained on those to automatically label the remaining frames, which were then manually reviewed and corrected to create the final training corpus. This iterative refinement increased annotation efficiency and ensured accurate bounding boxes. In summary, the combination of the pouring event and area-of-interest detection yields a highly structured video representation aligned with the actual foundry workflow. This pre-processing stage eliminates irrelevant visual noise, focuses computation on semantically meaningful areas, and supplies clean and time-aligned visual features to be fed the subsequent models responsible for pouring quality assessment and the further integration with the MoE framework. An example of detected AoIs can be seen in
Figure 8.
2.4. Area-of-Interest Visual Classification
Next we aim to perform local classification within each of the detected AoIs to characterise their visual state in the manufacturing process. This step focuses on the static regions extracted from the YOLOv11 detections (see
Section 2.3.2), specifically, the mould cups and the next incoming cup. We excluded the pouring stream, which requires a temporal analysis and is described separately in
Section 3.4.
2.4.1. Mould Cup Filling State Classification
Each detected mould cup is analysed to determine its filling level condition. The classes considered are
empty,
medium,
full, and
overpoured, illustrated in
Figure 9. These states represent, respectively, the following: (i) a mould cup that has not yet received metal, (ii) one that has been partially filled, (iii) a correctly filled cup, and (iv) excessive pouring resulting in overflow or splashing.
This classification step is essential because the mould path is exposed to multiple sources of variability that may compromise casting quality. For instance, mechanical vibrations of the conveyor, mould misalignment, sand defects, or microleakages along the pouring channel. These issues can cause deviations from the expected filling level. Therefore, tracking the state of every mould cup across the detected pouring event allows the early detection of anomalous behaviour thanks to the continuous identification of potential metal losses along the line, and the verification that the pouring stream reaches the mould under the required process conditions.
To this end, we employ the same backbone and hyperparameter configuration described in
Section 2.2, again, the YOLOv11 classification branch was employed for this task. The data consisted of approximately 184,041 annotated mould cup instances, divided according to the following distribution: 59,414 empty, 50,522 medium, 49,877 full, and 24,228 overpoured. The data were randomly partitioned into training (60%), validation (20%), and testing (20%) subsets. In the same way as in previous classification problems, augmentation strategies such as brightness jittering, small angle rotation, and contrast scaling were applied to improve generalisation under variable illumination and exposure conditions typical of the foundry environment.
Overall, the mould cup classifier extends the AoI level perception layer by offering a fine-grained interpretation of each mould’s filling state. Together with the incoming cup detector and the pouring stream analyser, this component contributes to a unified and temporally aligned understanding of the pouring operation, later integrated by the MoE framework.
2.4.2. Incoming Mould Cup Classification
In addition to the active pouring zone, the next incoming mould cup is also analysed to anticipate potential quality issues before pouring begins. Detecting metal residues inside an incoming mould cup is critical because even small droplets of prematurely solidified metal can generate severe casting defects. Residual metal may obstruct proper flow during the next pouring, promote cold shut defects when fresh metal meets already solid fragments, or induce local porosity by disrupting thermal gradients inside the mould. Moreover, splashes accumulated during previous cycles may indicate instabilities in the pouring stream or mould alignment issues. For these reasons, the incoming mould cup must be monitored to ensure that its condition does not compromise the quality of the next cast piece in the production sequence.
The following two states are considered, as illustrated in
Figure 10: (i)
OK, when the mould cup is clean and ready for filling, and (ii)
Warning, when metal residues or splashes are visually detected inside it. The early identification of such conditions enables the use of predictive alarms and improves the downstream traceability of potentially defective moulds. This classifier allows us to reliably distinguish clean cups from those containing metal residues even under strong brightness variations or partial occlusions.
The dataset used for this task included 280,363 cropped AoI samples, where 93% represent
OK samples, and 7% are
Warning ones. Each class was divided into training, validation, and test subsets with a 60/20/20 ratio. Training was performed by fine tuning ImageNet-pretrained YOLOv11 weights, applying similar augmentation strategies as in
Section 2.4.1 to ensure robustness as we applied for the pouring cups.
Altogether, the incoming cup classifier acts as an anticipatory safeguard within the pouring pipeline. By detecting hazardous pre-filling conditions several seconds before the molten metal reaches the mould, it enables the system to flag potential risks earlier than any downstream inspection stage could. This early warning capability not only reduces scrap generation but also provides the MoE with critical contextual evidence that enriches the final event evaluation and strengthens decision reliability across the entire production chain.
2.5. Pouring Stream Interpretation
In this section we focus on characterizing the pouring stream. Beyond common issues like overflow, maintenance, or sudden drops, defective streams in industrial settings can take on many subtle forms, for example, forked, unaligned, crooked, turbulent, or interrupted flows. These variations are visually distinct but often hard to categorise using traditional methods. Complicating matters further, industrial environments typically produce very few defective samples, making it difficult to even confirm their presence in the data. This, in turn, requires an extensive annotation process going through hundreds or thousands of normal samples until the anomalous ones are found. As a result, we approach this problem from a different perspective, more akin to anomaly or outlier detection, hypothesizing that this perspective may help uncover not only known defects but also unexpected or rare anomalies.
Our approach can be summarised into two steps. First, we employ an autoencoder-like architecture for the characterisation of jet stream dynamics. In particular, by solving a simple self-supervised reconstruction task, the network learns to model the inherent behaviour and dynamics of molten metal, producing multimodal representations of the video and sensory information. Then, we drop the decoder and extract features for each sequence at the output of the encoder. These are fed into a clustering algorithm to properly categorise them into different semantically coherent classes, which we can use to analyse the different types of normal and anomalous samples.
2.5.1. Background
Research performing
casting process analysis for pouring interpretation spans multiple analysis approaches, from real-time process monitoring to post-casting quality assessment. Current research demonstrates emerging capabilities in molten metal and pouring stream characterization using deep learning, with notable advances including optical stream analysis techniques [
47], video vision transformers for molten metal dynamics [
48], CNN-based high-pressure die-casting quality control [
49], novel flow visualization approaches [
50], and automated metallographic analysis using deep neural networks [
51]. While some works demonstrate neural network applications for temperature monitoring and general casting process optimization (e.g., [
52,
53]), none of these works specifically target anomaly detection for problematic casting scenarios such as splashes, turbulent streams, overflows, or forked streams in foundry environments. Despite these advances in general process monitoring, quality assessment, and basic flow characterization, there remains a significant research gap in developing real-time computer vision systems capable of identifying and classifying the specific anomalous pouring behaviours that lead to casting defects in industrial foundry settings.
On a general note, most
anomaly detection techniques rely on having a clean set of normal samples [
54,
55]. For example, reconstruction-based methods [
56,
57] use autoencoders trained on normal data to flag anomalies via reconstruction error, while one-class classification [
58,
59] and normalizing flow [
60,
61] approaches also depend on clear separation between normal and anomalous data to learn discernible distributions. However, when the dataset is noisy or contaminated, these methods often struggle [
62,
63]. Some recent works address this by explicitly filtering out anomalies (like in [
64,
65,
66]) or mitigating their impact during training (as it was described in [
64,
67,
68]). However, these methods either disregard information provided by the anomalous samples, or require
a priori knowledge of the contamination degree. Furthermore, many of these works have only been tested on MNIST [
69], CIFAR10 [
70], or Common Objects in Context (COCO) [
71], where one class is used as normal samples and the other classes are used as anomalies. While some have been evaluated on MVTec [
72], the application of these techniques may not directly translate into our industrial video scenario.
Traditional
outlier detection [
73] often relies on unsupervised techniques like feature extraction followed by clustering, which are especially useful when labelled data are scarce. This technique has also been used for anomaly detection [
74,
75,
76], as both terms are often used interchangeably. In visual data, feature extraction can be done using supervised models pretrained on large datasets, but these can be biased towards supervised tasks, hence they may miss the fine-grained, low-level details needed for industrial contexts. Unsupervised methods trained on natural generic images also tend to struggle when applied to specialised domains [
77]. We believe that a more promising approach is to learn features directly from the target data, most commonly done by leveraging convolutional autoencoders (e.g., [
78,
79]), though in our case, we explore transformer-based architectures, as they have been shown to better handle non-local interactions in the data [
80].
Regarding
feature extraction, supervised approaches are often impractical in our setting. Annotating casting videos requires expert knowledge and is prohibitively expensive, while models trained on labelled data tend to learn task-specific features that do not generalise as well [
81]. Similarly, relying on pretrained models, even those trained with self-supervised objectives, faces the same limitation, outlined as follows: most existing models are trained on natural images or videos [
82], which differ significantly from our industrial setting. This domain gap makes direct transfer ineffective for capturing the fine-grained patterns relevant to molten metal jet streams. Alternative self-supervised strategies, such as contrastive or joint-embedding methods, excel in categorical tasks, especially in object-centric datasets [
83], but they struggle in other situations [
84,
85]. In contrast, generative approaches have shown superior performance in tasks requiring detailed spatial understanding, such as object detection or segmentation [
86]. Given that our goal involves characterizing fluid dynamics and texture-level variations, we require a method that preserves high-frequency details rather than focusing solely on global semantics [
87].
To address these challenges, we adopt VideoMAE [
18], a self-supervised framework designed for video representation learning. VideoMAE reconstructs masked spatio-temporal patches, encouraging the model to capture fine-grained texture and motion cues essential for understanding jet stream behaviour. Its transformer-based architecture effectively models long-range temporal dependencies [
87] and supports seamless multimodal integration, which is advantageous for incorporating additional sensory data [
88,
89]. Furthermore, its self-supervised nature allows us to exploit large volumes of unlabelled casting videos, learning domain-specific representations without costly annotation. These properties make VideoMAE particularly well-suited for our application, where meaningful spatio-temporal features are critical for downstream clustering and anomaly detection.
Once features are extracted,
clustering is typically used to separate normal from anomalous samples. K-means is widely used but presents several limitations. It forces all data into clusters, assumes clusters of similar size, and requires the number of clusters to be specified in advance (see also
Section 3.4). Alternatives like CBLOF [
90] and LDCOF [
91] offer more flexibility but are harder to tune due to excessive hyperparameters. Density-based methods such as DBSCAN [
19], Density Peak Clustering [
92], Hierarchical DBSCAN (HDBSCAN) [
93], and OPTICS [
94] are better suited for identifying outliers directly. Among these, DBSCAN stands out for its robustness and simplicity, and although it has been used extensively in prior work (e.g., [
95,
96,
97]), we have not seen it applied to high-dimensional video features. We believe one reason for this gap in the literature may be related to the curse of dimensionality (see [
98] for an intuitive definition) affecting common distance metrics used in clustering algorithms [
99]. Although this has been challenged in other works [
100], we hypothesise that employing a neural network to embed the visual data may help alleviate this problem [
101,
102].
To the best of our knowledge, there are very few previous works with similarities to our proposal. Current approaches to clustering-based video anomaly detection are generally limited to dual-stream autoencoders paired with parametric clustering methods, like K-means or Spectral, which require a fixed number of clusters [
103,
104]. While isolated attempts have been made to incorporate spatio-temporal features via 3D convolutions [
105] or distance-based outlier removal [
104], these methods lack density-based adaptability and do not fully exploit modern transformer architectures. Furthermore, while DBSCAN has been used for trajectory-based video analysis [
97], its application to complex, high-dimensional descriptors for industrial processes like metal casting represents a distinct and more challenging use case. We propose a novel framework that leverages multimodal video transformers and utilises DBSCAN to enable density-based outlier detection.
2.5.2. Data Preparation
The position of the pouring stream varies along the moulding sessions. For this reason, for each pouring event defined in
Section 2.3.1, we leveraged the YOLOv11 detector trained in
Section 2.3.2 to extract AoIs of the pouring stream from each frame. To ensure temporal consistency and avoid false detections, we employ a simple algorithm to combine the AoI detected on each individual frame from the pouring event. For each new detection, a decision is made to keep it only if the following hold: (a) the detection confidence is above a threshold (in practice 0.5), and (b) the IoU with the accumulated box (the union of all previously detected boxes within the current pouring event) is above a given threshold (in practice 0.6). When the detector fails to detect a pouring stream for more than a margin of 10 frames, we consider the pouring event is over. As a final cleanup step, we compute the average detected box and remove any outlier box (those for which IoU with the average is below a threshold). The final AoI is computed as the union of the remaining boxes.
2.5.3. Methodology
In our implementation, VideoMAE [
18] processes sequences of 16 frames extracted from the YOLO-detected stream regions, producing rich spatio-temporal embeddings that encode both the spatial characteristics of the stream shape and its temporal evolution. The encoder architecture transforms these video clips into high-dimensional feature vectors that capture subtle variations in stream behaviour, such as flow stability, turbulence patterns, and directional changes. These embeddings serve as the foundation for our clustering analysis, where similar behaviours are grouped together, enabling the identification of distinct operational modes and the detection of anomalous patterns that deviate from normal casting conditions.
The temporal modelling capability of VideoMAE is particularly valuable for stream analysis, as casting anomalies often manifest as temporal irregularities rather than instantaneous defects. By encoding sequences rather than individual frames, the model can distinguish between normal flow variations and genuine anomalies, such as bifurcations or unstable flow patterns that develop over time. This temporal awareness, combined with the model’s ability to learn from unlabelled data, makes VideoMAE an ideal choice for developing robust, interpretable characterisations of industrial streams that can support both real-time monitoring and predictive maintenance applications.
In a nutshell, VideoMAE masks part of the input video and reconstructs it using an encoder–decoder architecture. The encoder learns a compact representation of the input, while the decoder reconstructs the masked tokens from that representation. The video input
is initially partitioned into non-overlapping spatio-temporal patches. Each patch is passed through a lightweight convolutional network (CNN) to produce an embedding, yielding
, where
denotes the number of resulting patches (hereafter denoted as
tokens) and
is their feature dimension. A portion of these tokens is then randomly masked, and only the unmasked tokens are provided to the transformer encoder
. Importantly, the masking scheme suppresses an entire spatial location across all time steps to prevent the model from exploiting information carried over from nearby frames (i.e., shortcut learning [
106]). The decoder
subsequently consumes both the visible tokens from the encoder and a set of learnable mask tokens to reconstruct the original video. The mean squared error (MSE) loss is used between the masked and reconstructed tokens as follows:
where
p is the token index,
is the set of masked tokens,
I is the input image, and
is the reconstructed one. The weights are initialised with the ViT-B pretrained model on Kinetics-400 [
107] (1600 epochs) provided by the authors on the model zoo [
108]. We then remove the decoder and keep the final features at the output of the encoder.
We employ the implementation provided by the authors in [
108]. To include sensory data, we sample readings from the same time-range as the frames as follows:
where
is the number of sensors used and
is the number of readings sampled during a given video sequence. We use a fully connected network
to map the readings into the same dimensionality as the visual tokens, resulting in
. These are then simply enhanced with sinusoidal positional encodings [
109] and concatenated together with the visual tokens. As the number of sensory tokens is notably smaller than that of visual tokens, we also add a weight to the loss, weighting the contribution of each element to the loss. In this manner, the final loss results in
, where
and
are the weights for the visual and sensory loss, respectively, set in practice to 1 and 5.
For the sensory data we combine two masking strategies. First, masking tokens as carried out for the visual ones, resulting in a reduced set of sensory tokens. However, the model could simply learn to interpolate between readings of the same sensor. For this reason, we also mask entire sensors across all temporal tokens. In order to preserve the temporal dimension across tokens, we choose to mask by setting the values of those sensors to 0. We hypothesise that, with this combined approach, we force the model to better understand vision–sensory relationships to properly infer the values of the missing sensors and visual information as a whole. We use a masking ratio of 82% for visual tokens and 45% for sensory data. The remaining parameters were left as in the original paper.
Once training converges, the decoder is discarded and the encoder processes full, unmasked sequences. Spatially pooled visual tokens and sensory features are extracted, normalised, and further reduced, while temporal features are preserved. The resulting representation is fed into a clustering stage (see
Section 3.4). DBSCAN is adopted as the primary method with empirically tuned parameters, and k-means is evaluated as a baseline.
2.6. Mixture of Experts
Once the pouring events have been identified thanks to the movement detection (see
Section 2.3.1) and the relevant AoIs are detected and classified (
Section 2.3.2), the next step focuses on evaluating the manufacturing outcome of each cycle. Specifically, this stage analyses both (i) the visual state of each detected element involved in the pouring operation and (ii) the process performance according to the foundry control plan. The resulting measurements is a structured set of variables that, lately, will be employed by the MoE system and its integrated rule-based expert system for data aggregation, correlation and knowledge generation. The overall interaction between the different experts and the supervisory MoE layer is summarised in
Figure 11, which illustrates how heterogeneous assessments are combined into a single event level decision.
For every detected pouring event
, the visual elements are interpreted to produce a set of semantic labels. In this case, the pouring stream is characterised by the anomaly score computed in
Section 3.4 as follows:
.
Moreover, the number of interruptions in the stream is also quantified as
, while the total pouring duration is defined as follows:
Furthermore, each mould cup in the conveyor is tracked before, during, and after reaching the pouring zone, giving the identification of its filling state like
The following cup (i.e., the incoming mould cup) is simultaneously monitored to detect undesired metal droplets or premature splashes, producing a binary safety flag , where denotes a warning condition. Similarly, the presence or absence of the tile (also known as the splash guard) is inferred to contextualise if the protection mechanisms are present.
These indicators allow tracking each mould individually along the the visible span of the conveyor belt captured by the camera, detecting undesired metal losses or incomplete filling throughout its trajectory over the visible part of the conveyor. This information is preserved and later aligned with the production references for final traceability.
Beyond the visual part, each pouring event is checked against the process constraints defined in the foundry control plan. For every event
, the following temporal variables are computed:
These variables are compared against the acceptable ranges established by the specific control specifications of the produced reference as follows:
When the process exceeds these tolerances, process deviations are identified. Those deviations can be translated into special and also risky situations. Additionally, stream interruptions are also recorded through and their temporal position related to the event. These cuts are often correlated with turbulence, clogged nozzles or thermomechanical instabilities. In addition, a prolonged idle time may indicate inoculation delays, metal cooling trends, or transient line imbalance that propagates downstream.
Finally, each event is annotated with its corresponding operational context, obtained from the global context classifier (explained in
Section 2.2). Rather than a single binary label, all context changes occurred during the event are logged as
, where
denotes the
k-th context state (e.g.,
normal,
maintenance,
overflow, and
stoppage), together with its start and end timestamps. This enables the measurement of the total duration of each context situation (
) and the identification of whether any portion of the event was affected by maintenance operations, operator occlusions, or technical stops. In that way, segments labelled as non-regular production are still stored for traceability but are excluded from strict quality assessment and statistical evaluation.
In summary, the manufacturing evaluation stage transforms raw visual observations and temporal measurements into a structured and fully interpretable description of each casting cycle. This representation captures not only the physical behaviour of the pouring stream and the mould cups but also the operational context and the compliance of the process with its reference control plan. Nevertheless, these heterogeneous outputs still constitute independent evidences that must be jointly analysed to infer whether the event is globally acceptable, risky, or defective. For this reason, the next stage introduces a MoE architecture that unifies all previously extracted indicators, applies a rule-based expert system grounded in foundry knowledge, and produces a final, coherent decision to be consumed by higher level systems such as digital twin platforms or supervisory decision tools.
Unification and Digital Twin
Once all visual and temporal layers have been independently processed, which means that temporally delimited pouring events are extracted (explained in
Section 2.3.1), the detection and segmentation of areas of interest are carried out (described in
Section 2.3.2), the manufacturing context classification is performed (summarised in
Section 2.2), the interpretation of the pouring stream itself is carried out (determining its continuity, turbulence or flow interruptions in the way that is explained in
Section 3.4), and the extraction of process metrics is achieved, it becomes necessary to unify these heterogeneous outputs into a consolidated and decision-ready representation of the pouring process.
To achieve this integration in a robust and interpretable manner, we adopt an explainable MoE-style paradigm [
110]. In the present work, the term MoE should not be understood as a fully trainable neural architecture with learned sparse routing but as a modular expert-fusion framework in which specialised analysis branches provide complementary evidence that is explicitly combined at event level. The emphasis is therefore placed on decomposition, specialisation, and transparent aggregation, rather than on end-to-end learned expert gating.
Their suitability for heterogeneous industrial data is further reinforced by recent advances in multi-source fusion and diagnostic modelling. Previous work has consistently shown that modular expert-based architectures are especially advantageous in scenarios such as foundry pouring control, where heterogeneous sensing modalities, strong temporal dependencies, and strict operational reliability constraints must be jointly handled within a single decision framework. For example, Ma et al. [
111] demonstrate that complex manufacturing processes benefit from architectures capable of unifying symbolic, temporal, and visual information within a common representation space, showing that heterogeneous data fusion significantly improves fault discrimination and process interpretability. Similarly, the survey by Han et al. [
112] highlights that modern fault diagnosis pipelines increasingly rely on expert decomposed reasoning, where different submodels specialise in operating conditions, regimes, or sensor subsets. Finally, recent multimodal approaches such as the dual-attentive fusion model of Chu et al. [
113] demonstrate that selectively combining specialised feature extractors leads to superior robustness against noise, variable regimes, and transient disturbances.
In our system, the MoE-style fusion layer performs the following key functions: (i) aggregating all expert outputs into a single structured representation, (ii) evaluating consistency with process specifications through explicit rule-guided industrial logic, and (iii) generating a final judgement for each casting event in an interpretable manner. Moreover, for each detected event , the MoE maintains a temporal memory so that late visual updates (for example, an overflow detected after the mould cup has left the pouring zone) can be retroactively assigned to the correct event.
Accordingly, the raw feature set ingested by the MoE can be reformulated as follows:
where the following hold:
, , and denote the measured pouring duration, inter-pour interval, and idle time between moulds for event .
, , and are the control plan thresholds that define acceptable operating limits. These values are not binary decisions; specifically, they are numerical references used by the MoE to assess compliance with production standards.
represents the regressed anomalous status of the pouring stream.
denotes the ordered set of mould cups observed during , each tagged with its filling state.
is a binary or categorical indicator describing the state of the protective tile during event .
encodes detected overflow or metal splashes affecting mould position j.
stores the time-resolved evolution of the manufacturing context throughout event .
The rule-based layer operates as a deterministic supervisor that interprets according to domain knowledge. A simplified subset of the logic is as follows:
Pouring time rule: If , then (underpouring); If , then (overpouring).
Inter pouring rule: If , then .
Idle rule: if , then .
Stream integrity rule: The MoE evaluates the stability of the pouring stream by jointly analysing the following: (i) the number of flow interruptions , (ii) their durations , (iii) their quartile locations within the event , and (iv) the deviation of the actual pouring time from the control plan target . The resulting stream integrity rating is determined as follows:
- –
High risk: Long interruptions ( above the control plan tolerance) or cuts occurring in the initial or early quartiles. These situations often produce partial filling, cold shuts, or early loss of stream coherence, making them critical for defect formation.
- –
Medium risk: Interruptions of short duration located in middle or late quartiles, or a moderate number of cuts () that do not exceed duration thresholds. Deviations of from the reference target also increase the risk to a medium level.
- –
Low risk: Short and isolated cuts occurring in the last quartile, where their impact on filling quality is typically minimal, and the pouring duration remains close to the control plan target.
To formalise this rule into a quantitative and interpretable metric, the MoE computes a
stream integrity rating for each event
, defined as follows:
where the following hold:
- –
is the total number of detected interruptions.
- –
is the duration of the k-th interruption.
- –
is the quartile index of the interruption.
- –
is a monotonically decreasing quartile weight. (), reflecting higher sensitivity to early cuts.
- –
is the nominal pouring duration from the control plan.
- –
are expert-defined coefficients calibrated to the sensitivity of the process.
The scalar is then used by the MoE as an aggregated quality indicator whose value increases with the number of cuts , their severity, their temporal position within the event, and the deviation of from the expected reference value.
Cup fill consistency rule: If any mould cup transitions from full to medium after leaving the pouring zone, the system flags metal loss after pouring.
Context rule: If any context segment within corresponds to maintenance or occlusion, then the event is marked as non-valid for quality.
Once the MoE finalises the event assessment, the system serialises all results into a structured JSON document. This format enables real-time communication through a WebSocket-based publisher/subscriber interface and ensures interoperability with external software layers, including supervisory dashboards, MES platforms, and digital twin environments.
The purpose of this interface is not to claim that process quality is evaluated by the digital twin itself, but to expose the event-level knowledge generated by the proposed framework in a machine-readable and operationally meaningful form. In particular, the JSON package includes raw measurements, intermediate expert outputs, contextual indicators, and the final MoE decision, so that downstream systems can ingest not only a final label, but also the underlying evidence supporting it.
Accordingly, the contribution of this layer is to enrich higher-level plant software with semantically structured pouring-event information that was not previously available in this explicit form. This facilitates downstream monitoring, traceability, replay, and future causal or predictive analyses, while keeping the proposed framework fully compatible with digital twin-based deployments.
3. Results
In this section we report the experimental performance of each functional block introduced in
Section 2. In particular, we first evaluate the local image interpretation and event detection, then the manufacturing context classifier, the AoI evaluation modules, and anomalous pouring stream results, and finally the global MoE aggregation.
3.1. Global Context Evaluation
The environmental context classifier exhibited a fast, stable, and monotonic convergence throughout the training process.
Figure 12 summarises the evolution of the main learning metrics across the training epochs. As shown, the
train/loss decreases steadily from 0.198 to 0.057, while the validation loss mirrors this behaviour, reaching 0.025 at the final epoch. This sustained and simultaneous reduction of both curves indicates an absence of overfitting and a highly consistent optimisation process.
In parallel, the classification accuracy increases rapidly. The top-1 accuracy improves from 97.4% in the very first epoch to 99.1% by epoch 16, reflecting the strong visual separability of the context categories. The validation loss decreases smoothly without oscillations, suggesting that the model generalises well even in the presence of the significant class imbalance described in
Section 2.2. Interestingly, only a small number of epochs are required to reach high performance, which is consistent with the characteristic behaviour of YOLOv11 classification models when trained on visually distinct categories. According to the industrial visual inspection literature, top-1 accuracy values above 0.95 and small train–validation loss gaps are considered indicators of reliable deployment in production environments.
The results confirm that the model reliably extracts and discriminates the contextual states associated with the four production conditions (i.e., normal, maintenance, stopline, and overpouring). This robustness is essential for downstream integration into the MoE, where the environmental context acts as a gating signal for validating the usability of each pouring event.
3.2. Local Image Interpretation and Event Detection
We first quantify the ability of the system to detect pouring events over long sequences, and then, we evaluate the performance of the AoI detector.
3.2.1. Pouring Event Detection
To evaluate the robustness of the event detector under realistic operating conditions, we use one hour of continuous production video with its corresponding sensor readings. These signals record, among other variables, the number of produced moulds. Then, we compared the automatically detected mould counts with the ground-truth number of pours.
Table 2 summarises the statistics of this experiment.
Our approach underestimates the number of moulds by only . The mean production rate measured by the detector ( moulds/min) is therefore very close to the actual rate ( moulds/min), and the ratio between detected and real counts remains within . The missed events are mostly associated with non-standard situations, such as contrast variations caused by molten metal that reduce or maintenance periods during which personnel partially occlude the camera view. Nonetheless, thanks to additional detectors downstream, we are able to identify these atypical situations and label them accordingly.
From a qualitative perspective, the annotated video confirms that the detector behaves consistently under regular operating conditions. It successfully tracks consecutive pours, short interruptions, and occasional variations in the number of active streams or the flow geometry. These patterns are clearly reflected in the temporal series of detected events, which will later be exploited for the characterisation of different pouring conditions.
3.2.2. AoI Detection
The second component of the foundry image interpreter focuses on the frame-wise detection of the AoIs involved in the pouring process. As discussed in
Section 2.3.2, the detector was trained to recognise four semantic regions as follows: (i)
pouring_stream, (ii)
pouring_cup, (iii)
tile, and (iv)
next_pouring_cup.
The resulting model achieves an average precision of
To further illustrate the stability of detection across confidence thresholds, the precision–recall curves in
Figure 13 summarise behaviour for each class as well as the aggregated performance. Indeed, all curves remain close to the upper right corner, showing that the detector maintains precision above 0.98 for almost the complete recall range. Furthermore, the aggregated curve exhibits a typical “flat–steep” profile associated with models that rarely trade recall for false positives, which is particularly beneficial for downstream temporal reasoning.
The confusion matrix in
Table 3 provides an additional perspective on the residual detection errors. For the four target AoI classes, the matrix is strongly diagonal and remains fully consistent with the high precision and mAP values reported above. Only a very small number of cross-class confusions is observed.
The background-related entries must be interpreted with care. In this confusion matrix, the background column does not represent all background pixels or image regions; instead, it collects unmatched detections, i.e., false positives. In absolute terms, these correspond to 48 residual false-positive detections over the whole evaluation set, of which thirty-five were assigned to pouring_cup, eleven to pouring_stream, and one case each to tile and next_pouring_cup. These residual errors are mainly associated with borderline visual situations, such as faint stream appearance at the beginning or end of a pour, or local structures around the cup region that resemble valid AoIs.
Taken together, the detection results show that the AoI model provides a highly trustworthy spatial description of each frame. This is particularly relevant because the output of the AoI detector serves as the spatial foundation for temporal reasoning, environmental assessment, and expert fusion.
3.3. Area-of-Interest Visual Classification
The specific AoI classifiers operate on the regions detected by the perception module, assigning semantic labels to two key visual elements of the pouring line, firstly, the
mould cups that are already been poured and, secondly, the
incoming mould cup that will be poured in the conveyor sequence. Both classifiers employ the same training protocol described in
Section 2.2, including an identical data management strategy, augmentation policies, loss functions, and evaluation metrics. This ensures methodological consistency across all recognition and classification tasks.
3.3.1. Mould Cup Filling State Classification
The first AoI classifier estimates the filling state of each mould cup after the pouring operation. Specifically, it classifies each pouring cup into four semantic classes as ofllows: empty, medium, full, or overpouring. This information is essential to track metal losses along the conveyor and detect underfilling or overflow episodes.
The accuracy curve in
Figure 14 shows the evolution of the accuracy of the model across training. Starting from an initial top-1 accuracy of approximately
, the classifier quickly surpasses
and eventually stabilises close to
after 300 epochs. Once this high accuracy value is reached, no significant degradation or oscillations are observed, suggesting that the model maintains its discriminative power across the entire training schedule. From an operational point of view, this behaviour implies that the classifier can reliably separate the four filling classes. As a result, it becomes a dependable source of information for monitoring metal usage, automatically flagging abnormal filling patterns and feeding accurate cup measurements into the MoE decision layer.
3.3.2. Incoming Mould Cup Classification
The incoming mould cup binary classifier determines whether the next mould in the conveyor is ready for pouring without any problem (OK) or presents residual metal or internal contamination (Warning).
The accuracy curve shown in
Figure 15 indicates a rapid improvement during the initial epochs, followed by a steady plateau, reaching a top-1 accuracy slightly above
on the validation set. These results are consistent with the qualitative performance observed on the validation samples as follows: the classifier produces stable predictions across different operating conditions, reliably distinguishing clean from contaminated incoming mould cups. These results imply a low rate of missed
Warning events and a limited number of false alarms, which is critical for continuous deployment in a production line.
The achieved performance for incoming mould cup classification provides a robust signal that is subsequently integrated into the MoE system (see
Section 3.5). Its output contributes to ensuring quality at an early stage of the pouring cycle, enabling predictive alerts and improving the traceability of potentially compromised moulds.
3.4. Pouring Stream Interpretation
Next we discuss the results obtained when trying to categorise a specific pouring event as anomalous or normal. We evaluate the influence of DBSCAN by comparing it to a KMeans. We then explore the influence of the various parameters of DBSCAN to the obtained results. Finally, we assess the differences provided by the sensors. Then, we evaluate the contribution of adding the sensory information, either through early fusion (see
Section 3.4) or late fusion, where we concatenate sensory tokens to visual features right before clustering. Importantly, due to the rarity and subtle nature of anomalies in this industrial context, manual per-event annotation is both logistically unfeasible and economically prohibitive, requiring intensive expert visual inspection that does not scale. Consequently, we validate our unsupervised approach through a dual strategy as follows: first, by confirming that our detected anomaly ratios align with historical plant failure frequencies—providing a high-level quantitative proxy for the model’s physical accuracy—and second, through qualitative expert verification to help ascertain that the types of expected anomalies were indeed detected. Rather than replacing human judgment, this method serves as a critical pre-filtering stage; by isolating and removing high-confidence ’normal’ sequences, it reduces the manual annotation burden, allowing domain experts to focus their limited resources on the most ambiguous and high-value edge cases.
Given a video of length N frames, we extract sequences of length L with temporal stride s. Anchor positions are uniformly distributed across valid starting frames (1 to ), with anchors ensuring complete temporal coverage. To maximise data utilisation, we generate sequences from starting offsets for each anchor position o, where consecutive sequences differ by a single frame shift. This sampling strategy ensures that each temporal segment is represented by s overlapping sequences, capturing complementary temporal information while maintaining computational efficiency. Notably, this is different from the dense sampling strategy where all possible sequences would be sampled (e.g., for a pouring event of 82 frames, a dense sampling event would sample fifty-one sequences, whereas ours with and will produce only six). In practice, we sample 16-frame videos at a stride of 2 (i.e., covering a span of 1.6 s or 32 frames) and a resolution of (by reshaping the frames). Interestingly, we observe that two consecutive overlapping sequences (generally showing the same part of the video but shifted one frame into the future) tend to be assigned to the same cluster, strengthening the idea that the used approach holds semantic significance.
We train the vision-only method for 800 epochs and the multimodal ones for 1600, as they take longer to converge. We observed that sensor prediction quickly converged to local minima, so we added a weight to the loss (×5) to account for this difference. In total, we sample 18,496 video sequences, both for training and evaluation. Each sequence results in ∼300 K features, which we then run through a 2 × 2 average spatial pooling twice and normalise them, resulting in ∼55 K features for vision-only data and ∼62 K for multimodal data.
3.4.1. Kmeans: Elbow and Silhouette
K-means requires the number of clusters to be specified in advance. Several methods have been proposed to guide this choice. One common approach is the elbow method [
114], which plots the explained variance against the number of clusters and looks for a point where the marginal gain in explained variance starts plateauing. However, this method is often criticised for its subjectivity, as the “elbow” is not always clearly defined. More robust quantitative techniques have therefore been developed, such as the average silhouette width [
115]. The silhouette value measures how similar an object is to others within its own cluster (cohesion) relative to objects in other clusters (separation). It ranges from −1 to +1, with higher values indicating that points are well matched to their assigned cluster and clearly distinct from neighbouring clusters. We perform this analysis on our simplest model, just with the visual features out of the VideoMAEs encoder.
The elbow method suggests (see
Figure 16) that an appropriate number of clusters is seven. However, qualitative analysis shows that these clusters have too much variation towards the edges. We will also see this with DBSCAN, although to a lesser degree. Nevertheless, the appropriate number of clusters according to the silhouette analysis is five (shown in
Figure 17). Indeed, qualitatively speaking, using five clusters with k-means seems to make more sense. We can also observe this in the figure with TSNE, where using five clusters gives a clear separation, while using more than that results in more convoluted and intersected clusters. The main categories displayed by this clustering, are the ones most visually distinct that will come up later in any of the attempts to find different clustering. These are as follows:
Mostly normal pouring sequences.
Mostly normal pouring sequences, but a tube occludes part of the stream; the mould is big.
Mostly normal pouring sequences, but a tube occludes part of the stream; the mould is small.
Light overflow which is promptly reabsorbed.
As we will also see later, these clusters tend to have more normal sequences towards the centre of the cluster whereas they become un-naturalised (i.e., anomalies) with regards to that cluster as they approach the edge. This would require setting up a threshold distance to the centroid from which sequences start being considered anomalous, which is not so clear-cut. However, this approach would require further analysis as these anomalous events near the edge are not straightforward to separate.
With K-means, as we increase the number of clusters, we observe that these four categories are split into multiple clusters which are difficult to differentiate qualitatively. This can be clearly seen in
Figure 18, where results for
and
are shown. In the 2D visualisation (
Figure 18a,b), many clusters appear that seem to be splitting unnecessarily a single one into more than two when using
, as the algorithm attempts an even distribution of samples across clusters. In the 3D visualisation (
Figure 18c,d), it is easy to see how many clusters seem to overlap, having samples within neighbouring clusters. This explains the lower silhouette score.
3.4.2. DBSCAN: and min_samples
Different from K-Means, DBSCAN uses density to assess whether samples should be included into a cluster, allowing for clusters of different density. This is controlled by two key parameters as follows: and min_samples. The former controls the maximum distance up to which neighbouring samples are considered. The latter controls how many of those neighbours need to pertain to a given cluster for the current sample to be classified into that cluster. Intuitively, denser regions will form the centre of a cluster, as samples reinforce each other pertaining to specific clusters, whereas edge regions will not have enough samples, thus leaving outlier samples outside the given cluster. We find this feature of DBSCAN very useful for our use case, where finding anomalous instances is needed.
To find an appropriate , we first compute a pairwise distance matrix between all encoded video sequences (with the plain visual features). Through empirical tests, we find the mean distance (356) minus the standard deviation of the distances (85) represents the ceiling for the reasonable to be used, i.e., 270. Finding the proper value for is not straightforward, as there is a fine interplay with min_samples.
Using too big of a distance (i.e., ) allows too many samples to be clustered together. In all our experiments with this setting, around 60% of samples are clustered together into one cluster and around 30% into a secondary one. Both of them contain what, under inspection, we consider to be normal samples. Very few samples are clustered into one or two more very small clusters, as well as grouped as outliers. However, as we detail later, we find these to be of not much semantical significance. The main problem we observe with this bigger is that not many samples fall into the outlier group, which complicates the identification of anomalous samples.
When
is small (i.e.,
), it causes the algorithm to be more sensible to the choice of
min_samples. Using
min_samples results in many very small clusters (with
samples), as they cannot be clustered with enough nearby samples (see
Figure 19b). This phenomenon is driven by the interaction between these two parameters, but also by the high similarity between overlapping video sequences, which provides sufficient local density to meet the clustering criteria despite the lack of broader statistical significance. These clusters are often hyper-specific to particular video segments and do not represent generalised anomalous behaviour. Nonetheless, in some cases we did find examples where very canonical anomalies appeared, which helped us identify the kind of outliers we had included in our dataset. In fact, as we increase
min_samples for a given
, we also observe an increase in the samples that have no cluster assigned (i.e., they fall in the outlier cluster, see
Figure 19a) This becomes increasingly evident with low
values. We believe that these small clusters are not allowed to form, and instead, those samples are categorised as either anomalous or normal in the bigger clusters. These clusters, despite them being too small and overly-specific, seem to indicate that VideoMAE is capable of producing features that put these sequences together, allowing for such fine-grained categorisation. Nevertheless, we believe that properly finding these clusters would require a more fine-grained hyperparameter search for the clustering algorithm. A potential venue of further research could also entail exploring the ratio of sequence overlap to explicitly deal with these small clusters.
A small
and small
min_samples require denser areas to form clusters. On the one hand, with a small
, it is harder to find enough neighbouring samples such that the
min_samples requisite is met, resulting in more outlier samples. On the other hand, as a small
min_samples allows for inclusion in a cluster with just a few samples, this results in many overly small clusters. As
is increased, more distant samples are considered (those assigned to bigger clusters), resulting in those samples being clustered together and reducing the amount of small clusters. And as
min_samples is increased, more neighbours of a given cluster are needed to categorise a sample into said cluster, making most of those samples in less dense areas to be classified as outliers. Overall, and as it can be seen in
Figure 19, for a given
, increasing
min_samples will result in less smaller clusters and more outlier samples. But for a given
min_samples, increasing
will result in less outlier samples and less smaller clusters. The distribution of samples across the four dominant clusters for the tested parameter combinations is shown in
Figure 20.
In general, we observe the formation of four more or less stable clusters, which are qualitatively equivalent to the ones described when discussing the results with KMeans, as follows: (1) normal pouring events, (2 and 3) normal pouring events with a tube in front of the frame (with both big and small mould), and (4) slight overflow promptly reabsorbed. These are consistent over almost all combinations of parameters tested, failing to form only in cases where either is too big or min_samples is large when is small. The first is generally the bigger cluster which is more stable regardless of the values of the two parameters, obtaining almost half the sequences. The next two are the same size, adding up to a third of the sequences. The fourth and final cluster is the more variable, ranging from ∼800 samples up to ∼2000. In this regard, we observe that the smaller the value of , the more sensible the results are to the choice of min_samples. For example, for , we can barely observe any difference in the results when varying min_samples.
Qualitatively speaking, we find two main types of anomalies. On the one hand, there are expected anomalies, which are known to cause potential casting defects, and most of them are shown in
Figure 21. These events are closely related to common defects observed in continuous casting, such as incomplete mould filling, improper nucleation resulting in coarse microstructures, segregation phenomena, thermal cracking, and surface or internal defects like inclusions and porosity. Other than that, we also observe examples where the stopper was moving while pouring was active. On the other hand, the algorithm also finds anomalous samples which can be explained by sampling errors or by not properly detecting the pouring event/position. These are shown in
Figure 22. As samples were shown in smaller 16-frame sequences, this also includes the least common beginning and end of pouring events, as they include some frames where no pouring occurs, and they are sometimes wrongly categorised as outliers. We also find samples where the mould is moving, which is normal behaviour outside of pouring, but it is categorised as an outlier as not many sequences include it.
The experimental results demonstrate that the combination of VideoMAE features and DBSCAN effectively partitions the latent space, successfully isolating evident anomalies from the baseline. As illustrated in
Figure 21, the model clearly separates both expected process anomalies and sampling-related irregularities. However, a more nuanced analysis reveals that while the core clusters are well-defined, challenges remain in characterizing “edge cases” situated at the transition between normal and anomalous behaviour. Specifically, the model occasionally flags the start and end of pouring events as anomalies (
Figure 22). Conversely, certain subtle defects—such as misaligned or skewed streams—are sometimes absorbed into normal clusters. These instances suggest the presence of environmental confounders, such as auxiliary piping or casting splashes, which may bias the feature representation.
To address this, we observe that normal clusters often exhibit a higher density of anomalous samples near their boundaries. This indicates that a distance-based anomaly score—measuring the proximity of a sample to its respective cluster centroid—could significantly reinforce the classification of these ambiguous cases. Furthermore, while the current DBSCAN approach provides a robust baseline, the use of hierarchical alternatives like HDBSCAN could better account for the varying densities of industrial data, potentially offering a more granular separation of the border regions.
Despite these boundary challenges, the current approach represents a significant step forward in unsupervised industrial monitoring. By successfully isolating the most distinct anomalies, this method enables a human-in-the-loop workflow. Specifically, it allows domain experts to focus their labelling efforts only on the subset of ambiguous sequences identified here, effectively reducing the manual annotation burden by filtering out sequences that can be confidently categorised as normal.
3.4.3. Multimodality
When multimodality is assessed in terms of global cluster structure, we do not observe substantial differences between early and late fusion, and the four dominant “normal” clusters remain broadly stable. In this sense, the addition of sensory data does not dramatically reorganise the main visual topology of the embedding space. Specifically,
Figure 23 shows the number of samples assigned to the four most stable large clusters across different DBSCAN parameter combinations for the multimodal setting. However, information from sensors does affect the downstream anomaly allocation. In particular, the multimodal variants tend to produce a larger number of outlier samples and a higher number of small residual clusters, indicating that the additional signals influence local neighbourhood structure even when the dominant clusters remain visually similar. Part of this effect may be related to the fact that
was initialised from distances computed on vision-only features, which can favour a stronger fragmentation of the multimodal embedding for small
values.
Therefore, the practical benefit of multimodality should not be interpreted as a major change in the global cluster layout, but rather as an improvement in event level anomaly prioritisation (the multimodal variants tend to produce a larger number of outlier samples across the explored combinations, see
Figure 24). Although the excess of very small clusters may complicate the regression of a continuous anomaly score, the multimodal variants are more effective at surfacing potentially anomalous pouring events for subsequent inspection.
Another difference is when using eps = 270. In the case of using sensory information, the whole dataset (except very few outliers, ∼300) falls into a single cluster. On the contrary, when only using visual information, we have two main clusters, one with most of the sequences (∼11 K), and another with the tube occluding part of the stream (∼6 K).
Finally, we believe this additional number of samples that fall outside of defined clusters may be explained by anomalies found exclusively on sensory data, which cannot be as easily observed as the ones explained visually. In this regard, some of the anomalies found could be either sensory anomalies or, alternatively, noisy values of the sensors. Nonetheless, this shows the successful integration of the sensory data, as they are clearly being used by the clustering algorithm to find anomalous samples.
3.4.4. Agreement
To compute the final score for a given pouring event, we use the percentage of sequences from that pouring event that are assigned to no cluster, resulting in a value from 0 to 1. We empirically set up a threshold of 0.6 to consider a pouring event as anomalous, but, for the final MoE, we use the continuous value instead. The choice of this threshold is guided by the following two key factors: (1) the ratio of expected defective moulds caused by pouring anomalies, and (2) a qualitative analysis of the produced results, thanks to expert knowledge of anomalous and normal events.
From ground-truth information provided by the factory where our data came from, we know that ∼3% of the moulds are rejected because of defects that originated by some fault related to the pouring. In this sense, from the 539 detected moulds detailed in
Table 2, we identify 549 pouring events, from which we should expect to find ∼16 that are anomalous. However, as we saw above, our system currently also finds anomalies that were caused by a sampling problem, and not due to actual pouring problems. In this sense, we consider that 16 should be taken as a lower bound.
In
Figure 25, we show the distribution of the number of detected anomalous pouring events using a threshold of 0.6 on the anomaly score. From the 549 detected pouring events, plant-level statistics suggest that approximately 16 events should be considered anomalous as a lower bound (about 3%). In this respect, only the multimodal variants retrieve numbers of anomalous pouring events that are consistent with this expected range, whereas the vision-only variants do not show the same level of agreement. Qualitative analysis of the multimodal configurations with
and
min_samples further supports this interpretation as follows: approximately 60% of the pouring events marked as anomalous indeed show expected anomalies (see
Figure 21), around 20% can be attributed to sampling errors (see
Figure 22), and the remaining cases appear visually normal but may reflect irregularities captured mainly by the sensory channels. Overall, these results indicate that the main benefit of multimodality lies in improving event-level anomaly prioritisation, rather than in dramatically altering the global cluster structure.
3.5. Mixture of Expert Style Fusion
The manufacturing evaluation stage operates on the set of pouring events successfully detected by the vision pipeline. For this evaluation, we considered the 539 pouring events correctly detected and reported in
Table 2. For each detected pouring event
, we computed a set of temporal indicators, namely, pouring duration
, the inter-pour time
, and the mould idle time
. After excluding problematic cases with missing or unreliable measurements, these variables exhibit a stable and relatively narrow distribution across regular production cycles. As reported in
Table 4, the mean pouring duration is
, with moderate dispersion, while idle times remain consistently within the expected operational range. Only a small fraction of events shows deviations in
, typically correlated with context states such as short stoppages or maintenance actions identified by the global context classifier. Regarding the visual evaluation of the pouring stream, most events present a single continuous segment, corresponding to uninterrupted pouring. In fact, stream interruptions, quantified by the variable
, are observed in a limited subset of events and usually occur near the beginning or end of the pouring interval. Such cases are often associated with transient instabilities or borderline visual conditions rather than systematic failures.
In summary, the manufacturing evaluation stage focuses on extracting, validating, and structuring heterogeneous indicators describing each pouring cycle, including visual states, temporal measurements, and process-level constraints. At this point, all variables are preserved as independent evidence sources, without enforcing a final quality decision. The joint interpretation of these indicators, together with the domain-specific rules and contextual information, is managed in the next stage through an explainable MoE-style fusion architecture. This type of unification transforms the measured evidence into a coherent and high-level event assessment suitable for downstream integration with supervisory and digital twin-compatible systems.
MoE Unification and Digital Twin
The previous stages provide complementary, but still independent, observations of each pouring cycle as follows: (i) the temporal delimitation of pouring events, (ii) the localisation of key AoIs, and (iii) the frame-level interpretation of stream and mould states. The role of the present layer is to convert this heterogeneous set of evidence into a single event-centred representation that can be queried, stored, and acted upon at line speed. Rather than performing additional vision processing, the focus is on evidence fusion. More specifically, this layer combines temporal variables, geometric cues, semantic labels, and contextual indicators into a coherent event-level assessment, enabling internal evidence traceability, early quality screening, and real-time interoperability with supervisory and digital twin-compatible platforms.
Each mould cycle is represented as a
POURING_EVENT record indexed by the event window
(see
Section 2.3.1). Specifically, the record stores several items of information such as (i) the event timing (
,
,
), (ii) the local interpretation of the scene (i.e., anomalous status
, number of cuts
, mould filling state
, tile presence and incoming warning flag
), and (iii) the operational context segments
coming from the global environment evaluator.
Since movement-based event identification is affected by occlusions and maintenance operations, the final population of events must be interpreted together with the detection statistics already reported in
Table 2.
Before the process of merging the information, the MoE layer applies a lightweight validation. In fact, it is the step to ensure that the recorded event is temporally coherent as follows:
Hence, events that fail these constraints are still stored for traceability. Nevertheless, they are identified as non-assessable and they are also excluded from the strict quality calculation or scoring. This is particularly relevant in borderline sequences where operator occlusions, technical stops, or the partial visibility of the conveyor can disturb the motion signal or the AoI evaluation.
Once all information has been extracted and validated, the fusion layer combines the outputs of the specialised experts to produce an interpretable classification of the pouring event. In this work, the MoE should be understood as an explainable modular expert-fusion mechanism rather than as a black-box trainable mixture model.
Temporal expert: Evaluates the correctness of the event comparing the temporal information with each reference control plan limits using , , and .
Stream expert: Interprets and (cuts), penalising unstable or interrupted flows.
Cup expert: Aggregates the tracked filling states (and the incoming warning flag ) to detect spill, underfill, and premature splashes.
Context expert: Reduces the influence of evidence during non-regular operating periods (e.g., maintenance or stoppages), retaining it for traceability while avoiding spurious quality alarms.
All these expert outputs are fused through an explicit event-level aggregation strategy conditioned by operational context, rather than through a learned neural gating network. Operationally, each expert contributes structured evidence that is interpreted as compliant, warning-level, or critical according to the corresponding industrial criteria. Events are labelled as OK when no critical evidence is present under regular operating context, as Review when isolated warning-level deviations or borderline disturbances are observed, and as Not_OK when severe or repeated anomalies are detected, especially when multiple experts concur on abnormality. The final MoE output also includes an explanation vector with the dominant contributing factors (e.g., cut detected, overflow warning, out-of-limit pour duration, maintenance segment present).
For this evaluation, we used the same one-hour run as in the previous analysis. Accordingly, the parsed
POURING_EVENT logs show that most cycles are temporally stable (explained below in
Table 4), with a tight inter-pour rhythm and occasional outliers caused by operational interruptions. The average number of stream interruptions is low. In fact, only 5.38% of events contain at least one cut (i.e.,
), indicating that discontinuous flows are sporadic rather than systematic.
Furthermore, the status flags calculated indicate that the majority of events are considered OK by the upstream checks, with a smaller fraction requiring review due to minor inconsistencies, such as short partial segments or atypical timing under occlusions. These flags are used by the MoE as priors rather than final decisions, since the ultimate quality label is produced after the multi-source fusion process.
After aggregating all validated evidences at event level, the MoE module assigns a final operational decision to each detected pouring cycle. This decision reflects the joint evaluation of temporal compliance, stream stability, mould filling conditions, and operational context. Events flagged as non-assessable due to partial visibility or external disturbances are excluded from strict quality scoring but preserved for traceability.
Table 5 summarises the distribution of MoE decisions obtained from the selected one hour evaluation video.
From a validation perspective, a full mould-by-mould correlation between the MoE decisions and the final casting quality is not feasible at this stage. This limitation is mainly due to the lack of part-level traceability between individual pouring events and downstream defect inspection, as well as the presence of external disturbances affecting both detection and quality assessment. In fact, in practical foundry operation, this limitation is not merely methodological but also operational. Once the pouring stage has finished, the downstream production flow does not preserve a direct one-to-one linkage between the visually observed pouring event and the final inspected casting within the available data infrastructure. As a result, extracting the exact subset of castings associated with a specific recorded pouring interval is not straightforward, and final quality information is typically available at an aggregated production level rather than as an event-level label tied to each detected mould. This prevents the computation of controlled deployment metrics on a mould-by-mould basis, even though the evaluation itself is performed under real industrial conditions. Consequently, the evaluation strategy adopted in this work relies on a comparison at the production statistics level, which is a realistic and commonly adopted practice in industrial foundry environments.
Specifically, the overall defect rates provided by the foundry, together with the proportion of defects explicitly associated with pouring-related phenomena, are contrasted with the distribution of MoE decision outcomes obtained over the same production interval. According to the plant quality records, the total rejection rate during the analysed period is
. Among these rejected parts, approximately
correspond to defect types directly linked to the pouring stage. Accordingly, the effective defect rate associated with pouring-related phenomena is approximately expressed as follows:
During the same one hour production video and based on the pouring-related defect rate, approximately 17 moulds would be expected to present quality issues related to pouring anomalies within this time interval. Consequently, and taking into account the values shown in
Table 5, the overall anomaly detection rate of the system, defined as the union of
Review and
Not_OK events, is as follows:
In comparison with the estimated of pouring-related defects reported by the foundry, the system exhibits a close correspondence in terms of order of magnitude. The slight overestimation observed in the MoE anomaly rate can be reasonably attributed to two factors as follows: (i) the conservative nature of the supervisory strategy, which prioritises early risk detection over missed defects, and (ii) the residual detection error inherent to vision-based mould identification under real industrial conditions. Importantly, the system does not under-detect potentially defective events, which is a critical requirement for quality assurance and process supervision.
Beyond the final decision counts, the MoE-style architecture enables an analysis of which groups of indicators predominantly contributed to each decision category. Decisions do not rely on a single expert, but they rather emerge from the explicit combination of multiple evidence sources, including temporal compliance, pouring stream behaviour, mould filling states, and operational context.
Table 6 summarises the dominant contributing factors associated with each MoE output class.
One of the main advantages of the proposed MoE-style architecture is that each decision can be traced back to a limited set of contributing experts and variables (see
Table 7). Instead of producing a single opaque score, the system preserves the individual expert outputs and their documented contribution to the final decision.
Once the MoE decision
has been produced for each pouring event, the resulting assessment is made available to higher-level software systems through a structured and machine-readable interface. Each evaluated event
is encoded as a JSON object that aggregates raw measurements, intermediate expert outputs, and the final MoE decision, thereby preserving the evidence underlying the assessment process. More specifically, the JSON structure serialises the same event-level variables described in the MoE evaluation stage, including temporal boundaries, expert-level indicators, contextual annotations, and the final decision label, so that downstream systems receive both the outcome and the supporting evidence in a directly consumable format. A representative example of this output is provided in
Appendix A. This internal traceability refers to the provenance of the event-level assessment within the proposed pipeline, and it should not be interpreted as a native end-to-end linkage to mould-level MES records, which is not available in the current deployment.
This information is streamed in real time via a WebSocket communication layer to downstream platforms, which may include supervisory tools or digital twin environments within the foundry. Rather than receiving only a final quality label, these systems can ingest a semantically enriched representation including temporal metrics, pouring stream descriptors, mould filling states, safety indicators, and contextual annotations. This facilitates event-level monitoring, replay, internal assessment traceability, and subsequent causal or operational analysis.
Overall, the MoE unification stage acts as a semantic bridge between low-level visual perception and higher-level manufacturing intelligence. By combining heterogeneous evidence into an explainable and structured event-level assessment, the system supports consistent process supervision while maintaining interpretability and traceability.
4. Discussion and Conclusions
This work addresses a critical industrial challenge in iron foundries, namely, the reliable monitoring and interpretation of pouring operations. This stage is particularly critical because a significant proportion of casting defects originate during pouring and are subsequently propagated downstream throughout the production process. These pouring-related anomalies, such as incomplete filling, cold shuts, and metal losses directly compromise the final casting quality, operational safety, and overall production efficiency. Despite its importance, the pouring process remains difficult to supervise consistently under real industrial conditions due to its harsh environment and dynamic nature. To handle this challenge, the proposed framework integrates multimodal perception, temporal reasoning, and expert-based decision logic into a unified and explainable system, enabling robust monitoring and interpretation of pouring operations in industrial foundries. The main contribution of this research lies in demonstrating that heterogeneous sources of evidence (i.e., visual, temporal, and contextual) can be effectively unified into a coherent operational assessment of each pouring cycle. Rather than focusing on isolated detections or individual indicators, the proposed system transforms raw observations into structured and interpretable representations that capture both process behaviour and compliance with the foundry control plan. Accordingly, the experimental results demonstrate that individual pouring events can be reliably segmented. Moreover, key AoIs can be detected and tracked, and meaningful indicators describing stream stability, filling quality, and safety conditions can be consistently extracted. When the objectives outlined in
Section 1 are considered in a unified way, the obtained results confirm that the proposed approach fulfils its intended scope. Thus, pouring events are robustly identified under real production conditions, and the critical visual elements required to interpret pouring dynamics are detected with high reliability. On this basis, interpretable indicators are consistently derived to characterise stream behaviour, mould filling states, and temporal compliance. In addition, heterogeneous sources of evidence are unified through an explainable MoE architecture, enabling a coherent and event-centred operational assessment. Building upon this capability, the statistical evaluation of the system behaviour provides consistent evidence that the proposed MoE framework captures the underlying defect generation mechanisms of the pouring process. Events flagged as
Review or
Not_OK are predominantly associated with conditions known to increase defect probability, such as stream interruptions, turbulence, abnormal filling states, or temporal deviations from the control plan. Conversely, events classified as
OK correspond mainly to stable and compliant pouring cycles. Consequently, any residual discrepancies between the system outputs and production statistics define a clear and actionable pathway for future refinements, including threshold tuning, expert weighting adjustments, and the incorporation of additional sensing modalities.
One of our core contributions concerns the identification of anomalous pouring streams. We have presented a novel framework for anomaly detection in complex industrial scenarios, specifically targeting molten metal pouring processes. By leveraging a multimodal sensor fusion strategy, we successfully integrated video data with disparate sensory readings through a self-supervised learning objective. To the best of our knowledge, this represents one of the first applications of VideoMAE transformers for industrial outlier detection. Our approach demonstrates that learning jet stream dynamics via reconstruction, followed by density-based clustering, offers a robust alternative to traditional supervised methods, which are often hampered by the scarcity of defective samples in manufacturing environments.
Our experimental analysis highlights the suitability of density-based clustering, specifically DBSCAN, over partition-based methods such as K-Means for this domain. While K-Means forces data into rigid partitions, DBSCAN naturally isolates outliers, which is more consistent with the definition of industrial anomalies. We observed the consistent formation of four primary “normal” clusters—representing standard pouring, occlusion variations, and minor reabsorbed overflows—regardless of the modality used. This stability suggests that the visual features, which constitute approximately 90% of the feature space, dominate the global structure of the representation. In this context, the benefit of sensory integration is not a major reorganisation of the dominant cluster topology but a more useful event-level anomaly retrieval behaviour. In particular, only the multimodal variants produced anomalous-pouring-event counts that were consistent with the expected lower bound derived from plant statistics (approximately 16 anomalous events out of 549 detected pouring events, i.e., around 3%). Therefore, the practical contribution of multimodality lies mainly in improving anomaly prioritisation and qualitative relevance at event level, rather than in dramatically changing the global clustering pattern.
Despite these successes, the definition of an anomaly remains a nuanced challenge. We observed that while distinct outliers are easily flagged, many samples exist on the “edges” of normal clusters, exhibiting increasing deviation from the centroid. Currently, the boundary between a “noisy normal” sample and a “subtle anomaly” is dictated by the hyperparameters and min_samples, which can be sensitive to tune. Therefore, we posit that this method is not yet a replacement for human expertise but rather a powerful augmentation tool. By successfully filtering the vast majority of normal operation data, our system significantly alleviates the annotation burden, allowing human experts to focus their attention on verifying a targeted subset of potential defects.
Finally, the generation of structured and machine-readable outputs provides interoperability with supervisory and digital twin-compatible platforms. This capability helps bridge the gap between low-level perception and higher-level industrial decision-support systems, allowing the extracted knowledge to be consumed, contextualised, and exploited beyond isolated detection tasks.
Beyond the foundry domain, the proposed framework is also compatible with other real-world inspection scenarios in which heterogeneous visual data, operational variability, and deployable decision-support requirements must be handled jointly. For example, similar pipeline principles could be transferred to recent dataset-oriented applications such as UAV-based rebar counting inspection on construction sites and building-facade analysis from street-view imagery, where robust preprocessing, structured event or object characterization, and downstream screening are also required [
116,
117]. In this sense, the contribution of the present work is not limited to pouring-process supervision, but it also illustrates a broader strategy for converting heterogeneous industrial or infrastructure data into machine-readable and operationally meaningful outputs.
A key strength of the proposed MoE architecture is its explainability. Instead of collapsing all observations into a single opaque score, the system preserves individual expert outputs. This property enables causal analysis of OK, Review, and Not_OK outcomes and facilitates alignment with expert judgement. In addition, qualitative assessment performed together with foundry specialists confirms that the decisions are consistent with domain knowledge, particularly in distinguishing stable production from borderline or potentially risky conditions. Importantly, situations such as mild overflow are correctly interpreted as warning-level events rather than obvious errors, reflecting realistic industrial practice.
Despite the promising results achieved in our research work, the proposed framework exhibits several limitations that must be critically analysed to properly contextualise its applicability and guide future developments. These limitations are not specific to the proposed architecture but are representative of current challenges in industrial vision-based monitoring systems.
The first limitation concerns the dependency on visual acquisition conditions. Hence, although the detection and tracking modules demonstrated robust behaviour under normal production scenarios, performance degradation may occur under non-ideal illumination (i.e., intense glare from molten metal, smoke, or partial occlusions caused by operators or maintenance activities). These similar constraints have been reported in other vision-based foundry monitoring systems, where illumination variability and environmental noise remain major sources of uncertainty [
12,
13]. To solve these problems, recent studies suggest that integrating complementary modalities such as thermal imaging or multispectral sensing can mitigate these effects by providing illumination-invariant cues [
118,
119]. Consequently, incorporating these modalities into the proposed multimodal framework represents a natural and well-founded extension.
The second detected limitation relates to the representation of rare but critical pouring anomalies. In other words, events such as severe nozzle clogging, abrupt ladle misalignment, or extreme overflows occur infrequently in normal production although they have a disproportionate impact on quality and safety. As a result, they are underrepresented in the available training data, limiting the learning capacity of data-driven components. This issue has been widely recognised in the industrial anomaly detection literature [
112]. Several studies have addressed this challenge using synthetic data generation, physics-based simulation, or generative models to augment this type of sample [
8,
120]. Ergo, applying similar strategies to pouring dynamics would allow the MoE system to better generalise its behaviour.
Another important limitation concerns the long-term stability of sensor-derived indicators and temporal measurements. The current implementation assumes properly calibrated sensors and stable acquisition conditions. Nevertheless, sensor drift, mechanical wear, and changes in production references over time can progressively degrade the reliability of temporal indicators such as pouring duration or inter-pour intervals. This challenge has been extensively discussed in the fault diagnosis and condition monitoring literature [
10,
14]. Accordingly, adaptive thresholding strategies, online recalibration mechanisms, and self-supervised drift detection techniques have been proposed to maintain robustness [
121]. These approaches could be integrated into the temporal expert of the proposed MoE architecture without altering its overall structure.
Regarding the detection of anomalous pouring events, we believe the process could benefit from finer detection granularity. A primary focus should be a deeper analysis of intra-cluster distances. We hypothesise that the Euclidean distance of a sample to its cluster centroid could serve as a continuous “anomaly score”, providing a soft filtering mechanism for edge cases that DBSCAN currently classifies strictly as binary noise or inliers. Furthermore, to better address the sensitivity of global hyperparameters, we intend to explore hierarchical clustering algorithms, such as HDBSCAN, or iterative sub-clustering strategies. These approaches could dynamically adapt to the varying densities of different operational modes, potentially separating the “dge” regions more effectively than a global density threshold. In this regard, we also believe that the artificial local density produced by the sampling strategy could be alleviated. Additionally, the role of sensory data warrants further investigation. Our results showed that multimodal models flagged samples that appeared visually normal as anomalous, suggesting the sensors captured latent process deviations. Future work should pursue a rigorous correlation analysis of these specific samples against downstream quality control logs to validate the physical nature of these “invisible” defects. Furthermore, we envision the immediate next step as the deployment of this system in an iterative, human-in-the-loop framework. By presenting the detected outliers to factory experts, we can not only validate the system’s current precision but also use expert feedback to progressively refine the feature space, moving from unsupervised exploration toward a robust, semi-supervised anomaly detection pipeline. This will further enable quantitative analysis and more in-depth comparison between different approaches or hyperparameter choices. In this regard, a more representative analysis of the contribution of sensory data could also be carried out through this cheaper annotation step.
A further limitation concerns the absence of native end-to-end mould-level traceability across all plant information systems. Although the proposed framework preserves the internal provenance of each detected pouring event and its associated expert outputs, it cannot yet assign every event to a unique downstream MES or final quality record in a fully automatic manner. Therefore, the current contribution should be interpreted as an event-level supervisory and decision-support layer, rather than as a complete part-level digital twin implementation. Addressing this cross-system identification gap constitutes an important direction for future industrial deployment.
Finally, the current research work does not yet close the loop between perception and control. Although the structured JSON outputs are designed to support interoperability with supervisory and digital twin-compatible platforms, the system currently operates in an observational and supervisory mode. Several recent industrial digital twin implementations suggest that closing this loop may improve process stability and reduce defect rates [
111,
122]. Hence, integrating the proposed framework into an online digital twin ecosystem could enable not only real-time monitoring but also predictive scenario evaluation and corrective action recommendation.
In conclusion, this work demonstrates that an explainable, multimodal MoE-based framework can provide a robust and industrially meaningful solution for supervising pouring operations in iron foundries. By unifying visual perception, temporal analysis, and domain-driven expert knowledge into a coherent decision layer, the proposed approach contributes to the advancement of pouring monitoring while delivering interpretable outputs aligned with real production constraints. At the same time, the identified limitations should be understood as part of the current deployment scope rather than as contradictions of the proposed framework, and they define a clear and realistic roadmap for its evolution. Future extensions may incorporate additional sensing modalities, rare-event augmentation strategies, adaptive temporal modelling, and tighter online interoperability with digital twin-compatible infrastructures, allowing the framework to evolve towards more autonomous, resilient, and scalable intelligent pouring supervision.