RI2AP: Robust and Interpretable 2D Anomaly Prediction in Assembly Pipelines

Predicting anomalies in manufacturing assembly lines is crucial for reducing time and labor costs and improving processes. For instance, in rocket assembly, premature part failures can lead to significant financial losses and labor inefficiencies. With the abundance of sensor data in the Industry 4.0 era, machine learning (ML) offers potential for early anomaly detection. However, current ML methods for anomaly prediction have limitations, with F1 measure scores of only 50% and 66% for prediction and detection, respectively. This is due to challenges like the rarity of anomalous events, scarcity of high-fidelity simulation data (actual data are expensive), and the complex relationships between anomalies not easily captured using traditional ML approaches. Specifically, these challenges relate to two dimensions of anomaly prediction: predicting when anomalies will occur and understanding the dependencies between them. This paper introduces a new method called Robust and Interpretable 2D Anomaly Prediction (RI2AP) designed to address both dimensions effectively. RI2AP is demonstrated on a rocket assembly simulation, showing up to a 30-point improvement in F1 measure compared to current ML methods. This highlights its potential to enhance automated anomaly prediction in manufacturing. Additionally, RI2AP includes a novel interpretation mechanism inspired by a causal-influence framework, providing domain experts with valuable insights into sensor readings and their impact on predictions. Finally, the RI2AP model was deployed in a real manufacturing setting for assembling rocket parts. Results and insights from this deployment demonstrate the promise of RI2AP for anomaly prediction in manufacturing assembly pipelines.


Introduction
The manufacturing industry has witnessed multiple evolutionary iterations throughout its history.From the mechanization of Industry 1.0, the mass production of Industry 2.0, the automation of Industry 3.0, and, finally, today's era of smart manufacturing of Industry 4.0 [1].Each of these revolutions is characterized by specific capabilities introduced to manufacturing systems to evolve these systems.The era of Industry 4.0 has transformed the manufacturing landscape with the advent of data-driven smart manufacturing, a paradigm aiming at utilizing generated data to influence decision-making processes to improve productivity and efficiency [2].
Time series data have become ever-present within manufacturing systems with the proliferation of affordable and robust sensors available in the market.Hence, time series analytics have experienced significant progress in Industry 4.0.An estimated one trillion sensors are projected to be utilized in manufacturing facilities by 2025 [3].The time series sensor data involved in manufacturing processes can play a pivotal role in analytics-driven insights into events of interest, such as anomalies.
Specifically, we are interested in utilizing the time series data to predict future anomalies based on historical data and the current status of the manufacturing system [4,5].However, being able to accurately predict anomalous events in production lines can be challenging.Real manufacturing datasets can be very imbalanced, as it is rare for anomalies to occur in mature manufacturing processes [6].Translating the data into meaningful insights about anomalies (e.g., remedial actions) can be challenging due to the considerable number of sensors that must be considered.Lastly, the interdependence between the sensor data and anomaly categories further complicates the prediction problem.
To tackle these challenges, researchers have experimented with data-driven statistical learning-and ML-based solutions for anomaly prediction.The spectrum of methods explored includes traditional statistical approaches like ARIMA, exponential smoothing, and structural models, as well as ML and neural network methods such as gradient boosting, convolutional neural networks, recurrent neural networks, and their variations [7][8][9][10][11][12][13][14][15].More details on these early works are available in Section 2 and Appendix A. In recent times, researchers have drawn inspiration from the success of generative artificial intelligence (GenAI).This has led to exploring pre-trained foundational time series models such as TimeGPT and PromptCast.These models are fine-tuned for specific downstream tasks, such as anomaly prediction [16,17].
Although the methods explored so far have shown promise, they have not achieved adequate predictive performances (the SOTA F1 measure is 50% in prediction and 66% in detection-Appendix C) due to several key challenges that still remain: (i) a robust solution for modeling the rarity of anomalous occurrences, e.g., rocket parts being fitted poorly, do not frequently occur in mature assembly pipelines, often resulting in a poor predictive accuracy; (ii) a framework for modeling the two-dimensional nature of the problem, namely, the prediction of the anomaly(s) at future time steps, along with dependencies among the anomalies when more than one occurs; and (iii) a lack of high-fidelity simulation data corresponding to real-world rocket assembly pipelines (the data generated often lack the stochasticity of real-world pipelines).Beyond prediction-related challenges, there are also hurdles related to interpreting the result in a domain-expert-friendly manner for informing insights into improving pipelines [18].
We propose a novel framework for handling the abovementioned challenges, which we refer to as Robust and Interpretable 2D Anomaly Prediction (RI2AP).Our main contributions are as follows:

•
For challenges (i) and (ii) above, we implemented the following strategies.We model an anomaly using a compositional real-valued number .First, we encode each anomaly class using a monotonically increasing token assignment strategy (e.g., 0 for none, 1 for the first part falling off, 2 for the second part falling off, and so on).This is done to capture the monotonically increasing nature of the severity of anomaly categories in rocket assembly.Next, we represent compositional anomalies using the expected value of their token assignments.We propose a novel model architecture that predicts both the sensor values at the next time step, as well as the value assigned to the compositional anomaly (hence the name 2D prediction).The robustness to rarity is achieved through modeling the problem using a regression objective, thus preventing the need for obtaining an adequate number of positive vs. negative class instances or other ad hoc sampling strategies to handle the rare occurrence.

•
For challenge (iii), we use the Future Factories dataset.The dataset originates from a manufacturing assembly line specifically designed for rocket assembly, adhering to industrial standards in deploying actuators, control mechanisms, and transducers [19].

•
For enabling domain-expert-friendly interpretability, we introduce combining rules first introduced in the independence of a causal influence framework [20], which were specifically inspired by real-world use cases such as healthcare cases to allow enhanced expressivity beyond traditional explainable AI (XAI) methods (e.g., saliency and heat maps).We note that although XAI methods are useful for the system developer for debugging and verification, they are not end-user friendly and do not give end-users the information they want [18].We demonstrate how combining rules allows natural and user-friendly ways for the domain expert to interpret the influence of individual measurements on the prediction outcome.

•
This full investigation aimed to tackle the above challenges to create an adequate model and fully deploy this model in a real manufacturing system.The results and insights from this deployment showcase the promising potential of RI2AP for anomaly prediction in manufacturing assembly pipelines.
Figure 1 shows a summary of the proposed method.The rest of this paper is organized as follows.Section 2 covers past work on anomaly detection and prediction within manufacturing processes using univariate and multivariate sensor data.Through this literature survey, we identify the key research gaps.Section 3 describes the dataset and summary statistics.Section 4 introduces a precise formulation of the problem aimed at addressing the gaps identified in Section 2. Section 5 details the proposed solution approach (the RI2AP method), design motivations, and other architectural choices (e.g., function approximator choices).Section 6 provides our experimental setup and records the improvements of our proposed approach over state-of-the-art baselines for a robust proof-of-concept (POC) model.Section 7 covers the deployment of the POC model on the Future Factories manufacturing cell.This includes the deployment plan, technical details, deployment results, and issues faced in deployment.We conclude the paper in Section 8 by summarizing the significant this study's findings and limitations and avenues for future work.
Wang et al. [21] proposed a method based on recurrent neural networks to detect anomalies in a diesel engine assembly process, utilizing routine operation data, reconstructing input data to identify anomaly patterns, and providing insights into the time step of anomaly occurrences to aid in pinpointing system issues.Ref. [22] addressed the problem of unexpected assembly line cessation with a unique approach that integrates Industrial Internet of Things (IIoT) devices, neural networks, and sound analysis to predict anomalies, leading to a smart system deployment that significantly reduces production halts.Ref. [23] investigated and developed automatic anomaly detection methods using support vector machines for in-production manufacturing machines.They considered operational variability and wear conditions, achieving a high recall rate without continuous recalibration, specifically in the rotating bearing of a semiconductor manufacturing machine.Ref. [24] conducted fine-grained monitoring of manufacturing machines, addressing challenges in data feeding and meaningful analysis, analyzing real-world datasets to detect sensor data anomalies in pharma packaging, and predicting unfavorable temperature values in a 3D printing machine environment.They developed a parameterless anomaly detection algorithm based on the random forest algorithm and emphasized the efficiency of anomaly detection in supporting industrial management.The research conducted by Abdallah et al. [25] analyzed sensor data from manufacturing testbeds using deep learning techniques, evaluated forecasting models, demonstrated the benefit of careful training data selection, utilized transfer learning for defect-type classification, released a manufacturing database corpus and codes, and showed the feasibility of predictive failure classification in smart manufacturing systems.Park et al. [26] proposed a fast adaptive anomaly detection model based on an RNN Encoder-Decoder and using machine sounds from Surface-Mounted Device (SMD) assembly machines.They utilized Euclidean distance for abnormality decisions, and the proposed approach has its structural advantages over Autoencoders (AEs) for faster adaptation with reduced parameters.
Chen et al. [27] developed a novel Spectral and Time Autoencoder Learning for Anomaly Detection (STALAD) framework for in-line anomaly detection in semiconductor equipment, utilizing cycle series and spectral transformation from equipment sensory data (ESD).They implemented an unsupervised learning approach with Stacked Autoencoders for anomaly detection, designing dynamic procedure control, and demonstrating its effectiveness in learning without prior engineer knowledge.Saci et al. [28] developed a low-complexity anomaly detection algorithm for industrial steelmaking furnaces using vibration sensor measurements, optimizing parameters with multiobjective genetic algorithms, demonstrating a superior performance over SVM and RF algorithms, and highlighting its suitability for delay-sensitive applications and limited computational resources devices, with a generic applicability to industrial anomaly detection problems.Ref. [29] investigated anomaly detection and failure classification in IoT-based digital agriculture and smart manufacturing, addressing technical challenges such as sparse data and varying sensor capabilities.The study evaluated ARIMA and LSTM models, designed temporal anomaly detection and defect-type classification techniques, explored transfer learning and data augmentation methods, and demonstrated improved accuracies in failure detection and prediction.However, to the best of the authors' knowledge, none of the studies have studied how to model the interdependencies of anomalies in a manufacturing setting.

Future Factories Dataset
We used the Future Factories (FF) dataset [30] generated by the Future Factories team operating at the McNair Aerospace Research Center at the University of South Carolina, which has been made available publicly.A visual representation of the FF setup is included in Appendix E. The dataset consists of measurements from a simulation of a rocket assembly pipeline, which adheres to industrial standards in deploying actuators, control mechanisms, and transducers.The data consist of several assembly cycles with several kinds of measurements, such as the conveyor variable frequency, drive temperatures, conveyor workstation statistics, etc., for a total of 41 measurements.In this work, we first utilized, XGBoost 2.0.1, and its coverage measure to narrow down 20 out of the 41 measurements that contain high information content.XGBoost has achieved a SOTA performance on anomaly detection and prediction (prediction refers to the identification before the anomalous event, and detection refers to the identification after the event), and therefore, we used it to narrow down our feature selection (please refer to Appendix B for coverage plots and an example of a learned tree from the XGBoost model).Each assembly cycle is associated with one among eight different anomaly types.Upon domain expert consultation, we further grouped the anomaly types into five distinct categories: a None type, Type 1: one rocket part is missing, Type 2: two rocket parts are missing, Type 3: three rocket parts are missing, and Type 4: miscellaneous anomalies.Tables 1 and 2 describe the dataset and anomaly statistics, respectively.

Problem Formulation
In this section, we formally characterize the problem.We begin with clarifying notations denoting the dataset components, followed by the encoding method for the target variable, i.e., the anomalous events.Equipped with the appropriate notations, we describe the task that we aimed to solve in this work.

Notations
Consider an assembly cycle that assembles a rocket from the set of parts P = {p 1 , p 2 , p 3 , . . .}. Parts p i with lower values for i represent parts at the rocket's lower end; otherwise, higher values for i represent parts at the rocket's upper (or nose) end.Each cycle occurs over a sequence of t = 1, 2, . . ., T discrete time steps.We referred to [30] for details on the definition of a time step (e.g., sampling rate).At each time step t, a group of 20 sensor measurements are collected (see Section 3); we denote them as the set M t = {m t 1 , m t 2 , . . ., m t 20 }.Anomalies during a cycle are recorded by a separate mechanism and categorized as None or Types 1-4 as in Section 3. We denote anomaly Type 1 as the singleton tuple a 1 = (p i ), p i ∈ P, and Type 2 as the two-tuple a 2 = (p i , p j ), p i , p j ∈ P, i < j, Type 3 as the three-tuple a 3 = (p i , p j , p k ), p i , p j , p k ∈ P, i < j < k.In a single cycle, parts falling off follow a compositional pattern, where the bottom parts of the rocket detach before the top parts.However, the time gap between these occurrences is nearly instantaneous and cannot be captured within discrete time steps.Consequently, only one type of anomaly from the set A = {a 0 , a 1 , a 2 , a 3 , a 4 } is recorded at each time step.It is important to note that, in reality, a combination of failures can occur.This is why we define each anomaly using indexed parts p i , p j , p k , i < j < k, where the ordering of the indices is representative of the spatial structure of the rocket (bottom to top).The miscellaneous anomaly type Type 4 is denoted as a 4 = (p i , p j , p k ), p i , p j , p k ∈ P. The ordering of indices is not important since they correspond to crashes (see Table 2) and are, therefore, unrelated to the spatial structure of the rocket.Finally, the None type is denoted as a 0 = (None).In the next subsection, we describe how the anomalies were encoded in our work given the above notations.

Anomaly Encodings
Recall A to be the set {a 0 , a 1 , a 2 , a 3 , a 4 }.To capture the compositional nature of the anomalies, we perform token assignments to each anomaly type as follows: , and token(a It is clear that this token assign- ment is monotonically increasing, which is representative of the spatial structure of the rocket, and also captures an increasing degree of severity (more parts falling off vs. fewer parts falling off, as mentioned in the main contributions from Section 1).For a 4 , we perform the token assignment as token(a 4 ) = max({token(a 3 = (p i , p j , p k )) | p i , p j , p k ∈ P}) + 1, i.e., miscellaneous anomalies are assigned the maximum possible value since they correspond to crashes that are considered the most severe.Note that anomaly Type 4 is not related to the spatial structure of the rocket.

Why Not Simple "One-Hot" Encoding for Anomaly Types?
Extensive prior work on anomaly detection for the specific case of rocket assembly studied in this paper has shown that "one-hot" encoding and other similar data reformatting techniques lead to poor performances for ML classifiers.Appendix C shows the SOTA results achieved using "one-hot"-encoded labels.Our problem formulation more naturally captures the dataset characteristics for the anomaly prediction problem with high fidelity.Additionally, the SOTA results clearly demonstrate that "one-hot" encoding does not achieve a satisfactory performance.

Task Description
At each time step t, an anomaly a t ∈ A either occurs or does not.The goal is to predict measurements M t = {m t 1 , m t 2 , . . ., m t 20 } and the token assignment of the anomaly type token(a t ) at time step t (two-dimensional prediction).This prediction is performed multiple times, and the evaluation metrics are recorded.

The RI2AP Method
In this section, we will first describe the RI2AP method (illustrated in Figures 2 and 3), subsequently explain the motivations for the method design, and finally elaborate on the detailed model architecture used in the RI2AP method.Consider a series of measurements up to time step t − 1, denoted by the data list Here, M t represents the set of all 20 measurements at time t, and each of the m t l represents one of these measurements at time step t.We first construct a set of 20 different function approximators from these measurements:  F l represents a function approximator and is associated with a specific measurement m t l at time step t.There are 20 such function approximators, indexed by l from 1 to 20. θ l represents the parameters associated with the lth function approximation F 1 .Here, the parameters are learned during the training process and are used to transform the input measurements into the predicted measurement m t l and an associated token token(a t ).Then, we combine the set of all 20 outputs from each of the F l using a combining rule denoted as aggr to yield a final value a t f inal [31].This operation is described using the equation 5.1.Design Motivations 5.1.1.Why Separate Function Approximators and Combining Rules?
When domain experts analyze sensor measurements to understand their influence on the presence or absence of detected anomalies (typically conducted post anomaly occurrence), they initially examine the impacts of individual measurements separately.This approach stems from the fact that each measurement can strongly correlate independently with anomaly occurrences.An anomaly typically occurs when multiple measurements independently combine, with well-defined aggregation effects, to cause the anomaly.Due to this reason, we employ combining rules introduced in the independence of the causal influence framework [32], specifically designed for such use cases.These rules provide a natural and domain-expert-friendly way to express realistic aggregation effects, offering options like a simple OR, Noisy-OR, Noisy-MAX, tree-structured context-specific influences, etc, leading to enhanced interpetability.Additionally, as combining rules inherently represent compactly structured Bayesian networks, methods from the do-calculus can be applied to isolate and study various combinations of anomaly-causation models, making them uniquely suitable for our use case [33,34].

Why Not Standard XAI Methods?
As briefly alluded to in Section 1, a qualitative issue with XAI methods is that they are primarily useful to ML researchers to gain insights into model behaviors and require some postprocessing or organization before end-users or domain experts can understand the model outcomes.They are developer friendly and not domain expert friendly.Additionally, there are also mathematical instability issues with XAI methods that raise questions about the robustness and reliability of the explanations provided.Specifically, XAI techniques are based on approximating the underlying manifold using a simpler surrogate, e.g., approximating a globally complex and non-linear function with a linear (LIME) or fixed-width kernel method (SHAP) for a particular test instance of interest [35,36].This surrogate model needs training using a representative set, a challenging proposition to ensure in cases with class rarity such as anomaly prediction, resulting in surrogate model variability (producing different explanations for the same prediction when different instances of the surrogate model are applied) [37].
The combining-rules approach used in our work is readily interpretable by the domain expert due to its natural functional forms.Second, it comes with the calibration advantages of probabilistic models-predicted probabilities can be well calibrated to align with experimental observations due to factors that facilitate robustness, e.g., Bayesian estimation, do-calculus, uncertainty modeling, and model interpretability.

Function Approximation Methods
Section 5, Equation (1) introduced the general form for the function approximation used in the RI2AP method.For ease of the explanation of the architecture, we will consider the function approximation architecture corresponding to measurement l, given by This model parameterized by θ l takes as input the data list X l = [m 1 l , . . ., m t−1 l ], i.e., the measurements corresponding to l up to time step t − 1, and produces the output (m t l , token(a t )), i.e., the measurement value and the anomaly type token(a t ) at time step t.

Long Short-Term Memory Networks (LSTMs)
A natural choice for such a time step-dependent prediction scenario is any recurrent neural network (RNN)-based method modified to emit two-dimensional outputs [38].The set of equations below describes an abstraction of the LSTM modified for our setting: Here, H denotes the hyperparameters such as choice of optimizer, learning rate scheduler, number of epochs, batch size, number of hidden layers, and dropout rate.

Transformer Architecture-Decoder Only
The current SOTA model in RNN-based models is the Transformer architecture, which has been employed successfully in a wide variety of application domains [39].We used two types of Transformer architectures in our experimentation: (i) our own decoder-only implementation modified to produce two-dimensional outputs at each autoregressive step [40] and (ii) TimeGPT [16], a foundational time series Transformer model.
The set of equations below describes an abstraction of the decoder-only Transformer architecture modified for our setting: Here, Attn Mask denotes the attention mask required for the autoregressive decoder-only architecture (to prevent it from looking at future parts of the input when generating each part of the output).B represents the number of Transformer blocks.H denotes the hyperparameters such as choice of optimizer, learning rate scheduler, number of epochs, batch size, number of feedforward layers (with default hidden layer size), number of blocks, number of attention heads, and dropout rate.

Method of Moments
In "A Kernel Two-Sample Test for Functional Data", Wynee et al. [41] demonstrated that when comparing data samples with imbalanced sizes, using first-order moments, specifically sample means, is more suitable as a feature to identify discriminatory patterns.Intuitively, employing sample means or averages helps alleviate the impact of significant differences in sample sizes.Narayanan et al. [42] leveraged ideas from Shohat and Tamarkin's book and generalized this idea to nth-order moments, providing theoretical proof and experimental observations that validate the method's robustness to sample imbalances [43,44].Let moments([m 1 l , . . ., m t−1 l ]) denote the moments of the input list.The set of equations below describes an abstraction of the method of moments for our setting: where NN denotes a feedforward neural network that encodes the measurements at different time steps into a dense matrix of size D × 2 (D is the output dimension of the penultimate layer of the neural network).θ NN denotes the parameters of the network, H denotes the hyperparameters, such as the number of hidden layers and their sizes, and n denotes the n th order moment.The reason for the neural network in this setup is to be able to learn a mapping from the inputs to a transformed basis, over which the moments are calculated.For a normally distributed sample, it is clear that the first-and second-order moments (mean and variance) of the measurements (before transformation to any other basis) are sufficient to characterize the distribution.However, in our case, the underlying data distribution is unknown.Therefore, we equip the function approximator with a neural network that can be trained to map inputs to a transformed basis, ensuring that the calculated moments sufficiently characterize the distribution.We chose the function approximator choices of the LSTM and Transformer models as they represent the SOTA models in sequence modeling.We chose the method of moments due to its ideal theoretical properties (robustness to noise and class imbalance) with respect to our problem setting.

Function Approximator Setup Details 6.1.1. LSTM
The preprocessed dataset was divided into training and testing sets, with the training set encompassing the initial 80% of the temporal data, and the remaining 20% allocated to the test set.Sequences were constructed from the normalized data utilizing a look-back length (context window) of 120.We used PyTorch Lightning's Trainer to train and validate the model.The training process was set up with the Mean Squared Error (MSE) loss function, and the AdamW optimizer, with its learning rate scheduler.The hyperparameters tuned included the number of epochs, batch size, hidden layers, and dropout rate.Early stopping was implemented, and the best checkpoint, determined by the reduction in MSE, was saved during training to monitor the validation loss.

Transformer (Ours)
As mentioned in Section 5, we implemented our own decoder-only Transformer setup.The preprocessed data were split into training and testing sets using the same splitting method used for the LSTM in the section.Subsequently, the data were normalized and transformed into sequences of a look-back length of 120 for training the Transformer model.Once again, the training process was set up with an MSE loss function, AdamW optimizer, and a learning rate scheduler.The model was trained and validated using Py-Torch Lightning's Trainer module, with early stopping implemented to prevent overfitting.The training progress was monitored and logged, and the best model checkpoint was saved based on the validation loss.The model's hyperparameters included the number of epochs, batch size, number of feedforward layers (with a default hidden layer size = 2048), number of blocks = 6, number of attention heads, and dropout rate.

TimeGPT
The dataset was preprocessed before being divided into two subsets, training and testing, each with 97,510 and 2000 rows.Both sets were then standardized using a standard scaler.The training of the model was performed using the timegpt.forecastmethod, and hyperparameter tuning was performed using finetune_steps, which performs a certain number of training iterations on our input data and minimizes the forecasting error.However, given Nixtla's current constraints, more hyperparameter tuning beyond finetune_steps, such as modifying the learning rate, batch size, or dropout layers, was not possible due to a lack of precise insights into the model's architecture.It is worth noting that the TimeGPT SDK and API have no restrictions on dataset size if a distributed backend is used.Other essential parameters used in the model included the frequency, level, horizon, target column, and time column.More information is provided in Appendix D.

Method of Moments
The preprocessing steps are similar to the LSTM and Transformer cases.The order of moments n was taken as 2 (starting from 0), and the number of hidden layers in the neural network NN were 2. The loss function was MSE, and the optimizer used was AdamW.Root Mean Squared Error (RMSE) scores were calculated for the predictions, and the bestperforming checkpoint was stored (best performing in terms of training loss).The training progress was monitored and logged, and the best model checkpoint was saved based on the validation loss.
We will now report the evaluation results.Table 3 provides a list of abbreviations, which we use in the result tables.

Evaluation Results Using Individual Measurements
We present Mean Squared Error (MSE) values and additionally categorize regression values based on token assignment, aligning them with the closest ground truth values.This categorization is crucial for computing traditional classification-based metrics, enhancing the interpretability of results for domain experts.The precision, recall, F1 score, and accuracy results for the LSTM and Transformer are detailed in Table 4. RMSE and MSE comparison results are provided in Table 5 in the same section.Table 6 summarizes aggregated measurements for all anomaly types.Notably, the TimeGPT model performed poorly; however, it is important to highlight that we lacked access to the model for fine-tuning on our dataset.The LSTM model outperformed the Transformer, possibly due to Transformers losing temporal information and facing overfitting issues related to the quadratic complexity of attention computation [45][46][47].The method of moments demonstrates a significantly better performance among function approximators, supporting our expectation that it is particularly well suited for robust anomaly prediction within the experimental context of this study.

Evaluation Results with Combining Rules
We used two separate combining rules, Noisy-OR and Noisy-MAX, as introduced in the independence of the causal influence framework [20].Combining rules combines probability values and not regression values.Therefore, we used the sigmoid of the binned regression values (binned to the closest token assignment) to convert the closeness value into a number between 0 and 1.This number denotes the probability of the influence of the corresponding measurement on the prediction outcome.
The precision, recall, and F1 measures of the LSTM and Transformer models with the Noisy-OR combining rule are reported in Table 7, and those for the Noisy-MAX combining rule in Table 8.Here, we notice that the Noisy-OR rule results in better predictions compared to the Noisy-MAX rule.This shows that the severity of the anomalous occurrence compounds with multiple failing parts and does not depend on any single critical part failure (recall the illustration from Figure 1).The precision, recall, and F1 measures for the method of moments with Noisy-OR and Noisy-MAX are shown in Table 9.As expected again, the method of moments achieved superior results in predicting anomalies.We also examined how different anomaly types with varying rarities affect the performances of different models.The findings demonstrate that regardless of the rarity of an anomaly, the method of moments outperformed the other function approximator choices, which, in contrast, exhibit results with significant variance, as shown in Figures 4 and 5.A comparison of RMSEs among the different function approximator choices using the combining rules is presented in Table 10 and Figure 6.Consistently, the method of moments exhibited a superior performance, showing lower RMSE values compared to other function approximators.This reaffirms its predictive effectiveness, particularly in addressing the infrequent occurrence of anomalies.3).3).0.9 0.8 0.9 0.9 0.2 0.3 0.2 0.5 0.5 0.5 0.5 1 V4 0.9 0.9 0.9 0.9 0.2 0.1 0.1 0.1 0.5 0.5 0.5

Deployment of RI2AP
The deployment of the proposed RI2AP method was carried out in the Future Factories cell, which is shown in Figure A4.The deployment plan, technical details, results, and issues faced in deployment are as follows.

1.
Input: The first step involves gathering and organizing saved models for important sensor variables, ensuring that they are ready for deployment.These saved models constitute the baselines and the proposed linear model based on the method of moments.An important task in this step is to verify the availability and compatibility of these models to be deployed in the FF setup.

2.
Data Preparation: This step involves integrating real-time data with server and Program Logic Controller (PLC) devices, enabling the collection of real-time data for analysis.Anomaly simulation mechanisms were developed to simulate various anomalies in the FF cell, tailored to each modeling approach, while normal event simulation was also conducted for training and testing purposes.

3.
Experimentation: This step involves feeding the prepared real-time data into the baseline models to analyze and predict outcomes.4.
Output: The output includes generating predictions for normal and anomalous events in the future based on the deployed models.

5.
Validation: The validation of the results was carried out through expert validation, where domain experts in the FF lab validated the results obtained from the deployed models.The predictions were cross-checked with findings from previous research or empirical observations to ensure their accuracy and reliability.6.
Refinement: The refinement of the models was undertaken based on validation results and feedback from domain experts, ensuring that the deployed models were effective and accurate.An iterative improvement process was implemented, involving refinement, testing, and validation cycles to continually enhance the effectiveness and accuracy of the deployed models.

Technical Details of Deployment
With an abundance of industrial communication protocols available within manufacturing systems, a successful deployment strategy hinges upon utilizing the correct technologies to enable the proper functioning of the trained model.The Future Factories cell has two main communication protocols utilized throughout the equipment.The first uses MQTT as the main pathway to send and receive data.This is performed by collecting the data on an edge device present within the cell and publishing the data to a public MQTT broker for different assets to access.However, since this method utilizes a public broker, the lag increases between the time it is generated and the time it is received.
To ensure that the model operates as intended, it must receive data as near to real time as possible.As such, the MQTT pathway might introduce some errors in the forecasting timing.The other data pathway available utilizes OPC UA.In this option, the PLC present in the system hosts a local OPC UA server that receives data from the PLC every 10 ms and broadcasts them to any client connected to the server.As such, this path presents a more adequate solution.The full deployment architecture can be seen in Figure 7.In this architecture, the trained model is deployed on a separate machine connected to the same network as the OPC UA server.The machine hosts an application that searches for the required data tags in the OPC UA information model and feeds them into the model.Once the next time step is predicted, it can be relayed back to the system through the server as well and take any corrective actions if needed.

Results of Deployment and Discussion
During the deployment phase, various types of anomalies, as outlined in Table 2, were systematically simulated to evaluate the efficacy of the deployed models.Data generated from relevant robots and sensors, capable of capturing these anomalies within the assembly pipeline, were fed into the models developed through the R12AP training process.Representative illustrations depicting this process are presented in Figures 8 and 9.This methodology facilitated the comprehensive testing and validation of the deployed system's capability to predict and respond to diverse anomaly scenarios within the manufacturing environment, thereby ensuring its robustness and reliability in practical applications.
In Figures 8 and 9, denoted as Sensor prediction and Label, respectively, we delineate the essence of our predictive models' output.The former signifies the projected next sensor reading, while the latter distinguishes between anomalous and normal states.Upon reviewing the snapshots, an observation arises: there are instances where the model flags a state as anomalous despite the absence of any actual anomaly.
The reason for false predictions is analyzable vis-a-vis the parameters of the combining rule.Specifically, the lack of intricate contextual interactions between the predictor variables (the multiple sensors) is omitted due to their separation during modeling, resulting in an insufficient understanding of the status of the system.However, the causal-influence framework allows natural extensions to other dependency structures (relating the influence among the multiple sensors), such as general Directed Acyclic Graph (DAG) forms.It is evident that our proposed methodology RI2AP, employing combining rules, represents a more sufficient solution in this context, primarily due to its enhanced interpretability and alignment with domain experts' requirements.
We plan to pursue this avenue in proposing a solution to this issue.The advantage of our framework is that it helps inform remedial measures during iterative development with the end goal of obtaining a robust deployment.The initial findings in deployment represent the nascent phase in a more extensive deployment regimen.Looking ahead, our further work in deployment entails implementing the combined model that we proposed, which will be tailored explicitly to the requirements of manufacturing environments.

Engineering Challenges Faced in Deployment
During the deployment of RI2AP within the real manufacturing environment at the FF laboratory, several challenges arose that required careful attention and resolution.Primarily, significant effort was dedicated to adapting the code to seamlessly align with the format of the input data stream, ensuring smooth integration and functionality.Additionally, sensor-related issues emerged, with some sensors failing to generate acceptable values, necessitating intervention from domain experts to troubleshoot and rectify the discrepancies.Another hurdle involved the simulation of anomalies, which posed difficulties in accurately replicating real-world scenarios.Moreover, the process of selecting suitable robots and sensor values for testing alongside simulated anomalies proved to be intricate, requiring close collaboration and expertise from the FF lab's domain specialists to navigate effectively.Through concerted efforts and the expertise of the involved stakeholders, these challenges were addressed and managed, contributing to the advancement and refinement of RI2AP's deployment in the manufacturing environment.

Conclusion, Future Work, and Broader Impact
This paper introduced a novel methodology, RI2AP, for anomaly prediction, designed to address unique challenges related to anomaly prediction in rocket assembly pipelines.We employed combining rules for an enhanced domain-expert-friendly interpretation of the results.Empirical evaluations demonstrated the effectiveness of our proposed methodology.

Future Work
Equipped with a proof-of-concept implementation of our proposed method, we will explore several enhancements in future work.Firstly, we will learn a multisensor function approximator that considers all 20 measurements simultaneously, utilizing a neural network, and track the performance gap between our current implementation and the multisensor model's accuracy.This approach aims for precise quantification, balancing the tradeoff between accuracy and interpretability while integrating multiple individual-function approximators.Secondly, we intend to investigate the impact of alternative combining rules, such as tree-structured conditional probability effects, and leverage do-calculus to manage potential backdoors and confounding factors.This step expands the exploration of combining rules beyond our current approach.Lastly, to enhance the interpretability of our methodology for domain experts, we propose developing higher-level representations of causal phenomena related to anomalies.This involves exploring connections between sensor measurements and high-level constructs (such as structural integrity or gripper failures), offering insights beyond ground-level sensor readings in understanding anomalous occurrences.

Broader Impact
While the focus of this paper has been the application of the RI2AP method to rocket assembly, the techniques proposed in this paper are fundamental and broadly applicable to other domains with similar problem characteristics, namely, rare-event categories, dependencies between events, and causal structures between factors affecting the rare events.Importantly, the proposed model was designed to be robust to inherent stochasticity (noise and anomalies) in processes that produce time series data collected from physical sensors and contain expressive mechanisms for deriving explanations (that support causality), facilitating insights that are readily interpretable by the end-user.Example applications include rare-event prediction in other manufacturing pipelines, corner-case prediction in healthcare applications (cases that deviate from the standard treatment protocol), etc.Finally, due to the unified handling of the causal-influence frameworks that adeptly deals with symbolic variables and powerful function approximation architectures that handle real-valued variables, natural extensions toward incorporating neuro-symbolic or generally statistical/symbolic/probabilistic approaches (with uncertainty estimation) are potentially promising avenues to explore.Funding: This work was supported in part by NSF grants #2119654, "RII Track 2 FEC: Enabling Factory to Factory (F2F) Networking for Future Manufacturing" and "Enabling Factory to Factory (F2F) Networking for Future Manufacturing across South Carolina".Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.

F 1 (m 1 , p 1 )Figure 1 .
Figure 1.Shows an abstract illustration of the RI2AP method proposed in this work.Sensor measurements correspond to the health of different rocket parts.Several function approximations are then used to predict anomalous occurrences from the sensor measurements, and their outputs are combined using combining rules.The combining rules allow natural aggregation mechanisms, e.g., NOISY-OR and NOISY-MAX, as shown in the illustration.

Figure 6 .
Figure 6.Loss/error comparison of different function approximator choices and combining rule predictions.

Figure 7 .
Figure 7. Deployment architecture of forecasting model.

Table 1 .
FF Dataset and its statistics.

Table 2 .
Anomaly types in FF Dataset.

Table 3 .
List of abbreviations.

Table 4 .
Evaluation results of baselines in univariate predictions: precision, recall, F1 score, and accuracy * .

Table 5 .
Evaluation results of baselines in univariate predictions: RMSE and MSE * .

Table 7 .
Evaluation results: Noisy-OR results of LSTM and Transformer * .
Feature importance scores using XGBoost Cover measure for all the features.
Figure A2.Feature importance scores of XGBoost Cover measure for top 20 features.