3.1. GraphNets for Arbitrary Inductive Biases
GraphNets (GNs) are a class of machine learning algorithms operating with (typically predefined) attributed graph data, which generalize several graph neural network architectures. An attributed graph, in essence, is a set of nodes (vertices)
$V:\{{\mathbf{v}}_{1},\cdots {\mathbf{v}}_{k}\}$ and edges
$E:\{({\mathbf{e}}_{1},{r}_{1},{s}_{1})\cdots ({\mathbf{e}}_{k},{r}_{k},{s}_{k})\}$, with
${\mathbf{e}}_{k}\in {\mathbb{R}}^{{N}^{e}}$ and
${\mathbf{v}}_{i}\in {\mathbb{R}}^{{N}^{v}}$. Each edge is a triplet
$({\mathbf{e}}_{j},{r}_{j},{s}_{j})$ (or equivalently
$({\mathbf{e}}_{j},{\mathbf{v}}_{{r}_{j}},{\mathbf{v}}_{{s}_{j}})$) and contains a reference to a receiver node
${\mathbf{v}}_{{r}_{j}}$, a sender node
${\mathbf{v}}_{{s}_{j}}$ as well as a (vector) attribute
${\mathbf{e}}_{j}$. Selfedges, i.e., when
${r}_{i}:={s}_{i}$ are allowed. See
Figure 3 for an example of an attributed graph.
In [
6], a more general class of GraphNets is presented, where global variables which affect all nodes and edges are allowed. A GN with no global variables consists of a nodefunction
${\varphi}^{v}$, an edge function
${\varphi}^{e}$, and an edge aggregation function
${\rho}^{e\to v}$. The function
${\rho}^{e\to v}$ should be (1) invariant to the permutation of its inputs and (2) able to accept a variable number of inputs. In what follows this will be referred to as the edge aggregation function. Simple valid aggregation functions are
$Min(\xb7)$,
$Max(\xb7)$,
$Sum(\xb7)$ and
$Mean(\xb7)$. Inventing more general aggregation functions (for instance by combining them) and investigating how these affect the approximation properties of GNs currently forms an active research topic [
47].
Ignoring global graph attributes, the GraphNet computation procedure is as detailed in Algorithm 1. First, the new edge states are evaluated using the sender and receiver vertex attributes (
${\mathbf{v}}_{{s}_{i}}$ and
${\mathbf{v}}_{{r}_{i}}$ correspondingly) and the previous edge state
${\mathbf{e}}_{\mathbf{i}}$ as arguments to the edge function
${\varphi}^{e}$. The arguments of the edge function may contain any combination of the source and target node attributes and the edge attribute. Afterwards, the nodes of the graph are iterated and the incoming edges for each node are used to compute an aggregated incoming edge message
${\overline{\mathbf{e}}}_{i}^{\prime}$. The aggregated edge message together with the node attributes are used to compute an updated node state. Typically, small MultiLayer Perceptrons (MLPs) are used for the edge and node GraphNet functions
${\varphi}^{e}$ and
${\varphi}^{v}$. It is possible to compose GN blocks by using the output of a GN as the input to another GN block. Since a single GN block allows only first order neighbors to exchange messages, GN blocks are composed as
where “∘” denotes composition. The first GN block may cast the input graph data to a lower dimension so as to allow for more efficient computation. The first GN block may comprise edge functions that depend only on edge states
${\varphi}^{{e}_{0}}\left(\mathbf{e}\right)$ and correspondingly node functions that depend only on node states
${\varphi}^{{u}_{0}}\left(\mathbf{v}\right)$. This is referred to as a Graph Independent GN block and it is used as the type of layer for the first and the last GN block. The inner GN steps (i.e.,
$G{N}_{1}$ to
$G{N}_{K1}$) are full GN blocks, where message passing takes place. This general computational pattern is referred to as encodeprocessdecode [
6]. The inner GN blocks have shared weights, yielding a lower memory footprint for the whole model, or can comprise different weights, which amount to different GN functions that need to be trained for each level. Sharing weights and repeatedly applying the same GN block helps propagate and combine information from more connected nodes in the graph. A message passing GN block which does not contain the global variable, as the ones used in this work, is shown in
Figure 4.
Algorithm 1: GN block without global variables [6]. 
functionGraphNetwork (E, V) for $k\in \{1\dots {N}^{e}\}$ do ${\mathbf{e}}_{k}^{\prime}\leftarrow {\varphi}^{e}\left(\right)open="("\; close=")">{\mathbf{e}}_{k},{\mathbf{v}}_{{r}_{k}},{\mathbf{v}}_{{s}_{k}}$ ▹ 1. Compute updated edges end for for $i\in \{1\dots {N}^{n}\}$ do let ${E}_{i}^{\prime}={\left(\right)}_{\left(\right)}$ ${\overline{\mathbf{e}}}_{i}^{\prime}\leftarrow {\rho}^{e\to v}\left(\right)open="("\; close=")">{E}_{i}^{\prime}$ ▹ 2. Aggregate edges per node ${\mathbf{v}}_{i}^{\prime}\leftarrow {\varphi}^{v}\left(\right)open="("\; close=")">{\overline{\mathbf{e}}}_{i}^{\prime},{\mathbf{v}}_{i},$ ▹ 3. Compute updated nodes end for let ${V}^{\prime}={\left(\right)}_{{\mathbf{v}}^{\prime}}i=1:{N}^{v}$ let ${E}^{\prime}={\left(\right)}_{\left(\right)}$ return $({E}^{\prime},{V}^{\prime})$ end function

In the present work, as is the case with RNNs [
48] and causal CNNs [
49], the causal structure of time series is exploited, which serves as a good inductive bias for the problem at hand, although without requiring that the data is processed as a chaingraph or that the data are equidistant. Instead, an arbitrary causal graph for the underlying state is built, together with functions to infer the quantity of interest which is the remaining useful life of a component given a set of nonconsecutive shortterm observations.
3.2. Incorporation of Temporal Causal Structure with GNs and Temporal CNNs (GNNtCNN)
The variable dependencies of the proposed model are depicted in
Figure 5 for three observations. The computational architecture is depicted in detail in
Figure 6. The variable
${Z}_{K}$ represents the current estimate for the latent state of the system. This corresponds to the node states
V. The variable
${T}_{K\to L}$, which represents the propagated latent state from past observations, depends on the latent state
${Z}_{K}$, an exogenous input
${F}_{K\to L}$ that controls the propagation of state
${Z}_{K}$ to
${Z}_{L}$ and potentially other propagated latent state estimates from instants before
${t}_{L}$. The variable
${T}_{K\to L}$ corresponds to an updated edge state
${E}^{\prime}$ and the exogenous inputs
${F}_{L\to L}$ can be the edge state before the edge step
E. The exogenous input
${F}_{K\to L}$ to the state propagation function can be as simple as the elapsed time between two time instants, i.e.,
${F}_{K\to L}={t}_{K\to L}={t}_{L}{t}_{K}$ or encode more complex inductive biases, such as the values representing different operating conditions during the interval between observations. An arbitrary number of past states can be propagated from past observations and aggregated in order to yield better estimates for a latent state
${Z}_{L}$. In addition to propagated latent states, instantaneous observations of raw data
${X}_{K}$ inform the latent state
${Z}_{K}$. For instance, in
Figure 5,
${Z}_{C}$ depends on
${T}_{B\to C}$ but at the same time on
${T}_{A\to C}$ and potentially more propagated states from past observations (other yellow nodes in the graph) and at the same time to an instantaneous observation
${X}_{C}$. Each inferred latent state
${Z}_{i}$ can be transformed to a distribution for the quantity of interest
${Y}_{i}$. The value of the propagated state variable from state
s to state
d,
${T}_{s\to t}$, depends jointly on the edge attributes and on the latent state of the source node. In a conventional RNN model,
${F}_{K\to L}$ corresponds to an exogenous input for the RNN cell. In contrast to an RNN model, in this work the dependence of the estimate of each state depends on multiple states by introducing a propagated state that is modulated by the exogenous input. In this manner, an arbitrary and variable number of past states can be used directly for refining the estimate of the current latent state, instead of the estimate summarized in the latent cell state of the RNN. In the proposed model, the parameters of the functions relating the variables of the model are learned directly from the data and essentially define the inductive biases following naturally from the temporal ordering of the observations. This approach allows for uniform treatment of all observations from the past and allows for the consideration of an arbitrary number of such observations to yield an estimate of current latent state.
The connections from all observable past states and the ultimate one, where prediction (readout) is performed, are implemented as a nodetoedge transformation and subsequent aggregations. Aggregation corresponds to the edgeaggregation function
${\rho}^{e\to u}(\xb7)$ of the GraphNet. In this manner, it is possible to propagate information from all distant past states on a single computation step. As mentioned also in the introduction, this is one of the computational advantages of the transformer architecture [
18], which is related to GNs. In contrast to using a causal transformer architecture, the causal GNN approach proposed herein allows for parametrizing the edges between different states. This key difference is what allows the proposed model to work on arbitrarily spaced data. The different steps of the causal GN computation and how they relate to the general GN, are further detailed in
Figure 7.
As is the case when using transformer layers, the computational burden increases quadratically with the context window. Therefore, the computation of all available past states would be inefficient. To remedy this, it is possible to randomly sample past observations in order to perform predictions for the current step. Similarly, during training, it is possible to yield unbiased estimates of gradients for the propagation and feature extraction model by randomly sampling the past states. It was found that for the presented usecases this was an effective strategy for training.
In GN terms, the “encode” GraphNet block (
$G{N}_{enc}:\{{\varphi}^{{u}_{0}},{\varphi}^{{e}_{0}}\}$) is a graphindependent block consisting of the node function
${\varphi}^{{u}_{0}}$ and edge function
${\varphi}^{{e}_{0}}$. The node function is a temporal convolutional neural network (temporal CNNs), with architecture detailed in
Table 3.
The edge update function is a feedforward neural network. The input of the edge function is the temporal difference between observations. Both networks cast their inputs to vectors of the same size. The
$G{N}_{core}:\{{\varphi}^{{u}_{c}},{\varphi}^{{e}_{c}},{\rho}^{e\to u}\}$ network consists of small feedforward neural networks for the node MLP
${\varphi}^{{u}_{c}}$ and the edge MLP
${\varphi}^{{e}_{c}}$. The input of the edge MLP is the sender and receiver state and the previous edge state. The MLP is implemented with a residual connection to allow for better propagation of gradients through multiple steps [
50].
In this work, the
$Mean(\xb7)$ aggregation function was chosen, which does not depend strongly on the indegree of the state nodes
${Z}_{i}$ (i.e., number of incoming messages) which corresponds to step 2 in Algorithm 1. The node MLP of the core network is also implemented as a residual MLP.
The
$G{N}_{core}$ network is applied multiple times to the output of
$G{N}_{enc}$. This ammounts to the shared weights variant of GNs, which allows for propagation of information from multiple steps while costing a small memory footprint. After the last
$G{N}_{core}$ step is applied, a final graphindependent layer is employed. At this point, only the final state of the last node is needed for further computation, i.e., the state corresponding to the last observation. The state of the last node is passed through two MLPs that terminate with
$Softplus$ activation functions
The
$Softplus$ activation is needed for forcing the outputs to be positive, since they are used as parameters for a
$Gamma$ distribution which in turn is used to represent the RUL estimates. The GraphNet computation procedure detailed above is denoted as
where
$G{N}_{core}^{\left({N}_{c}\right)}$ denotes
${N}_{c}$ compositions of the
$G{N}_{core}$ GraphNet and “
${g}_{in},\phantom{\rule{0.166667em}{0ex}}{g}_{out}$” are the input and output graphs. The vertex attribute of the final node as mentioned before is in turn used as the rate (
$\alpha \left(G{N}_{tot}\left({g}_{in}\right)\right)$) and concentration (
$\beta \left(G{N}_{tot}\left({g}_{in}\right)\right)$) parameters of a
$Gamma(\alpha ,\beta )$ distribution. For ease of notation, the parameters (weights) of all the functions involved are denoted by “
$\mathbf{\theta}$” and the functions that return the rate and concentration are denoted as
${f}_{\alpha ;\mathbf{\theta}}$ and
${f}_{\beta ;\mathbf{\theta}}$ correspondingly to denote explicitly their dependence on “
$\mathbf{\theta}$”. The
$Gamma$ distribution was chosen for the output values since they correspond to remaining time and they are necessarily positive. The GN described above is trained so as to directly maximize the expected likelihood of the remaining useful life estimates. For numerical reasons, equivalently, the negative loglikelihood (
NLL) is maximized. The optimization problem reads,
where
$\mathbf{g}$ corresponds to the sets of input causal graphs, and
$\mathbf{y}$ corresponds to the estimate of RUL for the last observation of each graph. The input graphs in our case consist of nodes, which correspond to observations and edges with timedifference as their features. Correspondingly
${g}_{i}$ and
${y}_{i}$ are single samples from the aforementioned set of causal graphs and remaining useful life estimates and
${N}^{s,p}$ denotes the number of sampled causal graphs from experiment
p that are used for computing the loss (i.e., the batch size). The expectation symbol is approximated by an expectation over the set of available training experiments denoted as
$\mathcal{P}$ and the random causal graphs created for training
$\mathcal{S}$. The gradients of Equation (5) are computable through implicit reparametrization gradients [
51]. This technique allows for lowvariance estimates for the gradient of the NLL loss with respect to the parameters of the distribution, which in turn allows for a complete endtoend differentiable training procedure for the proposed architecture.