#### 4.2. Training Machine Learning Models

Numerical training experiments started with the pressure probe footprints generated by using mould filling simulations. A set of 3000 images of pressure footprints was used in the CNN model with the purpose of predicting the five variables of position, ${\alpha}_{1},{\alpha}_{2}$, size ${\alpha}_{3},{\alpha}_{4}$ and the relative permeability $\beta $. Both datasets, pressure probe footprints, as well as the the five variables were already normalized in the $[0,1]$ interval so no extra treatments were required.

The first step was to split randomly the dataset generated into two pieces usually known as training and test sets. Training and test sets contained all the data in a ratio of 80/20. The test set was qualified as a never-before-seen dataset with the intention to evaluate the model under new data not used during training.

The subsequent step deals with training of the CNN model by using Keras. The network described in the previous section was coded in Keras by a sequential linking of two convolutional layers and three dense layers (see

Figure 6 for more details). Each layer applies some tensor operations with the input data, and these operations make use of weight and bias factors. The weight and bias factors are the intrinsic attributes of the different layers used and are considered the parameters where the learning capacity of the network resides. A total of 860,517 parameters was used in the CNN model,

Table 1. The determination of the network parameters is carried out by minimization of a norm defined as the sum of the squared differences between the truth values of the variables (

${X}_{i}={\alpha}_{1},{\alpha}_{2},{\alpha}_{3},{\alpha}_{4},\beta $) and the predicted ones by using the CNN network. This Mean Squared Error (MSE) is used in this work as loss function to minimize (

$MSE=1/N{\sum}_{i=1}^{N}{({X}_{i}^{pred}-{X}_{i}^{true})}^{2}$ with

N the size of the dataset). The iterative method for minimizing the loss function in combination with a gradient descent called

`Adadelta` were used to this end. The exact rules governing a specific use of gradient descent are defined by the

`Adadelta` [

25] Keras optimizer. Training was carried out after not more than 5000 epochs by using 64 as the batch size and lasted around 16 h with a 10 cores Intel Xeon W-2155CPU-3.30 GHz machine. The evolution of training and test losses against the number of epoch training cycles is presented in

Figure 7. The best model configuration obtained produces a minimum MSE of

$\approx {10}^{-2}$ after training which was judged accurately enough for modelling purposes.

It is worth highlighting the similarity of MSE loss curves obtained for both training and set datasets which is an indicator of reasonable model performance for unseen data. Highly dissimilar behaviour of these two curves usually indicates overfitting, which is a common problem in machine learning. If the complexity of the network and number of network parameters is too high, not in correspondence with the datasets size, overfitting is produced. In this case, the accuracy obtained after training can be excellent but the error corresponding to the test dataset could be still unacceptable and indicates a deficient model generalization for new unseen data. Several strategies were implemented in this work to alleviate possible overfitting problems according to recommendations found in the literature, namely data-augmentation, ${L}_{2}$ regularization and dropout rate in fully-connected layers.

An augmented dataset is generated from the pressure sensor signal by adding a white noise to each image of the training set. The white noise generated follows a normal distribution

$N(0,0.001)$ with zero mean and 0.001 standard deviation. The augmented dataset contains then a total of 14,400 images, with 2400 being from the original set computed with OpenFoam and the remaining 12,000 the augmented one.

${L}_{2}$ regularization techniques add a constraining term to the MSE loss function which is proportional, with a regularization factor of

$\lambda =5\times {10}^{-4}$, to the total sum of the squared values of the parameters of the network (

$MSE=1/N{\sum}_{i=1}^{N}[{({X}_{i}^{pred}-{X}_{i}^{true})}^{2}+\lambda {\u03f5}_{i}^{2}]$). Thus, the presence of data outliers is penalized preventing possible overfitting. Lastly, dropout rates were applied in the fully-connected neuron layers, entailing random dropping out, setting to zero, a number of output features of the layer during training producing a less regular structure. The loss curves that correspond to the case without any of the strategies used to alleviate overfitting are also presented in

Figure 7. Although the training loss in this case was excellent, and close to

${10}^{-4}$, the differences with the test loss were unacceptable. Thus, the model in this case was unable to generalize with the same precision level with new unseen data.

The comparisons between the ground truth values of the variables,

${\alpha}_{1},{\alpha}_{2}$,

${\alpha}_{3},{\alpha}_{4}$ and

$\beta $, and the predicted ones through use of the CNN are gathered in

Figure 8. The figure includes both training and test datasets. As a first approximation, the correlation between predicted and ground truth values was fairly good. This was especially true for the position of the dissimilar material region

${\alpha}_{1},{\alpha}_{2}$,

Figure 8a,b. The network in this situation was able to learn in a highly efficient manner from the given footprint by using only the features associated with the rise up of the pressure signals. However, the accuracy attained for the remaining variables was, in general, more modest although the overall trends were perfectly captured,

Figure 8c,d,f. A plausible explanation for this accuracy reduction could be attributed to the similarity of the pressure fields generated by the presence of the dissimilar material region. Two regions defined with similar values of size and/or relative permeability produce very close fluid pressure fields almost indistinguishable, producing nearly no single-valued pressure footprints. This reduction of accuracy was more evident in the case of the relative permeability parameter

$\beta $ which is essentially controlled by the pressure gradients.

Figure 8e shows the previous statement. Pressure field differences for two small values of

$\beta $ may differ only slightly when the macroscopic flow reached the outlet gates, thus producing again almost equal pressure footprints. Nonetheless, the accuracy was judged to be reasonable for the automatic detection of the position and severity of the dissimilar material region.

The histograms of the individual absolute errors computed as

${X}^{pred}-{X}^{true}$ corresponding to the five variables are also presented in

Figure 8f. As mentioned previously, the prediction of the two position variables (

${\alpha}_{1},{\alpha}_{2}$) was excellent and the error in this case exhibits a Dirac-like type function with

$90\%$ of the data lying within an absolute error band of less than

$3\%$. It should be pointed out that the model variables were expressed in non-dimensional and normalized form and thus, the absolute errors were expressed in percentage. The error distribution for the remaining variables was, of course, more flatten and the plausible reasons were discussed previously. The fraction of the total data corresponding to predictions with absolute error lying in the

$\pm 10\%$ and

$\pm 20\%$ error band are presented in

Table 2 for the sake of completion.

Figure 9a is presented to illustrate the overall performance of the model. The plots contain some dissimilar material regions selected randomly together with the corresponding predictions by using the test dataset and the

$3\times 3$ sensor network size. As discussed previously, the accuracy of the predictions was fairly good, thus showing the ability of the proposed model to capture the presence of dry regions during liquid moulding.

The accuracy of the model was also addressed for some additional cases including different pressure network sizes of

$2\times 2$,

$3\times 3$ (baseline) and

$4\times 4$ corresponding to 4, 9 and 16 equally spaced pressure sensors. It should be noted that as OpenFoam simulations were run a single time, saving the pressure probe evolution at the locations corresponding to each specified network, there was no need for further recalculations. The three models were trained by using the same procedure previously explained and the corresponding MSE losses obtained for the

$2\times 2$,

$3\times 3$ and

$4\times 4$ networks were 0.016, 0.011 and 0.012, respectively. The MSE losses obtained for the

$3\times 3$ and

$4\times 4$ networks were very similar between them. Such results seem to indicate that the dissimilar material region size used in this study, following the uniform distribution

${U}_{{\alpha}_{3},{\alpha}_{4}}(0.075,0.150)$, is perfectly captured even with the

$3\times 3$ network. Increasing the number of sensors to

$4\times 4$ will not result in a better accuracy of the model for such dissimilar material region size. Accordingly, the sensor network size should be previously determined if a minimum dissimilar material size is sought. The predictions for the ground truth cases presented in

Figure 9a by using the

$2\times 2$,

$3\times 3$ and

$4\times 4$ network are now summarized in

Figure 9b for the sake of completion.

Lastly, the flow progress predictions for the case presented in

Figure 3 are now shown in

Figure 10. This case corresponded to a

$2b=2h=0.25L$ square region centred in

${x}_{0}=0.25L$,

${y}_{0}=0.25L$ and relative permeability of

$\beta =0.1$. The pressure footprint presented in

Figure 4a was used as input to predict the position, size and relative permeability yielding the 5-tuple

$(0.258,0.241,0.118,0.117,0.083)$ from the ground truth values of

$(0.250,0.250,0.125,0.125,0.1)$. OpenFoam simulations were run subsequently and the corresponding flow patterns gathered in

Figure 10. The agreement between the ground truth flow patterns shown in

Figure 3 and the predicted ones was excellent,

$4.9\xb7{10}^{-2}$ MSE, considering that the only information used comes from a discrete network of pressure sensors.