Reinforcement Learning for the Face Support Pressure of Tunnel Boring Machines

: In tunnel excavation with boring machines, the tunnel face is supported to avoid collapse and minimise settlement. This article proposes the use of reinforcement learning, speciﬁcally the deep Q-network algorithm, to predict the face support pressure. The algorithm uses a neural network to make decisions based on the expected rewards of each action. The approach is tested both analytically and numerically. By using the soil properties ahead of the tunnel face and the overburden depth as the input, the algorithm is capable of predicting the optimal tunnel face support pressure whilst minimising settlement, and adapting to changes in geological and geometrical conditions. The algorithm reaches maximum performance after 400 training episodes and can be used for random geological settings without retraining.


Introduction
Face stability is critical in shallow tunnels to avoid collapse.In mechanised tunnels, the face support is provided by the tunnel boring machine (TBM), e.g., slurry (SPB), or earth pressure balance (EPB) shields.An estimate of the support pressure is required for safe and efficient construction.
The problem of face stability can be solved with analytical, numerical, and experimental approaches.The analytical methods are mainly based on the limit state analysis.Either the lower and upper bound theorems of plasticity in non-drained [1] and drained [2] conditions or the limit equilibrium method for SPB [3] and EPB [4] based on the original failure mechanism developed in [5] are used.These formulations have been recently considered in the design guidelines [6].
Face stability can also be investigated by numerical analysis.The finite element [7-9] and finite element limit analysis [10] methods have been used to obtain design equations.Alternatively, the problem can be studied experimentally at 1 g [8,[11][12][13][14][15] or with centrifuge model tests [16][17][18].Compared to 1 g tests, centrifuge model testing offers the advantages of stress and strain scaling due to the increased gravity field, thus increasing the accuracy in simulating complex soil-structure interactions.
Recently, machine learning has emerged as a promising technique for predictive assessment in geotechnical engineering, in general [19][20][21], and in tunnelling, in particular [22][23][24][25].Some promising research domains for machine learning in tunnelling are the geological prognosis ahead of the face, the interpretation of monitoring results, automation, and maintenance [22].At present, however, research appears to be focussed on the following topics: prediction of TBM operational parameters, penetration rate, pore-water pressure [26], ground settlement [27][28][29], disc cutter replacement [30][31][32], jamming risk [33,34], and geological classification [35][36][37][38].Mainly due to the significant amount of data that is automatically collected by modern tunnel boring machines, the prediction of their operational parameters is a popular research topic.The researchers sought to reduce the misestimation of the parameters [39], as well as to automatically determine the operational parameters based on geology [40] and to avoid trajectory deviations [41].The prediction of the penetration rate is generally performed with respect to the geological and geotechnical site conditions.Ref. [42], by using, e.g., the cutter rotation speed, torque, and thrust force as input parameters [43], as well as the rock mass parameters and hydrogeological survey data [44,45].Data filtering proved to improve the accuracy of the predictions [46].Few authors estimated the face support pressure of TBMs with machine learning [24,47].
The previous studies developed models that are trained after tunnel construction and thus pertain to the domain of supervised learning, the machine learning paradigm where the predictions are based on labelled datasets [48].This study proposes a method to determine the face support pressure of TBMs with reinforcement learning.In the field of artificial intelligence, reinforcement learning is a paradigm for addressing control problems where the actions taken by the algorithm are based on prior decisions.Reinforcement learning algorithms are taught by providing incentives to reach a goal.Although the study of reinforcement learning is still in its early stages, there have been some remarkable advancements, particularly those demonstrated by Google's DeepMind research team [49][50][51][52][53], including the ability to excel at playing Atari video games and board games, such as chess and go.The adoption of reinforcement learning from academia to industry is increasing.Notable examples include scheduling dynamic job shops [54], optimising memory control [55,56], personalising web services [57,58], autonomous vehicles [59], algorithmic trading [60], natural language processing [61] and healthcare applications such as dynamic treatment plans and automated medical diagnoses [62].Industrial applications of reinforcement learning include Google Translate, Apple's Siri, and Bing's Voice Search [63].
Unfortunately, only a few of such studies are found in geotechnical engineering, especially in tunnelling.In particular, Erharter and Marcher presented a framework for the application of reinforcement learning to NATM tunnelling in rock [64] and to the TBM disc cutter replacement [30].Zhang et al. [65] employed reinforcement learning to predict tunneling-induced ground response in real time.
In this study, the capability of the reinforcement learning algorithm to choose the best sequence of face support pressures is investigated by adapting the algorithm used by the DeepMind research group [51] and testing it on random geologies.The novelty of our method resides in the reinforcement learning approach, where the machine has no previous knowledge of the environment which it explores and is educated through the rewards defined by the user, as well as in the simulation of the environment with the finite difference method (FDM).This study shows that our model is capable of optimising the face support pressure, provided that a sufficient number of episodes are played.
The proposed method is outlined in the next section.The method is tested in environments of growing complexity, from analytical calculations to numerical analysis (Section 3).Its performance, limitations, and possible improvements are discussed in Section 4. Finally, Section 5 concludes the paper.

Methods
In this section, the reinforcement learning algorithm is described.It is implemented with the interpreted high-level general-purpose programming language Python [66].The algorithm is tested against analytical calculations (Section 2.1) and numerical analysis (Section 3.4).
One of three basic machine learning paradigms alongside supervised and unsupervised learning, reinforcement learning involves learning optimal actions that maximise a numerical reward [67].In other words, reinforcement learning deals with the way (the policy) an intelligent agent (an algorithm) takes the actions that maximise a user-defined reward in a particular setting (the state of the environment).
The policy defines how the algorithm behaves in a certain situation.More precisely, it connects the states of the environment to the actions to be taken.As such, the policy is the core of a reinforcement learning agent, given that it determines its behaviour [67].The well-established "epsilon greedy strategy", one of the oldest policies [67], is selected in this study.An action is considered "greedy" if it is expected to return the maximum reward for a given state.However, since the environment is unknown a priori, an initial exploration of the environment is necessary to determine these actions.This exploration begins with the first TBM excavation where the face support pressure is randomly chosen at every round.The randomness decreases after every episode as the environmental knowledge is exploited.
At each excavation round i, the agent chooses a random action n a with probability ε and the action associated with the highest expected reward Q max (S i+1 , a) with probability 1 − ε. ε is initialised at 1, which corresponds to a completely random selection, and is decremented by 1/N after each episode, where N is the total number of excavations (the episodes).For N episodes, ε decreases by 1/N per episode until it reaches 0. In mathematical terms, let r ∈ R ∩ (0, 1) be a random number and ε = 1 − j/N at episode j, the agent takes the action A i at the state S i according to Equation (1), where n a ∈ N ∩ (0, 4) is a random integer.
Hence, the random choice of the face support pressure is initially the dominant pattern that is slowly, but steadily, abandoned over time as the agent gains some experience of the environment in which it operates.In other words, the face support pressure p f becomes more of a "conscious" choice based on the rewards collected and represented by the value function Q(s, a), which returns the reward expected for action a in the state s.
The rewards are user-defined and determined by the support pressure, the excavation rounds, and the surface settlement.They reflect the definition of efficient construction: a safe process (consisting of the minimisation of surface settlement) executed with the least possible effort (determined by the lowest possible support pressure).The actual reward values are irrelevant, as long as the algorithm can maximise the expected reward, given the input data.The relative weight of the rewards, however, has an impact on the results [68].
A +1 reward is collected at each excavation round.The lower the face support pressure applied by the TBM, the lower the building effort.Hence, a reward corresponding to − p f 200 kPa is assigned to every action.The surface settlement at every round causes a −1 reward for each mm of additional settlement.A −100 reward is assigned if the surface settlement is larger than 10 cm (in the analytical environment) or if the calculations diverge (in the finite difference environment).These outcomes terminate the episode.The premature episode termination is called "game over" in reinforcement learning parlance.A +100 reward is assigned at excavation completion.Each completed or terminated tunnel excavation defines an episode.The rewards are listed in Table 1.
The "playground" of the agent is named environment [69].The actions are chosen and the rewards are received by the agent in the environment.In the present case, two classes of environments are created.First, an environment consisting of simple analytical calculations of the required face support pressure and expected settlements is considered (Section 2.1).The second environment is simulated numerically via the FDM as described in Section 3.4.

Analytical Training Environment
In the following, the initial training environment of the algorithm is described.For a tunnel with diameter D = 10 m, a random 2000 m long geological profile is generated (Figure 1).The pseudorandom number generator is set with a fixed seed to enable reproducibility of results.The 2000 m length corresponds to the break-even point for the choice of mechanised over NATM tunnelling [70].At first, this geology is kept constant and is thus the same for all 100 training episodes.In the second instance, different random geological profiles are created at each episode (Section 3.2).The soil cover C is randomly initialised in the interval (0.5, 3)D.The soil's cover-to-diameter ratio is capped at C/D = 3.At every 2 m (the excavation step or round length), a random slope is selected from the interval (−1, +1), corresponding to the interval of slope angles (−45 • , +45 • ).The soil unit weight γ, friction angle ϕ, cohesion c, and Young's modulus E are randomly initialised in the intervals shown in Table 2.The soil property values slightly change from their initial values at every excavation step.This means that, e.g., the unit weight at step i + 1 is calculated from the unit weight at step i as γ i+1 = γ i (1 + r v ) where r v is a random number in the interval (−1.25%, +1.25%).The soil properties are re-initialised with a 1% probability at every 2 m to simulate soil stratification.
As the agent navigates through the states of these environments, it gathers rewards and records them in the value function Q(s, a).The state representation is described in the next section.

State Representation
The state represents the crucial and pertinent information needed to take an action.It is not the actual physical state of the environment, but rather a representation of the information used by the algorithm to make a decision [71].
Since there are no rules to determine the state variables, domain knowledge, i.e., "the knowledge about the environment in which the data is processed" [72], must be used.According to the literature [2,6,9,16,73], face stability primarily depends on the soil unit weight, cohesion, friction angle, and the depth of the overburden.Furthermore, soil settlement depends on the soil Young's modulus E and the stress release [74].Hence, the soil properties γ, c, ϕ, E directly ahead of the tunnel face and the overburden C are normalised by dividing them by their maxima γ max , c max , ϕ max , E max , C max (Table 2) and selected as the state variables.The stress release is determined by the choice of the support pressure.
Generally, TBMs cannot estimate the soil properties ahead of the tunnel face [75].However, boreholes are retrieved prior to soil excavation and analysed in the laboratory to obtain the material properties for the engineering design.Unfortunately, due to the soil heterogeneity, the property values cannot perfectly match reality but are rather mean values that can be nonetheless used as a first approximation.

Face Support Pressure and Settlement
TBMs provide face support pressures up to approximately 200 to 300 kPa in soft soils [76].At every excavation step, the proposed model searches the optimal support pressure within the interval (50,250) kPa.According to the guidelines [6], based on a limit equilibrium approach for drained conditions, the required support pressure is calculated with where A = π D 2 4 is the cross-sectional area of the tunnel, G is the self-weight of the sliding wedge, P V is the vertical load from the soil prism, and T is the shear force on the vertical slip surface (Figure 2).The critical value of the sliding angle ϑ that maximises p f,req is searched iteratively with the Python package scipy [77].Note that the guidelines differentiate between c 1 and ϕ 1 above the tunnel and c 2 and ϕ 2 at the level of the tunnel.In this simulation, however, Tunnel face failure mechanism and forces acting on the sliding wedge according to [6].
The proposed model estimates the support pressure p f at every step to maximise the expected outcome according to the following criteria and compares it with the required pressure p f,req .In this simplified environment, the surface settlement occurs if p f < p f,req and the stress release λ is calculated as the ratio of the provided to the required tunnel face support pressure, so that λ = 1 − p f p f,req .The corresponding soil settlement u is then calculated as follows [74].
In reality, the settlement occurs not only due to the stress release at the tunnel face but also due to other factors, such as the overcutting and ring gap [74].Furthermore, an experience factor K < 1 depending on ground stress and conditions, and tunnel geometry is generally considered in Equation ( 3).
Since the model has no prior information about the required support pressure p f,req , the support pressure is chosen randomly during the first episode.Then, the model learns the optimal support pressure based on the rewards collected, as explained in the next section.

Deep Q-Network
In this section, the algorithm to maximise the rewards collected during the TBM excavation is elucidated.Since the material properties have continuous values, the number of possible states as described in Section 2.2 is infinite.The model lacks complete knowledge of the expected rewards of actions at each state.Therefore, the knowledge of the value function is incomplete.To address this issue, the deep Q-network (DQN) is used.
DQN is a deep neural network that approximates the Q(s, a) value function [78].This algorithm, developed by Google research group DeepMind, was able to play six Atari games at a record level [49,[51][52][53]79].DQN is a specific approach to Q-learning, a method of learning optimal actions by predicting the expected reward associated with the stateaction pairs, comparing the prediction to observed rewards, and updating the algorithm's parameters to improve future predictions.Formally, Q-learning algorithms are described by the following equation where Q(S i , A i ) is the reward associated with the state S i and actions A i , λ r is the learning rate, R i+1 is the reward collected in the state S i+1 , Γ is the discount factor, and is the maximum reward in state S i+1 .The DQN is trained to choose the best tunnel face support pressure based on state variables.It takes the state vector S i as input and outputs the expected rewards for each action.After the agent takes an action A i , the observed reward R i is used to update the algorithm and improve future predictions.The algorithm is re-run using S i+1 as its input and returns the action with the highest value Q max (S i+1 , a) of all the actions a. Afterwards, the learning algorithm is adjusted based on the actual reward.This is achieved by minimizing the mean squared error between the predicted and target prediction of In Equation (4), λ r and Γ are the hyperparameters that influence the algorithm learning process, namely the learning rate and the discount factor.Small updates are made by the algorithm at each step with a low value of λ r and vice versa.The discount factor determines the extent to which the agent considers future rewards when making a decision.The higher is Γ, the more future rewards are taken into consideration.
The deep neural network architecture shown in Figure 3 is implemented with the PyTorch library [80] of the Facebook AI Research lab using the higher-level interface nn, based on [81].The input layer is the state vector whose elements are γ i /γ max , c i /c max , ϕ i /ϕ max , E i /E max and C i /C max .The support pressure p f is expressed in increments of 50 kPa in the range of natural integers (50,250) kPa.This results in an output layer with five elements, representing the five possible support pressures (50, 100, 150, 200 and 250 kPa) corresponding to the pressures typically provided by TBMs.The first and second hidden layers contain 50 neurons each.Although there are no strict rules on the optimal architecture of neural networks, a high number of neurons in each layer does not typically impair the performance, albeit requiring more computation.Moreover, the performance is generally better if the same number of neurons in all layers is used and if this number is higher than the size of the input layer.[82] The design of the neural network in this study was chosen based on these findings and on the previous applications to slope stability and tunnelling [25,83].
However, DQNs are known to suffer from training instability [49].To address this, two common techniques, experience replay [84] and target memory [49], are used to stabilise the network as described in the following sections

Experience Replay
Due to the varying soil properties, the algorithm cannot rely on memorizing a set sequence of support pressures p f .It must determine the maximum support pressure that minimises the surface settlement, regardless of the geological conditions.However, reinforcement learning is a process that involves experimentation and can result in unstable solutions.This instability is particularly evident when rewards are "sparse", causing slow learning and leading to what is known as "catastrophic forgetting" [85].This problem occurs in online training when backpropagation is performed after each state and is a common issue with gradient descent-based training methods.The root cause of catastrophic forgetting is the conflict between very similar state-action pairs, which hinders the ability to train the algorithm.
To improve the stability of the learning process in reinforcement learning, experience replay can be used.Experience replay accelerates the learning process by adding batch updating to online learning.The algorithm stores each experience, including the state, action taken, the resulting new state, and the reward in the list (s, a, s i+1 , r i+1 ), until it reaches the specified "memory size" (s m ).Then, a random subset of the stored experiences, with a defined batch size (s b ), is selected for training.The algorithm calculates the value updates for each subset and stores them in target Y and state arrays X, which are then used as a mini-batch for training.Finally, the experience replay memory is overwritten with new values when it is full.Target memory is a further improvement of this technique.

Target Memory
The DQN instability is also caused by updating the parameters after each action, leading to a correlation between subsequent observations.To overcome this issue, the value function Q(s, a) is updated only after a set number of episodes, rather than after each action, as suggested by [49].
In this scenario, the rewards are "sparse" and mostly higher at the end of each episode than after individual actions.To handle this, a duplicate of the Q-network, referred to as the target Q-network, is created with its parameters lagging behind the original Q-network as described in [49].The target memory is implemented by initialising the parameters θ Q for the Q-network.The Q-network, a copy of the Q-network, is created with distinct parameters θ T that are, at first equal to θ Q .Action a is selected with the Q-values of the Q-network by employing the epsilon-greedy strategy.The reward r i+1 and new state s i+1 are observed.The Q-values of the Q-network are set to r i+1 at the end of the episode or to r i+1 + λ r • Qmax (S i+1 ) otherwise.The Q-value is not back-propagated through the Q-network, but through the Q-network.The parameters θ T are updated to θ Q after a certain number of iterations, called synchronisation frequency ( f ).Up to this point, all the features of the algorithm have been presented.The complete workflow of the algorithm is depicted in Reward for lower support pressure

Results of the Analytical Environment
In this section, the results of the analytical environment are presented.The cumulative reward collected at every episode is shown in Figure 5a.The orange line represents the moving average of 10 episodes and the orange band is the range of the ±1 standard deviations.Since the algorithm is trained on the geology of Figure 1 only, the reward increases steadily, barely oscillating around the moving average.In Figure 5b, the cumulative rewards obtained from the 1th, 25th, 50th and 100th episodes are highlighted.As expected, the reward increases more rapidly for the last episodes.Moreover, approximately at the "chainages" x = 200 and 700 m, slight drops in the cumulative reward occur due to the abrupt changes in geology (Figure 1).The support pressures and the resulting settlement in episodes 1, 50, and 100 are shown on the left and right sides of Figure 6, respectively.As the support pressure is randomly chosen at the first episode, Figure 6a  As the agent learns the optimal support pressures, this behaviour is mitigated and the support pressure becomes more stable (Figure 6c). Figure 6c also proves that the algorithm is able to choose the support pressure in such a way that it almost encases the required pressure.Furthermore, even when p f < p f,req , the resulting settlement is negligible (Figure 6f).The model sensitivity to changes in the values of the hyperparameters is studied in the next section.

Sensitivity Analysis
Contrary to the parameters of the deep neural networks depicted in Figure 3 that are learned by the algorithm, hyperparameters are set by the user.In Section 2, a number of hyperparameters have been introduced, such as the discount factor Γ, learning rate λ r , synchronisation frequency f , the memory s m , and batch s b sizes.
The model sensitivity to these hyperparameters is summarised in Table 3.The analysis is performed starting from Γ = 0.15, λ r = 10 −3 , f = 5, s m = 10, s b = 2, and changing one hyperparameter at a time to lower and higher values.In doing so, the original combination of hyperparameters is shown to be the optimal one.Obviously, the results are not very sensitive to the discount factor and Γ = 0.15 returns the highest cumulative reward.A low value of Γ implies that the action taken at one state has little impact on the actions taken in the following states.This seems plausible for tunnelling where the settlement caused by an excessively low support pressure cannot be possibly offset by a later support pressure increase.The optimal learning rate is a matter of numerical stability.In this study, λ r = 0.001 appears to be the optimal value and, if the learning rate is set at λ r = 0.01, the cumulative reward cannot be properly maximised.The cumulative reward is not very sensitive to the synchronisation frequency where the optimal value is 5 episodes.The memory size has a more pronounced effect and the optimal value is 10.Finally, the batch size of 2 episodes provides the best results.
In summary, the proposed model is not very sensitive to the discount factor and the synchronisation frequency, it is starkly affected by the learning rate and the memory size in this particular problem.After the hyperparameters are optimised, the algorithm is tested against random geologies in the next section.

Random Geologies
Prompted by the initial success in optimising the cumulative reward with constant geology, the algorithm is further tested against random geologies.Hence, the depth of the overburden and the material properties are no longer fixed as depicted in Figure 1, but change at every episode.Figure 7 exemplarily shows the profiles of the cohesion values at episodes 2, 3, 4, and 5. Clearly, since the environment changes at every episode, the cumulative reward shown in Figure 8a does not increase steadily as in Figure 5a, but swings markedly.The model is able to forecast the support pressure increases to accommodate for larger p f,req in different settings (Figure 9).Furthermore, even when the support pressure provided fall below the required one, the resulting settlement is lower than about 4 mm.
The previous results are obtained with no additional model training.This means that the initial value of ε is set to zero.By starting from different values of ε, it is shown in the following that no additional exploration is needed.Note that, strictly speaking, even if ε 0 = 0 the Q-value is slightly updated as rewards are collected along the way.The agent, however, would perform the same action, given a certain state, until its expected rewards fall below those of other actions.The results of this analysis are shown in Table 4 where the mean cumulative reward and standard deviation are listed for different ε 0 .The highest mean cumulative reward and the lowest standard deviation are achieved for ε 0 = 0. Therefore ε 0 = 0 is selected as the optimum value, meaning that no additional exploration is required to optimise the algorithm performance in the random analytical environment.The numerical environment is described in the next section.

Effect of the Number of Episodes
When more episodes are played, the environment is explored more thoroughly.Hence, it is expected that the maximum cumulative reward increases with the number of episodes.Figure 10 shows that the maximum cumulative reward grows asymptotically up to approximately 683 after 400 episodes.It also shows that the maximum cumulative reward obtained after 50 episodes already represents approximately 90% of the maximum reward achieved after 400 episodes. 1XPEHURIHSLVRGHVSOD\HG

0D[LPXPFXPXODWLYHUHZDUG 5HODWLYHSHUIRUPDQFH
Figure 10.Effect of the number of episodes on the maximum cumulative reward.

Finite Difference Environment
Due to the simplified analytical formulations, the previous environment is not very realistic.It is, however, useful to demonstrate the capability of the model and to optimise its hyperparameters.Tunnels are often simulated with numerical analysis [86][87][88][89].Hence, a more realistic finite difference environment is outlined in the following based on the finite-difference program FLAC 3D [90].
The mesh grid is 100 m wide, between 27 and 44 m high, and 100 m long.The tunnel diameter is 10 m and the distance between the tunnel axis and the bottom is 15 m.The soil surface slope changes at every 10 m of projected distance.The grid consists of 16,932 grid points and 15,300 elements with target dimensions of 2 × 2 × 1.5 m (Figure 11).The linear elastic material law with Mohr-Coulomb failure criterion is considered.The soil properties are assigned within the intervals of Table 2 to randomly inclined layers.The property values are listed in Table 5.The displacements are fixed in the horizontal direction at the vertical boundaries and in the vertical direction at the bottom.The excavation is performed by removing the elements corresponding to the excavated soil.The support provided by the tunnel lining is simulated by fixing the displacement at the excavation boundaries.The support pressure is provided by applying a linearly increasing external pressure onto the tunnel face (Figure 12).This pressure is equal to 50, 100, 150, 200, or 250 kPa at the tunnel axis and increases linearly with depth according to the unit weight of the support medium γ sm = 12 kN/m³.The support medium consists of excavated soil and additives for EPB machines or slurry suspension for SPB and mix-shield machines [91].The surface settlement is measured at the grid points situated on the surface every 10 m.Table 5. Soil property values assigned to the soil layers 1 (yellow), 2 (green), and 3 (orange) of Figure 11.In the analytical environment, the surface settlement is calculated at every excavation round.In reality, the cumulative surface settlement depends on the previous excavation steps.Hence, the settlement increase is considered in the finite difference environment to account for the effect of the tunnel face support pressure at each step.The settlement reward for state t is, thus, expressed as the maximum settlement difference between the excavation steps max(∆u t,j ) of all of the j-th measuring points along the soil surface.

Soil parameter Symbol
Finally, the game-over condition is not triggered by excessively large settlements, such as in the analytical environment, but by the divergence of the numerical solution.
Within the framework of transfer learning, the application of a machine learning model to a similar but unrelated problem [92], it is worth studying if and to what extent the algorithm used for the analytical environment can be deployed on the finite difference environment.The cumulative reward of 64.5 is obtained in this environment if the support pressure is predicted with the algorithm trained in the analytical environment.This result is compared by retraining the model with ε 0 = 1 in the finite difference environment.As it takes approximately 30 min to complete one episode, this training is computationally costly.Therefore, the model is trained only for 50 episodes or about 90% of the expected peak cumulative reward according to Figure 10.
The cumulative reward obtained at each episode is shown in Figure 13.For the first 20 episodes, the cumulative reward oscillates across approximately 65, corresponding to the reward obtained by using the model trained in the analytical environment.The cumulative reward oscillates across 75 between episodes 20 and 37 and stabilises at about 90 from episode 40.
(SLVRGHV 7RWDOUHZDUG 7RWDOUHZDUG 0RYLQJDYHUDJHDQGVWGHY Figure 13.Rewards of the finite difference environment.Cumulative reward vs. no. of episodes; moving average and interval of range ±one standard deviation for the last ten episodes (orange band).
Figure 14 shows the support pressures chosen along the chainage at each episode.This visualisation shows how the initial randomness begins to vanish at episode 20 where the first patterns start to emerge.In particular, the agent learns that the optimal support pressure between chainage 65 and 100 m is 250 kPa.It is also interesting that, between episodes 20 and 37, the agent preferably chooses a support pressure of 150 kPa between chainage 0 and 65 m.Starting from episode 37, it is evident that the optimal support pressure for the first 65 m is 100 kPa.This is consistent with Figure 13, where the cumulative reward starts increasing after episode 20 and increases again after Figure 15 shows the settlement after the last episode is completed.The surface settlement is up to 1 cm in the first 65 m of excavation and up to 2 cm in the last 35 m.Hence, it is evident that the actions chosen by the agent keep the settlement within reasonable limits.
From the previous analysis, it seems that the model developed in the analytical environment can be transferred to the finite difference environment provided that the model is retrained.These results are discussed in the next section.

Discussion
The results show that our model can optimise the support pressure by simultaneously controlling the surface settlement within a reasonable threshold in both analytical and finite difference environments.Implementing model training in an analytical environment is relatively simple and a large number of episodes can be completed fairly fast.Moreover, this class of environments allows for hyperparameter tuning.On the other hand, reinforcement learning training in the finite difference environment (or, more generically, in numerical environments) is rather costly, see also [25].Therefore, transfer learning is employed for hyperparameter tuning.As shown in the previous section, the model architecture and hyperparameters can be generally transferred to the finite difference environment, on the condition that retraining is performed starting from ε 0 = 1.0.Some limitations of this study can be highlighted and a strategy to amend them is outlined in the following.Firstly, albeit its use in engineering design, the finite difference environment cannot completely match reality.This is especially true in light of the simplifications considered in this study, such as the linear-elastic material law with Mohr-Coulomb failure criterion, the simulation of the tunnel lining as a zero displacement boundary condition, the absence of the ring gap and mortar, and the deterministic soil property values.These limitations can be overcome as follows: 1.
The adoption of more advanced constitutive models, the simulation of the lining with shell elements, and the simulation of the ring gap and mortar [93,94].It is perhaps worth noting that different types of segments (in terms of concrete class and reinforcement) and ring gap mortar pressures are chosen in practice.Hence, two additional agents could be implemented to predict the segment types and mortar pressures.

2.
The consideration of the spatial variability of soil properties with random fields, by varying the soil properties according to certain statistical distributions and correlation lengths [95].Since random fields further complicate the environment, more advanced reinforcement learning algorithms might be adopted, such as the 51-atom agent (C51) [96].Moreover, the definition of the state variables can be improved, e.g., by considering the soil properties at more than one point at each epoch.
The results match the expectation that the agent can be trained to predict the tunnel face support pressure.However, it is striking that the agent does not appear to need any additional training when deployed on random geologies (Section 3.2).This feature could be also theoretically tested with the finite difference environment.However, hyperparameter tuning in this environment is still computationally costly.
The results show that the DQN algorithm can be successfully used to control the tunnel support pressure and adapt to changes in the soil properties, such as variations in unit weight, cohesion, friction angle, and Young's modulus.One added value of the DQN in this context is that it can be used to develop more efficient and effective control strategies for maintaining tunnel face stability compared to traditional methods.The DQN has the capability to generalise to new situations, which can be useful in the case of changes in soil properties or overburden height.Furthermore, the DQN algorithm allows for efficient use of the available data as it is not heavily dependent on its quality, which is a common problem with traditional methods.

Conclusions
In this study, the deep Q-network reinforcement learning algorithm was applied to control the tunnel face support pressure during excavation.The algorithm was tested against analytical as well as numerical environments.The analytical environment was used for hyperparameter tuning.The optimised model was used in the numerical environment.
It was found that: 1.
The algorithm is capable of predicting the tunnel face support pressure that ensures stability and minimise settlements among a prescribed range of pressures.The algorithm can adapt to geological (soil properties) or geometrical (overburden) changes.

2.
An analytical environment is used to optimise the algorithm.The optimal hyperparameters are found as Γ = 0.15 (discount factor), λ r = 10 −3 (learning rate), f = 5 (synchronisation frequency), s m = 10 (memory size) and s b = 2 (batch size).These hyperparameter values are effective also in the numerical environment.

3.
Although the algorithm is trained in a static environment with constant geology, it is also effective with random geological settings.In particular, it is found that using the algorithm trained with constant geology can be used for random geologies without retraining.4.
The maximum cumulative reward plateaus after 400 training episodes and about 90% of the peak performance is reached after 50 episodes.

5.
The algorithm proves effective both in the analytical and in the more realistic numerical environment.Training is more computationally costly in the numerical environment.However, the hyperparameter values optimised in the analytical environment can be efficiently adopted.
Future research studies can consider more refined environments (in terms of constitutive models, simulation of the lining, ring gap, mortar, and random fields), provide more advanced state definitions (by considering the soil property values of various points), and use more refined reinforcement learning algorithms.
In spite of some limitations of this method, this study shows that the tunnel face support pressure can be estimated by an intelligent agent for design and possibly even during building operations.In regard to this, a roadmap for field validation can be envisioned.First, the neural network can be pre-trained with monitoring and operational data.Secondly, a scaled model can be constructed and tests conducted either at 1g or in a geotechnical centrifuge.Thirdly, a pilot project with a small TBM, such as those used in microtunelling, can be carried out.

Figure 1 .
Figure 1.Random soil property values of the analytical environment with constant geology: (a) Unit soil weight.(b) Cohesion.(c) Friction angle.(d) Young's modulus.

Figure 3 .
Figure 3. Architecture of the Deep Neural Network used for the choice of the support pressure based on the expected rewards and given the state of the TBM in the environment.

Figure 4 .
In the next section, the results of the analytical environment are presented.The values of the hyperparameters λ r , Γ, f , s m , and s b are chosen based on the sensitivity analysis of Section 3.1.Advance TBM: x = x + dx Move to next state: t = t + 1 Game over penalty R t = -100 done = True Update replay list Select a random subset oft the list Recompute Q for the subset Backpropagate

Figure 4 .
Figure 4. Workflow of the reinforcement learning algorithm.

Figure 5 .
Figure 5. Rewards of the analytical environment with constant geology: (a) Cumulative reward vs. no. of episodes; moving average and interval of range ±one standard deviation of ten episodes (orange band).(b) Cumulative reward vs. excavation step for all episodes.

Figure 6 .
Figure 6.Required and provided support pressure and settlement in the analytical environment with constant geology at episodes (a,b) 1, (c,d), 50, and (e,f) 100.

Figure 8 .Figure 9 .
Figure 8. Rewards of the random environment: (a) Cumulative reward vs. no. of episodes; moving average and interval of range ±one standard deviation of ten episodes (orange band).(b) Cumulative reward vs. excavation step for all episodes.

Figure 12 .
Figure 12.Detail of the linearly increasing support pressure at the tunnel face and the resulting horizontal displacement at a randomly selected excavation step.

Figure 14 .
Figure14shows the support pressures chosen along the chainage at each episode.This visualisation shows how the initial randomness begins to vanish at episode 20 where the first patterns start to emerge.In particular, the agent learns that the optimal support pressure between chainage 65 and 100 m is 250 kPa.It is also interesting that, between episodes 20 and 37, the agent preferably chooses a support pressure of 150 kPa between chainage 0 and 65 m.Starting from episode 37, it is evident that the optimal support pressure for the first 65 m is 100 kPa.This is consistent with Figure13, where the cumulative reward starts increasing after episode 20 and increases again after episode 37.

Figure 15 .
Figure 15.Computed settlement at the last episode (in metres).

Table 1 .
Summary of the rewards associated with the outcomes of the actions.

Table 2 .
Range of the intervals of soil properties and their coefficient of variation for every 2 m excavation step.

Table 3 .
Results of the analysis of sensitivity to the hyperparameters.

Table 4 .
Mean rewards and standard deviation of the random environment for different values of ε 0 .