Deep Reinforcement Learning Control of Cylinder Flow Using Rotary Oscillations at Low Reynolds Number

We apply deep reinforcement learning to active closed-loop control of a two-dimensional flow over a cylinder oscillating around its axis with a time-dependent angular velocity representing the only control parameter. Experimenting with the angular velocity, the neural network is able to devise a control strategy based on low frequency harmonic oscillations with some additional modulations to stabilize the Kármán vortex street at a low Reynolds number Re = 100. We examine the convergence issue for two reward functions showing that later epoch number does not always guarantee a better result. The performance of the controller provide the drag reduction of 14% or 16% depending on the employed reward function. The additional efforts are very low as the maximum amplitude of the angular velocity is equal to 8% of the incoming flow in the first case while the latter reward function returns an impressive 0.8% rotation amplitude which is comparable with the state-of-the-art adjoint optimization results. A detailed comparison with a flow controlled by harmonic oscillations with fixed amplitude and frequency is presented, highlighting the benefits of a feedback loop.


Introduction
A well-known Kármán vortex street is typically formed in the wake of the flow over a bluff body exerting an oscillating value of the force [1]. This unsteadiness may cause structural damages due to the coupling of the body vibrations and pressure fluctuations of the fluid. Over time, many flow control strategies have been proposed to influence and suppress these unwanted dynamical features [2]. The overall classification includes passive and active methods due to the possible energetic input to the flow [3].
Active methods represent open-and closed-loop control, depending on the presence of a feedback from sensors to actuators with a further update of a control signal. An appealing way to design new closed-loop control strategies is to rely on the so-called data-driven and learning-based methods, which lately receive well-deserved attention [4].
In fluid dynamics, machine learning techniques have been fruitfully applied to the issue of the turbulence closure modelling within Large-eddy simulations and Reynolds-averaged Navier-Stokes equations [5], estimating and reconstructing flow fields [6], recovering dynamical features [7] and control [8]. In particular, closed-loop flow control extensively benefits from the application of genetic algorithms to canonical turbulent flows, such as mixing layers, jets and wakes [9][10][11][12][13][14][15][16][17][18][19]. Contrary to the typical gradient-based optimization techniques, these methods introduce a population of the control laws which are selected step-by-step according to the target objective function. Even more promising is a combination of multilayer (deep) neural networks combined with the reinforcement learning (RL) strategy [20] resulting in a deep reinforcement learning (DRL) paradigm succeeding in a large number of multidisciplinary problems [21] as well as fluid dynamics in particular [22]. RL represents a self-learning strategy introducing an agent who interacts with the environment through particular actions in order to get a maximum reward. Recent examples mainly consider the manipulation of the flow over a cylinder using multiple synthetic jets and numerical simulations at low Reynolds numbers delivering robust DRL-based control strategies [23][24][25][26][27][28] as well as other applications [29][30][31][32].
In this work, we study a closed-loop control strategy for the flow over a cylinder rotating around its axis with the time-dependent angular velocity being the only control parameter. The direct numerical simulations (DNS) of the Navier-Stokes equations are employed coupled with the DRL method, which relies on the proximal policy optimization algorithm [33] maximizing the expected reward. The reward value is based on the lift and drag forces acting on the cylinder with the neural network employing the information on the pressure field in the wake region. The particular focus of the work is on the deviation of the control signal from the intuitive one typically considered as a harmonic oscillation.
The control method based on sinusoidal wall oscillations around the axis of the cylinder is known to dramatically suppress the drag coefficient up to 85% for a certain parameters of the amplitude and frequency for Re = 1.5 × 10 4 [34]. This experimental result has been qualitatively confirmed by a series of numerical simulations extending the study to even higher Re = 1.4 × 10 5 demonstrating that high-frequency and rather high-amplitude rotary oscillations lead to even larger decrease of the drag [35][36][37][38][39]. These experiments also showed that the method is energetically efficient only for high Reynolds numbers. However, at low Reynolds numbers, the drag reduction is not that impressive with a decrease between 30% and 60% [40][41][42][43][44][45]. One way to improve this performance and neutralize the effect of the fluid viscosity is to reduce the amplitude of oscillations also avoiding high-frequency rotary motion. This fact indicates that the harmonic control signal at low Re is far from optimal. To stabilize the vortex shedding, a closed-loop strategy has to be employed rather than a straightforward destruction of the wake region by a high-frequency harmonic motion. A feedback optimization has already been used for low Reynolds numbers, although constraining the control to sinusoidal-based forcing and its basic extensions [41,45,46]. The optimal control approach relying on the adjoint optimization and control law of the free waveform provided a reduction of up to 15% for Re = 150, obtaining the required low amplitude rotary motion [43]. The results depend on the time horizon of the optimization procedure. Recently, it has been shown that a significant increase of the time horizon leads to the drag reduction of 19% for Re = 100 together with a low-amplitude control law. Thus, the results of optimal control theory may serve as a verification point for the fully data-driven DRL method with a reduced complexity of implementation.

Problem Formulation and Computational Details
We consider a cylinder of the diameter D in a fluid cross-flow with a uniform incoming velocity U ∞ (see Figure 1). The considered Reynolds number Re = U ∞ D/ν = 100 representing a laminar flow regime with a Kármán vortex shedding, where ν is the kinematic viscosity. The applied control strategy is based on the rotation of the cylinder around its axis with the wall velocity U w (t) = U ∞ Ω(t). The primary goal is to find the optimal signal Ω(t) to influence drag and lift coefficients: where F x and F y are the drag and lift forces acting on the cylinder per unit length and ρ is the fluid density.

Flow Computations
To describe the flow field, we solve the non-dimensional incompressible Navier-Stokes equations: where u i and p are the velocity components and pressure field and all quantities are non-dimensionalized using U ∞ and D. We employ direct numerical simulations in a two-dimensional setup with the coordinate system located in the center of the cylinder and x, y representing the streamwise and vertical coordinates, respectively. The computations are performed using an open source unstructured finite-volume code T-Flows below referred to as the CFD solver [47,48]. The code features second-order accurate discretization in space and time. The SIMPLE algorithm is used to couple velocity and pressure fields. The dimensions of the computational domain representing a rectangle are L x × L y = 30D × 20D in the streamwise and vertical directions with the center of the cylinder placed 10D from the inlet boundary. The slip condition is set at the top and bottom boundaries and convective outflow condition is prescribed at the outlet, while control law representing the tangential velocity of the cylinder wall is described below. The mesh contains 15, 140 hexahedral cells and the computational timestep is ∆t CFD = 10 −2 . We validate the simulations of the stationary cylinder case. The typical quantities of interest are the time-averaged drag coefficient C D = 1.33 and vortex shedding frequency f vs = 0.17, which are in excellent agreement with available results [1].

Machine-Learning Architecture, Feedback Loop and Parallelization
Below, we describe a set of algorithms employed to obtain the angular velocity signal Ω(t) with the closed-loop controller synthesized by training a fully connected neural network (FNN). The optimal control strategy evaluation relies on the reinforcement learning (RL) approach [20] and a policy gradient (PG) algorithm [33] with a maximization of a defined reward function. A schematic view of optimization of the feedback control system is shown in Figure 2. In RL, the environment representing the CFD solver interacts with the agent which is the FNN controller in our case. The agent takes a new action based on the current state of the environment representing the data from 4 × 3 array of pressure sensors placed in the near wake beside the cylinder (see 'inputs' and the flow schematics in Figure 2). The FNN has two hidden layers with 64 neurons each and a single output for the mean of the policy PDF of the cylinder angular velocity during evaluation of the controller. The training employed two networks of the similar structure for policy and value prediction including a trainable dispersion coefficient of the policy PDF for a smooth transition from exploration to exploitation of the learned policy. The number of neurons was determined experimentally observing the learning speed and the mean reward after convergence with a fixed number of inputs and the reward function. The neurons in the hidden layers used the sigmoidal activation function while the activation of the output neuron was linear. The action defined the next value of Ω which was put forward to the CFD solver with a specific relaxation procedure in time. The action time step T ac was set small enough compared to the characteristic time scale T vs = 1/ f vs of the vortex shedding according to recommendations [24] (see Figure 3 for the hierarchy of timescales). We employed the value T ac ≈ 0.05T vs with T ac = 30∆t CFD = 0.3. After receiving a new target value of Ω, the angular velocity of the cylinder was updated from the old value to a new target one linearly in time within the whole interval T ac . We performed additional tests with the exponential relaxation scheme [24] as well as step-like change of Ω; however, a better behavior of the linear one for a test harmonic control law was observed. The parallelization strategy is twofold and represents a cornerstone of this research. The CFD code employs a standard MPI-based decomposition of the computational domain with each subdomain treated by a separate computational core. However, with the present low number of cells the speed-up is limited by the number around 10 cores due to the necessary frequent data exchange between neighboring domains sharing common cell faces. For RL algorithms, one needs a sufficient number of actions or control steps for the optimization scheme to succeed. Recently, a successful multi-environment approach [25] has been proposed where multiple independent CFD runs feed the data to one optimization algorithm (see Figure 3). Following this approach, we used three independent CFD runs in parallel with a typical wall time for training a controller using a blade server with two CPUs Intel(R) Xeon(R) CPU E5-2695 v2 @ 2.40GHz with 24 cores in total taking around three days with full CPU utilization. This parallelization was implemented through a vectorized environment feature provided in the OpenAI Baselines code [49] used in this study. As a parameter for the control optimization algorithm, we chose the learning rate constant of the value 3 × 10 −4 , clip parameter ε = 0.2, GAE parameters with discount rate γ = 0.99 and λ = 0.95, value coefficient c 1 = 0.5 and entropy coefficient c 2 = 0.01 [33]. The instantaneous reward value was parameterized in the following form: where the drag and lift coefficients were averaged over the action time step. For convenience, the constant R 1 was chosen so that the reward values per action were close to zero. The non-zero constant C 2 is expected to constrain the non-zero time-averaged lift, leading to the asymmetric flow regimes. With R 2 = 0, in the long run, the control law tended to return a rotating regime. We chose the final value as R 1 = 3 and tried two cases with R 2 = 0.1 (Case 1) and R 2 = 0.2 (Case 2) [24]. The code with the DRL agent adapted for the CFD solver and the corresponding configuration can be found in a public Github repository [50].

Results and Discussion
We performed the training process for 80 epochs to test the convergence issue where the epoch corresponds to the interval between control actions resulting in the overall time interval of nearly 40,000 time units in terms of D/U ∞ . Figure 4 shows a typical evolution of the reward value averaged over the action time step r ac reaching the saturation after around 50 epochs for Case 2 while still growing for Case 1. The entropy function is expected to characterize the convergence of the training process and decreases monotonically for both cases, indicating that the controller behaves more deterministic and less exploratory with increasing the number of epoch [33]. However, as for the on-policy algorithm a new training set is generated before each policy update using a recently obtained result, stabilization of the algorithm remains an issue [20]. Below, we demonstrate this inherent instability due to a small change of a reward function altering the convergence and leading to a situation with a better policy obtained in the middle of the training process. After accomplishing the training process, we evaluated the performance of the neural controllers. Table 1 and Figure 5 summarize the characteristics of the flow regimes with the applied DRL-based control scheme corresponding to the number of epochs during the training process depicted in Figure 4 by square points below referred to as 'cXeY', where 'X' stands for the case number (1 or 2) while 'Y' corresponds to the epoch number (37, 50 or 80). We address the issue of the drag reduction as well as the change of the root-mean-square of the lift coefficient in comparison with the stationary cylinder flow.
Note an opposite behavior of C D signal and other characteristics with the increase of the epoch number for Cases 1 and 2. While c1e80 features a decrease of C D with ∆C D = 13.9% compared to stationary cylinder regime and outperforms earlier epochs, c2e37 appears to be more optimal compared to later epochs with ∆C D = 16.1%. We introduce ∆Ω = Ω max − Ω min to evaluate tangential velocity amplitude with the value ∆Ω being averaged over a number of local maximum Ω max and minimum Ω min values, respectively. The same cases, c1e80 and c2e37, return the lowest amplitude within each run with ∆Ω/2 = 8.2% and impressive 0.8% of U ∞ , respectively. Coming back to the reward value parameterization as in Equation (3), we mention the possible asymmetric flow behavior due to DRL control. Indeed, Cases 1 and 2 (see Table 1) for different epoch numbers recover strategies displaying constant rotation of a fixed point on the surface of the cylinder θ(t) as αt combined with oscillatory perturbation θ (t), i.e., θ = αt + θ with the angle expressed in degrees, even with C L being present in the reward expression. Although α is nonzero for most of the regimes, it is relatively low for two out of three cases for each run with the maximum value α = −17.1 • and −20.6 • for c1e37 and c2e80, respectively. Below, we proceed with the analysis of c1e80 as the case where the outcome delivers an appealing slightly modulated harmonic control strategy while better performing c2e37 represents a mixture of a few Fourier modes with close frequency values and may be not very intuitive as an example.   Figure 6 shows the instantaneous streamwise velocity field around the cylinder for the case without control and corresponding to the DRL-based scheme for c1e80 case after a new steady state is reached with its evolution shown in Figure 7. As a result, the flow stabilizes close to the cylinder with a small rotational input trying to balance this inherent instability. The length of the recirculation bubble becomes larger with the local pressure minimum moving further downstream. This leads to a smaller pressure difference between the front and rear of the cylinder with a final decrease of C D . Figure 7 shows the evolution of several characteristics such as the angle θ, angular velocity Ω and drag and lift coefficients C D and C L for different flow regimes. The case of a stationary cylinder is demonstrated as a reference one with θ = Ω = 0. The drag coefficient exhibits sinusoidal behavior with a frequency 2 f vs around a time-averaged value C D = 1.33 while C L fluctuates with a natural frequency f vs = 0.17, reflecting the vortex shedding process, as mentioned above. The application of the DRL-scheme corresponding to c1e80 for control leads to a transient period of about 20 time units, resulting in a modulated harmonic signal of the angular velocity Ω. However, the angle θ also evolves harmonically in time with a linear trend of a relatively small slope α = 1.39 • (see Table 1), corresponding to around 14 • rotation within the period T 1 = 1/ f 1 ≈ 10 where f 1 ≈ 0.6 f vs represents the main peak of the Fourier spectrum of the C L signal. Note also secondary peaks at higher and lower frequencies giving room for modulations of Ω (see the peaks at ≈ 0.2, 0.4 and 0.8 f vs ). These modulations turn out to be essential for a drag reduction which is demonstrated in a straightforward manner. We apply a modulated harmonic forcing to the cylinder with where Ω 0 = 0.09, f 1 = 0.61 f vs , f 2 = 0.22 f vs well approximates the signal from a DRL-scheme. The outcome of this control strategy is a slight decrease of C D = 1.28 compared to the stationary case and is far from a DRL-scheme. Thus, the feedback loop is indeed gives a benefit for active flow control correctly responding to the instantaneous phase of the vortex shedding process by tuning the angular velocity of the cylinder to stabilize the wake and decrease the drag. See also the Supplementary Material Video S1 (also available at: https://youtu.be/9X8XtHk0R84). The array of 4 × 3 white points corresponds to pressure sensors serving as the input for the neural network (see Figure 2). As mentioned above, the DRL-scheme stabilizes the recirculation bubble suppressing the formation of the Kármán vortex street and leading to the elongated bubble behind the bluff body with the reduced value of C D . The mechanism behind the curtain represents the phasor control (see [51] for applications to wake flows), when the appropriate value of almost harmonic variation of angular velocity of the cylinder acts on the near-wall fluid. To get some additional insight into the control routine, Figure 8 shows the instantaneous streamwise velocity field corresponding to four time instants highlighted with triangular symbols in Figure 7 spanning half of the oscillation period of Ω. At t = 65.1, the rotation amplitude reaches its local maximum while rotating counterclockwise (see Figure 7). Previous time history shows that C D and C L are monotonically decreasing with time and reward function increases. DRL-scheme continues the counterclockwise rotation while also stimulating the growth of the attached eddy in the lower side of the cylinder. Reaching a certain size the eddy and its asymmetric location lead to the increase of |C L | as for t = 67.8, which negatively affects the reward function r. To counteract this trend, DRL-scheme switches the rotation direction producing new recirculation zone on the upper side of the cylinder at t = 69.9, leading to a more symmetric bubble reversing the growth of C L .

Conclusions
We applied deep reinforcement learning to active closed-loop control of a two-dimensional flow over a cylinder oscillating around its axis with a time-dependent angular velocity representing the only control parameter. Probing different values of the angular velocity, the neural network was able to create a control strategy based on low frequency harmonic oscillations with some additional modulations to stabilize the Kármán vortex street at a low Reynolds number Re = 100. We examined the convergence issue for two reward functions showing that later epoch number does not always guarantee a better result. The performance of the controller provide the drag reduction of 14% or 16% depending on the employed reward function comparable with a state-of-the-art control theory optimization routines based on adjoint methods [52]. The additional input of energy to rotate the cylinder was very low as the maximum amplitude of the angular velocity was equal to 8% of the incoming flow in the first case while the latter reward function returned an impressive 0.8% rotation amplitude. A detailed comparison with a flow controlled by harmonic oscillations with one frequency and a fixed amplitude is presented, highlighting the necessity of a feedback loop. Further work will be focused on extending the DLR-schemes to higher Reynolds numbers keeping a computational setup two-dimensional as well as going with a three-dimensional configurations for moderate Re.