Deep Reinforcement Learning for Stability Enhancement of a Variable Wind Speed DFIG System

: Low-frequency oscillations are a primary issue for integrating a renewable source into the grid. The objective of this study was to ﬁnd sensitive parameters that cause low-frequency oscillations and design a Twin Delayed Deep Deterministic Policy Gradient (TD3) agent controller to damp the oscillations without requiring an accurate system model. In this work, a Q-learning (QL)-based model-free wind speed DFIG was designed on the rotor-side converter (RSC), and a QL-based model-free DC-link voltage regulator was designed on the grid-side converter (GSC) to enhance the stability of the system. In the next step, the TD3 agent was trained to learn the system dynamics by replacing the inner current controllers of the RSC, which replaced the QL-based model. In the ﬁrst stage, the conventional PSS and Proportional–Integral (PI) controllers were introduced to both the RSC and GSC. Then, the system was trained to become model-free by replacing the PSS and the PI controller with a QL algorithm under very small wind speed variations. In the second stage, the QL algorithm was replaced with the TD3 agent by introducing large variations in wind speed. The results reveal that the TD3 agent can sustain the stability of the DFIG system under large variations in wind speed without assuming a detailed control structure beforehand, while QL-based controllers can stabilize the doubly fed induction generator (DFIG)-equipped wind energy conversion system (WECS) under small variations in wind speed.


Introduction
Based on the type of generator and the grid interface, wind energy systems can be categorized into four types [1]: (a) Squirrel-cage induction generator (SCIG) or fixed speed system. SCIGs are mainly used for smaller wind turbines, as they are simple and economical compared to other generators. In a squirrel-cage induction generator, the rotor bars are permanently short-circuited; therefore, the rotor voltage is zero. The stator is connected to a soft starter and is connected to the grid through a transformer. A capacitor bank is employed to compensate for the reactive power, and a soft starter is employed to mitigate high starting currents and to produce a smooth grid connection [2]. In a WRIG, the stator is directly connected to the grid, and the wound rotor winding is connected to a variable resistor via slip rings. In a WRIG, it is possible for the rotor to have configurations such as slip power recovery, the use of cyclo-converters, and rotor resistance chopper control [3]. In both WRIGs and SCIGs, by controlling the rotor resistance, the slip of the machine can be changed to 2-10% [1], through which the generator output can be controlled. The third type of generator used is the DFIG. The constructional features of DFIGs are like those of WRIGs, except that in a DFIG, the rotor winding of the WRIG is connected to the grid through an AC-DC-AC converter. In a fully converted synchronous generator or fully converted squirrel-cage induction generator (SCIG), the total power is interchanged between the wind system and the grid through a power electronic converter system, whereas in some systems, it can be transmitted to the grid directly [1]. Figure 1a,b show the configuration of a wind system with a synchronous generator and with a permanent-magnet synchronous generator (PMSG) [1]. It can be recognized that in Figure 1a, the synchronous generator is excited by using the power electronic converter externally, while in Figure 1b, it is excited by a permanent magnet, as the generator is a permanent-magnet synchronous generator. For large wind farms, both DFIGs and PMSGs are preferred due to increased power control.
Actuators 2022, 11, x FOR PEER REVIEW 2 of 32 as slip power recovery, the use of cyclo-converters, and rotor resistance chopper control [3]. In both WRIGs and SCIGs, by controlling the rotor resistance, the slip of the machine can be changed to 2-10% [1], through which the generator output can be controlled. The third type of generator used is the DFIG. The constructional features of DFIGs are like those of WRIGs, except that in a DFIG, the rotor winding of the WRIG is connected to the grid through an AC-DC-AC converter. In a fully converted synchronous generator or fully converted squirrel-cage induction generator (SCIG), the total power is interchanged between the wind system and the grid through a power electronic converter system, whereas in some systems, it can be transmitted to the grid directly [1]. Figure 1a,b show the configuration of a wind system with a synchronous generator and with a permanentmagnet synchronous generator (PMSG) [1]. It can be recognized that in Figure 1a, the synchronous generator is excited by using the power electronic converter externally, while in Figure 1b, it is excited by a permanent magnet, as the generator is a permanent-magnet synchronous generator. For large wind farms, both DFIGs and PMSGs are preferred due to increased power control. There are many advantages and superior characteristics of using a PMSG machine over a DFIG. A PMSG machine has better performance, higher reliability, and wider speed control [4]. Due to the PMSG's better performance, most of the current research work focus has shifted to topologies utilizing synchronous generators with permanent-magnet excitation. However, DFIGs are still dominant in modern wind power generation systems, as DFIGs can operate under variable speeds, regulate active and reactive power independently, and have a low converter cost [5]. The reason for the lower converter cost in DFIG wind turbines is that the power electronic devices are fed with power generated only by the rotor (25-30% of rated power), and due to this, the lower converter rating compromise offers significant cost savings over a PMSG [6].
The doubly fed induction generator (DFIG) model is considered one of the best solutions for wind energy conversion systems (WECSs). The two main reasons for using a DFIG in a WECS are its asynchronous characteristics and the flexibility of utilizing power electronic converters, which results in cost savings due to a lower converter rating. In DFIG-based grid-connected wind systems, the stator is directly connected to the grid, There are many advantages and superior characteristics of using a PMSG machine over a DFIG. A PMSG machine has better performance, higher reliability, and wider speed control [4]. Due to the PMSG's better performance, most of the current research work focus has shifted to topologies utilizing synchronous generators with permanentmagnet excitation. However, DFIGs are still dominant in modern wind power generation systems, as DFIGs can operate under variable speeds, regulate active and reactive power independently, and have a low converter cost [5]. The reason for the lower converter cost in DFIG wind turbines is that the power electronic devices are fed with power generated only by the rotor (25-30% of rated power), and due to this, the lower converter rating compromise offers significant cost savings over a PMSG [6].
The doubly fed induction generator (DFIG) model is considered one of the best solutions for wind energy conversion systems (WECSs). The two main reasons for using a DFIG in a WECS are its asynchronous characteristics and the flexibility of utilizing power electronic converters, which results in cost savings due to a lower converter rating. In DFIGbased grid-connected wind systems, the stator is directly connected to the grid, whereas the

Literature Review
In a DFIG-equipped WECS, decoupled controllers are used to control the active and reactive power on both the rotor and grid sides, the rotor speed, and DC voltage and to track the maximum power. Low-frequency oscillations were observed for a WECS with weak grid interconnections; this oscillation mode caused torsional interconnections with a remote synchronous generator and led to the shutdown of the power plant [10]. To overcome the problem of low-frequency oscillations, a power system stabilizer can be applied at the output or the input of the controller. As per [11], a PSS can be employed for any DFIG variable that is influenced by network oscillations, such as rotor speed, stator electrical power, and voltage or network frequency. In [11][12][13][14][15][16][17][18], the importance of employing a PSS and improving the damping of oscillations in a grid network fed by a DFIG-WES using slip signal and rotor speed deviations are discussed. Various oscillations that are damped using a PSS are: electromechanical oscillations [19], inter-area power system oscillations [20,21], network damping capability [22], and oscillations caused by DFIG when integrated into a network that is already fed by a synchronous generator [17,18]. These papers clearly illustrate the need for a PSS to damp the oscillations and show that a PSS can be used to improve the stability of the grid-integrated wind system.
Most of the conventional PSSs (CPSSs) employed for wind energy systems are classical designs in which the system is linearized around an operating point. However, PSSs developed using Artificial Intelligence (AI) techniques have been used for design and stability studies of non-renewable-sourced power systems. Some of the frequently used techniques for a PSS design are ANN [23,24] and Fuzzy Logic [10,25,26], and support vector regression was used to design an adaptive PSS in [27]. The use of AI to solve stability issues has been categorized into three separate methodologies based on the techniques used: supervised learning, unsupervised learning, and reinforcement learning (RL) [28]. RL is completely different from supervised learning and unsupervised learning. Supervised learning alone is not adequate for learning from interactions, and in interactive Every power network (grid) experiences a lot of disturbances, as there may be a great number of variations, such as voltage, frequency, active power, and reactive power, at the load end or at power generating stations (renewable or non-renewable). For any system to operate without disturbances, power system stability and voltage regulation are critical control issues that need to be considered [8]. The importance of power system stability can be illustrated clearly by investigating [9], which presents a clear description of power system instability as the primary grounds for any major blackout. The main objective of any power system stabilizer is to provide stable power to the grid and to improve the damping of oscillations. Especially when connecting any renewable energy conversion system to the grid, supplying stable power is always challenging due to fluctuating frequencies and voltages.

Literature Review
In a DFIG-equipped WECS, decoupled controllers are used to control the active and reactive power on both the rotor and grid sides, the rotor speed, and DC voltage and to track the maximum power. Low-frequency oscillations were observed for a WECS with weak grid interconnections; this oscillation mode caused torsional interconnections with a remote synchronous generator and led to the shutdown of the power plant [10]. To overcome the problem of low-frequency oscillations, a power system stabilizer can be applied at the output or the input of the controller. As per [11], a PSS can be employed for any DFIG variable that is influenced by network oscillations, such as rotor speed, stator electrical power, and voltage or network frequency. In [11][12][13][14][15][16][17][18], the importance of employing a PSS and improving the damping of oscillations in a grid network fed by a DFIG-WES using slip signal and rotor speed deviations are discussed. Various oscillations that are damped using a PSS are: electromechanical oscillations [19], inter-area power system oscillations [20,21], network damping capability [22], and oscillations caused by DFIG when integrated into a network that is already fed by a synchronous generator [17,18]. These papers clearly illustrate the need for a PSS to damp the oscillations and show that a PSS can be used to improve the stability of the grid-integrated wind system.
Most of the conventional PSSs (CPSSs) employed for wind energy systems are classical designs in which the system is linearized around an operating point. However, PSSs developed using Artificial Intelligence (AI) techniques have been used for design and stability studies of non-renewable-sourced power systems. Some of the frequently used techniques for a PSS design are ANN [23,24] and Fuzzy Logic [10,25,26], and support vector regression was used to design an adaptive PSS in [27]. The use of AI to solve stability issues has been categorized into three separate methodologies based on the techniques used: supervised learning, unsupervised learning, and reinforcement learning (RL) [28]. RL is completely different from supervised learning and unsupervised learning. Supervised learning alone is not adequate for learning from interactions, and in interactive problems, it is often impractical for supervised learning to obtain the desired behavior for all situations Actuators 2022, 11, 203 4 of 31 in which the agent must act [28]. In contrast, in an RL territory, an agent must be able to learn from its own experience. Unsupervised learning is typically focused on finding structures hidden in collections of unlabeled data, whereas the RL paradigm aims to maximize a reward signal instead of trying to find hidden structures [28]. Thus, an RL is a third machine learning paradigm, alongside supervised and unsupervised learning.
The current research used reinforcement learning to control the entire system, as there is a need for interaction between the mechanical system (wind turbine) and electrical system (DFIG and controllers) to supply stable power to the grid for wind speed variations. One of the main reasons for using the RL control method is its capability of adapting itself to evolving generation levels, load levels, and operating uncertainties and responding to arbitrary disturbances [29]. In [29], the RL method was used in online mode and applied to control a thyristor controller series capacitor (TCSC), aiming to damp power system oscillations. RL controllers were designed to stabilize the closed loop system after severe disturbances in [30]. A specific RL algorithm called Q-learning was utilized to control and adjust the gain of a conventional PSS in [31]. The RL algorithm was also used in generation control and voltage and reactive power control [32][33][34]. In [35], a control strategy was developed for a PSS using the Q-learning method to suppress low-frequency oscillations. In [36], a proportional resonance PSS (PR-PSS) was proposed using an actor-critic agent, one of the RL techniques for adaptive adjustment of parameters to suppress ultra-low-frequency oscillations. The RL techniques discussed in [35,36] were used for comparing the PSS results obtained with the Q-learning algorithm discussed in this paper. The TD3 method was implemented in [37,38] for continuous power disturbances to overcome low-frequency oscillations. In [38], the TD3 method was used to perform parameter estimation and the finetuning of PID controllers and to overcome the problem of low-frequency oscillations caused by load generation variations. Deep reinforcement learning methods were implemented for load frequency control of a multi-area power system and battery energy management. In [39], a multi-agent deep reinforcement learning method was proposed, which utilized the DDPG method to optimize load frequency control performance. In this study, in addition to the QL method, the TD3 method was explored and implemented to solve the low-frequency oscillations caused by huge variations in wind speed. In this research, the TD3 method and QL method were implemented by replacing the existing PI controller and PSS. This paper is an extension of [40], where a Q-learning algorithm was implemented on the rotor-side converter for a small range of changes in wind speeds. The major contributions of the current work are discussed in Section 3.

Contributions
In this work, a CPSS was first designed, and control strategies were applied at a fixed wind speed. The optimal PI gain values for both the rotor-side converter and the grid-side converter were obtained by using eigenvectors for the PSS, and the stability of the system was proved mathematically. In the next step, a Q-learning algorithm was implemented on the designed PSS on the RSC and PI controllers on the GSC for variable wind speeds. The objective of the Q-learning algorithm implemented on the RSC and GSC is to suppress the low-frequency oscillations of the DFIG-based WECS when variable speeds are applied to the system. Since the control objective is to stabilize the system with active power (P) and reactive power (Q) without low-frequency oscillations, this paper uses the active power change as the state of the agent and the control output of the RSC as the action of the agent to train the model. On the grid-side converter, the grid-side active power is controlled, which is a function of DC-link voltage (V dc ), which acts as the state of the agent, and the reward function generates the input signal to the grid-side current (i * dg ). Simulation results verify that the designed Q-learning-based model-free controllers can quickly stabilize the DFIG wind system under a small range of wind speeds. The terms action, state, and agent used in this section are explained in detail in Section 3 of this paper. With the Q-learning agent, the system becomes unstable under huge variations in wind speed. One of the solutions is to implement the system with an actor-critic method. In this research, the TD3 agent was trained to learn the system dynamics under huge variations in wind speed and control the active and reactive power. Real-time wind speed variations in the Ottawa region were used as the system input. The PI controller in the inner current control loop of the RSC was replaced with the TD3 agent. From the results discussed in Section 5, it is observed that the TD3 agent can mitigate the frequency changes under huge variations in wind speed and provide stable power to the grid. In both QL and TD3 implementations, the PSS and PI controllers were removed, and the agent was trained to learn the system dynamics and suppress low-frequency oscillations. Figure 2 shows the physical model of the grid-connected DFIG-based wind turbine system. This benchmark power system model was first introduced in [7] to study the effect of shaft systems and low-frequency oscillations by comparing the switching-level (SL) and Fundamental Frequency (FF) models. As illustrated in Figure 2, the system consists of mechanical and electrical models. The mechanical model has 3 different components: (1) wind turbine, (2) gearbox, and (3) pitch controller; all 3 of these components together provide a complete drivetrain model, which provides the required mechanical energy to the generator. The electrical model consists of a DFIG generator, of which the stator is directly connected to the grid, and the rotor is connected through back-to-back IGBT-based pulse-width modulation converters.

Design of Drivetrain Model
The mathematical representation of the drivetrain is formed by the turbine rotating mass, low-speed shaft, gearbox, high-speed shaft, and generator rotating mass. The model of the drivetrain is developed by neglecting the mechanical twisting and stresses, as these are more related to mechanical design and studies. Moreover, for power system stability studies, it is suggested in [41] to consider the turbine, gearbox, and generator as rigid disks and shafts as mass-less torsional springs. A two-mass model is recommended for power system stability studies; this is because there are more possibilities for the representation of shaft stiffness and inertia constants [42]. Figure 3 shows the physical representation of the two-mass model [7]. The dynamics of the two-mass model can be obtained by applying Newton's equation of motion for each mass. The equations obtained are [7,43]: where p = d/dt, and T sh is the shaft torque, which is given as: Actuators 2022, 11, x FOR PEER REVIEW 5 of 32 the Q-learning agent, the system becomes unstable under huge variations in wind speed. One of the solutions is to implement the system with an actor-critic method. In this research, the TD3 agent was trained to learn the system dynamics under huge variations in wind speed and control the active and reactive power. Real-time wind speed variations in the Ottawa region were used as the system input. The PI controller in the inner current control loop of the RSC was replaced with the TD3 agent. From the results discussed in Section 5, it is observed that the TD3 agent can mitigate the frequency changes under huge variations in wind speed and provide stable power to the grid. In both QL and TD3 implementations, the PSS and PI controllers were removed, and the agent was trained to learn the system dynamics and suppress low-frequency oscillations. Figure 2 shows the physical model of the grid-connected DFIG-based wind turbine system. This benchmark power system model was first introduced in [7] to study the effect of shaft systems and low-frequency oscillations by comparing the switching-level (SL) and Fundamental Frequency (FF) models. As illustrated in Figure 2, the system consists of mechanical and electrical models. The mechanical model has 3 different components: (1) wind turbine, (2) gearbox, and (3) pitch controller; all 3 of these components together provide a complete drivetrain model, which provides the required mechanical energy to the generator. The electrical model consists of a DFIG generator, of which the stator is directly connected to the grid, and the rotor is connected through back-to-back IGBT-based pulsewidth modulation converters.

Design of Drivetrain Model
The mathematical representation of the drivetrain is formed by the turbine rotating mass, low-speed shaft, gearbox, high-speed shaft, and generator rotating mass. The model of the drivetrain is developed by neglecting the mechanical twisting and stresses, as these are more related to mechanical design and studies. Moreover, for power system stability studies, it is suggested in [41] to consider the turbine, gearbox, and generator as rigid disks and shafts as mass-less torsional springs. A two-mass model is recommended for power system stability studies; this is because there are more possibilities for the representation of shaft stiffness and inertia constants [42]. Figure 3 shows the physical representation of the two-mass model [7]. The dynamics of the two-mass model can be obtained by applying Newton's equation of motion for each mass. The equations obtained are [7,43]: where ⁄ , and is the shaft torque, which is given as:  In Figure 3, ω t and ω r are the turbine and generator rotor speeds, respectively. T m and T e are the mechanical torque applied to the turbine and electrical torque, respectively; H t and H g are the turbine inertia constant and generator inertia constant, respectively; D t and D g are the damping coefficients of the turbine and generator, respectively; T tg is the internal torque of the model; D tg is the damping coefficient of the shaft between the two masses; K tg is the spring constant or shaft stiffness. Finally, N t /N g is the gear ratio of the gearbox.

DFIG Model
For power system stability studies, the DFIG machine is modeled by neglecting stator transients, as they do not affect the electromechanical oscillations [44], and by neglecting stator transients, it is easier to solve stator and grid equations [45]. Similarly, the rotor electrical transients are also neglected, as rotor winding is controlled by fast-acting converters [46]. In addition, other assumptions that are made for modeling the DFIG are: the skin effect, the saturation effect, and iron losses (hysteresis and eddy currents) are neglected, as these phenomena contribute more to conducting loss performance in transient fault analysis. For designing the DFIG, a synchronous reference frame is used. The reason for representing the DFIG model in a synchronous rotating reference frame is because the qd-axis model is convenient for conducting steady-state analysis and deriving a small-signal model [47]. The equivalent circuit of a DFIG in a synchronously rotating reference frame is given in Figure 4.  In Figure 3, and are the turbine and generator rotor speeds, respectively. and are the mechanical torque applied to the turbine and electrical torque, respectively; and are the turbine inertia constant and generator inertia constant, respectively; and are the damping coefficients of the turbine and generator, respectively; is the internal torque of the model; is the damping coefficient of the shaft between the two masses; is the spring constant or shaft stiffness. Finally, ⁄ is the gear ratio of the gearbox.

DFIG Model
For power system stability studies, the DFIG machine is modeled by neglecting stator transients, as they do not affect the electromechanical oscillations [44], and by neglecting stator transients, it is easier to solve stator and grid equations [45]. Similarly, the rotor electrical transients are also neglected, as rotor winding is controlled by fast-acting converters [46]. In addition, other assumptions that are made for modeling the DFIG are: the skin effect, the saturation effect, and iron losses (hysteresis and eddy currents) are neglected, as these phenomena contribute more to conducting loss performance in transient fault analysis. For designing the DFIG, a synchronous reference frame is used. The reason for representing the DFIG model in a synchronous rotating reference frame is because the -axis model is convenient for conducting steady-state analysis and deriving a smallsignal model [47]. The equivalent circuit of a DFIG in a synchronously rotating reference frame is given in Figure 4. In Figure 5, the -and -axes are orthogonal to each other, rotating at an angular velocity of , whereas , , and represent stator variables displaced by 120°. To analyze the induction machine variable associated with the rotor, the reference frame also needs to be transformed to -axis reference frame. In Figure 5, , , and represent rotor variables displaced by 120°, rotating at an angular velocity of . In Figure 5, the qand d-axes are orthogonal to each other, rotating at an angular velocity of ω, whereas a s , b s , and c s represent stator variables displaced by 120 • . To analyze the induction machine variable associated with the rotor, the reference frame also needs to be transformed to qd-axis reference frame. In Figure 5, a r , b r , and c r represent rotor variables displaced by 120 • , rotating at an angular velocity of ω r .  The stator model is represented in the qd-axis reference frame.
The rotor model is represented in the qd-axis reference frame.
The mathematical model including the saturation effect is provided below [48].
The voltage equations of the DFIG with flux saturation are: v ds = R s i ds − ω s ω b L s i qs + L ms i mq + L s pi ds + L md pi md (13) v qs = R s i qs + ω s ω b (L s i ds + L ms i md ) + L s pi qs + L md pi mq (14) v dr = R r i dr − ω s − ω r ω b L r i qr + L ms i mq + L r pi dr + L md pi md (15) v qr = R r i qr + ω s − ω r ω b (L s i dr + L ms i md ) + L r pi qr + L md pi md (16) where p = d dt .

Control Strategies
To develop control strategies for the converters, any one of the axes in two reference frames (qd), rotating with synchronous speed, is aligned with either the flux or the voltages of the stator. The two commonly used control strategies are stator flux-oriented control and stator voltage-oriented control. In stator flux-oriented control, the d-axis is aligned with the stator flux linkage vector [7]. This results in ψ ds = ψ s and ψ qs = 0. In contrast, in stator voltage-oriented control, the d-axis is aligned with the stator voltage linkage vector, resulting in V ds = V s and V qs = 0. For rotor-side converter control, stator fluxoriented control is used, and for grid-side converter control, stator voltage-oriented control is adopted [7]. For this work, stator voltage-oriented control is implemented for both the rotor-side converter controller and grid-side converter controller, as discussed in [49]. Compared to flux-oriented control, voltage-oriented control has the advantage of deriving the model and aligning with the stator voltage space vector using the measured phase voltages. Another advantage is that, typically, the grid-side converter is controlled using the stator voltage orientation, so by choosing voltage-oriented control for the rotor-side controller, the model will be simpler to implement. The usage of voltage-oriented control on both the RSC and GSC can be observed in [49]. The objective of these control strategies is to decouple the control of active and reactive power [49].

Rotor-Side Controllers
The main objective of the rotor-side converter controller is to control both the active and reactive power of the stator. It consists of two control loops: the inner control loop regulates the dand q-axis rotor currents, whereas the outer loop controls the active P s and reactive Q s power of the generator stator. By neglecting the transients in the stator flux linkages in the stator voltage orientation and under the assumption that resistive drops are negligible, it is observed that active power P s is independent of the q-axis rotor current. Under the same assumptions for the rotor model, the stator reactive power Q s is totally dependent on the q-axis rotor current, and it is independent of the d-axis rotor current. It can be concluded that, in the RSC, the stator active and reactive power can be controlled independently by rotor d-axis and q-axis currents, respectively, which generates stator active and reactive power reference current outputs i * dr and i * qr , respectively. The generated reference currents are fed as inputs to the inner current loops of the rotor current controller. The final outputs of the rotor-side converter controller are v dr and v qr . The controller is designed based on the rotor voltage models, which can be expanded from Equations (17) and (18) and expressed as [49]: When designing the rotor-side controller with decoupling elements, if the transients of the machine are neglected, the derivative terms become zero. With this condition, the d-axis rotor voltage will be in terms of i dr ; similarly, the d-axis rotor voltage will be in terms of i qr . With these elements, a transfer function can be developed for the inner current control loops, and from this, the PI gains are obtained. From the mathematical models developed, it is observed that the design is the same for both control loops, so the PI control design is also the same for both inner current control loops. In these control loops, v dr1 and v qr1 are the outputs of the rotor current controller loops, and they are fed to the pulse-width modulator of the converter along with the decoupled elements. Finally, these obtained signals from the modulator are fed to the converter circuit, and this is connected to the DC link of the model.
In Figure 6, i dr and i qr are the rotor currents, and they are transformed from rotor three-phase currents i rabc to i rdq by applying a transforming angle of (θ s − θ r ), where θ s is the angle obtained from v sabc at the grid frequency, and θ r is the rotor angle. P s is the stator active power, and Q s is the stator reactive power, and these are obtained from v sabc and i sabc . Reactive power reference Q sre f is given as 0, whereas stator active power reference P sre f is generated using the maximum power tracking method. In this work, a simple lookup table is used, which tries to obtain the reference stator power by plotting power and speed. The other reference at v dr control is the generator reference value, which is set by the speed.

Grid-Side Controllers
To design the grid-side converter controller, first, the grid model in the reference frame is transformed into two reference frames by using a phase-locked loop (PLL), which provides the required transformation angle and the frequency for synchronizing

Grid-Side Controllers
To design the grid-side converter controller, first, the grid model in the abc reference frame is transformed into two reference frames by using a phase-locked loop (PLL), which provides the required transformation angle and the frequency for synchronizing the model with the grid. The design and operation of the PLL are derived from [50], and it is shown in Figure 7 [50]. The grid model in the abc reference frame is transformed to the dq reference frame based on the transformation matrix from Equation (19) [50].

Grid-Side Controllers
To design the grid-side converter controller, first, the grid model in the reference frame is transformed into two reference frames by using a phase-locked loop (PLL), which provides the required transformation angle and the frequency for synchronizing the model with the grid. The design and operation of the PLL are derived from [50], and it is shown in Figure 7 [50]. The grid model in the reference frame is transformed to the reference frame based on the transformation matrix from Equation 19 [50].
The transformation angle is the grid-side converter terminal voltage angle, and it is expressed as , where is given as , where is the grid-side converter terminal frequency. The phase angle measured by the PLL is given as , and the measured frequency is given as ; this provides the transformation angle and the frequency to the grid-side converter controller design. The obtained angle and frequency information is used for both and transformation at the grid end.
Initially, for operating the PLL, the -axis voltage of the stator is obtained by the transformation technique. Once the grid-side converter terminal voltage is transformed and the transformation angle is , then the obtained equation is , where is the measured phase angle and is expressed as . The grid-side voltage frequency is obtained by adding the error signal processed in the PI controller and , and it is given as: (20)  The transformation angle θ is the grid-side converter terminal voltage angle, and it is expressed as θ = ω 0 t + δ, where δ is given as pδ = ω − ω 0 , where ω is the grid-side converter terminal frequency. The phase angle measured by the PLL is given as θ pll , and the measured frequency is given as ω pll ; this provides the transformation angle and the frequency to the grid-side converter controller design. The obtained angle and frequency information is used for both abc − dq and dq − abc transformation at the grid end.
Initially, for operating the PLL, the q-axis voltage of the stator v qs is obtained by the abc − dq transformation technique. Once the grid-side converter terminal voltage is transformed and the transformation angle is θ pll , then the obtained equation is v qs = V sin θ − θ pll = V sin δ − δ pll , where δ pll is the measured phase angle and is expressed as δ pll = θ pll − ω 0 t. The grid-side voltage frequency ω pll is obtained by adding the error signal processed in the PI controller and ω 0 , and it is given as: By using this three-phase PLL, the grid equations can be transformed into two reference frames. The dq-transformed equations of the grid model are as follows: In Equations (22) and (23), v ds and v qs are the stator voltages; r and L are the grid filter resistance and inductance, respectively; and i d and i q are the total currents supplied to the grid in the dq reference frame. Finally, e d and e q are the transformed voltages of the grid terminal. The model for the total current supplied to the grid, i.e., i d and i q , are given as the sum of the stator currents i dqs and grid-side converter currents i dqg . The grid-side converter voltages in the dq reference frame can be expressed as: v qg = R g i qg + L g pi qg + ω pll L g i dg + v qs (25) Finally, the grid voltages are expressed as: From the above-derived grid-side model, the grid-side converter controller can be designed. The main objective of the grid-side converter controller is to regulate the DC-link voltage and exchange power between the rotor-side converter and the grid. The other objective of this controller is to control the reactive power that is delivered to the grid at the grid-side converter. Like the rotor-side converter controller, the grid-side converter controller also consists of two cascaded control loops. The DC-link voltage and the reactive power are controlled by the outer control loop, whereas the inner current control loop regulates the current components in the grid-side converter. From the above-discussed grid model, the active power and the reactive power at the grid can be expressed as: For grid-side active power in these equations, by applying the synchronously rotating reference frame and aligning the d-axis on the grid voltage vector, the obtained results are v ds = v s and v qs = 0. Applying this to Equations (24) and (25) yields: From the grid-side active P gc and reactive power Q g , the outer control loops can be designed; the obtained result will be a function of DC-link voltage v dc and the grid-side converter current i dg [7,49]. By using this, an independent control loop is developed for DC-link voltage v dc , with i * dg as its output. Similarly, the grid-side reactive power can also be controlled individually. From this outer loop, i * qg is obtained. Similarly, for the inner control loop design, the same reference frame is applied. The inner control loops are mainly designed as current control loops, which use grid-side currents as inputs. The GSC control block is given in Figure 8 [7].

Grid-Side Inner Current Control Loop
With Equations 30 and 31 , the inner current control loop can be designed. In Equation 30 , and are fed as inputs at the outer end of the controller design, and in Equation 31 , is fed at the outer end. and act as decoupling elements for the grid-side converter controller. The final obtained equations

Grid-Side Inner Current Control Loop
With Equations (30) and (31), the inner current control loop can be designed. In Equation (30), ω pll L g i qg and v ds are fed as inputs at the outer end of the controller design, and in Equation (31), ω pll L g i dg is fed at the outer end. ω pll L g i qg and ω pll L g i dg act as decoupling elements for the grid-side converter controller. The final obtained equations are: From Equations (32) and (33), the PI controller can be designed, and it can be observed that both controllers have similar structures. Due to the very short sampling period and as the system requires a fast response rate, the controller is limited to a PI controller. Moreover, with the usage of power electronic devices and with huge variations in wind speed, which generate a lot of noise in the system, adding a derivative controller would result in an undesirable simulation result.

Q-Learning (QL) and Twin Delayed Deep Deterministic Policy Gradient (TD3)
Reinforcement learning (RL) algorithms are focused on goal-directed learning from iterations, which are mainly used to solve closed-loop problems. RL uses actions from learning systems that influence the later inputs [28]. An RL algorithm consists of a discrete set of environment states S, a discrete set of agent actions A, and a set of scalar reinforcement signals R. Here, the agent interacts with the environment through action, and the agent receives the current state as input; then, the agent chooses an action to generate an output. The final goal of any RL algorithm is to increase the long-run sum of values of the reinforcement signals, which can learn over time by the trial-and-error method and solve the problem [51].
The environment in reinforcement learning is fully observable and can be described as a Markov decision process (MDP), and most RL problems are formalized as MDPs. In an MDP, the action taken in the current state also affects the next state and not just the current state itself, so action plays a dominant role. Due to the action in the current state, a return reward will be assigned to the corresponding state-action pair [28].
Of the various available RL algorithms, the Q-learning algorithm is considered simple and easy to implement due to the simple way in which agents can learn and act optimally in controlled Markovian domains. The other main advantage of the Q-learning algorithm is that it is exploration-insensitive: Q values will converge to the optimal values, independently of how the agent behaves while the data are being collected [51,52]. With these advantages, this paper uses a Q-learning algorithm to train the agent to suppress oscillations and provide stable power to the grid under variable wind speeds. Assuming that the best action is taken initially, the Q-learning optimal value function is taken from [51].
In the above equation, Q(s, a) is the expected discounted reinforcement of choosing action a in state s. Once the action is taken, the agent will be given a reward R for the effectiveness of the action by observing the resulting state, s, of the environment. Here, T is the probability of action a applied to state s that changes the state from s to s . For each action executed, the Q values will converge with a probability of 1 to Q * , and when the Q values are nearly converged to their optimal values, the agent will act greedily by taking the action with the highest Q value; from this greedy policy, the optimal action is determined. At any time step, there is at least one action whose estimated value is greatest or optimal; choosing the greatest value is called greedy action [28]. γ (0 ≤ γ ≤ 1) is the discount factor, which discounts the rewards exponentially in the future [51]. Typically, an agent will look up the Q-memory lookup table, which has state s and action a and is updated as per [34,51,52]. The parameter α(0 ≤ α ≤ 1) in Equation (35) updates the Q-memory and affects the number of iterations [34].
Even though the Q-learning method is simple to implement, the agent cannot learn under conditions with huge variations, which is one of the observations from the current research work and a limitation of the QL agent. The other problem with the QL method is that the Q-table must be limited to certain states and actions, and the Q-table cannot be updated for a large state-action space. Therefore, the current research work explored and implemented the TD3 method so that the Q-table limitation can be overcome, and the agent can adapt to large variations in conditions.
The TD3 method is one of the model-free policy-based deep reinforcement learning algorithms built on the DDPG method [53]. The objective of the TD3 method is to increase the stability and performance by considering the function approximation error [53]. In an actor-critic setting, the learning target is: where y is the learning target, r is the reward received for every action, s is the new state of the environment, γ is the discount factor, π ∅ is the optimal policy, and Q θ (s, a) is the function approximator with parameter θ.
The stability and performance are increased by applying three modifications [53]: a. Clipped double Q-learning: In this update, the value target cannot introduce any additional overestimation using the standard Q-learning target. With a pair of actors (π ∅ 1 , π ∅ 2 ) and critics Q θ 1 , Q θ 2 , the target update is In this update, the computational costs are reduced, as a single actor is optimized with respect to Q θ 1 , and the same target is used to update Q θ 2 .
b. Target networks and delayed policy updates: In this update, the target network is used to reduce the error over multiple updates. The policy network is updated at a lower frequency than the value network to minimize the error. The modification is made to update the policy and target networks after a fixed number of updates to the critic [53]. Thus, very few policy updates are made with this modification, and policy updates are not repeated for an unchanged critic.
c. Target policy smoothing regularization: In this approach, the relationship between similar actions is forced explicitly by modifying the training procedure, which is carried out by fitting the value of a small area around the target action. This will have the benefit of smoothing the value estimate by bootstrapping off a similar state-action value estimate [53]. The expectation over actions is approximated by adding a small amount of random noise to the target policy, where the noise is kept close to the original action.
The modified target update is:

Design of PSS
In this paper, the PSS is designed based on the transformation technique. The block diagram for the PSS with the transformation technique is shown in Figure 9.

Design of PSS
In this paper, the PSS is designed based on the transformation technique. The block diagram for the PSS with the transformation technique is shown in Figure 9.

Washout filter
Compensator Gain The signal is obtained from the PLL frequency ( ): The transformation is defined as: * cos sin The PSS developed with the transformation technique is implemented on the inner current controller of the RSC on both -and -axis control loops. The block diagram with the transformation technique is given in Figure 10.

Q-Learning Algorithm on RSC
The main motivation for using the Q-learning algorithm is to suppress low-frequency oscillations generated by variable wind speeds by searching for the best action in each state. To ensure continuous exploration and avoid tending to the local optimum, a pursuit The signal θ f is obtained from the PLL frequency (ω pll ): The transformation is defined as: The PSS developed with the transformation technique is implemented on the inner current controller of the RSC on both dand q-axis control loops. The block diagram with the transformation technique is given in Figure 10.

Design of PSS
In this paper, the PSS is designed based on the transformation technique. The block diagram for the PSS with the transformation technique is shown in Figure 9.

Washout filter
Compensator Gain The signal is obtained from the PLL frequency ( ): The transformation is defined as: * cos sin The PSS developed with the transformation technique is implemented on the inner current controller of the RSC on both -and -axis control loops. The block diagram with the transformation technique is given in Figure 10.

Q-Learning Algorithm on RSC
The main motivation for using the Q-learning algorithm is to suppress low-frequency oscillations generated by variable wind speeds by searching for the best action in each state. To ensure continuous exploration and avoid tending to the local optimum, a pursuit

Q-Learning Algorithm on RSC
The main motivation for using the Q-learning algorithm is to suppress low-frequency oscillations generated by variable wind speeds by searching for the best action in each state. To ensure continuous exploration and avoid tending to the local optimum, a pursuit algorithm based on the learning automata algorithm is utilized for the action policy. Initially, actions are determined with a uniform probability distribution. When the Q-table is updated, the probabilities of actions are updated as follows [54].
where T k s a g denotes the probability that an action-state pair a g , s is selected in iteration k, and β represents the action exploration rate. After the effect of this algorithm, Q k tends to Q * for a sufficiently large k, and an optimal policy is obtained. The specific Q-learningbased adaptive parameter algorithm proposed in this paper is summarized as follows: An episode in Algorithm 1 is defined as a notion of the final time step when the agent-environment interaction breaks naturally into subsequences [28].

Algorithm 1 Q-learning-Based Adaptive Parameter in Rotor-Side Algorithm
For each episode do Initialize Q 0 (x, a) = 0, ∀a ∈ A, ∀s ∈ S Initialize T 0 x (a) = 1 |A| , ∀a ∈ A, ∀s ∈ S For each step of episode do Choose a from s based on the current distribution T x (a) Take action a, observe r, s Update Q(x, a) according to Equation (40) Update T x (a) according to Equation (43) s ← s , a ← a End for End for The damping of frequency oscillations discussed in this paper can be formulated as an MDP problem because the future state of the closed-loop controller is always dependent on the current state. Since the problems considered by reinforcement learning can be modeled as MDP models, we thus transform the adaptive parameter problem into an MDP model by designing a five-tuple (S, A, P, γ, r) as follows: Design of state space S:S is a set of states that represent configurations of the system. It is assumed that all possible states are finite. To damp the low-frequency oscillations of the power system and obtain stable power output, we use active power ∆P as the state information.
The states of the MDP are described as follows: We can see that the state distribution is unbalanced around zero, because the deviation of the active power at the rated value is also unbalanced with wind fluctuation. The QL agent is trained with 10 states and 10 action pairs. The state at iteration zero is chosen at random, and at each incremental step, the state will be in one of the intervals. Therefore, the states are divided into various sections based on the simulation results using a PSS with a PI controller, where the controlled variable is monitored. In this case, it is the active power that is a feedback parameter dependent on various other system parameters. The state space is divided into ten interval sates between an interval of -∞ to ∞. Each state consists of an interval; as the system is dynamic and continuous, the states are chosen between the intervals. The observation from the small-signal model and the simulation model with the PSS is that the change in power is within the range of −0.5 to 0.01. The intervals for each state are chosen randomly; however, the number of states is chosen based on the action.
Design of action space A: A is a set of actions that are executed by the agent to influence the environment. As mentioned before, the output of the agent should be the controller parameter at the rotor side. In this paper, we obtain the discrete action as follows: A = [−0.025 −0.02 −0.015 −0.01 −0.005 0.005 0.01 0.015 0.02 0.025]. The action space is chosen based on the output signals from the PSS controller, which are replaced by the QL agent. These are the list of actions that are provided to the agent to determine which action to choose at its corresponding state. In state 1, the action chosen can be any action and is not necessarily action 1. The Q-table, which is updated as per Algorithm 1, will choose either the action with max Q or a random Q, which is updated as per Equation (40). Even though any action can be chosen, the trained agent will use the Q values from the trained model. Therefore, under varying wind speeds, the agent will know which action to choose based on the chosen state from the updated Q values.
P : S × A → θ(S) is the state transition function that shows the distribution of the next state s k+1 after executing an action a k in the environment with the current state s k . Since the parameter variation model is unknown, we can use the temporal-difference (TD) method to train the adaptive parameter policy. TD learning is a combination of Monte Carlo and dynamic programming ideas. Q-learning is an off-policy TD control algorithm [28]. In this paper, we apply a Q-learning algorithm to optimize the parameter of the rotor-side controller.
Design of reward function r(s, a): r(s, a) is the function that maps the state-action pair (s, a) to a scalar, which represents the immediate reward after applying an action a to the environment with state s. In this paper, r k = −k × P − P re f . This means that the more the active power deviates from the reference value, the smaller the immediate reward, which prompts the adjustment of the controller parameters so that the active power reaches the reference value. The parameter β used in Equation (43) is used to update the probability distribution. The value of β is 0.1. The value of α is 0.2.
The control application of QL on the RSC is employed as per Figure 11.
any action can be chosen, the trained agent will use the Q values from the trained model. Therefore, under varying wind speeds, the agent will know which action to choose based on the chosen state from the updated Q values. : → is the state transition function that shows the distribution of the next state after executing an action in the environment with the current state . Since the parameter variation model is unknown, we can use the temporal-difference (TD) method to train the adaptive parameter policy. TD learning is a combination of Monte Carlo and dynamic programming ideas. Q-learning is an off-policy TD control algorithm [28]. In this paper, we apply a Q-learning algorithm to optimize the parameter of the rotorside controller.
Design of reward function , : , is the function that maps the state-action pair , to a scalar, which represents the immediate reward after applying an action to the environment with state . In this paper, | |. This means that the more the active power deviates from the reference value, the smaller the immediate reward, which prompts the adjustment of the controller parameters so that the active power reaches the reference value. The parameter β used in Equation (43)  In Figure 11, it can be observed that the PI controller, which is developed for the PSS with the transformation technique, is completely replaced with the Q-learning algorithm.

Q-Learning Algorithm for DC-Link Voltage Control on GSC
As discussed in the above section, the primary objective of the Q-learning algorithm used is to suppress the low-frequency oscillations generated by variable wind speeds by searching for the best action in each state. On the GSC, the sensitive parameters are the DC-link voltage and grid-side current, which are observed from the small-signal model. For the action policy, a pursuit algorithm based on the learning automata algorithm is utilized. Initially, actions are determined with a uniform probability distribution. When the Q-table is updated, the probabilities of actions are updated using Equation (43). The Q-learning algorithm discussed for the RSC (Algorithm 1) is used to update the episodes Figure 11. QL controller design on RSC.
In Figure 11, it can be observed that the PI controller, which is developed for the PSS with the transformation technique, is completely replaced with the Q-learning algorithm.

Q-Learning Algorithm for DC-Link Voltage Control on GSC
As discussed in the above section, the primary objective of the Q-learning algorithm used is to suppress the low-frequency oscillations generated by variable wind speeds by searching for the best action in each state. On the GSC, the sensitive parameters are the DC-link voltage and grid-side current, which are observed from the small-signal model. For the action policy, a pursuit algorithm based on the learning automata algorithm is utilized. Initially, actions are determined with a uniform probability distribution. When the Q-table is updated, the probabilities of actions are updated using Equation (43). The Q-learning algorithm discussed for the RSC (Algorithm 1) is used to update the episodes on the GSC as well; the main difference is in the action and reward that are chosen for each controller. For the GSC, the action space is designed for control parameters K p5 and K i5 , and the reward function is defined from the DC-link voltage. The new controller design with QL on the grid-side converter is shown in Figure 12.
It can be observed that the PI controllers on the GSC are replaced with the QL algorithm. The adaptive parameter problem can be transformed into an MDP model by designing a five-tuple (S, A, P, γ, r) as follows: Design of state space S: To damp low-frequency oscillations, the sensitive parameters observed from the GSC are the DC-link voltage and input reference current to the current control loop, so DC-link voltage ∆V dc is used as state information.
The states of the MDP are described as follows: On the GSC, the same state spaces defined for the RSC can be used, as the objective is to provide stable power to the grid, which is a function of active power from the RSC and GSC. In addition, the deviation of the active power at the grid-side converter is a function of V dc , which is also unbalanced with wind fluctuation. However, the action space will be completely different from the RSC action spaces, as the control parameters differ from each other. on the GSC as well; the main difference is in the action and reward that are chosen for each controller. For the GSC, the action space is designed for control parameters and , and the reward function is defined from the DC-link voltage. The new controller design with QL on the grid-side converter is shown in Figure 12. * * It can be observed that the PI controllers on the GSC are replaced with the QL algorithm. The adaptive parameter problem can be transformed into an MDP model by designing a five-tuple , , , , as follows: Design of state space : To damp low-frequency oscillations, the sensitive parameters observed from the GSC are the DC-link voltage and input reference current to the current control loop, so DC-link voltage ∆ is used as state information. The states of the MDP are described as follows: On the GSC, the same state spaces defined for the RSC can be used, as the objective is to provide stable power to the grid, which is a function of active power from the RSC and GSC. In addition, the deviation of the active power at the grid-side converter is a function of , which is also unbalanced with wind fluctuation. However, the action space will be completely different from the RSC action spaces, as the control parameters differ from each other.
Design of action space : The output of the agent is a controller parameter and at the grid side. In this paper, we obtain the discrete action as follows: The action space on the GSC is chosen based on the PI controllers, which are replaced with the QL agent. However, the state space will remain the same, as the end output is the active power, which is monitored and controlled by the agent.
: → is the state transition function that shows the distribution of the next state after executing an action ak in the environment with the current state. Design of reward function , : In this paper, | |. This means that the more the grid-side active power deviates from the reference value, the smaller the immediate reward, which prompts the adjustment of the controller parameters so that the voltage reaches the reference value. A complete block diagram with the QL-controller units discussed above can be observed in Figure 13. The action space on the GSC is chosen based on the PI controllers, which are replaced with the QL agent. However, the state space will remain the same, as the end output is the active power, which is monitored and controlled by the agent.
P : S × A → θ(S) is the state transition function that shows the distribution of the next state s k+1 after executing an action ak in the environment with the current state.
Design of reward function r(s, a): In this paper, r k = −k × V dc − V re f dc . This means that the more the grid-side active power deviates from the reference value, the smaller the immediate reward, which prompts the adjustment of the controller parameters so that the voltage reaches the reference value.
A complete block diagram with the QL-controller units discussed above can be observed in Figure 13. In Figure 13, is the wind speed, is the generator rotor speed, and is the rotor angle, which is used for transformation from the reference frame for rotor currents. and are stator voltage and currents, respectively, whereas and are grid voltage and currents, respectively. Grid voltage is provided as input to the phase-locked loop, which extracts the angle . is used for transformation from the reference frame for stator voltage, stator current, grid voltage, and grid currents. In Figure 13, it can be observed that voltages fed to the rotor-side converter _ and grid-side converter _ are provided by the QL controllers, which are discussed in this section.

TD3 Method
As discussed in Section 5, the TD3 method is derived from the DDPG method with a In Figure 13, V w is the wind speed, ω r is the generator rotor speed, and θ r is the rotor angle, which is used for transformation from the abc − dq reference frame for rotor currents. V sabc and I sabc are stator voltage and currents, respectively, whereas V abc and I abc are grid voltage and currents, respectively. Grid voltage V abc is provided as input to the phase-locked loop, which extracts the angle θ pll . θ pll is used for transformation from the abc − dq reference frame for stator voltage, stator current, grid voltage, and grid currents. In Figure 13, it can be observed that voltages fed to the rotor-side converter (V r_abc ) and grid-side converter V gc_abc are provided by the QL controllers, which are discussed in this section.

TD3 Method
As discussed in Section 5, the TD3 method is derived from the DDPG method with a few modifications. In this section, the TD3 algorithm and its implementation are discussed. One of the objectives of this research work is to control the frequency under huge variations in speed by replacing the PI controllers, as the PI controllers cannot provide or produce satisfying tracking performance with huge variations in wind speed. Thus, the TD3 method is implemented, which provides nonlinear control. In this paper, the inner current PI controllers are completely replaced with the TD3 algorithm. The TD3 agent is implemented in an environment where the observation space is continuous or discrete, and the action space is continuous. In this experience, both the observation and action spaces are continuous. When it comes to the actor and critic, the actor is a deterministic policy actor, and the critic can be one or more Q-value function critics Q(S, A). The training process consists of three steps. In step 1, the actor and critic properties are updated at each iteration by the agent. In step 2, the experience buffer is used to store past experiences. From the experience buffer, the actor and critic use a mini batch of experiences randomly. In step 3, noise is applied to the action chosen by the policy using the stochastic noise model at each episode.
TD3 implementation has four stages. In stage 1, the actor and critic functions are chosen. The actor is a deterministic actor that returns the action that maximizes the longterm reward with input with parameters θ and state S. Apart from the actor, to choose the best action, a target actor is also developed to improve the stability; the target actor parameter θ t is updated using the latest actor parameters. There are two critics: one is the value critic, and the second critic is the target critic. The value critic takes state S and action A as inputs and provides the expectation of the long-term reward. The target critic is responsible for improving the stability of the optimization. Target critic parameters are updated using the latest critic parameter values. In stage 2, the agent is created with the state and action specifications of the environment. As the agent has both actor and critic networks in the environment, in stage 3, the agent is now trained to learn and update the actor and critic models at each episode. Training is implemented as per Algorithm 2 [53]. In stage 4, the target actor and the target critic parameters are updated using one of the target update methods, such as periodic, smoothing, and periodic smoothing. However, for this research work, the smoothing update is chosen.
In the current work, the inner current controller on the rotor-side controller is replaced with the TD3 agent. Therefore, the inputs for the TD3 agent are f i dr , i qr , i dr_re f , i qr_re f , ω r , ω r_re f . Here, i dr , i qr are rotor-side currents, and i dr_re f , i qr_re f are the outputs of the outer current control loop on the RSC. ω r and ω r_re f are the rotor speed and reference speed, respectively. Both the current and speed with their references are provided as inputs or are used as the parameters for observation. The reward is computed by taking the error between the reference values and the actual values of currents. The reward R is computed as shown in Equation (44).
Therefore, the number of observations is six, which are the inputs to the TD3 agent, and the number of actions is two, v dr and v qr . As we have the observations and action space available, the next step is to create the agent block, as shown in Figure 14.

Algorithm 2 TD3 [53]
Initialize critic networks Q ∅ 1 , Q ∅ 2 with random parameters ∅ 1 Initialize target critic with same random parameters ∅ 2 ; so ∅ 1 = ∅ 2 Initialize actor network π θ with random parameters θ 1 , θ 2 Initialize target actor network π θ with same random parameters θ 2 So, for target networks θ 1 ← θ 1 , θ 2 ← θ 2 , ∅ ← ∅ Initialize replay buffer β For t = 1 to T do For current state of observation S, select action with exploration noise a ∼ π ∅ (s) + ε, ∈ ∼ N(0, σ). Here, ∈ is the stochastic noise from the noise model Execute action and observe reward r and new state s . Store the experience (s, a, r, s ) in β (experience buffer) Sample a random mini batch of N transitions (s, a, r, s ) from β If s is a terminal state, set the value function target y i = r; else

If t mod d then
Update ∅ by deterministic policy gradient: Update target networks (smoothing): For the TD3 agent, two critic networks are created with a deep neural network with state and action as inputs and one output. The neural network developed is a fully connected layer for both state and action paths. In the same way, for both actor networks, fully connected layers are used. The TD3 agent will determine the action that needs to be taken for the given state. To train the agent, the TD3 agent is provided with its discount factor, the buffer length, mini-batch size, target smoothing factor, and finally, the target update frequency. For training, the agent is allowed to run each training for 1000 episodes and stop training when the agent receives a cumulative average reward ≥ −200 over 100 consecutive episodes. Training progress with the TD3 agent is shown in Figure 15, which shows that the average reward of each episode increases, and the policy becomes stable after 40 episodes. The training process was executed for a duration of 32 h, as it reached the specified maximum episode numbers. In Figure 15, the thick blue line represents the average reward, and the light blue line indicates the episode reward and the yellow line shows the episode . For the TD3 agent, two critic networks are created with a deep neural network with state and action as inputs and one output. The neural network developed is a fully connected layer for both state and action paths. In the same way, for both actor networks, fully connected layers are used. The TD3 agent will determine the action that needs to be taken for the given state. To train the agent, the TD3 agent is provided with its discount factor, the buffer length, mini-batch size, target smoothing factor, and finally, the target update frequency. For training, the agent is allowed to run each training for 1000 episodes and stop training when the agent receives a cumulative average reward ≥ −200 over 100 consecutive episodes. Training progress with the TD3 agent is shown in Figure 15, which shows that the average reward of each episode increases, and the policy becomes stable after 40 episodes. The training process was executed for a duration of 32 h, as it reached the specified maximum episode numbers. In Figure 15, the thick blue line represents the average reward, and the light blue line indicates the episode reward and the yellow line shows the episode Q 0 . and stop training when the agent receives a cumulative average reward ≥ −200 over 100 consecutive episodes. Training progress with the TD3 agent is shown in Figure 15, which shows that the average reward of each episode increases, and the policy becomes stable after 40 episodes. The training process was executed for a duration of 32 h, as it reached the specified maximum episode numbers. In Figure 15, the thick blue line represents the average reward, and the light blue line indicates the episode reward and the yellow line shows the episode .

Simulation Results with the Newly Designed PSS
The model discussed in this paper was built in MATLAB/Simulink. The topology is shown in Figure 13. There have been extensive works published based on simulations [4,10,35,55]. A small-signal model was developed to identify the state variables that affect the stability of the overall system. From the small-signal model, the sensitive variables observed are , , , , , and . The input given to the power system stabilizer

Simulation Results with the Newly Designed PSS
The model discussed in this paper was built in MATLAB/Simulink. The topology is shown in Figure 13. There have been extensive works published based on simulations [4,10,35,55]. A small-signal model was developed to identify the state variables that affect the stability of the overall system. From the small-signal model, the sensitive variables observed are i dr , i qr , i qs , x 7 , v ds , and θ pll . The input given to the power system stabilizer is the frequency ω pll . One of the main observations from small-signal stability analysis is that the state variables associated with the inner current control loop tend to move faster towards the real axis, making the system unstable. This is observed with the large-signal model as well when applying small faults. The performance evaluation of the system with a PSS and without a PSS can be observed from the small-signal model. The low-frequency mode of the system can be observed from the same small-signal model. Usage of the eigenvalue distribution of systems with and without a PSS to identify the ultra-low-frequency oscillation mode can be observed in [36]. The shift of eigenvalues with the new PSS implemented is noted in Table 1 below. The eigenvalues of (E, A) x 1 and θ f are newly added signal inputs from the PSS. The time constants for the washout filter and compensator are determined from the small-signal model by observing the stability of the system. V dc and x 5 are the signal inputs of the outer current control loop of the grid-side converter. x 2 and x 4 are the input signals to the PI controllers (PI 2 and PI 4 ) at the inner current control loop of the RSC. x 7 is the input signal to the PI controller (PI 7 ) at the inner current control loop of the GSC. Controller data, DFIG, and the complete system parameters are presented in Appendix A. The results for the power system stabilizer with the transformation technique are shown below in Figure 16. The results for the power system stabilizer with the transformation technique are shown below in Figure 16.

Fault Analysis with Transformation PSS
The plot of the transformation PSS during the fault condition is shown below. It can be observed that the oscillations are damped even during the fault condition. This can be more clearly observed from the power plot, in which the oscillations are damped, and the settling time is also much less. From this discussion, it can be concluded that the PSS is working effectively during the fault condition.
For fault analysis, a three-phase fault was applied to the system for a time of 2 between times 30 and 32 . The results obtained for active power and reactive power are shown in Figure 18a and 18b, respectively. In the above plot, it can be observed that the PSS with voltage as input is able to damp the oscillations during the fault, and the settling time after the fault is also much less.

Fault Analysis with Transformation PSS
The plot of the transformation PSS during the fault condition is shown below. It can be observed that the oscillations are damped even during the fault condition. This can be more clearly observed from the power plot, in which the oscillations are damped, and the settling time is also much less. From this discussion, it can be concluded that the PSS is working effectively during the fault condition.
For fault analysis, a three-phase fault was applied to the system for a time of τ = 2 s between times τ = 30 s and τ = 32 s. The results obtained for active power and reactive power are shown in Figure 18a,b, respectively. In the above plot, it can be observed that the PSS with voltage as input is able to damp the oscillations during the fault, and the settling time after the fault is also much less. Figure 19 provides the simulation results for the system during a single-phase short circuit fault. The fault was applied to the system for τ = 5 s between times τ = 60 s and τ = 65 s. It can be observed that the PSS is able to damp the oscillations.
working effectively during the fault condition.
For fault analysis, a three-phase fault was applied to the system for a time of 2 between times 30 and 32 . The results obtained for active power and reactive power are shown in Figure 18a and 18b, respectively. In the above plot, it can be observed that the PSS with voltage as input is able to damp the oscillations during the fault, and the settling time after the fault is also much less.  Figure 19 provides the simulation results for the system during a single-phase short circuit fault. The fault was applied to the system for 5 between times 60 and 65 . It can be observed that the PSS is able to damp the oscillations.

Results with Q-Learning Algorithm
Next, a Q-learning algorithm was implemented on the developed PSS at the RSC and the outer current control loop of the GSC with a variable wind speed. The algorithm model was applied at and . Since the control objective is to stabilize the output of the power system with active and reactive power without low-frequency oscillations, this paper uses parameters that are directly related to the active power change in the power system as the state of the agent and the control output of the controller as the action of the agent to train the appropriate control strategy. After testing, when the wind speed fluctuates in a small range, it is easier to stabilize the system through the reinforcement learning controller. Therefore, the wind speed of this experiment was designed to be between 14 m/s and 15 m/s.
The following is the output waveform of the active power of the power system after several iterations of the reinforcement learning controller. The active power waveform is observed at iteration 18, which can be observed in Figure 20.

Results with Q-Learning Algorithm
Next, a Q-learning algorithm was implemented on the developed PSS at the RSC and the outer current control loop of the GSC with a variable wind speed. The algorithm model was applied at P sre f and V re f dc . Since the control objective is to stabilize the output of the power system with active and reactive power without low-frequency oscillations, this paper uses parameters that are directly related to the active power change in the power system as the state of the agent and the control output of the controller as the action of the agent to train the appropriate control strategy. After testing, when the wind speed fluctuates in a small range, it is easier to stabilize the system through the reinforcement learning controller. Therefore, the wind speed of this experiment was designed to be between 14 m/s and 15 m/s.
The following is the output waveform of the active power of the power system after several iterations of the reinforcement learning controller. The active power waveform is observed at iteration 18, which can be observed in Figure 20.
In Figure 20, it can be observed that the generated power tends to be smooth without any oscillations, which indicates that the QL algorithms implemented were able to learn from the system and provide the desired output.
However, the final active power is still not smooth. Next, the system was iterated until the desired active power was generated without any oscillations, which was observed at the 21st iteration. Figure 21 shows the response of the active power at the 25th iteration. After this point, the system response will not differ, as the Q-learning agent is able to learn about the system in various states and provide the required action at different states. In Figures 20-24, it is observed that under the action of the trained reinforcement learning controller, the system can output stable active power without low-frequency oscillations. The total time taken for 25 episodes or iterations is 90 min. agent to train the appropriate control strategy. After testing, when the wind speed fluctuates in a small range, it is easier to stabilize the system through the reinforcement learning controller. Therefore, the wind speed of this experiment was designed to be between 14 m/s and 15 m/s.
The following is the output waveform of the active power of the power system after several iterations of the reinforcement learning controller. The active power waveform is observed at iteration 18, which can be observed in Figure 20. In Figure 20, it can be observed that the generated power tends to be smooth without any oscillations, which indicates that the QL algorithms implemented were able to learn from the system and provide the desired output.
However, the final active power is still not smooth. Next, the system was iterated until the desired active power was generated without any oscillations, which was observed at the 21st iteration. Figure 21 shows the response of the active power at the 25th iteration. After this point, the system response will not differ, as the Q-learning agent is able to learn about the system in various states and provide the required action at different states. In Figures 20-24, it is observed that under the action of the trained reinforcement learning controller, the system can output stable active power without low-frequency oscillations. The total time taken for 25 episodes or iterations is 90 min.  Figure 22 shows the control of reactive power at the 25th iteration. It can be observed that the QL agent is able to learn from the environment and control reactive power as well. The output speed and power as the wind speed changes are plotted in Figure 23. As can be seen in Figure 23a, when the wind speed changes, the speed and active power of the power system are stabilized.  Figure 22 shows the control of reactive power at the 25th iteration. It can be observed that the QL agent is able to learn from the environment and control reactive power as well. The output speed and power as the wind speed changes are plotted in Figure 23. As can be seen in Figure 23a, when the wind speed changes, the speed and active power of the power system are stabilized.  Figure 22 shows the control of reactive power at the 25th iteration. It can be observed that the QL agent is able to learn from the environment and control reactive power as well.
The output speed and power as the wind speed changes are plotted in Figure 23. As can be seen in Figure 23a, when the wind speed changes, the speed and active power of the power system are stabilized. The output speed and power as the wind speed changes are plotted in Figure 23. As can be seen in Figure 23a, when the wind speed changes, the speed and active power of the power system are stabilized.

Comparing Q-Learning Algorithm with PI Controllers
The Q-learning algorithm developed completely replaces the PI controllers both on the RSC and GSC and the power system stabilizer at the RSC. From the results, it is observed that the model-free algorithm can learn from the system environment and generate similar results to those of the PI controller.
In Figure 24, the dotted red line indicates the curve for active power with the QL algorithm, and the continuous curve shows the active power with PI controllers. It can be observed that the Q-learning algorithm controls the active power similarly to the PI controller. The advantage of the developed Q-learning model over the PI controller design is that the Q-learning agent provides the controller parameters dynamically, whereas the gains in the PI controller are fixed; moreover, the Q-learning algorithm is implemented with varying wind speeds, whereas the PI controller is implemented with a constant wind speed. Figure 25 shows a comparison of the reactive power for the developed PSS with the PI controller and Q-learning algorithm.

Comparing Q-Learning Algorithm with PI Controllers under Fault Conditions
To evaluate the control of the Q-learning algorithm under fault conditions, a threephase fault was employed at the grid end before the transformer. The clearing time was 2 s, starting at 30 s and clearing at 32 s. Figures 26 and 27 show the active power and reactive power plots, respectively, under fault conditions. In Figure 26, the red dotted line is the active power with the Q-learning algorithm, and the continuous line shows the curve for

Comparing Q-Learning Algorithm with PI Controllers
The Q-learning algorithm developed completely replaces the PI controllers both on the RSC and GSC and the power system stabilizer at the RSC. From the results, it is observed that the model-free algorithm can learn from the system environment and generate similar results to those of the PI controller.
In Figure 24, the dotted red line indicates the curve for active power with the QL algorithm, and the continuous curve shows the active power with PI controllers. It can be observed that the Q-learning algorithm controls the active power similarly to the PI controller. The advantage of the developed Q-learning model over the PI controller design is that the Q-learning agent provides the controller parameters dynamically, whereas the gains in the PI controller are fixed; moreover, the Q-learning algorithm is implemented with varying wind speeds, whereas the PI controller is implemented with a constant wind speed. Figure 25 shows a comparison of the reactive power for the developed PSS with the PI controller and Q-learning algorithm.

Comparing Q-Learning Algorithm with PI Controllers
The Q-learning algorithm developed completely replaces the PI controllers both on the RSC and GSC and the power system stabilizer at the RSC. From the results, it is observed that the model-free algorithm can learn from the system environment and generate similar results to those of the PI controller.
In Figure 24, the dotted red line indicates the curve for active power with the QL algorithm, and the continuous curve shows the active power with PI controllers. It can be observed that the Q-learning algorithm controls the active power similarly to the PI controller. The advantage of the developed Q-learning model over the PI controller design is that the Q-learning agent provides the controller parameters dynamically, whereas the gains in the PI controller are fixed; moreover, the Q-learning algorithm is implemented with varying wind speeds, whereas the PI controller is implemented with a constant wind speed. Figure 25 shows a comparison of the reactive power for the developed PSS with the PI controller and Q-learning algorithm.

Comparing Q-Learning Algorithm with PI Controllers under Fault Conditions
To evaluate the control of the Q-learning algorithm under fault conditions, a threephase fault was employed at the grid end before the transformer. The clearing time was 2 s, starting at 30 s and clearing at 32 s. Figures 26 and 27 show the active power and reactive power plots, respectively, under fault conditions. In Figure 26, the red dotted line is the

Comparing Q-Learning Algorithm with PI Controllers under Fault Conditions
To evaluate the control of the Q-learning algorithm under fault conditions, a threephase fault was employed at the grid end before the transformer. The clearing time was 2 s, starting at 30 s and clearing at 32 s. Figures 26 and 27 show the active power and reactive power plots, respectively, under fault conditions. In Figure 26, the red dotted line is the active power with the Q-learning algorithm, and the continuous line shows the curve for the PI controller. It can be observed that the damping of oscillations is more effective with the Q-learning algorithm when compared to the PSS with PI controllers.  In Figure 27, the red dotted curve indicates the reactive power with the Q-learning algorithm, and the continuous curve is with the PI controller. In the figure, it can be observed that the damping of oscillations is more effective with the Q-learning algorithm when compared with the PI controllers. The proposed Q-learning algorithm can be compared with the PSS using the Q-learning algorithm from [35]. In [35], the case study used is a four-machine two-area system where a Q-learning-based PSS is employed to damp the frequency oscillations. The same can be observed in the current paper; however, the systems used are completely different. Moreover, in the current paper, the performance of Q-learning is observed under fault conditions as well. In Figures 24-27, a clear comparison can be observed between the PSS with the PI controller and Q-learning. Here, the PSS with the PI controller can be treated as a test case to evaluate the performance of the Q-learning-based model. In Figures 24-27, the proposed Q-learning-based model reaches the steady state in a shorter time compared with the classical PI controller-based PSS.   In Figure 27, the red dotted curve indicates the reactive power with the Q-learning algorithm, and the continuous curve is with the PI controller. In the figure, it can be observed that the damping of oscillations is more effective with the Q-learning algorithm when compared with the PI controllers. The proposed Q-learning algorithm can be compared with the PSS using the Q-learning algorithm from [35]. In [35], the case study used is a four-machine two-area system where a Q-learning-based PSS is employed to damp the frequency oscillations. The same can be observed in the current paper; however, the systems used are completely different. Moreover, in the current paper, the performance of Q-learning is observed under fault conditions as well. In Figures 24-27, a clear comparison can be observed between the PSS with the PI controller and Q-learning. Here, the PSS with the PI controller can be treated as a test case to evaluate the performance of the Q-learning-based model. In Figures 24-27, the proposed Q-learning-based model reaches the steady state in a shorter time compared with the classical PI controller-based PSS. In Figure 27, the red dotted curve indicates the reactive power with the Q-learning algorithm, and the continuous curve is with the PI controller. In the figure, it can be observed that the damping of oscillations is more effective with the Q-learning algorithm when compared with the PI controllers.
The proposed Q-learning algorithm can be compared with the PSS using the Q-learning algorithm from [35]. In [35], the case study used is a four-machine two-area system where a Q-learning-based PSS is employed to damp the frequency oscillations. The same can be observed in the current paper; however, the systems used are completely different. Moreover, in the current paper, the performance of Q-learning is observed under fault conditions as well. In Figures 24-27, a clear comparison can be observed between the PSS with the PI controller and Q-learning. Here, the PSS with the PI controller can be treated as a test case to evaluate the performance of the Q-learning-based model. In Figures 24-27, the proposed Q-learning-based model reaches the steady state in a shorter time compared with the classical PI controller-based PSS.

Comparing TD3 Agent with Q-Learning Algorithm
Simulations were carried out with large variations in wind speed using PI, Q-learning, and the TD3 agent. The wind speeds used are from the Ottawa region for the whole month of June 2019.
The wind speed variation shown in Figure 28 was used as an input to the DFIG WECS. With varying wind speeds, it is hard for the regular PI controller to maintain the frequency and, at the same time, provide a table output to the grid. The average wind speed variations can be observed in Figure 29. The active power and reactive power with the PSS with QL vs. the PSS with the TD3 agent can be seen in Figures 30 and 31, respectively. It can be observed that both active power and reactive power are not stable with such huge wind speed variations in either the PSS with the PI controller or the Q-learning agent. However, once the TD3 agent implemented on the inner current control loop of the RSC learns the system dynamics, the agent can control the generator speed and inner currents with respect to their reference values. Therefore, with large variations in wind speed and without using an inner PI controller, with the TD3 agent, the system can produce stable active and reactive power with varying wind speeds by mitigating lower frequencies. The system with the PI controller does not even respond to large variations in wind speed, as it can work only for a constant wind speed. Simulations were carried out with large variations in wind speed using PI, Q-learning, and the TD3 agent. The wind speeds used are from the Ottawa region for the whole month of June 2019.
The wind speed variation shown in Figure 28 was used as an input to the DFIG WECS. With varying wind speeds, it is hard for the regular PI controller to maintain the frequency and, at the same time, provide a table output to the grid. The average wind speed variations can be observed in Figure 29. The active power and reactive power with the PSS with QL vs. the PSS with the TD3 agent can be seen in Figures 30 and 31, respectively. It can be observed that both active power and reactive power are not stable with such huge wind speed variations in either the PSS with the PI controller or the Q-learning agent. However, once the TD3 agent implemented on the inner current control loop of the RSC learns the system dynamics, the agent can control the generator speed and inner currents with respect to their reference values. Therefore, with large variations in wind speed and without using an inner PI controller, with the TD3 agent, the system can produce stable active and reactive power with varying wind speeds by mitigating lower frequencies. The system with the PI controller does not even respond to large variations in wind speed, as it can work only for a constant wind speed.    Simulations were carried out with large variations in wind speed using PI, Q-learning, and the TD3 agent. The wind speeds used are from the Ottawa region for the whole month of June 2019.
The wind speed variation shown in Figure 28 was used as an input to the DFIG WECS. With varying wind speeds, it is hard for the regular PI controller to maintain the frequency and, at the same time, provide a table output to the grid. The average wind speed variations can be observed in Figure 29. The active power and reactive power with the PSS with QL vs. the PSS with the TD3 agent can be seen in Figures 30 and 31, respectively. It can be observed that both active power and reactive power are not stable with such huge wind speed variations in either the PSS with the PI controller or the Q-learning agent. However, once the TD3 agent implemented on the inner current control loop of the RSC learns the system dynamics, the agent can control the generator speed and inner currents with respect to their reference values. Therefore, with large variations in wind speed and without using an inner PI controller, with the TD3 agent, the system can produce stable active and reactive power with varying wind speeds by mitigating lower frequencies. The system with the PI controller does not even respond to large variations in wind speed, as it can work only for a constant wind speed.

Limitation and Future Work
The wind DFIG system with the PSS and PI controllers works fine for a constant speed, which is observed under normal and fault conditions. If a varying wind speed is provided, then the PSS cannot suppress any oscillations, and the system is very unstable. The PSS with the PI controller cannot be adopted or used for large variations in wind speed. To overcome the limitation of varying wind speeds, reinforcement learning is used to learn the system dynamics and mitigate oscillations. For this, first, the PSS and the PI controllers from both the RSC and GSC are replaced with the QL agent. The Q-learning algorithm developed in this paper is implemented for a very small range of varying wind speeds and does not respond accurately for large variations in wind speed. The observation is that the QL agent is not successful in learning the system dynamics with huge variations in wind speed. To overcome the limitation of the application of an RL to large variations in wind speed, advanced RL techniques such as actor-critic (A2C or A3C) or policy gradient methods such as Deep Deterministic Policy Gradient (DDPG) will be helpful, as these methods are robust in learning about the environment. To handle large variations in wind speed, the TD3 method is introduced in this research. Usage of the A3Cbased strategy can be observed in [36], where the PSS parameters were tuned to suppress low-frequency oscillations for a 10-machine 39-bus transmission network. Thus, A2C or A3C can be adopted to suppress low-frequency oscillations for large variations in wind speed in DFIG-based WECS. As discussed at the beginning of Section 7, the complete model was built in the MATLAB/Simulink environment. For this research work, there was no experimental hardware involved. The complete results and comparisons are discussed in Section 7.

Conclusions
In this work, a DFIG-equipped WECS was developed, and closed-loop controllers were designed to damp the oscillations by controlling both active and reactive power. First, a small-signal model was developed to identify the sensitive parameters that affect the stability of the system, and from the SS model, the proportional and integral gains were derived. The derived PI gains were used in the inner current control loops of the large-signal model. Next, the PI controllers were completely replaced with the Q-learningbased RL technique. The Q-learning-based model-free power system stabilizer and DClink voltage regulator were developed on both the rotor side controller and grid side controller. From the results, it is observed that the designed model-free PSS can damp the oscillations under small wind speed variations and faulty conditions. The conclusion is that the QL algorithm agent can learn from the environment and control the active power by helping the system to operate under normal conditions with variable wind speeds by

Limitation and Future Work
The wind DFIG system with the PSS and PI controllers works fine for a constant speed, which is observed under normal and fault conditions. If a varying wind speed is provided, then the PSS cannot suppress any oscillations, and the system is very unstable. The PSS with the PI controller cannot be adopted or used for large variations in wind speed. To overcome the limitation of varying wind speeds, reinforcement learning is used to learn the system dynamics and mitigate oscillations. For this, first, the PSS and the PI controllers from both the RSC and GSC are replaced with the QL agent. The Q-learning algorithm developed in this paper is implemented for a very small range of varying wind speeds and does not respond accurately for large variations in wind speed. The observation is that the QL agent is not successful in learning the system dynamics with huge variations in wind speed. To overcome the limitation of the application of an RL to large variations in wind speed, advanced RL techniques such as actor-critic (A2C or A3C) or policy gradient methods such as Deep Deterministic Policy Gradient (DDPG) will be helpful, as these methods are robust in learning about the environment. To handle large variations in wind speed, the TD3 method is introduced in this research. Usage of the A3C-based strategy can be observed in [36], where the PSS parameters were tuned to suppress low-frequency oscillations for a 10-machine 39-bus transmission network. Thus, A2C or A3C can be adopted to suppress low-frequency oscillations for large variations in wind speed in DFIGbased WECS. As discussed at the beginning of Section 7, the complete model was built in the MATLAB/Simulink environment. For this research work, there was no experimental hardware involved. The complete results and comparisons are discussed in Section 7.

Conclusions
In this work, a DFIG-equipped WECS was developed, and closed-loop controllers were designed to damp the oscillations by controlling both active and reactive power. First, a small-signal model was developed to identify the sensitive parameters that affect the stability of the system, and from the SS model, the proportional and integral gains were derived. The derived PI gains were used in the inner current control loops of the large-signal model. Next, the PI controllers were completely replaced with the Q-learningbased RL technique. The Q-learning-based model-free power system stabilizer and DC-link voltage regulator were developed on both the rotor side controller and grid side controller. From the results, it is observed that the designed model-free PSS can damp the oscillations under small wind speed variations and faulty conditions. The conclusion is that the QL algorithm agent can learn from the environment and control the active power by helping the system to operate under normal conditions with variable wind speeds by making the system model-free. However, the limitation of the QL method is that it cannot control under large variations in wind speed. To overcome this limitation, an actor-critic method called the TD3 method is introduced. From the results, it can be observed that the TD3 agent can learn the system dynamics under large variations in wind speed, and the agent can deliver the desired active and reactive power to the grid.