A New Fuzzy Reinforcement Learning Method for Effective Chemotherapy

: A key challenge for drug dosing schedules is the ability to learn an optimal control policy even when there is a paucity of accurate information about the systems. Artiﬁcial intelligence has great potential for shaping a smart control policy for the dosage of drugs for any treatment. Motivated by this issue, in the present research paper a Caputo–Fabrizio fractional-order model of cancer chemotherapy treatment was elaborated and analyzed. A ﬁx-point theorem and an iterative method were implemented to prove the existence and uniqueness of the solutions of the proposed model. Afterward, in order to control cancer through chemotherapy treatment, a fuzzy-reinforcement learning-based control method that uses the State-Action-Reward-State-Action (SARSA) algorithm was proposed. Finally, so as to assess the performance of the proposed control method, the simulations were conducted for young and elderly patients and for ten simulated patients with different parameters. Then, the results of the proposed control method were compared with Watkins’s Q-learning control method for cancer chemotherapy drug dosing. The results of the simulations demonstrate the superiority of the proposed control method in terms of mean squared error, mean variance of the error, and the mean squared of the control action—in other words, in terms of the eradication of tumor cells, keeping normal cells, and the amount of usage of the drug during chemotherapy treatment.


Introduction
Cancer is one of the most hazardous and fatal diseases throughout the world [1].This disease is caused by the abnormal division and spreading of cells that destroy the patient's body and may lead to death.Though many research efforts are devoted to precisely understanding the interaction between the immune system and tumor cells, the treatment of cancer is still one of the most challenging issues in modern medicine [2,3].
Based on the patient's conditions and type of cancer, there are several treatments, including radiotherapy, surgery, chemotherapy, immunotherapy, and so forth, to tackle cancer [4].Cancer treatment can be challenging, as it can have a range of side effects, including fatigue, nausea, and hair loss.Managing these side effects is an important part of cancer care among various treatments.Chemotherapy is one of the most effective treatments for annihilating cancerous cells.For this reason, in this research the chemotherapy method was chosen to treat cancer.Nonetheless, in all chemotherapy treatments, not only do the applied drugs destroy the cancerous cells, but they also affect the healthy cells and kill some of those cells.Hence, it is crucial to make sure that the patient can tolerate the side effects of the treatment [5].These sorts of side effects lead to a limitation in the dosage of the drugs [6].Hence, for prescribing drugs, one aim is to decrease the number of cancerous cells as much as possible and reach the minimum side effects [7].
There are many factors which define the treatment schedule and drug dosing.Some of these factors are the weight of the patient, level of white blood cells, the patient's age, the stage of the tumor, etc.; for this reason, certain established standards are followed by clinicians to define the therapy type and drug dosage for each patient.Howbeit, this approach has some limitations which have been approved by scientific and clinical communities [8,9].Moreover, evaluating the effectiveness of the chemotherapy plan and its feasibility is of significant importance [9].So as to evaluate the effectiveness of the chemotherapy plans, the clinical trials are a reliable choice, but they have some limitations such as high costs, long trial times, and they are difficult to be conducted.Keeping the above-mentioned limitations in mind, contriving an efficient chemotherapy treatment plan would be desired.
Mathematical models have been used as valuable tools in understanding the transmission dynamics of cancers [10].Actually, mathematical modeling could play a pivotal role in the understanding of the disease's dynamics [11][12][13][14][15].One common approach to modeling cancer is to use ordinary differential equations (ODEs), which are used to describe the time evolution of a system.These models can represent the growth and spread of cancer cells and the response of the immune system to the cancer.Researchers can then use these models to study the long-term and short-term dynamics of cancer and to identify potential therapeutic strategies.Finding an accurate model of cancer dynamics could pave the way for the long and short-term prediction of the disease as well as the designing of therapies [16].This field of study not only aims to anticipate the spread of cancerous cells but also to control the disease as effectively as possible [17,18].A lot of research has been done using mathematical modeling to understand cancer, and various approaches have been used in this field [19][20][21].
Fractional calculus is an excellent tool for the description of hereditary properties, and memory of systems has been widely utilized in various fields of study including economics, biology, ecology, and engineering [22][23][24][25][26][27][28].In this regard, the fractional-order model of cancers has started to attract some researchers' attention [29,30].Although early research studies on fractional-order calculus were based on the Caputo or Riemann-Liouville fractional-order derivative, it has been recently shown that these approaches possess some disadvantages, such as the singularity at the endpoint of an interval of definition [31][32][33].In tackling this issue, recently other definitions of fractional derivatives have been introduced [34][35][36][37][38][39].Caputo and Fabrizio offered a new derivative with nonsingular kernel [32], namely the Caputo-Fabrizio derivative.The Caputo and Caputo-Fabrizio derivatives are both types of fractional derivatives, which are a generalization of the classical derivative to functions with non-integer order.The Caputo derivative is based on a power law, which means that it is defined in terms of a power of the function's argument.The Caputo-Fabrizio derivative, on the other hand, is based on an exponential decay law, which means that it is defined in terms of the exponential decay of the function over time.One key difference between the two derivatives is that the Caputo derivative is defined in terms of a power of the function's argument, while the Caputo-Fabrizio derivative is defined in terms of an exponential decay law.This means that the two derivatives may behave differently when applied to different types of functions, and they may be used to model different types of physical phenomena.
Caputo and Fabrizio offered a new derivative with nonsingular kernel [32], namely the Caputo-Fabrizio derivative.The principal difference between the Caputo and Caputo-Fabrizio fractional derivative is that the Caputo derivative is based on a power law, while the Caputo-Fabrizio derivative is achieved with an exponential decay law [32,40].There are several research studies in the literature that demonstrate the applications of the Caputo-Fabrizio derivative to different systems such as biological processes [41][42][43].In Ref. [44], the underlying physical meaning of the nonsingular kernel as well as the applications of fractional differentiation operators with the non-singular kernel were presented.
During for the past several years, many control strategies have been applied to cancer chemotherapy drug dosing.
In [45], targeted chemotherapy was offered for tumor-immune interaction.As it has been studied in [46,47], in addition to the mathematical interpretation of tumor growth, scheduling appropriate treatment strategies has been an important issue, and controldesigning strategies can be utilized to optimize drug dosing mathematically.To this end, up to now, many research studies have worked on these subjects.De Pillis and Radunskaya proposed a dynamical model for immunotherapy and chemotherapy schedules in 2001 [48].In 2003, to control tumor growth, bang-bang type control was employed with chemotherapy [47].In 2007, De Pillis et al. used linear and quadratic control to a cure tumor with the use of chemotherapy [49].Also, through Pontryagin's maximum principle, Ghaffari and Naserifar have designed optimal therapeutic protocols to schedule immunotherapy [50].On the other hand, the cost-effectiveness of the treatment strategies has been investigated in [51].Drug resistance, which is a vital phenomenon in cancerous tumors, has been examined in [52].In addition, the conditions for an siRNA treatment were investigated to eradicate tumor burden in [53], while a combination of chemotherapy and anti-angiogenic agents was utilized to cure the disease in [54].In [55], robust adaptive control was presented to adjust the drug dosage with an extended Kalman filter observer.An optimal control strategy based on a linear time-varying approximation technique was proposed in [56].A model-free method for chemotherapy based on reinforcement learning was proposed in [57] using the closed-loop control.To be precise, they develop an optimal controller using the Q-learning algorithm for cancer chemotherapy treatment.In [47], the authors model the chemotherapy treatment based on optimal control theory, and the objective was to minimize the tumor cells while keeping the healthy cells above a fixed level.
The use of state-of-the-art automatic control methods can help to improve the effectiveness and efficiency of various processes and systems, leading to better outcomes and more sustainable solutions [58][59][60][61][62][63][64][65][66][67].Among the stated control strategies, intelligent controllers have significant advantages [68][69][70][71][72][73].Artificial intelligence does combine a wide variety of new technologies to give systems an ability to make decisions in new and unfamiliar conditions [74,75].In some applications, due to the high value of their tasks and their remarkable risks, implementing a reliable controller is the main concern.Where the degree of uncertainty is high, classical methods of control may fail [76,77].Hence, artificial intelligence-based controllers are rational choices for such systems.Moreover, optimal controls which could be provided based on artificial intelligence are helpful for drug dosage because of their ability to consider various aspects of the biological systems in the optimization function.For example, through intelligent approaches, we are able to design treatments which can take to account the number of healthy and infected cells as well as the cost and side effects of the drugs without having any mathematical model.Reinforcement learning is one of the most popular learning techniques, which explores the system's response in possible actions; this way, it learns the optimal action by calculating how the last action pushes the system towards desired situations [78][79][80].
As mentioned, fractional calculus is very advantageous to the modeling of real-world processes [81,82].Motivated by this background, in the present study, we investigated a Caputo-Fabrizio model of cancer, which is a new model; studies are quite rare in these systems.Also, few studies are dedicated to the evaluation of reinforcement learning methods in chemotherapy drug dosing; to the best of our knowledge, no studies have been done on cancer chemotherapy drug dosing using fuzzy-reinforcement learning-based optimal control.These issues motivated the current study.In this paper, a Caputo-Fabrizio fractional-order model of cancer chemotherapy treatment was analyzed.The existence and uniqueness of the solutions of the presented model were proven.Then, a fuzzy-reinforcement learningbased control method was proposed in order to control the cancer chemotherapy treatment, and the effective performance of the proposed method was illustrated.
The main contributions of this research paper are as follows: First, a Caputo-Fabrizio fractional-order model of cancer chemotherapy treatment was elaborated and analyzed, which is a novel model for this system.Afterwards, a fix-point theorem and an iterative method were implemented to prove the existence and uniqueness of the solutions of the proposed model.Finally, as the last contribution, a fuzzy-reinforcement learning-based control method that uses the SARSA algorithm was proposed to control cancer through chemotherapy treatment.
The rest of the paper has been organized as follows: In Section 2 the preliminaries for this work are given.Later, in Section 3, the Caputo-Fabrizio fractional model of cancer chemotherapy is elaborated, and then sensitivity analysis is done for the parameters of the system.The fuzzy-reinforcement-learning based controller is described in Section 4. Section 5 is devoted to the numerical simulations; finally, in Section 6, conclusions are made and discussed.

Preliminaries
As has been shown, the kernels of the Caputo fractional derivative are singular at the endpoint of the interval of integration [83]; therefore, the fractional derivative is not an appropriate kernel to describe the memory effect accurately in real systems.A new fractional derivative has been proposed that has no singularity in its kernels [32].This section is a summary of the definitions and properties of the Caputo-Fabrizio fractional that have been used in this paper. Consider is the space of square integrable functions on interval of (a, b).Definition 1.If function x is H 1 (t 0 , t) and derivative order α ∈ (0, 1), then the Caputo-Fabrizio derivative is defined as [32]: where A(α) is a normalization function that satisfies A(0) = A(1) = 1.Furthermore, if x / ∈ H 1 (a, b), then the Caputo-Fabrizio derivative is defined as: Remark 1. [32].By considering β = α/(1 − α) and β ∈ (0, ∞), we have α = 1/(1 + β) which α ∈ (0, 1).Now, Equation (2) can be written in the following form: where B(β) is a normalization function corresponding to A(α) which B(0) = B(∞) = 1.
The modified definition of Caputo-Fabrizio has been proposed by Nieto and Losada [31], and it has been defined as Definition 2. The Caputo-Fabrizio fractional integral of order α of function x(t) is defined as: Definition 3. [31].The fractional Caputo-Fabrizio derivative of order α of function x(t) is defined as Moreover, the Caputo-Fabrizio fractional integral of order α is given as In this definition, the normalization function has been considered as A(α) = 2/(2 − α).

Caputo-Fabrizio Fractional Model of Cancer Chemotherapy
In this section, in order to derive the Caputo-Fabrizio fractional model of cancer chemotherapy, the nonlinear four-state model given in [47,[84][85][86] has been considered.In this model, the first state is N(t), which indicates the number of normal cells; the second variable in this model is T(t), and it represents the number of tumor cells.The third state is I(t), which represents the number of immune cells.Finally, the last state is the drug concentration C(t).The original integer-order model [47,87] is given in Equation ( 9): The initial conditions for this model are assumed to be as follows: Replacing the first-order derivatives on the left-hand side of Equation ( 9) with the Caputo-Fabrizio fractional derivative that has been defined in Equation ( 7) leads to our new Caputo-Fabrizio fractional model of cancer chemotherapy.The new model is written in the following equations: Moreover, the initial condition for this system is: In our new model of cancer chemotherapy, it has been assumed that the fraction-order of each of the state variables is theoretically different, and 0 < α i < 1, i = 1, 2, . . ., 4.

Existence and Uniqueness of Solutions of the Cancer Chemotherapy Model
This section is devoted to the investigation of the uniqueness and existence of the solutions of our Caputo-Fabrizio fraction model of cancer chemotherapy in Equation (11) with the initial conditions mentioned in Equation (12).To seek this goal, the fix-point theory has been used [88,89].
In the light of the Caputo-Fabrizio fractional order integral operator, defined in Equation (8) and taking the initial conditions in Equation ( 12) into consideration, in using Equation ( 12) the following equation will be obtained: Then, the following kernels are defined: By taking Equation ( 14) into consideration and then calculating the right-hand side of Equation ( 13) using the definition of the Caputo-Fabrizio fractional-order integration in Equation ( 6), the following equation is obtained: where Σ(α) and σ(α) are defined as follows: Remark 3. The above-mentioned kernels K 1 , K 2 , . . .,K 4 satisfy the Lipschitz conditions and are contraction mapping if the following inequality satisfies In the proof of Remark 3 the following assumption has been made: Proof.Let N and N 1 be two different functions; then, by taking the kernel K 1 into consideration, we have For the second kernel we have: Additionally, for the third and fourth kernels, K 3 and K 4 , the following inequalities can be obtained: and Consequently, the Lipschitz conditions are satisfied for all of the kernels defined in Equation ( 14).Furthermore, because of 0 ≤ L = max{δ 1 , δ 2 , δ 3 , δ 4 } < 1, the kernels are contractions.
By considering Equation ( 16), the following recursive formula can be obtained: Moreover, the initial components for the above-mentioned recursive formals are considered to be as the following equation: Equation ( 23) can be written in the following formation: where µ i (t), ξ i (t), ω i (t) and κ i (t) are defined as follows: Now, the following inequalities can be derived for the functions defined in Equation ( 26): Equation (19) showed that the kernel K 1 satisfies the Lipschitz condition.Therefore, for Equation (27) we have: The same is true for the other defined functions in Equation ( 26); for this reason, we can obtain the following inequalities readily: Remark 4. The Caputo-Fabrizio fractional model of cancer chemotherapy (Equation ( 11)) has a system of solutions if following inequalities hold at a time such that t 1 > 0: Proof.In this part, we have proven the existence and smoothness of the defined functions in Equation ( 25), the existence of a system of solutions for the model in Equation (11), and Equation ( 12) has been illustrated.It has been assumed that N(t),T(t),I(t) and C(t) are bounded N(t) ≤ θ 1 , T(t) ≤ θ 2 , I(t) ≤ θ 3 and C(t) ≤ θ 4 .Moreover, we have proven that each of the defined kernels satisfies the Lipschitz condition.Consequently, we can obtain the following results using Equations ( 28) and ( 29): Now, it has been demonstrated that the functions N n (t),T n (t),I n (t) and C n (t) defined in Equation ( 25) converge to solutions of the model (Equation ( 11)) with the initial condition (Equation ( 12)), by defining remainder terms after n iteration as follows: To prove that the functions N n (t),T n (t),I n (t) and C n (t) converge to the solutions of the model (Equation ( 11)), we must show that as n → ∞ , then the reminder term converges to zero.Using the Lipschitz condition for the kernels defined in Equation ( 14) and triangle inequality, the following results will be obtained: Continuing the above process will lead us to the following inequality: Now, by taking Equation (30) into account and when n → ∞ at time t 1 , by taking a limit on both sides of Equation ( 34) the following result will be acquired: It can be seen that the right-hand side of Equation (35) converges to zero; for this reason, it can be concluded that X n (t) → 0 when n → ∞ .Using the same manner for other remainder terms defined in Equation (32), the following results will be obtained: Equations ( 35) and ( 36) demonstrate the existence of a system of solutions for the model in Equations ( 11) and (12).Remark 5.The system of solutions of the model in Equations ( 11) and ( 12) will be unique if the following inequality is satisfied: Proof.To prove that the model described in Equation ( 11) with initial conditions in Equation ( 12) has a unique system of solutions, we have assumed that the first system of solutions of the model is {N(t), T(t), I(t), C(t)}.Furthermore, we have considered another system of solutions for the model, which is denoted by {N 1 (t), T 1 (t), I 1 (t), C 1 (t)}.Then, using Equation ( 16), we have Then, we have As we proved, the kernels satisfy the Lipschitz conditions; as a result, By subtracting the right-hand side of Equation ( 40) from both side of this inequality, the following inequality is given: Considering Equation (37) and the fact that the output of the norm function is nonnegative, the following result will be obtained: In the same way for T(t),I(t), C(t), the following equations have been obtained: Equations ( 42) and (43) imply Equation (44) shows that the system of solution for the model in Equation ( 11) with the initial conditions in Equation ( 12) is unique, and this is the end of the proof.

Sensitivity Analysis
In this section, a sensitivity analysis for the chemotherapy drug-using system given in Equation ( 11) is conducted.The parameters of the system which have been used for analysis are given in Table 1.The results of the sensitivity analysis are given in Figure 1.

Sensitivity Analysis
In this section, a sensitivity analysis for the chemotherapy drug-using system given in Equation ( 11) is conducted.The parameters of the system which have been used for analysis are given in Table 1.The results of the sensitivity analysis are given in Figure 1.The relationship between the patient's parameter variations is elucidated in Figure 1.Conspicuously, per-unit growth rate of tumor cells ( ), immune cell influx rate (), and tumor cell competition term (competition between normal and tumor cells) ( ) are three parameters whose variations affect the number of normal cells the most.There are four parameters whose variations have the most impact on the number of tumor cells, of which three of them are the same as the ones for normal cells, and the other parameter is the reciprocal carrying capacity of normal cells ( ).Moreover, the number of immune cells is highly affected by the variation of the per-unit growth rate of tumor cells ( ), the reciprocal carrying capacity of normal cells ( ), and the tumor cell competition term (competition between normal and tumor cells) ( ).

Methodology
In this section, we have described the proposed fuzzy-reinforcement-learning based controller (FRLC) whose aim is to control the number of tumor cells; in other words, the The relationship between the patient's parameter variations is elucidated in Figure 1.Conspicuously, per-unit growth rate of tumor cells ( r 1 ), immune cell influx rate (s), and tumor cell competition term (competition between normal and tumor cells) ( c 3 ) are three parameters whose variations affect the number of normal cells the most.There are four parameters whose variations have the most impact on the number of tumor cells, of which three of them are the same as the ones for normal cells, and the other parameter is the reciprocal carrying capacity of normal cells (b 2 ).Moreover, the number of immune cells is highly affected by the variation of the per-unit growth rate of tumor cells (r 1 ), the reciprocal carrying capacity of normal cells (b 2 ), and the tumor cell competition term (competition between normal and tumor cells) (c 3 ).

Methodology
In this section, we have described the proposed fuzzy-reinforcement-learning based controller (FRLC) whose aim is to control the number of tumor cells; in other words, the aim of the proposed controller is to reach the desired number of tumor cells T(t) = T desire from a non-zero initial number of tumor cells T(0) > 0.
Expected-SARSA is a variation of SARSA wherein the variance in its update rule is decreased in comparison with that of the SARSA algorithm, and it has better performance than SARSA for online applications that can be considered as an on-policy version of Q-Learning.On-policy methods have some advantages over off-policy learning algorithms, such as stronger convergence guarantees in cases in which it is combined with function approximation, while the off-policy methods can diverge in those cases [90][91][92].They have outstanding performance in online applications because the policy that is estimated in on-policy ways will be improved iteratively, and the agent will behave based on this policy.
The advantage of data-driven controllers is that an accurate model of the system is not required, while selecting the best action for each rule in the fuzzy interface system demands accurate knowledge about the system.However, the proposed control method learns the best action for each rule; therefore, it has the advantages of data-driven controllers and fuzzy controllers simultaneously.

Fuzzy Controller Based on Expected SARSA Learning (FESL)
In this paper, in order to control the number of tumor cells, a fuzzy controller is proposed of which its learning process is based on the expected SARSA algorithm.The fuzzy controller uses Expected-SARSA algorithms in order to find the best map between state and actions.In other words, using the Expected-SARSA algorithm, the best action for each of the fuzzy system inference's rules will be obtained.Consider the rules in a fuzzy inference system be as follows: Ri: if x 1 is L 1i and . . .and x n is L ni then (a i is a i1 ) or (a i is a i2 ) or . . .or (a i is a ik ), in which Ri is the i-th rule of the fuzzy inference system, and s is the vector of the ndimensional input state and is defined as s = x 1 × x 2 × . . .× x n ; L i is the n-dimensional strictly convex and normal fuzzy set of the i-th rule with a unique center and is defined as L i = L i1 × L i2 × . . .× L in ; a i is the consequent action for the i-th rule; a i1 is the first candidate action for the i-th rule; a i2 is the second candidate action; and finally, a ik is the kth candidate action for the i-th rule.The aim of the learning is to find the optimal action for each of the rules of the fuzzy inference system.So as to seek this aim, a value action matrix has been defined and is denoted by Q, of which its elements are the value action of each candidate action for each of the rules.The optimal action for each of the rules is the candidate action for that rule, of which its value action is the most among all the candidate actions of that rule.
As aforementioned, the best consequent action for each of the rules will be obtained using reinforcement learning methods.Figure 2 depicts the structure of reinforcement learning.In this paper, the reinforcement learning algorithm that has been implemented to update value actions is the Expected-SARSA method.
demands accurate knowledge about the system.However, the proposed control method learns the best action for each rule; therefore, it has the advantages of data-driven controllers and fuzzy controllers simultaneously.

Fuzzy Controller Based on Expected SARSA Learning (FESL)
In this paper, in order to control the number of tumor cells, a fuzzy controller is proposed of which its learning process is based on the expected SARSA algorithm.The fuzzy controller uses Expected-SARSA algorithms in order to find the best map between state and actions.In other words, using the Expected-SARSA algorithm, the best action for each of the fuzzy system inference's rules will be obtained.Consider the rules in a fuzzy inference system be as follows: Ri: if  is  and … and  is  then ( is  ) or ( is  ) or … or ( is  ), in which Ri is the i-th rule of the fuzzy inference system, and  is the vector of the ndimensional input state and is defined as  =   …  ;  is the n-dimensional strictly convex and normal fuzzy set of the i-th rule with a unique center and is defined as  =   …  ;  is the consequent action for the i-th rule;  is the first candidate action for the i-th rule;  is the second candidate action; and finally,  is the kth candidate action for the i-th rule.The aim of the learning is to find the optimal action for each of the rules of the fuzzy inference system.So as to seek this aim, a value action matrix has been defined and is denoted by , of which its elements are the value action of each candidate action for each of the rules.The optimal action for each of the rules is the candidate action for that rule, of which its value action is the most among all the candidate actions of that rule.
As aforementioned, the best consequent action for each of the rules will be obtained using reinforcement learning methods.Figure 2 depicts the structure of reinforcement learning.In this paper, the reinforcement learning algorithm that has been implemented to update value actions is the Expected-SARSA method.The algorithm used in this paper has been delineated in Algorithm 1.The function that is used for calculating the reward of the agent for a transition from state s k to state s k+1 is as follows: where e(t), t ≥ 0 is defined as: The goal of the control agent in reinforcement learning (RL) is to find an optimal policy π * ; using this policy, the expected discount reward would be maximum.The discount reward is defined as follows: where δ is discount factor and is a non-negative constant, which satisfies δ ≤ 1.
The output action of the fuzzy inference system is calculated as where n is the number of fuzzy rules in the fuzzy inference, η i is the firing strength of the i-th rule, and a ip is the selected action using the ε-greedy method for the i-th rule among all possible actions.Updating the rule for FESL is considered to be as where α denotes the learning rate of the algorithm.In Equation ( 49), γ t is defined as: in which δ is a discount factor, π s [k, j] is the probability of selecting the j-th action for the k-th rule that has been obtained using ε-greedy method, and Q t [k, j] is an approximation of the value of the j-th action for the k-th rule.One of the key features of FRL-based controllers is their ability to handle uncertainty and imprecision in the system.Fuzzy logic allows the controller to make decisions based on approximate rather than precise values, which can be helpful in situations where the system is not well understood or where there is a high degree of variability.In addition to its ability to handle uncertainty, an FRL-based controller can also learn from its past experiences and adapt its behavior to better achieve the desired outcomes.This learning process is known as reinforcement learning, and it allows the controller to optimize its performance over time by adjusting its control inputs based on the consequences of those inputs.Overall, the combination of fuzzy logic and reinforcement learning in an FRL-based controller can make it well-suited for handling unexpected situations and adapting to a range of conditions.

Numerical Simulations
In this section, the performance of the FESL method for controlling the closed-loop system of cancer chemotherapy drug dosing, of which its input variable is considered to be a chemotherapy drug such as carboplatin, is assessed through simulations that are done in MATLAB.Oncologists consider different factors when they determine the drug dosage for a cancer patient.Age and gender are two examples of those factors.As has been shown in [87], the growth rate of normal cells and immune cells depends on age, and this rate will be larger for younger patients rather than elderly patients.Therefore, for young patients, oncologists prefer to eradicate cancerous cells as soon as possible without taking the damage to normal cells into consideration, which avoids cancer metastasis.In the current study, we assumed the value of fractional derivative as is mentioned in Table 1.However, estimating the fractional derivative parameter in the same way as the other parameters helps to ensure that the model is as accurate and reliable as possible.This can be done using a variety of techniques such as fitting the model to experimental data or using statistical methods to optimize the model parameters.By taking the time to accurately estimate all of the model parameters, researchers can be more confident in the results of their study and the conclusions they draw from the data.
For the cases in which the patient is elderly, the patient suffers from other diseases, or the cancer is in a vital organ such as the brain, the degeneration of the normal cells is not desirable, and the normal cells should be kept undamaged.The above-mentioned conditions have been taken into consideration by the proposed control method through the defining error in Equation ( 46).
For this reason, two scenarios have been considered to evaluate the performance of the proposed control method in the abovementioned cases.In scenario "A", it has been considered that the patient is young.On the other hand, the simulation of scenario "B" has been conducted for an elderly patient.Finally, the results of the simulations for the proposed control method have been compared with those of Watkin's Q-learning method for both scenarios.
The simulated patient's parameters are considered to be the same as the ones given in Table 1 [87]; in [93], it was shown how the parameters that are given in Table 1 can be obtained.The number of episodes that have been considered for simulations is considered to be 500, of which each of the episodes is defined as a set of transitions from the initial state to the terminal state.For the simulations, 20 fuzzy-sets have been considered for the fuzzy system, and the membership functions used for the fuzzy sets are considered to be trapezoidal-shape and z-shape membership functions, which have been shown in Figure 3. Table 2 implicates the type of membership function used for each fuzzy-set and their features.Furthermore, in the simulations, we have set δ = 0.9, and the learning rate is considered to be α = 0.2 and should decrease as the number of iterations and episodes increase so as to guarantee the convergence of the learning algorithm.
Mathematics 2022, 10, x FOR PEER REVIEW 14 of 25 For the cases in which the patient is elderly, the patient suffers from other diseases, or the cancer is in a vital organ such as the brain, the degeneration of the normal cells is not desirable, and the normal cells should be kept undamaged.The above-mentioned conditions have been taken into consideration by the proposed control method through the defining error in Equation ( 46).
For this reason, two scenarios have been considered to evaluate the performance of the proposed control method in the abovementioned cases.In scenario "A", it has been considered that the patient is young.On the other hand, the simulation of scenario "B" has been conducted for an elderly patient.Finally, the results of the simulations for the proposed control method have been compared with those of Watkin's Q-learning method for both scenarios.
The simulated patient's parameters are considered to be the same as the ones given in Table 1 [87]; in [93], it was shown how the parameters that are given in Table 1 can be obtained.The number of episodes that have been considered for simulations is considered to be 500, of which each of the episodes is defined as a set of transitions from the initial state to the terminal state.For the simulations, 20 fuzzy-sets have been considered for the fuzzy system, and the membership functions used for the fuzzy sets are considered to be trapezoidal-shape and z-shape membership functions, which have been shown in Figure 3. Table 2 implicates the type of membership function used for each fuzzy-set and their features.Furthermore, in the simulations, we have set  = 0.9, and the learning rate is considered to be  = 0.2 and should decrease as the number of iterations and episodes increase so as to guarantee the convergence of the learning algorithm.In Table 2, zmfl denotes the z-shape membership function for the lower bound; trapmf and zmfh stand for trapezoidal-shape and z-shape membership function for the upper bound respectively.Variables "a", "b", "c", and "d" are illustrated in Figure 3.
During simulations, learning algorithms should explore the best actions in order to find the best consequent action for each rule.Keeping this in mind, the ε value has been considered to be a small number; then, as the number of step-times and the number of episodes increases, the ε value in the ε-greedy method increases in order to reach a greedy policy.The ε has been calculated using the following formula: where I is the iteration number or the time-step number, MN I is the maximum number of iterations, E is the episode number, and MNE is the maximum number of episodes; in the simulations, the maximum number of iterations is chosen as 500.The candidate actions for each of the rules are considered to be a i ∈ {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.8, 1, 1.5, 2, 2.5, 3, 3.5, 4, 5, 6, 7, 8, 9, 10}.Moreover, the initial value of the state variables of the system model in Equation ( 11) are considered to be

Scenario A
In this scenario, first, the simulations have been conducted for the patient system, which has no uncertainty.Furthermore, the aim is to reach T = 0; to put it differently, the controller must take the system from the initial state to the first fuzzy set.Therefore, for the simulation, we have set β = 1, which means in this scenario the error will be (t) = (T(t) − T d (t)).
The result of the simulation is given in Figure 4 and is comprised of the number of normal cells in (a), the number of tumor cells in (b), the number of immune cells in (c), the concentration of the chemotherapeutic drug in blood (d), and the amount of chemotherapeutic drug executed for scenario "A" (d).As is evident, the number of tumor cells decreased consistently until it reached zero, which means that the proposed controller was able to obliterate all of the tumor cells.However, due to injecting a chemotherapeutic drug into the body of the patient, the number of normal cells and immune cells decreased at the beginning of the chemotherapy.Nonetheless, later, the numbers of normal cells and immune cells increased.The simulation confirms that the chemotherapy treatment was successful in e nating all of the tumor cells, but it caused a decrease in the number of normal cells immune cells at the beginning of treatment.This is a common occurrence with ch therapy, as the drugs used to kill cancer cells can also damage healthy cells.Howev is also possible for the body to recover and for the number of normal cells and imm cells to increase again after treatment.
It is important to carefully consider the potential benefits and risks of chemothe treatment, as it can be an effective way to treat cancer; however, it can also cause sig cant side effects.It is generally recommended to undergo chemotherapy under the su vision of a medical professional who can help to monitor the patient's condition and just the treatment plan as needed.
So as to prove the robustness of the proposed control method, after training the trol agent using the reinforcement learning algorithm, the parameters of the patients varied during simulation.The parameters of the modified patient system were consid to have a 10% variation from their original values, which are given in Table 1.The r of the simulation is depicted in Figure 5.The simulation confirms that the chemotherapy treatment was successful in eliminating all of the tumor cells, but it caused a decrease in the number of normal cells and immune cells at the beginning of treatment.This is a common occurrence with chemotherapy, as the drugs used to kill cancer cells can also damage healthy cells.However, it is also possible for the body to recover and for the number of normal cells and immune cells to increase again after treatment.
It is important to carefully consider the potential benefits and risks of chemotherapy treatment, as it can be an effective way to treat cancer; however, it can also cause significant side effects.It is generally recommended to undergo chemotherapy under the supervision of a medical professional who can help to monitor the patient's condition and adjust the treatment plan as needed.
So as to prove the robustness of the proposed control method, after training the control agent using the reinforcement learning algorithm, the parameters of the patients were varied during simulation.The parameters of the modified patient system were considered to have a 10% variation from their original values, which are given in Table 1.The result of the simulation is depicted in Figure 5. Figure 5 has the same sub-figures as Figure 4.As can be seen, the system has sh the same behavior as the system without uncertainty.Obviously, the controller has sh its ability to eradicate the tumor cells in the presence of the uncertainty in the parame The obtained results are remarkable because the following main achievements:

•
The effectiveness of the proposed controller in eliminating the tumor cells: Th sults show that the controller was able to successfully reduce the number of tu cells to zero, which indicates that the treatment was effective in destroying the ca cells.

•
The temporary decrease in normal cells and immune cells: The chemotherapy in the treatment caused a temporary decrease in the number of normal cells and mune cells in the body.This is a common side effect of chemotherapy, as the d used can also harm healthy cells.

•
The recovery of normal cells and immune cells: Despite the initial decrease in no cells and immune cells, the numbers of these cells eventually increased over t This suggests that the body was able to recover and rebuild healthy cells afte chemotherapy treatment.Figure 5 has the same sub-figures as Figure 4.As can be seen, the system has shown the same behavior as the system without uncertainty.Obviously, the controller has shown its ability to eradicate the tumor cells in the presence of the uncertainty in the parameters.
The obtained results are remarkable because the following main achievements:

•
The effectiveness of the proposed controller in eliminating the tumor cells: The results show that the controller was able to successfully reduce the number of tumor cells to zero, which indicates that the treatment was effective in destroying the cancer cells.

•
The temporary decrease in normal cells and immune cells: The chemotherapy used in the treatment caused a temporary decrease in the number of normal cells and immune cells in the body.This is a common side effect of chemotherapy, as the drugs used can also harm healthy cells.

•
The recovery of normal cells and immune cells: Despite the initial decrease in normal cells and immune cells, the numbers of these cells eventually increased over time.This suggests that the body was able to recover and rebuild healthy cells after the chemotherapy treatment.

•
The importance of monitoring the effects of chemotherapy: The results of the simulation highlight the importance of carefully monitoring the effects of chemotherapy to ensure that it is being administered effectively and safely.This includes monitoring the number of cancer cells, normal cells, and immune cells in the body, as well as the concentration of chemotherapy drugs in the blood.

Scenario B
This scenario has been considered to assess the performance of the proposed control method in control of the tumor-cell state in old patients.In this scenario, the aim is to control both tumor-cells and normal-cells.Consequently, in this scenario, it has been considered that β = 0.95, which means the error in this scenario will be e(t) = 0.95(T(t) − T d (t)) + 0.05(N(t) − N d (t)); as can be seen, the number of normal cells has been taken into consideration in this scenario.The result of the simulations for this scenario is illustrated in Figure 6.

•
The importance of monitoring the effects of chemotherapy: The results of the s lation highlight the importance of carefully monitoring the effects of chemothe to ensure that it is being administered effectively and safely.This includes mon ing the number of cancer cells, normal cells, and immune cells in the body, as w the concentration of chemotherapy drugs in the blood.

Scenario B
This scenario has been considered to assess the performance of the proposed co method in control of the tumor-cell state in old patients.In this scenario, the aim control both tumor-cells and normal-cells.Consequently, in this scenario, it has been sidered that  = 0.95, which means the error in this scenario will be () = 0.95   () + 0.05 () −  () ; as can be seen, the number of normal cells has been t into consideration in this scenario.The result of the simulations for this scenario is trated in Figure 6.By comparing Figures 4 and 6, it can be confirmed that in the case in which the pa is elderly, the number of normal cells has reached its maximum faster.Moreover, t By comparing Figures 4 and 6, it can be confirmed that in the case in which the patient is elderly, the number of normal cells has reached its maximum faster.Moreover, the 2-norm of the amount of drug executed to the patient's body in the case that the patient is elderly decreased by 5.5 percent.
It is remarkable that the proposed control method was able to reach the maximum number of normal cells and use a smaller amount of chemotherapy drugs more quickly in the case of an elderly patient compared to a non-elderly patient.This suggests that the control method was able to minimize the negative effects of the chemotherapy on healthy cells while still effectively treating the cancer.This is important because chemotherapy can have significant side effects on the body, and minimizing these effects is especially important for elderly patients who may be more sensitive to these effects.
In addition, in this scenario, in order to confirm the robustness of the proposed control method for elderly patients, the simulation was performed for the patient system with 10% uncertainty for each of the system parameters, and the results are given in Figure 7.
Mathematics 2022, 10, x FOR PEER REVIEW 19 norm of the amount of drug executed to the patient's body in the case that the patie elderly decreased by 5.5 percent.
It is remarkable that the proposed control method was able to reach the maxim number of normal cells and use a smaller amount of chemotherapy drugs more qu in the case of an elderly patient compared to a non-elderly patient.This suggests tha control method was able to minimize the negative effects of the chemotherapy on hea cells while still effectively treating the cancer.This is important because chemothe can have significant side effects on the body, and minimizing these effects is espec important for elderly patients who may be more sensitive to these effects.
In addition, in this scenario, in order to confirm the robustness of the proposed trol method for elderly patients, the simulation was performed for the patient system 10% uncertainty for each of the system parameters, and the results are given in Figur It is notable that the control method was able to maintain its effectiveness even w the system parameters were varied by 10%.This indicates that the control method i It is notable that the control method was able to maintain its effectiveness even when the system parameters were varied by 10%.This indicates that the control method is robust and can be applied effectively to elderly patients with a range of different characteristics.This is important because elderly patients may have a wide range of health conditions and characteristics that can affect their response to chemotherapy.
Figure 7 demonstrates the robustness of the proposed control method against the system's parameter variation for an elderly patient.The results of the simulations for the elderly patient in both cases show the effectiveness of the proposed control method.Overall, these findings suggest that the proposed control method could be a valuable tool for improving the treatment of elderly patients with chemotherapy.They could help to minimize the negative effects of chemotherapy on healthy cells and provide a more effective treatment for cancer.

Scenario C
In this scenario, the proposed control method and Watkin's Q-learning method were exerted to 10 different patients.So as to evaluate the performance of the proposed control method, the simulation results for the proposed control method and the Watkin's Qlearning method are elucidated in Table 3.The patient's parameters are considered to be as follows: a i ∈ (0.1, 0.5] for i = 1, 2, 3, a i ∈ [0.3, 1], i = 1, 2, 3, 4, d 1 ∈ [0.15, 0.3], r 1 ∈ [1.2, 1.6], r 1 ∈ [0.3, 0.5], γ ∈ [0.3, 0.5] and λ ∈ [0.01, 0.05]; other parameters are the same as ones given in Table 1.Furthermore, the parameters must hold the following conditions [47]: Conspicuously, as can be seen in Table 3, the 2-norm of the input signal for the proposed control method is less than that of the Watkin's Q-learning method; that means that the proposed control method used less drug than the conventional controller.In any case, the number of episodes for the proposed control method is less than 500 episodes, and it is significantly less than the required number of episodes for the Watkins Q-learning algorithm's convergence, which is 50,000 [57]; therefore, the proposed control method is 100 times faster than the Watkins Q-learning algorithm in term of convergence rate.
Table 3 implies that the 2-norm of the error in the case in which the proposed control method was used decreased by 35 percent for scenario "A" and 24 percent for scenario "B", in comparison with the case in which the Watkin's Q-learning algorithm was used.In any case, the 2-norm of the variance of the error and amount of drug usage decreased by 86 percent and 1 percent, respectively, for scenario "A" and 83 percent and 10 percent for scenario "B".
It is noteworthy to mention that the proposed control method can be exerted by clinicians to find the optimal amount of drug based on the patient's state to eradicate tumor cells.To this aim, clinicians must first obtain the patients' parameters by measuring the patient's state and fit the logged data to the mathematical model of the patient given in Equation (11).Then, using the mathematical model, the control agent should be pre-trained.Afterwards, the control agent would be able to offer the optimal dosage of the drug for each day.Keeping that in mind, the training process of the control agent goes on during the chemotherapy process.For this reason, the control framework performance would not be affected by varying the patient's parameters during chemotherapy.

Conclusions
This study was aimed at the investigation of a Caputo-Fabrizio fractional-order model of cancer chemotherapy treatment.At first, the existence and uniqueness of the solutions of the proposed model were proven through a fix-point theorem and an iterative method.After that, since applying optimal polices for drug dosage is crucial, we proposed to for-mulate the control problem of the chemotherapy treatment as an optimization problem and find optimal actions using a fuzzy reinforcement learning algorithm.The significant features of the designed fuzzy reinforcement learning-based method are its model-free approach as well as its optimal performance.Finally, three scenarios were considered to evaluate the performance of the proposed control technique.The results for the proposed control method demonstrate the effectiveness of the controller in the annihilation of tumor cells for young patients and elderly patients.The control method has the ability to bring the level of normal cells back to its normal range.In addition, the results implicate the robustness of the proposed control method against the uncertainties of the patients' parameters.Moreover, because of being real-time, the performance of the proposed control method would not be affected even if the patients' parameters had been changed during chemotherapy treatment.Also, the presented comparison with the Watkin's Q-learning method conspicuously demonstrated the superiority of the proposed method in terms of the annihilation of the tumor cells in simulated patients and the drug usage during chemotherapy treatment.While in the current study it was shown that the fuzzy reinforcement learning algorithm is also able to achieve a good level of optimization in this task, deep reinforcement learning has the potential to achieve even better results by taking into account a greater amount of data and making more sophisticated decisions.However, it is important to note that this is an area of active research, and it is not yet clear how well deep reinforcement learning will perform in this specific application.Further research and development will be needed to determine whether this approach is practical and effective for scheduling cancer chemotherapy drug dosing.Furthermore, it would be helpful to run a statistical test using variance to determine whether there are statistically significant differences between two or more categorical groups.

Figure 2 .
Figure 2. Structure of reinforcement learning.Figure 2. Structure of reinforcement learning.

Figure 2 .
Figure 2. Structure of reinforcement learning.Figure 2. Structure of reinforcement learning.

Figure 3 .
Figure 3. Membership functions used for fuzzy-sets.(a) Trapezoidal-shape membership function.(b) Z-shape membership function for upper bound.(c) Z-shape membership function for lower bound.

Figure 3 .
Figure 3. Membership functions used for fuzzy-sets.(a) Trapezoidal-shape membership function.(b) Z-shape membership function for upper bound.(c) Z-shape membership function for lower bound.

Figure 4 .
Figure 4.The result of simulation using the proposed control method for young patient wi uncertainty.(a) Normal-cells (b) Tumor-cells (c) Immune-cells (d) Concentration of cells (e) Co action.

Figure 4 .
Figure 4.The result of simulation using the proposed control method for young patient without uncertainty.(a) Normal-cells (b) Tumor-cells (c) Immune-cells (d) Concentration of cells (e) Control action.

Figure 5 .
Figure 5.The result of simulation using the proposed control method for young patient with u tainty.(a) Normal-cells (b) Tumor-cells (c) Immune-cells (d) Concentration of cells (e) Contr tion.

Figure 5 .
Figure 5.The result of simulation using the proposed control method for young patient with uncertainty.(a) Normal-cells (b) Tumor-cells (c) Immune-cells (d) Concentration of cells (e) Control action.

Figure 6 .
Figure 6.The result of simulation using the proposed control method for elderly patient wi uncertainty.(a) Normal-cells (b) Tumor-cells (c) Immune-cells (d) Concentration of cells (e) Co action.

Figure 6 .
Figure 6.The result of simulation using the proposed control method for elderly patient without uncertainty.(a) Normal-cells (b) Tumor-cells (c) Immune-cells (d) Concentration of cells (e) Control action.

Figure 7 .
Figure 7.The result of simulation using the proposed control method for elderly patient wit certainty.(a) Normal-cells (b) Tumor-cells (c) Immune-cells (d) Concentration of cells (e) Co action.

Figure 7 .
Figure 7.The result of simulation using the proposed control method for elderly patient with uncertainty.(a) Normal-cells (b) Tumor-cells (c) Immune-cells (d) Concentration of cells (e) Control action.

Table 2 .
Fuzzy-sets and their membership functions.

Table 3 .
Result of simulations.