Data-Driven Suboptimal Scheduling of Switched Systems

In this paper, a data-driven optimal scheduling approach is investigated for continuous-time switched systems with unknown subsystems and infinite-horizon cost functions. Firstly, a policy iteration (PI) based algorithm is proposed to approximate the optimal switching policy online quickly for known switched systems. Secondly, a data-driven PI-based algorithm is proposed online solely from the system state data for switched systems with unknown subsystems. Approximation functions are brought in and their weight vectors can be achieved step by step through different data in the algorithm. Then the weight vectors are employed to approximate the switching policy and the cost function. The convergence and the performance are analyzed. Finally, the simulation results of two examples validate the effectiveness of the proposed approaches.


Introduction
Switched systems consisting of several subsystems and a switching policy ruling the switching among them [1,2] arise in certain application situations [3,4] such as a system which has to collect data sequentially from a number of sensory sources and switches its attention among the data sources [5,6]. The switching among subsystems complicates the control problems and many of the problems remain to be open such as the optimal control problems. Optimal control [7,8] problems of switched systems have attracted considerable attention over the past few years. Thereinto, the optimal scheduling problem of switched systems is investigated in this paper.
Multiple approaches have been introduced to solve the optimal control problems for switched systems. Gradient-based approaches are investigated to solve the optimal switching time problems [9] and optimal scheduling problems [10,11] directly, usually in a finite time horizon. By utilizing control inputs to represent the switching policy, embedding approaches transform the optimal control problems of switched systems to traditional optimal control problems to address [12][13][14]. Adaptive dynamic programming (ADP) [15] approaches are introduced to solve the optimal control problems for switched systems with different initial conditions directly [16][17][18][19][20][21][22]. For optimal scheduling of switched systems with infinite-horizon cost functions, ADP approaches perform well to provide approximate global optimal solutions directly.
Approximate global optimal solutions are derived in feedback forms through ADP for optimal scheduling of switched systems with discrete-time dynamics and finite-horizon cost functions [17] or infinite-horizon cost functions [18]. Then further research is conducted for problems with switching cost [19] or state jumps [20]. For optimal scheduling of switched systems with continuous-time dynamics, an approximate feedback solution is proposed based on policy iteration (PI) algorithm with its offline, online, and concurrent implementation [21]. Then, a PI algorithm with recursive least squares is proposed and modified into a single loop PI algorithm to reduce the computational burden [22].
The aforementioned optimal control approaches are deduced based on the a priori knowledge of system models. However, not all system models can be completed acquired so that approaches independent of system models [23][24][25] require investigated. For this purpose, some model-free optimal control approaches have been studied for switched systems. Adaptive dynamic programming approaches are presented respectively for a continuous-time switched system with an infinite-horizon cost function [26] and a discrete-time switched system with a finite-horizon cost function [27] under the assumption that dynamic equations can be evaluated at some sets. Gradient-decent approaches only employing state data are proposed to solve optimal switching problems for continuous-time switched systems with finite-horizon cost functions [28,29]. Data-driven research utilizes real-life data measured by sensory sources to achieve the intrinsic information of systems [30] and can solve the problems of switched systems with unknown subsystems, such as the data-driven framework for discovering cyber-physical systems directly from the data [31]. Thereinto, data-driven ADP approaches provide possible solutions for optimal control problems of switched systems [32][33][34] and based on that, the following approach is designed to adapt to the complexity of switched systems which is brought by switching. Then in this paper, the data-driven optimal scheduling approach is investigated for continuous-time switched systems with unknown subsystems. At first, an online PI-based algorithm inspired by the off-policy learning method [35,36] is proposed to approximate the optimal solution quickly for optimal scheduling problems with known system models first and based on that, a data-driven PI-based algorithm is formulated for optimal scheduling of continuous-time switched systems with infinite-horizon cost functions, which don't require that dynamic equations can be evaluated or known at some sets. Moreover, common online algorithms usually keep collecting data with the updating switching policies applied to the system at each iteration while the online algorithms in this paper take advantages of the data produced only by the initial switching policy.
The contribution of this paper is stated as follows: (1) with the data produced by the initial switching policy, an online PI-based algorithm is proposed to approximate the optimal solution quickly for optimal scheduling of known switched systems. (2) A data-driven PI-based algorithm is designed to solve optimal scheduling problems for switched systems with infinite-horizon cost functions and unknown subsystems, solely from the data produced by the initial switching policy, which has not been achieved well in existing literature as far as we know. (3) The convergence is proved and the optimality is analyzed.
The remainder of the paper is organized as follows. In Section 2, the problem is stated and the classic PI approach is introduced for the optimal scheduling problem. In Section 3, online PI-based algorithms are proposed for switched systems with known subsystems and unknown subsystems. In Section 4, simulation results are shown to indicate the effectiveness of the algorithms. Finally, the conclusion is drawn in Section 5.

Preliminaries
Consider the switched system as follows: where x(t) ∈ R n is the system state, v ∈ V represents the index of the active subsystem, V = {1, 2, . . . , N} is the index set of all subsystems, N is the number of subsystems, and f v : R n → R n denotes the unknown dynamics of subsystem v. f v (v ∈ V) is Lipschitz continuous in Ω where Ω ⊂ R n including the origin is the region of interest and there exists v ∈ V such that f v (0) = 0.
In this paper, the problem to be addressed consists in seeking out the optimal switching policy v * to minimize the following cost function: where Q : R n → R is a positive definite function. During the time interval, the cost-to-go from the time t to infinity with the state x(t) at time t can be described as [37]: and then For the optimal problem, the admissible switching policy should be introduced and the relevant assumption is made as [22]. Definition 1. For system (1), a switching policy is called admissible with respect to the cost (2), if it stabilizes the system in Ω and for all x 0 ∈ Ω, the cost V(x 0 ) is finite.

Assumption 1.
There exists at least one admissible policy for the system.
According to the Bellman principle of optimality, the optimal cost-to-go can be represented by and when δt → 0 , the corresponding optimal switching policy v * can be represented in a state-feedback form: When δt → 0, with first-order Taylor expansion of V * (x(t + δt)) applied in (5) and (6), the HJB equation can be given as [21]: ∂x n ] T when (·) is a scalar and the corresponding optimal switching policy v * is given as [21]: Then, a PI approach can be applied to solve the optimal scheduling problem. Given an admissible switching policy v 0 , the PI approach [21,22] is stated as follows: where k is the iteration number. The cost-to-go is solved in policy evaluation (7) and the switching policy is updated in policy improvement (8) iteratively. The stability and convergence are stated in the following theorem which has been proved in [21,38].
Theorem 1. For the system (1) with the cost (2), if the value function sequence {V k } ∞ k=0 and the switching policy sequence {v k } ∞ k=0 are generated through (7) and (8) initiating from a stabilizing initial policy v 0 , the value functions {V k } ∞ k=0 converge to the optimal value function V * , and the switching policies {v k } ∞ k=0 stabilize the system in Ω.

PI-Based Algorithm for Known Switched Systems
The offline, online and concurrent implementation of the aforementioned PI approach for switched systems with known dynamics has been investigated in references [21,22,38]. In this subsection, a novel online PI-based algorithm is proposed to approximate the optimal switching policy quickly.
Common online algorithms keep collecting data from the systems to evaluate and improve the switching policy. To be specific, once a new switching policy is produced at each iteration, it is applied to the system and the newly produced data is collected to calculate a new cost and a new switching policy. It entails considerable time to apply the new switching policy, collect new data and calculate a new cost and a new switching policy sequentially at each iteration. Aiming at this, inspired by the off-policy ADP methods for ordinary systems [36,39,40], an online PI-based algorithm is proposed to approximate the optimal switching policy quickly with only an initial switching policy applied and only the data produced by the initial switching policy is required. Next, the algorithm starts from the initial admissible switching policy.
With the initial admissible switching policy v 0 applied only, large amounts of data can be produced in the process and the state trajectory x(t) corresponding to the switching policy v 0 can be acquired which will be employed in the subsequent derivation. Along the acquired state trajectory x(t), V k (x(t + δt)) − V k (x(t)) can be represented as where δt > 0 is very small. To combine the policy iteration and the cost at the acquired state, integrating (7) along the acquired state trajectory x(t) and adding the integration to (9) yield The cost V k (x) is unknown. However, for all x ∈ Ω, it can be expressed by: where T is a vector concerning a set of linearly independent basis functions Φ j (x) : R n → R(j = 1, 2, . . . , N w ), W k ∈ R 1×N w is the weight vector and e k Φ (x) is the approximation error. N w is the number of the basis functions. A set of basis functions can constitute a particular basis of a function space and can almost approximate any function in the function space. With the approximation function (11) applied, Equation (10) can be transformed into where )dτ is the approximation error. Then the estimation can be achieved: where the estimatev k (x) is achieved through substituting the estimateŴ k and the approximation (11) into (8) as follows:v with the initial switching policy estimatev 0 (x) = v 0 (x).
To employ the acquired state data corresponding to some selected instants, some data matrices can be defined as follows: The following assumption concerning the data matrices is made as [40,41]: There exist a positive integerL and a positive number α such that for all L ≥L, the following equality holds: In optimal control of ordinary systems, this kind of assumption can be satisfied by exerting an exploration noise in the input [39,40]. In the case of the switched systems, according to [21,22], it can be satisfied through random switching.
Then based on Assumption 2,Ŵ k can be achieved with the data matrices as the following formula: According to the analysis, the online PI-based algorithm for switched systems with known dynamics can be formulated in Algorithm 1.

Algorithm 1 Online policy iteration (PI)-based Algorithm.
Step 1. Start with an initial admissible switching policy v 0 (x) and set the iteration index k = 0.
Step 2. Apply v 0 (x) in the switched systems and acquire the state data. Setv 0 (x) = v 0 (x) . Calculatē Φ(t r ) and d(t r ) for r = 1, 2, . . . , l according to their definition with the state data.
Step 3. Calculate b(t r ) for r = 1, 2, . . . , l according to its definition with the state data and then calculatê W k from (15).
In the algorithm, when the switching policy v 0 (x) is applied, the corresponding system state data can be obtained. With multiple samples, multiple data matrixΦ(t r ) and b(t r ) will be calculated. Sufficient samples should be employed to satisfy that 1 with a proper α so that the sampling stage must be long enough to ensure that. Moreover, since the weight vectorŴ k to be solved in Equation (15) has N w components, at least N w instants t r should be employed to solveŴ k so that r is no less than N w . Then through the algorithm, the approximate optimal cost function can be achieved asV * (x) =Ŵ * Φ(x) with the corresponding approximate optimal switching policyv * (x) = arg min ). It can be seen from Algorithm 1 that only the initial switching policy is applied and then Algorithm 1 can approximate the optimal switching policy quickly with the data produced by the initial switching policy.

Data-Driven PI-Based Algorithm for Unknown Switched Systems
In this subsection, based on the proposed PI approach dependent on system models, a data-driven PI-based algorithm is proposed for switched systems with unknown subsystems. The algorithm takes full advantage of data produced by an initial switching policy to approximate the optimal switching policy quickly.
From Section 3.1, with the initial admissible switching policy v 0 applied and along the acquired state trajectory x(t), (10) can be achieved and the cost function can be represented by (11). In the process, can be represented by (W k ∂Φ(x) ∂x ) T and solved with known system models. However, due to the unknown subsystem models, cannot work well as Section 3.1. So another approximation function is brought in to solve the problem. For all x ∈ Ω, where T is a vector concerning a set of linearly independent basis functions Ψ j (x) : R n → R(j = 1, 2, . . . , N c ), C k i , i ∈ V are the weight vectors and e k Ψ,i (x) is the approximation error. N c is the number of the basis functions.
With the approximation functions (11) and (16) applied, Equation (10) can be transformed into where e k is the approximation error. Since δt is very small, the values of v k (x) and v 0 (x) can be seen to be constant during the time . Therefore, the estimation can be made as (18): For the subsequent algorithm, in addition to the data matrices defined in Section 3.1, some more matrices are required which are defined as follows: where t 1 < t 2 < · · · < t l , l is a positive integer, g(t r ) ∈ R N c and r = 1, 2, . . . l.
At first, the following assumption concerning the data matrices is made as Assumption 2: There exist a positive integerL and a positive number α such that for all L ≥L , the following equalities hold: where the time instants t r satisfies the condition thatv k (x(t r )) = v 0 (x(t r )); where the time instants t r satisfies the condition that v 0 (x(t r )) = i for ∀i ∈ V; where the time instants t r satisfies the conditions thatv k (x(t r )) = i and v 0 (x(t r )) = j for ∀i = j, (i, j ∈ V).

Remark 1.
Since the vectorĈ k v k (x) −Ĉ k v 0 (x) changes as x changes, the weight vectorŴ k andĈ k v k (x) cannot be solved directly from (18) with the data matrices as Section 3.1. The difficulty in this problem is mainly calculating the weight vectorŴ k andĈ k v k (x) or achieve enough useful knowledge about the weight vectors through the state data.
Based on the above analysis, the following approach is designed to acquire useful knowledge about the weight vectorŴ k andĈ k v k (x) step by step through different state data. The weight vector W k is discussed firstly. The state data satisfying the condition thatv k (x) = v 0 (x) is selected from all the state data and then utilized in (18). It is obvious that whenv (18) can be simplified to:Ŵ Under Assumption 3, the estimate of the weight vector W k can be achieved from (19) with the data matrices defined in Section 3.1 concerning certain states as follows: where the time instants t r satisfies the condition thatv k (x(t r )) = v 0 (x(t r )). Then, we concentrate on estimating the weight vector of C k i . Along the acquired state trajectory x(t) produced by the switching policy v 0 , it can be obtained from (16) and (7) that when k = 0 the following formula holds: The estimation can be made as follows: Under Assumption 3, the estimate of the weight vector C 0 i can be easily achieved as follows: where the time instants t r satisfies the condition that v 0 (x(t r )) = i(i ∈ V). Though the weight vector C k i is not easy to estimate, the estimate of C k i − C k j (i = j) can be easily achieved from (18) with the data matrices concerning certain states under Assumption 3 as follows: where the time instants t r satisfies the conditions thatv k (x(t r )) = i and v 0 (x(t r )) = j (i, j ∈ V and i = j).
For a certain state x, ). Therefore, on the basis of (8), employing the estimate of (22) and (23), the switching policy with the initial switching policy estimatev 0 (x) = v 0 (x). The approach utilizes different parts of the acquired state data to calculate the weight vectorŴ k , C 0 i andC k i − C k j (i = j) respectively. Then the weight vectors are employed to calculate the switching policy and the cost function.
According to the aforementioned analysis, the data-driven PI-based algorithm for switched systems can be formulated as follows: Then, the approximate optimal cost function can be achieved asV * (x) =Ŵ * Φ(x) with the corresponding approximate optimal switching policyv * (x) = arg min i∈V ((Ĉ k i −Ĉ k v 0 (x) )Ψ(x)).

Remark 2.
In Algorithm 2, only the initial switching policy v 0 (x) requires to be applied in the system and the produced state data is collected at the beginning. The data matricesΦ(t r ), d(t r ) and g(t r ) are calculated once at the beginning and don't require to be calculated repeatedly at each iteration. It is very convenient and timesaving to operate online according to Algorithm 2 and the calculation time is very short. Therefore, the optimal cost can be approximated rapidly in Algorithm 2. Moreover, Algorithm 2 is only based on data with no need for the knowledge of subsystems.
Step 1. Start with an initial admissible switching policy v 0 (x) and set the iteration index k = 0.
Step 2. Apply v 0 (x) in the switched systems and acquire the state data. Setv 0 (x) = v 0 (x) . Calculatē Φ(t r ), d(t r ) and g(t r ) for r = 1, 2, . . . , l according to their definition with the state data.
Next, the convergence of Algorithm 2 is analyzed. Based on Theorem 1, Theorem 2 is stated for the convergence analysis of Algorithm 2.  (7) and (8) initiating from an initial admissible policy v 0 , for ∀ε > 0 and all x ∈ Ω, there existsN > 0 such that for ∀N >N, when δt approaches to zero, the following inequalities hold: ,Ŵ k andv k are generated through (20) and (24) in Algorithm 2, and k = 0, 1, . . ..

Proof of Theorem 2.
Mathematical induction is utilized to prove the convergence. Firstly, we discuss the theorem when k = 0. When and it can be achieved from (17) and (18) that: According to Assumption 3, it follows that The approximation theory [42] yields that for all x ∈ Ω, lim 2 (t r ) = 0. Therefore, for ∀ε > 0 and ∀x ∈ Ω, there existsN w > 0 such that Then, it can be achieved from (21) and (22) that: Similarly, the approximation theory [42] yields that for all x ∈ Ω, lim Under Assumption 3, it can be deduced that for ∀x ∈ Ω, lim N c →∞ δt→0 Secondly, we consider the theorem when k = 1. When k = 1, lim . WhenŴ 1 is calculated, for ∀x ∈ Ω and it can be achieved from (17) to (19) that: Then it can be inferred that when Under Assumption 3, it can be deduced that for all x ∈ Ω, lim . Then, it can be achieved from (17), (18) and (23) that: Since it can be achieved that ).

Since lim
Thirdly, we suppose the theorem holds when k = k. That is to say, lim WhenŴ k+1 is calculated, it can be achieved from (17) to (19) that: Then it can be inferred that Under Assumption 3, it can be deduced that lim Then, it can be achieved from (17), (18) and (23) that: k+2 (x) = v k+2 (x) for ∀x ∈ Ω. In brief, it can be deduced that the theorem holds when k = k + 1 from the supposition that the theorem holds when k = k. The proof is completed through mathematical induction.
It can be achieved from Theorem 2 that the value functionV k (x) generated through Algorithm 2 is an approximation of V k (x) and the corresponding switching policy isv k (x) = v k (x) if the preconditions are satisfied. Theorem 2 combined with Theorem 1 indicates that the value function V * (x) =Ŵ * Φ(x) is the approximate optimal cost function with the corresponding approximate optimal switching policyv * (x) = arg min i∈V ((Ĉ k i −Ĉ k v 0 (x) )Ψ(x)).

Remark 3.
In practice, the error of the approximation exists and the small parameter δt can not get infinitely close to zero so that the solution calculated from Algorithm 2 is suboptimal.

Example
In this section, two examples are illustrated to validate the effectiveness of the suboptimal scheduling approach of Algorithm 1 and the data-driven suboptimal scheduling approach of Algorithm 2 in this paper. Example 1. Consider a switched system as [21,38] consisting of the following subsystems: The optimal switching policy can be known from [17] as Choose the initial switching policy v 0 (x) = 1 when x ≤ 1.5 and v 0 (x) = 2 when x > 1.5. The basis functions are Φ(x) = [x 2 , x 4 , x 6 , x 8 , x 10 ] T selected as [21]. Set the sample period δt = 0.002 s.
Apply Algorithm 1 and utilize the online state data from t = 0 to 0.2 s. Then after calculation in 0.02 s, the approximate optimal cost is achieved with the corresponding approximate optimal switching policy through 3 iterations. The initial costV 0 , the approximate optimal costV k and the optimal cost V * are demonstrated in Figure 1. The largest error betweenV k and V * is 0.1621 and it is obvious that the approximate optimal costV k is very close to the optimal cost of V * . The state trajectories with the initial switching policy v 0 , the approximate optimal switching policyv k and the optimal switching policy v * applied in the system after t = 0.25 s are demonstrated in Figure 2, where the state trajectory corresponding tov k coincides with the one corresponding to v * and their largest error is zero while the largest error between the state trajectory corresponding to v 0 and the one corresponding to v * is 0.0563. The online trajectories are too close to show the superiority of the proposed algorithm so that the switching policies v 0 ,v k and v * when x ∈ [−2, 2] are illustrated in Figure 3. Apparently,v k and v * are the same. The similarity rate of the schedulesv k and v * is 100% while the similarity rate of the schedules v 0 and v * is 75.31%.
Apply Algorithm 2 and utilize the online state data from t = 0 to 0.2 s. Then after calculation in 0.05 s, the approximate optimal cost is achieved with the corresponding approximate optimal switching policy through 6 iterations. The costsV 0 ,V k and V * are demonstrated in Figure 4. It can be obtained that compared with the initial cost ofV 0 , the approximate optimal cost ofV k is close to the optimal cost of V * . The state trajectories with v 0 ,v k and v * applied in the system after t = 0.25 s are demonstrated in Figure 5, where the state trajectory corresponding tov k coincides with the one corresponding to v * and their largest error is 0.0155 while the largest error between the state trajectory corresponding to v 0 and the one corresponding to v * is 0.0563. The online trajectories also are too close to show the superiority of the proposed algorithm so that the switching policies v 0 ,v k and v * when x ∈ [−2, 2] are illustrated in Figure 6. It can be seen thatv k is approximate to v * . The similarity rate of the schedulesv k and v * is 95.06% while the similarity rate of the schedules v 0 and v * is 75.31%.

Example 2.
Consider a mass-spring-damper system as [21,22]: with v ∈ {1, 2, 3}, F 1 = 1, F 2 = −1 and F 3 = 0. Here, the state x 1 (t) is the displacement of the mass measured from the relaxed length of the spring. F v is the external force acting on the mass. The initial state is x(0) = [2,2] and the function Q is Choose the initial switching policy v 0 (x) = 3 when |x 1 | ≤ 0.5, v 0 (x) = 2 when x 1 > 0.5 and v 0 (x) = 1 when x 1 < −0.5. The basis functions are polynomials with all possible combinations of the state variables up to the 4th degree without repetitions selected as [21,22]. Set the sample period δt = 0.02 s.
Apply Algorithm 1 and utilize the online state data from t = 0 to 23 s. Then after calculation in 7 s, the approximate optimal cost is achieved with the corresponding approximate optimal switching policy through 15 iterations. The initial costV 0 and the approximate optimal costV k are demonstrated in Figure 7. It can be seen that the approximate optimal costV k is less than the initial costV 0 . The state trajectories with the initial switching policy v 0 and the approximate optimal switching policyv k applied in the system after t = 30 s are demonstrated in Figure 8, where the state trajectory corresponding tô v k converges to the origin quickly after t = 30 s while the trajectory corresponding to v 0 converges slowly with decreasing oscillation amplitude. The corresponding switching policies v 0 andv k are illustrated in Figure 9. It can be seen that v 0 which switches among three subsystems and finally stays at subsystem 3 results in that the trajectory corresponding to v 0 converges slowly with decreasing oscillation amplitude, while v * which switches between subsystem 1 and 2 results in that the state trajectory corresponding tov k converges to the origin quickly after t = 30 s.  Apply Algorithm 2 and utilize the online state data from t = 0 to 23 s. Then after calculation in 7 s, the approximate optimal cost is achieved with the corresponding approximate optimal switching policy through 21 iterations. The costsV 0 andV k are demonstrated in Figure 10. It can be seen that the approximate optimal costV k is less than the initial costV 0 . The state trajectories with v 0 andv k applied in the system after t = 30 s are demonstrated in Figure 11, where the state trajectory corresponding tov k converges to the origin relatively quickly after t = 30 s while the trajectory corresponding to v 0 converges slowly with decreasing oscillation amplitude. The corresponding switching policies v 0 andv k are illustrated in Figure 12. It can be seen that v 0 which switches among three subsystems and finally stays at subsystem three results in that the trajectory converges slowly with decreasing oscillation amplitude, while v * which switches between subsystem 1 and 2 results in that the state trajectory converges to the origin relatively quickly after t = 30 s.

Remark 4.
In algorithm 2, the value of the initial switching policy v 0 (x) should include every element of V and there should exist enough data to calculateĈ 0 i for every i ∈ V so that the sampling stage must be long enough to ensure that.

Remark 5.
In the calculation, the fourth order Runge-Kutta algorithm is adopted to numerically evaluate the integrals which are necessary in certain terms such as b(t r ) and d(t r ).

Remark 6.
In Example 1, the proposed algorithms converge in 0.25 s while the online algorithm investigated in [38] requires more time. In Example 2, the proposed algorithms converge in 30 s while the online algorithms investigated in [21,22] require more time.   In the two examples, the effectiveness of Algorithm 2 has been validated. The superiority of Algorithm 2 lies in that it can work for switched systems with unknown subsystems while Algorithm 1 can not work if the subsystems of switched systems are unknown. Practical examples of this kind of switched systems appear in a wide range of applications such as cyber-physical systems which embed software into the physical world and have proved resistant to modeling due to their intrinsic complexity arising from the combination of physical and cyber components and the interaction between them in [31]. Thereinto, data-driven research has been conducted for some specific examples such as complex electronics switching among low-voltage, middle-voltage and high-voltage models, and smart grid switching between base configuration and changed configuration. These examples require the data-driven approaches of Algorithm 2 where Algorithm 1 is inapplicable.
Simulation results show that Algorithms 1 and 2 can approximate the optimal solution effectively and efficiently. Algorithm 2 achieves the approximate optimal solution for unknown switched systems with infinite-horizon cost function, which has not been achieved well in existing literature as far as we know.

Conclusions
In this paper, an online PI-based algorithm inspired by the off-policy learning method and based on that, a data-driven PI-based algorithm, are proposed to approximate the optimal solution quickly for optimal scheduling problems. Only data produced by an initial switching policy is necessary and the approximation time is relatively short. The data-driven algorithm acquires useful knowledge of the weight vectors step by step through different data and solves the optimal scheduling problem for switched systems with unknown subsystems, only taking advantage of data. However, the dwell-time problems, which are important in practical applications, are not incorporated in this paper. So, optimal scheduling problems with dwell-time constraints will be investigated in the future.

Conflicts of Interest:
The authors declare no conflict of interest.