Optimizing Age Penalty in Time-Varying Networks with Markovian and Error-Prone Channel State

In this paper, we consider a scenario where the base station (BS) collects time-sensitive data from multiple sensors through time-varying and error-prone channels. We characterize the data freshness at the terminal end through a class of monotone increasing functions related to Age of information (AoI). Our goal is to design an optimal policy to minimize the average age penalty of all sensors in infinite horizon under bandwidth and power constraint. By formulating the scheduling problem into a constrained Markov decision process (CMDP), we reveal the threshold structure for the optimal policy and approximate the optimal decision by solving a truncated linear programming (LP). Finally, a bandwidth-truncated policy is proposed to satisfy both power and bandwidth constraint. Through theoretical analysis and numerical simulations, we prove the proposed policy is asymptotic optimal in the large sensor regime.


Introduction
The requirements for data freshness in numerous emerging applications are becoming stricter [1,2]. However, the limited resources and bandwidth, together with the fading and error-prone channel characteristics, prevent the control terminal from obtaining the newest information. Moreover, the traditional optimization goals like low delay and high throughput cannot fully characterize the requirement of data freshness. Therefore, it is necessary to introduce new metrics to capture data freshness in such systems and design strategies to optimize the system performance in the presence of resource and environment restrictions.
Recently, a popular metric, Age of information (AoI), has been proposed in [3] to measure the data freshness. Since then, the optimization of age performance under different systems has been a research hotspot. The simple point-to-point system model has been studied in [3][4][5][6][7][8][9][10][11]. When update packets are generated by external sources and are queued in a buffer before transmission, queuing theory can be used to analyze the performance of such system, see, e.g., in, [3][4][5][6][7][8]. In [3], it is shown that the optimum packet generation rate of a first-come-first-served (FCFS) system should achieve a trade-off between throughput and delay. In [8], dynamic pricing is used as an incentive to balance the AoI evolution and the monetary payments to the users. Other studies [9][10][11] consider the generate-at-will system without a queue. Energy constraints are studied in [10,11] to find the trade-off between the age performance and energy consumption. In [11], both offline and online heuristic policies are proposed to optimize the average AoI, which outperform the greedy approach.
Apart from the point-to-point systems, scheduling strategies in the multi-user networks are studied in [12][13][14][15][16][17][18][19]. Different scheduling policies are studied in [12] to minimize the average AoI performance through unreliable channels, and Maatouk et al. verify the • We study the scheduling strategy for age penalty minimization in multi-sensor bandwidth constrained networks through time-varying and error-prone channel links with power limited sensors. To study a practical network, we model the channel to be a finite-state ergodic Markov chain. The packet loss probability and power consumption depend on the current channel state. Unlike previous work, we model the effect of data staleness in different scenarios via a class of monotone increasing function related to AoI. • Through relaxing the hard bandwidth constraint and Lagrange multipliers, we decouple the multi-sensor optimization problem into several single-sensor constrained Markov decision process (CMDP) problems. To deal with the potential infinite age penalty, we deduce the threshold structure of the optimal policy and then obtain the approximate optimal single-sensor scheduling decision by solving a truncated linear programming (LP). We prove the solution to the LP is asymptotic optimal when the truncated threshold is sufficiently large. • The sub-gradient ascend method is applied to find the optimal Lagrange multiplier to satisfy the relaxed bandwidth constraint. Finally, we propose the truncated stationary policy to meet the hard bandwidth constraint. The average performance of the strategy is verified through theoretical analysis and numerical simulations.
The remainder of this paper is organized as follows. The network model, the AoI metric, and the age penalty function are introduced in Section 2. In Section 3, we formulate the primal scheduling problem, and then decouple it into independent single-sensor problems through bandwidth relaxation and Lagrange multipliers. The approximate optimal policy for single-sensor problem is obtained in Section 4 by solving an LP. In Section 5, the asymptotic optimal truncated policy is proposed. Finally, Section 6 provides simulation results to verify the performance of the proposed truncated policy, and Section 7 draws the conclusion.
Notations: All the vectors and matrices are denoted in boldface. The probability of A given B is denoted as Pr(A|B). Let E π [X] be the expectation of variable X given π. The cardinality of a set Ω is written as |Ω|.

Network Model
In this work, we consider a BS collecting update packets from N time-sensitive sensors through time-varying channels. The time is slotted, and t ∈ {1, 2, ..., T} is used to denote the current slot index. Define u n (t) to be the scheduling decision for sensor n in slot t, where u n (t) = 1 means the sensor n chooses to send the newest packet, while u n (t) = 0 means idling the channel link. Assume all the scheduling behaviors take place at the beginning of each slot and the packet transmission delay through all the channel links is one slot. Due to the limited bandwidth capacity of the BS, the total number of sensors to be scheduled in one slot cannot be larger than M. Here, we assume M < N so that the problem is nontrivial: (1) To model the time varying effect, we assume that the channel link connecting the BS and each sensor n is an ergodic Q-state Markov chain. Denote q n (t) ∈ {1, 2, ..., Q} to be the channel state of link n in slot t. Without loss of generality, we assume that the channel state becomes more noisy as the state index becomes larger. Denote P n = {p n ij } to be the Markov state transition matrix of link n, and the entry p n ij means the probability of changing into state j in the next slot given the current state i, i.e., Due to different channel states, the sensors should use different energy for both saving energy and successful decoding of the packet at the receiver. We denote w(q) to be the energy consumption for scheduling when the channel state is q. Notice that the energy consumption tends to be larger as the channel state becomes more noisy to combat the channel fading, i.e., w(1) < w(2) < ... < w(Q). Besides, due to the power limit of each sensor n, the total average power consumption cannot exceed the upper bound, denoted by E n , i.e., where π is a scheduling policy. Given channel state q, we assume that there exists the probability of packet loss ε n,q through link n due to decoding error or inaccurate estimation.

Age of Information and Age Penalty
In the network described above, the BS wishes to obtain the freshest information for further process or accurate prediction. We model the data staleness at the terminal end as a monotone increasing age penalty function f (·) of Age of information (AoI). By definition, AoI is the difference between the current time slot and the time slot that the freshest data received by the BS is generated by the sensor [3]. Let x n (t) be the AoI of sensor n in slot t. According to the definition, if the sensor is scheduled in slot t and there is no packet loss, then x n (t + 1) = 1; otherwise, x n (t + 1) = x n (t) + 1. In conclusion, the AoI evolves as follows,

Problem Formulation
For given network, we measure the data freshness at the terminal side by computing the average age penalty under scheduling policy π, denoted by J(π), i.e., where x(0) = [x 1 (0), x 2 (0), ..., x N (0)] states the initial AoI of the system. In this work, we assume that the system is synchronized with all the sensors at the beginning, i.e., x n (0) = 1, ∀n, and thus omit x(0) in the further analysis.
We denote Π CP to be the set of all possible causal policies whose decisions are only based on current and historic information while satisfying both bandwidth and power constraints. Then, our goal is to optimize Equation (5) by choosing a scheduling policy π ∈ Π CP . Therefore, the primal optimization problem can be written as

Problem Decomposition
Notice that Equation (6b) is an integer programming where the exponential growth rate of state and action space set obstacles in solving Problem 1. Therefore, we formulate a relaxed version of Problem 1, where the primal bandwidth constraint in every slot is replaced by a time-average bandwidth constraint. We will then show that the relaxed problem can be solved by sensor level decoupling, which greatly reduces the cardinality of the state and action space.

Problem 2 (Relaxed Primal Scheduling Problem).
Denote π * R to be the optimal policy of Problem 2. The following theorem ensures that the optimal policy of the relaxed problem is composed of several local optimal policies π * n , each of which depends on its own channel state and AoI evolution regardless of others. Theorem 1. The optimal policy of Problem 2 can be decoupled into local optimal policies, i.e., π * R = π * 1 ⊗ π * 2 ⊗ · · · ⊗ π * N . Each of the local policy π * n has the following properties.
The proof of Theorem 1 is provided in Appendix A.
In order to find the local optimal polices, we introduce a Lagrange multiplier λ ≥ 0 to eliminate the relaxed bandwidth constraint, and the Lagrange function is as follows, where the Lagrange multiplier λ can be seen as a scheduling penalty which will increase the function value if there are more than M sensors to be scheduled per slot in average. For fixed λ, we can now further decouple the relaxed scheduling problem into N single-sensor cross-layer designs, each of which has the corresponding power constraint in Equation (7c): As the resolution of each decoupled problem is independent of n, we omit the subscript n in further analysis.

Constrained Markov Decision Process Formulation
First, we notice that the decoupled problem is a constrained Markov decision process of which the elements (S, A, Pr(·|·), c(·)) and constraint are explained as follows.
• State Space: The state of each sensor consists of two parts: the current AoI x(t) and channel state q(t). Thus, S = {x × q} is infinite but countable. • Action Space: There are two possible actions in the action space A for the scheduling policy, denoted by u(t). Action u(t) = 1 means the sensor chooses to schedule while u(t) = 0 means idling. Notice that here u(t) does not need to satisfy the bandwidth constraint. • Probability Transition Function: According to Equations (2) and (4), the probability transition function can be written out as follows.
Step Cost: The one-step cost consists of two parts: the age penalty growth and scheduling penalty, which can be computed by And the one-step power consumption is c E (x(t), q(t), u(t)) = u(t)w(q(t)).
Now our goal is to optimize the following average one-step cost, under the following average power constraint,

Characterization of the Optimal Policy
To search for the stationary optimal policy, we can further introduce another Lagrange multiplier ν ≥ 0 to eliminate the power constraint, i.e., The multiplier ν can be viewed as a power penalty, which will increase the Lagrange function once the average power consumption exceeds the constraint. Then minimizing the above Lagrange function for fixed ν becomes an MDP problem without any constraint.
The following lemma ensures that the optimal stationary policy for the MDP problem has a threshold structure. Lemma 1. The optimal stationary policy of the unconstrained MDP problem has the threshold structure, i.e., given state (x, q), there exists a threshold τ q such that if x ≥ τ q , then it is optimal to schedule the sensor; otherwise, idling is the optimal action.
Proof sketch: The complete proof is provided in Appendices B and C, which is similar to Theorem 2 in [23]. Despite the complex proof, the intuition is simple. As it is optimal to schedule the sensor in state (x, q), then it is also the optimal action in state (x , q), ∀x > x because the AoI is much bigger.

Linear Programming Approximation
Now, we focus on finding the optimal stationary policy. Denote ξ x,q to be the scheduling probability given state (x, q) and our goal is to find {ξ * x,q } that minimizes the objective function. In this part, we will approximate {ξ * x,q } by solving a truncated LP. According to Lemma 1, for the optimal stationary policy, we can set a threshold X > max q τ q and then we have ξ * x,q = 1, ∀x ≥ X. Next, we focus on searching for policies that possesses this threshold property as other policies are far from optimality.
Let µ x,q be the steady distribution of state (x, q). Then, define a new variable y x,q µ x,q ξ x,q . The following theorem converts the CMDP problem into an infinite LP problem.

Theorem 2.
The single-sensor decoupled problem is equal to the following infinite LP problem.
Proof. Let us consider the average cost of Equation (8a) by using µ x,q and y x,q . Invoking Equation (9), the one step cost of state (x, q) is either f (x) + λ when scheduling or f (x) when idling. Therefore, the time average cost can be computed as follows, Similarly, according to Equation (10), the time average power consumption is which is exactly the LHS of Equation (13e). Considering the property of steady probability distribution, Equation 13b,f are verified. Finally, notice that the evolution of state (x, q) forms a Markov chain as depicted in Figure 1 (top) for Q = 2 as an example. We use α x q,q = Pr((x + 1, q )|(x, q)) and β x q,q = Pr((1, q )|(x, q)) to denote the transition probability between the states, which can be computed as follows, According to the property of steady distribution, µ x,q equals to the sum of the steady distribution which can be transferred to µ x,q in the next slot times their transition probability. As depicted in Figure 1, µ 2,2 = µ 1,1 α 1 1,2 + µ 1,2 α 1 2,2 (see the dashed lines). Therefore, we can compute µ x,q as follows, which is equivalent to Equation 13c,d. Figure 1. Illustration of the state transition graph for Q = 2 channel states without (top) and with (bottom) AoI truncation with AoI threshold X = 3. The numbers in circles are channel state index q, and the number in rectangles are AoI index x.
As the steady distribution is infinite, it is difficult to solve the problem exactly. Therefore, we approximate the optimization problem in Theorem 2 into a finite LP problem through truncation: After truncation, the optimal value of Equation (16a) is the lower bound of the objective function of the decoupled problem Equation (8a). The detailed proof is provided in Appendix D. The key concept is to set a threshold X and convert the Markov chain into a finite-state one (see Figure 1).
Moreover, the following theorem guarantees the lower bound obtained by the above LP problem is tight when X is sufficiently large. Thus, the approximate optimal solution {μ * x,q ,ỹ * x,q } performs close to the exact optimal solution {µ * x,q , y * x,q }. Before displaying the theorem, first denote π * X and π * to be the scheduling policy according to the approximate optimal solution {μ * x,q ,ỹ * x,q } by setting threshold X and optimal one {µ * x,q , y * x,q }, respectively. Define J ∞ (π) to be the age and scheduling penalty of the primal problem under policy π and J X (π) is the approximate penalty when we set the age penalty f (x) = f (X), ∀x ≥ X. Then, according to Equations (13a) and (16a), we have Then, we have the following property, where τ max = max q τ q . As we see the above inequality, the difference between optimal solution of Theorem 2 and Problem 4 converges to 0 as the threshold X becomes sufficiently large.
The entire proof is provided in Appendix E. After solving the above LP problem, we can obtain the approximate optimal scheduling probability {ξ * x,q } by setting a sufficiently large X and computing {μ * x,q } and {ỹ * x,q }. Moreover, analogical to the threshold structure described in Lemma 1, {ξ * x,q } also has the following property.

Lemma 2.
For any state (x, q) of each sensor, the optimal scheduling probability {ξ * x,q } is nondecreasing with x, i.e.,ξ * The proof technique is similar to Lemma 1, so it is omitted here.

Multi-Sensor Problem Resolution
By now, through relaxing, decoupling and truncation, we have obtained the approximate solution to the single-sensor decoupled problem for fixed scheduling penalty λ. In this section, we will go back to solve the multi-sensor problem, and propose a truncated policy to meet the hard bandwidth constraint in Equation (6b).

The Relaxed Problem Resolution
First, we should choose the optimal λ such that the relaxed bandwidth constraint can be fully leveraged. Denote g(λ) = min π L(π, λ) to be the Lagrange dual function, where we choose the approximate optimal policy π * (λ) by solving LP. Then, the dual function can be computed as follows, where g n (λ) = min π L n (π, λ) is the decoupled dual function for sensor n. According to the LP approximation, g n (λ) can be further written out as follows, where X n (λ) is the average age penalty bounded by ∑ X x=1 ∑ Q q=1 f (x)μ n, * x,q , and U n (λ) is the average scheduling probability, which equals to ∑ X x=1ỹ n, * x,q . According to the work in [24], the optimal Lagrange multiplier λ * satisfies If U n (λ * ) = M, then the optimal policy is just π(λ * ). Otherwise, the optimal policy is a mixture of two policies, denoted by π l and π u , which can be computed by Then, we apply the sub-gradient ascend method to find the optimal solution, where the sub-gradient can be computed as follows, where U(λ) = ∑ N n=1 U n (λ) is the total scheduling probability. Choose λ 0 = 0 as the starting point, and compute the average scheduling probability U(λ 0 ). If U(λ 0 ) < M, then it does not need to consider the bandwidth constraint, and thus the optimal solution has already been solved. Moreover, this optimal solution can also be viewed as the lower bound of the primal optimization problem, i.e., Otherwise, we need to increase the scheduling penalty through iterations. The update operation in iteration k can be written out as follows, where t k+1 is the step size in iteration k.
Moreover, the step size is determined as follows, where γ ∈ (0, 1) is a constant. The determination of the step size above guarantees the algorithm converges from both sides. Therefore, after running the whole algorithm, we can obtain two different scheduling probabilities M l and M u : Their corresponding optimal polices are denoted as {μ n,l x,q ,ỹ n,l x,q } and {μ n,u x,q ,ỹ n,u x,q }. Then, the optimal stationary policy can be obtained by mixing these two policies: {μ n, * x,q ,ỹ n, * x,q } = θ{μ n,u x,q ,ỹ n,u x,q } + (1 − θ){μ n,l x,q ,ỹ n,l x,q }, where the mixed coefficient can be computed as follows, Now, we have obtained the optimal stationary policy of the relaxed scheduling problem. The algorithm flow chart is listed in Algorithm 1. Once we obtain {μ n, * x,q ,ỹ n, * x,q }, the optimal scheduling probability {ξ n, * x,q } can be computed as follows, , if x > X orμ n, * x,q = 0 orξ n, * x−1,q = 1; y n, * x,q µ n, * x,q , otherwise.

Truncation for the Hard Bandwidth Constraint
Finally, a bandwidth-truncated policyπ X is derived from the optimal stationary policy π * X to satisfy the hard bandwidth constraint in Equation (6b). Before introducing the truncated policy, first denote S(t) to be the set of sensors to be scheduled in slot t, and |S(t)| is the number of sensors to be scheduled in slot t. Then, the construction ofπ X is carried out as follows. • In slot t, compute the scheduling set S(t) according to the optimal stationary policy π * X . • If |S(t)| ≤ M, thenπ X schedules all these sensors as π * X does. • If |S(t)| > M, the hard bandwidth constraint is never satisfied. Therefore,π X randomly chooses M out of |S(t)| sensors to be scheduled in the current slot.
The following theorem guarantees the asymptotic performance of the truncated policŷ π X compared with π * X on certain conditions. Theorem 4. Suppose the age penalty function is concave, and let κ = M N be a constant. If all the sensors and their channels are identical, i.e., the power constraint and the channel transition matrix are the same, then the truncated policyπ X and the optimal randomized policy π * X have the following property, lim N→∞ J(π X ) − J(π * X ) = 0.
The whole proof is provided in Appendix F.

Simulation Results
In this section, we provide simulation results to verify the performance of the proposed policy. First, we study the average age penalty performance with different types of sensors with different bandwidth constraint and AoI truncation threshold X. Next, we study the detailed scheduling decision of each sensor. The average performance is obtained by simulating 10 5 consecutive slots.

Average Age Penalty Performance
In this part, we demonstrate the average performance of our proposed policy. We consider 4-state channel system, i.e., Q = 4. The age penalty function is chosen as f (x) = ln(x) unless otherwise specified. The transition matrix P n for each sensor is the same: Denote {η q } to be the steady distribution of the channel state. We consider that for each channel state q, the energy consumption w(q) = q. According to [12], the optimal policy to minimize the average AoI performance when all the sensors are identical is a greedy policy, which schedules the M sensors with the largest AoI and consumes the average power for each sensor E G = M N ∑ Q q=1 η q w(q). Therefore, define ρ n = E n E G to be the power constraint factor which describes the effects of power consumption constraint E n . Figure 2 demonstrates the average age penalty performance of the proposed policŷ π X as a number of sensors N, with bandwidth constraint M = {5, 15}, compared with the lower bound, the relaxed optimal policy π * X and the greedy policy. Set the threshold X = 20N M , where · is ceiling function. We assume that the probability of packet loss for each sensor is the same, denoted by ε: The power constraint factor of sensor n is ρ n = 0.2 + 1.4(n−1) N−1 . As seen in Figure 2, the proposed truncated policy performs closely to the relaxed optimal policy and the lower bound, and outperforms the greedy policy especially when N is large. According to Figure 2, the age penalty decreases by 18% and 23% from the greedy policy with N = 60 sensors when M = 5 and M = 15, respectively, under proposed policy. Moreover, as the threshold X becomes large, the difference between the average performance following policy π * X and the lower bound becomes indistinguishable. Therefore, the asymptotic performance described in Theorem 3 can be verified.   Figure 3, we can see that the AoI-minimum policy cannot guarantee a good age penalty performance. Thus, it is necessary to consider different demand for data freshness to achieve better performance.

Sensor Level Analysis and Threshold Structure
Next, we analyze the scheduling decision of each sensor and their corresponding age penalty to provide some insights of optimal scheduling policies. We consider a system with N = 8 and M = 2. The transition matrix of channel state is the same as Equation (21), and power consumption w(q) = q. We set the threshold X = 80 to compute the proposed policy.
First we consider the system with Q = 4 and age penalty function f (x) = ln x. Figure 6 analyzes how the power constraint influences age penalty of each sensor. The power constraint of sensor n is ρ n = 0.2n. From Figure 6, we can see that the proposed policy outperforms the greedy policy when the required power consumption is scarce, and performs similarly or a little worse when the factor ρ n > 1. This implies that our proposed policy chooses a more proper power allocation based on current channel state and AoI than the greedy policy by stimulating sources with scarce power budgets to be scheduled in better channel states. As the packet loss influences age penalty as well, Figure 7 considers sensors with different probability of packet loss, which can be written out as the following matrix {ε n,q }: We fix the power constraint factor ρ n = 0.6 for all sensors. Figure 7 shows that the average age penalty increases with the probability of packet loss. Moreover, the proposed policy combats with the packet loss better than the greedy policy as the proposed policy considers ε n,q when solving the LP problem, but the greedy policy does not. Next, we verify the threshold structure of the optimal scheduling policy. Figure 8 demonstrates the effect of bandwidth and packet loss on the scheduling threshold. The power constraint factor ρ n = 0.6, ∀n, and the packet loss probability is the same as Equation (22). Subfigures (a-c) demonstrate three of these sensors whose packet loss probability is as the title. For each of the three sensors, subfigures (d-f) consider the singlesensor system without bandwidth constraint and display their scheduling probability, respectively. Moreover, Figure 8 lists some of the thresholds given channel state q, e.g., in subfigure (a), the threshold of channel state q = 3 is x = 7, and the corresponding optimal scheduling probability is ξ 7,3 = 0.9963. From Figure 8, first we can see that all the six figures verify the non-decreasing property of the scheduling probability with AoI x(t) described in Lemma 2. Second, subfigures (a-c) demonstrate that the sensor with higher packet loss probability also has higher scheduling threshold. This implies that the sensors with more reliable channel should be given higher priority to scheduling than unreliable ones to minimize the average age penalty, since scheduling the more reliable channel under the same AoI is more likely to reduce the current age penalty. Third, by comparing subfigures (a) and (d), (b) and (e), and (c) and (f), the scheduling threshold varies more significantly for different channel states if there exists no bandwidth constraint. The sensors tend to update more often when the channel state is good, and idle when the channel state is bad. This is because the sensors can choose to update packets more frequently in good channel state to both save energy and increase the success probability of transmission without bandwidth constraint. Finally, we study the effects of age penalty function on threshold structure. Here, we consider a system with Q = 2, ρ n = 0.2n and three different penalty function, i.e., f (x) = ln x, f (x) = x, and f (x) = x 2 in Figure 9. We plot the scheduling decision of the sixth sensor. As is depicted in Figure 9, as the system has a higher restriction on data freshness such as exponential or quadratic function, the difference between thresholds of different states becomes small. In such situations, channel states play a weaker role because waiting for another slot to schedule tends to have unbearable age penalty.

Conclusions
In this paper, we consider the multi-sensor scheduling problem through an error-prone Markovian channel state. Through relaxing and decoupling, we propose a truncated policy to satisfy both the bandwidth and power constraints to minimize the average age penalty of all sensors in infinite horizon. We prove the asymptotic performance of the truncated policy in symmetric networks when the age penalty function is concave by choosing a sufficiently large threshold X. Through theoretical analysis and numerical simulations, we find that the age penalty function, packet loss probability, bandwidth constraint, and power constraint work altogether to influence the optimal scheduling decisions. Those who have more reliable channel state and enough power consumption tend to have higher scheduling priority.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Proof of Theorem 1
First, we notice that Problem 2 is equivalent to the following optimization problem, where we further introduce variables ν n to denote the local bandwidth constraint of sensor n.

Problem A1 (Equivalent Relaxed Primal Scheduling Problem).
For each feasible fixed local bandwidth constraint vector {ν n }, we can transfer the above problem into the following one by removing constraint Equations (A1c) and (A1e).

Problem A2 (Relaxed Problem for Fixed {ν n }).
Age * R (ν 1 , · · · , ν n ) = min Then, according to the work in [25], the optimal policy of the Problem 6 given {ν n } can be decoupled into several local policies. This is somewhat intuitive as the constraints and objective function of Problem 6 are decoupled for each sensor n. As for each feasible {ν n }, the optimal scheduling policy can be decoupled, recall that Age * R = min ν 1 ,··· ,ν N Age * R (ν 1 , · · · , ν N ), when {ν n } takes the optimum value {ν * n }, the property also holds.

Appendix B. Proof of Lemma 1
Before we proceed to the proof, we make two definitions. For any two states s = (x, q) and s = (x , q) ∈ S, define a partial order ≤ s . We state that s ≤ s s if and only if x ≤ x . Moreover, we also define a partial order ≤ a on the action space A. We state that u ≤ a u if and only if u ≤ u .
The monotonicity of the optimal action on the state space is true if the four following conditions hold: 1.
If s ≤ s s , c(s, u) ≤ c(s , u) for any u ∈ A; 2.
If s ≤ s s and u ≤ a u , then c(s, u) + c(s , u ) ≤ c(s , u) + c(s, u ); 4.
If s ≤ s s and u ≤ a u , then: where s, s ∈ S, u, u ∈ A. s + ∈ S is the next state, and c(s, u) denotes the one-step cost given the state s and action u. Next, we consider a discounted cost MDP over a finite horizon: where β is discounted factor. And the corresponding Bellman equation is If the above four conditions are satisfied and the corresponding Bellman function is monotone increasing, then the one-step cost c(s, u) = c x (x, q, u) + νc E (x, q, u) is monotone and sub-modular in s and u, which shows there exists an optimal monotone policy for any finite-time horizon MDP. Using the vanishing discount approach in Theorem 5.5.4 in [26], the property of monotonicity is propagated to the time-average MDP.
Before verifying the above four conditions, first we introduce the following lemma to ensure V k,β (x, q) is monotone increasing with x, whose proof is provided in Appendix C.
Lemma A1. For fixed channel state q and discounted factor β, V k,β (x, q) is monotone increasing with x.
Therefore, we only need to show the decoupled unconstrained problem satisfies the above four conditions.
Notice that the one-step cost can be computed by According to the definition of partial order ≤ s and ≤ a , and Equation (A6), we can easily verify condition 1 and 3.
Also, we have: According to Equation (A7) and the fact that V(x, q) < V(x , q), ∀x < x , it is also feasible to verify that both condition 2 and 4 hold.

Appendix C. Proof of Lemma A1
In this part, we will prove that V k,β (x, q) in the finite-time horizon MDP is an increasing function with x. The key method of the proof is through induction.
Invoking Equations (A8) and (A9), the above equality is equivalent to which is exactly equal to Equation (16e). Notice that the optimal action is to schedule when x ≥ X, i.e., ξ * x,q = 1. Thus, y * x,q = µ * x,q , ∀x ≥ X. Therefore, we have which is equal to Equation (16h). By now, through variable substitution, we have verified the constraints of the finite LP problem are equivalent to the ones of the original optimization problem. Therefore, the optimal solution to the finite LP problem is also feasible to the original problem. Now for the objective function, we have Therefore, we have verified that the decoupled single-sensor problem is approximate to the LP problem, which has the same constraint but can be the lower bound of the original problem.
(A17) Therefore, ∑ Q q=1 ρ X x,q can be bounded as follows, where 1 is all-one vector. (a) holds because of Equation (A16). (b) holds due to Equation (A17). (c) holds because 1 is the eigenvector of P.

Appendix F. Proof of Theorem 4
According to Lemma 2, the optimal policy for every decoupled single-sensor problem has the threshold structure. Let τ n q be the threshold of sensor n given channel state q. Denote Γ n = max q τ n q − min q τ n q to be the largest difference between different thresholds of sensor n, and Γ = max n Γ n . As all the sensors are identical, Γ does not change with N. Moreover, letS(t) = E π * X [S(t)]. Suppose that the sensor n is not scheduled underπ X when u n (t) = 1. Now consider the probability that it is still not scheduled in the next slot, which results from two reasons, i.e., it jumps into a state which has a higher scheduling threshold or there are still more than M sensors to be scheduled. Let p be the probability that the channel state jumps into a state having a higher scheduling threshold. Then, the probability of idling in the next slot, denoted by p idle can be computed by Notice that p can be upper bounded as p ≤ max n,q Q ∑ q =1,q =q p n qq = max n,q (1 − p n qq ).
Therefore, p idle can be upper bounded by z, which can be computed as Therefore, it can be generalized that the probability that it is still not scheduled in the consecutive k slots is upper bounded by z (k−Γ n ) + , where (·) + = max(·, 0). Now, we bound the different performance ofπ X and π * X by introducing another policỹ π X . Underπ X , when |S(t)| > M, all these sensors are scheduled like π * X , but add an extra penalty a n x (t) for those sensors, which can be computed as a n where (a) holds because f (x) is concave. For simplicity, denote A = Γ + 1 1−z , which is a constant once the channel transition matrix and power constraint are fixed.
As the age penalty function f (x) is concave, the average age penalty cost underπ X does not decrease compared withπ X . Then, the difference between J(π X ) and J(π * X ) can be bounded as Notice that when x > X, the optimal action is to schedule. Hence, the probability of x n (t) > X is upper bounded by (max n,q ε n,q ) x−X . For simplicity, let ρ = max n,q ε n,q . Then, ∀ > 0, there exists k = ln ln ρ such that the steady distribution µ n x,q , ∀x > X + k and n can be bounded as Then, we have (||S(t)| −S(t)| + |S(t) − M|) (||S(t)| −S(t)| + |S(t) − M|) I x n (t)>X+k f (x n (t)) . (A19) As f (x) is concave, it can be upper bounded by a linear function, i.e., f (x) ≤ mx, ∀x > X + k. Therefore, the second term in the above inequality can be further bounded as By choosing = N −1 and k = O(ln N), the second term in Equation (A19) converges to 0 as N becomes large.
For the first term in Equation (A19), according to the work in [27], the expectation of ||S(t)| −S| has the following property, In addition, as policy π * X satisfies the relaxed bandwidth constraint, we have Therefore, when T → ∞, J(π X ) − J(π * X ) can be upper bounded by As the threshold X does not increase with N, J(π X ) − J(π * X ) converges to 0 as N becomes infinite. Thus, the asymptotic performance of the truncated policy has been proven.