Scheduling to Minimize Age of Incorrect Information with Imperfect Channel State Information

In this paper, we study a slotted-time system where a base station needs to update multiple users at the same time. Due to the limited resources, only part of the users can be updated in each time slot. We consider the problem of minimizing the Age of Incorrect Information (AoII) when imperfect Channel State Information (CSI) is available. Leveraging the notion of the Markov Decision Process (MDP), we obtain the structural properties of the optimal policy. By introducing a relaxed version of the original problem, we develop the Whittle’s index policy under a simple condition. However, indexability is required to ensure the existence of Whittle’s index. To avoid indexability, we develop Indexed priority policy based on the optimal policy for the relaxed problem. Finally, numerical results are laid out to showcase the application of the derived structural properties and highlight the performance of the developed scheduling policies.


Introduction
The Age of Incorrect Information (AoII) is introduced in [1] as a combination of agebased metrics (e.g., Age of Information (AoI)) and error-based metrics (e.g., Minimum Mean Square Error). In communication systems, AoII captures not only the information mismatch between the source and the destination but also the aging process of inconsistent information. Hence, two functions dominate AoII. The first is the time penalty function, which reflects how the inconsistency of information affects the system over time. In reallife applications, inconsistent information will affect different communication systems in different ways. For example, machine temperature monitoring is time-sensitive because the damage caused by overheating will accumulate quickly. However, reservoir water level monitoring is less sensitive to time. Therefore, by adopting different time penalty functions, AoII can capture different aging processes of the mismatch in different systems. The second is the information penalty function, which captures the information mismatch between the source and the destination. It allows us to measure mismatches in different ways, depending on how sensitive different systems are to information inconsistencies. For example, the navigation system requires precise information to give correct instructions, but the real-time delivery tracking system does not need very accurate location information. Since we can choose different penalty functions for different systems, AoII is adaptable to various communication goals, which is why it is regarded as a semantic metric [2].
Since the introduction of AoII, several studies have been performed to reveal its fundamental nature. The authors of [3] consider a system with random packet delivery times and compare AoII with AoI and real-time error via extensive numerical results. The authors of [4] study the problem of minimizing the AoII that takes the general time penalty function. Three real-life applications are considered to showcase the performance advantages of AoII over AoI and real-time error. In [5], the authors investigate the AoII that considers the quantified mismatch between the source and the destination. The optimization problem is studied when the system is resource-constrained. The authors of [6]  For user i, the source process is modeled by a two-state Markov chain where transitions happen between the two states with probability p i > 0 and self-transitions happen with probability 1 − p i . At any time slot t, the state of the source process X i,t ∈ {0, 1} will be reported to the base station as an update, and the base station will decide whether to transmit this update through the corresponding channel. The channel is unreliable, but the estimate of the Channel State Information (CSI) is available at the beginning of each time slot. Let r i,t ∈ {0, 1} be the CSI at time t. We assume that r i,t is independent across time and user indices. r i,t = 1 if and only if the transmission attempt at time t will succeed and r i,t = 0 otherwise. Then, we denote byr i,t ∈ {0, 1} the estimate of r i,t . We assume thatr i,t is an independent Bernoulli random variable with parameter γ i , i.e.,r i,t = 1 with probability γ i ∈ [0, 1] andr i,t = 0 with probability 1 − γ i . However, the estimate is imperfect. We assume that the error depends only on the user and its estimate. More precisely, we define the probability of error as pr i e,i Pr[r i =r i |r i ]. We assume pr i e,i < 0.5 because we can flip the estimate if pr i e,i > 0.5. We are not interested in the case of pr i e,i = 0.5 sincer i,t is useless in this case. Although the channel is unreliable, each transmission attempt takes exactly one time slot regardless of the result, and the successfully transmitted update will not be corrupted. Every time an update is received, the receiver will use it as the new estimateX i,t . The receiver will send an ACK/N ACK packet to inform the base station of its reception of the new update. Since an ACK/N ACK packet is generally very small and simple, we assume that it is transmitted reliably and received instantaneously. Then, if ACK is received, the base station knows that the receiver's estimate changed to the transmitted update. If N ACK is received, the base station knows that the receiver's estimate did not change. Therefore, the base station always knows the estimate at the receiver side.
At the beginning of each time slot, the base station receives updates from each source and the estimates of CSI from each channel. The old updates and estimates are discarded upon the arrival of new ones. Then, the base station decides which updates to transmit, and the decision is independent of the transmission history. Due to the limited resources, at most M < N updates are allowed per transmission attempt. We consider a base station that always transmits M updates.

Age of Incorrect Information
All the users adopt AoII as a performance metric, but the choices of penalty functions vary. Let X t andX t be the true state and the estimate of the source process, respectively. Then, in a slotted-time system, AoII can be expressed as follows where U t is the last time instance before time t (including t) that the receiver's estimate is correct. g(X t ,X t ) can be any information penalty function that captures the difference between X t andX t . F(t) f (t) − f (t − 1) where f (t) can be any time penalty function that is non-decreasing in t. We consider the case where the users adopt the same information penalty function g(X t ,X t ) = |X t −X t | but possibly different time penalty functions. To ease the analysis, we require f (t) to be unbounded. Combined together, we require f (t 1 ) ≤ f (t 2 ) if t 1 < t 2 and lim t→+∞ f (t) = +∞. Without a loss of generality, we assume f (0) = 0, as the source is modeled by a two-state Markov chain, g(X t ,X t ) ∈ {0, 1}. Hence, Equation (1) can be simplified to where s t t − U t . Therefore, the evolution of s t is sufficient to characterize the evolution of AoII. To this end, we distinguish between the following cases.

•
When the receiver's estimate is correct at time t + 1, we have U t+1 = t + 1. Then, by definition, s t+1 = 0. • When the receiver's estimate is incorrect at time t + 1, we have U t+1 = U t . Then, by definition, s t+1 = t + 1 − U t = s t + 1. To sum up, we get s t+1 = 1 {U t+1 =t+1} × (s t + 1). (2) A sample path of s t is shown in Figure 2. In the remainder of this paper, we use f i (·) to denote the time penalty function user i adopts.

Remark 1.
Under this particular choice of the penalty function, s t can be interpreted as the time elapsed since the last time the receiver's estimate is correct. Please note that s t is different from the Age of Information (AoI) [17], which is defined as the time elapsed since the generation time of the last received update. We can see that AoI considers the aging process of the update, while AoII considers the aging process of the estimation error. At the same time, s t is also fundamentally different from the holding time, which, according to [18,19], is defined as the time elapsed since the last successful transmission. We notice that the receiver's estimate can become correct even when no new update is successfully transmitted. Moreover, the information carried by the update may have become incorrect by the time it is received. We also notice that [18,19] consider the problem of minimizing the estimation error. However, by adopting AoII as the performance metric, we study the impact of estimation error on the system.

System Dynamic
In this section, we tackle the system dynamic. We notice that the status of user i can be captured by the pair x i,t (s i,t ,r i,t ). In the following, we will use x i,t and (s i,t ,r i,t ) interchangeably. Then, the system dynamic can be fully characterized by the dynamic of x t (x 1,t , . . . , x N,t ). Hence, it suffices to characterize the value of x t+1 given x t and the base station's action. To this end, we denote, by a t = (a 1,t , . . . , a N,t ), the base station's action at time t. a i,t = 1 if the base station transmits the update from user i at time t and a i,t = 0 otherwise. We notice that given action a t , users are independent and the action taken on user i will only affect itself. Consequently Combined with the fact that all the users share the same structure, it is sufficient to study the dynamic of a single user. In the following discussions, we drop the user-dependent subscript i. We recall thatr t+1 is an independent Bernoulli random variable. Then, we have By definition, P(r t+1 = 1) = γ and P(r t+1 = 0) = 1 − γ. Then, we only need to tackle the value of Pr(s t+1 | x t , a t ). To this end, we distinguish between the following cases • When x t = (0,r t ), the estimate at time t is correct (i.e.,X t = X t ). Hence, for the receiver, X t carries no new information about the source process. In other words, X t+1 =X t regardless of whether an update is transmitted at time t. We recall that U t+1 = U t ifX t+1 = X t+1 and U t+1 = t + 1 otherwise. Since the source is binary, we obtain U t+1 = U t if X t+1 = X t , which happens with probability p and U t+1 = t + 1 otherwise. According to (2), we obtain • When a t = 0 and x t = (s t ,r t ), where s t > 0, the channel will not be used and no new update will be received by the receiver,and so,X t+1 =X t . We recall that U t+1 = U t if X t+1 = X t+1 and U t+1 = t + 1 otherwise. Since X t =X t and the source is binary, we have U t+1 = U t if X t+1 = X t , which happens with probability 1 − p and U t+1 = t + 1 otherwise. According to (2), we obtain • When a t = 1 and x t = (s t , 1) where s t > 0, the transmission attempt will succeed with probability 1 − p 1 e and fail with probability p 1 e . We recall that U t+1 = U t if X t+1 = X t+1 and U t+1 = t + 1 otherwise. Then, when the transmission attempt succeeds (i.e.,X t+1 = X t ), U t+1 = U t if X t+1 = X t and U t+1 = t + 1 otherwise. When the transmission attempt fails (i.e.,X t+1 =X t = X t ), we have U t+1 = U t if X t+1 = X t and U t+1 = t + 1 otherwise. Combining (2) with the dynamic of the source process we obtain When a t = 1 and x t = (s t , 0), where s t > 0, following the same line, we obtain Combines together, we obtain the value of Pr(s t+1 | x t , a t ) in all cases. As only M out of N updates are allowed per transmission attempt, we realize a necessity to require transmission attempts always help minimize AoII. It is equivalent to impose Pr(s t+1 > s t | (s t ,r t ), a t = 0) > Pr(s t+1 > s t | (s t ,r t ), a t = 1) for any (s t ,r t ). Leveraging the results above, it is sufficient to require p < 0.5. As all the users share the same structure, we assume, for the rest of this paper, that 0 < p i < 0.5 for 1 ≤ i ≤ N.

Problem Formulation
The communication goal is to minimize the expected AoII. Therefore, the problem can be formulated as the following where Φ is the set of all causal policies. We refer to the constrained minimization problem reported in problem (4) as the Primal Problem (PP). We notice that the PP is a Restless Multi-Armed Bandit (RMAB) Problem. The optimal policy for this type of problem is far from reachable since it is PSPACE-hard in general [20]. However, we can still derive the structural properties of the optimal policy. These structural properties can be used as a guide for the development of scheduling policies and can indicate the good performance of the developed scheduling policies.

Structural Properties of the Optimal Policy
In this section, we investigate the structural properties of the optimal policy for PP. We first define an infinite horizon with an average cost Markov Decision Process (MDP) Note that the feasible actions are independent of the state and the time. • P N denotes the state transition probabilities. We define P x,x (a) as the probability that action a at state x will lead to state x . It is calculated by where P s i ,s i (a i ,r i ) is the transition probability from s i to s i when the estimate of CSI isr i and action a i is taken. The values of P s i ,s i (a i ,r i ) can be obtained easily from the results in Section 2.3. • C N (w) denotes the instant cost. When the system is at state x and action a is taken, We notice that PP can be cast into M N (0, M). Since w = 0, the instant cost is independent of action a. Therefore, we abbreviate C(x, a) as C(x). To simplify the analysis, we consider the case of M = 1. Equivalently, we investigate the structural properties of the optimal policy for M N (0, 1).

Remark 2.
For the case of M > 1, we can apply the same methodology. However, as M increases, the action space will grow quickly, resulting in the need to consider more feasible actions in each step of the proof. Hence, to better demonstrate the methodology, we only consider the case of M = 1 in this paper.
It is well known that the optimal policy for M N (0, 1) can be characterized by the value function. We denote the value function of state x as V(x). A canonical procedure to calculate V(x) is applying the Value Iteration Algorithm (VIA). To this end, we define V ν (·) as the estimated value function at iteration ν of VIA and initialize V 0 (·) = 0. Then, VIA updates the estimated value functions in the following way where θ is the optimal value of M N (0, 1). VIA is guaranteed to converge to the value function [21]. More precisely, V ν (·) = V(·) when ν → +∞. However, the exact value function is impossible to get since we need infinite iterations and the state space is infinite. Instead, we provide two structural properties of the value function.
Proof. Leveraging the iterative nature of VIA, we use mathematical induction to prove the desired results. The complete proof can be found in Appendix A.
Before introducing the next structural property, we make the following definition.
Definition 1 (Statistically identical). Two users are said to be statistically identical if the userdependent parameters and the adopted time penalty functions are the same.
For the users that are statistically identical, we can prove the following Lemma 2 (Equivalence). For M N (0, 1), if users j and k are statistically identical, V(x) = V (P(x)) where P(x) is state x with x j and x k exchanged.
Proof. Leveraging the iterative nature of VIA, we use mathematical induction to prove the desired results. At each iteration, we show that for each feasible action at state x, we can find an equivalent action at state P(x). Two actions are equivalent if they lead to the same value function. The complete proof can be found in Appendix B.
Equipped with the above lemmas, we proceed with characterizing the structural properties of the optimal policy. We recall that the optimal action at each state can be characterized by the value function. Hence, we denote, by V j (x), the value function resulting from choosing user j to update at state x. Then, V j (x) can be calculated by If V j (x) < V k (x) for all k = j, it is optimal to transmit the update from user j. When V j (x) = V k (x), the two choices are equally desirable. In the following, we will characterize the properties of δ j,k (x) V j (x) − V k (x) for any j and k.

2.
δ j,k (x) is non-increasing inr j and is non-decreasing inr k when s j , s k > 0. At the same time, δ j,k (x) is independent ofr i for any i = j, k.

4.
δ j,k (x) is non-increasing in s j if Γr j j ≤ Γr k k and is non-decreasing in s k if Γr j j ≥ Γr k k when s j , s k > 0. We define Γ 1

5.
δ j,k (x) ≤ 0 if s j ≥ s k ,r j ≥r k , and users j and k are statistically identical.
Proof. The proof can be found in Appendix C.
We notice that Γr i i can be written as where s i can be any positive integer. Consequently, Γr i i is independent of any s i > 0 and indicates the decrease in the probability of increasing s i caused by action a i = 1. When Γr i i is large, action a i = 1 will achieve a small decrease in the probability of increasing s i . In the following, we provide an intuitive interpretation of why the monotonicity in Property 4 of Theorem 1 depends on Γr i i . We take the case of Γr j j ≤ Γr k k as an example and assume that there are only users j and k in the system. Then, according to Section 2.3, the dynamic of s j and s k can be divided into the following three cases • Neither s j nor s k increases. In this case, both s j and s k become zero. • Either s j or s k increases and the other becomes zero. We denote by P k j the probability that only s k increases when a j = 1. The notation for other cases is defined analogously. The probabilities can be obtained easily using the results in Section 2.3. • Both s j and s k increase. We denote by P j the probability that both s j and s k increase when a j = 1. P k is defined analogously. The probabilities can be obtained easily using the results in Section 2.3.
We notice that δ j,k (x) implies the tendency of the base station to choose between the two users. The larger δ j,k (x) is, the more the base station tends to choose user k. Thus, we investigate the base station's propensity to choose user k when s k increases but s j stays the same. We ignore the case where the resulting s k is zero since it is independent of the increase in s k . With this in mind, we first notice that P k k ≤ P k j . Meanwhile, we can easily verify that When Γr j j ≤ Γr k k , we have P j ≤ P k . Then, there exists a subtle trade-off. More precisely, choosing user k will result in P k k ≤ P k j , but at the cost of P k ≥ P j . Hence, in this case, the propensity of the base station is hard to determine. Following the same line, we can show that choosing user j will lead to P j j ≤ P j k and P j ≤ P k . Thus, there exists no such trade-off when we investigate the base station's propensity to choose user j as s j increases but s k stays the same.
Leveraging Theorem 1, we can provide some specific structural properties of the optimal policy. Corollary 1 (Application of Theorem 1). When M = 1, the optimal policy for PP must satisfy the following 1.
The user i withr i = p 0 e,i = 0 or s i = 0 will not be chosen unless it is to break the tie.

2.
When user j is chosen at state x 1 , then for state x 2 , such thatr 1,j ≤r 2,j and s 1,i = s 2,i for 1 ≤ i ≤ N, the optimal choice must be in the set G = {j} ∪ {k :r 1,k <r 2,k }.

3.
When N = 2, we consider two states, x 1 and x 2 , which differ only in the value of s j .
Specifically, s 1,j ≤ s 2,j . If user j is chosen at state x 1 and Γr 1,j j ≤ Γr 1,k k , the optimal choice at state x 2 will also be user j.

4.
When N = 2, we consider two states, x 1 and x 2 , which differ only in the value of s k .
Specifically, s 1,k ≥ s 2,k . If user j is chosen at state x 1 and Γr 1,j j ≥ Γr 1,k k , the optimal choice at state x 2 will also be user j.

5.
When all users are statistically identical, the optimal choice at any time slot must be either the user with x = (s max,1 , 1) where s max,1 max s i {(s i , 1)} or the user with x = (s max,0 , 0) where s max,0 max s i {(s i , 0)}. Moreover, • If s max,1 ≥ s max,0 , it is optimal to choose the user with x = (s max,1 , 1). • If s max,1 < s max,0 , the optimal choice will switch from the user with x = (s max,0 , 0) to the user with x = (s max,1 , 1) when s max,1 increases from 0 to s max,0 solely.
Proof. The first property follows directly from Property 1 and Property 3 of Theorem 1.
For the second property, leveraging Property 2 of Theorem 1, we have δ j,k (x 2 ) ≤ δ j,k (x 1 ) ≤ 0 ifr 1,j ≤r 2,j ,r 1,k ≥r 2,k , and s 1,i = s 2,i for 1 ≤ i ≤ N. Thus, the optimal choice will not be user k in this case. Then, we can conclude that the optimal choice must be in the set G = {j} ∪ {k :r 1,k <r 2,k }.
For the third property, we have proved in Property 4 of Theorem 1 that δ j,k (x) is non-increasing in s j if Γr j j ≤ Γr k k . Hence, δ j,k (x 2 ) ≤ δ j,k (x 1 ) ≤ 0. As we consider the case of N = 2, the optimal choice at state x 2 will also be user j. The fourth property can be shown in a similar way by noticing that δ j,k (x) is non-decreasing in s k when Γr j j ≥ Γr k k . For the last property, we recall from Property 5 of Theorem 1 that it is always better to choose the user with a larger s if they are statistically identical and have the samer. Thus, we can conclude that the optimal choice must be either the user with x = (s max,1 , 1) or the user with x = (s max,0 , 0). Without a loss of generality, we assume x j = (s max,1 , 1) and x k = (s max,0 , 0). Now, we distinguish between the following cases • According to Property 5 of Theorem 1, we can conclude that it is optimal to choose user j when s max,1 ≥ s max,0 . • To determine the optimal choice in the case of s max,1 < s max,0 , we recall that the optimal choice will be user k (i.e., δ j,k (x) ≥ 0) if s j = 0 and will be user j (i.e., δ j,k (x) ≤ 0) if s j = s k . At the same time, Property 4 of Theorem 1 tells us that δ j,k (x) is non-increasing in s j when users j and k are statistically identical. Therefore, we can conclude that the optimal choice will switch from user k to user j when s j increases from 0 to s k solely.

Whittle's Index Policy
Whittle's index policy is a well-known low-complexity heuristic that shows a strong performance in many problems that belong to RMAB [22][23][24]. In this section, we develop Whittle's index policy for PP. We first present the general procedures we adopt to obtain Whittle's index.

•
We first formulate a relaxed version of PP and apply the Lagrangian approach. • Then, we decouple the problem of minimizing the Lagrangian function into N decoupled problems, each of which only considers a single user. By casting the decoupled problem into an MDP, we investigate the structural properties and performance of the optimal policy. • Leveraging the results above and under a simple condition, we establish the indexability of the decoupled problem. • Finally, we obtain the expression of Whittle's index by solving the Bellman equation.

Relaxed Problem
The first step in obtaining Whittle's index is to formulate the Relaxed Problem (RP). More precisely, instead of requiring the limit on the number of updates allowed per transmission attempt to be met in each time slot, we relax the constraint such that the limit is not violated in an average sense. Then, RP can be formulated as As RP is specified, we apply the Lagrangian approach. First of all, we write RP into its Lagrangian form.
where λ ≥ 0 is the Lagrange multiplier. Then, we investigate the problem of minimizing the Lagrangian function. Since λM is independent of policies, we can ignore it. More precisely, we consider the following minimization problem

Decoupled Model
In this section, we formulate the decoupled problem and investigate its optimal policy. The decoupled model associated with each user follows the system model with N = 1.
Since all the users share the same structure, we drop the user-dependent subscript i for simplicity. Then, the decoupled problem can be formulated as where Φ is the set of all causal policies when N = 1. We notice that problem (8) can be cast into the MDP M 1 (λ, −1). We define M = −1 when there is no restriction on the number of updates allowed per transmission attempt. We first investigate the structural properties of the optimal policy for M 1 (λ, −1) when λ is a given non-negative constant. We start with characterizing the corresponding value function V(x).
Proof. The proof follows the same steps as in the proof of Lemma 1. The complete proof can be found in Appendix D.
Equipped with the above corollary, we can characterize the structural properties of the optimal policy for (8).
Proposition 1 (Optimal policy for decoupled problem). The optimal policy for the decoupled problem is a threshold policy with the following properties.

•
The optimal policy can be fully captured by n = (n 0 , n 1 ). More precisely, when the system is at state (s,r), it is optimal to make a transmission attempt only when s ≥ nr.
is the value function resulting from taking action a at state x. Then, the optimal action at state x is a = 1 if ∆V(x) < 0, and a = 0 is optimal otherwise. We use Corollary 2 to characterize the sign of ∆V(x). The complete proof can be found in Appendix E.
In the following, we evaluate the performance of the threshold policy detailed in Proposition 1. More precisely, we calculate the expected AoII∆ n and the expected transmission rateρ n resulting from the adoption of threshold policy n. We will see in the following that∆ n andρ n are essential for establishing the indexability and obtaining the expression of Whittle's index.

Proposition 2 (Performance).
Under threshold policy n = (n 0 , n 1 ), Proof. We notice that the dynamic of AoII under the threshold policy can be fully captured by a Discrete-Time Markov Chain (DTMC). Then, combined with the fact thatr is an independent Bernoulli random variable, we can obtain the desired results from the stationary distribution of the induced DTMC. The complete proof can be found in Appendix F.
As f (·) can be any non-decreasing function,∆ can grow indefinitely. Thus, it is necessary to require that there exists at least one threshold policy that causes a finite∆. By noting that 1 − p ≥ c 1 ≥ c 2 , we havē The equality is achieved when n 0 = n 1 = 1. Then, we can conclude that it is sufficient to require ∑ +∞ k=1 f (k)c k−1 2 < +∞. This will be the underlying assumption throughout the rest of this paper.

Indexability
In this section, we establish the indexability of the decoupled problem, which ensures the existence of Whittle's index. We start with the definition of indexability.

Definition 2 (Indexability).
The decoupled problem is indexable if the set of states in which a = 0 is the optimal action increases with λ, that is, where D(λ) is the set of states in which a = 0 is optimal when Lagrange multiplier λ is adopted.
The Lagrange multiplier λ can be viewed as a cost associated with each transmission attempt. Intuitively, as λ increases, the base station should stay idle (i.e., a = 0) for a longer time until s becomes large enough to offset the cost. Although it is intuitively correct that the decoupled problem is indexable, the indexability is hard to establish as the optimal policy is characterized by two thresholds. Thus, Whittle's index does not necessarily exist. However, the indexability can be established when the following condition is satisfied Remark 3. Problem (9) only requires the estimater i to be perfect whenr i = 0. In the case of r i = 1, we still allow the estimate to be inaccurate.
When (9) is satisfied, Propositions 1 and 2 reduce to the following Corollary 3 (Consequences of (9)). When (9) is satisfied, the optimal policy for the decoupled problem (8) is the threshold policy n = (+∞, n). The corresponding∆ n andρ n arē Proof. We continue with the same notations as in the proof of Propositions 1 and 2. It is sufficient to show that n 0 = +∞. To this end, we consider the state x = (s, 0). By following the same steps as in the proof of Proposition 1, we have Therefore, it is optimal to stay idle (i.e., a = 0) at state x = (s, 0) for any s ≥ 0. Equivalently, n 0 = +∞. Then, the corresponding∆ n andρ n can be calculated as a special case of Proposition 2 where n 0 = +∞, n 1 = n, and p 0 e = 0.
Leveraging Corollary 3, we can establish the indexability of the decoupled problem.
Proof. According to Proposition 2.2 of [25], we only need to verify that the expected transmission rateρ n is strictly decreasing in n. From Corollary 3, we havē As 1 2 < 1 − p < 1, we can easily verify thatρ n is strictly decreasing in n. Thus, the decoupled problem is indexable when (9) is satisfied.

Whittle's Index Policy
In this section, we proceed with finding the expression of Whittle's index and defining Whittle's index policy. First of all, we give the definition of Whittle's index.

Definition 3 (Whittle's index).
When the decoupled problem is indexable, Whittle's index at state x is defined as the infimum λ, such that both actions are equally desirable. Equivalently, Whittle's index at state x is defined as the infimum λ such that V 0 (x) = V 1 (x).
Let us denote by W x the Whittle's index at state x. Then, the expression of Whittle's index is given by the following Proposition.
where s > 0 and c 1 = (1 − γ)(1 − p) + γα.∆ s andρ s are the expected AoII and the expected transmission rate when threshold policy n = (+∞, s) is adopted, respectively. At the same time, W x is non-negative and is non-decreasing in s.

Proof.
Whittle's indexes at state x = (0,r) and x = (s, 0) are obtained easily from the proof of Proposition 1. For state x = (s, 1), we first use backward induction to calculate the expressions of some value functions. Then, the expression of Whittle's index can be obtained from its definition. The complete proof can be found in Appendix G.
Definition 4 (Whittle's index policy). At any state x = (x 1 , x 2 , . . . , x N ), the base station will transmit the updates from M users with the largest W x i . The ties are broken arbitrarily. W x i is calculated using Proposition 4 with the parameters of user i.

Remark 4.
Whittle's index policy possesses the structural properties detailed in Corollary 1.

•
The first two properties can be verified by noting that W x i ≥ 0 and the equality holds when r i = 0 or s i = 0. At the same time, W x i is non-decreasing inr i .

•
The third and fourth properties can be verified by noting that W x i is non-decreasing in s i .

•
For the last property, we first notice that W x j = W x k when users j and k are statistically identical and x j = x k . Then, the property can be verified by noting that W x i is non-decreasing in both s i andr i .

Optimal Policy for Relaxed Problem
In this section, we provide an efficient algorithm to obtain the optimal policy for RP, based on which we will develop another scheduling policy for PP in the next section that is free from indexability. At the same time, the performance of the optimal policy for RP forms a universal lower bound because the following ordering holds where∆ RP AoI I and∆ PP AoI I are the minimal expected AoII of RP and PP, respectively.

Remark 5.
Note that the optimal policy for RP may not necessarily be a valid policy for PP, as the transmitter may transmit more than M updates in one transmission attempt under RPoptimal policy.
To solve RP, we follow the discussion in Section 4.1. More precisely, we take the Lagrangian approach and consider the problem reported in (7). We will see in the following discussion that the optimal policy for RP can be characterized by the optimal policies for problem (7). Therefore, we first cast problem (7) into the MDP M N (λ, −1). However, the optimal policy for M N (λ, −1) is difficult to obtain because the state space is infinite. Even though we can make the state space finite by imposing an upper limit on the value of s, the state space and the action space grow exponentially with the number of users in the system. To overcome the difficulty, we investigate the optimal policy for M i 1 (λ, −1) where 1 ≤ i ≤ N. The superscript i means that the only user in the system is user i. We will show later that the optimal policy for M N (λ, −1) can be fully characterized by the optimal policies for M i 1 (λ, −1) where 1 ≤ i ≤ N.

Optimal Policy for Single User
In this section, we tackle the problem of finding the optimal policy for M i 1 (λ, −1). Since the users share the same structure, we ignore the superscript i for simplicity. To find the optimal policy, we first use the Approximating Sequence Method (ASM) introduced in [26] to make the state space finite. More precisely, we impose s ≤ m where m is a predetermined upper limit. The state transition probabilities P s,s (a,r) are modified in the following way The action space and the instant cost remain unchanged. Then, we can apply Relative Value Iteration (RVI) with convergence criteria to obtain the optimal policy. We notice that M 1 (λ, −1) coincides with the decoupled model studied in Section 4.2. Hence, we can utilize the threshold structure of the optimal policy to improve RVI. To this end, we class a state as active if the optimal action at this state is a = 1. Then, the threshold structure detailed in Proposition 1 tells us the following. For any state x, if there exists an active state x 1 with s 1 ≤ s andr 1 ≤r, then x must also be active. Hence, we can determine the optimal action at state x immediately instead of comparing all feasible actions. In this way, we can reduce the running time of RVI. The pseudocode for the improved RVI can be found in Algorithm A1 of Appendix M. A similar technique is also presented in [5].
For M 1 (λ, −1), when problem (9) is satisfied, Whittle's index exists and can be calculated efficiently using Proposition 4. Therefore, we can obtain the optimal policy using Whittle's index and further reduce the computational complexity. To this end, we denote by n λ the optimal policy for M 1 (λ, −1) and present the following proposition Proposition 5 (Optimal deterministic policy). When (9) is satisfied, the optimal policy for M 1 (λ, −1) is n λ = (+∞, n) where n is given by Proof. We first notice that M 1 (λ, −1) coincides with the decoupled model studied in Section 4.2. Then, we show the optimal action for each state withr = 1 using the definition of Whittle's index and the fact that the decoupled problem is indexable when (9) is satisfied. The complete proof can be found in Appendix H.
In the following, we provide a randomized policy that is also optimal for M 1 (λ, −1). We will see later that the randomized policy is the key to obtaining the optimal policy for RP.
Theorem 2 (Optimal randomized policy). There exist two deterministic policies n λ + and n λ − , which are both optimal for M 1 (λ, −1). We consider the following randomized policy n λ : every time the system reaches state (0, 0), the base station will make the choice between n λ − with probability µ and n λ + with probability 1 − µ. The chosen policy will be followed until the next choice. Then, the randomized policy n λ is optimal for M 1 (λ, −1) under any µ ∈ [0, 1].
Proof. We show that our system verifies the assumptions given in [27]. Then, leveraging the characteristics of our system, we can obtain the optimal randomized policy. The complete proof can be found in Appendix I.
In practice, we approximate λ + ≈ λ + ξ and λ − ≈ λ − ξ where ξ is a small perturbation. Then, the deterministic policies n λ + and n λ − can be obtained by following the discussion at the beginning of this subsection. Note that, in most cases, n λ + and n λ − are the same.

Optimal Policy for RP
In this section, we characterize the optimal policy for RP. Let us denote by V(x) and V i (x i ) the value functions of M N (λ, −1) and M i 1 (λ, −1), respectively. Then, we can prove the following In other words, the policy, under which each user adopts its own optimal policy, is optimal for M N (λ, −1).
by comparing the Bellman equations they must satisfy. The complete proof can be found in Appendix J.
We denote the optimal policy for M N (λ, −1) as φ λ = [n λ,1 , . . . , n λ,N ] where n λ,i is the optimal policy for M i 1 (λ, −1). For simplicity, we define∆(λ) andρ(λ) as the expected AoII and the expected transmission rate associated with φ λ , respectively.∆ i (λ) andρ i (λ) are defined analogously for user i under policy n λ,i . We also define λ * inf{λ > 0 : ρ(λ) ≤ M}. With Proposition 6 and the above definitions in mind, we proceed with constructing the optimal policy for RP. Theorem 3 (Optimal policy for RP). The optimal policy for RP can be characterized by two deterministic policies φ λ * + = [n λ * + ,1 , . . . , n λ * + ,N ] and φ λ * − = [n λ * − ,1 , . . . , n λ * − ,N ] where n λ * + ,i and n λ * − ,i are both the optimal deterministic policies for M i 1 (λ * , −1). Then, we mix φ λ * + and φ λ * − in the following way: for each user i, every time the user reaches state (0, 0), the base station will make the choice between n λ * − ,i with probability µ i and n λ * + ,i with probability 1 − µ i . The chosen policy will be followed by user i until the next choice. Where 1 ≤ i ≤ N, the µ i is chosen in such a way as to satisfy Then, the mixed policy, denoted by φ λ * , is optimal for RP.
Proof. According to Lemma 3.10 of [27], a policy is optimal for RP if The resulting expected transmission rate is equal to M.
Then, we construct such a policy using Theorem 2 and Proposition 6. The complete proof can be found in Appendix K.
Since we approximate λ * for all i according to the monotonicity given by Lemma 3.4 of [27]. Combining with the definition of λ * , we must haveρ(λ * + ) ≤ M <ρ(λ * − ). Therefore, we can always find µ i 's that realize (11). In this paper, we choose Then, we describe the algorithm used to obtain the optimal policy for RP. As detailed in Theorem 3, it is essential to find λ * . To this end, we recall that, for any user i under given λ, the optimal deterministic policy n λ,i can be obtained using the results in Section 5.1 and the resulting expected transmission rateρ i (λ) is given by Proposition 2. Sinceρ i (λ) is non-increasing in λ for all i according to Lemma 3.4 of [27],ρ(λ) = ∑ N i=1ρ i (λ) is also non-increasing in λ. Hence, we can regardρ(λ) as a non-increasing function of λ. Then, according to the definition of λ * , we can use the Bisection search to obtain λ * efficiently. The main steps can be summarized as follows.

3.
Run Bisection search on the interval [λ − , λ + ] until the tolerance 2ξ is met. Then, λ * − and λ * + can simply be the boundaries of the final interval. The pseudocode for the Bisection search can be found in Algorithm A2 of Appendix M. After obtaining λ * − and λ * + , the optimal policy φ λ * is detailed in Theorem 3 and the mixing probabilities µ i 's are given by (12).

Remark 6.
We recall that the optimal deterministic policy for each user can be characterized by two positive thresholds (i.e., n 0 , n 1 > 0). Consequently, under RP-optimal policy, the base station will never choose the user at state (0,r). Then, when M increases, the expected transmission rate achieved by RP-optimal policy will saturate before M reaches N. When the expected transmission rate saturates, the RP-optimal policy is φ * = [n 1 , . . . , n N ] where n i = (1, 1) for 1 ≤ i ≤ N. The saturation happens when M is larger than or equal to the expected transmission rate achieved by φ * .

Indexed Priority Policy
Although the performance of Whittle's index policy is known to be good, it requires indexability, which is usually difficult to establish. In this section, based on the primaldual heuristic introduced in [28], we develop a policy that does not require indexability and has comparable performance to Whittle's index policy. We start with presenting the primal-dual heuristic.

Primal-Dual Heuristic
The heuristic is based on the optimal primal and dual solution pair to the linear program associated with RP. To introduce the linear program, we define π a i x i (φ) ≥ 0 as the expected time that user i is at state x i and action a i is taken according to policy φ. Then, for any φ, π a i x i (φ) must satisfy the following problems The objective function of RP can be rewritten as The constraint on the expected transmission rate can be rewritten as Thus, the linear program associated with RP can be formulated as the following minimize π The corresponding dual problem is Let {π a i x i } and {σ,σ i ,σ x i } be the optimal primal and dual solution pair to the problems reported in (13) and (14). We definē ψ 0 For any state x = (x 1 , . . . , Then, the heuristic operates in the following way

•
If h(x) ≥ M, the base station will choose the M users with the largestψ 0 x i among the h(x) users. • If h(x) < M, these h(x) users are chosen by the base station. The base station will choose M − h(x) additional users with the smallestψ 1 x i . However, Linear Programming (LP) is a very general technique and does not appear to take advantage of the special structure of the problem. Although there are algorithms for solving rational LP that take time polynomial in the number of variables and constraints, they run extremely slowly in practice [29]. For our problem, we notice that the users have separate activity areas that are linked through a common resource constraint. Therefore, the primal problem can be solved using Dantzig-Wolfe decomposition. Even so, the problem is still computationally demanding when the system scales up. We recall that we solved the exact problem efficiently using MDP-specific algorithms in Section 5. It is more efficient because of the following reasons • According to Proposition 6, we can decompose the problem into N subproblems. • For each subproblem, the threshold structure of the optimal policy is utilized to reduce the running time of RVI. • As we will see later, the developed policy can be obtained directly from the result of RVI in practice.
In the following, we will translate the results in Section 5 into the optimal primal and dual solution pair and propose Indexed priority policy.

Indexed Priority Policy
We first define the Lagrangian function associated with (13).
Then, the corresponding Lagrangian dual function is Let π x i be the expected time that user i is at state x i caused by the adoption of φ λ * , where φ λ * is the optimal policy detailed in Theorem 3. Then, we define {π a i x i } as follows • State x i is where randomization happens (randomization happens when the actions suggested by the two optimal deterministic policies are different), and it has a value of π 0 x i where µ i is given by (12) and a n λ,i (x i ) is the action suggested by n λ,i at state x i . • For other values of x i , we have π 0 x i = (1 − a n λ * ,i (x i ))π x i and π 1 Then, we can prove the following proposition.
Proposition 7 (Optimal solution pair). {π a i x i } and {σ, σ i , σ x i , ψ a i x i } are primal and dual solutions to (13), respectively.
Proof. Since (13) is linear and strictly feasible, it is sufficient to show that {π a i x i } and {σ, σ i , σ x i , ψ a i x i } verify the KKT conditions, which can be expressed as the following four conditions.

2.
Dual feasibility: σ ≥ 0 and ψ a i x i ≥ 0 for all x i , a i , and i.

3.
Complementary slackness: is the value function resulting from taking action a i at state x i . Then, the nonnegativity is guaranteed by the Bellman equation. For the third condition, the first term is zero because we choose the µ i 's given by (12). For the second term, we recall that ψ x i > 0. Combined together, we can conclude that ψ a i x i = 0 when π a i x i > 0. Thus, the third condition is satisfied. For the last condition, setting the gradient equal to zero yields a system of linear equations. More precisely, for each x i and 1 Then, {σ, σ i , σ x i , ψ a i x i } verifies the system of linear equations by definition. Since all four conditions are satisfied, we can conclude our proof.
According to Proposition 7, we know that {π a i x i } and {σ, σ i , σ x i } defined above are the optimal solutions to problems (13) and (14), respectively. As the optimal solutions are obtained, we can adopt the heuristic detailed in Section 6.1.
The heuristic can be expressed equivalently as an index policy. To this end, we define the index I x i for state x i as I x i ψ 0 x i −ψ 1 x i . According to the complementary slackness, I x i can be reduced to the following.

•
For state x i such thatπ 1 x i > 0 andπ 0 We can show that I x i possesses the following properties. Proof. We notice that I x i can be expressed as a function of V i (x i ) and λ * . Meanwhile, M i 1 (λ * , −1) coincides with the decoupled model studied in Section 4.2. Then, we can verify the properties of I x i using the results in Section 4.2. The complete proof can be found in Appendix L.
Comparing with the heuristic detailed in Section 6.1, we can define the Indexed priority policy. Definition 5 (Indexed priority policy). At any state x = (x 1 , x 2 , . . . , x N ), the base station will transmit the updates from M users with the largest I x i . The ties are broken arbitrarily.

Remark 7.
Indexed priority policy belongs to the class of priority policies introduced in [30]. These priority policies are asymptotically optimal when certain conditions are satisfied.

Remark 8.
Indexed priority policy possesses the structural properties detailed in Corollary 1.

•
The first two properties can be verified by noting that I x i ≥ −λ * and the equality holds when r i = p 0 e,i = 0 or s i = 0. At the same time, I x i is non-decreasing inr i .

•
The third and fourth properties can be verified by noting that I x i is non-decreasing in s i .

•
For the last property, we first notice that I x j = I x k when users j and k are statistically identical and x j = x k . Then, the property can be verified by noting that I x i is non-decreasing in both s i andr i .
We notice that θ i 's and C(x i )'s are canceled out by the definition of I x i . Therefore, I x i can be calculated using λ * and the value function of M i 1 (λ * , −1). In practice, we can use either λ * − or λ * + to approximate λ * , and the value function can be approximated by the result of the RVI detailed in Section 5.1. Since the state space is infinite, we only calculate a finite number of V i (x i ), the number of which depends on the truncation parameter m of ASM. Meanwhile, the probabilities P x i ,x i (a i ) in I x i are modified according to (10).

Numerical Results
In this section, we provide numerical results to showcase the performance of the developed scheduling policies. To eliminate the effect of N, we plot the expected average AoII. In particular, we provide the expected average AoII achieved by the Indexed priority policy and Whittle's index policy when M = 1. The policies are calculated using the results detailed in Sections 4-6. When obtaining the Indexed priority policy, we set the tolerance in the Bisection search to ξ = 0.005. Meanwhile, we choose the truncation parameter in ASM m = 800 and the convergence criteria in RVI = 0.01. We notice that the calculation of Whittle's index involves an infinite sum. In practice, we approximate the result by replacing +∞ with a large enough number k max . Here, we choose k max = 800. For both scheduling policies, the resulting expected average AoII is obtained via simulations. Each data point is the average of 15 runs with 15,000 time slots considered in each run.
We also compare the developed policies with the optimal policy for RP, which can be calculated by following the discussion in Section 5.2. We adopt the same choices of parameters as we used to obtain the developed policies. The corresponding performance is calculated using Proposition 2. Like before, the infinite sum is approximated by replacing +∞ with k max = 800. We also provide the expected average AoII achieved by the Greedy policy to show the performance advantages of the developed policies. When the Greedy policy is adopted, the base station always chooses the user with the largest AoII. The resulting expected average AoII is obtained via the same simulations as applied to the developed policies. Figures 3 and 4 illustrate the performance when the source processes have different dynamics and when each user's communication goal is different, respectively. Figure 3a provides the performance when p i = 0.05 + 0.4(i−1) N−1 for 1 ≤ i ≤ N. For other parameters, the users make the same choices. More precisely, f i (s) = s, γ i = 0.6, and p 0 e,i = p 1 e,i = 0.1 for 1 ≤ i ≤ N. Figure 4a provides the performance when f i (s) = s 0.5+ i−1 N−1 for 1 ≤ i ≤ N. Same as before, the users make the same choices for other parameters. More precisely, p i = 0.3, γ i = 0.6, and p 0 e,i = p 1 e,i = 0.1 for 1 ≤ i ≤ N. In Figures 3b and 4b, we force p 0 e,i = 0 for all users to ensure the existence of Whittle's index. Other choices remain the same as in Figures 3a and 4a. According to Corollary 1, the optimal policy will never choose the user withr = p 0 e = 0 unless it is to break the tie. Therefore, in Figures 3b and 4b, we also consider the Greedy+ policy where the base station always chooses the user with the largest AoII among the users withr = 1. The resulting expected average AoII is obtained via the same simulations as applied to the Greedy policy. Figure 5 shows the performance in systems where the parameters for each user are generated uniformly and randomly within their ranges. In Figure 5a, we consider N = 5, γ ∈ [0, 1], p ∈ [0.05, 0.45], pr e ∈ [0, 0.45], and f (s) = s τ , where τ ∈ [0.5, 1.5]. There are a total of 300 different choices and the results are sorted by the performance of RP-optimal policy in ascending order. Figure 5b adopts the same system settings except that we impose p 0 e,i = 0 for 1 ≤ i ≤ N to ensure the feasibility of Whittle's index policy. Meanwhile, we ignore the Greedy policy since the Greedy+ policy achieves a better performance, as indicated by Figures 3b and 4b.       We can make the following observations from the figures.

•
The Greedy+ policy yields a smaller expected average AoII than that achieved by the Greedy policy. Recall that we obtained the Greedy+ policy by applying the structural properties detailed in Corollary 1. Therefore, simple applications of the structural properties of the optimal policy can improve the performance of scheduling policies. • The Indexed priority policy has comparable performance to Whittle's index policy in all the system settings considered. The two policies have their own advantages. The Indexed priority policy has a broader scope of application, while Whittle's index policy has a lower computational complexity. • The performance of the Indexed priority policy and Whittle's index policy is better than that of the Greedy/Greedy+ policies and is not far from the performance of the RP-optimal policy. Recall that the performance of the RP-optimal policy forms a universal lower bound on the performance of all admissible policies for PP. Hence, we can conclude that both the Indexed priority policy and Whittle's index policy achieve good performances.

Conclusions
In this paper, we studied the problem of minimizing the Age of Incorrect Information in a slotted-time system where a base station needs to schedule M users among N available users. Meanwhile, the base station has access to imperfect channel state information in each time slot. The problem is a restless multi-armed bandit problem which is SPACEhard. However, by casting the problem into a Markov decision process, we obtain the structural properties of the optimal policy. Then, we introduce a relaxed version of the original problem and investigate the decoupled model. Under a simple condition, we establish the indexability of the decoupled problem and obtain the expression of Whittle's index. On this basis, we developed Whittle's index policy. To get rid of the requirement for indexability, we developed the Indexed priority policy based on the optimal policy for the relaxed problem. The characteristics of the relaxed problem are explored to make the calculation of its optimal policy more efficient. Finally, through numerical results, we show that simple applications of the structural properties can improve the performance of scheduling policies. Moreover, Whittle's index policy and the Indexed priority policy achieve good and comparable performances.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Proof of Lemma 1
We consider two states, x 1 and x 2 , that differ only in the value of s j . Without the loss of generality, we assume s 1,j < s 2,j . Then, it is sufficient to show that, for any 1 ≤ j ≤ N, V(x 1 ) ≤ V(x 2 ). Leveraging the iterative nature of VIA, we use mathematical induction to prove the monotonicity. First of all, the base case (i.e., ν = 0) is true by initialization. We assume the lemma holds at iteration ν. Then, we want to examine whether it holds at iteration ν + 1. The update step reported in problem (5) can be rewritten as follows.
To prove the desired results, we distinguish between the following cases.
• We first consider the case of s 1,j = 0 < s 2,j andr 1,j =r 2,j = 0. When a j = 1 and for any x − {s j }, we have is the estimated value function of the state x with s j = 0 at iteration ν (at the risk of abusing the notation, we use V(x; s j = s 1 ) and V(x; s j = s 2 ) to represent the value functions of two states that differ only in the value of s j ). Then, we get The inequalities hold since β j > p j and Lemma 1 are true at iteration ν by assumption.
For the case of a i = 1 where i = j, we notice that a j = 0. Then, for any The inequalities hold since 2p j − 1 < 0 and Lemma 1 is true at iteration ν by assumption. Combining with the case of a j = 1, U j ν (x 1 , x ) ≤ U j ν (x 2 , x ) holds for any x − {s j } under any feasible action. Since x 1 and x 2 differ only in the value of s j and C(x) is non-decreasing in s i for 1 ≤ i ≤ N, we can see that V a ν+1 (x 1 ) ≤ V a ν+1 (x 2 ) for any feasible a. Then, by (A1), we can conclude that the lemma holds at iteration ν + 1 when s 1,j = 0 < s 2,j andr 1,j =r 2,j = 0. • When s 1,j = 0 < s 2,j andr 1,j =r 2,j = 1, by replacing the β j 's in the above case with α j 's, we can achieve the same result. • When 0 < s 1,j < s 2,j andr 1,j =r 2,j , we notice that P s 1,j ,s 1,j +1 (a j ,r 1,j ) = P s 2,j ,s 2,j +1 (a j ,r 2,j ), P s 1,j ,0 (a j ,r 1,j ) = P s 2,j ,0 (a j ,r 2,j ).
Then, leveraging the monotonicity of V ν (x) and C(x), we can conclude with the same result.
Combining the three cases, we prove that the lemma also holds at iteration ν + 1 of VIA. Therefore, the lemma holds at any iteration ν by mathematical induction. Since the results hold for any 1 ≤ j ≤ N and VIA is guaranteed to converge to the value function when ν → +∞, we can conclude our proof.

Appendix B. Proof of Lemma 2
We inherit the notations in the proof of Lemma 1. We still use mathematical induction to obtain the desired results. The base case ν = 0 is true by initialization. We assume the lemma holds at iterative ν and examine whether it still holds at iteration ν + 1. In the case of M = 1, we rewrite (5) as and P i x,x (a i ) is the probability that action a i will lead to state x when user i is at state x. To get the desired results, we distinguish between the following cases • We first show that V j ν+1 (x) = V k ν+1 (P(x)). According to (A3), we have It is obvious that for any P(x) , there always exists P(x ) = P(x) . Then, we obtain The second equality follows from the definition of P(·), the property of summation, and the assumption at iteration ν. The last equality follows from the variable renaming. Then, by the definition of statistically identical, we have P k (0), and C(x) = C(P(x)). Therefore, we can Along the same lines, we can easily show that V k ν+1 (x) = V j ν+1 (P(x)) and V i ν+1 (x) = V i ν+1 (P(x)) for i = j, k. Combining the above cases with (A2), we prove that V ν+1 (x) = V ν+1 (P(x)). Then, by induction, we have V ν (x) = V ν (P(x)) at any iteration ν. Since VIA is guaranteed to converge to the value function when ν → +∞, we can conclude our proof.

Appendix C. Proof of Theorem 1
For arbitrary j and k where With this in mind, we will prove the properties one by one.
Property 1-δ j,k (x) ≤ 0 ifr k = p 0 e,k = 0. The equality holds when s j = 0 orr j = p 0 e,j = 0. Whenr k = p 0 e,k = 0, transmitting the update from user k will necessarily fail. Therefore, P s k ,s k (0, 0) = P s k ,s k (1, 0) for any s k and s k . Then, we have To identify the sign of R j,k (x, x ), we distinguish between the following cases • When s j = 0, we can easily show that R j,k (x, x ) = 0 for any x − {s j , s k } by noticing that the two possible actions with respect to user j (i.e., a j = 1 and a j = 0) are equivalent when s j = 0. Since δ j,k (x) is a linear combination of R j,k (x, x )'s with non-negative coefficients, we can conclude that δ j,k (x) = 0 in this case. • When s j > 0 andr j = 1, for any x − {s j , s k }, we have The inequality holds because of Lemma 1 and the fact that α j + p j < 1. We recall that δ j,k (x) is a linear combination of R j,k (x, x )'s with non-negative coefficients. Then, we can conclude that δ j,k (x) ≤ 0 in this case. • When s j > 0 andr j = 0, by replacing the α j in (A6) with β j , we can get the same result. In this case, the equality holds when β j + p j = 1, or, equivalently, p 0 e,j = 0. Combining the cases, we prove the first property. Property 2-δ j,k (x) is non-increasing inr j and is non-decreasing inr k when s j , s k > 0. At the same time, δ j,k (x) is independent ofr i for any i = j, k.
We first prove the monotonicity of δ j,k (x) with respect tor j . To this end, we define x 1 and x 2 as two states that differ only in the value ofr j . Without a loss of generality, we assumer 1,j = 1 andr 2,j = 0. Then, we investigate the sign of δ j,k (x 1 ) − δ j,k (x 2 ). We define x i x 1,i = x 2,i for i = j. Then, according to (A4), δ j,k (x 1 ) − δ j,k (x 2 ) can be written as Since x 1,k = x 2,k , we have P s 1,k ,s k (a,r 1,k ) = P s 2,k ,s k (a,r 2,k ) for any s k . We recall that the transition probability is independent ofr when a = 0. Combining with the fact that s 1,j = s 2,j , we also have P s 1,j ,s j (0,r 1,j ) = P s 2,j ,s j (0,r 2,j ) for any s j . Combining together, we obtain P s 1,k ,s k (1,r 1,k )P s 1,j ,s j (0,r 1,j ) = P s 2,k ,s k (1,r 2,k )P s 2,j ,s j (0,r 2,j ), P s 1,k ,s k (0,r 1,k ) = P s 2,k ,s k (0,r 2,k ).
Leveraging the above two problems, we have Consequently, we obtain In the following, we characterize the sign of As s 1,j = s 2,j > 0, for any x − {s j }, we have The inequality follows from Lemma 1 and the fact that β j > α j . Since δ j,k (x 1 ) − δ j,k (x 2 ) is a linear combination of R 1 's with non-negative coefficients, we can conclude that δ j,k (x 1 ) ≤ δ j,k (x 2 ). Sincer 1,j >r 2,j , we can see that δ j,k (x) is non-increasing inr j . In a very similar way, we can show that δ j,k (x) is non-decreasing inr k . We recall that r i will not affect the system dynamic if a i = 0. Consequently, we can conclude that δ j,k (x) is independent ofr i for any i = j, k.
Combining together, we prove the second property.
Property 3-δ j,k (x) ≤ 0 if s k = 0. The equality holds when s j = 0 orr j = p 0 e,j = 0. Since the probabilities are non-negative, it is sufficient to show that R j,k (x, x ) satisfies Property 3 for any x − {s j , s k }. More precisely, it is sufficient to show that R j,k (x, x ) ≤ 0 for any x − {s j , s k } when s k = 0 and the equality holds when s j = 0 orr j = p 0 e,j = 0. We recall that P s k ,s k (1,r k ) = P s k ,s k (0,r k ) for any s k when s k = 0. Hence, for any x − {s j , s k }, we have Then, we investigate the following quantity for any x − {s j } To this end, we distinguish between the following cases • When s j = 0, we have P s j ,s j (1,r j ) = P s j ,s j (0,r j ) for any s j . Thus, we conclude that When s j > 0 andr j = 1, for any x − {s j }, we have The inequality follows from Lemma 1 and the fact that α j + p j < 1. Thus, R j,k (x, x ) ≤ 0 for any x − {s j , s k }. • When s j > 0 andr j = 0, by replacing the α j in (A7) with β j , we can get the same result. In this case, the equality holds when β j + p j = 1, or, equivalently, p 0 e,j = 0. Combined together, we can conclude that Property 3 is true.
Property 4-δ j,k (x) is non-increasing in s j if Γr j j ≤ Γr k k and is non-decreasing in s k if Γr j j ≥ Γr k k when s j , s k > 0. We define Γ 1 Such as we did in the proof of Property 3, it is sufficient to show that R j,k (x, x ) satisfies Property 4 for any x − {s j , s k }. We recall that R j,k (x, x ) depends on the values ofr j andr k . Therefore, we distinguish between the following cases • In the case ofr j =r k = 1 and s j , s k > 0, for any x − {s j , s k }, (A5) can be written as As we can verify We define Γ 1 Combining with Lemma 1, we can conclude that, for any In the case ofr j =r k = 0 and s j , s k > 0, by replacing the α's in the above case with β's, we can conclude with the same result. • In the case ofr j = 1,r k = 0, and s j , s k > 0, for any x − {s j , s k }, (A5) can be written as As we can verify Combined with Lemma 1, we can conclude that, for any In the case ofr j = 0,r k = 1, and s j , s k > 0, by swapping the α's and β's in the above case, we can conclude with the same result.
Combined together, we conclude that R j,k (x, x ) satisfies Property 3 for any x − {s j , s k }.
Consequently, δ j,k (x) is non-increasing in s j if Γr j j ≤ Γr k k and is non-decreasing in s k if Γr j j ≥ Γr k k when s j , s k > 0.
Property 5-δ j,k (x) ≤ 0 if s j ≥ s k ,r j ≥r k , and users j and k are statistically identical. According to Property 3, it is sufficient to consider the case where s j , s k > 0. We notice that the sign of δ j,k (x) can be captured by the sign of the quantity Q j,k (x, x ) ∑r j ,r k P(r j )P(r k )R j,k (x, x ). Thus, we divide our discussion into the following cases.

•
We first consider the case of s j ≥ s k > 0 andr j =r k = 0. Leveraging the definition of statistically identical, for any x − {x j , x k }, we have where κ 1 = 1 − p j − β j ≥ 0. Then, by substituting the values of P(r) and using Lemma 2, we obtain Since users j and k are statistically identical, we have γ j = γ k . Then, by Lemma 1, For the case of s j ≥ s k > 0 andr j =r k = 1, by replacing β j in κ 1 with α j , we can conclude with the same result. • Then, we consider the case of s j ≥ s k > 0,r j = 1, andr k = 0. We first notice that, for any As users j and k are statistically identical, we have p j = p k and α j < β k . Leveraging Lemma 1, we have Then, for any Such as we did in the previous cases, we can leverage Lemmas 1 and 2 to conclude that Q j,k (x, x ) ≤ 0 for any x − {x j , x k }. Consequently, δ j,k (x) ≤ 0 in this case. The details are omitted for the sake of space. Combined together, we conclude the proof of Property 5.

Appendix D. Proof of Corollary 2
We follow the same steps as in the proof of Lemma 1. To prove the corollary, it is sufficient to show that V(x 1 ) ≤ V(x 2 ) when s 1 < s 2 andr 1 =r 2 . We use mathematical induction to prove the monotonicity. First of all, the base case (i.e., ν = 0) is true by initialization. We assume the lemma holds at iteration ν. Then, we want to examine whether it holds at iteration ν + 1. For the system with a single user, the update step reported in problem (5) can be simplified and rewritten as follows and θ is the optimal value for M 1 (λ, −1). To prove the desired results, we distinguish between the following cases • We first consider the case of s 1 = 0 < s 2 andr 1 =r 2 = 0. When a = 1, we have Subtracting the two expressions yields The inequalities hold since β > p, C(x, a) is non-decreasing in s, and Corollary 2 is true at iteration ν by assumption.
For the case of a = 0, we obtain Therefore, when a = 0, we have The inequalities hold since 2p − 1 < 0, C(x, a) is non-decreasing in s, and Corollary 2 is true at iteration ν by assumption. Combined together, we can see that V a ν+1 (x 1 ) ≤ V a ν+1 (x 2 ) for any feasible a. Then, by problem (A8), we can conclude that the lemma holds at iteration ν + 1 when s 1 = 0 < s 2 andr 1 =r 2 = 0. • When s 1 = 0 < s 2 andr 1 =r 2 = 1, by replacing the β's in the above case with α's, we can achieve the same result. • When 0 < s 1 < s 2 andr 1 =r 2 , we notice that P s 1 ,s 1 +1 (a,r 1 ) = P s 2 ,s 2 +1 (a,r 2 ) and P s 1 ,0 (a,r 1 ) = P s 2 ,0 (a,r 2 ). Then, leveraging the monotonicity of V ν (x) and C(x, a), we can conclude with the same result.
Combining the three cases, we prove that the lemma holds at iteration ν + 1 of VIA. Therefore, the lemma holds at any iteration ν by mathematical induction. Since VIA is guaranteed to converge to the value function when ν → +∞, we can conclude our proof.

Appendix E. Proof of Proposition 1
We define ∆V(x) V 1 (x) − V 0 (x) where V a (x) is the value function resulting from taking action a at state x. Then, V a (x) can be calculated as follows where θ is the optimal value for M 1 (λ, −1). Hence, the optimal action at state x can be fully characterized by the sign of ∆V(x). More precisely, the optimal action at state x is a = 1 if ∆V(x) < 0, and a = 0 is optimal otherwise. To determine the sign of ∆V(x) for each state, we distinguish between the following cases • We first consider the state x = (0,r). Applying the results in Section 2.3 to problem (A9), we obtain Therefore, ∆V(0,r) = λ ≥ 0. Thus, the optimal action at state (0,r) is a = 0. • Then, we consider the state x = (s, 0) where s > 0. Applying the results in Section 2.3 to Equation (A9), we obtain Then, where Finally, we consider the state x = (s, 1) where s > 0. Following the same trajectory, we have According to Corollary 2 and the fact that p < 0.5, we can see that ∆V(s, 0) and ∆V(s, 1) are both a constant λ plus a term that is non-increasing in s. As the time penalty function is unbounded, the value function must also be unbounded. Then, combining the three cases, we can conclude the following. For fixedr, there always exists a threshold nr > 0 such that the optimal action at state (s,r) where s ≥ nr is a = 1, otherwise a = 0 is optimal. Sincê r ∈ {0, 1}, the optimal policy can be fully captured by the pair (n 0 , n 1 ).
In the following, we determine the relationship between n 0 and n 1 . We have At the same time, for the threshold n 0 , we know ∆V(n 0 , 0) < 0. Then, we have ∆V(n 0 , 1) ≤ ∆V(n 0 , 0) < 0. Combined with the fact that ∆V(s,r) is non-increasing in s, we can conclude that the ordering n 0 ≥ n 1 is true.

Appendix F. Proof of Proposition 2
We notice that the dynamic of AoII under threshold policy can be fully captured by a Discrete-Time Markov Chain (DTMC). Then, the expected AoII∆ n and the expected transmission rateρ n under threshold policy n = (n 0 , n 1 ) can be obtained from the stationary distribution of the induced DTMC. Let the states of the induced DTMC be the values of s. We recall thatr is an independent Bernoulli random variable with parameter γ. Combined with the results in Section 2.3, we can easily obtain the state transition probabilities of the induced DTMC, which are shown in Figure A1. Figure A1. DTMC induced by the threshold policy n = (n 0 , n 1 ). In the figure, The balance equations of the induced DTMC are the following (1 − p)π k−1 = π k f or 2 ≤ k ≤ n 1 .
Then, we can easily solve the above system of linear equations. After some algebraic manipulation, we obtain the following Equipped with the above results, we proceed with calculating∆ n andρ n . According to problem (6a), the expected AoII is:∆ Substituting the expressions of π k 's, we can get the expression of∆ n . Proposition 1 tells us the following.

•
For state (s,r) where s < n 1 , it is optimal to stay idle (i.e., a = 0). • For state (s,r) where n 1 ≤ s < n 0 , it is optimal to make a transmission attempt only whenr = 1. We recall thatr is an independent Bernoulli random variable with parameter γ. Therefore, the expected proportion of time that the system is at state (s, 1) is γπ s .
• For state (s,r) where s ≥ n 0 , it is optimal to make transmission attempt regardless ofr.
Combined with problem (6b), we havē Substituting the expressions of π k 's, we can obtain the closed-form expression ofρ n .

Appendix G. Proof of Proposition 4
We first tackle the Whittle's indexes at state (0,r) and (s, 0) where s > 0. To this end, we distinguish between the following cases • We first consider the state x = (0,r). By definition, Whittle's index is the infimum λ such that V 0 (x) = V 1 (x). According to (A10), we can conclude that W x = 0 when x = (0,r). • Then, we consider the state x = (s, 0) where s > 0. We recall that p 0 e = 0. Then, we can conclude, from (A11), that W x = 0 for all x = (s, 0) where s > 0. Now, we tackle the Whittle's index at state x = (s, 1) where s > 0. For convenience, we denote by W n the Whittle's index at state x = (n, 1). According to the monotonicity of ∆V(x) shown in the proof of Proposition 1, we can conclude that threshold policy n = (+∞, n + 1) is optimal when V 0 (n, 1) = V 1 (n, 1). Then, we can prove the following Lemma A1. When (9) is satisfied and V 0 (n, 1) Proof. Since the value function satisfies the Bellman equation, it is sufficient to show that V(s, 1) and V(s, 0) satisfy the same Bellman equation. We recall that the Bellman equation for V(x) is given by and θ is the optimal value of the decoupled problem. We recall, from Corollary 3, that the optimal action at state (s, 0) is staying idle (i.e., a = 0) for any s. We also know that threshold policy n = (+∞, n + 1) is optimal when V 0 (n, 1) = V 1 (n, 1). Therefore, the optimal actions at states (s, 0) and (s, 1) where s ≤ n are the same (i.e., a = 0). Equivalently, we have According to the system dynamic reported in Section 2.3, we know that the state transition probabilities are independent ofr when a = 0. Meanwhile,r does not affect the instant cost. Let x 1 = (s, 1) and x 2 = (s, 0). Then, for any x , we have P x 1 ,x (0) = P x 2 ,x (0).
Letting s = 1 yields From problem (A17), V(1) also satisfies the following Equating the two expressions of V(1), we obtain We recall that, when V 0 (n, 1) = V 1 (n, 1), threshold policy n = (+∞, n + 1) is optimal and both actions at state x = (n, 1) are equally desirable. Thus, threshold policy n = (+∞, n) is also optimal. Then, we know θ =∆ n + W nρn , where∆ n andρ n are the expected AoII and the expected transmission rate under threshold policy n = (+∞, n), respectively. Finally, combining problems (A16), (A18) and (A19), we obtain After some algebraic manipulation, we have In the following, we investigate some properties of Whittle's index. First of all, W n is non-negative since 1 − p − α and V(n + 1,r) in (A15) are all non-negative. Meanwhile, combining (A15) with the fact that V(n,r) is non-decreasing in n, we can verify that W n is non-decreasing in n. Combined with the Whittle's indexes in two other cases (i.e., x = (0,r) and x = (s, 0) where s > 0), we can easily obtain the properties of W x as detailed in Proposition 4.

Appendix H. Proof of Proposition 5
We notice that M 1 (λ, −1) coincides with the decoupled model studied in Section 4.2. When problem (9) is satisfied, the decoupled problem is indexable, and, according to Corollary 3, we only need to show that n is the optimal threshold for the states witĥ r = 1. We first tackle the case of λ > 0. To this end, we divide our discussion into the following cases

•
For state (s, 1) where s < n, W s ≤ λ by definition. As the problem is indexable, we have D(W s ) ⊆ D(λ). We recall that W s min{λ ≥ 0 : V 0 (s, 1) = V 1 (s, 1)}. Equivalently, W s min{λ ≥ 0 : (s, 1) ∈ D(λ )}. Then, we know (s, 1) ∈ D(W s ). Combined together, we conclude that (s, 1) ∈ D(λ). In other words, the optimal action at state (s, 1) where s < n is to stay idle (i.e., a = 0). • For state (s, 1) where s ≥ n, we first recall that W s = min{λ ≥ 0 : (s, 1) ∈ D(λ )}. Consequently, for any λ < W s , we know (s, 1) / ∈ D(λ ). Meanwhile, we have W s ≥ W n > λ by the monotonicity of Whittle's index and the definition of n. Hence, we can conclude that (s, 1) / ∈ D(λ). In other words, the optimal action at state (s, 1) where s ≥ n is to make the transmission attempt. Then, we conclude that n is the optimal threshold for the states withr = 1 when λ > 0. In the case of λ = 0, according to the proof of Proposition 1, we can easily verify that the optimal threshold is 1.

Appendix I. Proof of Theorem 2
We first make the following definitions. When M 1 (λ, −1) is at state x and action a is taken, cost C 1 (x, a) f (s) and C 2 (x, a) λa are incurred. We denote the expected C 1 -cost and the expected C 2 -cost under policy φ asC 1 (φ) andC 2 (φ), respectively. Let G be a non-empty set of states. For the given state i, we define R * (i, G) as the class of policies φ, for which the following hold

•
The probability P φ (x n ∈ G f or some n ≥ 1 | x 0 = i) = 1 where x n is the state of M 1 (λ, −1) at time n.

•
The expected time m iG (φ) of a first passage from i to G under φ is finite. • The expected C 1 -costC i,G 1 (φ) and the expected C 2 -costC i,G 2 (φ) of a first passage form i to G under φ are finite.
With the definitions in mind, we proceed with verifying the assumptions given in [27].

1.
For all d > 0, the set A(d) = {x | there exists an action a such that C 1 (x, a) + C 2 (x, a) ≤ d} is finite: For any state x, the cost satisfies C 1 (x, a) + C 2 (x, a) = f (s) + λa ≥ f (s).
The equality holds when a = 0. Then, the states in A(d) must satisfy f (s) ≤ d.
Combined with the fact that f (s) is a non-decreasing and unbounded function when s ∈ N 0 , we can conclude that A(d) is finite.

2.
There exists a stationary policy e such that the induced Markov chain has the following properties: the state space S consists of a single (non-empty) positive recurrent class R and a set U of transient states such that e ∈ R * (i, R) for i ∈ U. Moreover, bothC 1 (e) andC 2 (e) on R are finite: We consider the policy under which the base station makes a transmission attempt at every time slot. According to the system dynamic detailed in Section 2.3, we can see that all the states communicate with state (0, 0) and (0, 0) communicates with all other states. Thus, the state space S consists of a single (non-empty) positive recurrent class and the set of transient states can simply be an empty set.C 1 (e) and C 2 (e) are trivially finite as we can verify using Proposition 2.

3.
Given any two state x = y, there exists a policy φ such that φ ∈ R * (x, y): We notice that, under any policy, the maximum increase of s between two consecutive time slots is 1. Meanwhile, when s decreases, it decreases to zero. Combined with the fact thatr is an independent Bernoulli random variable, we can conclude that there always exists a path between any x and y with positive probability. m xy (φ),C x,y 1 (φ), andC x,y 2 (φ) are trivially finite.

4.
If a stationary policy φ has at least one positive recurrent state, then it has a single positive recurrent class R. Moreover, if x = (0, 0) / ∈ R, then φ ∈ R * (x, R): Given thatr is an independent Bernoulli random variable, we can easily conclude from the system dynamic that all the states communicate with state (0, 0) and (0, 0) communicates with all other states under any stationary policy. Therefore, any positive recurrent class must contain state (0, 0). Thus, there must have only one positive recurrent class which is R = S.
As the assumptions are verified, we proceed with introducing the optimal randomized policy for given λ. We say a policy is λ-optimal if the policy is optimal for M 1 (λ, −1). We consider two monotone sequences λ n + ↓ λ and λ n − ↑ λ. Then, there exist subsequences of λ n + and λ n − such that the corresponding sequences of optimal policies converge. Then, according to Lemma 3.7 of [27], the limit points, denoted by n λ + and n λ − , are both λoptimal. By Proposition 3.2 of [27], the Markov chains induced by n λ + and n λ − both contain a single non-empty positive recurrent class and state (0, 0) is positive recurrent in both induced Markov chains. Hence, the base station can choose which policy to follow each time the system reaches state (0, 0) while keeping the resulting randomized policy λ-optimal as suggested by Lemma 3.9 of [27]. More precisely, we consider the following randomized policy: each time the system reaches state (0, 0), the base station will choose n λ − with probability µ and n λ + with probability 1 − µ. The chosen policy will be followed until the next choice. We denote such policy as n λ and conclude that n λ is λ-optimal under any µ ∈ [0, 1].

Appendix J. Proof of Proposition 6
The value function V(x) and V i (x i ) must satisfy their own Bellman equations. More precisely where θ and θ i are the optimal values of M N (λ, −1) and M i 1 (λ, −1), respectively. We recall from Section 2.3 that the users are independent when action a and current state x are given. Thus Pr ( Pr(x j | x, a) = 1.
We also recall from Section 2.3 that the state of user i depends only on its previous state and the action with respect to user i. Thus Pr(x i | x, a) = Pr(x i | x i , a i ).
Combined together, we obtain Then, we sum problem (A20) over all users which yields We recall that C(x, a) = ∑ N i=1 C(x i , a i ) by definition. Then, leveraging problem (A21), we obtain Since the solution to the Bellman equation is unique [21], we must have ∑ N i=1 V i (x i ) = V(x) and ∑ N i=1 θ i = θ. Then, we can conclude that it is optimal for M N (λ, −1) if each user adopts its own optimal policy.

Appendix K. Proof of Theorem 3
In this proof, we class a policy as λ * -optimal if it is optimal for M N (λ * , −1). In Section 4.2, we ensure that, for each user, there exists at least one threshold policy that yields a finite expected AoII. Therefore, we can conclude that, for RP, there exists at least one policy that causes the expected AoII and the expected transmission rate to be both finite. Then, according to Lemma 3.10 of [27], a policy is optimal for RP if 1.
The resulting expected transmission rate is equal to M.