An Optimal Flow Admission and Routing Control Policy for Resource Constrained Networks

Overloaded network devices are becoming an increasing problem especially in resource limited networks with the continuous and rapid increase of wireless devices and the huge volume of data generated. Admission and routing control policy at a network device can be used to balance the goals of maximizing throughput and ensuring sufficient resources for high priority flows. In this paper we formulate the admission and routing control problem of two types of flows where one has a higher priority than the other as a Markov decision problem. We characterize the optimal admission and routing policy, and show that it is a state-dependent threshold type policy. Furthermore, we conduct extensive numerical experiments to gain more insight into the behavior of the optimal policy under different systems’ parameters. While dynamic programming can be used to solve such problems, the large size of the state space makes it untractable and too resource intensive to run on wireless devices. Therefore, we propose a fast heuristic that exploits the structure of the optimal policy. We empirically show that the heuristic performs very well with an average reward deviation of 1.4% from the optimal while being orders of magnitude faster than the optimal policy. We further generalize the heuristic for the general case of a system with n (n>2) types of flows.


Introduction
Efficient resource utilization is a primary problem in resource constrained networks. In wireless sensor networks (WSNs) for instance the issue of energy efficiency is crucial to ensure network connectivity and quality of service. In WSNs, sensor nodes are generally deployed to transmit sensitive information in a timely manner. They rely on neighboring nodes to relay traffic to a given destination while operating on limited battery capacity. Energy is used when a node is listening, receiving, or transmitting. If a node's battery is depleted, neighboring nodes become incapable of relaying and transmitting urgent traffic through the node. More importantly, if said node belongs to the optimal path, a less efficient path will need to be computed which reduces network throughput and consumes more of the scares resources. It is expected that some sensor networks will be deployed over large and inhospitable areas [1][2][3][4][5][6]. Since these networks may not be accessible following deployment, it is crucial that implemented admission and routing policies are resource efficient (i.e., consume minimal energy).
Consider a sensor node A with two available paths to the same destination as shown in Figure 1. We define a task as the transmission of a single flow from a relaying node to the final destination. We say that a task is successful if its corresponding flow is treated to the full extent -that is, all packets that belong to the same flow reach their final destination. Consider the scenario where sensor node A is tasked with relaying two flows. To maximize task success, it may be more efficient to treat one task to a full extent and reject or treat the second task partially than to partially treat both tasks. For instance, suppose a node is sending information about two events to a control center (CC) simultaneously: an attack on a battlefield and a fire in another nearby region. The CC would prefer to receive full information about one flow and act on it, rather than receive partial information about both flows that would be discarded. Hence, sometimes rejecting a flow may be perceived as more beneficial than accepting a flow and partially transmitting it due to lack of resources. This saves energy consumption while transmitting a flow to the full extent.  The queueing system at node A where type-i, i ∈ {1, 2} arrivals join queue 1. Instead of being aborted from the system, packets are routed to queue 2; µ 2 ≤ µ 1 . Served at queue 1 or at queue 2, packets will reach the same destination D over alternate paths .
In this paper, we model node A ( Figure 1) as a queueing system where accepted packets belong to two different flows. We assume that one of the flows has higher priority than the other. Once a packet of a given flow is accepted to the system, it joins queue 1 and it is guaranteed service to the full extent independent of its type. Instead of being rejected or preempted from service, packets have the option to be served at a slower server behind a second queue (queue 2). Hence, the packet is transmitted over a less rewarding path. Transmitted type-i packets are rewarded r i , i ∈ {1, 2}, r 1 > r 2 > 0 if served at queue 1, and r 3 ≤ r 2 if served at queue 2. Using the less rewarding path not only minimizes the number of rejected packets from the system, but also maximizes the chance that both flows will be treated to the full extent. Most importantly, allowing packets to be served at queue 2, allows the extension of the life of the path behind queue 1 (the efficient path).
We are interested in finding ways by which node A can, through its local decision policy, accept or reject and decide which path to choose to transmit packets. The objective is to minimize energy consumption and maximize the number of flows served, hence maximizing network throughput.
Such decision control mechanism is fundamental to a variety of other interesting applications. For example, consider the case where a node experiences a flood of traffic as a sign of it being compromised. A node can be subject to a SYN flood where an attacker attempts to fill the backlog queue of a victim machine's Transmission Control Protocol server (TCP) [7,8]. This results in resource depletion that renders the node unresponsive to legitimate traffic. By recognizing such flood of traffic, a node may either classify it as high priority traffic to identify the attacker and take the appropriate measures, or as low priority traffic and route it to another queue with a slower server.
This problem finds applications not only in computers and communication networks but in various other fields as well. Blockchain-based applications for instance, suffer from high computational and storage expenses, negatively impacting overall performance and scalability [9]. Therefore, work has been done to move computation and data off the chain (Off-chain). Off-chain transactions (i.e., high priority traffic) can be executed instantly and usually have low or no transaction fee. However, on-chain transactions (i.e., low priority traffic) can have a lengthy lag time depending on the network load and on the number of transactions waiting in the queue to be confirmed. Similarly, control of multi-class queueing systems has received significant attention in supply chain management and manufacturing systems ( [10][11][12] and references therein). One of the main tools for such control problems is to characterize a performance measure of interest and use optimization methods to find the optimal control policy [13][14][15]. An agorithm for optimal pricing and admission control is proposed in [14,16].
In this paper, we develop a dynamic programming formulation of the Admission and Routing Control (ARC) problem, that maximizes the network throughput by extending the life (the resources) of the efficient path and thus the number of flows serviced to the full extent. We formulate the ARC problem as a Markov decision process (MDP) [17] and characterize the optimal policy under the Poisson traffic model. In particular, we show that the ARC policy that maximizes the expected reward is stationary and is a state-dependent threshold type policy. While dynamic programming can be used to solve such problems, the large size of the state space makes it untractable and too resource intensive to run on network devices and especially on wireless devices. Therefore, we propose a fast heuristic that exploits the structure of the optimal policy. Much of the computation required for our method can be done off-line, and the real-time computation requires no more than a table lookup. Furthermore, computing the parameters of the heuristic control policy is orders of magnitude faster than computing the optimal. We empirically show that the heuristic performs very well with an average reward deviation of 1.4% from the optimal while being orders of magnitude faster than the optimal policy. We further generalize the heuristic for the general case of a system with n (n > 2) types of flows. We believe our heuristic is general enough to be widely applicable and can be implemented in realtime. Although the method we propose applies to general network resource constraints, we consider energy limitations as our motivating application.
The rest of the paper is organized as follows. In Section 2, we provide a literature review. In Section 3, we describe the model and provide its mathematical formulation. In Section 4, we develop properties of the expected reward value function and characterize the optimal ARC policy. In Section 5, we analyze the behavior of the optimal policy through extensive numerical experiments. In Section 6, we propose a heuristic control policy, for the general case of n types of packets, and conduct extensive numerical experiments, for the case of 2 types of flows, in order to assess its performance compared to the optimal policy. In Section 7, we summarize our findings and propose future directions of this work.

Literature
In this section, we review the existing literature pertaining to resource constrained environments in WSNs. This is by no means exhaustive; it is only indicative of the interest and the applications.
Extensive work has been done to address the problem of energy saving with respect to WSNs. In [18][19][20][21], controlled mobility is used to extend the network lifetime. Algorithms for self-organization of sensor networks have been proposed [22,23] to minimize the risk of data loss during transmission and to maximize the battery life of individual sensors. Various coverage optimization protocols have been studied [24,25] where a number of sensor nodes are deployed to ensure adequate coverage of a region. Using a coverage optimization protocol, nodes with overlapping sensing areas are turned off to reduce energy consumption. We refer the readers to the following survey [26] of the various other energy efficient coverage techniques. Research in [19] focused on shortest path algorithms to optimize energy consumption. However, using the shortest path may lead to an increase in the ratio of lost packets [27]. In [28], research focused on scheduling sensor nodes to switch on and off, depending on the queue size, to reduce energy consumption. However, switching nodes from an idle to a busy state and vice versa has been a major portion of the power consumption [29].
Numerous works have also treated admission control policies relevant to WSNs. Several algorithms have been used by Internet routers to decide on packets admission and rejection to manage queues and minimize congestion in TCP [30]. In Tail Drop [31], when the queue reaches its maximum capacity, the newly arriving packets are rejected independent of their types. Though traffic may belong to different flows, it is not differentiated, and each packet is treated identically. When segments are lost, the TCP sender enters slow-start, which reduces throughput in that TCP session. A more severe problem occurs when segments from multiple TCP flows are dropped causing global synchronization -that is, all of the involved TCP senders enter slow-start. This problem can be mitigated by routing segments from one flow (the lower priority flow for instance) to a different queue, rather than discarding one segment from each flow. In Weighted Random Early Detection (WRED) [32], flows are differentiated and treated depending on their type. Packets with a higher IP address precedence are less likely to be dropped than packets with a lower precedence. Thus, higher priority packets are delivered with a higher probability than lower priority packets. As in TCP, there is no guarantee that discarded packets belong to the same flow.
Queuing theory has also been widely used in network optimization [33][34][35]. In [36], a queueing network model was used to analyze and study the performance of a mobile WSN. In [37,38], models implementing admission control mechanisms to manage scarce radio resources in WSNs are described in the form of a queueing system with unreliable devices. Work in [39] proposed an energy saving mechanism that controls the ON/OFF state of a sensor node. A sensor node enters an OFF state (multiple fixed duration vacation periods) as soon as its queue is empty and is turned on (changes to an ON state) only when its queue size reaches a threshold value of packets. A M/M/1-type queueing model with a control mechanism is proposed in [40] as a tool to reduce power consumption in WSNs. The tool switches a sensor node from an OFF mode to an ON mode only when its queue size reaches or exceeds a given size.
In order to achieve quality of service in a multi-class two parallel queue system, work in [41][42][43] dedicates a server to the high priority flow. In a resource constrained environment, such a model is not efficient. The dedicated server may be idle for a long time waiting, and consuming resources, for the high priority traffic to arrive while other servers may be congested. A different approach is considered in research that focuses on a single server and two classes of jobs. To minimize the sum of the holding/processing and switching costs, work in [44] switches serving between two classes of traffic. In [45] the authors consider a two-class single server preemptive priority queue where jobs can be denied admission to the system and can be aborted from service. Aborted jobs can rejoin the queue and resume service at a later time. Service does not have to restart but it continues from the step it was aborted from. In [46], an expulsion/scheduling control mechanism is proposed for a single class M/M/2 queueing system with non-identical servers. It was further extended in [47] to a multiple class system where job preemption is allowed.
The concept of termination control, studying a single server one-class workload model is introduced in [48]. In this work, the service of a job may be aborted before the job has received full service, and may be removed at any point in time from the queue. The authors further show that optimal threshold (acceptance and termination) policies exist.
A common characteristic of these methods is that they may find applications in areas such as workflow and assembly lines. However, they are not applicable to communication networks. One key element of communication networks is that once a packet transmission starts, it cannot be aborted and resume its service at a later time. A packet can be either fully transmitted or put back in the queue before its service starts. Moreover, in resource constrained networks, once a packet is queued or starts service, processing and computation resources are consumed. Preempting its service (transmission) and returning it to the queue for its service to restart at a later time, only consumes more of the scarce resources.
A recurrent assumption of existing work related to admission control is that low priority packets are denied admission to the queue when higher priority packets are already present. In the event they are accepted, low priority packets can be at any time rejected from the queue, or aborted from service in favor of a higher priority packet [45]. Another option in multiserver systems is to restrict sending the high priority flow over one of the available paths (generally the optimal one). These approaches may work well in certain applications such as delay tolerant networks and networks with unlimited resources. They are also applicable in general applications such as in workflow and in assembly lines. However, in resource constrained computer networks, once a packet is accepted to the system, resource (energy, computation, memory) consumption starts.

Model Description
We consider the system given in Figure 1. We model node A as a two-class queueing system and assume that type-i, i ∈ {1, 2}, packets arrive at node A according to independent Poisson process with arrival rate λ i ≥ 0, respectively. We assume that node A has two paths that lead to the destination D: an energy efficient path, through queue 1, and a less energy efficient path through queue 2. We assume that the service at queue 1 and at queue 2 are exponentially distributed with rates µ 1 and µ 2 respectively, where µ 1 ≥ µ 2 . We assume that type-1 packets have higher priority than type-2 packets hence, they are always accepted to queue 1 upon arrival. However, an admission control mechanism at node A decides to accept or reject type-2 packets. Arrivals (type-1 and accepted type-2 packets) join queue 1. At any event, arrival or service completion, the decision maker can decide to route a packet to queue 2 or to serve it at queue 1. This decision is made to give service advantage to the higher priority packets to be served at queue 1 and to be transmitted over the more energy efficient path. In summary, the system is controlled in 3-ways: (1) all type-1 packets are always accepted to the system, (2) an admission policy at node A decides to accept or reject a newly arriving type-2 packet, (3) a routing policy at node A decides to route type-i, i ∈ {1, 2} packet to queue 2 or to serve it at queue 1. However, type-1 packets are always given service priority at queue 1.
We assume that the decision maker has complete state information, i.e., it knows the instantaneous number of packets of type-i, i ∈ {1, 2} in queue 1 and the total number of packets in queue 2. Thus, the structure of the system is that of a Markovian decision process [49]. As such, we propose to formulate the admission and routing problem as a MDP and use the value iteration technique to characterize the form of the optimal policy. We formulate the problem by defining the states, the transition structure and the feasible actions.
State. The state of the system is described by the vector 2} is the number of type-i packets in queue 1 and x 3 is the number of packets (type-1 and/or type-2) in queue 2.
Events. We distinguish three possible events: (1) the arrival of a new packet, (2) the service completion at queue 1, and (3) the service completion at queue 2.
Decisions. If the event is an arrival, then if the packet is a type-1, it is automatically accepted to queue 1 and this changes the state (x 1 , x 2 , x 3 ) into state (x 1 + 1, x 2 , x 3 ). If alternatively, it is a type-2 packet, then a decision has to be made to accept or reject the newly arrived packet. If the packet is accepted, then this changes the state (x 1 , x 2 , x 3 ) to (x 1 , x 2 + 1, x 3 ). If it is rejected, the state does not change. Next, the decision is either to serve a packet in queue 1 or to route it to queue 2. The idea here is, if there are type-1 packets in queue 1, they are given higher service priority at queue 1. Hence type-2 packets will be served at queue 1 only if there are no type-1 packets in queue 1. One may argue, since there are two queues why not dedicate queue 1 to type-1 packets and queue 2 to type-2 packets? The main reason for not using this approach is so that type-2 packets will not be deprived from using the more efficient path when there are no type-1 packets in queue 1. This also allows type-1 packets to be served at queue 2 in the event queue 1 is heavily congested or when the server at queue 2 is idle.
Costs and rewards. Packets earn a reward upon service completion. Packets served at queue 1 receive a type-dependent reward r i , i ∈ {1, 2}. Since type-1 packets have a higher priority than type-2 packets they receive a higher reward r 1 ≥ r 2 > 0. Packets served at queue 2 receive a reward r 3 (0 < r 3 < r 2 ) independent of their type.
Packets admitted to the system are also subject to a holding and processing cost h, incurred while waiting in the queue or while being served. We assume these costs are linear in the number of packets in the system and are type independent namely, yh ≥ 0 per unit of time when there are y = ∑ 3 i=1 x i packets present in the queues. In addition, each time a packet is admitted to the system, type independent admission cost c ≥ 0 is incurred. Rejecting packets is free of charge. Moreover, routing type-2 packets to queue 2 is free of charge while routing type-1 has a positive switching cost c 2 . Imposing a positive switching cost on type-1 packets is intended to discourage these packets from being routed to queue 2, and use the less efficient path especially, when there are type-2 packets in queue 1.
For a reward to be collected and for the model to make sense, the cost incurred by a type-i packet served at queue i must be smaller than the reward it collects at the queue. When served at queue 1 a packet reward is r i > c + h/µ 1 . When served at queue 2 a packet reward is where the indicator I {i=1} = 1 if the packet is type-1 otherwise, it is equal to zero. As our application is related to energy optimization in sensor networks, we assume that all the costs and rewards are in units of energy. The cost h can be interpreted as the energy consumed to process and maintain a packet in the queue. The cost c is the energy consumed to receive a packet and c 2 is the energy consumed to switch or move a packet from queue 1 to queue 2. Rewards can be interpreted as the energy saved by successfully transmitting a packet compared to rejecting it.
Criterion. The objective is to maximize the expected discounted reward resulting from accepting, routing and servicing flows to completion over an infinite horizon.
Uniformization. In order to convert the continuous problem into a discrete one, we follow [50]'s uniformization technique. We adjust the transition rates of the embedded Markov chain of the system so that the transition times between decision times is a sequence of independent exponentially distributed random variables with mean 1 β , where β = α + λ 1 + λ 2 + µ 1 + µ 2 . Then with probability λ i /β > 0, i = 1, 2, a transition concerns the arrival of a type-i packet, with probability µ j /β > 0, j = 1, 2 concerns a service completion at queue j and with probability α/β > 0, the process terminates. Without loss of generality, we scale the time line so that the rate β = 1.
Discounting. We discount future rewards at a rate α ≥ 0, (i.e., rewards at time t are multiplied by e (αt) ). This is equivalent to a process that lasts an exponentially distributed time with mean 1/α after which, there will be no more arrivals or service completions.
Note that node A in Figure 1, can be modeled as a single-shared-queue system. In this case, a routing decision can be made only when a packet reaches the head of the queue, leading to the head-of-the-line blocking (HOL) problem [51]. As such, single shared queue devices are perceived to have low performance due to the HOL blocking [52,53]. This is the main reason why network devices generally use separate queues per output port. In this work, modeling node A as a two-queue system mitigates the HOL problem, especially that the routing decision is made not only right before service but also at the arrival of a packet.

Model Formulation
In the following, we summarize and complete the model in terms of a mathematical formulation. Let w n (x) be the expected discounted reward of responding to an event given that the system has reached state x following n state transitions starting from a randomly chosen initial state (i.e., w 0 (x) = 0 for all x ≥ 0, where 0 is the zero vector of dimension 3 and the inequality x ≥ 0 is taken component-wise).
Admission: Let T a i w n−1 (x) denote the expected discounted reward when an arrival of type-i packet event occurs and the system is in state x. Let e k be the k-th unit vector of dimension 3. An arrival of type-2 is accepted to the system only if the difference in reward between accepting the packet and rejecting it is positive i.e., w n (x + e 2 ) − c ≥ w n (x). Recall that type-1 packets are never rejected. Thus, for x ≥ 0, Service: When the system is in state x, a service decision of a packet at queue 1 is made as follows. If it is a type-1 packet, it proceeds with no delay to service at server 1. If it is type-2 packet, it is served at server 1 only if there are no type-1 packets in queue 1. In queue 2, packets are served on a first-come-first-serve independent of their type. We define the service operators T i at queue i, i ∈ {1, 2} as follows: Let T s w n−1 (x) denote the expected discounted reward when the current state is x and an arrival or a service completion event occurs. Note that T s w n−1 (x), given by Equation (1), represents the expected reward assuming no routing to queue 2 occurred.
Routing: A type-1 packet may be routed to queue 2 only if no type-2 packets are in queue 1 (x 2 = 0) and it is more rewarding to route the packet to queue 2 than to keep it in queue 1. However, type-2 packets can be routed to queue 2 when x 2 > 0 and when it is more rewarding to do so. Let T r w n (x) denote the expected discounted reward when the current state is x and a routing decision to queue 2 is to occur.
The optimal expected discounted reward at state x is given by Equation (3). It is implied that at any event, arrival or service completion, the decision maker can decide to route a packet to queue 2 or serve it at queue 1.

Characterization of the Optimal Arc Policy
To characterize the optimal policy, we use the value iteration technique introduced in [54,55], by recursively evaluating w n using Equation (3) for n ≥ 0. We prove by induction that if some structural properties of the discounted reward function w n are satisfied, then these properties are also satisfied for w n+1 and therefore, they hold for all n ≥ 0. As n tends to infinity, the optimal policy converges to the unique optimal policy. This convergence result is ensured by Theorem 8.10.1 in [17]. The convergence to the optimal policy is an important result in the MDP literature. It is based on showing that the iteration from w n to w n+1 is a contraction mapping as stated in Theorem 6.2.3 in [17]. This Theorem also proves that the optimal infinite horizon policy is independent of the choice of w 0 and this is why one can simply choose w 0 (x) = 0.

Reward Function Properties
Solving the optimality Equation (3) analytically is untractable. Hence, in order to characterize the structure of the optimal policy, we show that the optimal reward function satisfies a set of properties which allow us to infer the structure of the optimal policy. The properties are listed and interpreted below.
Property 1 implies that w n (x) is concave in each of the state variables x i . In other words, it implies that the marginal reward (i.e., w n (x + e i ) − w n (x)) of an additional packet of type-i, i ∈ {1, 2} in queue 1 is non-increasing in the number of packets x i for a fixed x j , j = i and fixed number of packets, x 3 in queue 2. It also implies that the marginal reward of an additional packet in queue 2 is non-increasing in the number of packets x 3 for a fixed number of packets of type-i, i ∈ {1, 2} in queue 1.
Property 2, for i = 2 and j = 3, states that the marginal reward of an additional type-2 packet in queue 1 is non-increasing in x 2 . Therefore, routing a type-2 packet to queue 2 is less rewarding than servicing it at queue 1. Similarly, Property 2, when i = 1 and j = 3, states that the marginal reward of an additional type-1 packet in queue 1 is non-increasing in x 1 . Consequently, routing a type-1 packet to queue 2 is less rewarding than servicing it at queue 1. The other cases have similar interpretations.
Property 3 states that the marginal reward of an additional type-i packet is non-increasing in x j , i, j ∈ {1, 2} and i = j for fixed x i . Similarly, the marginal value of an additional packet in queue 1 is non-increasing in x j , j ∈ {1, 2} for fixed queue 2 size. Mathematically, Property 3 indicates that the reward value function is sub-modular.

Reward Function Bounds
Since all accepted packets are guaranteed to be served at either queue, packets will collect a reward upon service completion as long as r i > c + h/µ 1 and However, since the reward depends on the packet type and on the queue where the packet resides, in this subsection, we bound the reward collected by packet type. We make use of sample path approach [56] to prove the following propositions.

Proposition 1.
For all n ≥ 0 and x ≥ 0, the difference in reward of serving a type-2 packet at queue 1 does not exceed r 2 .
Proof. Using a sample path analysis, let two instances Π 1 and Π 2 of the policy where Π 1 starts at state x + e 2 and Π 2 starts at state x. Π 1 will follow the actions of the optimal policy and Π 2 will copy the actions of Π 1 . An arrival to both instances changes the rewards equally (every arrival is charged a cost of c). In the event of a departure from state x + e 2 , due to service completion at queue 1, in this case, we must have x 1 = 0, otherwise type-1 takes priority in service, immediately afterwards Π 2 and Π 1 become identical, so a reward of r 2 is generated. The departure can also be a route to queue 2. In this case, since there is no switching cost for type-2 packets, the reward does not change.
Proposition 2. For all n ≥ 0 and x ≥ 0, the difference in reward of serving a packet at queue 2 does not exceed r 3 .
Proof. Using a sample path analysis, let two instances Π 1 and Π 2 of the policy where Π 1 starts at state x + e 3 and Π 2 starts at state x. Π 1 will follow the actions of the optimal policy and Π 2 will copy the actions of Π 1 . A departure from both instances changes the rewards equally (every departure is rewarded r 3 independent of the packet type). Hence, upon a departure from queue 2 at state x + e 3 , Π 2 and Π 1 become identical so the difference in reward is at most r 3 .

Proposition 3.
For all n ≥ 0 and x ≥ 0, the difference in reward to serve a type-1 packet at queue 1 does not exceed r 1 .
Proof. Using a sample path analysis, let two instances of the policy where one (Π 1 ) starts at state x + e 1 and the other (Π 2 ) starts at state x. Π 1 will follow the actions of the optimal policy and Π 2 will copy the actions of Π 1 . An arrival to both instances changes the rewards equally (every arrival is charged a cost of c). In the event of a departure from state x + e 1 , due to service completion at queue 1 (immediately afterwards Π 2 and Π 1 become identical), a reward of r 1 is generated (since type-1 takes priority over type-2, the departure will be of type-1 unless x 1 = 0). So the difference in reward is at most r 1 . The departure can also be a route to queue 2. Note that a routing in both instances changes the reward equally by the switching cost of c 2 < r 1 .

Proposition 4.
For all n ≥ 0 and x ≥ 0, the difference in reward of serving a packet at queue 1 and at queue 2 is larger than r Proof. Using a sample path analysis, we first consider the case where i = 1 and prove w n (x + e 1 ) − w n (x + e 3 ) ≥ r 1 − r 3 . Let two instances Π 1 and Π 2 of the policy where Π 1 starts at state x + e 3 and instance Π 2 starts at state x + e 1 . Instance Π 1 will follow the optimal policy and instance Π 2 will copy the actions of Π 1 . That is, if Π 1 routes its packet, then Π 2 routes its packets, and if Π 1 takes its packet into service, then Π 2 takes its packet into service. For both Instances, while packets are still in the system, their costs and rewards are the same. Hence, the difference in reward between the two instances is zero. However, if a packet is served, a reward of r 3 in Π 1 is collected and a reward of r 1 in Π 2 is collected. Hence, the difference in reward between the two instances is r 1 − r 3 > 0.
The proof of the case where i = 2, is very similar to the above proof. It suffices to replace e 1 by e 2 and r 1 by r 2 .
Note that in this work, the admission cost is not as relevant as the reward as it is packet type-independent. However, the problem can be easily generalized to assigning type-dependent costs c a i > 0, i ∈ {1, 2}, (c a 1 = c a 2 ). On the other hand, the switching cost c 2 is important for type-1 packets as they are charged only in the event they are routed to queue 2.
We conclude this section with the main results of the paper as illustrated in the following Theorem.

Theorem 1.
There exists a stationary optimal policy for any initial state x = (x 1 , x 2 , x 3 ) such that: • Admission policy: The optimal admission control policy is a state-dependent threshold-type, with threshold curve A(x 1 , x 3 ), such that a type-2 packet is admitted to queue 1 if and only if Routing policy: The optimal routing control policy is a state-dependent threshold-type, with threshold curves R 1 (x 3 ) and R 2 (x 1 , x 3 ), such that: 1.
Type-1 packet is routed to queue 2 if and only if x 2 = 0 and Type-2 packet is routed to queue 2 if and only if x 2 ≥ R 2 (x 1 , The results of Theorem 1 also apply to the average reward criterion (see [57]). Hence, we will use the average reward criterion for all our numerical experiments as it has the advantage of not depending on the initial state. To prove the theorem, we will prove Properties 1-3. However, for ease of flow, we defer all mathematical proofs to the Appendix A.

Sensitivity Analysis of the Optimal Policy
In this section, we study the optimal control policy depicted in Theorem 1 and its sensitivity to the system parameters. We conduct extensive numerical experiments varying system parameters. As an illustration we consider a base case, where µ 1 = 0.35, µ 2 = 0.45, λ 1 = 0.25, λ 2 = 0.25, r 1 = 80, r 2 = 60, r 3 = 30, h = 0.7, c = 5, c 2 = 0. The optimal policy is computed using the value iteration algorithm of dynamic programming [58]. Convergence is obtained when the expected reward of successive iterations is within an accuracy of 10 −5 . The optimal admission control policy for this system is presented in Figure 2a. The optimal action is to reject type-2 packets in all states above the switching curve (above the red line) and to accept them in all states below the curve. Similarly, the optimal routing policy for the system is presented in Figure 2b. The optimal action is to route type-2 packets to queue 2 only in states above the switching curve (above the blue and red lines). In states below the switch-curve (below the blue line), no routing is allowed. Below the red line are the only states where type-1 packets are routed to queue 2 that is, when x 2 = 0. We experimented with several system parameters and we obtained the same results in terms of the shape of the switching curves of the admission control policy and routing control policy.  In Figure 3, we superpose both control policy curves where we note that the system gives priority to type-1 packets to be served at queue 1 by routing excess type-2 packets to queue 2. All numerical results gave a straight-line switching curve with slope of −1 in the (x 1 , x 2 ) plane for a given x 3 packets in queue 2. However, proving this result analytically is untractable as it amounts to solving the optimality Equation (3) in closed form.
We further study the effect of the system parameters on the optimal average reward for various network load values, ρ ∈ {50%, 75%, 95%}. We isolate the effect of a particular system parameter by varying its value while holding the values of other system parameters constant.
We study the effect of increasing reward r 2 while maintaining the sum of the rewards of r 1 and r 2 constant. Figure 4a shows that the optimal average reward decreases nonlinearly as the ratio r 2 /r 1 increases. This behavior can be explained as follows. As r 2 increases, the incentive for packets to be routed to queue 2 decreases. Hence, queue 1 becomes overloaded and the overall holding cost eventually becomes high affecting the optimal average reward. Note however, that under heavy network load (ρ = 95%), at a certain point, routing to queue 2 becomes inevitable causing lower reward compared to a system with lower network load (ρ = 75%). This also explains the crossover of the curves corresponding to the optimal average reward curve for ρ = 75% and the one for ρ = 95% in the figure. We further study the effect of increasing reward r 3 on the optimal average reward while maintaining the sum of the rewards of r 2 and r 3 constant. Figure 4b shows that the optimal average reward increases non-linearly as the ratio r 3 /r 2 increases when the network load is high (ρ = 75% and ρ = 95%). This increase is due to the fact that as r 3 increases, packets in queue 1 have more incentive to be routed to queue 2 especially if queue 1 has type-1 packets. As type-2 packets are routed to queue 2, more space opens up in queue 1 for type-1 packets, and more type-2 packets are admitted, hence, the optimal average reward increases. Moreover, as the network load increases, routing admitted packets to queue 2 becomes sometimes necessary. Indeed, for higher network load (ρ = 95%), the optimal average reward is increasing at a faster rate compared to the optimal average reward for a network with lower load of ρ = 75%. For lower network load (ρ = 50%) however, as r 3 increases the optimal average reward decreases nonlinearly. This can be explained as follows: as r 3 increases and if type-1 packets are in queue 1, then type-2 packets have more incentive to be routed to queue 2, and collect a lower reward hence, the optimal average reward decreases. The graphs generated in Figure 4a,b show results when µ 1 = 0.45, µ 2 = 0.35, h = 0.7, c = 5, c 2 = 0. Figure 4c shows the sensitivity of the optimal average reward to the arrival rates while maintaining the sum of λ 1 and λ 2 constant. The figure shows that the optimal average reward initially increases at a high rate as λ 1 increases then the rate of increase slows down. The increase continues until queue 1 becomes congested to cause type-1 packets to be routed to queue 2, hence, collect a lower reward r 3 , and decrease the optimal average reward. For high network load (ρ = 95%) however, the optimal average reward eventually starts decreasing as the high load makes it necessary to increase routing packets to queue 2. Eventually, both queues saturate, leading to an increase in the holding cost, and a decrease in the optimal average reward. The graph generated in Figure 4c shows results for the following system parameters: r 1 = 80, r 2 = 60, r 3 = 30, µ 1 = 0.45, µ 2 = 0.3, h = 0.7, c = 5 and c 2 = 0.
In Figure 4d, we study the effect of the service rates µ 1 and µ 2 on the optimal average reward while maintaining the sum of µ 1 and µ 2 constant. As the ratio µ 1 /µ 2 increases, the optimal average reward increases and eventually levels-off for all network loads considered (ρ ∈ {50%, 75%, 95%}). This can be explained as follows. As queue 1 service rate µ 1 increases, queue 1's length becomes shorter discouraging packets from being routed to queue 2. Hence, the optimal average reward increases. As µ 1 continues to increase relative to µ 2 , less and less routing occurs eliminating the need for queue 2 which explains the leveling-off of the optimal average reward (since the arrival rates are held constant). In practice, however, achieving a high service rate to eliminate queue 2 is rather costly. The graph generated in Figure 4d shows results for the following system parameters: r 1 = 80, r 2 = 60, r 3 = 30, λ 1 = 0.35, λ 2 = 0.3, h = 0.7, c = 5 and c 2 = 0.   Optimal reward sensitivity to rewards r 1 , r 2 , r 3 , service rates µ 1 , µ 2 , arrival rates λ 1 , λ 2 , holding cost h and admission cost c. Figure 4e shows that the optimal average reward is nonlinearly decreasing in the holding cost h. Figure 4f shows that the optimal average reward is linearly decreasing in the admission cost c. These results are interesting and are worth exploiting in a future work in an attempt to get an analytical expression of the reward. The graphs generated in Figure 4e,f show results for the following system parameters: r 1 = 80, r 2 = 60, r 3 = 30, λ 1 = 0.35, λ 2 = 0.3, c 2 = 0 and various values of h and c respectively.
Finally, we would like to note that even though our analysis focused on the case where the switching cost is zero (c 2 = 0), similar results are obtained for c 2 > 0. As an illustration, Figure 5, shows that the optimal reward linearly decreases in c 2 . This result is expected as when the switching cost increases, type-1 packets have no incentive to be routed to queue 2. Moreover, as the network load increases, the optimal reward increases up to a point where both queues become congested. This explains why the optimal reward when ρ = 75% is higher than the optimal reward when ρ = 95% as c 2 increases. This also lead to the conclusion that there is an optimal load where the reward in maximized. The results in Figure 5 are obtained for system parameters: r 1 = 80, r 2 = 60, r 3 = 30, λ 1 = 0.35, λ 2 = 0.3, h = 0.7 and c = 5.

Heuristic Control Policy
It is well established that dynamic programming suffers from the curse of dimensionality. For our model in particular, the optimal policy is computationally untractable for systems with more than 2 types of packets (i.e., a state space with dimension greater than 3). Hence, it is too resource intensive to run on resource limited sensor devices. This motivated us to propose an efficient heuristic control policy that imitates the behavior of the optimal policy and is computationally much faster to obtain for the general case of n types of packets (see Algorithm 1). As such, we define the state of the system by the (n + 1) dimensional vector x = (x 1 , x 2 , . . . , x n , x n+1 ) where type-i packets take priority over type-j packets for i < j. The number of packets in queue 1 is represented by x 1 + x 2 + · · · + x n where x i is the number of type-i packets while x n+1 depicts the number of packets in queue 2. The heuristic control policy is characterized by 2(n − 1) parameters: parameters A i , i = 2, . . . , n control the admission to queue 1, while parameters R i , i = 2, . . . , n control the routing to queue 2. Note that type-1 packets are always admitted to queue 1 (i.e, A 1 = ∞). Furthermore, we have A 2 ≥ A 3 ≥ . . . ≥ A n and R 1 ≥ R 2 ≥ . . . ≥ R n ≥ A 2 . We extend the costs and reward parameters as follows: c is the admission cost; h is the holding cost; c i is the switching cost for type-i packets (c 1 ≥ c 2 ≥ . . . ≥ c n ) and r i , i ∈ {1, 2, . . . , n} is the reward of type-i packet served at queue 1 (r 1 ≥ r 2 ≥ . . . ≥ r n ) and r n+1 (< r n ) is the reward of packets served at queue 2.
At arrival of type-i packet, we use the following control policy where we define I(x) = the largest packet type i ∈ {1, . . . , n} such that x i > 0:

Algorithm 1. Proposed heuristic control policy.
if ∑ n k=1 x k < A i then admit type-i packet to queue 1 end if if ∑ n k=1 x k ≥ A i and I(x) ≥ i and ∑ n k=1 x k ≥ R I(x) then admit type-i packet to queue 1 and route type-I (x) packet to queue 2 else do no admit type-i packet end if In order to test the performance of the above proposed heuristic, we compare the reward generated by the heuristic to that of the optimal policy. We use the average reward criterion for this purpose. The average reward under the optimal policy is obtained using the following optimality equation: where g * is the optimal average reward per transition (see [58]) and w * (x) is the optimal differential reward, w(x), T s and T r as defined in Section 3.2. The average reward under the heuristic control policy (H) is defined using the following dynamic programming equation: where g H is the average reward per transition under the heuristic control policy, w H (x) is the differential reward under the heuristic policy and T H a i , T H 1 and T H 2 are defined as follows: In the following, we compare the performance of the proposed heuristic control policy to the optimal control policy for the case of two types of packets. We examine the impact of a certain system variable by varying its value while maintaining all other variables constant. Similar to [59], we use as performance metric the reward Relative Deviation (RD) of the heuristic from the optimal. The RD, expressed in percentage, is defined as RD = 100 × (ψ * − ψ H )/ψ * where ψ * denotes the average reward rate of the optimal control policy obtained by solving Equation (4), and ψ H denotes the average reward rate associated with the heuristic control policy obtained by solving Equation (5).
Here, similar to the optimal policy, the expected reward is obtained using the value iteration algorithm with the same accuracy of 10 −5 . Table 1 shows a sample of 100 randomly generated system parameter values used to compute the performance of the heuristic control policy. Based on these results, it is clear that the heuristic performs very well compared to the optimal policy. For a 95% confidence interval, the average RD is 1.40% ± 0.02% with a range of [0, 4.29]. Table 1. Heuristic performance compared to the optimal policy. RD is used as a measure of the Heuristic performance compared to the optimal policy.  For a system with more than two types of packets, computing the optimal policy becomes untractable due to the exponential explosion of the number of states. However, computing the thresholds of this heuristic is orders of magnitude faster than computing the optimal policy. In fact, much of the computation required (i.e., computation of the system parameters) for the heuristic can be done off-line, and the real-time computation requires no more than a table lookup. The computation of the system parameters of the heuristic is approximately, and at worst equal to the number of parameters times the size of the square of the cardinality of the state space. Numerical results show that the heuristic performs very well compared to the optimal policy. For a 95% confidence interval, the average computation time is 0.045% ± 0.005% with a range of [10 −4 , 9 × 10 −4 ] over a sample of 100 cases.
Given that the computation time of the optimal policy scales exponentially in the state space, computing the optimal policy beyond two priority classes is untractable. For instance, consider a system with three types of packets. Even if we succeed to compute the optimal policy and the associated parameters, it will require a huge state dependent look-up table (five hyper-surfaces representing the state dependent thresholds). For the heuristic however, we will need to store only five static control parameters (i.e., two admission and three routing threshold parameters).
Finally, in practice, traffic flow (i.e., arrival rate λ) changes over time. Our model however, assumes a constant traffic flow (λ i for flow type-i). This is by no means a limitation of our model. In fact, this issue can be approached in one of two ways: either using a transient analysis which is well documented as being untractable especially in the context of an MDP framework; or computing different policy parameters off-line for each traffic flow. These parameters would be used for the particular traffic flow in effect during deployment.

Conclusions
In this paper, we considered an admission and routing control problem to address the issue of resource limitation in resource constrained networks (such as WSNs). We formulated the admission and routing control problem of two types of flows where one has a higher priority than the other, as a Markov decision problem. We characterized the optimal policy and showed that it is a state-dependent threshold type policy. Furthermore, we conducted extensive numerical experiments to gain more insight into the behavior of the optimal policy under different system parameters. Due to the computational challenges of the optimal policy (curse of dimensionality) which makes it untractable and too resource intensive to run on wireless devices, we proposed a heuristic that mimics the optimal control policy. Through extensive numerical results, we showed that the heuristic performs very well. It is also orders of magnitude faster than the optimal policy. Much of the required computation can be done off-line, and the real-time computation requires no more than a table lookup. We further generalized the heuristic for the case of a system with n types of flows (n ≥ 2).
The results presented in this work provide a first step towards a better understanding of the structure of the optimal policy. There are several avenues for future research. In particular, it would be of interest to generalize the system to multi-server queues with more than two paths leading to the same destination. We expect the problem to become considerably more difficult with each additional feature and it is not clear if the optimal policy would be tractable. A clear extension of this work is to implement and test the proposed admission and routing control policy in real resource constrained network devices.
Funding: This research received no external funding.

Conflicts of Interest:
The author declares no conflict of interest.

Appendix A
In order to prove Theorem 1 we first state the following Lemma.
Lemma A1. Let V be the set of functions, v, defined on R 3 such that v satisfies Properties 1-3. Then w n ∈ V, ∀n ∈ Z + , where Z + is the set of non-negative integers.
Proof. First, note that by adding Properties 2 and 3 we obtain Property 1. Therefore, it suffices to prove Properties 2 and 3. To prove these properties, we use induction on the remaining number of periods. The following is a sketch of the proof.

•
Step 1: we observe that Properties 2 and 3 hold for n = 0.

•
Step 3: we prove that 2 and 3 hold for n + 1.
We introduce the following notation: ∆ ij w n (x + e i + e j ) = w n (x + e i ) − w n (x) Hence, we can write max{w n (x + e i ), w n (x)} = w n (x) + max{w n (x + e i ) − w n (x), 0} = w n (x) + max{∆ i w n (x), 0} We show that T s and T r satisfy Properties 2 and 3 and thus satisfy Property 1.

Operator T s
Note that operators T a 1 , T 1 and T 2 satisfy Properties 2 and 3 by induction as they do not involve any decision. Hence, we only need to show that T a 2 satisfies Properties 2 and 3. Using the difference operator, we have: To show that T a 2 satisfies Property 2, let Using submodularity property, we infer that : ∆ j w n−1 (x + e i ) ≥ ∆ j w n−1 (x + e j ) ≥ ∆ j w n−1 (x + e i + e j ) and ∆ j w n−1 (x + e i ) ≥ ∆ j w n−1 (x + 2e i ) ≥ ∆ j w n−1 (x + e i + e j ). We prove the property for each case.
2. case 2: assume ∆ j w n−1 (x + e j ) − c ≥ 0 ≥ ∆ j w n−1 (x + e i + e j ) − c and ∆ j w n−1 (x + 2e i ) − c ≥ 0 ≥ ∆ j w n−1 (x + e i + e j ) − c Q(x) = ∆ i w n−1 (x + e i + e j ) − ∆ i w n−1 (x + 2e j ) ≤ 0 by the inductive hypothesis ≤ 0 by the inductive hypothesis ≤ 0 by the inductive hypothesis + ∆ j w n−1 (x + e i + e j ) − c − ∆ j w n−1 (x + 2e i ) + c ≤ 0 by the assumption ≤ 0 5. case 5: assume ∆ j w n−1 (x + e i ) − c ≥ 0 ≥ ∆ j w n−1 (x + e j ) − c≥ ∆ j w n−1 (x + e i + e j ) − c and ≤ 0 by the inductive hypothesis Since the set V is closed under addition and multiplication by a scalar, it follows that T s satisfies Property 2.
To show that T a 2 satisfies Property 3, let In the following we prove the property for each case.
≤0 by the assumption

by the inductive hypothesis
Since the set V is closed under addition and multiplication by a scalar, it follows that T s satisfies Property 3.
To show that T r satisfies Property 2, we need to prove that For simplicity we prove the property for i = 1 and j = 2. The rest of the properties follow the same procedure. 1.
We prove the property for each case.
x 2 > 0 T r w n−1 (x) = max{w n−1 (x − e j + e 3 ), w n−1 (x)} = w n−1 (x) + max{∆ −23 w n−1 (x), 0} Note that T r (x) given in Equation (A2) is a special case of T r (x) expression given in Equation (A1) where c 2 = 0 and e i is substituted by e j , therefore the proof of Property 2 follows.
In order to prove that T r satisfies Property 3, we need to prove that T r w n−1 (x + e i + e j ) − T r w n−1 (x + e j ) − T r w n−1 (x + e i ) + T r w n−1 (x) ≤ 0.
For simplicity we prove the property for i = 1 and j = 2. The rest of the properties follow the same procedure. 1.