Estimation of the Optimal Threshold Policy in a Queue with Heterogeneous Servers Using a Heuristic Solution and Artificial Neural Networks

This paper deals with heterogeneous queues where servers differ not only in service rates but also in operating costs. The classical optimisation problem in queueing systems with heterogeneous servers consists in the optimal allocation of customers between the servers with the aim to minimise the long-run average costs of the system per unit of time. As it is known, under some assumptions the optimal allocation policy for this system is of threshold type, i.e., the policy depends on the queue length and the state of faster servers. The optimal thresholds can be calculated using a Markov decision process by implementing the policy-iteration algorithm. This algorithm may have certain limitations on obtaining a result for the entire range of system parameter values. However, the available data sets for evaluated optimal threshold levels and values of system parameters can be used to provide estimations for optimal thresholds through artificial neural networks. The obtained results are accompanied by a simple heuristic solution. Numerical examples illustrate the quality of estimations.


Introduction
Many queueing systems are analysed for their dynamic and optimal control related to system access, resource allocation, changing service area characteristics and so on. Sets of computerised tools and procedures provide large data sets which can be useful to expand potential of classical optimisation methods. The paper deals with a known model of a multiserver queueing system with controllable allocation of customers between heterogeneous servers which are differentiated by their service and cost attributes. For the queueing system with two heterogeneous servers it has been shown in [1] by using a dynamic programming approach that to minimise the mean sojourn time of customers in the system, the faster server must be always used and the customer has to be assigned to the slower server if and only if the number of customers in the queue exceeds the certain threshold level. Furthermore, this result was obtained independently in more simple form in [2,3]. In [4], the author has analysed a multi-server version of such a system and confirmed a threshold nature of the optimal policy as well.
The problem of an optimal allocation of customers between heterogeneous servers in queueing systems with additional costs with the aim to minimise the long-run average cost per unit of time is notoriously more difficult. Some progress has been made after the appearance of a review paper [5]. In [6,7], the authors studied a model with set-up costs using a hysteretic control rule, thereby stressing the algorithmic aspects of the optimal control structure. The same system has been discussed in [8], where a direct method that provides a closed-form expression for the stationary occupancy distribution was proposed. In [9,10], the authors have used theoretical study and exhaustive numerical analysis to show that for some specified servers, ordering the optimal allocation policy which minimises the long-run average cost belongs to a set of structural policies. In other words, for the servers' enumeration (1), the allocation control policy denoted by f can be defined through a sequence of threshold levels 1 = q 1 ≤ q 2 ≤ · · · ≤ q K < ∞. With respect to the defined policy, the server operating at highest rate should remain busy by non-empty queueing system. The kth server (k ≥ 2) is used only if the first k − 1 servers are busy and the queue length reaches a threshold level q k > 0. In the general case, the optimal threshold levels can depend on states of slower server and formally the optimal policy f is not of a pure threshold type. However, since the kth threshold value may vary by at most one when the state of slower server changes and it has a weak effect on the average cost, such influence can be neglected. Hence the optimal allocation policy for multi-server heterogeneous queueing system can be treated as a threshold one.
Searching for the optimal values of q 2 , . . . , q K by direct minimising the average cost function can be expensive, especially when K is large. To calculate the optimal threshold levels we can use a policy-iteration algorithm [11][12][13] which constructs a sequence of improved policies that converges to optimal one. This algorithm is a fairly versatile tool for solving various optimisation problems. Unfortunately, as is usually the case in practice, this algorithm is not without some limitations, such as the difficulties associated with convergence when the traffic is close to loaded, limitation on the process dimension and, consequently, on the number of states. Thus, we would like to compensate for some of the weaknesses of this algorithm with other methods for calculating the optimal control policy. The contribution of this paper can be briefly described in two conceptual parts. In the first part, we propose a heuristic solution (HS) to obtain functional relationships for optimal thresholds based on a simple discrete approximation of the system's behaviour. The second part is devoted to the alternative machine learning technique such as artificial neural networks (NN) [14][15][16] which is used again for the estimation of the optimal threshold levels. The policy-iteration algorithm is used in the paper to generate the data sets needed both to verify the quality of the proposed optimal threshold estimation methods and to train the neural networks. We strongly believe that the trained neural network can be successfully used to calculate the optimal thresholds for those system parameters for which alternative numerical methods are difficult or impossible to use, for example, in heavy traffic case, or, in general, to reconstruct the areas of optimality without usage of timeexpensive algorithms and procedures. There are some number of papers on prediction of the stochastic behaviour of queueing systems and networks using machine learning algorithms, see e.g., [17,18] and references therein. However, we unsuccessfully tried to find published works where heuristics and machine learning methods would be used to solve a similar optimisation problem for heterogeneous queueing systems and therefore we consider this paper relevant. This paper is organised as follows. In Section 2, we briefly discuss a mathematical model. Section 3 introduces some heuristic choices for threshold levels that turn out to be nearly optimal. Section 4 presents results when the trained neural network was ran on verification data of the policy-iteration algorithm.

Mathematical Model
We summarise briefly the model under study. The queueing system is of the type M/M/K with infinite-capacity buffer and K heterogeneous servers. This system is shown schematically in Figure 1. The Poisson arrival stream has a rate λ and the exponential distribution of the service time at server j has a rate µ j . We assume that the service in the system is without preemption, when customer in service cannot change the server. The random variables of the inter-arrival times and the service times of the servers are assumed to be independent. An additional cost structure is introduced, consisting of the operating cost c j > 0 per unit of time of service on server j and the holding cost c 0 > 0 of waiting in the queue. Assume that the servers are enumerated in a way where c j µ −1 j stands for the mean operating cost per customer for the jth server. The controller has full information about the system's state and, based on this information, can make control actions on the system at the decision epochs when certain state transitions occur, following the prescription of the policy f . In our case, the controller selects the control action at the time when a new customer enters the system and at the service completion times, if the queue is not empty. When a new customer arrives, it joins the queue and at the same time, the controller sends another customer from the head of the queue to one of the idle servers or leaves it in the queue. At the service completion, the customer leaves the corresponding server, and at the same time the controller takes the next customer from the head of the queue, if it is not empty, and dispatches it to one of idle servers or can leave it in the queue as well. The service completion in the system without waiting customers does not require the controller to perform any control action.
The fact that the optimal policy for the problem of minimising the long-run average cost per unit of time belongs to a set of threshold-based policies for the multi-server heterogeneous queueing systems with costs were proved first in [10] and further conformed for systems with heterogeneous groups of servers in [19]. The corresponding optimal thresholds can in the general case depend on the states of slower servers. However, according to obtained numerical results in [9], we can neglect the weak influence of the slower servers' states on the optimal allocation policy for the faster servers. This phenomena was discussed additionally in Example 2. Therefore, we may assume that the optimal policy belongs to the class of a pure threshold policy when the use of a certain server depends solely on the number of waiting customers in the queue. Specifically, for the system under study, such a policy is defined by the following sequence of threshold levels: The policy prescribes the use of the k fastest servers whenever the number of customers waiting in the queue satisfies the condition q k ≤ q ≤ q k+1 − 1.
To calculate optimal thresholds we need to formulate the introduced optimisation problem in terms of a Markov decision process. This process is based on a K + 1-dimensional continuous-time Markov chain with an infinitesimal matrix Λ f which depends on the policy f . Here the component Q(t) ∈ N 0 stands for the number of waiting customers at time t and D j (t) = 0 if jth server is idle 1 if jth server is busy .
The state space of the process {X(t)} t≥0 operating under some policy f is where the notations q(x) and d j (x) are used respectively for the queue length and for the state of jth server in state x ∈ E f . The possible server states are partitioned as follows: The sets J 0 (x) and J 1 (x) denote the sets of idle and busy servers in state x ∈ E f , respectively. The set of control actions a is A = {0, 1, . . . , K}. If a = 0, the controller allocates a customer to the queue. Otherwise, if a = 0, the controller instructs a customer to occupy the server with a number a. In addition, we can define the subsets A(x) = J 0 (x) ∪ {0} ⊆ A of admissible actions in state x The policy f specifies the choice of a control action at any decision epoch and the infinitesimal matrix Λ f = [λ xy (a)] of the Markov-chain (3) has then the following elements, where e j is defined as K + 1-dimensional unit vector with each element equal to zero except the jth position (j = 0, 1, . . . , K).
We will search for the optimal control policy among the set of stationary Markov policies f that guarantee ergodicity of the Markov chain {X(t)} t≥0 . The corresponding stability condition is obviously defined as λ < ∑ K j=1 µ j . It follows from the fact, that if number of customers exceeds a threshold q K , then the queueing systems behaves like a M/M/1 queue with an arrival rate λ and total service rate µ 1 + · · · + µ K . As it is known, see e.g., [13], the ergodic Markov chain with costs implies the equality of the long-run average cost per unit of time for the policy f and the corresponding assemble average, that can be written in the form where c(y) = c 0 q(y) This function describes the total average cost up to time t given the initial state is x and π f y = P f [X(t) = y] is a stationary state distribution for the policy f . The policy f * is said to be optimal when for g f defined in (4) we evaluate To evaluate optimal threshold levels and optimised value for the mean average cost per unit of time the policy-iteration Algorithm 1 is used. This algorithm constructs a sequence of improved policies until the average cost optimal is reached. It consists of three main steps: value evaluation, policy improvement and threshold evaluation. The Value evaluation is based on solving, for a given policy f , a system of linear equations Algorithm 1 Policy-iteration algorithm Initial policy 3: Value evaluation 5: end for 8: 10: else n ← n + 1, go to step 4 11: end if 12: Threshold evaluation For the dynamic-programming value function v f : E f → R, which indicates a transition effect of an initial state x to the total average cost and satisfies the following asymptotic relation, In order to make the system (6) solvable, one of the values v(x) must be set to zero, e.g., for x 0 = (0, . . . , 0) we set v(x 0 ) = 0. Since in our case c(x 0 ) = 0, the first equation of the system (6) is of the form g f = ∑ y =x 0 λ x 0 y (a)v f (y). In the policy improvement step a new policy f is calculated by minimising the value function v(x + e a ) for any state x ∈ E f and any admissible control action a ∈ A(x). The algorithm converges if the policies f and f on neighbouring iterations are equal. In the threshold evaluation we calculate the optimal thresholds q k , k = 2, . . . , K, based on optimal policy f . As an initial policy we select the policy which prescribes in any state the usage of a server j with the minimal value of the mean operating cost c j µ j per customer. More detailed information on deriving the dynamic programming equations for the heterogeneous queueing system and calculating the corresponding optimal allocation control policy can be found in [9]. For existence of an optimal stationary policy and convergence of the policy-iteration algorithm we refer to [12,[20][21][22].
To realise the policy-iteration algorithm we convert the K + 1-dimensional state space E f of the Markov decision process to a one-dimensional equivalent state space. Let ∆ : A new state after transition involving the addition or removal of customer in some state x ∈ E f , in a one-dimensional state space is calculated by Further in the algorithm, an infinite buffer system must be approximated by an equivalent system where the number of waiting places is finite but at the same time is sufficiently large. As a truncation criterion, we use the loss probability which should not exceed some small value ε > 0.

Remark 1. If the buffer size is W, the number of states is
In case the number of waiting customers is getting larger as the level q K , all servers must be occupied and the system dynamics is the same as in a classical queue M/M/1 with arrival rate λ and service rate ∑ K j=1 µ j . The stationary state probabilities for the states x where the component q(x) ≥ q K satisfy the following difference equation which has a solution in a geometric form, π (q,1,...,1) = π (q K ,1,...,1) ρ q−q K , q ≥ q K . For details and theoretical substantiation see, e.g., [23]. Note that the value of q K included in this formula can be estimated by a heuristic solution (9). Then the truncation parameter W of the buffer size can be evaluated from the following constraint for the loss probability 2734. Here q 5 = 12 was calculated by (9). In a control table, we summarise the functions f (x) which specify the control actions at time of arrivals to a certain state x: ) for q = 0, . . . , W − 1. In this example the optimal policy f * is defined here through a sequence of threshold levels (q 2 , q 3 , q 4 , q 5 ) = (3,4,5,12) and g * = 4.92897. The bold and underline format in a control table is used to label the change of the control action in a certain system state.
In the next example we give some arguments that allow us to work further only with the threshold-based control policies. Example 2. Consider the system M/M/3 with K = 3 servers. The aim of this example consists in the following: With respect to the system states x = (q, 1, 0, 0) and y = (q, 1, 0, 1) the assignment to the second server can in general depend not only on the number of customers in the queue but also on the state of the third server. In this example it is optimal to make an assignment in state x but not in state y. We solve optimisation problem for the following parameters: • λ = 0.238 , µ 1 = 0.621 , µ 2 = 0.071 and µ 3 = 0.070, • λ = 0.477 , µ 1 = 0.356 , µ 2 = 0.096 and µ 3 = 0.070.
The of optimal solution for the first and second group of system parameters are represented in Tables 1 and 2, respectively.  3  3  3  3  3  3  3  3  3  3  3 3 (0,0,1) We notice that for most parameter values the optimal decision can be made independently of the states of the slower servers. However, it is interesting to consider the reasons for such possible dependence. It is evident that in our optimisation problem, the optimal policy assigns a customer to the fastest free server in states for which this would not be optimal if there were no arrivals. This is because the system should be ready for possible arrivals, which, if they occur, will wish to see a less congested system.  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3 3 (0,0,1) Consider now the system with three servers in the states x + e 0 + e 1 and x + e 1 + e 2 , where x = (0, 0, 0, 0). Let us consider the case of potential service completion at the second server, taking into account a large number q of accompanied arrivals. Because of large q, it is optimal to occupy all accessible idle servers. The states mentioned above become x + (q − 1)e 0 + e 1 + e 2 + e 3 and x + (q − 2)e 0 + e 1 + e 2 + e 3 . Thus, the difference v(x + (q − 1)e 0 + e 1 + e 2 + e 3 ) − v(x + (q − 2)e 0 + e 1 + e 2 + e 3 ) of value functions measures the advantage that will be obtained in the case of the assignment to the second processor x + e 0 + e 1 → x + e 1 + e 2 . The events of service completion on the second server provide the incentive to make an assignment to the second server. However, if the two initial states are x + e 0 + e 1 + e 3 and x + e 1 + e 2 + e 3 , the measure of advantage if service completion takes place is v(x + qe 0 + e 1 + e 2 + e 3 ) − v((q − 1)e 0 + e 1 + e 2 + e 3 ). Since we expect that the value function v(qe 0 + e 1 + e 2 + e 3 ) is convex in q, it is plausible that the incentive to make an assignment to the second server is greater in state x + e 0 + e 1 + e 3 than in x + e 0 + e 1 . Numerical examples proposed in Table 3 confirm our expectations. The further numerical examples show that the threshold levels have a very weak dependence of slower servers' states. According to our observations, the optimal threshold may vary by at most 1 when the state of a slower server changes.
The data needed either to verify the heuristic solution or for training and verification of the neural network was generated by a policy-iteration algorithm in form of the list S = (λ, µ 1 , . . . , µ K , c 0 , c 1 , . . . , c K ) → (q 2 , . . . , q K ) :

Heuristic Solution
In this section, we want to obtain a heuristic solution (HS) to calculate the optimal thresholds q k , k = 2, . . . , K for the arbitrary K in explicit form. For this purpose, we will use a simple deterministic approximation for the dynamic behaviour of the number of customers in the queue as illustrated in Figure 2. The further numerical examples show that the threshold levels have a very weak dependence of slower servers' states. According to our observations, the optimal threshold may vary by at most 1 when the state of a slower server changes.
The data needed either to verify the heuristic solution or for training and verification of the neural network was generated by a policy-iteration algorithm in form of the list S = (λ, µ 1 , . . . , µ K , c 0 , c 1 , . . . , c K ) → (q 2 , . . . , q K ) :

Heuristic Solution
In this section, we want to obtain a heuristic solution (HS) to calculate the optimal thresholds q k , k = 2, . . . , K for the arbitrary K in explicit form. For this purpose, we will use a simple deterministic approximation for the dynamic behaviour of the number of customers in the queue as illustrated in Figure 2. Let q k is an optimal threshold used to dispatch the customer to server k in state (q k − 1, 1, . . . , 1 , where the kth server is occupied by a waiting customer. It is assumed that the stability condition holds. The initial queue lengths are labelled in Figure 2 by A = q k and B = q k − 1. The proposed deterministic approximation is based on an assumption that the queue length of the system with the first k − 1 busy servers decreases with the rate ∑ k−1 j=1 µ j − λ. When this rate is keeping until the queue is empty, it occurs at time points D = q k ∑ k−1 j=1 µ j −λ and C = q k −1 ∑ k−1 j=1 µ j −λ respectively for the given initial queue length A and B. The total (accumulated) holding Let q k is an optimal threshold used to dispatch the customer to server k in state (q k − 1, 1, . . . , 1 , where the kth server is occupied by a waiting customer. It is assumed that the stability condition holds. The initial queue lengths are labelled in Figure 2 by A = q k and B = q k − 1. The proposed deterministic approximation is based on an assumption that the queue length of the system with the first k − 1 busy servers decreases with the rate ∑ k−1 j=1 µ j − λ. When this rate is keeping until the queue is empty, it occurs at time points D = q k ∑ k−1 j=1 µ j −λ and C = q k −1 ∑ k−1 j=1 µ j −λ respectively for the given initial queue length A and B. The total (accumulated) holding times of all customers in the queue with lengths q k and q k − 1 are equal respectively to the number of square blocks of dimension 1 × 1 ∑ k−1 j=1 µ j −λ within the areas AOD and BOC multiplied by the mean service time of the approximated model: The mean operating cost of the first k − 1 servers during the time period until the queue becomes empty given the initial state is x 0 can be calculated by The expression means the probability of the service completion at the ith server, and the mean operating cost given the initial state is y 0 , which can be defined as Now using the deterministic approximation we can formulate the following proposition.

Proposition 1.
The optimal thresholds q k , k = 2, . . . , K, are defined by Proof. Let V(x) be the overall average system cost until the system becomes empty given the initial state is x ∈ E f . This value can be represented as a sum of the total holding cost of customers waiting in the queue and mean operating cost of all servers which remain busy in state x. Assume that the controller performs a decision to allocate the customer to the kth server in state (q k − 1, 1, . . . , 1 ). As a result, it leads to a reduction of the overall system costs according to the proposed deterministic approximation, i.e., where ).
After substitution of (11) into (10) we get Now, expressing q k after some simple manipulations we obtain the heuristic solution for the optimal value of q k in form (9). Example 4. Consider a queueing system from the previous example for K = 5. We select randomly from the data set S (8) a list of system parameters α = (λ, µ 1 , . . . , µ K , c 0 , c 1 , . . . , c K ) and calculate by means of the HS (9) threshold levels q k , k = 1, . . . , K. Figure 3 illustrates the efficiency of the proposed heuristic solution respectively for threshold levels (q 2 , q 3 , q 4 , q 5 ) by confusion matrices. The matrix row represents the elements including a predicted value while each column represents the elements for an actual value. As a metric for the closeness of the measurements to a specific value and to the interval with possible deviation of threshold by ±1 from the real value, the overall accuracy and accuracy ±1 are used. The results are summarised in Table 4.

Artificial Neural Networks
Artificial Neural Networks (NN) belong to a set of supervised machine learning methods. It is most popular in different applied problems including data classification, pattern recognition, regression, clustering and time series forecasting. Here we show that the NN can give even more positive results compared to the HS that indicates the possibility to use it for predicting the structural control policies.
The data set S (8) is used to explore predictions for the optimal threshold levels through the NN. The multilayer neural network is used for the data classification. It can be formally defined as a function f : α → y, which maps an input vector α of dimension 2m + 1 to an estimate output y ∈ R N c of the class number N = 1, . . . , N c . The network is decomposed into 6 layers as illustrated in Figure 4, each of which represents a different function mapping vectors to vectors. The successive layers are: a linear layer with an output vector of size k, a nonlinear elementwise activation layer, other three linear layers with output vectors of size k and a nonlinear normalisation layer. The first layer is an affine transformation where q 1 = R 2m+1 is the output vector, W ∈ R 2m+1×k=30 is the weight matrix, b 1 ∈ R 2m+1 is the bias vector. The rows in W 1 are interpreted as features that are relevant for differentiating between corresponding classes. Consequently, W 1 α is a projection of the input α onto these features. The second layer is an elementwise activation layer which is defined by the nonlinear function q 2 = max(0, q 1 ) setting negative entries of q 1 to zero and uses only positive entries. The next three layers are other affine transformations, where q i ∈ R k , W i ∈ R k×k , and b i ∈ R k , i = 3, 4, 5. The last layer is the normalisation layer y = softmax( q 5 ), whose componentwise is of the form The last layer normalises the output vector y with the aim to get the values between 0 and 1. The output y can be treated as a probability distribution vector, where the Nth element y N represents the likelihood that α belongs to class N.
We use 70% of the same data S which was not used to verify the quality of the HS in a training phase of the NN and the rest of S-as validation data. We train a multilayer (6-layer) NN using an adaptive moment estimation method [24] and the neural network toolbox in Mathematica© of the Wolfram Research. Then we verify the approximated functionq k :=q k (λ, µ 1 , . . . , µ K , c 0 , c 1 , . . . , c K ), which should be accurate enough to be used to predict new output from verification data. The algorithm was ran many times on samples and networks with different sizes. In all cases the results were quite positive and indicate the potential of machine learning methodology for optimisation problems in the queueing theory.

Example 5.
The results of estimations of the optimal threshold values using the trained NN are summarised again in form of confusion matrices, as is shown in Figure 5. The overall accuracy of classification and accuracies for the values with deviations are given in Table 5. We can see that the NN methodology exhibits even more accurate estimations for the optimal thresholds if the results are compared with the corresponding HS.

Conclusions
We combine classic methodology of analysing controllable queues with a heuristic solution and machine learning to study the possibility to estimate the values of optimal thresholds. Due to the fact that the results were quite positive, we can make the following general conclusion. With this study we confirm that the analysis of controlled queueing systems and the solution of optimisation problems using classical Markov decision theory can be successfully combined with machine learning techniques. These approaches do not contradict each other; on the contrary, combining them provides new results.