Learning-based Online QoE Optimization in Multi-Agent Video Streaming

: Video streaming has become a major usage scenario for the Internet. The growing popular-1 ity of new applications, such as 4K and 360-degree videos, mandates that network resources must be carefully apportioned among different users, in order to achieve the optimal Quality of Experience (QoE) and fairness objectives. This results in a challenging online optimization problem, as networks grow increasingly complex and the relevant QoE objectives are often nonlinear functions. Recently, data-driven approaches, deep Reinforcement Learning (RL) in particular, have been successfully ap-6 plied to network optimization problems, by modeling them as Markov Decision Processes. However, existing RL algorithms involving multiple agents fail to address nonlinear objective functions on dif-8 ferent agents’ rewards. To this end, we leverage MAPG-ﬁnite, a Policy Gradient algorithm designed for multi-agent learning problems with nonlinear objectives. It allows us to optimize bandwidth distributions among multiple agents and to maximize QoE and fairness objectives on video streaming rewards. Implementing the proposed algorithm, we compare MAPG-ﬁnite strategy with a number of baselines – including static, adaptive, and single-agent learning policies. The numerical results show that MAPG-ﬁnite signiﬁcantly outperforms the baseline strategies with respect to different objective functions and in various settings, including both constant and adaptive bitrate videos. Speciﬁcally, our MAPG-ﬁnite algorithm maximizes QoE by 15.27% and maximizes fairness by 22.47% compared to the standard SARSA algorithm for a 2000 KB/s link.


19
Video streaming has become a major usage scenario for Internet users, accounting 20 for over 60% of downstream traffic on the internet [1]. The growing popularity of new 21 applications and video formats, such as 4K and 360-degree videos, mandates that network 22 resources must be apportioned among different users in an optimal and fair manner, in 23 order to deliver satisfactory Quality of Experience (QoE). There are many factors impacting 24 the Quality of Experience of the video streaming service. For example, the peak signal-to-25 noise ratio (PSNR) of the received video [2], or the structural similarity of the image (SSIM) 26 [3]. In particular, the stall time during streaming is a critical performance objective especially 27 for services that require low response time and highly rely on customer experience, e.g., 28 online video streaming and autonomous vehicle networks [4]. Further, the streaming 29 device impacts the bitrate and in turn effects the QoE parameter (see [5] and the citations 30 within). 31 Online optimization of stall time and QoE in a dynamic network environment is a 32 very challenging problem which can be analyzed as optimization problem [6] or learning 33 problem [7]. Traditional optimization-based approaches often rely on precise models to 34 crystalize the network system and the underlying optimization problems. For instance, 35 the authors in [8] construct the QoE aware utility functions using a two-term power series 36 model, while the authors in [9] leverage both "bSoft probe" and "Demokritos probe" to 37 model the QoE measurement, by analyzing the weight factors and exponents of all video-38 straming service parameters, as well as quantifying the "Decodable Frame Rate" of three 39 different types of frames. However, these model-based approaches cannot solve the online 40 QoE optimization with incomplete or little knowledge about future system dynamics. 41 Recently, Reinforcement Learning (RL) has been proven as an effective strategy in 42 solving many online network optimization problems that may not yield a straightforward 43 analytical structure, such as wireless sensor networks (WSN) routing [10], vehicle net-44 works spectrum sharing [11], data caching [7], and network service placement [12]. In 45 particular, deep RL employs neural networks to estimate a decision-making policy, which 46 self-improves based on collected experiential data to maximize the rewards. Compared 47 with traditional model-based decision making strategies, deep RL has a number of benefits: 48 (i) It does not require a complete mathematical model or analytical formulation that may 49 not be available in many complex practical problems; (ii) The use of deep neural networks 50 as function approximators makes the RL algorithms easily extensible to problems with 51 large state spaces; and (iii) It is capable of achieving fast convergence in online decision 52 making and dynamic environments that evolve over time. 53 The goal of this paper is to develop a new family of multi-agent reinforcement learning 54 algorithms to apportion download bandwidth on the fly among different users and to opti-55 mize QoE and fairness objectives in video streaming. We note that existing RL algorithms 56 often focus on maximizing the sum of future (discounted) rewards across all agents and 57 fail to address inter-agent utility optimization, aiming at balancing the discounted reward 58 received by each individual agents. Such inter-agent utility optimizations are widely con-59 sidered in video streaming problems, e.g., to optimize the fairness of network resource 60 allocation and to maximize a non-linear QoE function of individual agent's performance 61 metrics. More precisely, in a dynamic setting, the problem being solved must be modeled 62 as an MDP, where agents take actions based on some policy π and observed system states, 63 causing the system to transition to a new state. A reward r k is fetched for each agent k. 64 The transition probability to the new state is dependent only on the previous state and 65 the action taken in the previous state. RL algorithms aim to find an optimal policy π to 66 maximize the sum of (discounted) rewards∑ ∞ t=1 γ t ∑ k r k (t) for all users. However, when 67 QoE and fairness objectives are concerned, a nonlinear function f , such as the fairness 68 utility [13] and sigmoid QoE function reported by [14], must be applied to the rewards 69 received by different agents, resulting in the optimization of a new objective f (r 1 ,r 2 , . . .), 70 wherer k = 1 T ∑ T t=1 γ t r k (t) is the average discounted reward for each agent k in a finite time 71 T. It is easy to see that such nonlinear functions will potentially violate the memoryless 72 rule of MDP which is required for RL since the optimization objective now depends on 73 all past rewards/states. In this paper, we will develop a new family of multi-agent rein-74 forcement learning algorithms to optimize such nonlinear objectives for QoE and fairness 75 maximization in video streaming. 76 We propose Multi-agent Policy Gradient for Finite Time Horizon (MAPG-finite) for opti-77 mize nonlinear objective functions of cumulative rewards of multiple agents with a finite 78 time horizon. We employ MAPG-finite in online video streaming with the goal of maxi-79 mizing QoE and fairness objectives by adjusting the download bandwidth distribution. 80 To this end, we quantify stall time for online video streaming with multiple agents, under 81 a shared network link and dynamic video switching by the agents. At the end of the 82 time horizon, a nonlinear function f (·) of the agents' individual cumulative rewards is 83 calculated. The choice of f (·) is able to capture different notions of fairness -e.g., the well-84 known α-fairness utility [13] that incorporates proportional fairness and max-min fairness 85 as special cases, and the sigmoid-like QoE function reported by [14] which indicates that 86 users with mediocre waiting time tend to be more sensitive than the rest -and thus balances 87 the performance received by different agents in online video streaming. We leverage the 88 RL algorithm proposed in [15] to develop a model-free multi-agent reinforcement learning 89 algorithm to cope with the inter-agent fairness reward for multiple users. In particular, this 90 RL algorithm modifies the traditional Policy Gradient to find a proper ascending direction 91 for the nonlinear objective function using random sampling. We prove the convergence of 92 the proposed algorithm to at least a local optimal of the target optimization problem. The 93 proposed multi-agent algorithm is model-free and shown to efficiently solve the QoE and 94 fairness optimization in online video streaming.

95
To demonstrate the challenge associated with optimizing nonlinear objective functions, 96 consider the example shown in Figure 1. Two users, A and B, share a download link for 97 video streaming. Due to the bandwidth constraint, the link is only able to stream one 98 high-definition (HD) video and one low-definition (LD) video at a time. In each time slot 99 t, the two users' QoE, denoted by r A (t) and r B (t), are measured by a simple policy that 100 multiplies the quality of the content (from 1 to 5 stars) and the resolution of the video (1 for 101 LD and 2 for HD). The service provider is interested in optimizing a logarithmic utility of 102 the users' aggregate QoE, i.e., u = ln(∑ t r A (t)) + ln(∑ t r B (t)) (which corresponds to the 103 notion of maximum proportional fairness [13,16]), for the two time slots t = {1, 2}. It is 104 easy to see that in Case 1, user A received r A (1) = 6 in time slot 1 and user B r B (1) = 4. 105 If we stream HD to user A and LD to user B in the next time slot, then the total received 106 utility becomes ln(6 + 6) + ln(4 + 4) ≈ 4.56, while the opposite assignment achieves a 107 higher utility ln(6 + 3) + ln(4 + 8) ≈ 4.68. However, in Case 2, choosing user A in time 108 slot 2 to receive HD and user B LD gives the highest utility of ln(2 + 6) + ln(5 + 4) ≈ 4.28. 109 Thus, the optimal decision in time slot 2 depends on the reward received in all past time 110 slots (while we have only shown the rewards in time slot 1 for simplicity in this example). 111 In general, the dependence of utility objective on all past rewards implies a violation of 112 the Markovian property, as the actions should only be affected by the currently observed 113 system state in MDP. This mandates a new family of RL algorithms that are able to cope 114 with nonlinear objective functions, which motives this paper.
continuous (or at least fine-grained if we sample the continuous decision space), while the 118 action space suitable for RL algorithms should be small as the output layer sizes of neural 119 networks are limited. The action space for bandwidth assignment increases undesirably 120 with both the growing number of users and the increasing amount of network resources 121 in the video streaming system. To overcome this challenge, we propose an adaptive 122 bandwidth adjustment process. It leverages two separate reinforcement learning modules 123 running in parallel, each tasked to select a target user to increase or decrease his current 124 bandwidth by one unit. This effectively reduces the action space of each RL module to 125 exactly the number of users in the system, while both learning modules can be trained in 126 parallel with the same set of data through backpropagation. This technique significantly 127 reduces the action spaces in MAPG-finite and make the optimization problem tractable.

128
To evaluate the proposed algorithm, we develop a modularized testbed for event-129 driven simulation of video streaming with multi-agent bandwidth assignment. In particular, 130 a Bandwidth Assigner is developed to observe the agent states, obtain an optimized 131 distribution from the activated Action Executor, and then adjust the bandwidth of each 132 agent on the fly. We implement this distribution generating solution along with the model-133 free multi-agent deep Policy Gradient algorithm, and compare the performance with 134 static and dynamic baseline strategies, including "Even" (which guarantees balanced 135 bandwidths for all users), "Adaptive" (which assigns more bandwidth to users consuming 136 higher bitrates), and SARSA (which is a standard single-agent RL-driven policy that fails to 137 consider inter-agent utility optimization). By simulating various network environments as 138 well as both constant and adaptive bitrate policies, we validate that the proposed MAPG-139 finite outperforms all other tested algorithms. With Constant Bitrate (CBR) streaming, 140 MAPG-finite is able to improve the achieved QoE by up to 169.66%, and the fairness by 141 up to 8.28% compared with baseline strategies. Further, with the Adaptive Bitrate (ABR) 142 streaming, up to 41.25% QoE improvement can be obtained. 143 We conclude the key contributions of this work are: 144

•
We model the bandwidth assignment problem for optimizing QoE and fairness ob-145 jectives in multi-user online video streaming. The stall time is quantified for general 146 cases under system dynamics.

147
• Due to the nature of the inter-agent fairness problem, we propose a multi-agent learn-148 ing algorithm that is proven to converge and leverages two reinforcement learning 149 modules running in parallel to effectively reduce the action space size.

150
• The proposed algorithm is implemented and evaluated on our testbed, which is able 151 to simulate various configurations, including different reward functions, network 152 conditions, and user behavior settings.

153
• The numerical results show that MAPG-finite outperforms a number of baselines, 154 including "Even", "Adaptive", and single-agent learning policies. proposes a Q-learning-based cache 196 replacement policy to jointly optimize hit ratio and stall time. Within an edge network 197 environment, the placement of calculations will also affect the streaming performance, thus 198 work [12] breaks hierarchical service placement problem into sub-trees, and further solve it 199 using Q-learning.

200
For video streaming services still using Constant Bitrate (CBR) systems, [43] proposed 201 QUVE which estimates the future network quality and controls video-encoding accordingly. 202 [44] considered maximizing QoE by optimizing cache content in edge servers. This is 203 different from our setup where we consider caching chunks at client devices. Similar to 204 us, [45] also provide a bandwidth allocation strategy to maximize QoE. However, they use 205 model-predictive control whereas we pose it as a learning problem and use reinforcement 206 learning.
[46] consider a multi-user encoding strategy where the encoding schemes for 207 each user varies depending on their network condition. However, they use a Markovian 208 model and do not consider the possible future network conditions into account.
[47] 209 consider a future dependent adaptive strategy where they estimate the TCP throughput 210 and success probability of chunk download. Similar to us, [48] consider a reinforcement 211 learning protocol to maximize QoE for multiple clients. However, they use average client 212 QoE at time t as reward for time t and use deterministic policies learnt from Q-learning [23]. 213 We show that our formulation outperforms standard Q-learning algorithms by considering 214 stochastic policies and reward as function of QoE of the clients.

System Model
a streaming session to be divided into nonidentical logical slots. In each time slot, all 224 the users will maintain requesting/playing chunks from the same videos. Once any user 225 k ∈ [K] starts a request for a new video in the current time slot l, the slot counter increments, 226 thus the new slot l + 1 starts for all users in [K], even if the video does not change for users 227 k ∈ [K]/{k}.

228
Using the logical time slot setting described above, at time slot l ∈ [L] = {1, · · · , L}, 229 a user k ∈ [K] consumes downloading rate d k (l) ≥ 0 to fetch video v k which is coded 230 with bitrate r k (l). The downloading speed is limited by ∑ k∈K d k (l) = B, ∀l ∈ [L], and 231 may update for all users when the time slot increments in the system. The video server 232 continuously sends video chunks to the user with downloading speed d k (l), and the user 233 plays the video with bitrate r k (l) which is defined by the property of the video.  Table 1. List of the variables used in the paper.

Variable Description
K number of clients in the system k index for agents, runs from 1 to K B total bandwidth of the system In each slot, we reset the clock to zero. We use t k (l, m) to denote the time when the 238 server starts to send the m-th chunk of video v k in the time slot l, t k (l, m) to denote the time 239 when the user starts to play video chunk m, andt k (l, m) to denote the moment that chunk 240 m is finished playing. For analysis, we consider that the size of each chunk is normalized 241 to 1 unit.

242
With our formulation, there will be two classes of users in a time slot l. The first class 243 is of the user who requested a new video and triggered the increment of time slot to l. Since 244 the user has requested a new video, it can purge the already downloaded chunks for the 245 previous video. Users in this class may observe a new downloading rate d k (l) and video bit 246 rate r k (l). The second class of users are those who do not request a new video, but a new 247 streaming rate d k (l) is assigned to them because some other user k ∈ [K] has requested a 248 new video and triggered a slot change. For these users, the video bitrates will remain the 249 same from the previous slot l − 1, or be adjusted solely by the ABR streaming policy when 250 CBR or ABR policies are activated. While for the downloading rate d k (l), it is updated by 251 the bandwidth distribution policy. Note that a resource allocation scheme may still allocate 252 bandwidth to the user such that d k (l) = d k (l − 1), however, it may not be always true. 253 Next, we calculate the stall time in a slot l for both classes of users. those chunks will be studied in Case 2. With the given downloading speed and bitrate, we 259 can observe the relationships between t (l, m), t(l, m), andt(l, m): 260 downloaded, the stall time needs an additional wait after the m-th chunk is played. Hence, 272 stall time till end of the slot l, T s (l, T, k), can be defined as a recursive conditional function, 273 In the condition of d k (l) < r k (l), the stall time before the m-th chunk is downloaded is the key to obtain the stall time of T. The stall time of t(m) fits the second condition of Equation (4). So we have: According to Equations (1), (3), and (2), the difference between t(m) andt(m − 1) can be 274 calculated. Thus, Equation (5) can be written in a recursive form: and further solved as: Finally, substitute T s (t k (l, m)) into Equation (4), the stall time of time slot length T is: Note that if some other user k = k requests a new video triggering increment in time 277 slot from l to l + 1, the stall duration analysis will fall to the second class of users. We 278 discuss the stall duration for the second class of users in the next section.  Assume that in the previous time slot l − 1, the total slot duration is T . At the 283 moment of T , a chunk -denoted by 0 -is being downloaded. Because of the chunks were 284 continuously downloaded, by evaluating T and the download speed d k (l − 1), we can 285 calculate the length or ratio of chunk 0 which hasn't been downloaded by Note that since the length of chunks are normalized, we have: slot l, we have t (1) = l 0 /d k (l). Then similar with Equation (1), the rest of the t (m) can be recursively obtained.
We denote the last chunk being played in time slot l − 1 as chunk −n, which is the 288 video chunk ahead of chunk 0 by n, and we denote its finish time calculated in the previous 289 time slot byt (−n). If n = 1, andt (−1) ≤ T , we know that all chunks before chunk 0 are 290 finished playing in slot l − 1. Otherwise, chunk −n is being played half-way at the moment 291 of the time slot transition. For the latter case, user k will continue the play of video chunk 292 −n at the beginning of time slot l. Then in the new time slot l, because the video bitrate is 293 not changed, chunk −n will be finished att(−n) =t (−n) − T . Since at the beginning of 294 slot l, chunk 0 is being downloaded, we know that chunks in interval {−n, · · · , −1} in t 295 are all ready to be played. So we can derive the play finish time of chunks {−n, · · · , −1} 296 in slot l as: As the download finish time t (1) and play finish timet(−1) are defined, the leftover 298 video chunk 0 is played at time t(0) = max(t (1),t(−1)), and finished att(0) = t(0) + 299 1/r k (l).
With all the leftover chunk issues tackled, we finally obtain the chunk time equations 301 for time slot l: With all the time equations obtained, we now can calculate for the stall time using 304 the similar procedure shown in the previous sub-section. Since for m < 0, all the chunks 305 are being played stall-free. So if the slot ends at time T <t(−1), the stall time will be 306 zero. From chunk m > 0, it's possible that the stall appears between the gap where chunk 307 If T happens to be in this gap, the stall time T s (T) will be the accumulated waiting time 309 of chunks [0, m − 1] (denoted as T s (t(m − 1))) plus the additional time between T and 310 t(m − 1). Otherwise, if T happens to be during a chunk m being played, then the stall time 311 T s (T) should be the accumulated stall time for chunks {0, · · · , m}, which can be denoted 312

314
The goal of this work is to maximize the inter-agent QoE utility for all users. In this paper, we consider the fairness utility functions in [13] and optimize the inter-agent fairness with two existing evaluations -the sigmoid-like QoE function and the logarithmic fairness function. By analyzing real-world user rating statistics, a sigmoid-like relationship between the web page loading time and the user QoE was reported in [14]. Inspired by that, we draw a similar nonlinear, sigmoid-like QoE curve to map the streaming stall time ratio, and fulfill that (i) Reducing the stall time for users who already have very low stall time or (ii) Increasing the stall time for users who already suffer from high stall time do not impact the QoE values, while (iii) Users with mediocre QoE are more sensitive to stall time changes: We also consider a logarithmic utility function that achieves the well-known proportional 315 fairness [13] among the users: It is easy to see that, with unit stall time decrease, this utility function provides (i) Larger 317 QoE increment for users experiencing higher stall time, and (ii) Smaller increment for users 318 already enjoying good performance with low stall time.

319
In both Equations (16) and (17), x represents the stall time ratio for playing a video. It 320 is defined by: where L v denotes the time slots that video v has been played in, and T s (l, T l , k) denotes the 322 stall time for user k in time slot l with slot length T l . 323 We note that (16) is only one representative QoE function, while other functions may be used. Suppose that in L time slots, V k (L) be the set of videos played by user k. Then the total QoE of the L time slots obtained by user k are given as: and for all users, the total QoE is: Substituting (19) and (18) in (20), we have Note that the QoE function defined in Equation (19) assigns higher Quality of Experi-324 ence to lower stall time. The QoE metric remains constant for small stall times. If the stall 325 times are lower than a certain value and are not noticeable, QoE does not vary as obtained 326 in sigmoid-like function of Equation (19). Also, the QoE decreases rapidly with increasing 327 stall times and remains zero if the stall times exceed a certain value therefore ruining the 328 viewing experience.

330
In this section, we propose a slice assignment system to distribute the download 331 link bandwidth to users. Letπ(l) = (π 1 (l), · · · , π K (l)) be a vector in [0, 1] K such that 332 ∑ k∈K π k (l) = 1. Each element π k (l) denotes the portion of total bandwidth that user k is 333 assigned to. By this definition, user k's downloading bandwidth d k (l) under policyπ(l) 334 can be calculated as π k (l)B.

335
The Multi-Agent Video Streaming (MA-Stream) optimization problem is defined as 336 the following: var.π.
We now discuss the MA-Stream optimization problem described in Equation (22)- (24). 338 The Equation (22) denotes the sum of the Quality of Experience for each user k ∈ [K] across 339 each video played in L time slots. The control variable is the policy π (in Equation (24)) 340 which directly controls the bandwidth allocation. This gives the constraint in Equation 341 (23) where the sum of allocated bandwidths, d k (l), to each user k ∈ [K] can be at most the 342 total bandwidth of the system for all slots l ∈ [L]. Moreover, the QoE for any video is a 343 non-linear function of the cumulative stall-durations over each chunk in the video played. 344 We utilize the deep Reinforcement Learning technique to optimize the bandwidth 345 distributionπ(l). In the following sub-sections, we define the state, action, and objective 346 for the decision making. At time slot l, the observed state is defined by a 4K dimensional vector s(l) = 349 (v 1 (l), · · · , v K (l), d 1 (l), · · · , d K (l), z 1 (l), · · · , z K (l), c 1 (l), · · · , c K (l)), where v k (l) denotes 350 the video bitrates, d k (l) represents the currently assigned download speeds, z k (l) tracks 351 the accumulated stall time for the current playing video until slot l, and c k (l) counts the 352 number of chunks which are downloaded but not yet played for user k ∈ [K]. For brevity, 353 we use the notation s(l) = (v(l),d(l),z(l),c(l)) wherev(l) = (v 1 (l), · · · , v K (l)),d(l) = 354 (d 1 (l), · · · , d K (l)),z(l) = (z 1 (l), · · · , z K (l)),c(l) = (c 1 (l), · · · , c K (l)). We will expand the 355 corresponding vector when necessary. By considering the variables v k (l) and d k (l), the 356 learning model should be able to estimate the downloaded and played video chunk infor-357 mation in the current time slot l, while z k (l) and c k (l) provide the objective-related history 358 information.

360
At the beginning of time slot l, in order to find the optimal download speed distribu-361 tion, multiple decisions are needed to adjust the observed speed distribution. We utilize 362 two decision processes to get the optimal distributionπ(l) while maintaining the constraint 363 shown in Equation (23). One of the process is a decreasing process that decides for which 364 user the download speed will be decreased by 1 unit of rate, and the other process is an 365 increasing process that decides the user which will obtain the released 1 unit of download 366 speed.

367
The download speed distribution is iteratively adjusted to a final distribution by recursively running the decreasing and increasing decision processes. A distribution will not be assigned to the system until the final decision is concluded, and the system will not transit into the next time slot. Assuming at time slot l, with the observed state s(l) = (v(l),d(l),z(l),c(l)), actions a − , a + , a − = a + are made by the decreasing and increasing processes, the intermediate state s(τ ) can be derived by s(τ ) = (v(l), d 1 (l), · · · , d a − (l) − 1, · · · , d a + (l) + 1, · · · , d K (l),z(l),c(l)).
Now, this intermediate state s(τ ) is used in the decision making for both processes. New 368 actions will be made to push the distribution towards the final state. Finally, at state s(τ), 369 when both the increasing and decreasing processes give the same action a + = a − , the 370 distributionπ is obtained asπ(l) = (d 1 (τ), d 2 (τ), ..., d K (τ))/B.

371
According toπ(l), the system distributes the bandwidth to each user for the time slot 372 l. The next time slot l + 1 will be triggered when a user switches its playing video. We 373 assume that the new content request for all users follow Poisson arrival processes with 374 arrival rate λ k for user k, so the mean value of slot duration T l can be derived by 1/ ∑ k∈K λ k , 375 and the probability that user k triggers the state transition is λ k / ∑ κ∈K λ κ . For time slot 376 l + 1, the initial system state s(l + 1) = (v(t + 1),d(t + 1),z(t + 1),c(t + 1)) should have 377 video bitrates v κ (l + 1) = v κ (l), (∀κ ∈ K, κ = k) if CBR is activated as the bitrate policy, 378 and downloading speeds d(l + 1) =π(l)B calculated in the previous time slot. 379 The accumulated stallsz(l) can be calculated using Equations (4) and (15). Let v k (l) be the video played by the user k in time slot l, and let l v k (l),0 be the time slot where user k starts playing video v k (l). Let T l denote the length of time slot l , we have The number of remaining chunksc(l) can easily be tracked during the downloading/playing 380 procedures, and observed whenever the information is needed. Both the downloading and 381 playing processes can be monitored. For the downloading process, let c d (l) = h d (l) + ρ d (l), 382 where h d (l) ∈ N denotes the chunk being downloaded of the video being played at 383 the beginning of time slot l, and ρ d (l) ∈ [0, 1) denotes the ratio or percentage of chunk 384 h d (l) that has been completed. The similar mechanism holds for the playing process, 385 c p (l) = h p (l) + ρ p (l). Using both the processes, the remaining chunks c(l) can be calcu-386 lated by c(l) = c d (l) − c p (l) in any time slot l. As pointed in Equation (22), the goal of the controller is to maximize the average 389 QoE. For our RL algorithm to learn an optimal policy to maximize the objective every slot 390 provides a feedback of the value of the objective calculated from the average stall times for 391 all users.

392
In Section 4.2, we mention that when the decreasing and increasing processes take 393 decisions a − (τ ) = a + (τ ), the state transition only happens in a logic domain instead of 394 the realistic time domain. During this intermediate state transition, no real stall time calcu-395 lations exist and we assign zero rewards for actions a − (τ ) = a + (τ ) in the intermediate 396 state before converging toπ(l)B. When the final distribution is achieved (a + = a − ), the 397 slot duration T l can be obtained and hence stall times T s (l, T l , k) can be calculated for all 398 users. We can also obtain rewards from the calculated stall times using Equation (21).

399
The complete schema is presented in Algorithm 1 In the previous section, we define the network streaming problem MA-Stream for 402 multiple users. Note that the objective defined in Section 4.3 is a nonlinear function 403 (Equation (19)) of total stall duration till the current time instant. At any time slot l, the 404 reward not only depends on the stall times observed by the users in the slot l, but also on the 405 stall times observed by users in the previous time slots. Hence, the decision making module 406 needs to track not only the current state but also the history of the decision and the rewards 407 obtained to select current action. Hence, we are not able to utilize standard RL algorithms 408 that require modeling the problems into MDP. To this end, we leverage Multi-agent Policy 409 Gradient for Finite Time Horizon (MAPG-finite) [15], a novel multi-agent policy gradient 410 RL algorithm, which aims to solve optimization problems without the requirements of 411 MDP modeling. In the following Section 5.2, we give a short description of this algorithm. 412

413
Standard RL problems consider an agent that interacts with a Markov Decision Process M. The agent, at time t, observes the state s t of the environment and plays action a t to obtain a reward r t and causes the environment to transition to state s t+1 [52]. Let the next state transition probability be P a ss = P(s t+1 = s |s t = s, a t = a) and the expected reward of playing action a in state s be R a s = E[r t |s t = s, a t = a]. The goal of the agent is to find a policy π(s, a; θ) = P(a t = a|s t = s, θ) parameterized on θ that maximizes the discounted cumulative reward where s 0 is the initial state, and γ < 1 is a discounted factor. Using linearity of cumulative rewards in Equation (27), state action value function Q π (s, a) for policy π is defined as Based on Equation (28) many algorithm have been proposed, e.g., SARSA [53][54][55], 414 Q-learning [23], Policy Gradient [24], etc. Based on these fundamental algorithms, many 415 deep learning based implementations are also proposed [52]. 416 In many network optimization problems, the reward metrics are nonlinear when 417 multiple subjects are jointly optimized. One typical example is resource fairness among 418 agents/users in one network [56,57]. With a nonlinear reward function, the decision making 419 engine must be aware of the historical decisions and states in the past. To demonstrate the 420 requirement of policies that require history, we take the following example. Suppose that 421 there are K = 2 users who share the network resource and we want to allocate this network 422 resource fairly between the two users. If we use proportional fairness, the fairness for the 423 two users can be calculated as the sum of logarithms of the QoE indicators of the two users. 424 We call the users 1 and 2, and we assume that both users 1 and 2 start with the same video. 425 In slot 1 the bandwidth allocated to user 1 is higher than user 2 (d 1 (1) > d 2 (1)). This results 426 in higher stall times for user 2 compared to user 1.

427
Now, in the next time slot 2, user 2 switches the video with video bitrate same as the 428 previous video, and user 1 continues with the old video. Then, since the video bitrates 429 remain the same, allocating a higher download rate to user 2 (d 2 (2) > d 1 (1)) will result 430 in lower stall times for user 2 and hence the fairness can be maximized. This requires 431 the controller to ensure that the decisions at time slot 2 are made keeping account of the 432 decision made in time slot 1. However, the state defined in Section 4.1 does not store the 433 cumulative stall duration of the previous video for the user triggering the current time slot. 434 The algorithm used in this paper to deal with such nonlinear problem is explained in the 435 next subsection. 436 We further note that one possible way to tackle the Non-Markovian nature is to 437 introduce a high-order Markov model by including the objective value till time slot l − 1 438 in the state. This approach, however, potentially increases the state space dramatically to 439 (SA) L , where S is the number of states and A is the number of actions the controller can 440 take. Hence, we consider a multi-agent learning algorithm in the next session, which does 441 not require the use of high-order Markov models and allows individual agent to improve 442 its policy. policy gradient RL algorithm, the MAPG-finite algorithm utilizes the observed system 447 state s(t) as the input of a neural network which is parameterized by θ, then takes the 448 output of the neural network to be the decision policy π θ , which indicates the action 449 probabilities. Finally, according to the policy π, an action is randomly chosen to interact 450 with the environment.

451
With proper training process, the neural network is improved by the reward feedback. Expectedly, the neural network parameter θ should evolve to stage θ * which maximizes the objective function f : where J π θ k denotes the long term reward obtained by agent k running policy π θ : 452 J k π = E s 0 ,a 0 ,s 1 ,a 1 ,... lim s 0 ∼ ρ 0 (s 0 ), a l ∼ π(a l |s l ), s l+1 ∼ P(s l+1 |s l , a l ).
Since the objective function f is differentiable (refer to Equation (16)), the gradient estimation for Equation (29) can be obtained by: whereJ π = (J π 1 , ..., J π K ) T in whichJ π k denotes the expected cumulative reward, and further estimated asĴ In each episode n, time slots l runs from 0 to L. Finally, with learning rate β, the step parameter update can be shown as: and further utilized for gradient ascent in the neural network. 453 We now state a formal result of convergence to a stationary point from the gradient 454 ascent steps.  Proof. Since we use softmax activations, gradient of policy π, ∇ θ π is also continuous 458 in θ and obtaining continuity of ∇ π J π k from [24, Theorem 2] with respect to parameter 459 θ we have continuity of ∇ θ J π k . Using the continuity of gradient

463
We conduct a hybrid simulation on a network containing five users over a shared 464 downloading link, and evaluate the performance of the proposed learning algorithm. In 465 particular, three users prefer to watch HD videos (with desired bitrates at 8Mbps and 466 5Mbps), while the other two users watch videos at lower resolution (with desired bitrates 467 at 2.5Mbps and 1Mbps). The video durations of all users follow an exponential distribution 468 with an identical 120 second average. We run the simulation on both channels with 469 different bandwidth, i.e., 1500KB/s and 2000KB/s channels. In both settings, our proposed 470 algorithm is shown to substantially outperform the baseline policies (relying on heuristics 471 and single-agent learning) in terms of QoE reward and fairness. We evaluate our model-free MAPG-finite algorithm along with three baselines, which 475 are denoted by "Even", "Adaptive", and SARSA policies, as follows.

476
Policy MAPG-finite: Our proposed algorithm leverages model-free, multi-agent 477 policy gradient to optimize the download bandwidth distribution among agents. Recall 478 that in the algorithm, multiple decisions/actions are made to either increase or decrease the 479 download speed of specific users by 1 unit. During the training process, the two decision 480 making processes (to increase and decrease download speeds) need to perform random 481 exploration in a non-cognitive fashion, which often leads to long exploration time and thus 482 slow convergence in the optimal policy. Suppose that the bandwidth distribution at time 483 t is π t . To mitigate this problem during training, we suspend the exploration process if 484 the same bandwidth distribution is observed again in the future, i.e., π t+x = π t for some 485 x ∈ N + .

486
Policy "Even": The downloading bandwidth is evenly distributed to all users in the 487 system. For instance, when the total bandwidth is 1500KB/s, each of the five users will 488 receive 300KB/s for its downloading speed, regardless of its demand and preference. This 489 one-size-fits-all policy equally distributes bandwidth among the users. Policy "Adaptive": The download bandwidth is split between the users in proportion 491 to their desired video bitrates. This policy guarantees that users with high data rate demand 492 (i.e., those watching the high-resolution video) receive a higher downloading speed, while 493 users with low data rate demand receive a lower speed. Specifically, user k will be assigned 494 a bandwidth of d k = v k B/∑ κ∈K v κ , where v κ is the desired video bit rate of user κ.

495
Policy SARSA: This policy leverages single-agent learning, SARSA [53][54][55], to dis-496 tribute download bandwidth to the users. It uses a standard Policy Gradient strategy 497 with the same state/action definition of our proposed MAPG-finite. Without considering 498 the nonlinear reward function feature, this policy simply utilizes the sum of step reward 499 as its immediate reward for learning. Note that we make the same state variables in-500 cluding reward-related history informationz(l) known to the SARSA policy to boost the 501 performance of this baseline. Our proposed MAPG-finite algorithm allows the maximize of any nonlinear reward 504 in a finite time period. We consider two reward functions in the evaluation, namely QoE 505 and fairness, to compare the performances achieved by different policies.

506
For the QoE reward, we use a sigmoid-like function to measure the reward with 507 respect to the stall time. In particular, we choose parameters in Equation (16) to match 508 a stall-to-QoE curve reported by [14]. We plot our fitted reward function in Figure 4a, 509 which well matches the reported stall-to-QoE curve in [14]. While for the fairness objective, 510 we choose a logarithmic utility function shown in Equation (17) and Figure 4b of users' 511 received stall time. By maximizing the logarithmic function, the proportional-fair QoE 512 assignment between the users [13] is obtained.

513
Naturally, our proposed multi-agent learning algorithm is able to learn and optimize 514 any reward functions, linear or nonlinear. We note that even when the exact function is 515 unknown, the model-free algorithm can still be trained and evaluated using the real-world 516 user traces. We implement a network with five users and both high-and low-resolution videos. In 519 particular, three users prefer to watch high-resolution videos with bitrates of 8Mbps (1080p) 520 and 5Mbps (720p) (similar to Youtube videos [59]), while the other two users consume 521 2.5Mbps (480p) and 1Mbps (360p) videos randomly.

522
The video durations for all users follow an independent, identical exponential dis-523 tribution with an average of 2 minutes. Thus, the combined video switching rate for all 524 five users is once every 24 seconds. When a user elects to switch video, a random video 525 is selected and starts streaming. Note that the new video may have the same or different 526 bitrate with the previous video. For example, when User 1 finished watching a video, the 527 new video will have a bitrate of 8Mbps or 5Mbps with the same 50% probabilities. The 528 user preferences and their corresponding probabilities are shown in Table 2.   We implement a testbed using Python 3.5. The workflow of the testbed is depicted 531 in Figure 5. First, at the beginning of a new cycle, according to the video switching rates 532 (which are resulted from the known video duration distributions), the Video Switch module 533 randomly schedules a user who will be the next candidate to change his video. The users 534 then start downloading video chunks continuously, and their Download Timers record the 535 timestamp when each chunk is successfully downloaded (i.e., t m+1 in Figure 3). Next, 536 based on the download timestamps and the bitrate of the current video, the Playback Timer 537 schedules the playback and further obtains t m andt m . With all timestamps confirmed, 538 the Stall Calculator is able to calculate the stall time for the user/videos. Such stall time is 539 transferred to the distribution module for training and evaluation. Finally, the chosen user 540 randomly picks a new video from its Video Library at the end of the current cycle.

541
Recall from Section 4, the state variables utilized for MAPG-finite decision making 542 include video bitrates, downloading speeds, accumulated stall time, and residue video chunks, 543 which can be reported through output paths 2 , 4 , 3 , and 1 respectively, in Figure 5. 544 The state variables are then collected by the State Listener through input path 6 . Further, 545 utilizing the Neural Network, a bandwidth distribution -as the action -is decided and sent 546 by the Action Sender to all the users via 7 . For training and evaluation purpose, a copy 547 of the stall ratio is also sent from 3 to 8 . It is processed by the Reward Function f (·) to 548 calculate either the QoE (Equation 16) or the fairness (Equation 17) reward. Finally, the 549 reward is delivered to the Neural Network for policy backpropagation, and also logged for 550 experiment evaluation. By input path 5 of each user, the users adjust their Download Rates 551  Using the modularized testbed implemented in this project, we are able to evaluate 554 different policies under various environment configurations, including with different 555 reward functions, network conditions, and user behavior settings. We note that with some 556 minor logic adjustments, we can even test the streaming performance in a discrete time 557 domain, while this paper focuses on continuous time evaluations. 558 . 559

560
The numerical results for the QoE reward function (Equation 16) is depicted in Figure 561 6. It is shown that our proposed MAPG-finite algorithm outperforms the static "Even" and 562 dynamic "Adaptive" strategies by 23.34% and 169.66% (in terms of achieved QoE) with the 563 shared download link of 1500KB/s. With the 2000KB/s download link, MAPG-finite still 564 obtains 15.30% and 32.58% higher QoE reward than the "Even" and "Adaptive" policies, 565 while the improvement becomes smaller because of smaller marginal QoE improvement 566 when stall time is already small under higher bandwidth. As for the SARSA policy, it is 567 unable to cope with the nonlinear utility function and fails to achieve much improvement 568 over its initial decision policy -"Even". Since the QoE reward function is nonlinear to the 569 assigned bandwidth, we also observe that the "Adaptive" policy (allocating bandwidth 570 proportional to desired video bitrate) performs worse than "Even" in both cases.

571
These can be further seen from Table 3, which shows the stall time and reward 572 breakdown of different policies. Apparently, the "Adaptive" policy achieves similar stall 573 time for both HD and LD users, while the "Even" policy sacrifices the performance of 574 HD users, and in return, significantly reduces the stall time of LD users, leading to higher 575 overall QoE. More precisely, according to the QoE reward curve shown in Figure 4a, the 576 reward boost for the LD users is much greater than the loss suffered by the HD users, which 577 finally results in the overall QoE improvement in the "Even" policy. As a learning-based 578 algorithm, MAPG-finite can achieve substantially better performance since it is aware of 579 the current network conditions and system states, in order to optimize the bandwidth 580 distribution between the users. For example, when a user has enough cached chunks for 581 future playback, its bandwidth can be temporarily turned over to the other users, who 582 recently started playing a new video or is suffering from a stall. Thus, all users are able 583 to obtain increased QoE rewards under the MAPG-finite strategy, compared with the 584 baselines.

585
Results for the fairness reward function is shown in Figure 7. We note that due to 586 the use of logarithmic fairness function, the improvement appears to be smaller when 587 measured by fairness reward than by QoE reward, while the gains should be interpreted 588 in the "multiplicative" sense. Our proposed MAPG-finite still outperforms the "Even" 589 and "Adaptive" strategies (in terms of the logarithmic fairness reward) by 4.83% and 590 6.75% for 2000KB/s downloading link, 8 2500KB/s downlink, Table 4 shows that the dynamic "Adaptive" strategy achieves similar 592 performances for all users, leading to better fairness reward than the "Even" policy, who 593 creates more difference between the HD and LD users that have significantly different 594 bitrate requirements. On the other hand, our MAPG-finite strategy reduces the average stall 595 ratio of HD users by 38.46%, with a cost of 37.50% higher stall ratio for LD users, compared 596 to the "Even" policy. This way, it is able to reduce the stall ratio deviation of all users from 597 0.1698 to 0.0712, and improves the fairness reward. Comparing with the "Adaptive" policy, 598 MAPG-finite has about 30% higher stall ratio deviation. However, the optimization object, 599 known as the proportional fairness utility [13], is not solely about "equalizing" different 600 users' performance. (To illustrate this, we construct a "Low Dev" policy in Table 4 that has 601 close-to-zero stall ratio deviation but low fairness reward). The use of proportional fairness 602 reward function indeed balances two important objectives -efficiency (i.e., assigning more 603 bandwidth to users that can achieve higher reward per unit bandwidth) and fairness (i.e., 604 balancing different users' performance). MAPG-finite is able to attain the highest reward 605 under the choice of proportional fairness utilities, demonstrating its ability to achieve 606 complex optimization objectives.

607
The evaluation results show that the "Even", "Adaptive", and SARSA policies fail to 608 perform consistently under different application scenarios and network conditions, while 609 our learning-based MAPG-finite policy is able to achieve the highest reward. Figure 8 610 depicts the average download rate distribution decided by the MAPG-finite policy. The 611 gray bars represent the average video bitrates requested by the users. The white bars 612 represent the average download rate achieved by our MAPG-finite policy (for presentation 613 purpose, the unit of download rates is converted from KB/s to Mbps). It is can be seen that 614 to maximize the fairness reward, MAPG-finite ensures that (i) for users with the same HD 615 preference (e.g., users 1, 2, and 3), the same average downloading bandwidth is assigned 616 to obtain similar stall time for these users, and (ii) for users watching videos with lower 617 desired bitrates, less bandwidth is assigned to balance the stall time since video chunks 618 are consumed at a slower pace. According to Figure 7, the "Adaptive" policy gets a lower 619 fairness reward than "Even" under the total download link of 2000KB/s, which indicates 620 that proportionally adjusting the download bandwidth does not always achieve a better 621 result when fairness is concerned. Through exploration and training of RL, MAPG-finite is 622 able to self-teach, improve, and finally converge to an optimal policy, making the model-623 free suitable to bandwidth allocation with complex networks and objectives that often do 624 not have a straightforward mathematical formulation. Policies QoE without ABR with ABR Figure 9. QoE reward comparisons with ABR feature activated/deactivated. The total download bandwidth is 1500KB/s. MAPG-finite achieves 21.94%, 41.25%, and 37.11% than the "Even", "Adaptive", and SARSA policies.
To further illustrate the agility of our proposed MAPG-finite algorithm, we perform 626 another evaluation on a 1500KB/s downlink, with an ABR streaming algorithm imple-627 mented. Figure 9 depicts the QoE performances for this evaluation. We utilize a basic 628 Buffer-Based ABR algorithm proposed in [37]. Each time a video chunk is requested, if 629 the last chunk downloaded is already being played, the bitrate is adjusted to 80% of the 630 max bitrate of the video to avoid high stall time. When the number of residue cached 631 chunks is more than three, the agent starts to request for the max bitrate, and thus better 632 display quality is obtained. Comparing Figure 9 with Figure 6, all policies receive higher 633 rewards under ABR due to the benefits of bitrate adaptation. We note that MAPG-finite 634 still outperforms the "Even" policy by 21.94%, the "Adaptive" policy by 41.25%, and the 635 SARSA policy by 37.11% . In this evaluation, we choose the Buffer-Based strategy for ABR 636 due to its efficiency for implementation. According to the numerical results, our proposed 637 MAPG-finite is able to adapt well to a dynamic bitrate environment. We are aware that 638 new ABR policies -some are engined by RL algorithms themselves -have been proposed 639 and evaluated [36,38-41] to improve the streaming quality. The key aim of the evaluation 640 was to show that the proposed framework can work on ARB streaming strategies, while 641 not to compare the different streaming strategies. Thus, any streaming algorithm can be 642 used in our evaluations and our results show that efficient bandwidth distribution among 643 multiple agents can be achieved with the proposed algorithms where each agent uses any 644 of the ABR/CBR streaming algorithm.

646
In this paper, we model the MA-Stream problem which apportions bandwidth to 647 multiple users in a video streaming network, to maximize nonlinear, non-convex objectives 648 such as QoE and fairness objectives. We propose a novel multi-agent reinforcement learning 649 algorithm MAPG-finite that is able to work with nonlinear objective functions to solve 650 this optimization problem. Using our testbed implemented in Python, we verify that our 651 proposed solution outperforms existing baseline policies (including "Even", "Adaptive", 652 and single-agent SARSA) measured by both QoE and fairness. Our algorithm improves 653 QoE by 15.27% and fairness by 22.47% for 2000 KB/s link data link. Further, it is able to 654 adapt well in collaboration with existing Adaptive Bitrate (ABR) streaming algorithm by 655 improving QoE more than 30% over the Adaptive algorithm. The interaction between 656 bandwidth distribution and ABR policies could be considered in future work to further 657 improve the performance of video streaming. Another interesting future work is to perform 658 large scale experiments with network scale users.