User Preference-Based Dynamic Optimization of Quality of Experience for Adaptive Video Streaming

Feng, Zixuan; Liu, Yazhi; Zhang, Hao

doi:10.3390/electronics14153103

Open AccessArticle

User Preference-Based Dynamic Optimization of Quality of Experience for Adaptive Video Streaming

by

Zixuan Feng

¹,

Yazhi Liu

^1,* and

Hao Zhang

^2,*

¹

School of Artificial Intelligence, North China University of Science and Technology, Tangshan 063210, China

²

QingGong College, North China University of Science and Technology, Tangshan 064000, China

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(15), 3103; https://doi.org/10.3390/electronics14153103

Submission received: 2 July 2025 / Revised: 30 July 2025 / Accepted: 31 July 2025 / Published: 4 August 2025

Download

Browse Figures

Versions Notes

Abstract

With the rapid development of video streaming services, adaptive bitrate (ABR) algorithms have become a core technology for ensuring optimal viewing experiences. Traditional ABR strategies, predominantly rule-based or reinforcement learning-driven, typically employ uniform quality assessment metrics that overlook users’ subjective preference differences regarding factors such as video quality and stalling. To address this limitation, this paper proposes an adaptive video bitrate selection system that integrates preference modeling with reinforcement learning. By incorporating a preference learning module, the system models and scores user viewing trajectories, using these scores to replace conventional rewards and guide the training of the Proximal Policy Optimization (PPO) algorithm, thereby achieving policy optimization that better aligns with users’ perceived experiences. Simulation results on DASH network bandwidth traces demonstrate that the proposed optimization method improves overall Quality of Experience (QoE) by over 9% compared to other mainstream algorithms.

Keywords:

video streaming; Quality of Experience (QoE); user preference; adaptive bitrate

1. Introduction

1.1. Research Background

The rapid advancement of internet technologies has made online video a primary medium for entertainment, education, and information [1]. The widespread adoption of video streaming services (e.g., YouTube, Netflix, Tencent Video) has imposed increasing demands on network transmission quality. Particularly in mobile networks, bandwidth fluctuations and unstable latency often lead to playback stalling or quality degradation, severely impacting user experience.

Adaptive bitrate (ABR) technology is the key solution to this challenge. ABR algorithms dynamically adjust video bitrates based on real-time network conditions and playback buffer status to balance smooth playback with optimal visual quality. An effective ABR strategy must negotiate trade-offs among stalling, video clarity, and bitrate switching smoothness to maximize overall Quality of Experience (QoE).

Recent years have seen numerous ABR strategies, including rule-based heuristics (e.g., BOLA, FESTIVE), supervised learning models, and reinforcement learning (RL)-based approaches. RL methods, with their strong adaptability and decision-making capabilities, have achieved notable success in ABR tasks. For instance, the Pensieve [2] system employs policy gradient methods to optimize bitrate selection, demonstrating superior performance across simulated network environments.

However, existing RL-based ABR methods rely on predefined reward functions to quantify QoE. These rewards often combine objective metrics (e.g., video quality, stalling duration, bitrate fluctuations) into simplistic, uniform formulas, failing to capture subjective user preferences. For example, users may prioritize high quality with occasional stalling or prefer consistently smooth playback. Such diversity renders unified reward-driven strategies inadequate for personalized QoE demands.

1.2. Research Motivation

The inflexibility of traditional RL-based ABR rewards stems from their disregard for user preference heterogeneity. Ideally, systems should dynamically adjust optimization targets based on individual viewing habits, network conditions, or feedback to generate personalized strategies.

To bridge this gap, we propose integrating preference learning [3] into ABR optimization. Specifically, we design a preference modeling module that learns a subjective scoring model (“preference network”) from pairwise comparisons of viewing trajectories. This model replaces handcrafted reward functions, allowing user perceptions to directly guide policy learning. Our objectives include the following:

(1): Aligning policy optimization with user perceptions;
(2): Enabling dynamic, individualized reward signals;
(3): Reducing manual reward tuning overhead;
(4): Enhancing interpretability and adaptability of RL-based ABR strategies.

1.3. Research Contributions

To achieve these goals, we propose a comprehensive RL-driven ABR system with three core modules:

(1): Trajectory Sampling Module: Multiple ABR agents interact with a simulated environment to collect playback trajectories (states, actions, video metrics).
(2): Preference Modeling Module: An LSTM-based preference network is trained to score trajectories via pairwise comparisons.
(3): Policy Training Module: The preference model’s outputs replace traditional rewards, enabling PPO-based ABR policy training.

Additional contributions include the following:

(1): Designing a multi-dimensional state encoding and trajectory data structure for preference modeling;
(2): Developing methods for generating preference pairs (automated and human-annotated);
(3): Adapting the PPO framework for preference-driven policy updates;
(4): Experimental validation of the strategy’s performance in quality, stalling control, and subjective consistency.

1.4. Thesis Structure

Section 2 reviews existing adaptive bitrate (ABR) streaming strategies, reinforcement learning (RL) methodologies, and relevant research in preference learning. Section 3 introduces the overall system architecture, including its module decomposition and functional specifications. Section 4 details the model implementation, core data structures, and key technical specifics; it further presents the experimental design and result analysis to evaluate the effectiveness of the proposed method. Finally, Section 5 concludes this paper and suggests potential avenues for future research.

2. Literature Review

This chapter reviews research areas closely related to this work, including the following: (1) the evolution of adaptive bitrate (ABR) algorithms, (2) applications of reinforcement learning (RL) in ABR tasks, and (3) user experience modeling and preference learning methods. By analyzing limitations in existing approaches—particularly in reward design and user perception modeling—we establish the foundation for our proposed method.

2.1. Evolution of Adaptive Bitrate Algorithms

Existing adaptive bitrate (ABR) algorithms can be broadly classified into four primary categories based on their underlying decision mechanisms: rate-based methodologies, buffer-occupancy-driven approaches, hybrid techniques that integrate these two paradigms, and adaptive bitrate methods powered by deep reinforcement learning (DRL).

Among these, rate-based methodologies employ distinct strategies: FESTIVE [4] utilizes a progressively updated throughput estimation approach combined with fairness and stability optimizations to mitigate frequent bitrate fluctuations. CS2P [5] introduces a data-driven throughput prediction model that leverages historical data to enhance bandwidth prediction accuracy, thereby optimizing bitrate selection. Oboe [6] employs automatic parameter tuning to adapt to varying network conditions, dynamically adjusting parameters of rate-driven algorithms to improve Quality of Experience (QoE), with a focus on enhancing bandwidth utilization through dynamic alignment between video bitrates and network throughput. However, constrained by the inherent limitations of bandwidth prediction models, their practical performance often degrades significantly under the time-varying nature of network bandwidth, particularly in bursty traffic scenarios where prediction errors tend to accumulate. These deviations act as destabilizing factors, causing substantial oscillations in ABR algorithms’ video bitrate decisions that negatively impact user viewing experiences. To address these challenges and improve bandwidth estimation accuracy, some approaches attempt to mitigate errors through smoothed throughput prediction—effectively using averaged available bandwidth data from preceding seconds as the basis for bitrate adaptation decisions. Nevertheless, achieving accurate bandwidth prediction remains a formidable challenge in complex and volatile network environments. In addition to ABR algorithms, Gao et al. proposed an optimized multimedia data broadcasting scheme for multi-channel wireless communication environments [7]. By integrating Scalable Video Coding (SVC) with Multiple-Input Multiple-Output (MIMO) technology, this approach effectively reduces latency and energy consumption during video data transmission, systematically alleviating bandwidth and energy bottlenecks in wireless transmission links.

In contrast to rate-based strategies, buffer-based algorithms determine the bitrate for the next video segment by analyzing the real-time state of the playback buffer on the client side. The core objective of this approach is to maintain buffer occupancy within a predefined optimal range, effectively preventing video rebuffering events. Representative techniques include the following: BBA (Buffer-Based Adaptation) [8], which implements a linear threshold policy that selects bitrates according to buffer level segments (reducing bitrates at low buffer to maintain fluency while increasing bitrates at high buffer to enhance quality); BOLA (Buffer Occupancy-based Lyapunov Algorithm) [9], grounded in Lyapunov optimization theory, which formulates bitrate selection as a utility maximization problem to achieve a dynamic trade-off between buffer stability and video quality; and QUETRA (QUEuing-TRAffic-aware adaptation) [10], which applies queuing-theoretic models by conceptualizing the buffer as a queue to systematically optimize bitrate decisions through theoretical analysis.

Contemporary research in adaptive bitrate control has proposed composite decision frameworks that integrate rate awareness with buffer state analysis. Specifically, these hybrid strategies synergistically combine rate-based bandwidth prediction models with dynamic feedback on buffer occupancy, establishing a multi-dimensional parameter coordination mechanism for bitrate selection. By jointly analyzing network throughput estimations and real-time buffer utilization, such approaches effectively balance bandwidth resource efficiency with playback continuity, overcoming optimization bottlenecks inherent in single-dimensional decision paradigms. Representative implementations include the following: mDASH [11], which employs Markov Decision Processes to perform multi-objective optimization incorporating both throughput prediction and buffer status; the Karma algorithm [12], which leverages sequential information of observations, rewards, and actions through causal sequence modeling and decision Transformers to refine ABR decisions; and Model Predictive Control (MPC) algorithms [13]. MPC formulates an optimization scheme that holistically considers bandwidth forecasts and buffer states to maximize Quality of Experience (QoE) over future video segments, demonstrating superior performance to traditional heuristic methods. Despite MPC’s theoretical advantages in QoE optimization, its effectiveness remains heavily dependent on the accuracy of bandwidth predictions; any errors in these forecasts significantly compromise MPC performance.

Within the domain of deep reinforcement learning (DRL)-driven adaptive bitrate research, the Mao team introduced a landmark solution [2]. Their approach innovatively established a multi-agent training system under the Asynchronous Advantage Actor–Critic (A3C) framework, enabling autonomous evolution of rate decision models through policy gradient optimization in simulated network environments. This methodology transcends traditional algorithms’ reliance on handcrafted heuristics, demonstrating significantly enhanced decision robustness and quality stability under complex network conditions featuring time-varying bandwidth and random packet loss—thereby providing a novel framework for addressing dynamic network challenges. However, this approach exhibits substantial dependency on simulator-constructed training environments; when deploying trained models to complex real-world network conditions, marked performance degradation often occurs, limiting broader applicability. To address this challenge, Li et al. proposed a multi-agent reinforcement learning approach with expert guidance for short video streaming [14]. By modeling bitrate decisions as collaborative agents and incorporating expert knowledge to guide training, their method achieves faster convergence and superior ABR performance, especially under frequent user interactions, ensuring enhanced QoE and robust adaptability to dynamic short video environments. Concurrently, Yan et al. adopted a complementary strategy [15], leveraging real client-side monitoring data from multi-source video service platforms to construct analytical datasets encompassing network state records from mainstream video applications’ user endpoints.

Adaptive bitrate algorithms must address several critical challenges in practical deployment: accurate network bandwidth estimation, balancing video quality against playback smoothness, optimizing the trade-off between transmission latency and Quality of Experience (QoE) [16], maintaining equitable resource allocation in multi-user shared network environments [17], and ensuring algorithmic interpretability [18]. As technological capabilities progress, these algorithms continually advance to deliver increasingly intelligent and efficient streaming services.

2.2. User Experience Modeling and Preference Learning

2.2.1. QoE Metrics for DASH

(1): Instantaneous Visual Quality: The visual quality of the current video segment plays a crucial role in QoE. Since video sequences are encoded at the server into different representations, the resulting impairments are typically quantified through Video Quality Assessment (VQA). From the perspective of original video availability, VQA methods can be categorized into three types: full-reference, reduced-reference, and no-reference approaches. In most cases, full-reference VQA provides the most accurate visual quality assessment. Therefore, this paper adopts the full-reference method, specifically Structural Similarity (SSIM), as the visual quality metric for QoE evaluation in DASH. Although computationally intensive, the SSIM values for each segment can be pre-computed on the video server and embedded in the MPD file before being requested by the client. Thus, Equations (1) and (2) can be expressed as

$q_{t} = S S I M (a_{t} - C_{t})$

(1)

$S S I M (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}$

(2)

where $μ_{x}$ and $μ_{y}$ represent the mean luminance values of image $x$ and image $y$ , respectively, $σ_{x}$ and $σ_{y}$ denote the standard deviations of luminance, and $σ_{x y}$ is the covariance, while $C_{1}$ and $C_{2}$ are small constants preventing division by zero. These constants take values $C_{1} = {(K_{1} L)}^{2}$ and $C_{2} = {(K_{2} L)}^{2}$ , where $L$ represents the pixel value dynamic range, with $K_{1} = 0.01$ and $K_{2} = 0.03$ .
The Structural Similarity Index (SSIM) yields values within the range [0, 1], where values approaching 1 indicate higher inter-frame similarity and superior visual quality. To reduce computational complexity, this study precomputes SSIM values for all bitrate-encoded video segments at the server side and embeds them within the Media Presentation Description (MPD) file. During playback, clients directly retrieve these precomputed SSIM values as inputs for Quality of Experience (QoE) assessment.
(2): Quality Oscillation: Furthermore, quality oscillation constitutes another critical QoE metric unique to DASH technology. The DASH adaptation mechanism dynamically adjusts bitrates based on real-time bandwidth conditions and client buffer status, which may cause significant visual quality variations across consecutive video segments. Such fluctuations substantially degrade QoE in DASH systems, making quality oscillation an essential metric for comprehensive QoE evaluation.
(3): Rebuffering Events: As another fundamental QoE metric, rebuffering events have been widely adopted in existing DASH adaptation schemes. Minimizing rebuffering duration or frequency enhances playback smoothness, thereby reducing visual discomfort for viewers. Notably, users typically perceive rebuffering events as more disruptive than initial playback delays.

Based on the aforementioned metrics, the overall QoE for each segment

L_{t}

can be calculated as follows:

Q o E_{L t} = α q_{t} - β | q_{t} - q_{t - 1} | - γ ϕ_{t}

(3)

where

q_{t}

represents instantaneous visual quality;

q_{t} - q_{t - 1}

quality fluctuation between

L_{t}

and

L_{t} - 1

;

ϕ_{t}

indicates rebuffering duration.

Here, α, β, and γ are weighting factors for instantaneous visual quality, quality fluctuation, and rebuffering events, respectively, in the overall QoE assessment. In practice, comprehensive QoE should incorporate additional components and more sophisticated computational models, However, this work primarily focuses on QoE-optimized adaptation methods rather than QoE modeling itself. At the client side, users may exhibit distinct preferences regarding visual quality, fluctuations, and rebuffering events. Consequently, the weighting coefficients α, β, and γ should be user-specific.

2.2.2. Preference Learning

Preference learning is a methodology [3] for modeling subjective user perceptions through pairwise comparisons, which has been widely adopted in recommendation systems, learning-to-rank tasks, and game outcome evaluation. Its canonical formulation involves predicting a user’s preferred choice when presented with two sample trajectories.

Compared to rating regression, preference learning offers several distinct advantages:

(1): Better alignment with human judgment paradigms (e.g., “Which do you prefer?”);
(2): Superior suitability for modeling highly subjective problems;
(3): Effective utilization of limited labeled data to generate high-quality training signals.

In video streaming systems, preliminary attempts have employed preference learning for QoE assessments, such as developing subjective perception models by comparing pairs of video playback records. However, these approaches have not yet been deeply integrated with reinforcement learning policy training frameworks.

2.3. Mathematical Model of DASH

In DASH systems, video sequences are encoded into multiple representations with varying bitrates and resolutions. The video stream is segmented into chunks (typically 2–10 s in duration), where each chunk contains multiple representations corresponding to different quality levels. Formally, let

L_{t} \in \{L_{1}, L_{2}, \dots, L_{t}\}

where

t

represents the discrete timestep for segment requests, with

t

being the total number of segments. These video chunks are made available to clients through the Media Presentation Description (MPD) file, which contains essential encoding metadata including: bitrate profiles, resolution specifications, and corresponding URLs for each representation.

It should be noted that DASH clients sequentially request and download these segments in playback order. For each segment

L_{t}

, the client selects a representation at a point, and

a_{t} \in A rep

,

A rep

is the set of available representations for the client. The quality of

a_{t}

is defined as follows:

q_{t} = F (a_{t}, C_{t})

(4)

where

F (,)

represents the quality assessment function;

C_{t}

represents the complexity of the video scene at timestep

t

.

The video stream consists of one or more scenes. Assume each scene has the same complexity

C_{t}

. Generally, the client cannot obtain

C_{t}

at timestep

t

because segment

L_{t}

has not been downloaded yet. Due to scene continuity in the video stream,

C_{t}

can be estimated using

C_{t - 1}

from the previous timestep. The download time

a_{t}

of

τ_{t}

depends on its data size (denoted by

S (a_{t})

) and the current actual network throughput (denoted by

c_{t}

). Therefore,

τ_{t}

is defined as follows:

τ_{t} = \frac{S (a_{t})}{c_{t}}

(5)

Prior to downloading segment

L_{t}

, the client also cannot utilize

c_{t}

. Similarly, previous information can be used to estimate the current actual throughput

c_{t}

. Define D and

B_{t}

as the playback duration (2 s) and buffer occupancy at timestep t, respectively. More specifically,

B_{t}

represents the time interval between downloading segment

L_{t}

and playing it. When

B_{t} < τ_{t}

, the buffer time has been depleted before downloading the next segment, and a rebuffering event occurs to compensate for the gap. Therefore, the rebuffering time

φ_{t}

rebuff can be defined as follows:

ϕ_{t} = \max (0, τ_{t} - B_{t})

(6)

As can be seen from the above equation, if

B_{t} \geq τ_{t}

, rebuffering events can be avoided. Meanwhile, this leads to an increase in the next buffer occupancy

B_{t} + 1

. Therefore,

B_{t} + 1

is defined as follows:

B_{t + 1} = \max (0, τ_{t} - B_{t}) + D

(7)

The first term on the right side of Equation (7) accounts for the increased occupancy due to extra time in the buffer, while the second term indicates that, once segment

L_{t}

is completely downloaded, an additional duration D is added to the buffer. Furthermore, considering network capacity and client memory usage, this paper imposes a maximum limit on the buffer size, denoted as

B_{\max}

. Consistent with most previous studies, we set

B_{\max} = 20

s.

2.4. Summary and Analysis

Although reinforcement learning has achieved remarkable results in ABR strategies, current mainstream approaches still rely on fixed reward functions, neglecting individual differences in subjective user experience. The key limitations include the following:

(1): Misaligned Policy Objectives: Uniform reward functions fail to address personalized user requirements.
(2): Ineffective Preference Modeling: Existing methods predominantly depend on objective metric regression, overlooking preference judgment signals.
(3): Limited Scalability and Adaptability: Fixed parameters cannot be effectively transferred to multi-user, multi-scenario environments.

To address these challenges, this paper proposes incorporating preference learning mechanisms as an alternative reward source for reinforcement learning. The proposed system builds upon user-subjective preference scoring to optimize strategies, aiming to better approximate real user experiences and establish a new paradigm for personalized ABR policy evolution.

3. System Design and Architecture

3.1. System Design Objectives

To enhance the subjective experience of adaptive bitrate (ABR) strategies in real-world video streaming, this paper proposes an ABR optimization system that integrates user preference modeling with reinforcement learning. The core design objectives are

(1): Personalization: Support subjective trade-offs between video quality and smoothness through preference modeling.
(2): Adaptability: Enable reinforcement learning policies to continuously adjust to dynamic network conditions.
(3): Modularity: Maintain a clearly layered system design for independent optimization and testing of components.
(4): Training Efficiency: Support parallel trajectory sampling and periodic policy updates to reduce training time. Based on these objectives, the system is divided into three functional modules: Trajectory Sampling Module Preference, Modeling Module, and Policy Training Module. These modules work together to form a closed-loop policy optimization process.

3.2. System Architecture

The overall architecture of the adaptive video streaming system comprises four core components: the video server, network transmission path, client-side adaptive decision process, and user preference feedback module. The video server delivers multiple bitrate-encoded video segments along with the Media Presentation Description (MPD) file to clients via the network. Client devices execute the adaptive bitrate (ABR) decision module to select optimal bitrates based on real-time network conditions and buffer status. Concurrently, the system collects user preference feedback to iteratively optimize the policy model, thereby achieving personalized Quality of Experience (QoE) enhancement as illustrated in Figure 1.

The system consists of three core modules:

(1): Trajectory Sampling Module (Agent + Simulated Env): Interacts with a simulated video playback environment to collect trajectory data comprising states, actions, and initial rewards.
(2): Preference Modeling Module (Preference Net): Employs an LSTM network to score trajectories, capturing users’ subjective preferences for video streaming experiences.
(3): Policy Training Module (PPO + Reward Reformulation): Performs PPO policy training, using preference scores as surrogate rewards to drive policy updates.

Data flows among these modules, forming a sample–evaluate–optimize loop that implements the reinforcement learning “experience cycle” (As shown in Figure 2).

3.3. Trajectory Sampling Module

Trajectory Sampling

This module simulates real video playback by generating training trajectories through RL agent–environment interactions. The collected trajectories train the preference model. Key functions include the following: state construction/update, action selection (bitrate decisions), environment feedback collection, and trajectory caching/storage.

Implemented with 16 parallel agent processes (Python multiprocessing), each agent performs the following:

(1): Policy synchronization: Updates the local network with the latest PPO policy parameters from the main process.
(2): Environment interaction: Executes TRAIN_SEQ_LEN = 1000 steps per episode, recording the following: State: state ∈ ℝ^ {6 × 8}, Action: action ∈ {0, 1, …, 5} (bitrate index), and Reward: environment’s immediate reward.
(3): Experience reporting: Sends trajectories (states/actions/policy probabilities) via inter-process queue.
(4): Trajectory archiving: Stores trajectories as JSON files every 10 episodes; JSON trajectory structure (per step): state: 6-dimensional signal history (8-step matrix); action: selected bitrate index; reward: immediate environment reward; video_quality: quality metric; rebuffer_time: stalling duration; download_speed: measured bandwidth. This design ensures the following: sufficient policy–environment interaction, high-quality raw trajectory data for preference modeling.

3.4. Preference Modeling Module

To incorporate user subjective experience factors, this system designs and implements a preference modeling sub-module. This component aims to simulate user preference behaviors toward different video playback trajectories, thereby supplanting the environmental reward function conventionally employed in reinforcement learning frameworks.

3.4.1. Preference Data Construction

This study employs Long Short-Term Memory (LSTM) networks as the core architecture for preference modeling, primarily leveraging their distinctive temporal modeling advantages: (1) The gating mechanism of LSTM effectively captures long-term dependencies in video streaming trajectories, precisely modeling the influence of historical states on current decisions [19]. (2) Its cell-state design is particularly suitable for handling critical sparse events (e.g., stuttering) in video streams and their lasting impacts [20]. (3) It exhibits natural compatibility with reinforcement learning algorithms, featuring stable gradient properties and supporting multi-step reward computation [21]. Consequently, compared to variants like BiLSTM [22] and GRU [23], LSTM demonstrates superior performance in causal task modeling, complex state transition handling, and training stability. Existing studies have further validated its effectiveness in multi-dimensional time-series signal analysis. These characteristics collectively establish LSTM as the optimal choice for modeling users’ subjective video experience preferences.

The module constructs pairwise preference samples from collected trajectory files. Each sample pair consists of two trajectories (traj_1, traj_2) with a preference label ∈ {0, 0.5, 1}. It indicates that the user prefers one of the trajectories:

0: Prefers traj_1. 1: Prefers traj_2. 0.5: No clear preference (similar quality).

These labels are automatically generated by comparing cumulative rewards or proxy metrics (weighted sums of video quality and rebuffering time) and then stored as standardized JSON files for training.

3.4.2. Preference Network Architecture

Preference Net receives complete trajectory sequences and outputs subjective quality scores. Given the temporal nature of trajectories, the model architecture features the following:

(1): Input layer: The trajectory tensor is $(B, T, 6, 8)$ and first flattened to $(B, T, 48)$ .
(2): Temporal modeling layer: Stacked LSTM (Long Short-Term Memory) modules for extracting temporal dependencies in trajectories.
(3): Score output layer: An MLP network that projects the final hidden state of the LSTM into a scalar score.

A higher output score indicates greater user satisfaction with the trajectory.

3.4.3. Training Method

The model employs a pairwise comparison strategy:

(1): Input: A set of trajectory pairs ( $t r a j_1$ , $t r a j_2$ ).
(2): Model output: Corresponding scores (score_i, score_j) for each trajectory.
(3): Objective function: Computes BCE loss between the sigmoid-transformed score difference and preference labels.

$ξ = B (s_{1} - s_{2}, t)$

(8)

where
$t = 1$ indicates preference for $t r a j_1$ ;
0 indicates preference for $t r a j_2$ ;
0.5 indicates uncertain preference.

This mechanism enables the system to learn trajectory scoring models that better reflect user experience, thereby guiding subsequent reinforcement learning policy adjustments.

Although the current study employs LSTM for the preference modeling module, Transformer architectures and self-attention mechanisms have gained significant traction in video quality prediction and QoE modeling. For instance, Wu et al. proposed a hybrid approach combining Vision Transformer (ViT) and Swin Transformer branches augmented with attention-enhanced GRU, achieving high-performance prediction (SROCC ≈ 0.927) on UGC datasets including KoNViD-1k and LIVE-VQC [24]. Similarly, Xing et al. demonstrated state-of-the-art performance across multiple in-the-wild video quality datasets with their StarVQA+ model through alternating spatial–temporal attention mechanisms [25]. Consequently, future work could explore integrating Transformer architectures into the user preference modeling module to enhance capture capacity for long-range dependencies and complex preference patterns, thereby improving the policy model’s generalization capability across diverse video scenarios and complex network conditions.

3.5. Main Training Module

The system employs a reinforcement learning algorithm based on Proximal Policy Optimization (PPO), integrating preference-driven rewards and parallel sampling to establish a stable and efficient policy training pipeline. The core concept utilizes actor–critic architecture to jointly optimize the policy network and value network while incorporating a user preference model to replace conventional QoE rewards, thereby enhancing the policy’s generalization capability in terms of real user experience.

Each training iteration consists of the following key steps:

(1): Parameter Broadcasting: The main process distributes current policy network parameters to all sampling agents via queues;
(2): Experience Collection: Waits for all agents to return sampled trajectories;
(3): Preference Scoring: Uses the preference model to score each trajectory as alternative rewards for the current iteration;
(4): Data Organization: Constructs batches of states (s_batch), actions (a_batch), probabilities (p_batch), and preference scores (v_batch);
(5): Policy Optimization: Executes one policy update using the PPO network;
(6): Periodic Testing: Conducts policy performance evaluation every few iterations, recording metrics such as test reward and entropy;
(7): Model Saving: Periodically saves model parameters to local storage.

The core training module of this system employs a deep reinforcement learning approach based on Proximal Policy Optimization (PPO), utilizing a user preference-driven reward mechanism to guide iterative updates of the policy network towards higher user-perceived satisfaction. The entire module completes training within a unified framework comprising state space, action set, policy structure, and advantage estimation, with the following specific components:

(1): State: The agent’s observation state that $O_{i} (Q o E_{i}^{γ, h}, f_{i}, b_{i}, T_{i}, C_{i}, L_{i})$ comprises multiple dynamic metrics. The historical QoE metric $Q o E_{i}^{γ, h}$ reflects the user experience quality of the previous $h$ video segments. The buffer state $b_{i}$ quantifies the current buffer capacity, while the bandwidth estimate $T_{i}$ records real-time network throughput. The expected bandwidth requirement $C_{i}$ calculates the necessary bandwidth value to meet maximum switching conditions. Additionally, $L_{i}$ represents client-side QoE, reflecting end-users’ perceived quality level in real time.
(2): Action: The agent’s action space is defined as a discrete set $A = {0, 1, 2, 3, 4, 5}$ , where each value corresponds to a specific video bitrate option.
(3): Reward: The system incorporates a preference modeling mechanism, where the trained Preference Net scores complete trajectories $τ = {(s_{t}, a_{t})}_{t = 1}^{T}$ as shown in Equation (9).

$R (τ) = f_{φ} (τ)$

(9)

This reward substitution mechanism aligns the policy learning objective more closely with users’ subjective experience preferences, thereby constructing a user-oriented ABR strategy.
(4): The policy network $π_{θ}$ is defined as

$π_{θ} (a_{t} | s_{t}) = S o f t \max (M L P_{θ} (F l a t t e n (s_{t})))$

(10)

The value network

V_{θ}

shares the same architecture as the policy network π, differing only in output dimension, and is used to estimate the expected return for a given state:

V_{ϕ} (s_{t}) = M L P_{ϕ} (F l a t t e n (s_{t})) \in R

(11)

Policy updates are based on the clipped ratio PPO loss:

L^{n} (ϕ) = Ε_{o_{t}^{n}} [\min (r_{t} (θ) {\hat{A}}_{t}, c l i p (r_{t} (θ), 1 - ε, 1 + ε) {\hat{A}}_{t})]

(12)

where

r_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{o l d}} (a_{t} | s_{t})}

:

ε

is the empirically determined tolerance interval (set to 0.2).

3.6. Model Training

The system employs a closed-loop reinforcement learning framework that integrates three core modules—trajectory sampling, preference modeling, and policy training—to establish a self-adaptive “sample–score–train” cycle. Each training iteration involves three coordinated phases, as illustrated in Figure 3.

Module Coordination Workflow

The workflow follows a sample → score → train sequence (Algorithm 1):

(1): Sampling Phase (Agent Sampling)
Multiple agents concurrently interact with the ABREnv simulation environment. Each agent collects state–action–policy probability trajectories over multiple steps and transmits them to the central trainer. A subset of trajectories is saved as JSON files for preference model training.
(2): Preference Modeling Phase (Preference Net Scoring)
The central trainer invokes the preference network to assign scalar subjective scores to trajectories, which replace traditional rewards. This step directly incorporates user experience feedback into policy optimization.
(3): Policy Optimization Phase (PPO Training)
This aggregates state sequences (s_batch), actions (a_batch), policy probabilities (p_batch), and preference scores (v_batch) into training data. It executes one PPO policy update, with results distributed to agents for the next sampling round, closing the loop.

Algorithm 1: A reinforcement learning training framework that integrates user preference modeling.

1: Initialize policy network parameters

θ

, preference model, sampling agent set, and set initial parameters

ϕ

for the value network

2: Complete the environment configuration for all agent

i \in N

3: for

s t e p = 1, 2, \dots, T

do:

4: for each agent

i \in ℕ

do:

5: Each agent interacts with the simulated environment to collect trajectory “traj”
6: Upload trajectory data to the central training process
7: Save the trajectory as a JSON file for preference model training purposes
8: end for
8: The main process receives all trajectories “traj”
9: for each agent

i \in ℕ

do:
10: Use preference model scoring to obtain alternative rewards “reward”
11: end for
12: Construct training batches “batch”
13: Calculating Advantage Estimation Based on PPO Algorithm
14: Update policy network parameters

θ

15: Conduct strategy testing, record performance metrics, and save model parameters

16: end for

17: end for

18: end for

4. System Implementation and Experimental Design

This chapter details the implementation of the proposed reinforcement learning framework and describes the experimental platform, environment configuration, data sources, evaluation metrics, and baseline methods. Through controlled experiments and standardized performance evaluation, we validate the effectiveness of the preference-aware reinforcement learning framework for adaptive bitrate (ABR) control in video streaming.

4.1. Experimental Setup and Evaluation Metrics

The system is implemented in Python 3.7.10, with deep reinforcement learning networks built on PyTorch 1.4.0. The environment simulation uses a custom ABREnv module to emulate client playback behavior and network responses. The reinforcement learning component implements the Proximal Policy Optimization (PPO) training pipeline, with multi-process sampling for improved efficiency.

The trajectory preference scoring model employs an LSTM architecture to capture user experience preferences, assigning semantic scores to agent-sampled trajectories as surrogate rewards for policy learning.

The training environment runs on Linux, utilizing standard machine learning tools (e.g., Anaconda, TensorFlow, TFLearn) for script isolation and parallel deployment. To ensure sampling consistency and CPU-based reproducibility, GPU acceleration is disabled. The test dataset includes a video repository with six distinct bitrate configurations, enabling controlled network transmission simulations. A complete list of system experiments is provided in Table 1.

To facilitate deployment and debugging, GPU acceleration was disabled during the main training process to ensure controllable thread resources and better compatibility in multi-process sampling and training.

Video Dataset: This study constructed multiple video datasets containing various content types including sci-fi, documentary, and action, as video sources. All videos were encoded in H264/MPEG-4 format, with six different bitrate segments selected for evaluation. To minimize interference from other factors in the simulation experiments, each client-selected video was divided into 48 segments.

Network Trace Dataset: The simulation experiments required emulating different client video playback scenarios, necessitating diverse network environments. The bandwidth traces used in this study were obtained from the HSDPA dataset collected in Norway. The HSDPA mobile dataset’s bandwidth measurements were acquired from various mobile devices streaming videos while in transit. After filtering the bandwidth data, we obtained a series of real network bandwidth traces fluctuating between 0.2 Mbps and 6 Mbps.

4.2. Evaluation Metrics

The following four metrics were employed to assess algorithm performance:

(1): Average Bitrate Quality Level: This metric calculates the mean bitrate quality of dynamically requested media segments to determine the actual visual quality level obtained by users. Higher average bitrate levels correspond to higher-resolution media content, directly reflecting QoE improvement. In adaptive streaming systems, this metric serves as a core optimization objective function parameter, enabling clients to dynamically adjust request strategies based on real-time network throughput to achieve balanced optimization of transmission quality and user experience.
(2): Video Quality Switch Count: While DASH protocol adapts video bitrates dynamically to network changes, frequent large bitrate fluctuations may cause noticeable quality jumps that degrade viewing experiences. Optimization strategies must balance bandwidth utilization with visual consistency by reducing the frequency of significant bitrate switches to maintain stable perceived quality.
(3): Rebuffering Time: This is the total pause duration during playback caused by insufficient buffering, representing one of the most significant factors affecting user experience. Each noticeable rebuffering event typically causes sharp declines in user satisfaction and is often incorporated as a negative value in reward calculations to penalize such behavior.
(4): QoE: A comprehensive metric for evaluating video viewing experience that is primarily composed of three key elements: video quality, bitrate switch frequency/magnitude, and buffering events. It is calculated by combining these three components.

4.3. Comparative Experiments

As our design is based on user preferences and considers three key aspects of QoE, we focus on comparing video quality, bitrate switching frequency, and rebuffering time. The following algorithms are compared:

(1): MPC [13]: As a conventional bitrate adaptation algorithm, it employs multi-step prediction within a sliding time window to optimize bitrate decisions. Its core methodology involves the following: (i) forecasting future network conditions based on historical throughput measurements, (ii) constructing an optimization problem by integrating a dynamic buffer state model, and (iii) performing time-domain rolling optimization to balance video quality, rebuffering risk, and playback smoothness.
(2): BOLA [9]: As a representative static heuristic algorithm, BOLA reformulates bitrate selection as a queue stability control problem within the Lyapunov optimization framework. Its key characteristics include the following: (i) employing a virtual buffer queue to enable smooth bitrate transitions, (ii) leveraging control-theoretic principles to guarantee system stability, and (iii) demonstrating exceptional buffer regulation performance in single-user scenarios. Owing to its mathematical rigor and implementation simplicity, BOLA is widely adopted as a benchmark algorithm in related research.
(3): Pensieve [2]: A state-of-the-art ABR solution that leverages deep reinforcement learning for end-to-end bitrate decision optimization. Combining offline training and online inference, it improves bitrate selection accuracy in complex network environments and is considered a representative benchmark.
(4): Lumos [26]: This is a decision tree-based throughput predictor designed to enhance bitrate selection accuracy in adaptive video streaming (ABR) algorithms, consequently optimizing Quality of Experience (QoE) by providing reliable throughput estimates for bandwidth-sensitive adaptation decisions.
(5): Our Method (PPO + Preference Modeling): This incorporates a user preference network as the reward function to enhance training efficiency and policy rationality.

As aggregated in Figure 4, the proposed method is benchmarked against four mainstream algorithms—MPC, BOLA, Pensieve, and Lumos—across four core performance metrics: average QoE score, total rebuffering duration, bitrate switching frequency, and average playback bitrate. To further highlight performance disparities, Figure 4 visualizes comparative results through line charts, specifically presenting QoE and rebuffering time measurements across all evaluated methods. This visualization provides an intuitive assessment of the relative merits between approaches.

The comparative QoE performance of MPC, BOLA, Pensieve, Lumos, and our proposed CADA method is illustrated in Figure 4. CADA achieves an average QoE score of 43.7, surpassing the four benchmark algorithms by margins of 2, 9, 4, and 4 percentage points, respectively, as quantified in Table 2. Regarding rebuffering duration, CADA maintains 2.21 s, representing significant reductions of 31.15% compared to MPC, 17.23% against BOLA, and 7.14% relative to Pensieve, though marginally higher (5.24%) than Lumos. Furthermore, CADA demonstrates substantial improvements in average bitrate: 7.55% higher than MPC, 16.33% over BOLA, and identical 5.56% gains versus both Pensieve and Lumos.

The results demonstrate that the proposed method outperforms conventional approaches across all key metrics. Particularly noteworthy is its significantly lower rebuffering duration compared to MPC and BOLA, coupled with the most stable video quality (indicating enhanced bitrate transition stability). The highest overall QoE score reflects how the reinforcement learning strategy, augmented with user preference modeling, better aligns with human subjective perception. This comprehensive superiority stems from three pivotal innovations: First, the LSTM-based temporal modeling precisely captures delayed effects of user preferences, heightening rebuffering perception sensitivity. Second, the dynamic gating mechanism effectively filters network noise, maintaining decision consistency amidst bandwidth fluctuations. Third, the closed-loop training paradigm progressively learns user preferences, elevating QoE scores beyond traditional methods. Collectively, these designs establish an integrated optimization pipeline from network-state awareness to preference-informed adaptation, simultaneously overcoming limitations of fixed-weight formulations in conventional approaches and addressing the neglect of subjective experience in pure reinforcement learning. Consequently, the framework achieves dual enhancement in both technical metrics and user-centric quality.

4.4. Ablation Study Analysis

To comprehensively evaluate each component’s effectiveness and contribution, we designed ablation experiments by systematically removing or replacing specific modules, creating several variant versions of the Pensieve algorithm for comparative testing under identical conditions. By examining performance differences in QoE and its sub-metrics (rebuffering time, quality variation, average bitrate), we assessed each module’s individual impact. The experiment includes three key variants:

(1): Baseline Pensieve: Uses conventional environmental rewards, single-threaded trajectory sampling, and excludes preference modeling (original PPO-based Pensieve architecture).
(2): Preference-Modeling Version (PPO + Preference Net): Replaces environmental rewards with Preference Net scores, optimizing for user preference. This version shows marked improvement in video quality and user satisfaction metrics, better approximating subjective user experience.
(3): Complete Version (PPO + Multi-process + Preference Reward): Integrates all proposed modules as our final optimized strategy. It achieves optimal performance across QoE, average bitrate, and rebuffering time, demonstrating superior generalization capability and robustness.

The multi-process sampling module significantly improves both the stability and efficiency of policy training, particularly enabling the faster acquisition of positive returns during early stages. The incorporation of preference modeling effectively enhances the policy’s ability to align with users’ subjective preferences, resulting in substantial overall QoE improvement. The complete model outperforms all other variants across all metrics, validating both the rationality and advancement of the proposed system architecture. As shown in Table 3, Cada-1 achieves a 10.95% improvement in QoE over Pensieve-PPO, while Cada-2 further enhances QoE by 14.18% compared to Pensieve-PPO, confirming the superior performance of our proposed approach.

4.5. Extended Experiments on Public Datasets

To further validate the generalization capability and practical applicability of the proposed preference-driven adaptive bitrate strategy (Cada-PPO), we conducted supplementary experiments on the publicly available LIVE-NFLX benchmark dataset. This dataset, jointly released by Netflix and the University of Texas at Austin, comprises diverse real-world video content paired with corresponding subjective user ratings, covering various scenarios such as high-motion and low-motion sequences. It provides comprehensive QoE annotation data and has been widely adopted for video quality assessment studies. Figure 5 shows the average bitrate results, while Figure 6 presents the rebuffering time comparison.

The supplementary experiments conducted on the LIVE-NFLX dataset comprehensively demonstrate our method’s robustness and adaptability across diverse video content types and complex network conditions. The results reveal that our preference-driven strategy consistently outperforms state-of-the-art algorithms in terms of average QoE improvement, rebuffering mitigation, and bitrate smoothness. Specifically, Cada-PPO achieves a 17.09% QoE gain over MPC, 14.17% over BOLA, 3.79% over Pensieve, and 2.54% over Lumos. This superior performance stems from our end-to-end optimization framework that seamlessly integrates network perception with preference modeling. Unlike conventional approaches reliant on fixed-weight heuristics, our method overcomes their limitations while simultaneously addressing the subjective experience deficiencies of pure reinforcement learning systems, ultimately delivering comprehensive improvements across all evaluation metrics.

5. Conclusions and Future Work

This paper addresses the limitation of traditional ABR reinforcement learning strategies in reward design for user perception modeling by proposing an end-to-end optimization framework based on preference modeling.

Preference Modeling Mechanism: By constructing trajectory pairs and incorporating user preference judgments, we train a Preference Net to learn subjective user experience. Multi-process Parallel Sampling Architecture: Combining PPO with parallel interaction sampling to achieve efficient trajectory collection and policy updates. Closed-loop Reinforcement Learning Process: Policy → Trajectories → Preference Learning → Preference Scores → Policy Iteration. This forms a user-driven optimization system. Experimental Validation: The proposed method outperforms traditional QoE reward strategies in terms of average reward, video quality, and buffer control.

Despite achieving strong performance across multiple metrics, the framework still has room for improvement:

(1): Limited Automation in Preference Data Generation: Current trajectory pair construction depends on sampling quantity; future work could explore active learning and preference augmentation techniques.
(2): Static Preference Model: The current Preference Net operates as a fixed scorer without joint optimization during policy training; future research could investigate online updates and co-optimization mechanisms.
(3): Network Architecture Enhancement: Beyond LSTM, future work may explore Transformer-based encoders for better long-term dependency modeling and scoring accuracy.
(4): More Realistic User Evaluation: Future studies could incorporate subjective ratings and real-world user feedback (e.g., click-through data) to further refine the preference model.

Author Contributions

Conceptualization, Z.F.; Methodology, Y.L.; Software, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Cisco, U. Cisco Annual Internet Report (2018–2023) White Paper; Cisco: San Jose, CA, USA, 2020; Volume 10, pp. 1–35. [Google Scholar]
Mao, H.; Netravali, R.; Alizadeh, M. Neural Adaptive Video Streaming with Pensieve; ACM Special Interest Group on Data Communication; ACM: Singapore, 2017. [Google Scholar] [CrossRef]
Wirth, C.; Akrour, R.; Neumann, G.; Fürnkranz, J. A survey of preference-based reinforcement learning methods. J. Mach. Learn. Res. 2017, 18, 1–46. [Google Scholar]
Jiang, J.; Sekar, V.; Zhang, H. Improving fairness, efficiency, and stability in http-based adaptive video streaming with festive. In Proceedings of the 8th International Conference on Emerging Networking Experiments and Technologies, Florham Park, NJ, USA, 15 September 2012; pp. 97–108. [Google Scholar]
Sun, Y.; Yin, X.; Jiang, J.; Sekar, V.; Lin, F.; Wang, N.; Liu, T.; Sinopoli, B. CS2P: Improving video bitrate selection and adaptation with data-driven throughput prediction. In Proceedings of the 2016 ACM SIGCOMM Conference, Florianópolis, Brazil, 22–26 August 2016; pp. 272–285. [Google Scholar]
Akhtar, Z.; Nam, Y.S.; Govindan, R.; Rao, S.; Chen, J.; Katz-Bassett, E.; Ribeiro, B.; Zhan, J.; Zhang, H. Oboe: Auto-tuning video ABR algorithms to network conditions. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, Budapest, Hungary, 20–25 August 2018; pp. 44–58. [Google Scholar]
Gao, X.; Song, A.; Hao, L.; Zou, J.; Chen, G.; Tang, S. Towards efficient multi-channel data broadcast for multimedia streams. IEEE Trans. Parallel Distrib. Syst. 2019, 30, 2370–2383. [Google Scholar] [CrossRef]
Huang, T.Y.; Johari, R.; McKeown, N.; Trunnell, M.; Watson, M. A buffer-based approach to rate adaptation: Evidence from a large video streaming service. ACM 2014, 12, 187–198. [Google Scholar]
Spiteri, K.; Urgaonkar, R.; Sitaraman, R.K. BOLA: Near-optimal bitrate adaptation for online videos. IEEE/ACM Trans. Netw. 2020, 28, 1698–1711. [Google Scholar] [CrossRef]
Yadav, P.K.; Shafiei, A.; Ooi, W.T. Quetra: A queuing theory approach to dash rate adaptation. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 1130–1138. [Google Scholar]
Zhou, C.; Lin, C.W.; Guo, Z. mDASH: A Markov decision-based rate adaptation approach for dynamic HTTP streaming. IEEE Trans. Multimed. 2016, 18, 738–751. [Google Scholar] [CrossRef]
Xu, B.; Chen, H.; Ma, Z. Karma: Adaptive video streaming via causal sequence modeling. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 1527–1535. [Google Scholar]
Yin, X.; Jindal, A.; Sekar, V.; Sinopoli, B. A control-theoretic approach for dynamic adaptive video streaming over HTTP. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, London, UK, 17–21 August 2015; pp. 325–338. [Google Scholar]
Li, Y.; Zheng, Q.; Zhang, Z.; Chen, H.; Ma, Z. Improving ABR performance for short video streaming using multi-agent reinforcement learning with expert guidance. In Proceedings of the 33rd Workshop on Network and Operating System Support for Digital Audio and Video, Vancouver, BC, Canada, 7–10 June 2023; pp. 58–64. [Google Scholar]
Yan, F.Y.; Ayers, H.; Zhu, C.; Fouladi, S.; Hong, J.; Zhang, K.; Levis, P.; Winstein, K. Learning in situ: A randomized experiment in video streaming. In Proceedings of the 7th USENIX Symposium on Networked Systems Design and Implementation, Boston, MA, USA, 30 March–1 April 2011. [Google Scholar]
O’Hanlon, P.; Aslam, A. Latency Target based Analysis of the DASH. js Player. In Proceedings of the 14th Conference on ACM Multimedia Systems, Vancouver, BC, Canada, 7–10 June 2023; pp. 153–160. [Google Scholar]
Han, B.; Qian, F.; Ji, L.; Gopalakrishnan, V. MP-DASH: Adaptive video streaming over preference-aware multipath. In Proceedings of the 12th International on Conference on Emerging Networking EXperiments and Technologies, Irvine, CA, USA, 12–15 December 2016; pp. 129–143. [Google Scholar]
Meng, Z.; Wang, M.; Bai, J.; Xu, M.; Mao, H.; Hu, H. Interpreting deep learning-based networking systems. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication, Virtual Event, USA, 10–14 August 2020; pp. 154–171. [Google Scholar]
Graves, A. Long short-term memory. Supervised Seq. Label. Recurr. Neural Netw. 2012, 385, 37–45. [Google Scholar]
Greff, K.; Srivastava, R.K.; Koutník, J.; Steunebrink, B.R.; Schmidhuber, J. LSTM: A search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 2222–2232. [Google Scholar] [CrossRef] [PubMed]
Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. Int. Conf. Mach. Learn. PmLR 2016, 48, 1928–1937. [Google Scholar]
Siami-Namini, S.; Tavakoli, N.; Namin, A.S. The performance of LSTM and BiLSTM in forecasting time series. In Proceedings of the 2019 IEEE International Conference on Big Data, Los Angeles, CA, USA, 9–12 December 2019; pp. 3285–3292. [Google Scholar]
Nosouhian, S.; Nosouhian, F.; Khoshouei, A.K. A review of recurrent neural network architecture for sequence learning: Comparison between LSTM and GRU. Preprint 2021. [Google Scholar] [CrossRef]
Kossi, K.; Coulombe, S.; Desrosiers, C. No-reference video quality assessment using transformers and attention recurrent networks. IEEE Access 2024, 12, 140671–140680. [Google Scholar] [CrossRef]
Xing, F.; Wang, Y.G.; Tang, W.; Zhu, G.; Kwong, S. Starvqa+: Co-training space-time attention for video quality assessment. arXiv 2023, arXiv:2306.12298. [Google Scholar]
Lv, G.; Wu, Q.; Wang, W.; Li, Z.; Xie, G. Lumos: Towards better video streaming qoe through accurate throughput prediction. In Proceedings of the IEEE INFOCOM 2022-IEEE Conference on Computer Communications, Virtual, 2–5 May 2022; pp. 650–659. [Google Scholar]

Figure 1. System architecture diagram.

Figure 2. System module diagram.

Figure 3. Model training diagram.

Figure 4. Comparison experiment diagram.

Figure 5. Average bitrate performance.

Figure 6. Rebuffering time comparison.

Table 1. System experiment table.

No.	Name	Description	Version
1	Ubuntu	Linux OS	18.04
2	Anaconda	Package/environment manager	5.0.0
3	Python	Programming language	3.7.10
4	PyTorch	Deep learning framework	1.4.0
5	Pysyft	Federated learning library	0.2.4
6	TFlearn	Deep learning framework	0.5.0
7	TensorFlow	Deep learning framework	2.4.0
8	Nginx	Video server	1.16.1
9	Quiche	QUIC protocol implementation	0.10.0
10	Firefox	Video client	94.0
11	CPU	Central Processing Unit	i7-12700H
12	GPU	NVIDIA RTX graphics card	4060

Table 2. QoE comparison.

Method	Average QoE	Buffering	Smoothness	Bitrate
MPC	42.7	3.21	2.8	5.3
Bola	39.8	2.67	3.1	4.9
Pensieve	41.8	2.38	2.9	5.4
Cada	43.7	2.21	2.6	5.7

Table 3. Ablation study comparison.

Version	Preference NET	Multi-Process	Average QoE
Pensieve-PPO	No	No	40.2
Cada-1	Yes	No	44.6
Cada-2	Yes	Yes	45.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Feng, Z.; Liu, Y.; Zhang, H. User Preference-Based Dynamic Optimization of Quality of Experience for Adaptive Video Streaming. Electronics 2025, 14, 3103. https://doi.org/10.3390/electronics14153103

AMA Style

Feng Z, Liu Y, Zhang H. User Preference-Based Dynamic Optimization of Quality of Experience for Adaptive Video Streaming. Electronics. 2025; 14(15):3103. https://doi.org/10.3390/electronics14153103

Chicago/Turabian Style

Feng, Zixuan, Yazhi Liu, and Hao Zhang. 2025. "User Preference-Based Dynamic Optimization of Quality of Experience for Adaptive Video Streaming" Electronics 14, no. 15: 3103. https://doi.org/10.3390/electronics14153103

APA Style

Feng, Z., Liu, Y., & Zhang, H. (2025). User Preference-Based Dynamic Optimization of Quality of Experience for Adaptive Video Streaming. Electronics, 14(15), 3103. https://doi.org/10.3390/electronics14153103

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

User Preference-Based Dynamic Optimization of Quality of Experience for Adaptive Video Streaming

Abstract

1. Introduction

1.1. Research Background

1.2. Research Motivation

1.3. Research Contributions

1.4. Thesis Structure

2. Literature Review

2.1. Evolution of Adaptive Bitrate Algorithms

2.2. User Experience Modeling and Preference Learning

2.2.1. QoE Metrics for DASH

2.2.2. Preference Learning

2.3. Mathematical Model of DASH

2.4. Summary and Analysis

3. System Design and Architecture

3.1. System Design Objectives

3.2. System Architecture

3.3. Trajectory Sampling Module

Trajectory Sampling

3.4. Preference Modeling Module

3.4.1. Preference Data Construction

3.4.2. Preference Network Architecture

3.4.3. Training Method

3.5. Main Training Module

3.6. Model Training

Module Coordination Workflow

4. System Implementation and Experimental Design

4.1. Experimental Setup and Evaluation Metrics

4.2. Evaluation Metrics

4.3. Comparative Experiments

4.4. Ablation Study Analysis

4.5. Extended Experiments on Public Datasets

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI