1. Introduction
Recent advancements in machine learning have facilitated the interaction between robots and humans, enabling robots to offer adept services in diverse applications, such as autonomous driving systems [
1] and collaborative assembly in smart factories [
2]. Given these developments, it is crucial that robotic systems dynamically adapt to the preferences of users involved in these interactions. Recent approaches in robotics have employed preference-based learning (PBL) to learn user preference [
3,
4,
5,
6,
7,
8,
9]. In PBL, human preferences are modeled as a reward function, learned by presenting users with diverse robot behaviors and having them select preferred ones. The robot’s behavior can be tailored to align with the user’s preference by maximizing the learned reward model.
While most existing methods [
3,
4,
5,
6,
7,
8,
9] have successfully captured user preferences in an online manner, a clear limitation arises from the presumption of a stationary reward function to model user preference. Notably, user preference might evolve while interacting with robotic systems [
10,
11]. Hence, robots must adeptly adjust to evolving preferences to maintain appropriate behavior. For instance, initial encounters with service robots may prompt users to favor cautious behavior due to unfamiliarity, but as users grow accustomed to the robot’s presence, preferences often shift towards more task-specific behaviors, as depicted in 
Figure 1. Beyond this scenario, user preferences can evolve due to diverse factors, encompassing trends, emotions, and age. To address these dynamic preferences, we present a method capable of adapting to evolving user inclinations.
The challenge of adapting to dynamic user preferences can be addressed by using a non-stationary bandit framework [
12]. The bandit framework [
13] is extensively utilized to optimize decision-making processes under uncertain and unknown rewards, where an agent selects from a set of options, each providing stochastic rewards. The primary goal of this framework is to maximize the cumulative reward over time, thereby finding an optimal option even without the explicit knowledge of the rewards. This framework strategically balances the selection of high-potential queries against those already known to align with user preferences, enabling efficient and adaptive query selection to acquire time-varying reward functions.
In this paper, we propose a novel preference-based learning method, called discounted preference bandits (DPBs), to address time-varying preferences. First, our algorithm is inherently adaptive to time-varying environments by updating parameters based on penalized likelihood. Second, we theoretically demonstrate a no-regret convergence for the proposed method. In the simulation, the proposed method outperforms existing methods [
3,
5,
6,
7] in terms of cosine similarity, simple regret, and cumulative regret in time-varying scenarios. Finally, simulation and real-world user studies confirm that the proposed method successfully adapts to time-varying scenarios, especially with respect to robot behavior adaptation and environmental changes.
  4. Methods
We present a novel discounted preference bandit (DPB) to estimate the time-varying preference with the minimum number of queries. First, we newly define a context vector as the difference of feature vectors between two trajectories, i.e., 
, which represents the information of comparison. Let 
 be a set of possible context vectors that are converted from all trajectory pairs. Then, the query selection problem is converted to choosing a proper query vector in 
. Furthermore, based on 
X, the probabilistic model in (
2) can be converted into the following logistic distribution,
      
      where 
 indicates a context vector of 
 that are compared at round 
t. By introducing a context vector 
, PBL can be reduced to the online learning problem. Algorithm 1 demonstrates the online learning process of DPB to acquire time-varying reward functions. In each round 
t, DPB selects a batch of queries as outlined in line 4. The human then observes these queries and provides preference labels for each query as described in line 5. Finally, the parameter is updated with the collected preference data as shown in line 6.
      
| Algorithm 1 Discounted Preference Bandits (DPBs) | 
| Require: 
                  Ensure: 
                    1:, , and   2:while 
                     do  3:    Set  in Theorem 1  4:    Select top-b  queries   from (4 )  5:    Demonstrate  and collect   6:    Estimate   by solving (5 ) and   7:end while
 | 
  4.1. Absolute Upper Confidence Bound
Suppose that the parameter 
 is estimated, which will be explained later. At round 
t, DPB chooses an action based on the following action selection rule
        
        where 
, 
 is a past query vector from 
 to 
, 
 is a regularization coefficient, 
 is an identity matrix, and 
 is a scale parameter that controls the importance between the first and second term.
The first term indicates the absolute difference of the estimated rewards between two trajectories. Since we construct a context vector from two trajectories, 
X and 
 contain the same information, i.e., 
 and 
. Hence, computing the reward via the absolute value ensures the equivalence between selecting 
X and 
. Based on this trick, the query selection method (
4) employs the upper confidence bound (UCB) [
20]. The second term in (
4) represents the confidence bound that magnifies the amount of the uncertainty of the first term 
. Particularly, 
, called a design matrix, embodies the empirical covariance of 
X, and the parameter 
 of 
 is a discount factor in 
 which penalizes an effect of past data. Intuitively, as 
t grows, additional query vectors are added into 
, thereby augmenting the minimum eigenvalue of 
 and, thus, diminishing the term 
. Consequently, the confidence bound eventually decreases.
Our proposed query selection method simultaneously considers two key factors by leveraging UCB. The first factor evaluates how much the chosen query contributes to learning the relevant parameters. The second factor considers how much the user will like the query when presented as a demonstration. While conventional approaches focus on the first factor [
4,
5,
6,
7], the proposed approach incorporates the second factor by applying the UCB method. The consideration of the quality of queries experienced by the user [
5] is vital because it may help build user familiarity and trust with the robot. If users consistently encounter undesirable queries, it might lead to mistrust in the robot’s behavior. Therefore, our approach carefully balances these factors, adjusting the trade-off between two factors via 
, which will be further examined in 
Section 5.
While the query selection rule selects a single query, choosing a batch of queries is more efficient in practice. In PBL, extended durations for query generation and parameter updates can challenge users, particularly those who are less patient. Thus, we adopt a simple batched version by selecting the top 
b queries based on the UCB score (
4), where 
b is the number of queries in a single batch. To approximate the solution for (
4), we prepare a finite set of trajectory pairs by randomly generating and selecting two trajectories. The feature vectors derived from this set are used to compute (
4) and to select the top-
b queries among the finite set.
  4.2. Discounted Parameter Estimation
After selecting a query and receiving its label, the parameter of the user preference is estimated considering changes over time. Let 
 denote 
. Suppose 
t data points are given, i.e., 
. We can estimate 
 by using the discounted maximum log-likelihood scheme [
12] as follows,
        
        where 
 is a discount factor in 
. This discounted negative log-likelihood (
5) intuitively shows that the parameter 
 is about to be learned as the most recent optimal parameter 
 that changes over time. Note that the minimizer of (
5) satisfies 
, which makes the gradient of (
5) be equal to zero.
  5. Theoretical Analysis
In this section, we analyze the cumulative regret of the proposed method. The cumulative regret is defined as
      
      where 
 is a logistic function, i.e., 
, 
, and 
T is the number of iterations. 
 indicates the optimal query that contains the optimal trajectory such that 
. The cumulative regret is widely employed in bandit settings as a measure to assess the efficiency of exploration methods [
13]. Then, we prove that our method has the sub-linear regret under the mild assumption on 
. In other words, our theoretical results tell us that the proposed method efficiently adapts to the time-varying parameters. First, we introduce the assumptions.
Assumption 1. For , , and , there exist D and S such that  holds and  holds.
 Assumption 2. Let  be the number of changing points. Assume that  is changed up to  times during T rounds.
 Initially, we make Assumption 1 that both feature vectors and the parameters are bounded. Assumption 2 tells us that the user parameter is changed discretely  times. Note that  indicates the most volatile user, and  indicates a stationary user. Furthermore, we define the lower bound of the derivative of the logistic function.
Definition 1. For the logistic function μ, there exists a positive constant  such that . Note that  always exists for bounded θ and x.
 Now, the set of time indices used for analysis is defined.
Definition 2. For fixed γ, let us define  and define an index set as .
 For t in , we can find the interval  where  does not change. In other words,  is fixed for  rounds. Then, we prove that the proposed method can adapt  in at least  rounds and, hence, the proposed method is no regret. Now, we first derive the confidence bound of the estimated parameter  as follows.
Theorem 1. Suppose that Assumptions 1–2 hold. Consider the gap between  and . For all , the following inequality holds with probability at least ,where .   is used to compute the confidence bound of the estimated parameter. By using Theorem 1, we can derive the regret bound of the proposed DPB. The detailed proofs of Theorem 1 can be found in 
Appendix A.
Theorem 2. Suppose that Assumptions 1–2 hold; then, for fixed , with probability at least , the regret of DPB is bounded as follows: .
 If  is sub-linear with respect to T, then, the proposed DPB is called no-regret. However, the sub-linearity of DPB depends on . In particular, if  holds for , then, the DPB finally converges to the time-varying user preferences. This result shows some theoretical limitations of the proposed method since it cannot overcome the time-varying tendency of  if  grows faster than . This regret bound is the first result in preference-based learning for time-varying settings.
  6. Experimental Settings
Simulation Setup. We validate our work in three simulation environments: 
Driver [
1], 
Tosser [
21], and 
Avoiding. The 
Driver environment aims to drive while aware of the other vehicle and the 
Tosser environment learns to put the ball in a certain basket with diverse trajectories. The features utilized are identical to [
5], distance to the closest lane, speed, heading angle, and distance to the other vehicles for 
Driver and maximum horizontal range, maximum altitude, the sum of angular displacements at each timestep, and final distance to the closest basket for 
Tosser. We newly created an 
Avoiding environment where the robot moves the object over the laptop to place the final target pose, similarly to [
22]. Four-dimensional hand-coded features: The height of the end-effector from the table, the distance between the end-effector and the laptop, the moving distance, and the distance between the end-effector and the user, are utilized in 
Avoiding. Optimal parameters were randomly generated for each seed.
 Dataset. To discretize a trajectory space, a query set  is predefined in Driver and Tosser by sampling K trajectories with uniformly random controls. In Avoiding, RRT* is used to create trajectories after randomly sampling the passing midpoint through the fixed start and target point. We set K to 20,000 for Driver and Tosser, and to 5000 for Avoiding.
Evaluation Metrics. In our experiments, we use the following three suitable metrics: the cosine similarity 
, the simple regret 
, and the cumulative regret 
. First, cosine similarity 
 is measured as the alignment metric leveraged in most existing research [
4,
5,
6]. Simple regret is defined as 
, where 
 is an optimal trajectory of learned parameter, i.e., 
. The quality of the optimized trajectory can be measured by the simple regret, with smaller values indicating better performance. Cumulative regret 
, defined in 
Section 5, illustrates how much reward will be lost by exploration. Minimizing cumulative regret is often the object of the bandit framework.
 Baselines. We compare the performance of DPB with other methods using different criteria for query selection, such as batch active learning [
5,
7], information gain [
6], and maximum regret [
3]. All these baseline algorithms were adapted into batch selection versions to select the top-
b queries to ensure fair comparisons.
      
   7. Simulation Results
We validated the superiority of DPB in three different preference changing scenarios: smooth preference changes in 
Section 7.1, abrupt preference changes in 
Section 7.2, and static preferences in 
Section 7.3. The experiments in this section were conducted using synthetic data, and the results from real-world scenarios involving users and physical robots will be presented in 
Section 8.
  7.1. Performance on Smooth Preference Changes
To simulate smooth preference changes, we randomly select two parameters, , within a proper range and linearly interpolate them by dividing the interval into 10 points.  is changed every 30 rounds, making a total of nine changes. After reaching , 120 additional rounds are executed; hence, 390 rounds are conducted in total.
Each row in 
Figure 2 shows the performances of each algorithm in 
, 
, and 
, respectively. For 
Figure 2c, DPB clearly outperforms the baselines on 
 since other baselines cannot consider 
. Lower values of 
 indicate that the proposed query selection rule effectively balances the trade-off between the user’s preference for the presented trajectories and their associated uncertainties, as discussed in 
Section 4.1. This balance enables the generation of high-quality queries, which are well suited for eliciting meaningful user feedback in real-world scenarios with smooth preference changes. Regarding 
 in 
Figure 2a, it can be observed that DPB adapts to smooth parameter changes faster than other algorithms in terms of parameter estimation. We presume that the superior performance of DPB is derived from the effect of discounts on the past data. Finally, the DPB algorithm also outperforms with respect to 
 as demonstrated in 
Figure 2b. Thus, in scenarios characterized by smoothly varying preferences, the DPB method demonstrates the capability to generate well-suited queries, facilitating superior estimation of reward parameters and optimal trajectories compared to baseline approaches.
Interestingly, baseline algorithms adopt poorly in . We believe that this effect is correlated with the abruptness at which the parameters are changed. As the optimal parameters are linearly interpolated, the deviation  is consistent with every parameter changes. The average deviation  over the seeds is computed as , , and  for , , and , respectively. For Avoiding, the optimal parameter changes comparatively drastically, resulting in a bad performance for the baseline algorithms. Furthermore, the limited adaptability in time-varying scenarios for the maximum regret algorithm might stem from its query selection rule over the solution space. This algorithm tends to converge towards local optima due to its greedy selection that restricts the trajectories to be compared, wherein it selects two trajectories exhibiting the highest estimated regret based on parameter obtained from Markov chain Monte Carlo. However, the results of all simulations support that DPB adapts to the smooth preference changes faster than baselines.
  7.2. Performance on Abrupt Preference Changes
To analyze scenarios involving more realistic changes in human preferences, which are not likely to be smooth, simulations are designed to accommodate abrupt changes in preference. For abrupt preference changes, we conducted experiments in 
Driver involving two significant alterations at 100 and 200 rounds, resulting in 
 and 
 values of 
 and 
, respectively. 
Figure 3c similarly demonstrates that DPB achieves sub-linear convergence in cumulative regret. While other algorithms do not adapt well to sudden changes in preference, 
Figure 3a,b show that DPB adapts to abrupt changes in preference. A comparative analysis of the results presented in 
Figure 2 and 
Figure 3 reveals that the DPB method consistently outperforms baseline approaches, particularly in scenarios involving abrupt and realistic changes in human preferences. These findings underscore the effectiveness of DPB in addressing dynamic preference scenarios.
In the case of the max regret algorithm, metrics such as 
 and 
 struggle to adapt to changing parameters. They exhibit consistent behavior even when user preferences shift. This issue strongly aligns with the findings from the 
Avoiding task illustrated in 
Figure 2. We attribute this limitation to the inherent characteristics of the max regret algorithm, which has difficulty adjusting to abrupt changes in preferences and tends to converge to local optima.
  7.3. Sanity Check on Static Preferences
To ensure the performance of baseline algorithms, 
Figure 4 presents the results of 
Avoiding in conventional static human preference scenarios; i.e., parameter 
 does not change over time. We opted for the 
Avoiding environment due to its relatively inferior performance compared to the environments discussed in 
Section 7.1 and 
Section 7.2. The optimal parameter 
 is identical to the simulation experiments in 
Section 7.1. In 
Figure 4a,b, we can observe that DPB converges faster than other baselines except for the information gain method that shows a similar convergence speed. The results indicate that DPB demonstrates strong performance even in scenarios involving conventional static preferences. The algorithm effectively estimates preference parameters while identifying optimal trajectories that align with user preferences. Its adaptability to both time-varying and static preferences highlights the practical capability of DPB to accurately estimate preference parameters and generate user-preferred trajectories, reinforcing its applicability in diverse real-world scenarios. Moreover, 
Figure 4c also supports that DPB guarantees to minimize the cumulative regret. The generated queries also demonstrate superior quality in scenarios with static preferences. It is noteworthy that the maximum regret algorithm in 
Figure 4 shows reasonable performance in a static setting unlike in the time-varying setting shown in 
Figure 2 and 
Figure 3.
  9. Conclusions
Our proposed algorithm introduces a novel approach to address time-varying preferences using discounted likelihood. Then, our theoretical analysis establishes that DPB demonstrates sub-linear cumulative regret under preference changes occurring less than  times. Experimental outcomes highlight the adaptiveness of our framework compared to previous methods in handling time-varying user preferences. Particularly, our DPB method effectively minimizes cumulative regret, while other approaches struggle in this regard. User studies further validate the competitiveness of DPB in time-varying environments with environmental changes and robot behavior adaptation over repeated interactions.
In robotics, addressing the time-varying preferences of users is crucial, as their expectations can shift depending on the context or environment. Robotic systems are required to perform a wide range of actions across diverse scenarios, adapting their behavior to evolving human needs. For instance, in autonomous driving systems, the DPB method enables vehicles to modify their driving styles—such as acceleration, braking, or lane-changing. Passengers preferring a smoother, more conservative ride may find the vehicle adjusting its behavior accordingly, while those prioritizing efficiency could benefit from optimized travel times. This continuous learning and adaptation enhance the personalization and comfort of autonomous systems. It would be interesting to consider additional factors such as emotions or social contexts that influence preference changes. Exploring these aspects and developing a unified model capable of perceiving user states and learning preferences across diverse users represent promising directions for future research.