Next Article in Journal
Bus Voltage Fluctuation Suppression Strategy for Hybrid Energy Storage Systems Based on MPC Power Allocation and Tracking
Previous Article in Journal
Dual-Path Enhanced YOLO11 for Lightweight Instance Segmentation with Attention and Efficient Convolution
Previous Article in Special Issue
Cyber–Physical Multi-Robot Formation with a Communication Delays and a Virtual Agent Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhancing Agent-Based Negotiation Strategies via Transfer Learning

1
School of Information Science and Engineering, Chongqing Jiaotong University, Chongqing 400074, China
2
College of Intelligence and Computing, Tianjin University, Tianjin 300072, China
3
Department of Advanced Computing Sciences, Maastricht University, 6200 MD Maastricht, The Netherlands
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(17), 3391; https://doi.org/10.3390/electronics14173391
Submission received: 11 May 2025 / Revised: 15 August 2025 / Accepted: 17 August 2025 / Published: 26 August 2025
(This article belongs to the Special Issue Advancements in Autonomous Agents and Multi-Agent Systems)

Abstract

While negotiating agents have achieved remarkable success, one critical challenge that remains unresolved is the inherent inefficiency of learning negotiation strategies from scratch when encountering previously unencountered opponents. To address this limitation, Transfer Learning (TL) emerges as a promising solution, leveraging knowledge acquired from prior tasks to accelerate learning and enhance adaptability in new negotiation contexts. This study introduces Transfer Learning-based Negotiating Agent (TLNAgent), a novel framework enabling autonomous negotiating agents to systematically leverage knowledge from pretrained source policies. The proposed transfer mechanism not only enhances negotiation performance but also substantially accelerates policy adaptation in unfamiliar negotiation environments. TLNAgent integrates three core components: (1) a negotiation module that interacts with opponents; (2) a critic module that determines whether to activate the transfer process and selects which source policies to transfer; and (3) a transfer module that facilitates knowledge integration between source and target policies. Specifically, the negotiation module interacts with opponents during the negotiation to execute core decision-making processes; in addition, it trains new policies using reinforcement learning. The critic module serves dual critical functions: (1) it dynamically triggers the transfer module according to interaction analysis; and (2) it selects the source policies via its adaptation model. The transfer module establishes lateral parameter-level connections between source and target policy networks, facilitating systematic knowledge transfer while ensuring training stability. Empirical findings from our extensive experiments indicate that transfer learning considerably enhances both the efficiency and utility of outcomes in cross-domain negotiation tasks. The proposed framework attains superior performance when compared to the state-of-the-art negotiating agents from the Automated Negotiating Agents Competition (ANAC).

1. Introduction

Negotiation is a strategic interaction aimed at reaching a resolution that is mutually acceptable to all participating parties. Standing as a cornerstone activity in human society, it manifests across diverse contexts, including commerce (buyer–seller transactions), labor markets (employer–employee compensation discussions), and international relations (intergovernmental diplomacy). In recent years, with advancements in Multi-agent Systems (MAS) and Machine Learning (ML) [1,2,3,4], the field of automated negotiation has attracted substantial attention. Automated negotiation entails the deployment of autonomous agents that negotiate on behalf of human negotiators in complex scenarios. These agents are engineered to engage in the negotiation process autonomously, with the objective of attaining a joint agreement that meets the interests of the human counterparts they represent. By harnessing automation and AI technologies, automated negotiation systems have assisted in enhancing the capabilities of human negotiators across diverse domains such as industry and commerce (e.g., [5,6,7,8]).
The automated negotiation research community has developed diverse strategies to guide negotiator behavior. These strategies are broadly classified into two categories: heuristic-based approaches and machine learning–based techniques. Heuristic-based strategies rely on predefined rules that prescribe offer-making and acceptance criteria, providing clear guidelines for negotiators but exhibiting limited adaptability in complex environments. On the other hand, machine learning-based approaches allow negotiators to learn from historical interactions and adapt their strategies by leveraging patterns identified in negotiation datasets.
Recently, drawing inspiration from Deep Reinforcement Learning (DRL) [9,10,11], the utilization of DRL in the context of negotiation has yielded noteworthy achievements [12,13,14,15,16]. Nonetheless, a common limitation of these methods is the requirement to initiate learning from scratch when confronted with unfamiliar opponents, a process that proves to be inefficient and impractical. Current research primarily focuses on training agents to interact with known opponents by leveraging acquired experiential knowledge. However, in real-world scenarios, agents frequently encounter adversaries employing previously unencountered strategies, which can render pretrained strategies suboptimal or obsolete. Consequently, agents face the resource-intensive challenge of developing entirely new policies de novo. Furthermore, in dynamic negotiation environments, agents are often required to interact with various types of opponents, some of which may remain unknown until encountered. This in turn leads to the fundamental question of how to expedite the learning process for novel opponent strategies while concurrently enhancing the performance of previously learned policies.
In this paper, we aim to address this question by leveraging Transfer Learning (TL). TL has emerged as a promising technique capable of accelerating the learning process and enhancing performance on target tasks by utilizing pre-existing knowledge from related scenarios. Negotiating agents can transfer knowledge gained from previously encountered opponents, helping them to adapt rapidly to new negotiation opponents. We introduce a novel Transfer Learning-based Negotiating Agent (TLNAgent), which represents the first integration of both Transfer Learning (TL) and Reinforcement Learning (RL) in an automated negotiation framework. TLNAgent integrates three core components: (1) a base strategy module that governs the negotiation process; (2) a critic module that determines whether to activate the transfer process; and (3) a transfer module that facilitates knowledge integration between source and target policies.
The comprehensive empirical experiments reported in this study demonstrate the effectiveness of TLNAgent. Specifically, we conducted a systematic evaluation of its performance across the four dimensions, including its ability to learn and adapt when encountering new opponents, its performance against state-of-the-art negotiating agents, the impact of individual components via ablation studies, and its robustness as revealed by empirical game theory analysis of tournament results.
To sum up, this work makes the following key contributions to the field of automated negotiation:
  • TLNAgent is among the first to apply transfer learning in reinforcement learning to negotiation scenarios, systematically identifying critical challenges in cross-task transfer within negotiation frameworks.
  • To address these challenges, this study introduces a detection method to determine whether to activate the transfer module. Leveraging its adaptation model, TLNAgent preliminarily assesses the helpfulness of source policies and selects which source policies to transfer, ensuring that task-relevant knowledge is prioritized.
  • A Gaussian Mixture Model-Universal Background Model (GMM-UBM) is incorporated to dynamically determine the weighting factors for source policies during knowledge transfer. Additionally, the transfer module employs a lateral connection architecture, which demonstrates efficiency and effectiveness in scenarios involving diverse opponents and domains.
  • Empirical validation of the proposed strategy includes comparative evaluations against multiple baselines under corresponding experimental settings. Further, ablation studies are conducted to isolate the contributions of individual components, while game-theoretic analyses are employed to ground the approach in theory.
The remainder of this paper is structured as follows: Section 2 provides a concise overview of key related studies; Section 3 delineates the negotiation settings; Section 4 elaborates on the core components of the proposed approach; Section 5 presents the experimental results; finally, Section 6 summarizes the paper and outlines significant research directions emerging from the presented work.

2. Related Work

2.1. Deep Reinforcement Learning

Over the past few decades, Reinforcement Learning (RL) algorithms [17,18,19] have been extensively applied in automated negotiation. The theoretical foundation of RL traces back to Richard Bellman’s seminal work in the 1950s, which introduced the Bellman equation as a recursive solution for optimal decision-making in Markov Decision Processes (MDPs) [20,21]. This equation formalizes the principle of dynamic programming, enabling agents to balance immediate rewards against long-term cumulative gains through value iteration. Deep Reinforcement Learning (DRL) extends classical reinforcement learning by integrating deep neural networks for function approximation. Modern DRL algorithms (e.g., DQN, PPO) inherit this framework, scaling it to high-dimensional state spaces via deep learning.
Recently, research attention has turned to DRL in the filed of automated negotiation. An example is the Deep Q-Network (DQN) algorithm [22,23], which introduces critical advancements over traditional Q-learning. DQN employs an experience replay buffer [24] to decorrelate sequential training samples, mitigating bias arising from the inherently sequential nature of negotiation experiences. Additionally, it stabilizes learning through a target network architecture, using separate neural networks to approximate the action–value function. These features have enabled DQN to learn effective bidding and acceptance strategies in automated negotiation scenarios [17,25,26,27]. However, a notable limitation of DQN lies in information loss caused by the discretization of the action space. This limitation necessitates increased reliance on opponent modeling to generate mutually beneficial offers, highlighting a key challenge for its application in negotiation environments.
To address this issue, policy-based reinforcement learning algorithms are employed in automated negotiation settings. These algorithms present notable advantages when dealing with continuous action spaces. This is of utmost importance in dynamic negotiation environments where a diverse array of actions can be considered. However, the policy gradient method encounters significant difficulties in guaranteeing stable policy updates, which hinders convergence during the training of negotiation strategies. To surmount this limitation, Trust Region Policy Optimization (TRPO) [28] has been proposed, based on the core principles of conservative policy iteration. This foundation enables TRPO to better handle the intricacies of complex data and promotes more stable policy updates within the realm of automated negotiation. Nonetheless, implementing TRPO can be computationally intensive due to the requirement for a hard KL divergence constraint on the update size in each iteration.
To simplify the implementation process while still attaining results similar to those of TRPO, an alternative approach named Proximal Policy Optimization (PPO) [29] has been put forward. PPO integrates a penalty term to regulate the extent of policy updates, thereby circumventing the need for a KL divergence constraint. This method effectively achieves a balance between exploration and exploitation during the learning process, which is essential for adapting to the dynamic nature of negotiation scenarios. Building on this approach, Ref. [30] proposed an end-to-end negotiation strategy that utilizes a multi-issue policy network. This approach aims to simplify the negotiation process by directly mapping inputs to actions within a comprehensive policy framework. While useful, both TRPO and PPO have a significant shortcoming in the form of sample inefficiency. Both of these algorithms require a large volume of interaction with the environment to achieve effective learning. In the early stages of negotiation, this inefficiency can greatly limit their performance, as agents may struggle to rapidly adapt to the negotiation context and make optimal decisions.
Other DRL approaches used in the field of automated negotiation include the Deep Deterministic Policy Gradient (DDPG) algorithm, which is an actor–critic approach employed for automated negotiations [12]. This algorithm has the capacity to learn competitive policies for tasks based on low-dimensional observations. However, DDPG hinges on deterministic policies, meaning that it is not suitable for complex environments marred by noise interference. In such settings, policies must possess a degree of randomness in order to adapt effectively. Another drawback of widely used RL algorithms, including TRPO and PPO, is the need for a substantial number of new samples at each gradient step. This requirement presents a considerable hurdle to learning effective policies, especially in complex negotiation tasks. The Soft Actor–Critic (SAC) algorithm [31,32] has demonstrated high efficacy in surmounting these challenges. SAC integrates off-policy updates with a stable stochastic actor–critic formulation. Moreover, it adopts a maximum entropy framework to augment the standard maximum reward RL objective. Consequently, it has been extensively applied in automated negotiation and has achieved remarkable success [33,34,35]. Given its advantageous properties outlined above, we opt to use SAC as the learning algorithm for TLNAgent.

2.2. Transfer Learning in RL

In recent years, transfer learning has emerged as a prominent research paradigm seeking to address the challenge of sample inefficiency in reinforcement learning. By leveraging prior knowledge from experts or related tasks, TL approaches aim to mitigate this limitation through strategic knowledge transfer, facilitating more efficient learning in target domains [36,37,38]. One prominent transfer technique in RL is policy transfer. This approach centers around capitalizing on the knowledge embedded in policies that have been pretrained on source tasks. To this end, policy distillation is employed to extract knowledge from source policies by minimizing the cross-entropy or KL divergence between the state-conditional action distributions of the source and target policy networks [39,40,41]. Nonetheless, the efficacy of these methods hinges on the extent of similarity between the source and target tasks. In real-world situations, this similarity requirement may not always be satisfied.
Other methods enhance exploration in target environments by reusing source policies [42,43]. For instance, the study in [44] introduced lateral connections. These connections integrate representations from the source network to refine those in the target network. This approach has proven to be effective in transferring knowledge, thereby speeding up the learning process of the target task. The research in [45] proposed to model source policy evaluation as option learning and then extract knowledge from the selected source policy. This effectively prevents negative transfer. More recently, the work in [46] suggested combining different transfer methods, such as policy distillation and instance transfer, to achieve more efficient knowledge transfer. Furthermore, the authors of [47] adaptively transferred knowledge from multiple source policies with different state–action spaces. This results in a more realistic form of transfer compared to traditional transfer learning methods.

3. Negotiation Settings

The negotiation framework comprises three core elements: the negotiation protocol, the negotiation scenario, and the negotiating agents. The negotiation protocol outlines the rules and procedures to which agents must conform throughout the negotiation process. The negotiation scenario encompasses the issues under negotiation and their structural configuration, thereby defining the outcome space, i.e., the set of all possible agreements. Each negotiating agent is endowed with a preference profile encoding its individual priorities across the domain’s possible outcomes along with a strategy that determines its actions in each negotiation round. Collectively, the negotiation protocol, scenario, and agents’ preference profiles constitute the foundational structure of the negotiation system. This study focuses on bilateral multi-issue negotiations involving two agents. To illustrate these concepts, consider a buyer–seller negotiation over a laptop. The negotiable issues (e.g., brand, hard drive capacity, and monitor size) form the domain, the alternating-offer rules define the protocol, and each party’s hidden preferences (e.g., buyer prioritizing low price, seller favoring high-end brands) constitute their profiles. The process is depicted in Figure 1.

3.1. Negotiation Domain

A negotiation domain D is formally defined as a structured collection of negotiable issues J = { j 1 , j 2 , , j n } . Within commercial transactions, for example, such issues may include product attributes like price, brand, or hard drive capacity in laptop purchases. Let the value of issue j be v j k and let w j i be the weighting preference which agent i assigns to issue j. The weights of agent i over the issues are normalized summing to one (i.e., j = 1 n ( w j i ) = 1 ). These preferences are determined by the interests of the parties which the agents are acting on behalf of. An offer O is a vector of values v j k for each of the issues j, where there are k choices for issue v k . The utility of an offer for agent i is defined as
U i ( O ) = j = 1 n ( w j i · V j i ( v j k ) ) ,
where V j i is the evaluation function of agent i, mapping the value of an issue j to a real number. A utility function assigns a numerical utility value to each possible outcome. This value quantifies how an agent evaluates that particular outcome.
To achieve consensus, negotiating agents must mutually agree on a concrete alternative v k for for every issue. A valid agreement is formally represented as an outcome vector
ω = { v 1 , v 2 , , v n } ,
where v i denotes the assigned value for the i-th issue. The complete outcome space Ω of a domain constitutes all feasible solutions, expressed as follows:
Ω = { ω 1 , , ω m }
where ω k describes a possible negotiation solution, i.e., a vector of values for all issues. In our laptop example, we have the following: issues: J = {monitor, brand, hard drive}; outcome: ω = {monitor = 14, brand = HP, hard drive = 512 GB}; outcome space: Ω includes all combinations (e.g., ω 1 : {13.3, Apple, 256 GB}, ω 2 : {15.6, Dell, 1 TB})

3.2. Negotiation Protocol

A negotiation protocol governs the rules and permissible actions during negotiation. It specifies the allowable negotiation moves that can be made at any given point during the negotiation. In this study, we employ the well-established stacked alternating offers protocol, as detailed in [48]. Under this protocol, negotiating agents take turns presenting proposals (hereinafter termed bids or offers). During each negotiation round, an agent may either submit a new offer or accept the proposal put forward by its counterpart. This process continues until one of two conditions is met: (1) a mutually agreeable outcome is reached; or (2) a negotiation session exceeds the given time limit. When applied to the laptop example, the buyer can start with offer ω 1 , then the seller can counter with ω 2 or accept ω 1 according to its own decision-making mechanism. This process repeats until agreement (e.g., ω a g r e e d = {14, HP, 512 GB}) or timeout.

3.3. Negotiation Profiles

Negotiating agents are equipped with private preference profiles, which are characterized by the weighting preference w j i in Equation (1). These profiles establish a preference order for ranking outcomes within the outcome space. Alongside the negotiation domain, the preference profile constitutes the negotiation scenario. Within this framework, an incomplete information scenario occurs. In such a scenario, agents participate in strategic interactions without knowledge of their opponents’ preferences and strategies. Although a preference profile could in theory be expressed as a list of ordering relations, the literature typically uses utility functions to formalize preferences in order to simplify the process. Taking our laptop example, the utility functions for the buyer and seller can be described as shown below.
U b ( ω ) = 0.5 · m o n i t o r s c o r e + 0.3 · b r a n d s c o r e + 0.2 · h a r d d i s k s c o r e
U s ( ω ) = 0.3 · m o n i t o r s c o r e + 0.4 · b r a n d s c o r e + 0.3 · h a r d d i s k s c o r e

4. Transfer Learning-Based Negotiating Agent

In this section, we first provide a comprehensive overview of the TLNAgent framework. Following this, we detail its core components: (1) the negotiation module, which interacts with opponents and trains source policies using RL algorithm; (2) the critic module, which dynamically determines whether to activate the transfer module and which source policies to use; and (3) the transfer module, which quantifies the domain-specific helpfulness of each source policy and strategically integrates their knowledge to enhance task performance. The transfer module enables the transfer of knowledge through probabilistic alignment guided by a Gaussian Mixture Model–Universal Background Model (GMM-UBM). This mechanism ensures statistically robust adaptation to the behavioral patterns of opponents.

4.1. Framework Overview

The framework of the transfer learning-based negotiating agent is depicted in Figure 2. First, the Base Strategy Module (left) interacts directly with negotiation opponents using SAC policy networks. This module processes negotiation states (e.g., historical offers) and generates actions (counter-offers) while storing interaction trajectories in a negotiation database. Next, the Critic Module (center) analyzes new trajectories against historical data through two sub-components: the Distance Discriminator computes Wasserstein distances to detect opponent strategy shifts, while the Adaptation Model evaluates source policy usefulness via soft advantage scores. Finally, when novel opponents are detected, the Transfer Module (right) activates to integrate knowledge; it weights source policies using GMM-UBM similarity metrics and transfers their representations to the current policy via lateral neural network connections. Throughout the figure, arrows denote directional data flows, from trajectory collection through critic analysis to policy updates.
This is the first negotiating agent grounded in transfer reinforcement learning. It endows the agent with the capacity to effectively apply previously acquired knowledge when interacting with new opponents. Within this framework, three main components collaborate. Their combined efforts efficiently expedite the learning process when the agent encounters unknown opponents.
The process of TLNAgent is presented in Algorithm 1. Initially, TLNAgent sets up all the necessary parameters for negotiation (Line 1). During each episode, the negotiation module begins by initializing the utility function. Subsequently, it engages with the opponent by leveraging the current strategy (Lines 3–13). Following that, the critic module makes a determination. Based on two specific metrics, it decides whether to activate the transfer module and conducts a preliminary assessment of the transferability of the source policies (Lines 14–15). Finally, the transfer module assesses the utility of the policies forwarded by the critic module. It extracts relevant knowledge from these policies to formulate a new strategy tailored to the current task (Lines 16–21).
Algorithm 1: TLNAgent
Electronics 14 03391 i001

4.2. Negotiation Module

This module serves as the basic component of our framework, interacting with negotiation opponents and training negotiation strategies (also called the base strategy module). We ground our approach in standard RL theory; negotiations are modeled as a Markov Decision Process (MDP) formalized by the tuple S , A , P , R , where an agent learns optimal actions through reward maximization. This extends Bellman’s foundational work on sequential decision-making to negotiation domains. Below, we map core RL components to bilateral negotiation.
States: In standard RL, states encapsulate environment information. Here, negotiating agents incur penalties upon reaching a negotiation timeout, thereby establishing the time parameter T (representing the maximum allowable number of negotiation rounds) as a critical determinant of strategic decision-making. Additionally, exchanged offers serve as key information, significantly influencing agents’ decisions on whether to accept incoming offers or generate new proposals. As such, we define the state at time t as follows:
S t = { t r , U o ( ω o t 2 ) , U s ( ω s t 2 ) , U o ( ω o t 1 ) , U s ( ω s t 1 ) , U o ( ω o t ) , U s ( ω s t ) }
where the relative time, denoted as t r = t T , represents the progress of the negotiation and where ω o t and ω s t denote the respective offers by the opponent and our agent at time t.
Actions: Actions in RL represent environment interactions. Within our framework, the action set comprises all possible target utility values within u r , r max , where u r is the reservation value (i.e., lower than u r is unacceptable) and r max is the maximal value. An action at step t is consequently defined as a t = u s t . To construct an offer corresponding to the target utility value u s t , we employ an inverse utility function F : U Ω that maps a real number u to an outcome ω in the outcome space Ω . To ensure the effectiveness of the chosen offers, this function generates offers with utilities that fall within the range u s Δ u , u s + Δ u , with Δ u being a small positive value to accommodate slight variances in tolerance. When multiple offers fit this utility range, we select the one most likely to be preferred by the opponent, based on an opponent model U o that ranks a set of offers by estimating the opponent’s preference profile. Thus, the inverse utility function is defined as follows:
F u s = arg max ω U o ω , where u s Δ u U s ω u s + Δ u ,
with U o and U s denoting the opponent’s and agent’s utility functions, respectively.
Rewards: Following RL principles, rewards incentivize goal achievement. The reward function R for the agent during negotiation is defined based on the outcome of the negotiation process. When the agent reaches an agreement ω , it receives a reward equal to the utility U ( ω ) , reflecting the value of the negotiated outcome. If no agreement is reached (e.g., timeout or breakdown), the agent incurs a penalty with a reward of −1, discouraging negotiation failure. The function R is formally expressed as shown below.
R ( s t , a t , s t + 1 ) = U s ( ω ) , if there is an agreement ω 1 , if no agreement in the end 0 , otherwise
Learning algorithm: In this work, we use Soft Actor–Critic (SAC) [31,32] to solve the Markov decision problem. SAC is an off-policy RL algorithm using a stochastic policy based on the maximum entropy idea. In contrast to other RL algorithms, SAC optimizes the policy to achieve higher expected rewards while also maximizing the policy’s entropy to ensure the algorithm’s randomness. Maximizing the entropy of the policy allows the algorithm to better explore the state space and obtain greater stability.
The object of entropy-regularized RL is to optimize an optimal policy as follows:
π = arg max π ( t = 1 T E ( s t , a t ) ρ π [ R ( s t , a t , s t + 1 ) + α H ( π ( · | s t ) ) ] )
where α is a temperature parameter that controls the importance of entropy, which is automatically adjusted by training [31]; R is the reward function; H denotes the entropy of policy π ; and s t and a t respectively denote the state and action at round t. SAC learns two value networks together, allowing it to avoid overestimation of the value function. The soft Q value containing the entropy is defined as shown below.
Q ( s t , a t ) = R ( s t , a t , s t + 1 ) + γ E ( s t + 1 , a t + 1 ) ρ π [ Q ( s t + 1 , a t + 1 ) α log ( π ( a t + 1 | s t + 1 ) ) ]

4.3. Critic Module

The critic module addresses two central questions prior to initiating knowledge transfer: (1) determining when to activate the transfer module, and (2) selecting appropriate source policies for transfer when activation is warranted. For the first question, activation of the transfer module hinges on detecting significant shifts in the opponent’s behavioral patterns. When such changes occur, the agent must activate the transfer module to acquire a new response strategy, as adaptability to evolving opponent strategies constitutes a core capability of the agent. Regarding the second question, when the agent seeks to learn a new strategy via the transfer module, it is critical to identify which source policies are most relevant to the target negotiation task. A non-discriminatory transfer of knowledge from all available source policies could yield detrimental outcomes, including negative transfer effects and unnecessary computational burdens. To tackle these challenges, we integrate two essential components: the distance discriminator and the adaptation model (see Figure 3 and Algorithm 2).
Algorithm 2: Critic module
Electronics 14 03391 i002

4.3.1. Distance Discriminator

The distance discriminator functions as a binary classifier tasked with determining whether the opponent changes its negotiation strategy. Upon detecting a discernible shift, this component triggers activation of the transfer module, enabling the agent to learn a new negotiation strategy tailored to the negotiation dynamics. Conversely, when the opponent’s strategy remains stable and the agent’s current strategy continues to produce favorable outcomes, the agent persists with its existing approach. During negotiation, we make the realistic assumption that an opponent’s strategy remains unchanged over a sequence of consecutive negotiation sessions.
As illustrated in Figure 3, the Wasserstein distance [49] is employed by the discriminator to quantify the similarity between trajectories derived from historical negotiations and the most recent negotiation trajectory. The use of the Wasserstein distance to measure similarity between historical and latest negotiation trajectories is theoretically grounded in its unique alignment with the characteristics of offer distributions in negotiation contexts. Compared to other probabilistic metrics such as the KL divergence and Jensen-Shannon distance, the Wasserstein distance offers three key advantages. First, it is robust to non-overlapping offer distributions resulting from abrupt strategic shifts where the set of offers with non-zero probability under an old strategy is entirely disjoint from that under a new strategy (e.g., an opponent switching from exclusively proposing Dell offers to only HP offers in negotiation of buying laptops). Unlike the KL divergence, which becomes undefined for such cases due to infinite log ratios, the Wasserstein distance quantifies the minimum cost of transforming the old distribution into the new one by transporting probability mass across the gap, yielding a finite and meaningful value to detect even discontinuous shifts. Second, it captures geometric and strategic similarity by accounting for the structural closeness of offers in the multi-issue outcome space (e.g., in our example, offers with similar hard drive sizes are treated as closer than those with drastically different sizes), unlike KL divergence, which overemphasizes high-probability offers. This ensures subtle but strategic shifts, such as gradual moves toward smaller hard drive sizes in our example. Finally, it is stable under noise. Negotiation trajectories often include random outliers (e.g., a one-time high-utility offer). The Wasserstein distance smooths these by minimizing cumulative transport costs, thereby avoiding the overreactions to low-probability noise that plague KL divergence. Together, these properties make the Wasserstein distance uniquely suited to identifying meaningful strategy shifts by opponents, which is a core requirement allowing the distance discriminator to trigger knowledge transfer when necessary.
For the previous negotiation session, we store the negotiation trajectories τ . Let l τ o = { F τ ( ω 1 o ) , , F τ ( ω n o ) } denote the probability distribution of offers by the opponent in trajectories τ , where F ( · ) provides the probability of occurrence of an offer ω n o and l H o = { 1 k i = 1 k F τ i ( ω 1 o ) , , 1 k i = 1 k F τ i ( ω n o ) } denotes the average probability distribution of opponent offers in the history of negotiation H with cardinality k. With the Wasserstein distance W ( l H , l τ n e w ) , where τ n e w is a set containing the last few traces of negotiation with the current opponent, we can calculate the similarity between the historical trajectories and the latest trajectories as follows:
W ( l 1 , l 2 ) = inf γ Γ ( l 1 , l 2 ) E ( x , y ) γ [ x y ]
where Γ ( l 1 , l 2 ) is the set of all joint distributions γ ( x , y ) whose marginals are respectively l 1 and l 2 .
For each joint distribution γ , a sample ( x , y ) is drawn to compute the sample distance [ x y ] . The expectation of this distance under γ is then calculated as E ( x , y ) γ [ x y ] . The Wasserstein distance W is the lower bound that can be taken on the value of this expectation across all possible joint distributions. This formulation is grounded in the principle that offer distributions differ across negotiation strategies and that their dissimilarity can be quantified using a metric capable of measuring the distance between probability distributions. Here, a higher value of W ( l H , l τ n e w ) means greater dissimilarity between the historical trajectory set l H and the new trajectory l τ n e w . If this dissimilarity exceeds a predefined threshold s, then the distance discriminator outputs 1, activating the transfer module to signify that the agent is encountering a novel opponent strategy; otherwise, it outputs 0 when the dissimilarity remains below the threshold.
The threshold s is set to 0.3, as determined via empirical validation across fifty negotiation domains and eight representative opponents. This setting can already result in satisfying performance, as we show in Section 5. However, we also propose a complementary mechanism to set s dynamically, called the Context-Aware Threshold (CTR) predictor. The CTR predictor leverages deep learning to dynamically predict the optimal Wasserstein distance threshold s by mapping the high-dimensional negotiation context to a scenario-specific threshold. The model takes three key feature sets as input: (1) opponent behavior embeddings derived from a 1D Convolutional Neural Network (CNN) that encodes temporal patterns in the opponent’s recent offer trajectories; (2) domain metadata, including the number of negotiable issues, opposition of the domain, size of the outcome space, and average utility range, which are used to capture domain complexity; and (3) a window of Wasserstein distances, reflecting shift trends. These features are fused in a feed-forward neural network, which outputs a bounded threshold  s ^ [ 0 , 1 ] via a regression head. The CTR predictor is trained on historical negotiation sessions with labels indicating the optimal threshold (i.e., the smallest s that correctly identifies meaningful strategy shifts while avoiding false positives), using mean squared error (MSE) loss to minimize prediction error. By learning to associate patterns such as volatile opponent trajectories with higher thresholds and stable trajectories with lower thresholds, the CTR predictor generalizes to unseen scenarios, leveraging representation learning to capture nuanced relationships between context and strategy shift significance.
Moreover, when sufficient data and training time are available, the distance discriminator can be trained using the triplet loss. In this configuration, the discriminator employs a deep convolutional network to learn strategy-specific embeddings via triplet loss. The network is optimized such that squared L2 distances in the embedding space directly reflect strategy similarity; trajectories belonging to the same strategy exhibit minimal distances, whereas those from distinct strategies demonstrate substantial distances. Our objective is to guarantee that for a given trajectory l i a (referred to as the anchor) associated with a particular strategy, its distance to all other trajectories l i p (termed as the positive) is shorter than its distance to any trajectory l i n (termed as the negative) of any different strategy. The loss L after collecting enough trajectories of the opponent is defined as
L = i N L ( l i a , l i p , l i n ) ,
where L is provided by
L ( l i a , l i p , l i n ) = max ( 0 , l i a , l i n 2 2 l i a , l i p 2 2 + c ) .
Here, c is a margin that is enforced between positive and negative pairs. In this way, the discriminator can detect the change in the opponent’s strategy.
The triplet loss variant of the discriminator employs a CNN tailored to process sequential negotiation trajectories. Its architecture includes an input layer accepting trajectory tensors of shape (T, D), where T = 100 timesteps and D = 8 features, including relative time, agent/opponent utilities, and historical averages. This is followed by two convolutional blocks, each consisting of a 1D convolutional layer (32 filters, kernel size 3), batch normalization, ReLU activation, and max-pooling. These blocks feed into a fully connected layer with 128 units (ReLU activation) and a 64-dimensional embedding layer that outputs compact trajectory embeddings. For training, 50,000 trajectories are used, with 2500 trajectories per opponent across twenty distinct strategies (thirteen ANAC winners and seven basic strategies, e.g., time-dependent, behavior-dependent) to form triplets (anchor, positive, negative). The margin parameter, which enforces separation between positive and negative pairs, is set to c = 0.2 via cross-validation on 100,000 held-out trajectories, balancing avoidance of trivial solutions and over-separation of slightly different strategies.
While we propose this triplet loss discriminator as a potential alternative for strategy-change detection, it was not used in our core experiments for two key reasons, namely, data efficiency (the triplet loss variant requires a large number of trajectories, which is impractical for dynamic negotiation scenarios with limited pretraining data) and real-time performance (its inference latency of 20 ms per trajectory exceeds the Wasserstein distance calculation’s 5 ms, making it less suitable for time-sensitive negotiations). Thus, the triplet loss variant is presented as a potential extension for offline pre-deployment scenarios with abundant data, but is not part of the primary evaluation.

4.3.2. Adaptation Model

When detecting that the agent is facing a new strategy, the first step is to determine which source policies should be transferred to enhance learning efficiency. To select the most beneficial source policies, we employ an adaptation model for the initial selection of source policies (refer to the left-handed side of Figure 3). During each policy iteration, we calculate the helpfulness of each source policy by leveraging the critic, a commonly used component in actor–critic RL methods. More specifically, we define the soft expected advantage score of the action probability distribution π p ( · | s ) over policy π q at state s as follows:
S c π q s , π p = E a π p ( · s ) Q π q ( s , a ) α log π p ( a s ) V π q ( s )
where S c π q s , π p quantifies the performance improvement attained by executing policy π p instead of π q at state s. Each pretrained source policy in the set Π = { π 1 , , π n } is associated with a score S c π n o w s , π i .
Subsequently, the policy selector is employed to identify the beneficial source policies as teachers (i.e., teacher library Π t e a c h e r ) by referring to the soft expected advantage improvement score S c π n o w s , π i relative to the current policy π n o w :
Π t e a c h e r = { π | S c π n o w s , π > 0 } = { π | E a π ( · s ) Q π n o w ( s , a ) α log π ( a s ) V π n o w > 0 } .
This adaptation model offers two significant advantages in practical applications. First, it is rational to transfer knowledge from certain multiple source policies, as they may possess helpful experience that is valuable for training a new policy. Second, the agent can learn without the teacher’s guidance, allowing it to circumvent negative transfer when none of the source policies exhibit positive soft expected advantages.

4.4. Transfer Module

4.4.1. Weight Assignment

Drawing on the teacher policies  Π teacher  recommended by the critic module, as detailed in Section 4.3, the transfer module extracts pertinent knowledge from these policies (as shown on the right-hand side of Figure 3). However, applying a uniform knowledge transfer approach across all source policies would render the transfer module both ineffective and inefficient. It is worth noting that when two opponent strategies are similar, their corresponding response strategies should also demonstrate a degree of similarity. Therefore, it is imperative to evaluate the helpfulness of each source policy by assessing the similarity between the current opponent and the opponents used in training these policies, then assigning appropriate weights to those teacher policies (see lines 2–4 of Algorithm 3).
Thus, we propose a novel metric for comparing the similarity of opponents based on the states they have visited, denoted as B i = ( s 1 i , , s n i ) , where B i means the visited states of opponent i. Among other indices such as returns, the state is more suitable for automated negotiation because it remains stable in stochastic environments and offers high accuracy when dealing with similar opponents.
To capture the states visited by different opponents, we employ the basic strategy initiated by any known strategy, such as one of the ANAC agents, to interact with all opponents, ensuring that all operate within a uniform environment. Gaussian Mixture Models (GMMs) [50] can model nearly any continuous distribution with arbitrary accuracy. After the state set of opponent i ( B i ) collected, GMMs are used to generate a mixture of Gaussian models p ( x ) = k = 1 N σ k N x μ k , Σ k , where N is the number of Gaussian components in the mixture model, σ k is the probability of the observed data belonging to the k-th component model, and N x μ k , Σ k is the Gaussian distribution density function of the k-th component model. The GMM parameters are estimated by maximizing its likelihood function:
L s 1 i , , s n i ; μ , Σ = i = 1 n p s i = i = 1 n k = 1 N σ k N s i μ k , Σ k .
With the GMM at hand, we can compare the similarity of current opponent between previous opponents.
However, the frequent use of GMMs to fit collected state trajectories during negotiation incurs significant computational and sampling costs, which restricts the performance of TLNAgent in negotiations. To address this limitation, we draw inspiration from prior research in speaker verification, where Gaussian Mixture Models (GMMs) were employed to model speaker characteristics (e.g., [51,52]). These studies introduced universal background models (GMM-UBMs), hereinafter referred to as UBMs. UBMs capture the distribution of cross-opponent features by fitting a large GMM to a vast corpus of state data collected from all opponents. After the universal opponent model is established, an opponent-specific model can be derived through Maximum A Posteriori (MAP) adaptation of GMM components, leveraging the state data of individual opponents. The adapted parameters are concatenated into a supervector, forming a fixed-length representation to characterize the opponent. Thus, when encountering an unknown opponent, its visited states are collected using the basic strategy and a GMM is adapted via MAP using pretrained UBMs to quantify similarity. This approach drastically reduces the computational and sampling costs associated with similarity assessment, enhancing its feasibility in negotiation scenarios (see Figure 4).
In our implementation, we conduct M negotiation sessions with each opponent in the opponent pool, which is comprised of all opponents used to train the source policies, and collect state sets B i = ( s 1 i , s 2 i , ) for every opponent i. The state sets of the opponents are then aggregated into the UBM training set B u b m = ( B 1 , B 2 , , B n ) . The UBMs are fitted with K components and parameters μ u b m R K × d , Σ u b m R K × d × d , w u b m R K , w k 0 , and k w k = 1 using the EM-algorithm until convergence. Next, we determine the probabilistic alignment of the training vectors B i in the mixture components of UBMs by computing
p k s t = σ k N μ k , Σ k l = 1 K σ l N μ l , Σ l .
Next, p k s t is used to compute the sufficient statistics for the weight and mean parameters:
n k = t = 1 p ( k s t ) ,
E k ( B ) = t = 1 p ( k s t ) s t n k .
To derive the adapted parameters for Gaussian component k, the sufficient statistics of the UBM are updated with those computed from the training data using
μ ^ k = α k E k ( B ) + ( 1 α k ) μ k ,
the adaptation coefficients controlling the balance between the old and new estimates are α k , defined as follows:
α k = n k n k + r
where r is a fixed relevance factor.
To measure the distance between two policy supervectors, we use the upper bound on the KL-divergence between the means of two adapted GMMs μ i and μ j [53]:
U i , j = 1 2 k = 1 K w ubm k μ i k μ j k Σ ubm 1 μ i k μ j k .
In contrast to the KL-divergence, the upper bound presented in this work is symmetric and does not require costly sampling. In this work, we only adapt the means of the UBMs to enable the use of the aforementioned metric. The upper bound U i , j is employed as the metric for assessing opponent similarity. After computing opponent the similarity, the transfer model derives the weight vector W t e a c h e r s = { w 1 , , w n } from the critic module’s teacher policy library, where w i = exp U i , cur 1 j = 1 n exp U j , cur 1 and U i , cur 1 represents the similarity between opponent i and the current opponent. Specifically, a higher  w j  amplifies the influence of the corresponding policy  π j  on our agent’s behavior in the current environment, indicating that  π j  contributes more valuable knowledge and is more instrumental in shaping the new policy.
The key parameters of GMM-UBM include the number of Gaussian mixture components (K), the relevance factor (r) in MAP adaptation, and the convergence criteria for the Expectation Maximization (EM) algorithm. For the number of Gaussian components (K), we use a two-step approach combining likelihood-based cross-validation across candidate values ( K { 16 , 32 , 64 , 128 } ) on a validation set of opponents from ANAC to maximize the average log-likelihood of held-out state data with the elbow method to identify the point where marginal gains in likelihood diminish; empirically, this elbow occurs at K = 64 across 50 domains, balancing expressiveness and efficiency. The relevance factor (r), which controls the tradeoff between the universal background model (UBM) and opponent-specific data during adaptation, is fixed at  r = 16 , aligning with the speaker verification conventions where GMM-UBM originated [42,43]. For EM convergence, the algorithm terminates when the relative change in log-likelihood between iterations falls below  10 4 , typically requiring 50–100 iterations for K = 64 to ensure convergence without unnecessary computational overhead.
It should be noted that when applied to large-scale real-world negotiations, the negotiation state space may become high-dimensional owing to the presence of numerous continuous state variables. In this case, we introduce a preprocessing step using lightweight dimensionality reduction such as Principal Component Analysis (PCA) to reduce the state space before feeding it into GMM-UBM. PCA projects high-dimensional states onto a low-dimensional subspace while preserving 95% of variance. This reduces the dimensionality of state vectors, significantly speeding up Gaussian mixture fitting. Moreover, we can further optimize by restricting the number of active Gaussian components in the UBM to those most relevant to the current opponent. For example, during MAP adaptation (Section 4.4.1), we retain only the top 20% of components with the highest posterior probability for the opponent’s states, reducing the number of parameters updated.
As a matter of fact, transferring knowledge from teachers via weighted factors yields notable advantages over selecting the single best teacher. This approach recognizes that even suboptimal teachers can offer valuable insights or complementary strategies to the learning agent. By dynamically adjusting the weighting coefficients, TLNAgent autonomously balances the influence of each teacher, prioritizing contributions based on their relevance and performance. Consequently, the agent can leverage diverse source policies to develop a robust and adaptive strategy tailored to the current opponent’s behavior.
Algorithm 3: Transfer Module
1 
Initialize: source policy buffer { π i } , the agent policy network hyperparameters ϕ , value-state network hyperparameters ψ
2 
// Measure the helpfulness of source policies
3 
Pre-train the UBMs with states of all opponents in the supervector library
4 
Adapt the UBMs to specific GMMs with states of single opponent and compute the weight fator
5 
// Draw out knowledge from teacher
6 
Combine the student policy network and source policies’ networks using W t e a c h e r s
7 
Update the agent network parameters ϕ and ψ based on the SAC algorithm
8 
Learn a new policy against the current opponent based on the source policies { π i }

4.4.2. Knowledge Transfer

Next, we shift our focus to the method of transferring knowledge between the teachers and student (see lines 5–8 of Algorithm 3). Building upon transfer learning methodologies from prior work [47,54], we leverage lateral connections to integrate knowledge directly from the teachers’ policy and state–value networks, thereby enriching the representations of the student network (i.e., the learning agent). Specifically, the number of hidden layers in the policy and state–value networks of both the teachers and the student are denoted as N π and N V , respectively.
To streamline the framework and ensure architectural compatibility, we make both the teachers and the student maintain identical network depths (i.e., an equal number of hidden layers) in their respective policy and value networks. The policy and state–value networks of teacher j are denoted as π ϕ j and V ψ j , respectively, with parameters ( ϕ j , ψ j ) held fixed throughout training. The student’s networks are equipped with trainable parameters ( ϕ , ψ ) . During negotiation, the student agent accesses the current state s t through the teachers’ networks to extract the pre-activation outputs of the i-th hidden layer from teacher j’s networks:
{ h ϕ j i , 1 i N π , 1 j N } , { h ψ j i , 1 i N V , 1 j N } .
To derive the i-th hidden layer outputs within the student networks, denoted as { h π ϕ i , h V ψ i } , we execute two weighted linear combinations by amalgamating the pre-activations of the student’s networks with those of the teachers’ networks [47,54]:
h π ϕ i = p ϕ i h ϕ i + ( 1 p ϕ i ) j = 1 N w j h ϕ j i h V ψ i = p ψ i h ψ i + ( 1 p ψ i ) j = 1 N w j h ψ j i
where w j represents the weight of source policy π j obtained above. Furthermore, p ϕ i and p ψ i are two adaptable weighting coefficients that modulate the influence of the teachers on the student’s output, and are bounded within [0, 1]. During the initial phases of training, setting these coefficients to low values furnishes the essential information to initiate the learning process.
This architecture enables the student to attain higher average utility during the early training epochs compared to alternative transfer paradigms such as policy distillation. Crucially, however, complete decoupling of the student from the teachers becomes imperative as training converges. This necessity arises because teacher strategies may become obsolete relative to the student’s continuously evolving policy over time. To incentivize this decoupling, we introduce additional coupling loss terms that drive p ϕ i and p ψ i toward 1 as training progresses:
L c o u p l i n g = 1 N π i = 1 N π log ( p ϕ i ) 1 N V i = 1 N V log ( p ψ i ) .
As the values of p ϕ i and p ψ i increase, the influence of the teacher policies on the student policy in the current task diminishes. Upon completing training, we incorporate the newly learned strategy in the source policy library and the corresponding opponent supervector in the supervector library.

5. Experiments

In this section, we present a series of systematic studies to evaluate the effectiveness of TLNAgent in comparison with state-of-the-art approaches. The code is available at https://github.com/NegoAgentLab2025/TransferNegoAgent (accessed on 16 August 2025). First, we aim to demonstrate the strategy’s capacity to accelerate the learning process and enhance performance when engaging with novel opponents in negotiations. To this end, we execute comparative analyses across a spectrum of diverse negotiation domains spanning various task complexities and opponent dynamics. Subsequently, a tournament-style assessment is undertaken to thoroughly evaluate our agent’s performance against a curated set of ANAC-winning agents. This evaluation scheme enables a comprehensive characterization of TLNAgent ’s competitive efficacy across standardized benchmark settings. To further investigate the contributions of key components within our framework, we conduct a series of ablation studies. These experiments systematically dissect the impacts of the adaptation model and the GMM-UBM mechanism embedded within the transfer module. By testing these components in isolation and across combinatorial configurations, we quantify their individual and synergistic influences on the overall efficacy of TLNAgent, ensuring a granular understanding of their roles in driving the system’s success. Lastly, we conduct a robustness assessment via empirical game-theoretic analysis to investigate whether TLNAgent maintains stability and performance in both two-player and multi-player scenarios. This evaluation addresses the method’s robustness to strategic interactions under varying agent populations, providing critical insights into its generalizable applicability.
In our experimental design, we incorporated thirteen baseline agents selected from past winners of the Automated Negotiating Agents Competition (ANAC) [55]. This set includes Atlas3, ParsAgent, Caduceus, YXAgent, Ponpoko, CaduceusDC16, AgreeableAgent2018, Agent36, AlphaBIU, MatrixAlienAgent, Agent007, ChargingBoul, and MiCRO, representing a diverse array of competitive strategies from prior ANAC editions. For negotiation environments, we utilized all fifty domains from the 2022 ANAC dataset, which encompass scenarios with highly varied structural characteristics and preferences. This comprehensive domain selection ensures a rigorous assessment of our agent’s adaptability and performance across distinct negotiation contexts, capturing the complexity of real-world bargaining scenarios. To maintain experimental rigor and align with practical negotiation constraints, each session was capped at a maximum of 300 negotiation rounds. This procedural boundary serves a dual purpose: it prevents indefinite interaction durations that could compromise empirical tractability, while also mirroring real-world negotiation settings where time constraints often necessitate timely resolution.

5.1. Performance of New Opponent Learning

In this experimental setup, the objective is to empirically demonstrate the effective learning capacity of TLNAgent when interacting with novel opponents, i.e., those not encountered before. To achieve this, evaluations are conducted across a diverse set of opponents and negotiation domains designed to stress-test the framework’s learning performance. Specifically, we select eight opponent agents from the annual winners of the Automated Negotiating Agents Competition (ANAC) 2015–2018, complemented by the two top-performing agents from the 2021 and 2022 editions. This ensemble captures a wide spectrum of negotiation strategies, ensuring rigorous assessment under varied adversarial dynamics. To accentuate its capacity for learning new opponents, TLNAgent initiates each experimental run with a minimal set of four source policies. These policies are derived from the second-place agents in the 2015–2018 ANAC competitions, providing a structured yet parsimonious starting point for transfer learning.
Both TLNAgent and the baselines (see below) underwent training across 300,000 interaction rounds per opponent throughout the experimental pipeline. This training protocol was designed to facilitate convergence to stable performance, enabling fair and nuanced comparisons of learning efficiency and asymptotic performance between TLNAgent and the benchmark methods.
The following three benchmarks are used in our experiments:
  • Initial Utility Benchmark: The average reward during the initial stages of negotiation.
  • Accumulated Utility Benchmark: The accumulated utility obtained by the agent in the first 100 sessions.
  • Average Utility Benchmark: The mean utility obtained by the agent negotiating with an opponent over all fifty domains.

5.1.1. Initial Utility Performance

In this experiment, we compare the initial utility of TLNAgent with a learning-from-scratch baseline implemented using the SAC deep RL algorithm, which learns without leveraging prior knowledge in novel negotiation environments, across fifty domains and eight opponents (Figure 5). The results demonstrate that TLNAgent achieves significantly higher initial utility than the baseline, with an overall improvement of 40.6%. Notably, it exhibits a striking 53% improvement against the Caduceus opponent. Even for ChargingBoul, the opponent associated with the smallest improvement, TLNAgent still achieves a 21% increase in utility. These findings underscore the efficacy of the transfer learning scheme, which empowers TLNAgent to gain an early competitive edge in negotiations. By leveraging the knowledge of teachers, TLNAgent circumvents a common pitfall of RL-based agents, which often experience low utility during the initial exploration phase as they strive to identify optimal strategies. These results validate TLNAgent’s ability to reach a significant jump-start in performance, providing a crucial advantage from the onset of negotiations. This early advantage serves as a foundational cornerstone for TLNAgent’s subsequent interactions, enabling more efficient and effective negotiation outcomes.

5.1.2. Accumulated Utility Performance

Next, Figure 6 illustrates the comparison of accumulated utility between TLNAgent and the learning-from-scratch baseline during the first 100 sessions. The results across the eight ANAC-winning agents reveal that the TLNAgent consistently outperforms the baseline in nearly all tested scenarios, underscoring its superior performance. On average, TLNAgent achieves a significant 28% improvement in accumulated utility compared to the baseline. Across all eight competing agents, including notable winners such as Atlas3, MatrixAlienAgent, and Ponpoko, TLNAgent demonstrates higher utility scores, with improvements ranging from 10% to 65% compared to the baseline. For example, TLNAgent achieves a striking 63.3 utility score against Atlas3, significantly surpassing the baseline’s 46.9, highlighting its ability to leverage prior knowledge or optimized strategies to enhance outcomes. Even in cases where the baseline performs relatively well (e.g., reaching 32.6 vs. TLNAgent ’s 43.2 against AlphaBIU), TLNAgent maintains a substantial lead, indicating robustness across different agent architectures. The smallest performance gap is observed in the case of AgreeableAgent2018 (37.6 vs. 34.2), yet TLNAgent still outperforms the baseline by a margin of 10%, showcasing its reliability. These findings collectively demonstrate that TLNAgent greatly enhances the convergence speed of the agent in comparison to the baseline approach. Specifically, the transfer module plays a crucial role in accelerating the training process by leveraging the knowledge acquired from the teacher agents. This utilization of prior knowledge contributes to the agent’s faster acquisition of effective negotiation strategies, resulting in improved accumulated utility.

5.1.3. Average Utility Performance

The average utility benchmark serves as an effective indicator of an agent’s overall performance and effectively demonstrates the improvement resulting from knowledge transfer. As depicted in Figure 7, TLNAgent outperforms the baseline approach for all opponents, showcasing a significant 61.6% improvement in terms of average utility. For instance, against Ponpoko, TLNAgent achieves an average utility of approximately 0.584, significantly exceeding the baseline’s 0.383, with peak values reaching 0.696 compared to the baseline’s 0.46. Notably, TLNAgent exhibits the most substantial gain when playing against Atlas3, achieving an average utility of 0.783 against 0.558 for the baseline.
This significant enhancement in performance can be primarily attributed to TLNAgent’s ability to effectively transfer valuable knowledge from multiple source policies into the learning process of the target task through its dedicated transfer module. By enabling TLNAgent to dynamically determine when and which source policy offers the highest utility for adaptive knowledge transfer, this mechanism systematically enhances its overall performance in target tasks. Through the strategic leverage of inter-policy knowledge transfer and the selective integration of source policy insights, TLNAgent demonstrates advanced decision-making competence, leading to substantial improvements in average utility across a diverse spectrum of negotiation scenarios.
To further demonstrate the capabilities of TLNAgent, our experiments incorporated two more sophisticated baselines, RLBOA [17] and Deep BPR+ Agent [34]. These baselines are able to learn from teachers, and were trained on the same opponents used to train the source policies. Experimental results are presented in Figure 8, which displays three vertically aligned heatmaps (from top to bottom) corresponding to the three aforementioned benchmarks of initial utility, accumulated utility, and average utility.
For all benchmark metrics, TLNAgent achieves superior performance by effectively leveraging knowledge transferred from teacher agents through its dedicated transfer module. Deep BPR+ Agent, utilizing the same teacher agents as the learning-from-teachers baseline, exhibits suboptimal performance, which is primarily attributed to its inefficient policy reuse mechanism. RLBOA achieves the poorest performance across the three benchmarks, as the DQN learning algorithm tends to produce less effective results compared to SAC within the negotiation task environment. Notably, the learning-from-teachers baseline exhibits no substantial performance enhancement when compared with the learning-from-scratch baseline. In specific scenarios, this baseline demonstrates inferior performance to the learning-from-scratch baseline when paired with certain opponents, suggesting that the agent is unable to effectively reuse knowledge through this approach.
These findings underscore the superiority of TLNAgent ’s transfer module, which allows it to efficiently utilize knowledge from teacher agents, resulting in notable performance improvements. In contrast, direct learning from teachers without the transfer module does not yield comparable advantages, highlighting the critical role of the transfer mechanism in facilitating effective knowledge transfer and enhancing agent performance.

5.2. Performance Against ANAC-Winning Agents

To evaluate the performance of TLNAgent in a traditional tournament setting as used in the ANAC, we report experimental results from a tournament featuring our agent and thirteen ANAC-winning agents. Specifically, the tournament included the top two agents from each ANAC competition held between 2015 and 2022, excluding the 2019 and 2020 editions due to their distinct negotiation setups designed to elicit user preference information. Additionally, the MiCRO agent, an effective competitor from the 2022 competition that does not employ opponent modeling or machine learning techniques yet still demonstrates strong performance, is included in the lineup. During the tournament, each agent pair engaged in 1000 episodes of bilateral negotiations. To ensure that agents using preference-learning mechanisms could gather adequate data, a maximum of 300 negotiation rounds per episode was permitted. The results are presented in Table 1, with the experiments employing the following benchmark metrics:
  • Average Utility Benchmark: The mean utility obtained by agent p A when negotiating with every other agent q A on all domains D, where A and D denote all of the agents and all of the domains used in the tournament, respectively.
  • Agreement Rate Benchmark: The rate of agreement achieved between the agent and all others throughout the tournament.
Table 1 showcases the performance of TLNAgent as indicated by the average utility, its standard deviation, and the average agreement achievement rate. Remarkably, TLNAgent outperforms all ANAC-winning agents participating in the tournament, as evidenced by its superior average utility and agreement achievement rate. Excluding the top-tier ANAC-winning agents from 2021 and 2022, which utilize historical negotiation data, our agent achieves an average utility 70% higher than the mean performance of all other ANAC-winning agents. Even when including the 2021 and 2022 competitors in the analysis, TLNAgent outperforms the field by a 43% margin in terms of the average utility metric. This demonstrates TLNAgent ’s capacity to rapidly improve performance when encountering new opponents in the tournament, enabled by strategically leveraging knowledge from relevant source policies through its adaptation model and transfer module. Notably, TLNAgent achieves the highest agreement rate among all agents in the tournament, indicating that alongside seeking high-utility offers, our agent prioritizes reaching negotiated agreements. This balanced strategy focused on maximizing both utility and agreement rates demonstrates TLNAgent’s ability to optimize across critical negotiation objectives, underpinning its superior overall tournament performance.
To provide a clearer visualization of the performance of agents in the tournament, Figure 9 displays the average utility achieved by each agent. Due to space limitations, it is challenging to present the results for all fifty domains. Therefore, we have carefully selected twelve representative domains to showcase in the figure based on the opposition and outcome space. In this way, the figure provides a concise overview of how TLNAgent outperforms the other agents in terms of average utility, highlighting its superiority and consistent performance across a diverse range of negotiation scenarios.
Figure 9 reveals several interesting observations. First, TLNAgent demonstrates impressive performance against these strong ANAC-winning agents. It consistently ranks within the top three positions, and achieves an average performance that surpasses the mean performance of all participants by 67.2%. This signifies the effectiveness of our knowledge transfer approach, which biases the agent’s proposition towards a more favorable bidding process. Furthermore, TLNAgent achieves a notable 17.5% improvement over Atlas3, the second-best agent in the experiments. Importantly, it exhibits greater stability, experiencing less variance in its performance. This further highlights the advantage conferred by the knowledge transfer mechanism within TLNAgent. Interestingly, the performance of MatrixAlienAgent, the second-place agent in ANAC 2021, falls below the average in our tournament. This unexpected outcome can be attributed to the significant impact of strong agents such as TLNAgent, Agent007, and MiCRO in the most complex and highly competitive negotiation domains. Their presence exerts substantial pressure on MatrixAlienAgent, leading to a notable suppression of its average utility. These observations underscore the robust performance of TLNAgent against strong opponents, the influence of knowledge transfer on bidding strategies, and the interplay between different agents in a competitive negotiation environment.

5.3. Ablation Studies

In the next part of the study, we conducted several ablation experiments to analyze the contribution of each component to knowledge transfer. Specifically, we performed experiments in which we removed the adaptation model and GMM-UBM individually in order to assess their necessity and impact on overall performance. The remaining environmental settings were the same as in Section 5.1. The ablation studies were designed as follows:
  • T L N A g e n t w / o A : Updates the negotiation strategy without the adaptation model.
  • T L N A g e n t w / o U : Updates the negotiation strategy without the UBMs.
  • T L N A g e n t w / o A U : Updates the negotiation strategy without the adaptation model or the UBMs.
Figure 10 shows the influence of these different parts on the performance of TLNAgent. Each column in the figure represents the proportion of utility obtained by a specific strategy in the corresponding benchmarks. The x-axis denotes opponents, indicated by the numbers 1 to 8, representing Agent007 to Atlas3 from Section 5.1. For the y-axis, the bar length of a given strategy on a given metric with opponent i corresponds to the proportion of the utility it achieves in the total utility of all strategies across all fifty domains. If the first column of IU is 50 % red, this indicates that the initial utility obtained by TLNAgent is equal to the sum of the other four strategies facing Agent007 across fifty domains.
In the case of Initial Utility (IU), we can observe that TLNAgent significantly outperforms both T L N A g e n t w / o A and T L N A g e n t w / o U when negotiating with eight opponents across fifty domains. TLNAgent achieves a 35 % improvement over T L N A g e n t w / o A and an impressive 95 % improvement over T L N A g e n t w / o U . This can be attributed to TLNAgent’s ability to effectively leverage the knowledge of the teachers through the adaptation model and GMM-UBM. Comparing T L N A g e n t w / o A and T L N A g e n t w / o U , we find that T L N A g e n t w / o A demonstrates better performance than T L N A g e n t w / o U . This is because T L N A g e n t w / o A can accurately evaluate the helpfulness of different teachers at the beginning of a negotiation, allowing it to make more informed decisions. In contrast, T L N A g e n t w / o U lacks the capability to assign appropriate weights to the source policies using the GMM-UBM, which diminishes its performance. However, even without the GMM-UBM, T L N A g e n t w / o U still achieves a significant 48 % performance improvement over the baseline. This highlights the value of the adaptation model in avoiding negative knowledge transfer. On the other hand, T L N A g e n t w / o A U performs similarly to the baseline. This is likely due to the absence of both the adaptation model and the GMM-UBM, which results in a higher possibility of negative knowledge transfer.
In terms of the Accumulated Utility (CU) and Average Utility (AU) benchmarks, TLNAgent consistently achieves the highest utility among all strategies, as depicted in Figure 10. Moving on to the comparison of other strategies, it is noteworthy that both T L N A g e n t w / o A and T L N A g e n t w / o U outperform the baseline, indicating that the presence of either the adaptation model or the GMM-UBM still provides some performance improvement. This demonstrates that both the adaptation model and the GMM-UBM contribute to transferring knowledge from the teachers to varying degrees. Without these components, T L N A g e n t w / o A U exhibits poor performance, even worse than the learning-from-scratch baseline. This can be attributed to the absence of mechanisms that help to effectively filter and utilize knowledge, resulting in a higher possibility of negative knowledge transfer. It is important to highlight that T L N A g e n t w / o A performs better than T L N A g e n t w / o U in terms of the CU and AU benchmarks, which is consistent with the observations made for the Initial Utility (IU) metric. However, it is worth noting that removing the adaptation model increases computational complexity, as there is no initial filtration of the teachers based on their relevance.
Lastly, we plot the agreement rate obtained by every strategy negotiating with eight opponents across fifty domains. The excellent performance of TLNAgent indicates that its strategy can not only boost the average utility but also increase the agreement rate.
Overall, these findings emphasize the critical role of both the adaptation model and the UBMs in facilitating effective knowledge transfer. While the absence of either component can still provide performance improvements over the baseline, the absence of both components, as seen for T L N A g e n t w / o A U , leads to poor performance due to negative knowledge transfer.

5.4. Sensitivity Analysis

To evaluate the robustness of TLNAgent to different choices of the Wasserstein distance threshold s (see Section 4.3.1) and validate the potential of adaptive thresholding, we conducted a sensitivity study comparing fixed thresholds with s = { 0.2 , 0.3 , 0.4 } and a learned dynamic threshold. These experiments used the same eight ANAC-winning opponents and fifty domains from Section 5.1, focusing on the key metrics of transfer activation rate (frequency of transfer module activation), average utility, and agreement rate.

5.4.1. Performance with Fixed Thresholds

Table 2 summarizes the results for fixed thresholds. The original threshold s = 0.3 balances transfer activation and performance, while both s = 0.2 and s = 0.4 obtain worse performance. We found that s = 0.2 triggers transfer in 78% of sessions, including frequent false activations (e.g., reacting to random opponent concessions). This wastes computational resources on irrelevant knowledge integration, reducing average utility by 8.6% compared to s = 0.3 . In contrast, s = 0.4 activates transfer in only 22% of sessions, missing genuine strategy shifts (e.g., often failing to detect Atlas3’s late-stage concession adjustments). This leads to a 13.4% drop in average utility and a 15% lower agreement rate. The choice of s = 0.3 avoids these extremes and its 45% activation rate aligns with meaningful shifts, thereby maximizing both utility and agreement.

5.4.2. Learned Dynamic Threshold

To enhance adaptability, we implemented a context-aware threshold predictor (refer to Section 4.3.1), a deep learning model that predicts s based on opponent behavior embeddings and domain metadata. Trained on historical sessions, it learns to adjust s for volatile opponents (e.g., increasing s for ChargingBoul) and stable opponents (e.g., decreasing s for Agent007). With the learned threshold, the agent achieves a transfer activation rate of 42% (adaptive to opponent volatility), average utility of 0.652 (1.2% higher than s = 0.3 ), and agreement rate of 89% (1% higher). It outperforms fixed thresholds by balancing sensitivity; for volatile opponents, it reduces false activation (activation rate = 35% vs. 45% for s = 0.3 ), while for stable opponents it is able to capture subtle shifts (activation rate = 50% vs. 45% for s = 0.3 ).

5.5. Computational Overhead

In this subsection, we report the training time and peak memory across source policy library sizes. Overall, the training time and peak memory usage for TLNAgent scale approximately linearly with the size of the source-policy library ( N l ), as measured on a workstation with an Intel Core i9-12900K CPU, 128 GB RAM, and an NVIDIA RTX 3090 GPU across fifty ANAC domains and eight opponents. For training time, the baseline learn-from-scratch SAC agent takes 2.1 h; TLNAgent with N l = 4 takes 2.8 h (75% from the negotiation module, 14% from the critic module, 11% from the transfer module); N l = 8 takes 3.5 h (60% negotiation, 20% critic, 20% transfer); N l = 16 takes 4.9 h (43% negotiation, 27% critic, 31% transfer); and N l = 32 takes 7.2 h (29% negotiation, 35% critic, 36% transfer). The increase is driven by the critic module’s need to evaluate more source policies and the transfer module’s additional similarity calculations for larger N l .
For peak memory (RAM + VRAM), the baseline uses 18.2 GB; TLNAgent with N l = 4 uses 24.5 GB (2.3 GB for source policies, 1.2 GB for GMM-UBM parameters, 8.5 GB for trajectory data, 12.5 GB for intermediate computations); N l = 8 uses 31.8 GB (4.7 GB source policies, 2.1 GB GMM-UBM, 8.5 GB trajectories, 16.5 GB intermediates); N l = 16 uses 45.3 GB (9.2 GB source policies, 3.8 GB GMM-UBM, 8.5 GB trajectories, 23.8 GB intermediates); and N l = 32 uses 72.1 GB (18.5 GB source policies, 7.2 GB GMM-UBM, 8.5 GB trajectories, 37.9 GB intermediates). Memory growth is dominated by source policy storage (each policy requiring 580 MB) and intermediate computations for pairwise similarity checks. Even at N l = 32, both training time and memory remain manageable, while optimizations such as sparse policy selection and incremental UBM updates can further reduce overhead.

5.6. Empirical Game-Theoretic Analysis

In the previous subsection, we investigated the strategy performance from the traditional mean score perspective. However, this does not reveal information about the robustness of these strategies. To appropriately address robustness, Empirical Game Theory (EGT) analysis [56] is applied to the competition results. Here, we consider the best single-agent deviations as in [57], where there is an incentive for one agent to unilaterally change the strategy in order to statistically improve its own utility. The aim of using EGT is to search for pure Nash equilibria, in which no agent has the incentive to deviate from its strategy by choosing another one.
We first applied the EGT technique to scenarios where exactly two players are involved, which corresponds to the common format of bilateral negotiation; in these scenarios, each agent was allowed to choose one strategy from the top eight strategies considered in the tournament experiments. For brevity, let one letter of each strategy be the identifier ( E means AgreeableAgent2018, S means Atlas3, P means AlphaBIU, X means MatrixAlienAgent, A means Agent007, C means ChargingBoul, M means MiCRO, and T means TLNAgent) and let L be the strategy set, that is, L = { E , S , P , X , A , C , M , T } . The profile is defined as the two strategies used by players in the game. Furthermore, the score of a specific strategy in a specific profile is calculated as its average utility achieved when playing against the other strategies in all domains.
The results are depicted in Figure 11. Under this EGT analysis, there exists only one pure Nash equilibrium: the strategy profile ( T vs. M ), i.e., TLNAgent versus MiCRO. This observation is of great interest, as it indicates that this strategy profile is the most stable profile among all possible profiles. For any non-Nash equilibrium strategy profile, there exists a path of statistically significant deviations (strategy changes) that leads to this equilibrium. When compared with the other strategy in the equilibrium, TLNAgent is always preferred unless the current profile already contains it, which incentivizes a player to deviate for 85% of the state transitions. Moreover, the equilibrium profile constitutes a negotiation solution with the highest social welfare (i.e., the largest sum of scores achieved by two strategies). This is desirable because, as a measure of the negotiation benefit for all participants rather than the benefit for an individual agent, higher social welfare results in a better overall value of negotiation.
To further verify the robustness of TLNAgent, we additionally applied the EGT technique to multi-player scenarios. However, the analysis of the fourteen-agent tournaments with the full combination of strategies is far too large to visualize. Due to space constraints, an analysis of a seven-agent tournament in which each agent can choose one of the top three strategies is displayed in Figure 12 and an analysis of a seven-agent tournament in which each agent can choose one of the top seven strategies is displayed in Figure 13. For reasons of brevity, the graph is pruned to highlight some interesting features. As such, only those nodes on the path leading to pure Nash equilibria are shown, starting from an initial profile where all agents employ the same strategy or each agent uses a different strategy. The profile (node) in the resulting graph (Figure 12 and Figure 13) is defined by the mixture of strategies used by the players in a tournament. As before, the score of a strategy in a specific profile is calculated as its average utility achieved in all domains.
Under this EGT analysis, there exists only one pure Nash equilibrium from all initial profiles in these two scenarios, represented by a thick border in the figures; five agents use TLNAgent and two agents use Agent007. This equilibrium attracts all profiles. In other words, for any non-Nash equilibrium strategy profile, there exists a path of statistically significant deviations (i.e., strategy changes) that leads to this equilibrium. In this equilibrium, there is no incentive for any agents to deviate from TLNAgent to Agent007, as this will decrease their payoffs. This analysis further demonstrates that TLNAgent exhibits robust performance across diverse negotiation scenarios, maintaining consistent effectiveness even when faced with variations in opponent strategies.

6. Conclusions and Future Work

This paper introduces a novel automated negotiation framework termed TLNAgent that leverages transfer learning to enhance the efficiency and performance of automated negotiation. The proposed framework comprises three core components: the negotiation module, the critic module, and the transfer module. It integrates a distance discriminator and an adaptation model to systematically assess the transferability of source policies, enabling principled knowledge integration across different negotiation scenarios. Our experimental results clearly demonstrate that TLNAgent achieves a significant performance advantage over state-of-the-art agents selected from prior ANAC competitions.
TLNAgent paves the way for several promising research avenues, some of which we find particularly noteworthy. First and foremost, exploring the integration of opponent modeling techniques with our framework holds significant potential for further enhancing the efficiency of negotiation processes. Additionally, an interesting avenue for future research involves evaluating the performance of TLNAgent against human negotiators, which could shed light on its capabilities in real-world scenarios. Furthermore, extending the scope of the proposed framework to encompass concurrent negotiations would represent a substantial and impactful area of future investigation.

Author Contributions

Conceptualization, S.C.; methodology, S.C.; visualization, S.C.; writing—original draft preparation, S.C.; formal analysis, G.W.; investigation, G.W.; supervision, G.W.; writing—review and editing, G.W.; resources, S.C.; validation, G.W.; funding acquisition, S.C. All authors have read and agreed to the published version of the manuscript.

Funding

The work was funded by the National Natural Science Foundation of China (Grant Nos. 61602391, 62222311).

Acknowledgments

We want to thank Zhaoyuan Xiong at Chongqing Jiao tong University for his valuable contribution to the analysis of the experimental results. We would like to express our sincere gratitude to the anonymous reviewers for their insightful comments and constructive suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Varotto, L.; Fabris, M.; Michieletto, G.; Cenedese, A. Visual sensor network stimulation model identification via Gaussian mixture model and deep embedded features. Eng. Appl. Artif. Intell. 2022, 114, 105096. [Google Scholar] [CrossRef]
  2. Luzolo, P.H.; Elrawashdeh, Z.; Tchappi, I.; Galland, S.; Outay, F. Combining Multi-Agent Systems and Artificial Intelligence of Things: Technical challenges and gains. Internet Things 2024, 28, 101364. [Google Scholar] [CrossRef]
  3. Chen, S.; Semenov, I.; Zhang, F.; Yang, Y.; Geng, J.; Feng, X.; Meng, Q.; Lei, K. An effective framework for predicting drug-drug interactions based on molecular substructures and knowledge graph neural network. Comput. Biol. Med. 2024, 169, 107900. [Google Scholar] [CrossRef]
  4. Li, F.; Wang, D.; Li, Y.; Shen, Y.; Pedrycz, W.; Wang, P.; Wang, Y.; Zhang, W. Stacked fuzzy envelope consistency imbalanced ensemble classification method. Expert Syst. Appl. 2025, 265, 126033. [Google Scholar] [CrossRef]
  5. He, H.; Chen, D.; Balakrishnan, A.; Liang, P. Decoupling Strategy and Generation in Negotiation Dialogues. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 2333–2343. [Google Scholar] [CrossRef]
  6. Yang, R.; Chen, J.; Narasimhan, K. Improving Dialog Systems for Negotiation with Personality Modeling. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, Virtual Event, 1–6 August 2021; Volume 1. [Google Scholar]
  7. Luo, X.; Li, Y.; Huang, Q.; Zhan, J. A survey of automated negotiation: Human factor, learning, and application. Comput. Sci. Rev. 2024, 54, 100683. [Google Scholar] [CrossRef]
  8. Priya, P.; Chigrupaatii, R.; Firdaus, M.; Ekbal, A. GENTEEL-NEGOTIATOR: LLM-Enhanced Mixture-of-Expert-Based Reinforcement Learning Approach for Polite Negotiation Dialogue. In Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Walsh, T., Shah, J., Kolter, Z., Eds.; AAAI Press: Washington, DC, USA, 2025; pp. 25010–25018. [Google Scholar] [CrossRef]
  9. Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [Google Scholar] [CrossRef] [PubMed]
  10. Ye, D.; Chen, G.; Zhang, W.; Chen, S.; Yuan, B.; Liu, B.; Chen, J.; Liu, Z.; Qiu, F.; Yu, H.; et al. Towards Playing Full MOBA Games with Deep Reinforcement Learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020. [Google Scholar]
  11. Yang, T.; Hao, J.; Meng, Z.; Zhang, C.; Zheng, Y.; Zheng, Z. Towards Efficient Detection and Optimal Response against Sophisticated Opponents. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, ijcai.org, Macao, 10–16 August 2019; pp. 623–629. [Google Scholar]
  12. Bagga, P.; Paoletti, N.; Alrayes, B.; Stathis, K. A Deep Reinforcement Learning Approach to Concurrent Bilateral Negotiation. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, Yokohama, Japan, 11–17 July 2020; pp. 297–303. [Google Scholar]
  13. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, 2–4 May 2016; Conference Track Proceedings. 2016. [Google Scholar]
  14. Chang, H.C.H. Multi-issue negotiation with deep reinforcement learning. Knowl.-Based Syst. 2021, 211, 106544. [Google Scholar] [CrossRef]
  15. Chen, S.; Sun, Q.; You, H.; Yang, T.; Hao, J. Transfer Learning based Agent for Automated Negotiation. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2023, London, UK, 29 May–2 June 2023; ACM: New York, NY, USA, 2023; pp. 2895–2898. [Google Scholar]
  16. Chen, S.; Zhao, J.; Zhao, K.; Weiss, G.; Zhang, F.; Su, R.; Dong, Y.; Li, D.; Lei, K. ANOTO: Improving Automated Negotiation via Offline-to-Online Reinforcement Learning. In Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2024, Auckland, New Zealand, 6–10 May 2024; Dastani, M., Sichman, J.S., Alechina, N., Dignum, V., Eds.; International Foundation for Autonomous Agents and Multiagent Systems/ACM: New York, NY, USA, 2024; pp. 2195–2197. [Google Scholar]
  17. Bakker, J.; Hammond, A.; Bloembergen, D.; Baarslag, T. RLBOA: A modular reinforcement learning framework for autonomous negotiating agents. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, Montreal, QC, Canada, 13–17 May 2019; pp. 260–268. [Google Scholar]
  18. Chen, L.; Dong, H.; Han, Q.; Cui, G. Bilateral Multi-issue Parallel Negotiation Model Based on Reinforcement Learning. In Proceedings of the Intelligent Data Engineering and Automated Learning—IDEAL 2013—14th International Conference, IDEAL 2013, Hefei, China, 20–23 October 2013; Proceedings. Yin, H., Tang, K., Gao, Y., Klawonn, F., Lee, M., Weise, T., Li, B., Yao, X., Eds.; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2013; Volume 8206, pp. 40–48. [Google Scholar]
  19. Tesauro, G.; Kephart, J.O. Pricing in Agent Economies Using Multi-Agent Q-Learning. Auton. Agents Multi-Agent Syst. 2002, 5, 289–304. [Google Scholar] [CrossRef]
  20. Bellman, R.E. Dynamic Programming; Princeton University Press: Princeton, NJ, USA, 1957. [Google Scholar]
  21. Sutton, R.; Barto, A. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  22. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M.A. Playing Atari with Deep Reinforcement Learning. arXiv 2013, arXiv:1312.5602. [Google Scholar] [CrossRef]
  23. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.A.; Fidjeland, A.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
  24. Lin, L.J. Self-Improving Reactive Agents Based On Reinforcement Learning, Planning and Teaching. Mach. Learn. 1992, 8, 293–321. [Google Scholar] [CrossRef]
  25. Bagga, P.; Paoletti, N.; Stathis, K. Learnable Strategies for Bilateral Agent Negotiation over Multiple Issues. arXiv 2020, arXiv:2009.08302. [Google Scholar]
  26. Chang, H.C.H. Multi-Issue Bargaining With Deep Reinforcement Learning. arXiv 2020, arXiv:2002.07788. [Google Scholar] [CrossRef]
  27. Razeghi, Y.; Yavuz, C.O.B.; Aydogan, R. Deep reinforcement learning for acceptance strategy in bilateral negotiations. Turk. J. Electr. Eng. Comput. Sci. 2020, 28, 1824–1840. [Google Scholar] [CrossRef]
  28. Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; PMLR: Cambridge, MA, USA, 2015; pp. 1889–1897. [Google Scholar] [CrossRef]
  29. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
  30. Higa, R.; Fujita, K.; Takahashi, T.; Shimizu, T.; Nakadai, S. Reward-based negotiating agent strategies. Proc. AAAI Conf. Artif. Intell. 2023, 37, 11569–11577. [Google Scholar] [CrossRef]
  31. Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft Actor-Critic Algorithms and Applications. arXiv 2018, arXiv:1812.05905. [Google Scholar]
  32. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2018; pp. 1861–1870. [Google Scholar]
  33. Sengupta, A.; Mohammad, Y.; Nakadai, S. An Autonomous Negotiating Agent Framework with Reinforcement Learning Based Strategies and Adaptive Strategy Switching Mechanism. In Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’21, Virtual Event, 3–7 May 2021; pp. 1163–1172. [Google Scholar]
  34. Wu, L.; Chen, S.; Gao, X.; Zheng, Y.; Hao, J. Detecting and Learning Against Unknown Opponents for Automated Negotiations. In Proceedings of the PRICAI 2021: Trends in Artificial Intelligence, Hanoi, Vietnam, 8–12 November 2021; Pham, D.N., Theeramunkong, T., Governatori, G., Liu, F., Eds.; Springer: Cham, Switzerland, 2021; pp. 17–31. [Google Scholar]
  35. Arslan, F.; Aydogan, R. Actor-critic reinforcement learning for bidding in bilateral negotiation. Turk. J. Electr. Eng. Comput. Sci. 2022, 30, 1695–1714. [Google Scholar] [CrossRef]
  36. Taylor, M.E.; Stone, P. Transfer Learning for Reinforcement Learning Domains: A Survey. J. Mach. Learn. Res. 2009, 10, 1633–1685. [Google Scholar]
  37. Zhu, Z.; Lin, K.; Jain, A.K.; Zhou, J. Transfer Learning in Deep Reinforcement Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13344–13362. [Google Scholar] [CrossRef] [PubMed]
  38. Yang, T.; Wang, W.; Tang, H.; Hao, J.; Meng, Z.; Mao, H.; Li, D.; Liu, W.; Chen, Y.; Hu, Y.; et al. An Efficient Transfer Learning Framework for Multiagent Reinforcement Learning. Adv. Neural Inf. Process. Syst. 2021, 34, 17037–17048. [Google Scholar]
  39. Rusu, A.A.; Colmenarejo, S.G.; Gülçehre, Ç.; Desjardins, G.; Kirkpatrick, J.; Pascanu, R.; Mnih, V.; Kavukcuoglu, K.; Hadsell, R. Policy Distillation. In Proceedings of the 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, 2–4 May 2016; Conference Track Proceedings. Bengio, Y., LeCun, Y., Eds.; 2016. [Google Scholar]
  40. Parisotto, E.; Ba, L.J.; Salakhutdinov, R. Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning. In Proceedings of the 4th International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
  41. Schmitt, S.; Hudson, J.J.; Zídek, A.; Osindero, S.; Doersch, C.; Czarnecki, W.M.; Leibo, J.Z.; Küttler, H.; Zisserman, A.; Simonyan, K.; et al. Kickstarting Deep Reinforcement Learning. arXiv 2018, arXiv:1803.03835. [Google Scholar] [CrossRef]
  42. Fernández, F.; Veloso, M.M. Probabilistic policy reuse in a reinforcement learning agent. In Proceedings of the 5th International Joint Conference on Autonomous Agents and Multiagent Systems, Hakodate, Japan, 8–12 May 2006. [Google Scholar]
  43. Li, S.; Zhang, C. An Optimal Online Method of Selecting Source Policies for Reinforcement Learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th Innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
  44. Liu, I.; Peng, J.; Schwing, A.G. Knowledge Flow: Improve Upon Your Teachers. In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  45. Yang, T.; Hao, J.; Meng, Z.; Zhang, Z.; Hu, Y.; Chen, Y.; Fan, C.; Wang, W.; Liu, W.; Wang, Z.; et al. Efficient Deep Reinforcement Learning via Adaptive Policy Transfer. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, Yokohama, Japan, 11–17 July 2020. [Google Scholar]
  46. Tao, Y.; Genc, S.; Chung, J.; Sun, T.; Mallya, S. REPAINT: Knowledge Transfer in Deep Reinforcement Learning. In Proceedings of the 38th International Conference on Machine Learning, Virtual Event, 18–24 July 2021. [Google Scholar]
  47. You, H.; Yang, T.; Zheng, Y.; Hao, J.; Taylor, M.E. Cross-domain adaptive transfer reinforcement learning based on state-action correspondence. In Proceedings of the Uncertainty in Artificial Intelligence, Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, Eindhoven, The Netherlands, 1–5 August 2022. [Google Scholar]
  48. Aydoğan, R.; Festen, D.; Hindriks, K.V.; Jonker, C.M. Alternating Offers Protocols for Multilateral Negotiation. In Modern Approaches to Agent-based Complex Automated Negotiation; Fujita, K., Bai, Q., Ito, T., Zhang, M., Hadfi, R., Ren, F., Aydoğan, R., Eds.; Springer: Berlin/Heidelberg, Germany, 2017; pp. 153–167. [Google Scholar]
  49. Ramdas, A.; Trillos, N.G.; Cuturi, M. On Wasserstein Two-Sample Testing and Related Families of Nonparametric Tests. Entropy 2017, 19, 47. [Google Scholar] [CrossRef]
  50. Geary, D.N.; Mclachlan, G.J.; Basford, K.E. Mixture Models: Inference and Applications to Clustering. J. R. Stat. Soc. Ser. A (Stat. Soc.) 1989, 152, 126. [Google Scholar] [CrossRef]
  51. Kinnunen, T.; Li, H. An overview of text-independent speaker recognition: From features to supervectors. Speech Commun. 2010, 52, 12–40. [Google Scholar] [CrossRef]
  52. Reynolds, D.A.; Quatieri, T.F.; Dunn, R.B. Speaker Verification Using Adapted Gaussian Mixture Models. Digit. Signal Process. 2000, 10, 19–41. [Google Scholar] [CrossRef]
  53. Campbell, W.; Sturim, D.; Reynolds, D. Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. 2006, 13, 308–311. [Google Scholar] [CrossRef]
  54. Wan, M.; Gangwani, T.; Peng, J. Mutual Information Based Knowledge Transfer Under State-Action Dimension Mismatch. In Proceedings of the Thirty-Sixth Conference on Uncertainty in Artificial Intelligence, Virtual Online, 3–6 August 2020. [Google Scholar]
  55. Jonker, C.; Aydogan, R.; Baarslag, T.; Fujita, K.; Ito, T.; Hindriks, K. Automated negotiating agents competition (ANAC). In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
  56. Jordan, P.R.; Kiekintveld, C.; Wellman, M.P. Empirical game-theoretic analysis of the TAC supply chain game. In Proceedings of the Sixth International Joint Conference on Automomous Agents and Multi-Agent Systems, Honolulu, HI, USA, 14–18 May 2007; pp. 1188–1195. [Google Scholar]
  57. Williams, C.; Robu, V.; Gerding, E.; Jennings, N. Using Gaussian Processes to Optimise Concession in Complex Negotiations against Unknown Opponents. In Proceedings of the 22nd Internatioanl Joint Conference on Artificial Intelligence, Barcelona, Spain, 16–22 July 2011; pp. 432–438. [Google Scholar]
Figure 1. The process of automated negotiation between Agent 1 (left) and Agent 2 (right) in a laptop purchasing scenario. The agents alternate offers under the stacked alternating offers protocol: (1) the initial proposal by Agent 1 (e.g., Dell/60 GB/23″ monitor); (2) the counterproposal by Agent 2 (e.g., HP/30 GB/21″ monitor); (3) subsequent counter-offers; and (4) mutual acceptance when the offers align with both agents’ private preference profiles (A/B). Arrow direction indicates proposal sequence.
Figure 1. The process of automated negotiation between Agent 1 (left) and Agent 2 (right) in a laptop purchasing scenario. The agents alternate offers under the stacked alternating offers protocol: (1) the initial proposal by Agent 1 (e.g., Dell/60 GB/23″ monitor); (2) the counterproposal by Agent 2 (e.g., HP/30 GB/21″ monitor); (3) subsequent counter-offers; and (4) mutual acceptance when the offers align with both agents’ private preference profiles (A/B). Arrow direction indicates proposal sequence.
Electronics 14 03391 g001
Figure 2. An overview of TLNAgent framework and components: (A) Base Strategy: SAC policy network for opponent interaction. (B) Critic Module: Detects novel opponents (Distance Discriminator) and evaluates source policies (Adaptation Model). (C) Transfer Module: Integrates knowledge via GMM-UBM and lateral connections. Arrows indicate data flow (negotiation trajectories → critic analysis → policy transfer).
Figure 2. An overview of TLNAgent framework and components: (A) Base Strategy: SAC policy network for opponent interaction. (B) Critic Module: Detects novel opponents (Distance Discriminator) and evaluates source policies (Adaptation Model). (C) Transfer Module: Integrates knowledge via GMM-UBM and lateral connections. Arrows indicate data flow (negotiation trajectories → critic analysis → policy transfer).
Electronics 14 03391 g002
Figure 3. An illustration of the critic module and transfer module. The critic module decides whether to activate the transfer module, while the transfer module measures the helpfulness of each source policy and adaptively transfers knowledge under the guidance of the critic module. For convenience, in this paper W i represents W t e a c h e r s . Note that only the policy network is shown here, as the action network is the same.
Figure 3. An illustration of the critic module and transfer module. The critic module decides whether to activate the transfer module, while the transfer module measures the helpfulness of each source policy and adaptively transfers knowledge under the guidance of the critic module. For convenience, in this paper W i represents W t e a c h e r s . Note that only the policy network is shown here, as the action network is the same.
Electronics 14 03391 g003
Figure 4. GMM-UBM.
Figure 4. GMM-UBM.
Electronics 14 03391 g004
Figure 5. Performance of TLNAgent and learning-from-scratch baseline during the initial negotiation stages against eight opponents across fifty domains.
Figure 5. Performance of TLNAgent and learning-from-scratch baseline during the initial negotiation stages against eight opponents across fifty domains.
Electronics 14 03391 g005
Figure 6. Comparison of TLNAgent and learning-from-scratch baseline with accumulated utility benchmark against eight ANAC-winning agents.
Figure 6. Comparison of TLNAgent and learning-from-scratch baseline with accumulated utility benchmark against eight ANAC-winning agents.
Electronics 14 03391 g006
Figure 7. Comparison of TLNAgent and the learning-from-scratch baseline in terms of the average utility benchmark against eight ANAC-winning agents.
Figure 7. Comparison of TLNAgent and the learning-from-scratch baseline in terms of the average utility benchmark against eight ANAC-winning agents.
Electronics 14 03391 g007
Figure 8. Heatmaps of TLNAgent and baselines negotiating with eight opponents. From top to bottom, the heatmaps correspond to the initial utility, accumulated utility, and average utility benchmarks, respectively.
Figure 8. Heatmaps of TLNAgent and baselines negotiating with eight opponents. From top to bottom, the heatmaps correspond to the initial utility, accumulated utility, and average utility benchmarks, respectively.
Electronics 14 03391 g008
Figure 9. Performance of all agents in terms of average utility in each domain.
Figure 9. Performance of all agents in terms of average utility in each domain.
Electronics 14 03391 g009
Figure 10. Utility distribution in terms of Initial Utility (IU), accumulated Utility (CU), Average Utility (AU), and Agreement Rate (AR) negotiating with eight opponents across fifty domains.
Figure 10. Utility distribution in terms of Initial Utility (IU), accumulated Utility (CU), Average Utility (AU), and Agreement Rate (AR) negotiating with eight opponents across fifty domains.
Electronics 14 03391 g010
Figure 11. Deviation analysis for two-player negotiation. Each node shows a strategy profile and the scores of two involved strategies, with the higher-scoring one marked by a star. The arrows indicates the statistically significant deviations between strategy profiles.
Figure 11. Deviation analysis for two-player negotiation. Each node shows a strategy profile and the scores of two involved strategies, with the higher-scoring one marked by a star. The arrows indicates the statistically significant deviations between strategy profiles.
Electronics 14 03391 g011
Figure 12. The deviation analysis for the seven-player tournament setting in Energy using three strategies. Each node shows a strategy profile, and the strategy with the highest-scoring profile marked by a background color. The arrows indicates statistically significant deviations between strategy profiles. The equilibria are the nodes with a thicker border and no outgoing arrow.
Figure 12. The deviation analysis for the seven-player tournament setting in Energy using three strategies. Each node shows a strategy profile, and the strategy with the highest-scoring profile marked by a background color. The arrows indicates statistically significant deviations between strategy profiles. The equilibria are the nodes with a thicker border and no outgoing arrow.
Electronics 14 03391 g012
Figure 13. The deviation analysis for the seven-player tournament setting in Energy using seven strategies. Each node shows a strategy profile, and the strategy with the highest-scoring profile is marked by a background color. The arrows indicates statistically significant deviations between strategy profiles. The equilibria are the nodes with a thicker border and no outgoing arrow.
Figure 13. The deviation analysis for the seven-player tournament setting in Energy using seven strategies. Each node shows a strategy profile, and the strategy with the highest-scoring profile is marked by a background color. The arrows indicates statistically significant deviations between strategy profiles. The equilibria are the nodes with a thicker border and no outgoing arrow.
Electronics 14 03391 g013
Table 1. Comparison of our proposed TLNAgent with thirteen ANAC-winning agents, using the average utility and average agreement achievement rate as benchmarks. The bold number in each column represents the highest performance.
Table 1. Comparison of our proposed TLNAgent with thirteen ANAC-winning agents, using the average utility and average agreement achievement rate as benchmarks. The bold number in each column represents the highest performance.
AgentAvg. Utility95% Confidence IntervalAvg. Agreement Rate95% Confidence Interval
Lower BoundUpper BoundLower BoundUpper Bound
AgreeableAgent20180.4670.4440.4900.790.780.80
Agent360.3310.3140.3470.510.490.53
PonPoko0.3350.3180.3510.550.530.57
CaduceusDC160.3150.2990.3310.770.750.79
Caduceus0.3830.3640.4020.440.430.45
YXAgent0.3210.3050.3370.530.510.55
Atlas30.5710.5420.6000.820.810.83
ParsAgent0.3750.3560.3940.470.450.49
AlphaBIU0.5500.5230.5780.640.620.66
MatrixAlienAgent0.4520.4260.4750.590.570.61
Agent0070.6290.5980.6600.570.550.59
ChargingBoul0.5660.5380.5940.570.560.58
MiCRO0.5880.5590.6170.420.390.45
TLAgent0.6480.6160.6800.880.860.90
Table 2. Performance comparison under different Wasserstein distance thresholds.
Table 2. Performance comparison under different Wasserstein distance thresholds.
Threshold sTransfer Activation Rate (%)Average UtilityAgreement Rate (%)
0.278 ± 0.020.592 ± 0.0282 ± 0.02
0.3 (Original)45 ± 0.020.648 ± 0.0288 ± 0.02
0.422 ± 0.020.561 ± 0.0275 ± 0.02
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, S.; Weiss, G. Enhancing Agent-Based Negotiation Strategies via Transfer Learning. Electronics 2025, 14, 3391. https://doi.org/10.3390/electronics14173391

AMA Style

Chen S, Weiss G. Enhancing Agent-Based Negotiation Strategies via Transfer Learning. Electronics. 2025; 14(17):3391. https://doi.org/10.3390/electronics14173391

Chicago/Turabian Style

Chen, Siqi, and Gerhard Weiss. 2025. "Enhancing Agent-Based Negotiation Strategies via Transfer Learning" Electronics 14, no. 17: 3391. https://doi.org/10.3390/electronics14173391

APA Style

Chen, S., & Weiss, G. (2025). Enhancing Agent-Based Negotiation Strategies via Transfer Learning. Electronics, 14(17), 3391. https://doi.org/10.3390/electronics14173391

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop