Social Learning-Enhanced Deep Reinforcement Learning Through Behavioral Observation

Mehmet Dincer Erbas; Ceren Gulen

doi:10.3390/electronics15132940

Abstract

In this study, we present a novel adaptive algorithm, social learning-enhanced deep reinforcement learning (SLDRL), which integrates social learning mechanisms into deep reinforcement learning (DRL) to improve agent performance in both discrete and continuous state-space environments. The proposed hybrid control architecture enables agents to autonomously decide when and how to exploit socially acquired behaviors, balancing social learning with individual exploration through an entropy-based intrinsic motivation mechanism. The framework incorporates online imitation and enactment mechanisms that allow agents to observe and selectively reuse behavioral sequences acquired from other agents during training. We evaluate SLDRL in a sparse-reward discrete grid-based foraging task and in the dense-reward continuous-state/discrete-action CartPole problem. In both domains, SLDRL agents outperform baseline DRL agents, achieving faster learning and higher cumulative rewards. The results show that socially acquired behaviors are utilized adaptively throughout training, with the balance between imitation and individual learning emerging dynamically according to the structure of the environment and the agent’s experience. Comparisons with a behavioral cloning baseline further indicate that selectively integrating observed behaviors can yield more robust long-term learning than direct imitation of demonstration trajectories. Overall, the results demonstrate that SLDRL can effectively leverage online social learning in diverse environments.

Keywords:

deep reinforcement learning; social learning; observational learning

1. Introduction

Reinforcement learning [1] is an adaptive machine learning technique in which a learning agent tries to maximize its cumulative reward as it benefits from its interactions with its environment through a trial-and-error mechanism. It depends on finding a balance between exploitation of current knowledge and exploration of not or less tested state-action pairs. The three main components of a reinforcement learning problem are listed as the interactions with the environment, the set of actions that can be performed by the agent and the reward that might be delayed [1]. The environment is generally designed as a Markov Decision Process in which the learning agent tries to maximize its cumulative reward by improving a known policy. Consequently, Reinforcement Learning has been applied to many research areas, including real-time control [2], industrial automation systems [3], economics and finance [4] and natural language processing [5].

Classical reinforcement learning methods are known to perform efficiently in discrete environments with a finite state-action space. However, using similar techniques in problems with continuous and large (or infinite) time and space is still an open area of research [6,7]. In recent years, a number of studies attempted to use Artificial Neural Networks (ANN) [8] to overcome this limitation of classical reinforcement learning techniques. Hence, we can observe the emergence of a new reinforcement learning method, namely deep reinforcement learning (DRL) [9], that utilizes ANNs to solve problems in continuous time and space. With this method, learning agents can act in unstructured state space without explicit design as they still benefit from reinforcement learning through policy improvement. As a result, DRL has been applied to many problems in research areas that may include problems with continuous time and space, including robotic control [10], natural language processing [11], games [12], healthcare [13], cyber security [14], and finance [15].

An important aspect of the problems that need to be solved by DRL is that they may involve sparse rewards [16], e.g., the reward is +1 when the task is achieved, and 0 otherwise. In environments with sparse rewards, many decisions must be made by the learning agent before any outcome can be received. Without receiving a proper reward for the performed actions, the learning agent may have a slow start to its learning activity. Furthermore, making a strong start and building early momentum in learning can also be beneficial in environments with dense rewards or continuous state spaces, where rewards are received at every time step, but fine-grained behavioral adjustments are required to maximize long-term returns. In both cases, an effective mechanism is needed to accelerate early-stage learning and provide guidance before the agent develops its own reliable policy—this is where social learning can play a valuable role. Social learning or observational learning takes place when a learning agent copies or imitates the performed actions of another agent or an expert sharing the same environment [17]. A prominent form of social learning in nature, namely imitation learning, includes a variety of mechanisms such as goal inference, conceptual interpretation and selective attention of salient cues. It enables individuals to understand and adapt the observed behaviors to new situation [18,19]. Furthermore, imitation learning is shown to play a significant role in cultural transmission and cumulative knowledge acquisition in humans, animals and artificial agents [20,21,22,23,24]. Social learning involves learning from others. In this sense, it is different from other individual adaptive learning techniques, such as reinforcement learning, supervised learning [25] or genetic algorithms [26]. As it involves interactions among individuals, it is seen as an enhancing factor to other individual adaptive learning algorithms. In the field of robotics, developing social learning-based methods, such as Learning by Demonstration [27] or Imitation Learning [28], hold the promise of overcoming the need to program every behavior that an artificial agent needs to perform as the agent can socially learn new behaviors by observing the demonstrations of those behaviors [29]. As a result, many studies attempted to utilize a human or robot demonstrator to train artificial agents, for instance [27,30,31,32,33]. In these studies, the learning agent had access to either the internal state or experiences of the demonstrator agent such that the perceived internal calculations or the experiences were utilized to enhance the learning performance of the agent. In another study, Erbas et al. [34] used imitation of executed actions only as an enhancing factor to increase the learning performance of classical Q-learning agents. The learning agents updated the probability of enacting observed behaviors based on the feedback provided by the environment. It was shown that the imitation of purely observed behaviors can enhance learning.

In this study, we design a novel, biologically inspired, social learning-enhanced deep reinforcement learning (SLDRL) algorithm. The controller that is designed based on this algorithm has a hybrid structure in which socially learned behavioral demonstrations are used to enhance the learning performance of the classic DRL algorithm. As in [34], We only transfer externally observable behaviors—i.e., action sequences taken in specific environmental states—between agents. The learner agent has no access to the internal parameters, reward signals, policy networks, or value functions of the demonstrator agent. This ensures that learning is driven purely by imitation of what is seen, not by shared internal knowledge or gradients. In contrast to approaches where the internal state of the expert (e.g., Q-values, neural weights, replay buffers, or reward shaping information) is made available to the learner, our method maintains a strict separation between agents’ internal mechanisms. This separation aligns more closely with the concept of social learning observed in nature, where individuals imitate only the observed behavior of others without insight into their internal decision-making processes. Uniquely, our method also allows the learner agent to autonomously decide when and how to make use of the observed behaviors based on its own learning dynamics and performance signals, thereby preserving the balance between individual exploration and imitation. In addition to probable actions that the learning agent needs to perform its task, the neural controller of the learning agent can trigger imitate and enact actions during which the agent can observe the behaviors of others and then replicate the observed behaviors, respectively. In our framework, imitation and replication are implemented as a form of perfect imitation within the constraints of a simulation environment. That is, the learner agent observes and stores exact action sequences executed by a teacher agent under specific environmental conditions, and these sequences are later transferred without artificial noise. The replication process is deterministic in the sense that, once triggered, the learner enacts the same sequence of actions as observed, assuming a sufficiently similar initial state. However, while the technical transfer of behaviors is exact, the effectiveness of imitation can vary due to factors such as partial state similarity, or differences in the learner’s internal representation and policy evolution. Replication in SLDRL manifests through the temporary adoption of these copied action sequences, which are selectively applied based on a similarity threshold and a probabilistic decision mechanism. This allows the learning agent to decide when and how to utilize imitation, rather than blindly copying, thus preserving a balance between social learning and individual exploration. Therefore, the learning agent has full control over when and how to use social learning. Taken together, these design choices distinguish SLDRL from conventional demonstration-based and imitation-based reinforcement learning approaches by preserving independent policy learning while enabling selective reuse of externally observed behaviors. As the observed behaviors are replicated, the learning agent utilizes an entropy-based intrinsic evaluation mechanism that probabilistically favors socially acquired behaviors which reduce uncertainty in action preference, rather than directly optimizing external reward. We devise several experimental settings to examine the effectiveness of the proposed method while observing both expert and non-expert agents.

The rest of the article is organized as follows: Section 2 reviews the related work on social learning and imitation learning methods used to enhance reinforcement learning algorithms. Section 3 presents the simulation setup used to evaluate our algorithm. Section 4 describes the implementation details of the social learning-enhanced deep reinforcement learning (SLDRL) framework. Section 5 reports experiments in discrete-state environments and compares the performance of SLDRL with that of pure DRL-based methods. Section 6 presents the experimental setup and results for continuous-state environments. Finally, Section 7 concludes the article and outlines potential directions for future research.

2. Related Work

Several studies have been trying to use social learning to enhance the performance of DRL algorithms. The first social learning method that is used for this purpose is behavioral cloning [35]. The objective of behavioral cloning is to directly learn the behavioral representations of an expert by transferring state-action pairs between the expert and the learning agent. In this sense, behavioral cloning can be seen as a straightforward type of supervised learning. On the other hand, it is reported that behavioral cloning cannot easily generalize to different problems as it is only effective on problems that allow the perception of large amounts of expert behaviors [36].

Another method that has been used to enhance DRL is Inverse Reinforcement Learning (IRL) [37], which focuses on learning a reward function that explains the demonstrated behaviors of an expert. IRL has achieved success in solving several reinforcement learning problems [38,39]. However, many IRL approaches require the estimation of reward functions through iterative optimization procedures, which may introduce additional computational complexity and training overhead in comparison with directly learning a control policy. This can be particularly challenging in real-time control scenarios where computational efficiency is important. Furthermore, rather than directly providing behavioral action sequences that can be reused by the learning agent, IRL methods primarily aim to infer a reward function that encourages expert-like behavior.

In recent years, there have been some attempts to design low-cost, model-free, effectively generalizable, social learning-based techniques that can directly recommend learning agents. A method designed for this purpose is known as Generative Adversarial Imitation Learning (GAIL) [40], which uses Generative Adversarial Networks [41] to train an ANN with expert behavior as input. As the training of the ANN progresses, the actions chosen by the trained network gradually resemble the actions of the expert. Although GAIL has been successfully applied to some DRL problems, it has been reported that it is highly dependent on the supervision of the expert’s behaviors. In the case that the behaviors that are transferred are not optimal, it tends to converge to local optimums [42]. Based on this observation, Yi et al. developed the Deep Imitation Reinforcement Learning (DIRL) algorithm, which attempts to categorize observed behaviors as expert and non-expert. In this way, they attempted to reduce the dependence of GAIL algorithms on the quality of observed behaviors. Furthermore, they assigned extra rewards for behaviors that are categorized as expert, and they gradually decreased the amount of these extra rewards during the training of the learning agent. This method was proposed as a mechanism to simultaneously exploit social learning, especially at the beginning of training, and individual learning during later periods of training. However, in this study, it was not clearly explained how the expert and non-expert behaviors would be categorized at the early stages of training. Moreover, the applicability of the method, which involves predefined extra rewards and their predefined decay over time, to different problem types is uncertain. In another study, the GAIL algorithm was utilized to control an autonomous underwater vehicle [43]. Similar to the previous study, a separate ANN is trained to categorize expert and non-expert behaviors. In addition, multiple autonomous underwater vehicles were trained simultaneously as they stored state-action pairs in a shared memory space. In this way, they attempted to solve the problem of convergence to non-optimal solutions due to the imitation of non-expert behaviors. Furthermore, there have been some studies on learning from the observation that attempted to utilize state transitions instead of executed actions to enhance the learning performance of DRL methods through social learning [44,45].

As stated above, several studies attempted to use social learning as an enhancing factor for the DRL. However, existing approaches do not fully reflect how imitation is utilized in natural social systems. In natural systems, individuals, through imitation, have access to demonstrations of different types of behaviors in a social environment [29]. The imitating agent should first convert the observed behaviors to behaviors of its own, hence solving the correspondence problem [46], and then determine in which states or circumstances the observed behaviors are meaningful. This is only possible through a trial-and-error mechanism during which a reinforcement signal is calculated on the outcomes of performing observed behaviors such that useful behaviors are supported while others are prohibited. In this way, the imitated behaviors that enhance the learning speed of the agent can become a part of the individual learning process of the agent. As the only source of feedback in an environment with sparse rewards is the actual delayed rewards which can be received when the task is achieved, we need to devise an intrinsic motivation mechanism that can favor useful observed behaviors to others. Furthermore, instead of imitating at predefined, specific time intervals, the choice of when and how to perform imitation should be decided by the learning agent itself.

Existing approaches differ not only in the source of socially acquired information but also in how such information is incorporated into learning. Behavioral cloning (BC) directly reproduces demonstrated state–action mappings through supervised learning, while learning from demonstrations (LfD) typically relies on expert-generated experiences to initialize or guide policy learning. Inverse reinforcement learning (IRL) attempts to infer latent reward structures from demonstrations, whereas adversarial approaches such as Generative Adversarial Imitation Learning (GAIL) seek to align generated behaviors with expert demonstrations through distribution matching. Learning-from-observation approaches further relax this requirement by relying on observed state transitions without explicit action labels.

In contrast, the perspective adopted in this study treats socially observed behaviors as reusable behavioral candidates whose usefulness is evaluated through subsequent interaction with the environment. Rather than relying on inferred rewards, direct enforcement of demonstrated behaviors, or predefined demonstration utilization strategies, the learner autonomously determines when socially acquired behaviors should be observed and selectively enacted while maintaining an adaptive balance between social and individual learning. Unlike behavioral cloning (BC), SLDRL does not directly reproduce demonstrated state–action mappings. Unlike Generative Adversarial Imitation Learning (GAIL), it does not attempt to learn an expert policy through adversarial optimization. Unlike dataset aggregation approaches such as DAGGER, SLDRL does not require iterative expert relabeling or continuous expert supervision. Instead, socially acquired behaviors remain optional behavioral candidates whose future reuse emerges through interaction-driven evaluation and intrinsic behavioral assessment. Accordingly, behavioral cloning was selected as the primary imitation-learning baseline in this study because it provides the most direct comparison against demonstration-driven behavior transfer without introducing additional reward inference or adversarial optimization mechanisms.

3. Simulation Setup

The social learning-enhanced deep reinforcement earning (SLDRL) algorithm is developed and tested on a simple foraging task in a simulated environment shown in Figure 1. The environment in which the learning agent is trained is a grid world with 10 × 10 cells. The task of a learning agent is to find the food item placed in one of the target cells at the right side (marked with T1, T2 and T3), starting from the cell at the top-left corner (marked with S). The only state that provides a reward is the target location, so the setup is an example of a sparse reward environment. At each time unit, the agent can move to one of the eight neighboring cells of its current location. In addition to the borders that define the dimensions of the environment, some additional obstacles are presented in the environment that can limit the agent’s movement. The agent receives its current spatial coordinates from the environment, which are used as the input to the neural controller. These coordinates define the agent’s location in the environment and serve as the state representation for learning. If the selected action causes the agent to bump into the borders or an obstacle, the agent stays static in its current location. Once the agent arrives at the target location, it returns to its starting position and moves again. An experiment run ends after 10,000 time units pass during which the shortest path to the target location has enough time to converge to its optimum.

Figure 1. Simulation setup for the foraging task that the agents are trained to achieve.

As stated above, the performance of the SLDRL algorithm is tested against the pure DRL algorithm. For this purpose, pure DRL agents are controlled by the deep Q-network [47] as shown in Figure 2. DQN is a deep reinforcement learning framework that is used for online control of smart systems [48]. It combines Q learning iterations into a series of batch reinforcement learning procedures. The DQN framework includes two separate ANNs, namely prediction and target networks. The prediction network is responsible for estimating the current Q values, hence controlling the behaviors of the agent, while the target network keeps track of the old parameters that are used to estimate current Q values. As the agent is trained, the experiences of the agent are saved in a replay memory, in the form of

(s, a, r, s^{'})

where s, a, and r are the current state, selected action, received reward, and s′ is the next state. The prediction network is updated regularly using a random set of experiences that are stored in the replay memory. Once several batch updates are completed on the prediction network, its parameters are loaded into the target network. At every update procedure, the prediction network is updated based on the values of the target network to minimize the mean-squared Bellman error between the two networks, by using the loss function given below:

L_{i} (Q_{i}) = E [{(r + γ {m a x}_{a^{'}} Q (s^{'}, a^{'}; Q_{i}^{-}) - Q (s, a; Q_{i}))}^{2}]

(1)

Figure 2. General architecture of the DQN algorithm, adapted from [47].

In this way, at iteration i, the current parameters

Q_{i}

are updated based on the old parameters

Q_{i}^{-}

by a stochastic gradient descent algorithm. Based on the updated

Q

values of the prediction network, the next action is selected with a

Q - g r e e d y

approach.

4. Social Learning-Enhanced Deep Reinforcement Learning

As stated above, the SLDRL algorithm is designed to be low-cost, model-free, and modular, allowing future adaptation to different reinforcement learning settings. Our approach implements a form of social learning in which an agent selectively observes and replicates structured sequences of actions performed by other agents in similar environmental contexts. This mechanism aligns with imitation-based strategies in the taxonomy of social learning, as described by Hoppitt [19], Laland [20] and Rendell et al. [21], where the learner reproduces behavior after observing demonstrators, without direct access to their internal states or goals. Unlike simpler mechanisms such as stimulus enhancement or local enhancement—which merely direct attention toward locations or objects—our method allows the learner to encode, evaluate, and enact multi-step behavioral sequences. The imitation process is context-dependent and filtered by the learner’s internal assessment of utility, thereby enabling a dynamic balance between individual exploration and socially guided exploitation. For this purpose, it is necessary to define imitation as an action that can be performed by the learning agent and to design an internal feedback mechanism that allows the evaluation of imitated behaviors. Based on this approach, we designed the hybrid control structure that can be seen in Figure 3. The SLDRL controller consists of two decision-making components:

Figure 3. Control architecture of the proposed SLDRL agent. The controller consists of an upper deep reinforcement learning (DRL) layer and a lower Social Learning (SL) layer. The DRL layer contains the DQN controller responsible for action selection and policy updates during task learning. The SL layer contains the components required for social learning, including behavior memory, similarity matching, entropy-based behavior evaluation, and behavior selection mechanisms for storing and retrieving observed behaviors. The figure illustrates the information flow and interaction between the DRL and SL layers, while the internal DQN optimization process is separately detailed in Figure 2.

(1): DRL module, implemented using deep Q-networks (DQNs), responsible for selecting actions to solve the task.
(2): Social learning (SL) module, responsible for storing and retrieving observed action sequences for imitation.

During training, the DRL module can choose from a set of available actions, which include not only primitive task actions, but also two meta-actions: imitate and enact. The imitate and enact operations are treated as ordinary actions within the DQN action space. Consequently, their Q-values are learned through the same Bellman update rule used for all primitive actions.

When the DRL module selects the imitate action:

The agent remains stationary for 5 time steps and observes another agent’s behavior in the shared environment.
The 5-step sequence of observed actions, along with the observer’s position at the start of observation, is stored by the SL module.

Although the imitate action spans multiple time steps, it is implemented as a sequence of standard single-step transitions. At each step, the agent remains in the same state (

s_{t} = s_{t + 1}

) and receives zero reward (

r_{t} = 0

). Each step generates a transition (

s_{t}

,

a_{i m i t a t e}

,

r_{t}

,

s_{t + 1}

), which is stored in the replay buffer and used to update the Q-network via the standard one-step Bellman update. Thus, the temporally extended imitate action is decomposed into standard DQN updates without any aggregation across steps.

When the DRL module selects the enact action:

The SL module searches its memory for previously observed sequences that began near the agent’s current location.
If such a sequence is found, the agent executes it step-by-step.
As each action is executed, a transition $(s, a, r, s^{'})$ is generated, stored in the replay buffer, and the Q-network is updated using the standard one-step Bellman update.
After execution, the SL module updates the selection probability of the enacted sequence based on a reinforcement signal derived from the reduction in policy entropy.

Similar to the imitate action, the enact operation is also decomposed into standard single-step transitions. Each action in the executed sequence produces its own transition and corresponding update. No aggregated or multi-step update is applied over the entire sequence. Thus, both imitate and enact are temporally extended at the decision level but are implemented as sequences of standard one-step transitions at the learning level.

If, during execution, any part of the stored sequence becomes invalid—for example, due to an obstacle or boundary—the enact operation is aborted, and the DRL module selects a new action as usual. This constraint further reinforces behavioral fidelity and aligns with our goal of modeling non-adaptive imitation based on direct observation. Importantly, stored behaviors are only considered valid if they were originally observed in the same region of the state space (defined as the agent’s current location and its eight adjacent grid cells). This ensures that the SLDRL agent imitates only purely observed behaviors without generalization or adaptation.

As stated above, during every imitate operation, a demonstrated behavior that consists of a series of actions will be observed and saved in the memory space of the SL layer. In this context, a behavior b refers to a stored observed action sequence together with the state in which the observation started. To provide the reinforcement signal for a specific enact operation, the execution of the set of actions is taken as an episode and the change in relevant Q values during the episode is examined. For this purpose, the action-based entropy in a specific state s is calculated by:

E n t r o p y_{s} = - \sum_{a \in A} p_{(s, a)} \log (p_{(s, a)} + δ)

(2)

In the formula,

p_{(s, a)}

is the probability of choosing action a in state s. Since the normalization in (3) and the probability computation in (4) can assign a value of zero to some actions, a small constant δ > 0 is explicitly added inside the logarithm in the entropy calculation to avoid undefined numerical evaluations when

p_{(s, a)} = 0

. Here, δ denotes a small positive numerical constant used for numerical stability. In our implementation, δ is set to 10⁻⁸. The selection probability of each action at a specific state s is calculated based on the normalized Q values of each action in state s:

{Q_n o r m}_{s, a} = \frac{Q (s, a) - Q (s, a_{m i n})}{(Q (s, a_{m a x}) - Q (s, a_{m i n})) + δ}

(3)

In the formula,

Q (s, a_{m i n})

and

Q (s, a_{m a x})

are the state-action pairs with the lowest and highest Q values, respectively. To prevent division-by-zero in the normalization step of Equation (3), a small numerical constant δ is added to the denominator. The same δ value used throughout the manuscript for numerical stability is applied here; in our implementation, δ is set to 10⁻⁸. Once the normalized Q values for each state-action pair at a specific state are determined, the selection probability of each action at a specific state is calculated as follows:

p_{s, a} = \frac{{Q_n o r m}_{s, a}}{\sum_{a^{'} \in A} {Q_n o r m}_{s, a^{'}}}

(4)

We utilize action-based entropy to measure the randomness in normalized Q values for state action pairs. If every state-action pair in a specific state has similar normalized Q values, then we measure a high entropy for that state. However, if one or more state-action pairs have relatively higher normalized Q values compared to other state-action pairs in the same state, we measure a relatively lower entropy. When the learning agent triggers “enact” action, it visits a number of states based on the enacted behavior that is previously observed. While the agent is going through these states, we calculate the action-based entropy for each state before and after the state is visited. In this way, we determine the change in the sum of entropy for all visited states. A decrease in the sum of entropy for all visited states means that the randomness in the action selection decreases for the visited states. In this case, we take it as a positive reinforcement signal which will make the enacted behavior more likely to be selected in future. For a specific behavior b, the entropy decrease count is updated according to:

b_{d e c r e a s e} = b_{d e c r e a s e} + 1, if \sum_{s \in S_{b}} E n t r o p y_{a f t e r, s} < \sum_{s \in S_{b}} E n t r o p y_{b e f o r e, s}

(5)

where

s_{b}

denotes the set of states visited during the enactment of behavior b, and

E n t r o p y_{a f t e r, s}

and

E n t r o p y_{b e f o r e, s}

denote the action-based entropy of state s measured before and after visiting the state, respectively. Otherwise, the value of

b_{d e c r e a s e}

remains unchanged.

Based on this approach, the entropy decrease rate for a behavior b is calculated as:

R_{b} = \frac{b_{d e c r e a s e} + 1}{b_{e n a c t e d} + 1}

(6)

In Equation (6),

b_{e n a c t e d}

denotes the total number of episodes during which behavior (b) is enacted, and

b_{d e c r e a s e}

denotes the number of enactment episodes in which a decrease in the cumulative action-based entropy of the visited states is observed. Finally, based on the entropy decrease rates, the probability that a specific behavior b is selected for enactment is calculated by:

{P r o b S e l e c t}_{b} = \frac{R_{b}}{\sum_{b \in B} R_{b}}

(7)

in which B is the set of observed behaviors currently stored in the memory space of the SL layer. In this way, the observed behaviors that cause a decrease in the cumulative action-based entropy for the visited states during the enaction episode become more likely to be selected.

In our SLDRL implementation, we use sequences of five consecutive actions to represent socially observed behaviors. The action-sequence length was determined experimentally through a sensitivity analysis using sequence lengths of 2, 5, and 10 actions. The results, presented in Appendix B, show that all three values lead to similar qualitative learning behavior, indicating that the framework is reasonably robust to this parameter within the tested range. However, a sequence length of 5 provides the best overall performance and achieves statistically significant advantages at several stages of training. Accordingly, the action-sequence length was fixed at 5 in the main experiments as a compromise between behavioral richness and contextual robustness. The learner observes only the actions executed by the demonstrator and does not have access to the demonstrator’s internal states or learning parameters. Imitation in SLDRL is therefore context-sensitive: stored behaviors are enacted only when the learner encounters states sufficiently similar to the conditions under which the behaviors were originally observed. To further examine the contribution of the individual mechanisms introduced in the proposed SLDRL framework, an additional ablation study was conducted. Specifically, we evaluated variants in which either the state similarity matching mechanism or the entropy-based reinforcement mechanism was removed while preserving the underlying DQN learning process. Detailed results and analysis of the ablation experiments, together with additional robustness experiments under imperfect behavioral demonstrations, are provided in Appendix B. While Figure 2 and Figure 3 describe the internal architectures of the DQN and Social Learning components separately, Figure 4 presents a high-level block diagram illustrating their overall interaction throughout the learning cycle. The diagram provides an end-to-end overview of the proposed SLDRL framework, from state observation and action selection to environment feedback and policy update. The corresponding step-by-step operational procedure of the proposed SLDRL framework is presented in Algorithm 1.

Algorithm 1. SLDRL Behavioral Cycle.

1 : Initialize DQN parameters θ 2 : Initialize social learning memory B 3 : for each timestep t do 4 : Observe state s_{t} 5 : Select action a_{t} using ε - greedy Q (s_{t}, a) 6 : if a_{t} = imitate then 7 : Observe a 5 - step demonstrator sequence b 8 : Store b with start state s_{b} in B 9 : for each observation step do 10 : Store transition (s_{t}, a_{i m i t a t e}, 0, s_{t}) 11 : Update DQN using the standard one - step Bellman update 12 : end for 13 : else if a_{t} = enact then 14 : Select behavior b from memory B using probabilities {P r o b S e l e c t}_{b} 15 : for each action a_{k} in b do 16 : Execute a_{k} 17 : Observe reward r_{t} and next state s_{t + 1} 18 : Store transition (s_{t}, a_{k}, r_{t}, s_{t + 1}) 19 : Update DQN using the standard one - step Bellman update 20 : s_{t} \leftarrow s_{t + 1} 21 : end for 22 : Compute entropy change Δ H for the visited states 23 : Update b_{d e c r e a s e}, b_{e n a c t e d}, and {P r o b S e l e c t}_{b} 24 : else 25 : Execute primitive action a_{t} 26 : Observe reward r_{t} and next state s_{t + 1} 27 : Store transition (s_{t}, a_{t}, r_{t}, s_{t + 1}) 28 : Update DQN using the standard one - step Bellman update 29 : s_{t} \leftarrow s_{t + 1}

30: end if
31: end for

Figure 4. High-level block diagram of the proposed SLDRL framework. The diagram provides an end-to-end overview of the interaction between the DQN controller, the Social Learning module, and the environment throughout the learning cycle.

5. Experiments in Discrete State Space Environments

To test the performance of the SLDRL algorithm, we compare its task learning performance with two baseline approaches: a pure deep reinforcement learning (DRL) agent consisting only of the DRL layer with the DQN controller, and an imitation learning baseline based on behavioral cloning (BC). The prediction and target networks of the DQN used in both the SLDRL and pure DRL agents consist of deep artificial neural networks with two hidden layers. This architecture was selected based on common practices in the literature for deep Q-learning in discrete environments such as grid-based foraging, providing sufficient representational capacity to approximate the Q-function while maintaining computational efficiency and training stability.

To provide a stronger comparison with imitation learning approaches, we additionally implement a behavioral cloning baseline. Behavioral cloning is a commonly used imitation learning technique in which a policy is trained through supervised learning to reproduce demonstrated actions by learning a mapping from observed states to actions. In our implementation, the BC agent receives exactly the same behavioral demonstrations that are available to the SLDRL agent in each experimental scenario. Specifically, the action sequences that are provided to the SLDRL agent during imitation events—such as predefined trajectories, expert demonstrations, or trajectories generated by other agents—are also used as training samples for behavioral cloning. Each training sample consists of a state–action pair

(s, a)

, where s represents the observed state and a denotes the demonstrated action. A policy network with the same architecture as the DQN prediction network is trained using supervised learning with a cross-entropy loss function to imitate the demonstrated actions. After this imitation phase, the trained policy is used to initialize a DQN agent, which then continues learning through reinforcement learning using the same training setup as the pure DRL baseline.

This experimental design allows us to directly evaluate whether simple imitation of demonstrated behaviors is sufficient to improve learning performance, or whether the adaptive and selective social learning mechanism implemented in SLDRL provides additional benefits beyond direct behavioral replication. To improve the reproducibility of the experiments, all neural network architectures, hyperparameters, and training settings used in the experiments—including those of the SLDRL, pure DRL, and behavioral cloning models—are summarized in detail in Appendix A. All reported results were obtained from independent experimental repetitions (n = 120 for discrete-state experiments and n = 100 for continuous-state experiments). Performance comparisons between methods were performed independently at each evaluation point (every 100 environment steps in discrete experiments and every episode in continuous experiments). Paired t-tests were applied to paired observations collected across independent runs. Prior to analysis, the distribution of paired differences was inspected to verify the suitability of the parametric test. To improve robustness against potential deviations from normality assumptions, all statistical conclusions were additionally validated using the Wilcoxon signed-rank test. Since repeated hypothesis testing was performed across multiple time points, the Benjamini–Hochberg (BH) procedure was applied separately for each comparison family to control the false discovery rate (FDR = 0.05). Reported statistically significant intervals correspond to time ranges that remained significant after BH correction. Below, we present different experimental scenarios based on the simulation setup introduced in Section 3.

5.1. Observing Pre-Defined Trajectories

In the first set of experiments, we examine the performance of an SLDRL agent that can imitate pre-defined trajectories. For this purpose, when the SLDRL agent triggers an imitate action, 5 consecutive actions in one of the eight directions of the compass as shown in Figure 5, are given to the learning agent. It should be noted that, since the transferred pre-defined trajectories are not demonstrated by another agent, these trajectories are defined in a way that can be tested in every region of the state–action space. As can be seen, some of the given trajectories are not helpful on specific regions of the state–action space because enacting them cannot take the agent close to the target location. Thus, the SLDRL agent needs to discover on which part of the state–action space the given trajectories may enhance its learning. We compare the performance of the SLDRL agent with two baseline approaches: a pure DRL agent that learns only through individual reinforcement learning and a behavioral cloning (BC) baseline introduced in the previous subsection. An experiment run is completed when the learning agent executes 10,000 actions as the Q values of state–action pairs converge to their final values. At every 100-time units, a greedy action selection is applied to the current Q values so that we can determine the shortest path (or policy improvement) that is achieved by the learning agent as the time passes. We have 120 independent experiments runs in which we can examine the learning performance of the agent in 40 experiment runs with each of the three target locations, as shown in Figure 1.

Figure 5. Pre-defined trajectories given to the SLDRL agent. Each trajectory consists of 5 consecutive actions towards one of the eight directions of the compass.

Figure 6 presents the results for this experiment setting. In addition to the pure DRL baseline, the figure also includes the behavioral cloning (BC) baseline that is trained using the same predefined trajectories provided to the SLDRL agent. As can be observed, the SLDRL agent achieves a faster improvement in best path length compared to both the pure DRL agent and the behavioral cloning baseline. While the BC agent benefits from the demonstrated trajectories and therefore shows slightly faster initial improvement than the pure DRL agent, its learning speed remains slower than that of the SLDRL agent. This indicates that directly imitating demonstrated behaviors can provide some early learning benefits, but the adaptive mechanism of SLDRL enables more effective integration of these behaviors into the reinforcement learning process.

Figure 6. Results for observing pre-defined trajectories. The results show the averages of 120 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals. The agent that uses the pure DRL algorithm is solely controlled by the DQN while the SLDRL agent is directed by the hybrid control architecture shown in Figure 3. The parameters that are used in the DQN in all experiment setups are set as follows: α = 0.01, γ = 0.9 and ε = 0.3. These parameter values follow common practice in the literature for training DQN agents on the grid-based foraging problem, ensuring comparability with prior studies and stable learning performance.

To statistically evaluate these differences, paired t-tests were conducted between the compared methods. The results show that the SLDRL agent learns significantly faster than both the pure DRL agent and the behavioral cloning agent from the 1100th to the 4800th time unit (paired t-test, p < 0.05 after Benjamini–Hochberg correction). In contrast, the comparison between the behavioral cloning agent and the pure DRL agent does not reveal a statistically significant difference throughout the entire training period. To further validate these findings, the non-parametric Wilcoxon signed-rank test was also performed for each pairwise comparison, confirming the same statistical conclusions. Furthermore, to account for multiple comparisons across time points, the Benjamini–Hochberg correction was applied. These results demonstrate that the improved learning speed of SLDRL is statistically significant over a broad temporal interval rather than being limited to a narrow portion of the training process. In the discrete-state experiments, performance differences are therefore interpreted primarily through learning efficiency and training dynamics rather than final converged outcomes.

Figure 7 shows the number of imitate and enact actions in every 100 actions executed by the learning agent. As can be seen, the learning agent triggers these actions, hence utilizes social learning, more often during the early stages of training. As the learning agent becomes more experienced, we observe that social learning-related actions are triggered less often so that the individual learning mechanism of the robot becomes more effective. Thus, contrary to studies in which social learning is activated at certain rates and intervals, the emergence of a non-preprogrammed balance between individual and social learning mechanisms can be observed.

Figure 7. Average number of triggered imitate and enact actions in every 100 actions that are executed by the agent observing pre-defined trajectories. The results show the averages of 120 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals.

5.2. Observing an Expert

In the second set of experiments, we train an SLDRL agent that can observe the performed actions of an expert. For this purpose, based on the different configurations of the target location, we present an agent that follows the trajectories shown in Figure 8. As can be seen, the expert starts from the initial position and follows the shortest path to the corresponding target locations. Whenever the SLDRL agent triggers an imitate action, five consecutive actions from the expert’s trajectory are provided to the learning agent. As in the previous experiments, the learning agent stores the starting position of each observed behavior and later enacts these behaviors only within the corresponding region of interest as described in Section 4. In this experimental setup, we compare the performance of the SLDRL agent with two baseline methods: a pure DRL agent that learns entirely through individual reinforcement learning and a behavioral cloning (BC) baseline that directly imitates the expert demonstrations.

Figure 8. The movement trajectories of the expert agent. The expert starts from the initial position and follows the shortest path for each of the three target locations.

Figure 9 presents the results for this experiment setting. As expected, the behavioral cloning agent achieves the fastest initial learning since it directly imitates the optimal trajectories demonstrated by the expert. Because the demonstrations correspond to near-optimal solutions, direct supervised imitation enables the BC agent to rapidly acquire an effective policy. The SLDRL agent also benefits substantially from observing the expert and achieves significantly faster learning than the pure DRL agent. However, its learning speed is slightly lower than that of the BC agent, since SLDRL integrates demonstrated behaviors selectively through its entropy-based reinforcement mechanism rather than directly copying them.

Figure 9. Results for observing an expert. The results show the averages of 120 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals.

Statistical analysis further supports these observations. A paired t-test and a Wilcoxon signed-rank test reveal a statistically significant difference between the SLDRL and pure DRL agents (paired t-test, p < 0.05 after Benjamini–Hochberg correction) from the 1000th to the 4800th time unit. In addition, the difference between the SLDRL and BC agents is statistically significant up to approximately the 4000th time unit, indicating that BC benefits more strongly from the availability of optimal expert demonstrations during the early stages of learning. The non-parametric Wilcoxon signed-rank test confirms this finding. These results highlight an important distinction between the methods: while behavioral cloning performs extremely well when optimal demonstrations are available, SLDRL selectively incorporates observed behaviors into the reinforcement learning process, allowing it to remain effective even when demonstrations may not be perfect or universally applicable. Furthermore, Figure 10 shows how often in average the imitate and enact actions are triggered by the SLDRL agent. Once again, the learning agent triggers these actions, hence utilizes social learning, more often during the early stage of training. As the agent becomes more experienced, we observe that social learning-related actions are triggered less often so that the individual learning mechanism of the robot becomes more effective.

Figure 10. Average number of triggered imitate and enact actions for observing an expert. The results show the averages of 120 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals.

In Figure 11, we compare the results in the experiments setups for observing pre-defined trajectories and observing an expert. As can be seen, the SLDRL agent that observes an expert achieves a slightly better performance than the SLDRL agent that observes pre-defined trajectories. The difference between the two settings becomes statistically significant at time unit 1300, between time units 2600 and 3000 and at time unit 4000. In accordance with this result, the number of enact behavior is slightly higher for the learning agent observing an expert. During the early stages of learning, the agent that observes an expert triggers social learning related actions more often and therefore benefits more from social learning.

Figure 11. Comparison of results from experiments setups with observing an expert and observing pre-defined trajectories. The results show the averages of 120 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals: (a) Performance of SLDRL agent in different environment setups; (b) Number of enact behaviors that are executed by the SLDRL agent in different environment setups.

5.3. Observing an Experienced Agent

In previous sets of experiments, we examined the learning performance of the SLDRL algorithm when observing behaviors that were either designed to be part of the optimal solution or predefined directional trajectories. These results demonstrated that the SLDRL agent can effectively exploit social learning to accelerate its learning process. In the next set of experiments, we investigate a more realistic scenario in which the SLDRL agent observes the behavior of another learning agent that has gained experience in the environment but does not necessarily follow an optimal policy.

For this purpose, a pure DRL agent is first trained for 9000 time units in the simulation environment so that it becomes an experienced agent capable of solving the task with a reasonably good policy. At the 9000th time unit, an SLDRL agent is introduced into the environment and begins observing the actions performed by the experienced DRL agent. When the SLDRL agent triggers an imitate action, it remains stationary for five consecutive time units and observes the executed actions of the experienced agent. Since the learning agent temporarily pauses its own exploration during this observation period, social learning introduces an additional cost to the SLDRL agent. The observed action sequence, together with its starting position, is then stored in the memory space of the SL layer of the SLDRL agent.

The memory space of the SL module stores a limited number of observed behavioral sequences. The memory capacity was determined experimentally through a sensitivity analysis using capacities of 5, 10, and 20 stored behaviors. The results, presented in Appendix B, show that increasing the memory capacity from 5 to 10 leads to improved learning performance at several stages of training, with statistically significant differences observed in parts of the learning process. However, further increasing the capacity to 20 does not produce statistically significant improvements. These results indicate that enlarging the memory beyond a moderate size provides limited additional benefit while increasing computational overhead. Accordingly, the SL memory capacity was set to 10 behaviors in the main experiments. If a new behavior is observed after all slots are filled, the newly observed behavior replaces the stored behavior with the lowest entropy decrease rate Rb. When the SLDRL agent triggers the enact action, it probabilistically selects one of the stored behaviors according to the current ProbSelect values and executes it in the corresponding region of interest. It should be noted that although the experienced DRL agent has already learned an effective policy, it may still perform occasional exploratory actions due to its ε-greedy strategy.

For comparison, the behavioral cloning (BC) baseline is trained using the same action sequences that are provided to the SLDRL agent. In other words, the demonstrations observed by the SLDRL agent are also used as training data for the BC agent in the form of state–action pairs.

Figure 12 presents the results for this experiment setting. As can be seen, the SLDRL agent achieves better learning performance than both the pure DRL agent and the behavioral cloning baseline. While the experienced agent provides useful demonstrations, these behaviors are not always optimal because they are generated by a learning agent that still performs exploratory actions. As a result, the BC agent, which directly imitates these demonstrations, also reproduces suboptimal behaviors and therefore exhibits slower learning compared to SLDRL.

Figure 12. Results for observing an experienced agent. The results show the averages of 120 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals.

Statistical analysis supports these observations. A paired t-test and a Wilcoxon signed-rank test reveal that the difference between the SLDRL and pure DRL agents is statistically significant between the 2400th and 5500th time units. In contrast, no statistically significant difference is observed between the behavioral cloning agent and the pure DRL agent throughout the experiment. However, the difference between the SLDRL and BC agents remains statistically significant until approximately the 4700th time unit, indicating that the selective imitation mechanism of SLDRL allows it to benefit more effectively from the experienced agent’s demonstrations. The non-parametric Wilcoxon signed-rank test confirms these findings.

Furthermore, Figure 13 illustrates the number of triggered imitate and enact actions during training. As in previous experiments, social learning actions occur more frequently during the early stages of training, after which their frequency gradually decreases. This behavior indicates the emergence of a dynamic balance between social learning and individual reinforcement learning.

Figure 13. Average number of triggered imitate and enact actions for observing an experienced agent. The results show the averages of 120 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals.

5.4. Observing an Inexperienced Agent

In the previous set of experiments, we showed that the SLDRL agent can effectively exploit social learning when an experienced agent is present in the environment. However, in many realistic scenarios such an expert or experienced agent may not exist. To examine the performance of the SLDRL algorithm under these conditions, we design an experiment setup in which two agents start learning at the same time and can observe each other’s behavior during the training process. In this configuration, neither agent initially possesses useful experience that could reliably guide the learning process of the other.

In this experiment setting, when the SLDRL agent triggers an imitate action, it remains stationary for five consecutive time units while observing the actions executed by the other learning agent. Because the observed agent is also exploring the environment and learning simultaneously, the demonstrated behaviors may frequently contain suboptimal or exploratory actions. Furthermore, the observation process temporarily halts the learning agent’s own exploration, which introduces an additional cost to the learning process. This observation cost is intentionally modeled as part of the learning process, allowing the agent to learn when imitation is beneficial by balancing its advantages against the opportunity cost of pausing.

For comparison, the behavioral cloning (BC) baseline is again trained using the same action sequences that are provided to the SLDRL agent. In other words, the demonstrations observed by the SLDRL agent are also used as training data for the BC agent in the form of state–action pairs. Figure 14 presents the results for this experiment setting. As can be seen, the SLDRL agent achieves better learning performance than both the pure DRL agent and the behavioral cloning baseline, even though the observed behaviors originate from an inexperienced agent. Statistical analysis using a paired t-test shows that the difference between the SLDRL and pure DRL agents is statistically significant between the 3300th and 3900th time units. In addition, the difference between the SLDRL and BC agents remains statistically significant from approximately the 2000th time unit until the end of the experiment, indicating that the selective imitation mechanism of SLDRL enables more effective use of the observed behaviors.

Figure 14. Results for observing an inexperienced agent. The results show the averages of 120 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals.

In contrast, the behavioral cloning agent exhibits the weakest performance in this scenario. Since the BC agent directly imitates the observed behaviors without evaluating their usefulness, it also reproduces many exploratory and suboptimal actions demonstrated by the inexperienced agent. When comparing BC and pure DRL agents, the difference between the two methods becomes statistically significant at several intervals during training, particularly after the 6000th time unit where significant differences appear more frequently. These results suggest that direct imitation of unreliable demonstrations can even degrade learning performance compared to purely individual reinforcement learning. The non-parametric Wilcoxon signed-rank test confirms these findings. Figure 15 further explains this behavior by showing the number of triggered imitate and enact actions during training. As can be observed, the SLDRL agents trigger a relatively smaller number of imitate actions, indicating that they gradually learn that many of the observed behaviors are not particularly useful. At the same time, the number of enact actions remains relatively higher, suggesting that the agents are still able to extract and exploit useful behavioral fragments from the observed demonstrations. This demonstrates not only the adaptive balance between social learning and individual reinforcement learning in the SLDRL framework but also serves as an implicit edge-case analysis highlighting the robustness of the proposed method to imperfect, noisy, and potentially misleading demonstrations.

Figure 15. Average number of triggered imitate and enact actions for observing an inexperienced agent. The results show the averages of 120 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals.

Figure 16 compares the results for experiment setups with observing experienced and inexperienced agents. As can be seen, the presence of an experienced agent provides an enhanced learning performance for the SLDRL agent. A paired t-test and a Wilcoxon signed-rank test between the two setups reveal that the difference between the two results is statistically significant from 2400th to 5500th time units.

Figure 16. Comparison of results from experiments setups with observing an experienced and an inexperienced agent. The results show the averages of 120 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals: (a) Performance of SLDRL agent in different environment setups; (b) Number of enact behaviors that are executed by the SLDRL agent in different environment setups.

During the early stages of training, compared to the SLDRL agent that observes an inexperienced agent, the SLDRL agent that observes an experienced agent, on average, has a higher number of imitate actions but a lower number of enact actions. Based on these observations, we can deduce that the SLDRL agent that observes an experienced agent is able to benefit from social learning at a higher level as a fewer number of enactions is sufficient to enhance its learning. On the other hand, we observe that the SLDRL agent that observes an inexperienced agent attempts to determine the useful behaviors by trying a smaller number of copied behaviors a larger number of times. This explains why the SLDRL agent that observes an inexperienced agent is less able to benefit from social learning.

Figure 17 compares the learning performance of the SLDRL agent in all four experiment setups. As can be seen, when compared with the pure DRL agent, a statistically enhanced learning performance can be observed for all setups. The best performance is achieved by the SLRDL agents that can observe an expert or pre-defined trajectory. We can conclude that when the learning agent can observe behaviors that can solve its task, the SLDRL algorithm is able to exploit social learning and achieve highly enhanced learning. Furthermore, we can observe a highly enhanced learning performance when the agent can observe the behaviors of an experienced agent. In this setup, the behaviors that are observed may not be optimal. However, through social learning, the SLDRL agent can exploit the experience of another agent and enhance its learning performance. Finally, we can see that the SLDRL agent is able to exploit social learning and enhance its learning performance even when the observed behaviors are demonstrated by an inexperienced agent.

Figure 17. Comparison of the performance of SLDRL algorithm in different experiment setups. The results show the averages of 120 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals.

6. SLDRL in Continuous State-Space Environments

To provide an initial evaluation of the proposed SLDRL framework beyond sparse-reward discrete environments, we extend our experiments from the discrete grid-based foraging environment with sparse rewards to a continuous-state/discrete-action problem, specifically the CartPole domain [49]. While the discrete setting provides a clear and interpretable testbed for studying the fundamental interactions between social and individual learning, its discretized state and action spaces allow relatively straightforward matching of observed behavioral sequences to encountered states. In such environments, socially acquired behaviors can often be reused within clearly defined regions of the state space.

By contrast, continuous state-space environments with discrete actions such as CartPole introduce a more challenging setting in which behavioral reuse must rely on approximate rather than exact state similarity. Since continuous environments do not naturally provide discrete neighborhood relationships between states, the learning agent must determine whether previously observed behaviors remain relevant under continuously varying environmental conditions. This makes continuous domains particularly suitable for evaluating whether SLDRL can maintain an adaptive balance between social learning and individual exploration when behavioral reuse depends on state similarity rather than exact state correspondence. Moreover, unlike the sparse-reward structure commonly encountered in discrete grid-based problems, CartPole provides dense rewards at every timestep, enabling the investigation of how SLDRL operates when environmental feedback is frequent but long-term success still depends on subtle behavioral refinements. Evaluating SLDRL in such a setting allows us to examine the robustness of the proposed framework across fundamentally different state-space structures and reward dynamics. Although CartPole operates in a continuous state space, the control actions remain discrete. Therefore, the present experiment should be interpreted as validation under continuous-state/discrete-action conditions rather than as evaluation in continuous action-space control problems.

The continuous state-space experiments are conducted using MATLAB 2025a’s predefined CartPole environment [50]. The task involves balancing a pole hinged to a cart that moves along a frictionless track. The state space consists of four continuous variables: cart position

(x)

, cart velocity

\dot{(x)}

, pole angle

(θ)

, and pole angular velocity

(\dot{θ})

. The action space is discrete, comprising two possible actions: applying a force of

+ 10 N

(push right) or −

10 N

(push left) to the cart. An episode terminates if the cart moves more than 2.4 m from the center, the pole angle exceeds

12^{\circ} \approx 0.209 rad

, or if the maximum episode length of 500 time steps is reached. The reward function provides a reward of

+ 1

at each time step that the pole remains upright and the cart stays within bounds. This setup provides a continuous-state/discrete-action setting with dense rewards, making it suitable for evaluating SLDRL under conditions substantially different from those used in the discrete grid-based foraging experiments while not constituting validation in continuous action-space environments.

Importantly, the overall SLDRL framework used in the continuous-state experiments remains identical to that used in the discrete-state experiments. In both environments, agents employ the same online imitate and enact meta-actions as explained in Figure 4, the same entropy-based reinforcement mechanism for evaluating socially acquired behaviors, and the same SL memory management process. The only difference lies in the state-matching mechanism used during behavioral enactment. While the discrete-state experiments use neighboring grid cells to determine contextual similarity, the continuous-state experiments employ cosine similarity due to the absence of a natural neighborhood structure in continuous state spaces.

To evaluate the effectiveness of the proposed method, we compare the performance of the SLDRL agent with two baseline approaches: a pure DRL agent controlled by a DQN algorithm and a behavioral cloning (BC) baseline. In the BC approach, the same observed action sequences available to the SLDRL agent are used as demonstration data in the form of state–action pairs for supervised training before reinforcement learning continues. This comparison allows us to assess whether direct imitation of demonstrated behaviors is sufficient for improving performance in continuous environments or whether the selective social learning mechanism of SLDRL provides additional advantages.

The DQN architecture used for both the pure DQN agent and the DRL layer of the SLDRL agent consists of a fully connected feedforward neural network with two hidden layers. This network architecture is selected based on established practices in CartPole experiments, where relatively small networks are sufficient to approximate the Q-function effectively due to the low-dimensional and well-structured nature of the state space. For reproducibility, all neural network architectures, hyperparameters, and training configurations used in both the continuous and discrete experiments are summarized in detail in Appendix A.

In the continuous-state experiments, behavioral enactment is performed based on cosine similarity between the current state of the learning agent and the starting state of previously observed behavioral sequences stored in the SL memory. Unlike the discrete-state experiments, where behavioral sequences are enacted only within neighboring cells of the original observation location, continuous state spaces do not provide a natural neighborhood structure. Therefore, approximate state similarity must be evaluated explicitly.

Let s denote the current state vector of the SLDRL agent and sm denote the starting state of a stored behavioral sequence m in SL memory. Each state is represented as a four-dimensional vector consisting of cart position, cart velocity, pole angle, and pole angular velocity. Before computing cosine similarity, all state variables are normalized using z-score normalization in order to eliminate scale differences across dimensions. For each state dimension i, normalization is performed as:

{\hat{s}}_{i} = \frac{s_{i} - μ_{i}}{σ_{i} + δ}

(8)

where

μ_{i}

and

σ_{i}

denote the mean and standard deviation of the i-th state variable, and δ is the same small numerical constant used throughout the paper (set to

10^{- 8}

in our implementation) added for numerical stability. This normalization ensures that all state dimensions contribute equally to the cosine similarity computation, preventing variables with larger numerical ranges from dominating the similarity measure. Both the current state and the stored memory state are normalized using the same normalization parameters prior to similarity computation. The normalization parameters are computed once from the teacher-generated dataset and kept fixed throughout training. Additional implementation details regarding the normalization procedure are provided in Appendix A. The cosine similarity between the normalized state vectors is then computed as:

c o s_s i m (s, s_{m}) = \frac{\hat{s} \cdot {\hat{s}}_{m}}{‖ \hat{s} ‖ ‖ {\hat{s}}_{m} ‖}

(9)

where

\hat{s}

and

{\hat{s}}_{m}

denote the normalized versions of

s

and

s_{m}

, respectively.

A stored behavioral sequence is considered eligible for enactment if its cosine similarity score exceeds a predefined threshold τ. The threshold parameter τ determines how closely the current state must match the starting state of a stored behavioral sequence before enactment is permitted. In this sense, the cosine-similarity threshold in the continuous-state experiments serves as a continuous analog of the neighborhood-based matching mechanism used in the discrete-state environment.

Performance evaluation is conducted by comparing the learning performance of the SLDRL agent with those of two baseline approaches: a pure DRL agent trained under identical conditions but without access to social learning mechanisms, and a behavioral cloning (BC) baseline. In the BC approach, the same observed state–action trajectories available to the SLDRL agent during imitation events are used as demonstration data to train a policy through supervised learning before reinforcement learning continues. All agents interact with the same CartPole environment in MATLAB’s Reinforcement Learning Toolbox using identical network architectures, hyperparameters, and training episode limits. Episode rewards are recorded throughout training, and statistical analyses are applied to assess whether the integration of socially acquired behaviors provides measurable advantages over purely individual learning and direct imitation.

The results of the CartPole experiments are illustrated in Figure 18, which compares the mean episode rewards of the SLDRL agent, the pure DRL agent, and the behavioral cloning baseline, averaged over 100 independent runs with 95% confidence intervals. During the initial episodes, the BC agent achieves relatively high rewards due to its direct imitation of observed demonstration trajectories. This allows the BC agent to quickly reproduce useful action patterns without requiring extensive exploration. However, as training progresses, the performance of the BC agent gradually plateaus and shows limited further improvement.

Figure 18. Results for CartPole experiments. The results show the averages of 100 experiment runs, each lasting 100 episodes, along with 95% confidence intervals. The parameters that are used in the DQN in the experiment setup are set as follows: α = 0.01, γ = 0.99 and ε decays from 1.0 to 0.01 over training. These parameter values follow common practice in the literature for training DQN agents on the CartPole problem, ensuring comparability with prior studies and stable learning performance. The τ parameter, specifying the similarity threshold that two states must meet to be considered a match, is set to 0.99.

In contrast, both reinforcement learning agents continue improving their performance through ongoing exploration and policy refinement. Notably, the SLDRL agent begins to outperform the BC agent after the early stages of training and consistently achieves higher rewards than both baseline methods throughout the remainder of the learning process. Statistical analysis using paired t-tests and Wilcoxon signed-rank tests confirms that the performance difference between the SLDRL and pure DRL agents becomes statistically significant from approximately episode 15 onward (p < 0.05). Furthermore, the SLDRL agent also demonstrates statistically significant improvements over the BC agent during the middle and later stages of training, indicating that selective utilization of socially acquired behaviors provides advantages beyond direct imitation.

The improved performance of the SLDRL agent can be attributed to its ability to selectively enact previously observed behavioral sequences while continuing reinforcement learning-based exploration. By combining socially acquired behaviors with ongoing individual learning, the SLDRL agent can exploit useful behavioral patterns while still refining its policy through environmental interaction. This hybrid learning process enables the agent to benefit from social learning without becoming overly dependent on direct imitation.

Although the performance gap between SLDRL and pure DRL narrows slightly toward the end of training as the baseline agent gradually converges to an effective policy, the SLDRL agent maintains a consistent performance advantage. These results demonstrate that integrating socially acquired behaviors into the reinforcement learning process not only accelerates early learning but also improves long-term learning performance compared to both purely individual reinforcement learning and direct imitation approaches.

In the continuous state-space CartPole experiments (Figure 19), the SLDRL agent initially triggers imitate actions more frequently during the early stages of learning, while the ratio of enact actions gradually increases over time before approaching a stable level. This behavior indicates the emergence of a dynamic transition from behavioral acquisition to behavioral exploitation during training. At the beginning of learning, the agent has limited knowledge about which socially observed behaviors are useful under different environmental conditions. Consequently, imitation is utilized more frequently to acquire behavioral information from other agents. As training progresses, the agent increasingly identifies socially acquired behavioral sequences that contribute positively to task performance, leading to a growing reliance on enact actions.

Figure 19. Ratio of triggered imitate and enact actions during the CartPole experiments. The results show the averages of 100 independent experiment runs, each lasting 100 episodes, along with 95% confidence intervals. The figure illustrates the dynamic balance between behavioral acquisition through imitation and behavioral exploitation through enactment during the learning process.

Unlike the discrete-state experiments, where both imitate and enact actions rapidly decline after early learning, the enact ratio in the continuous CartPole environment remains comparatively high throughout training and gradually stabilizes around 0.4. One important factor underlying this difference is the reward structure of the environment. In the discrete grid-based foraging problem, rewards are sparse and are typically obtained only when the target location is reached. Consequently, socially acquired behaviors mainly provide early guidance before the agent develops its own reliable policy. In contrast, CartPole provides dense step-wise rewards at every timestep that the pole remains balanced. As a result, enacted behavioral sequences can continuously generate positive reinforcement signals throughout training, allowing socially acquired behaviors to remain beneficial even during later learning stages.

Another contributing factor is the structure of the continuous state space itself. Compared to the discrete grid-based environment, the CartPole domain contains a substantially larger and more finely grained state space. In such environments, the agent encounters a broader range of continuously varying states, making exact repetition of previously experienced situations less common. Consequently, the reuse of socially acquired behaviors depends on approximate state similarity rather than exact state correspondence. During the early stages of training, this limits the number of situations in which stored behaviors can be successfully enacted. However, as the agent explores the environment and its policy gradually stabilizes, the likelihood of encountering states sufficiently similar to previously observed situations increases, leading to a higher enact ratio.

The gradual saturation of the enact ratio suggests that the SLDRL agent eventually reaches a balance between individual exploration and social exploitation. Although socially acquired behaviors remain useful throughout training, the number of newly encountered states that strongly benefit from previously observed behavioral sequences decreases as the policy converges. Consequently, the enact ratio continues to increase with diminishing increments before stabilizing at a relatively high level. Overall, these results demonstrate that, even in continuous state-space environments, the proposed SLDRL framework can dynamically regulate the balance between imitation and individual learning according to both the reward structure and the characteristics of the state space.

7. Conclusions

In this study, we proposed a novel social learning-enhanced deep reinforcement learning (SLDRL) framework that enables agents to selectively incorporate socially observed behaviors into the reinforcement learning process. The proposed framework integrates a hybrid control architecture in which a DRL layer governs the agent’s primary decision-making process, while a social learning (SL) layer manages the storage, retrieval, and selective enactment of observed behavioral sequences. Through an entropy-based intrinsic motivation mechanism, the agent autonomously determines whether previously observed behaviors are beneficial under current environmental conditions. This design enables the emergence of a non-preprogrammed balance between social learning and individual exploration, inspired by observational learning mechanisms observed in biological systems.

The experimental results support the hypothesis that integrating online social learning into reinforcement learning can accelerate policy development, particularly during the early stages of training. In both sparse-reward discrete environments and dense-reward continuous-state/discrete-action environments, SLDRL agents achieved faster learning and higher cumulative rewards than pure DRL baselines. In the discrete-state experiments, imitate actions were used more frequently during the early phases of training, after which the reliance on social learning gradually decreased as the agents developed effective individual policies. In contrast, the continuous-state CartPole experiments exhibited a sustained yet stabilizing enactment ratio throughout training. This difference appears to be related both to the larger and more finely grained structure of the continuous state space and to the dense reward structure of the CartPole environment, where socially acquired behaviors can continuously generate reinforcement signals during enactment.

Comparisons with a behavioral cloning (BC) baseline further revealed an important distinction between direct imitation and selective social learning. While behavioral cloning can provide rapid initial improvements when demonstration trajectories are informative, its performance often plateaus once the demonstrated behaviors are reproduced. In contrast, SLDRL combines socially acquired behaviors with ongoing reinforcement learning-based exploration, enabling agents to refine and improve upon observed behaviors rather than simply copying them. As a result, SLDRL consistently demonstrated stronger robustness in scenarios where demonstrations were partially useful, noisy, or generated by agents that were still learning.

Another important finding is that the proposed framework does not require expert demonstrators. Even when multiple agents begin learning simultaneously, socially acquired behaviors can still provide measurable performance improvements. Furthermore, the proposed architecture is flexible with respect to the source of demonstrations, since observed behavioral sequences are selected according to state similarity rather than demonstrator identity. The computational overhead introduced by the SL layer also remains relatively low, as the additional operations primarily consist of lightweight state-similarity comparisons and entropy calculations performed on already available Q-values.

Importantly, SLDRL should not be interpreted as a standalone replacement for existing reinforcement learning algorithms. Instead, the proposed framework acts as a social learning enhancement layer that can be integrated with conventional reinforcement learning methods to accelerate and guide learning. In the present study, DQN was used as the underlying DRL component due to its suitability for both the discrete and continuous experimental setups. However, the proposed social learning mechanism was designed as a modular enhancement layer and its empirical evaluation in the present study was limited to DQN. Since the proposed framework operates on top of the underlying DQN optimization process and does not introduce an alternative policy update objective, the observed performance improvements should be interpreted as empirical evidence under the evaluated experimental conditions rather than formal guarantees of convergence or stability. Socially acquired behaviors influence learning indirectly by modifying the distribution of experiences encountered during training through selective observation and enactment, while preserving the underlying Bellman update mechanism of DQN. Formal analysis of convergence and stability properties under the interaction between socially acquired behaviors and replay-based deep reinforcement learning remains an open direction for future theoretical investigation. Consequently, future studies may investigate integration of the proposed framework with alternative reinforcement learning methods such as PPO, DDPG, SAC, Rainbow DQN, or other actor–critic architectures, which could potentially replace the DQN component while preserving the overall SLDRL framework. In this sense, the primary contribution of this work is not the introduction of an entirely new reinforcement learning paradigm, but rather the demonstration that adaptive online social learning mechanisms can effectively enhance existing reinforcement learning approaches.

Although the proposed framework demonstrated improved learning performance under both experimental settings, the present validation was conducted using DQN-based agents to isolate and evaluate the contribution of the proposed social learning mechanism under controlled conditions. The objective of this study was not to establish superiority over alternative reinforcement learning architectures but to investigate whether socially acquired behaviors can be adaptively integrated into the learning process through the proposed imitation–enactment framework. Since different reinforcement learning algorithms may exhibit different exploration dynamics and learning characteristics, additional evaluation with approaches such as PPO, Actor–Critic, Double DQN, Dueling DQN, and Rainbow DQN may provide further insight into the generality and contribution of the proposed social learning layer.

To further clarify the practical cost of the proposed social learning layer, we conducted a representative computational overhead analysis using the expert trajectory scenario. Runtime measurements were obtained from 10 independent repetitions under identical hardware and software conditions. As summarized in Table 1, the average training time increased from 205 ± 10 s for the pure DRL baseline to 210 ± 12 s for SLDRL, corresponding to an approximate overhead of 2.4%.

Table 1. Representative computational overhead comparison between pure DRL and SLDRL.

This limited increase is consistent with the structure of the proposed framework. The social learning layer does not alter the underlying DRL optimization procedure and does not introduce additional trainable neural networks. Instead, the additional computation originates from bounded behavioral-sequence storage, state similarity evaluation, and selective enactment operations. Since the reported wall-clock time includes both simulation execution and learning updates, these measurements should be interpreted as implementation-level overhead rather than isolated algorithmic complexity estimates. No noticeable increase in memory consumption was observed during representative runs, which is consistent with the bounded size of the social memory structure.

Future research can extend the SLDRL framework in several directions. First, alternative state-similarity mechanisms, learned embeddings, or context-aware representations could improve the generalization capability of socially acquired behaviors across wider regions of the state space. Second, evaluating the framework in more complex and high-dimensional environments, including visual-input-based tasks and real-world robotic systems, is necessary before making broader conclusions regarding scalability and practical applicability. Third, broader validation of the proposed framework across alternative reinforcement learning architectures and additional imitation learning approaches may further clarify the strengths, limitations, and generalizability of the social learning mechanism. Finally, although the current study evaluates SLDRL in observational social learning scenarios, future work may investigate how the proposed mechanism can be extended to larger multi-agent settings, partially observable environments, and decentralized learning scenarios to better understand the broader applicability of adaptive social learning mechanisms in artificial intelligence systems. Future studies may also investigate comparisons with more advanced imitation learning approaches, including adversarial and reward-inference-based methods such as GAIL and DIRL. Since these methods rely on substantially different learning assumptions and optimization objectives, such comparisons would provide an opportunity to evaluate the proposed framework under a broader imitation learning perspective.

Although the proposed framework was evaluated in both sparse-reward discrete and continuous-state benchmark environments, the present study remains limited to controlled and relatively low-dimensional settings. The selected environments were intentionally chosen to isolate the contribution of the proposed social learning mechanism while minimizing confounding effects introduced by large-scale environmental complexity. Therefore, the reported results should be interpreted as an initial validation of the proposed mechanism rather than evidence of scalability to partially observable, stochastic, high-dimensional, or large-scale environments. Evaluating SLDRL in visual-input, higher-dimensional, and more complex control scenarios remain an important direction for future work. Similarly, although the proposed framework was designed to be conceptually compatible with alternative reinforcement learning architectures, empirical validation in the present study was intentionally restricted to DQN to isolate the contribution of the social learning mechanism.

The present study evaluates social learning under controlled observational settings involving a single learning agent and limited demonstrator interactions. Accordingly, the proposed framework should not be interpreted as a validated large-scale multi-agent reinforcement learning approach. Extension to larger populations with heterogeneous expertise distributions, cooperative–competitive interactions, and decentralized learning remains outside the scope of the present study. Nevertheless, because SLDRL enables agents to selectively acquire and reuse observed behaviors without sharing internal representations, future studies may investigate how populations of imitation-capable agents collectively influence learning dynamics and whether socially acquired behaviors can propagate and accumulate at the population level.

Overall, the findings demonstrate that SLDRL provides a biologically inspired and computationally effective mechanism for integrating social learning into reinforcement learning agents. By enabling agents to selectively exploit observed behaviors while maintaining individual exploration, the proposed framework achieves measurable learning benefits in the evaluated sparse-reward discrete environment and the continuous-state/discrete-action CartPole benchmark. Although these results support the effectiveness of the proposed approach in the evaluated discrete and continuous benchmark settings, broader validation across more complex environments, alternative reinforcement learning algorithms, and real-world applications remains a direction for future research.

Author Contributions

Conceptualization, M.D.E. and C.G.; methodology, M.D.E.; software, M.D.E. and C.G.; validation, M.D.E. and C.G.; formal analysis, M.D.E.; investigation, M.D.E. and C.G.; resources, M.D.E.; data curation, M.D.E.; writing—original draft preparation, M.D.E.; writing—review and editing, M.D.E. and C.G.; visualization, M.D.E.; supervision, M.D.E.; project administration, M.D.E.; funding acquisition, M.D.E. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by The Scientific and Technological Research Council of Turkey (TUBITAK), project number: 122E642.

Data Availability Statement

The datasets generated and analyzed in this article are available at the following URL: https://github.com/gulenceren/SLDRL-Datasets (accessed on 21 May 2026).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1. State Normalization (Continuous State-Space Experiments)

In the SLDRL implementation, all state variables are normalized prior to cosine similarity computation using z-score normalization. For each state dimension i, normalization is performed as:

{\hat{s}}_{i} = \frac{s_{i} - μ_{i}}{σ_{i} + δ}

(A1)

where

μ_{i}

and

σ_{i}

denote the mean and standard deviation of the corresponding state variable. These statistics are computed from the state distribution generated by the teacher agent used to populate the SL memory. The normalization parameters (

μ_{i}

and

σ_{i}

) are computed once before training and remain fixed throughout both training and evaluation phases. This ensures consistency between the stored behavioral sequences and the states encountered during learning. A small numerical constant

δ = 10^{- 8}

is added to the denominator to ensure numerical stability and prevent division-by-zero.

Appendix A.2. Hyperparameters

This part summarizes the neural network architectures, reinforcement learning hyperparameters, and SLDRL-specific parameters used in the experiments. The complete set of parameters for both the discrete-space and continuous-space environments is provided in Table A1 to facilitate reproducibility.

Table A1. Hyperparameters used in experiments.

Appendix B

Appendix B.1. Action-Sequence Length

To examine the sensitivity of the SLDRL framework to the length of stored behavioral sequences, we conducted additional experiments using action-sequence lengths of 2, 5, and 10 actions in the discrete-state foraging environment using the observing inexperienced agent scenario. All other training parameters and experimental conditions were kept identical to those used in the main experiments.

Figure A1 compares the resulting learning performance measured by the best path length during training. As shown in the figure, all three configurations exhibit similar qualitative learning dynamics and converge to comparable final performance levels. This indicates that the proposed SLDRL framework is not highly sensitive to the precise value of the sequence length within the tested range.

Nevertheless, although the performance differences are relatively modest, the configuration with sequence length = 5 provides the most favorable overall performance among the tested values and shows slightly faster learning at several stages of training compared to the shorter and longer alternatives. Based on these observations, the sequence length was fixed at 5 actions in the main experiments as a practical compromise between behavioral informativeness and contextual robustness.

Figure A1. Sensitivity analysis of action-sequence length in the discrete-state SLDRL experiments. The results show the averages of 100 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals.

Appendix B.2. SL Memory Capacity

To investigate the sensitivity of the SLDRL framework to the size of the social learning memory, additional experiments were conducted using memory capacities of 5, 10, and 20 stored behavioral sequences actions in the discrete-state foraging environment using the observing inexperienced agent scenario. All other training parameters and experimental conditions were kept identical to those used in the main experiments.

Figure A2 compares the resulting learning performance measured by the best path length during training. As shown in the figure, increasing the memory capacity from 5 to 10 behaviors improves learning performance at several stages of training, with statistically significant differences observed in parts of the learning process.

However, further increasing the memory capacity from 10 to 20 behaviors does not produce statistically significant improvements in performance. This indicates that increasing the memory capacity beyond a moderate size yields diminishing returns. Based on these observations, the SL memory capacity was fixed at 10 behaviors in the main experiments as a balance between behavioral diversity and computational efficiency.

Figure A2. Sensitivity analysis of memory capacity in the discrete-state SLDRL experiments. The results show the averages of 100 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals.

Appendix B.3. Similarity Threshold

To evaluate the sensitivity of the proposed framework to the similarity threshold used during behavioral enactment, additional experiments were conducted using threshold values of τ = 0.97, τ = 0.99, and τ = 1.00. The similarity threshold determines whether a stored behavioral sequence can be executed in the agent’s current state based on cosine similarity between the current state and the starting state of the sequence.

Figure A3 presents the resulting learning performance measured by the best path length during training. The experiments were conducted in the CartPole continuous-state environment using the same training configuration as in the main experiments. When τ = 1.00, the similarity requirement becomes very strict, allowing sequence enactment only in nearly identical states. As a result, the opportunities for social learning become limited, reducing the benefits of using socially acquired behaviors. In contrast, when τ = 0.97 the similarity requirement becomes relatively permissive, allowing stored behavioral sequences to be executed in states that may differ significantly from the original demonstration context. This can result in the execution of less relevant behaviors and may slow down learning. The intermediate value τ = 0.99 provides the most favorable balance between these two effects, enabling socially acquired behaviors to be reused while maintaining sufficient contextual relevance. Based on these observations, τ = 0.99 was selected for the main experiments.

Figure A3. Sensitivity analysis of similarity threshold in the continuous-state SLDRL experiments. The results show the averages of 100 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals.

Appendix B.4. Component Ablation Analysis

To further evaluate the contribution of the individual mechanisms introduced in SLDRL, we performed an additional component ablation analysis in the discrete-state foraging environment. Since SLDRL is designed as a social learning enhancement layer operating on top of DQN rather than a standalone learning algorithm, removing individual components is not expected to prevent learning entirely. Instead, the objective of this analysis is to quantify the contribution of each mechanism to learning efficiency. For this purpose, two ablated variants of SLDRL were evaluated while maintaining the same DQN architecture, training procedure, and experimental setup used in the main experiments. In the first variant, the state similarity matching mechanism was removed, allowing socially acquired behaviors to be enacted without contextual filtering based on state similarity. In the second variant, the entropy-based reinforcement mechanism was removed, and observed behaviors were selected without updating their reuse probability according to entropy reduction. All remaining components of the framework were preserved.

Figure A4 presents the results of the component ablation analysis. The complete SLDRL agent achieved the highest learning performance throughout training. Removing either the state similarity matching mechanism or the entropy-based reinforcement mechanism resulted in a statistically significant reduction in learning performance relative to the complete SLDRL configuration. The performance reduction was more pronounced when entropy-based reinforcement was removed, indicating a stronger contribution of this mechanism to the adaptive selection and reuse of socially acquired behaviors. Nevertheless, removing state similarity matching also produced a statistically significant degradation, demonstrating that contextual alignment contributes meaningfully to the effective utilization of observed behaviors. Although the two ablated variants converged toward similar performance levels after approximately 4000 environment steps, the complete SLDRL configuration consistently maintained superior learning efficiency. The results indicate that both mechanisms contribute to the effectiveness of SLDRL and that the observed performance gains emerge from the interaction between selective behavior reuse and context-sensitive enactment rather than from a single dominant component.

Figure A4. Component ablation analysis of the SLDRL agent in the discrete-state foraging environment. Performance of the complete SLDRL architecture is compared with variants in which either the state similarity matching mechanism or the entropy-based reinforcement mechanism is removed. The results show the averages of 100 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals.

Appendix B.5. Robustness to Noisy Behavioral Observation

To evaluate the robustness of the proposed SLDRL framework against imperfect social information, additional experiments were conducted by introducing artificial noise into the behavioral demonstrations observed by the learner agent. The objective of these experiments was to examine whether the performance benefits provided by social learning remain observable when the copied behaviors are partially corrupted. For this purpose, during each imitation event, random perturbations were applied to the demonstrated action sequences before they were stored in the SL memory. Two noise levels were evaluated: 10% and 20%. At each noise level, each demonstrated action had the corresponding probability of being replaced with a randomly selected valid action from the environment action space. All other training conditions and hyperparameters were kept identical to those used in the main experiments.

Figure A5 presents the learning performance obtained under different noise conditions. As expected, introducing noise into the demonstrated behaviors reduces the effectiveness of social learning and results in increased variability during training. The degradation becomes more visible under the 20% noise condition, particularly during the earlier stages of learning where social information contributes more strongly to behavioral acquisition. Despite this degradation, the SLDRL agent maintains faster learning compared to the pure DRL baseline across most of the training process under both noise levels. This observation suggests that the proposed selective social learning mechanism does not rely exclusively on perfectly transferred demonstrations and may exhibit a degree of tolerance to imperfect behavioral information.

However, these findings remain preliminary and should not be interpreted as evidence of robustness under real-world social learning conditions. The introduced perturbations represent an artificial and simplified approximation of noisy imitation and do not capture the full complexity of social learning in physical systems. In real robotic environments, imitation errors may emerge from perception uncertainty, embodiment differences, actuator noise, temporal misalignment, and environmental variability. Therefore, although the present results provide preliminary evidence that SLDRL can preserve learning benefits under moderate levels of artificially corrupted demonstrations, validation in real robotic platforms remains necessary before drawing broader conclusions regarding robustness to noisy social learning.

Figure A5. Robustness analysis under noisy behavioral observation. Behavioral demonstrations transferred to the SLDRL agent were perturbed with random action noise at rates of 10% and 20%. The results show the averages of 100 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals.

References

Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Adam, S.; Busoniu, L.; Babuska, R. Experience replay for real-time reinforcement learning control. IEEE Trans. Syst. Man. Cybern. C Appl. Rev. 2012, 42, 201–212. [Google Scholar] [CrossRef]
Chen, Q.; Heydari, B.; Moghaddam, M. Leveraging task modularity in reinforcement learning for adaptable Industry 4.0 automation. J. Mech. Des. 2021, 143, 071701. [Google Scholar] [CrossRef]
Charpentier, A.; Elie, R.; Remlinger, C. Reinforcement learning in economics and finance. Comput. Econ. 2023, 62, 425–462. [Google Scholar]
Luketina, J.; Nardelli, N.; Farquhar, G.; Foerster, J.; Andreas, J.; Grefenstette, E.; Whiteson, S.; Rocktäschel, T. A Survey of Reinforcement Learning Informed by Natural Language. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI 2019), Macau, China, 10–16 August 2019; pp. 6309–6317. [Google Scholar]
Doya, K. Reinforcement learning in continuous time and space. Neural Comput. 2000, 12, 219–245. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Zariphopoulou, T.; Zhou, X.Y. Reinforcement learning in continuous time and space: A stochastic control approach. J. Mach. Learn. Res. 2020, 21, 8145–8178. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep reinforcement learning: A brief survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef]
Ibarz, J.; Tan, J.; Finn, C.; Kalakrishnan, M.; Pastor, P.; Levine, S. How to train your robot with deep reinforcement learning: Lessons we have learned. Int. J. Robot. Res. 2021, 40, 698–721. [Google Scholar] [CrossRef]
Wang, W.Y.; Li, J.; He, X. Deep reinforcement learning for NLP. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL): Tutorial Abstracts, Melbourne, VIC, Australia, 15–20 July 2018; pp. 19–21. [Google Scholar]
Brown, N.; Bakhtin, A.; Lerer, A.; Gong, Q. Combining deep reinforcement learning and search for imperfect-information games. In Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 17057–17069. [Google Scholar]
Yu, C.; Liu, J.; Nemati, S.; Yin, G. Reinforcement learning in healthcare: A survey. ACM Comput. Surv. 2023, 55, 1–36. [Google Scholar] [CrossRef]
Nguyen, T.T.; Reddi, V.J. Deep reinforcement learning for cyber security. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 3779–3795. [Google Scholar] [CrossRef]
Wang, X.; Liu, L. Risk-Sensitive Deep Reinforcement Learning for Portfolio Optimization. J. Risk Financ. Manag. 2025, 18, 347. [Google Scholar] [CrossRef]
Andrychowicz, M.; Wolski, F.; Ray, A.; Schneider, J.; Fong, R.; Welinder, P.; McGrew, B.; Tobin, J.; Abbeel, O.; Zaremba, W. Hindsight Experience Replay. In Advances in Neural Information Processing Systems (NeurIPS 2017); Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Reed, M.S.; Evely, A.C.; Cundill, G.; Fazey, I.; Glass, J.; Laing, A.; Newig, J.; Parrish, B.; Prell, C.; Raymond, C.; et al. What is social learning? Ecol. Soc. 2010, 15, 1. [Google Scholar] [CrossRef]
Reader, S.M.; Biro, D. Experimental identification of social learning in wild animals. Learn. Behav. 2010, 38, 265–283. [Google Scholar] [CrossRef] [PubMed]
Hoppitt, W.J.E.; Brown, G.R.; Kendal, R.; Rendell, L.; Thornton, A.; Webster, M.M.; Laland, K.N. Lessons from animal teaching. Trends Ecol. Evol. 2008, 23, 486–493. [Google Scholar] [CrossRef] [PubMed]
Laland, K.N. Social learning strategies. Learn. Behav. 2004, 32, 4–14. [Google Scholar] [CrossRef] [PubMed]
Rendell, L.; Fogarty, L.; Hoppitt, W.J.E.; Morgan, T.J.H.; Webster, M.M.; Laland, K.N. Cognitive culture: Theoretical and empirical insights into social learning strategies. Trends Cogn. Sci. 2011, 15, 68–76. [Google Scholar] [CrossRef] [PubMed]
Borg, J.M.; Channon, A. The effect of social information use without learning on the evolution of social behavior. Artif. Life 2020, 26, 431–454. [Google Scholar] [CrossRef] [PubMed]
Acerbi, A.; Nolfi, S. Social learning and cultural evolution in embodied and situated agents. In Proceedings of the IEEE Symposium on Artificial Life, Honolulu, HI, USA, 1–5 April 2007; pp. 333–340. [Google Scholar]
Floreano, D.; Mitri, S.; Magnenat, S.; Keller, L. Evolutionary conditions for the emergence of communication in robots. Curr. Biol. 2007, 17, 514–519. [Google Scholar] [CrossRef] [PubMed]
Hastie, T.; Tibshirani, R.; Friedman, J. Overview of supervised learning. In The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: New York, NY, USA, 2009; pp. 9–41. [Google Scholar]
Mitchell, M. An Introduction to Genetic Algorithms; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
Billard, A.; Calinon, S.; Dillmann, R.; Schaal, S. Robot programming by demonstration. In Springer Handbook of Robotics; Siciliano, B., Khatib, O., Eds.; Springer: Berlin, Germany, 2008; pp. 1371–1394. [Google Scholar]
Hussein, A.; Gaber, M.M.; Elyan, E.; Jayne, C. Imitation learning: A survey of learning methods. ACM Comput. Surv. 2017, 50, 1–35. [Google Scholar] [CrossRef]
Nehaniv, C.L.; Dautenhahn, K. Imitation in Animals and Artifacts; MIT Press: Cambridge, MA, USA, 2002. [Google Scholar]
Guenter, F.; Billard, A.G. Using reinforcement learning to adapt an imitation task. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), San Diego, CA, USA, 29 October–2 November 2007; pp. 1022–1027. [Google Scholar]
Breazeal, C.; Buchsbaum, D.; Gray, J.; Gatenby, D.; Blumberg, B. Learning from and about others: Towards using imitation to bootstrap the social understanding of others by robots. Artif. Life 2005, 11, 31–62. [Google Scholar] [CrossRef] [PubMed]
Nicolescu, M.; Mataric, M.J. Task learning through imitation and human–robot interaction. In Models and Mechanisms of Imitation and Social Learning in Robots, Humans and Animals: Behavioural, Social and Communicative Dimensions; Dautenhahn, K., Nehaniv, C.L., Eds.; Cambridge University Press: Cambridge, UK, 2005; pp. 407–424. [Google Scholar]
Wang, X.; Chen, Y.; Zhu, W. A survey on curriculum learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 4555–4576. [Google Scholar] [PubMed]
Erbas, M.D.; Winfield, A.F.T.; Bull, L. Embodied imitation-enhanced reinforcement learning in multi-agent systems. Adapt. Behav. 2014, 22, 31–50. [Google Scholar]
Pomerleau, D.A. Efficient training of artificial neural networks for autonomous navigation. Neural Comput. 1991, 3, 88–97. [Google Scholar] [CrossRef] [PubMed]
Ross, S.; Gordon, G.; Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 627–635. [Google Scholar]
Ng, A.Y.; Russell, S. Algorithms for inverse reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning (ICML), Stanford, CA, USA, 29 June–2 July 2000; pp. 663–670. [Google Scholar]
Ratliff, N.D.; Silver, D.; Bagnell, J.A. Learning to search: Functional gradient techniques for imitation learning. Auton. Robots 2009, 27, 25–53. [Google Scholar] [CrossRef]
Ziebart, B.D.; Maas, A.L.; Bagnell, J.A.; Dey, A.K. Maximum entropy inverse reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Chicago, IL, USA, 13–17 July 2008; pp. 1433–1438. [Google Scholar]
Ho, J.; Ermon, S. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2016; Volume 29, pp. 4565–4573. [Google Scholar]
Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; Bharath, A.A. Generative adversarial networks: An overview. IEEE Signal Process. Mag. 2018, 35, 53–65. [Google Scholar] [CrossRef]
Yi, M.; Xu, X.; Zeng, Y.; Jung, S. Deep imitation reinforcement learning with expert demonstration data. J. Eng. 2018, 2018, 1567–1573. [Google Scholar] [CrossRef]
Mao, Y.; Gao, F.; Zhang, Q.; Yang, Z. An AUV target-tracking method combining imitation learning and deep reinforcement learning. J. Mar. Sci. Eng. 2022, 10, 383. [Google Scholar] [CrossRef]
Yang, C.; Ma, X.; Huang, W.; Sun, F.; Liu, H.; Huang, J.; Gan, C. Imitation learning from observations by minimizing inverse dynamics disagreement. In Advances in Neural Information Processing Systems (NeurIPS); 2019; Volume 32. [Google Scholar]
Li, A.; Boots, B.; Cheng, C.-A. Mahalo: Unifying offline reinforcement learning and imitation learning from observations. In Proceedings of the International Conference on Machine Learning (ICML), Honolulu, HI, USA, 23–29 July 2023; PMLR: Honolulu, HI, USA, 2023; Volume 202, pp. 19360–19384. [Google Scholar]
Nehaniv, C.L.; Dautenhahn, K. (Eds.) The correspondence problem. In Imitation in Animals and Artifacts; Chapter 3; MIT Press: Cambridge, MA, USA, 2002; pp. 41–61. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-Level Control through Deep Reinforcement Learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Arwa, E.O.; Folly, K.A. Reinforcement learning techniques for optimal power control in grid-connected microgrids: A comprehensive review. IEEE Access 2020, 8, 208992–209007. [Google Scholar] [CrossRef]
Barto, A.G.; Sutton, R.S.; Anderson, C.W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans. Syst. Man Cybern. 1983, SMC-13, 834–846. [Google Scholar] [CrossRef]
MathWorks. Reinforcement Learning Toolbox: CartPole Environment. Available online: https://www.mathworks.com/help/reinforcementlearning/benchmark-examples.html (accessed on 21 May 2026).

Figure 1. Simulation setup for the foraging task that the agents are trained to achieve.

Figure 2. General architecture of the DQN algorithm, adapted from [47].

Figure 3. Control architecture of the proposed SLDRL agent. The controller consists of an upper deep reinforcement learning (DRL) layer and a lower Social Learning (SL) layer. The DRL layer contains the DQN controller responsible for action selection and policy updates during task learning. The SL layer contains the components required for social learning, including behavior memory, similarity matching, entropy-based behavior evaluation, and behavior selection mechanisms for storing and retrieving observed behaviors. The figure illustrates the information flow and interaction between the DRL and SL layers, while the internal DQN optimization process is separately detailed in Figure 2.

Figure 4. High-level block diagram of the proposed SLDRL framework. The diagram provides an end-to-end overview of the interaction between the DQN controller, the Social Learning module, and the environment throughout the learning cycle.

Figure 5. Pre-defined trajectories given to the SLDRL agent. Each trajectory consists of 5 consecutive actions towards one of the eight directions of the compass.

Figure 6. Results for observing pre-defined trajectories. The results show the averages of 120 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals. The agent that uses the pure DRL algorithm is solely controlled by the DQN while the SLDRL agent is directed by the hybrid control architecture shown in Figure 3. The parameters that are used in the DQN in all experiment setups are set as follows: α = 0.01, γ = 0.9 and ε = 0.3. These parameter values follow common practice in the literature for training DQN agents on the grid-based foraging problem, ensuring comparability with prior studies and stable learning performance.

Figure 7. Average number of triggered imitate and enact actions in every 100 actions that are executed by the agent observing pre-defined trajectories. The results show the averages of 120 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals.

Figure 8. The movement trajectories of the expert agent. The expert starts from the initial position and follows the shortest path for each of the three target locations.

Figure 9. Results for observing an expert. The results show the averages of 120 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals.

Figure 10. Average number of triggered imitate and enact actions for observing an expert. The results show the averages of 120 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals.

Figure 11. Comparison of results from experiments setups with observing an expert and observing pre-defined trajectories. The results show the averages of 120 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals: (a) Performance of SLDRL agent in different environment setups; (b) Number of enact behaviors that are executed by the SLDRL agent in different environment setups.

Figure 12. Results for observing an experienced agent. The results show the averages of 120 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals.

Figure 13. Average number of triggered imitate and enact actions for observing an experienced agent. The results show the averages of 120 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals.

Figure 14. Results for observing an inexperienced agent. The results show the averages of 120 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals.

Figure 15. Average number of triggered imitate and enact actions for observing an inexperienced agent. The results show the averages of 120 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals.

Figure 16. Comparison of results from experiments setups with observing an experienced and an inexperienced agent. The results show the averages of 120 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals: (a) Performance of SLDRL agent in different environment setups; (b) Number of enact behaviors that are executed by the SLDRL agent in different environment setups.

Figure 17. Comparison of the performance of SLDRL algorithm in different experiment setups. The results show the averages of 120 experiment runs, each lasting 10,000 time units, along with 95% confidence intervals.

Figure 18. Results for CartPole experiments. The results show the averages of 100 experiment runs, each lasting 100 episodes, along with 95% confidence intervals. The parameters that are used in the DQN in the experiment setup are set as follows: α = 0.01, γ = 0.99 and ε decays from 1.0 to 0.01 over training. These parameter values follow common practice in the literature for training DQN agents on the CartPole problem, ensuring comparability with prior studies and stable learning performance. The τ parameter, specifying the similarity threshold that two states must meet to be considered a match, is set to 0.99.

Figure 19. Ratio of triggered imitate and enact actions during the CartPole experiments. The results show the averages of 100 independent experiment runs, each lasting 100 episodes, along with 95% confidence intervals. The figure illustrates the dynamic balance between behavioral acquisition through imitation and behavioral exploitation through enactment during the learning process.

Table 1. Representative computational overhead comparison between pure DRL and SLDRL.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Article metric data becomes available approximately 24 hours after publication online.

Method	Training Time (s)	Relative Increase
Pure DRL	205 ± 10	-
SLDRL	210 ± 12	+2.4%

Hyperparameter	Discrete-State	Continuous-State
Optimizer	Adam	Adam
Learning rate (α)	0.01	0.01
Discount factor (γ)	0.9	0.9
Exploration Rate (ε)	1.0 → 0.01 (decay)	1.0 → 0.01 (decay)
Replay memory size	10,000	10,000
Minibatch size	64	64
Target network update	100	100
Hidden layers	2	2
Hidden units	24	24
Activation	ReLu	ReLu
SL Memory Capacity	10	10
Action Sequence Length	5	5
Similarity Threshold	Neighboring cells	τ = 0.99
Numerical Stability Constant (δ)	10⁻⁸	10⁻⁸