Transfer Learning for Operator Selection: A Reinforcement Learning Approach

: In the past two decades, metaheuristic optimisation algorithms (MOAs) have been in-creasingly popular, particularly in logistic, science, and engineering problems. The fundamental characteristics of such algorithms are that they are dependent on a parameter or a strategy. Some online and ofﬂine strategies are employed in order to obtain optimal conﬁgurations of the algorithms. Adaptive operator selection is one of them, and it determines whether or not to update a strategy from the strategy pool during the search process. In the ﬁeld of machine learning, Reinforcement Learning (RL) refers to goal-oriented algorithms, which learn from the environment how to achieve a goal. On MOAs, reinforcement learning has been utilised to control the operator selection process. However, existing research fails to show that learned information may be transferred from one problem-solving procedure to another. The primary goal of the proposed research is to determine the impact of transfer learning on RL and MOAs. As a test problem, a set union knapsack problem with 30 separate benchmark problem instances is used. The results are statistically compared in depth. The learning process, according to the ﬁndings, improved the convergence speed while signiﬁcantly reducing the CPU time.


Introduction
Adaptive operator selection has been playing a crucial role in heuristic optimisation, especially in population-based metaheuristic approaches, including swarm intelligence algorithms. Since the early 1990s, the concept of Adaptive Operator Selection (AOS) and the methods developed for it have been widely known [1,2]. Most recently, AOS has been used with artificial bee colony (ABC) algorithms for the first time [3]. The study has been extended further with a dynamically built selection scheme with reinforcement learning to solve binary and combinatorial optimization problems [4]. The problem of operator selection becomes a sequencing problem in the sense that additional operators are added one after the other to make it easier to move solutions to more fruitful regions of the search space. Due to the randomness effect and the unknown nature of the search space, previously devised schemes may not provide the best or even a better option to respond to the current state of the problem. However, stochastic and dynamic programming-based approaches may work better. The success of an optimisation algorithm using a sequence of operators handled with stochastic processes can be seen as a Markovian Decision Process due to its nature. As a typical stochastic process and using gained experience, Q learning can help in selecting the best operator among several in a given search space under specific conditions. Many complex and difficult real-world problems, especially combinatorial ones, are thought to be easier to solve once the circumstances are effectively mapped to the best operators using experiences. The machine learning literature is filled with good examples and state of the art techniques for mapping problem states to the expected outcomes. However, since this has been done within the boundaries of a single problem domain, significant changes in data and domain will necessitate the duplication of the same process. Recent machine learning studies suggest that learning how to handle a case can be transferred across domains, and a certain level of success can be achieved using deep learning [5]. This article proposes a reinforcement learning-based transfer learning approach to aid search algorithms with adaptive operator selection schemes while transferring gained experience from one case/run/benchmark to another. Although it is widely acknowledged that deep learning approaches facilitate the use of pre-trained tools in solving new problems, shallow learning processes, particularly the use of building adaptive operator selection schemes within a dynamic and extremely unknown environment, such as metaheuristic search processes for problem solving, are less well understood. This is the first attempt, to the best of the authors' knowledge, to apply transfer learning in building adaptive operator selection processes designed with reinforcement learning and implemented in swarm intelligence algorithms, such as the artificial bee colony algorithm.
The rest of the paper is organised as follows: Section 2 introduces the approaches on which the proposed method developed, while Section 3 details the proposed transfer learning approach. Extensive experimental results are provided in Section 4, and the article is concluded in Section 5.

Background and Related Work
This study brings a number of techniques together for devising an adaptive search process embedded in a swarm intelligence algorithm and enhanced with reinforcement learning. To keep the article self-contained, this section discusses briefly the necessary background followed by a review relevant to the proposed work.

Artificial Bee Colony Algorithm
The outmost optimisation framework used in this study is a swarm intelligence algorithm, which is the artificial bee colony algorithm (ABC). It is a population-based metaheuristic and evolutionary technique developed inspired by the foraging behavior of honey bees when seeking a quality food source [6]. There is a population of food positions, which refers to solution set in the ABC algorithm, and the artificial bees modify their positions over time to reach high-quality food. In order to find the optimal solution, the algorithm employs a group of agents known as honeybees. It is one of the efficient nature-inspired optimisation algorithms for solving continuous problems. Other swarm intelligence algorithms include ant colony optimisation (ACO) [7], which has been successfully used to solve discrete problems, and particle swarm optimisation (PSO) [8], which is a population-based stochastic optimisation algorithm that has been successfully used to solve continuous problems. There are three types of bees in the ABC algorithm, namely employer, onlooker, and scout bees. Employer bees are assigned to each food source in the first phase, and they use Equation (1) to try to increase the quality of the food source. The second stage involves onlooker bees attempting to enhance the most promising solutions, which are assigned probability values based on fitness function values using Equations (2) and (3). In the final step, onlooker bees turn into scout bees, who replace the non-improved food source with a new random viable solution.
where candidate, current, and neighbour solutions are represented by v i , x i , and x n , respectively.
In this study, we will focus on using the ABC algorithm, which is widely used in various industries to solve a variety of problems, including combinatorial and binary problems. For the sake of brevity, further literature details have not been considered because the major goal of the proposed research is to emphasise building adaptive operator selection using one of the state-of-the-art machine learning techniques and investigate if transfer learning can be achieved in this respect.

Reinforcement Learning
Reinforcement Learning (RL) is a type of machine learning technique for solving sequential decision-making problems. In this technique, a learning agent interacts with the environment to improve its performance through trial and error [9]. As with any other learning techniques, it is all about mapping situations to behaviours in order to optimise some rewards. However, unlike other machine learning techniques, the main challenge in RL is that the learning agent has to discover by itself the best action to take in a given situation. That is, in RL, the agent learns by itself without the intervention of a human. Dynamic programming is often used in this technique to find the optimum strategy to maximise reward in a given situation. The following are some key terms that describe the fundamental parts of an RL problem: Environment (E) -the physical world in which the agent acts, States (S) -the situations of the agent (what is the agent's current situation in a given state?), Actions (A)-the set of actions available to the agent, Reward (R : S × A → R) -the feedback from the environment (good or bad), Policy (Π) -a strategy to map the agent's state to actions (it is a strategy that an agent uses in pursuit of goals), and Value (V) -the future reward that an agent will receive by taking an action in a particular state. The RL techniques can be implemented using various approaches, including Q-learning [9]. In this approach, the agent learns an optimal policy based on previous experience in the form of sample sequences of states, actions, and rewards. Therefore, each learning step consists of a state-transition tuple (s i , a i , r i+1 , s i+1 ), where s i ∈ S is the current state of the agent, a i ∈ A denotes the chosen action in the current state, r i+1 ∈ R specifies the immediate reward received after transitioning from the current state to the next state, and s i+1 ∈ S represents the next state.
There are different ways we can formulate any problem in RL mathematically; one of them is Markov Decision Process (MDP). In many applications, it is assumed that the agent is unaware of anything in the environment. However, in some other applications, it can be assumed that not everything in the environment is unknown to the agent; for example, reward calculation is considered to be part of the environment even though the agent has some knowledge of how its reward is calculated as a function of its actions and states. An MDP can be represented as a tuple (S, A, T, γ, R), where S, A, and R are defined above, γ ∈ [0, 1] is called the discount factor, and T : S × A × S → [0, 1] is called the probabilistic transition relation such that for a given state s and an action a, ∑ s ∈S T(s, a, s ) = 1. The system being modelled is Markovian if the result of an action does not depend on the previous actions and visited states (history) but only depends on the current state, i.e., P(s t+1 | s t , a t , s t−1 , a t−1 , . . .) = P(s t+1 | s t , a t ) = T(s t , a t , s t+1 ). This implies that the current state s gives enough information to the agent to make an optimal decision. That is, if the agent selects an action a, the probability distribution over the next states is the same as the last time the agent tried this action in the same state. Once an MDP is defined, we can define policies, optimality criteria, and value functions to compute optimal policies. Solving a given MDP means computing an optimal policy. More detailed discussion on this can be found elsewhere [10]. The RL techniques have been successfully used to train robotic and/or software agents for a variety of purposes, including games, in a variety of situations ranging from simple to complex problems [11]. In particular, Deep RL has recently been developed and made available for dealing with and solving complex, dynamic, online, and real-time problems. As part of the heuristic optimisation outlined below, RL approaches can also be employed in operator selection. It would be easier to develop more conscious adaptive selection methods that take inputs into account while selecting operators and awarding the outcomes of each operation.

Adaptive Operator Selection
Many NP-hard problems can be solved using evolutionary search techniques [12]. These are mostly stochastic optimisation algorithms that have already demonstrated their effectiveness in a variety of application domains. This is largely due to the parameters that the user can define based on the problem at hand. However, such algorithms are very sensitive to the definition of these parameters. There are no standard principles for an effective setting, so researchers from other domains rarely use those algorithms. One of the features that search algorithms with multiple alternative operators require is operator selection. In this paper, we focus on Adaptive Operator Selection (AOS) [1]. Since its introduction in the 1990s, many AOS approaches have been proposed in the literature, varying widely in various aspects such as the amount of knowledge to use from the algorithm's previous performance and whether or not it is a good idea to use previous quality in the learning process. In practice, Credit Assignment (CA) and Operator Selection (OS) are the two components that are used during the operator selection process [13,14]. A definition based on fitness achievement over a solution is used in the CA component. OS, on the other hand, uses CA's captured knowledge to determine the quality of each operator before estimating its likelihood. Finally, based on the probability assigned to each operator, a selection strategy is used to choose an operator for evolving a parent. All the parents in an episode are evolved using the same selection strategy. As the algorithm learns more about the landscape, it moves the solutions in a specific search direction after a number of episodes.

Transfer Learning
Transfer in reinforcement learning is a new field of study that focuses on developing strategies for transferring knowledge from a set of source tasks to a target task. When the tasks are similar, a learning algorithm can use the transferred knowledge to solve the target task and enhance performance significantly [15]. So far, traditional machine learning and deep learning algorithms have been intended to work in isolation. These algorithms have been designed to solve specific problems. Once the feature-space distribution changes, the models must be rebuilt from the scratch. Transfer learning techniques have been proposed to overcome the isolated learning paradigm, allowing acquired trained knowledge learnt for one problem to be used to address other related problems. The following three critical questions must be addressed during the transfer learning process: What needs to be transferred, when should it be transferred, and how should it be transferred? Depending on the domain, problem at hand, and data availability, various transfer learning techniques could be used [16]. This is crucial because one of the most difficult aspects of transfer learning for an RL agent running in a target problem is figuring out which elements of the target and source problems are the same and which parts are different. The majority of transfer learning research has been focused on general classical RL problems; however, the purpose of the proposed study will be on how to acquire transferable experience in operator selection through reinforcement learning.

Related Work
The performance of evolutionary algorithms, similar to that of the other meta-heuristics, is frequently linked to proper design decisions, such as crossover operator selection and other factors. The selection of variation operators that are efficient to solve the problem at hand is one of the parameters that has a significant impact on the performance of such algorithms. The control of these operators can be handled at both the structural and behavioural levels when solving the problem. At the behavioural level, adaptive operator selection refers to the process of deciding which of the available operators should be used at any given time. The adaptive operator selection technique is widely used to enhance the search power in many evolutionary algorithms, including in Multi-objective Evolutionary Algorithm [14]. In [14], the authors have proposed a bandit-based AOS method for selecting appropriate operators in an online manner. Their work proposes fitness-rate-rank, which is a credit assignment that updates the attributes using ranking rather than raw fitness progress.
The decomposition is well-known in traditional multi-objective optimisation, and the technique is used by [17]. The authors of [18] proposed the so-called multi-objective evolutionary algorithm with decomposition in 2007, which was the first time the decomposition technique was used in multi-objective optimisation. Despite the fact that these research studies have shown significant results, no state-of-the-art studies have taken into account situational information such as problem state. For example, while selecting an operator to develop new solutions, none of the above discussed approaches took into account the problem state and/or the history of past acts. In a slightly different direction, a few research studies in single-objective continuous optimisation have addressed the algorithm selection problem in an automated method. In [19], the authors proposed an initial approach to combine exploratory landscape analysis (ELA) and algorithm selection, concentrating on the BBOB test suite [20].
The work presented by [21] selects operators using fitness landscape and performance indicators without a structured learning process. The majority of AOS research in the literature is based on traditional dynamic programming approaches. There has never been a detailed investigation of a technique that uses reinforcement learning (RL) to consider the problem state, i.e. input data, when selecting operators. In [22], the authors present a Markov Decision Process model for selecting crossover operators during the evolutionary search. A Q-learning method is used to solve the given model. On the benchmark instances of Quadratic Assignment Problems, they have experimentally validated the efficacy of the proposed strategy. However, the work lacks a detailed presentation, as well as analysis and discussion. The work presented in [23] emphasises how AOS is developed with RL for a variable neighbourhood search algorithm to solve vehicle-routing problems with time window and open routes. However, not much detail is provided on how reinforcement learning is implemented throughout the article.
In a more recent work [4], the authors proposed an adaptive operator selection approach based on reinforcement learning. In their proposed method, the problem states are mapped to operators based on the success level per operation. Although these proposed techniques advance the state-of-the art on AOS based on RL, these are generally centred within the boundaries of one problem's domain, and if major changes in data and domain occur, the same process must be replicated. As a result, new approaches for transferring learnt information from one problem-solving procedure to another are required. The proposed research addressed this problem by presenting a technique to determine the effect of transfer learning using RL and Metaheuristic optimisation algorithms, with the Adaptive Operator Selection method being used to choose between different available operators.

Proposed Approach for Transfer Learning with RL
It is well known that the transferability of knowledge and gained experience on how to solve problems optimally is quite limited. This is due to the uniqueness of the search spaces and the characteristics of the problem domain and data. However, transfer learning in the deep learning context has facilitated better performance, which can be investigated to see whether any particular level can be achieved. However, the problem data and set of parameters make each problem unique and distinct, making it difficult to apply gained experience from one problem to another. The aim of this research is to investigate how to achieve some degree of transferability. In this context, it is envisaged that the knowledge and experience gained through prior searches be carried out on three levels: (i) transferring experience across the runs of the same problem subject to different circumstances, but with the same configuration and settings; (ii) different problems with the same size and context; and (iii) different problems of various sizes and contexts.
The proposed research investigates if experience could be transferred between runs of the same problem.
In this case, the idea of transferring learning and experience is implemented using reinforcement learning. This is achieved using a dynamic and online learning strategy to facilitate utilisation of the gained experience in different circumstances. More specifically, this turns to be the problem of dynamic operator sequencing, since an optimally selected operator to produce new solutions will be added to the list of operators selected so far. At the end of a problem-solving process, a sequence of operators will be produced via using a set of criteria such as operator selection schemes. The framework used to solve the problems whilst operator selection is learned through reinforcement learning is presented in Figure 1, where the ABC framework is depicted on the left-hand side and the details of AOS are reflected on the right-hand-side. ABC shows the interaction in between a population of solution and the new solution generator, which is detailed with the selection scheme built up with RL. As depicted in Figure 1, a swarm intelligence algorithm (ABC works here) takes the role of the search framework, while operators are selected from a pool subject to a preferred AOS scheme. A reinforcement learning algorithm (Q learning embedded with a distance-based clustering algorithm preferred here) is placed in the search framework to work alongside the search to learn how the operators can work best subject to given circumstances. The RL algorithm continuously monitors the operator selection and the search processes to gain experience and process it accordingly to support the online operator selection scheme. The search algorithm selects an operator from the pool, applying the rule of selection scheme (here, the best Q value calculated is the rule used). Once an operator is selected, it helps to produce a new solution that is evaluated regarding whether to take it on board for the next generation or not. Depending on the success attained by the selected operator, a Q learning algorithm updates the measure for the corresponding selected operator. This is repeated until a new generation is completely built. Note that the ABC algorithm works generation-by-generation as a population-based algorithm. The complete algorithm is outlined in Algorithm 1.

Algorithm 1 General overview of RL-AOS
1: Initial Phase 2: if learning is not activated or first run then 3: Initial credit and C cluster values 4: end if 5: Operator Selection 6: Assign probabilities 7: Choose operator using Roulette-Wheel selection 8: Operator Evaluation 9: Execute operator and get reward 10: if positive reward and learning is activated then 11: Update cluster of operator 12: Update operator total reward 13: end if 14: At the end of iteration 15: if learning is activated then 16: Update credit values using Equation (5) 17: end if We present below the above discussed concepts more formally. Let X be a population of solutions that makes up a bee colony handled by the ABC implemented in this study, where X = {x i |i = 1 . . . |X |}. Each solution x i is defined as a D dimensional binary set, Meanwhile, a set of clusters, C = {c a |a = 1 . . . |A|}, is defined to represent the set of actions, A, where each cluster center is the centroid measure of the D dimensional dataset, c a = {c a,j |j = 1 . . . |D|}, and calculated using the following equation: where t is the number of iterations done so far, b a,i ∈ [0, 1] is a binary value indicating if the action is successful, (i.e., if the operator a helped produce a better fitness), where it take value 1 if successful and 0 otherwise. The centroids are optimised online with Q learning algorithm collecting the rewards, r a,i , based on the fitness values, F(x i ) as detailed in [4]. All c a values are initialised with 0, while random values are allocated to Q(x, a). Earlier iterations impose a random selection of operators, a ∈ A, whereas subsequent stages enforce the selections through fine-tuned Q values throughout the experience-gaining process. The Q(x i , a) values are updated immediately after an action is taken, (i.e., operator a is chosen and applied to x i ) using the following rule: where β is the learning coefficient, γ is the discounting factor, and E(y i ) is the expected Q value for the new problem's state, i.e., a solution. The expected value is calculated with d = |x i − c a | as the Euclidean distance between x i and c a as the current solution and the centroid for operator a, respectively. The algorithm runs repeating this procedure until a stopping criterion is satisfied. More opportunities to experience would be required to build a wide range of experience across the search space, necessitating the adoption of an exploration policy alongside the exploitation of learned cases. This study employs a -greedy policy to accomplish this goal. It requires randomising a value and performing a random selection if the random number is less than a threshold and Q-values-based selection otherwise. More details can be found in [4,24]. The transfer learning can be adopted into a , where α and β represent the experience gained previously and the experience to be gained over upcoming attempts, respectively. The model can be implemented as where δ is the learning coefficient that manages the contribution of previous and next experiences. For example, it switches training ON if δ > 0, and it switches OFF otherwise. This approach adopts transfer learning into problem solving, the data model is considered as the cluster, M(D) ← − C(D), which is trained with solving an instance of the problem including the components as follows: m i (d) ← − c a (d), α represents the learned components, β is the change to be imposed from upcoming activities, and δ is to be decided if the past experience would be used.
The algorithm is set up to run once to solve a specific problem instance, with online learning switched ON to train the cluster centres and then switched OFF to repeat the experiments with the same problem instance but with new random number sequences. Since the exploration activities are pruned, this is expected to solve the problem with better or slightly better solution qualities in a far shorter period. This is the stage at which the proposed algorithm transfers previously gained experience, which is still the most common method of transfer learning.

Experimental Results
This section introduces experimental results of the proposed approach to handle transfer learning across different runs of the same problem instances. We demonstrate how reinforcement learning-based experience transfer assists towards solving the problem in high efficiency with respect to computational time. The experiments have been carried out in a high-performance computing cluster machine with 8 core CPU 27.2 GB RAM and CentOS 7.9 operating system specs.

The Problem and Datasets
This study has been conducted to demonstrate the gain/benefit of transfer learning using an adaptive operator selection scheme built with an implementation of Q learning algorithm. Both the selection scheme and the RL (i.e., Q learning) are embedded within a standard ABC algorithm equipped with three recent state-of-the-art binary operators; binABC [25], ibinABC [26], and disABC [27] to solve the set union knapsack problem (SUKP), where Q learning is implemented and integrated into the ABC algorithm to allow the agent (i.e., the ABC algorithm) to learn how to adaptively select one of these three operators.
The family of knapsack problems includes renown combinatorial optimisation problem sets used to test the efficiency and performance of problem-solving algorithms. They are known as an NP-Hard problems with respect to complexity and are very instrumental in modelling and solving real-world industrial problems. SUKP is a special form of knapsack problem, which holds NP-Hard complexity level [28]. This problem is chosen as the testbed in this study to demonstrate the success of proposed approach. It requires a set of items to be optimally composed in subsets so as to gain the maximum benefit. Given a set of n elements, U = {u i |i = 1, . . . , n} with a non-negative weight set, W = {w i |i = 1, . . . , n} and a set of m items, S = {U j |j = 1, . . . , m} with a profit set, P = {p j > 0|j = 1, . . . , m}, a subset of A ⊆ S is sought to be found such that it maximises the profit subject to that the sum of the weights of selected items is not to exceed the capacity constraint, C. The formal structure of the problems is as follows: The problem is represented in real numbers and needs to be represented in binary form to enable binary operators in search algorithms such as binary ABC [3]. Follow-ing the details of the problem and the approach introduced by [29], a binary vector, B = {b j |j = 1, .., m} ∈ {0, 1}, is defined to be used as the set of decision variables, where b j = 1 if an item is selected, b j = 0, otherwise. The model of the problem can be reformulated as follows: The main goal is to find the best binary vector, B, which provides the subset of items with the maximum profit.
The problem instances of SUKP chosen in this study are collected from recently published literature. He et al. [30] have introduced 30 benchmarking problem instances of SUKP as tabulated in Table 1 with all configuration details, where three different configurations presented varying with comparative status of m and n; (i) m > n, (ii) m < n, and (iii) m = n), while w ∈ {0.10, 0.15} and y ∈ {0.75, 0.85} represent the density of elements and the rate between the capacities and the sum of weights of elements, respectively. As seen, each set of problem instances includes 10 instances varying with m, n, w, and y values. More details can be found in [30,31].

Experimental Settings
The experimental study reported in this article is conducted to demonstrate that transfer learning helps improve the efficiency of swarm intelligence algorithms in solving combinatorial optimisation problems. For this purpose, three algorithms have been set up: (i) RLABC taken from [4] is the baseline algorithm that solves the problems with ABC embedded with a Q learning-based adaptive operators selection scheme, (ii) RLABC-T extends RLABC with a static transfer learning that switches off the gain experience in upcoming runs, and (iii) RLABC-TC keeps online learning while solving new problems and executes new runs. This means δ = 0 for RLABC-T, while δ > 0 for RLABC-TC.
The parametric settings for all algorithms have been taken from previous works [3,4] rolling over the fine-tuned set of parameters accordingly. The configuration applied to all three algorithms includes the following settings: γ is 0.3, the window size W is 25 iterations, reward is chosen as extreme (r i,t ), is 0.1, and α is 0.5. The termination criteria is used as the maximum number of iterations, which is determined as the problem size. For the algorithm parameters, the population size is 20 and maximum trial number is 100.

Results and Discussions
Tables 2-4 show comparisons among three variants in terms of solution quality. The column of best is determined as the maximum value of the best solutions of thirty different runs. Mean and Std values are the average and standard deviation of them, respectively. R is the rank and S is the sign of Wilcoxon signed-rank test. The algorithms are ranked in terms of mean values.
As can be seen from Table 2, RLABC provides the best place, (i.e., the first place) only for two instances 1_1 and 1_4. The average rank of RLABC over the set of problem is 2.2, and it has the worst ranking among three approaches. RLABC-TC is the second-best algorithm because the average of rank values is 2; it has left behind RLABC-T, which shows better performance than the others, reaching the mean rank value of 1.8. When the statistical results are examined, RLABC-T has produced a statistically meaningful result for less than half of the instances, whereas the results of RLABC-TC are statistically meaningful only for one instance.  Table 3 presents comparative statistical performance of the three variants in terms of solution quality on Set 2 benchmark instances. Clearly, RLABC-T remains in the first position among the variants similar to the case of Set 1 , where it achieves first place on the half of benchmark instances. It is observed that both RLABC-T and RLABC-TC perform better than RLABC with respect to Best values, while RLABC takes first position in only two instances, 2_6 and 2_10.  Table 4 shows the comparative statistical performance of three variants on Set 3 benchmark instances. The algorithms look more competitive on this set in comparison to the previous two sets. In fact, the average rank calculated for each is 2, 1.9, and 2.1 for RLABC, RLABC-T, and RLABC-TC, respectively. The comparative results suggest that the algorithms produce slightly different qualities of solutions.  Figure 2 shows comparative results of the algorithms with respect to CPU time, while Table 5 Figure 5 presents the convergence graphs of methods through the search process; Figure 5a shows the algorithms' behaviour on a 400-dimensional problem, 2_7, and Figure 5b plots the performances on a 500-dimensional problem, 2_9. As can be seen in both figures, RLABC-T converges quicker than the other two but remains in local optima at around iteration 200. Meanwhile, RLABC-TC escapes that local optima in iterations such as 200, 300, and 350 in Figure 5a and in iterations around 80 and 250 in Figure 5b. On the other hand, RLABC gradually converges but stops after iteration 300 in both Figure 5a,b. This suggests that it is able to escape local optima at that points but cannot converge as RLABC-TC does. Both figures clarify that RLABC is outperformed by the other algorithms, while RLABC-TC converges better among all.
(a) (b) Figure 5. Convergence level attained by each operator while solving (a) 2_7 problem instance and (b) 2_9 problem instance . Figure 6 reflects the credit values gained by the three operators through iterations. As shown in the figure, all methods has slightly similar characteristics. ibinABC has obtained more credit from the start to the 200th iteration. The difference of methods is started from there. In the RLABC, ibinABC always has the most credited operator, while RLABC and RLABC have changed to DisABC and binABC. In overall comparison, RLABC is a powerful algorithm that shows good performance than most of the state-of-art methods that are applied to the same problem, as in [4]. However, it does not allow transfer learning in problem solving. RLABC-T and RLABC-TC have improved the results not only in terms of solution quality but also the algorithm's CPU time, demonstrating that the solution quality of the score is slightly better, while both RLABC-T and RLABC-TC significantly solve the problems much faster than RLABC. This experimentally approves the contribution of transfer learning in dynamically building an adaptive operator selection scheme. It is important to note that RLABC-T stops learning from new problem-solving runs, while RLABC-TC keeps updating the relevant centeroids of the corresponding clusters with upcoming new cases learned. The past experience jointly with undergoing learning remains beneficial in solving the problems faster without compromising the solution quality. The scope of this study has been kept to solve SUKP as an abstract combinatorial optimisation problem, which can be implemented into many real-world problems such as different variants of scheduling and timetabling problems. It is obvious that once a real-world problem can be solved with this approach as demonstrated, profit-loss analysis can be conducted to reveal the economical benefits.
The outperforming algorithm among three is RLABC-TC, which transfers previously gained experience into new runs and keeps learning active. It has been compared with some recently published very competitive state-of-art works-labelled as GA [28], BABC [32], binDE [33], and GPSO [31]-and the comparative results are plotted in Figure 7 on which all benchmark instances of the three sets have been considered, as shown on the horizontal axis of the graph, while the quality of the solution is indicated on the vertical axis. As a maximisation problem, the highest mean value is always delivered by RLABC-TC, which demonstrates the strength of the proposed approach. Comparative results by the proposed approach, RLABC-TC, and the state-of-art methods, GA [28], BABC [32], binDE [33], and GPSO [31], on benchmark instances in Set 1 , Set 2 , and Set 3 .

Conclusions and Future Work
This article described how transfer learning was used in a reinforcement learningbased adaptive operator selection scheme incorporated in an ABC algorithm to tackle SUKP as a combinatorial optimisation problem. The ABC algorithm uses a pool of operators from which the adaptive operator selection scheme identifies the best-fitting operator for the current state of the problem and the search conditions. This helps search through the problem space in an efficient way. The operator selection scheme is developed and fine-tuned with the Q learning algorithm embedded and empowered with the "Hard-C-Means" clustering algorithm. The knowledge and experience gained through this process is transferred into the next runs to be utilised for faster approximation and better-quality solutions. The experimental results demonstrated that the transferred experience across runs helped achieve slightly better solution quality but significantly faster convergence. Both scenarios of keeping learning ON and OFF are tested, and it is observed that each has its own set of advantages and disadvantages. It is clearly observed that learning through a single run helps in solving problem instances in subsequent runs in a much shorter time. This is because the gained experience is used to select more complementary operators one after another, cutting the computational time, while the quality of the solution improves slightly or at least remains the same.
This study has considered the first level of experience and knowledge transfer in solving combinatorial optimisation problems, which implies training the agents in one run and utilising its gained experiences in the next runs. The next two levels, transfer across problem instances and problem types, remain as the future study, which is expected to achieve a significant breakthrough in building generic problem solvers. The proposed transfer learning method can be used to solve a wide range of real-world problems, which are applied to tackling in real-world applications, including but not limited to image recognition, speech recognition, and timetabling problems.