Cooperative Multi-Agent Interaction and Evaluation Framework Considering Competitive Networks with Dynamic Topology Changes

In recent years, the problem of reinforcement learning has become increasingly complex, and the computational demands with respect to such processes have increased. Accordingly, various methods for effective learning have been proposed. With the help of humans, the learning object can learn more accurately and quickly to maximize the reward. However, the rewards calculated by the system and via human intervention (that make up the learning environment) differ and must be used accordingly. In this paper, we propose a framework for learning the problems of competitive network topologies, wherein the environment dynamically changes agent, by computing the rewards via the system and via human evaluation. The proposed method is adaptively updated with the rewards calculated via human evaluation, making it more stable and reducing the penalty incurred while learning. It also ensures learning accuracy, including rewards generated from complex network topology consisting of multiple agents. The proposed framework contributes to fast training process using multi-agent cooperation. By implementing these methods as software programs, this study performs numerical analysis to demonstrate the effectiveness of the adaptive evaluation framework applied to the competitive network problem depicting the dynamic environmental topology changes proposed herein. As per the numerical experiments, the greater is the human intervention, the better is the learning performance with the proposed framework.


Introduction
Reinforcement learning is concerned with the problem of maximizing the rewards of learning objects that need to be effectively controlled within a defined environment. The more complex the human's behavior and system configuration are, the more difficult the problem is, and the longer it takes for the learning object to learn. These problems can be solved through additional tasks such as pre-learning or preprocessing; however, these tasks are not very effective because preprocessing and pre-learning take a long time to complete and corrupt the learning data. Therefore, to effectively solve complex and difficult reinforcement learning problems, methods for solving problems through intuitive and professional human intervention have been proposed. The benefit of learning with the help of humans is that as the learning objects are intended to resemble humans, their learning goals can be clearly defined, and learning can be completed quickly without pre-learning or preprocessing.
However, existing studies on reinforcement learning focusing on learning problems with human help were concerned with the single model problem of a simple environment, static simple network and single agent.
As such, this study focuses on the two more competitive network topologies and it consists of multi-agent (e.g., combat or soccer) in a complex manner. To learn effectively, learners can model human evaluations in the form of quantitative real numbers to successfully achieve their goals and apply adaptive policies to humans. Herein, we propose a framework whereby the test results are precisely and strategically calculated.
The following section provides a review of existing research and literature related to human-machine evaluations and reinforcement learning. Section 3 proposes the cooperative human-machine evaluation framework and its algorithm. The implementation of the proposed framework and the performance analysis of the experimental example are presented in Section 4. Finally, the concluding remarks and future scope of work are presented in Section 5.

Background and Literature Review
This study applies reinforcement learning to the effective integration of human-machine evaluations toward a goal and compared previous studies on reinforcement learning on the basis of several categories. The learning object may be a robot, a system, or a game, and reinforcement learning is being investigated in various fields that require complex learning toward goals. There are a number of control strategies and relevant applications. Roman, et al. [1] proposed an adaptive control using fuzzy components to control a town crane. Zhang, et al. [2] applied a semi-global state synchronization method to actuator control under unknown nonuniform input delay.
When the learning object is a robot, these robots are generally classified as robot arms, humanoid robots, and industrial robots [3][4][5]. Robots with a high degree of freedom usually require complicated and difficult calculations because they require a goal that can perfectly mimic human behavior [6]. Additionally, robots learn how to behave similarly to humans so that they can also learn to expend a collaborative effort to help with human tasks or purposes [7]. The results from these studies can be applied to large-scale industrial applications that begin with learning and imitating simple movements, achieving goals through collaboration with humans, and applying them to real industries [8,9].
Reinforcement learning is also often used in areas such as games, where the goal of learning is clearly stated. This is because the algorithm of reinforcement learning is rewarded and updated according to the actions taken by the learning object. For example, when the object performs a mission in the game, the reward it obtains depends on its actions, which affects the final result and the target value [10].
Studies have also been conducted to learn systems, besides robots and games. System network and communication are complex and computationally heavy because of the need for a system that can optimize performance and goals. Furthermore, the computational load of networks that optimize paths such as an escape is heavy because it is necessary to derive the learning results by computing problems in real time. Therefore, learning methods that effectively deal with the real-time computation of a problem have been investigated [11][12][13].
Moreover, various methods have been studied regarding the problem of reinforcement learning to deal with values of measuring devices such as sensors. There are methods for learning via effectively estimating, interpolating, and approximating values before preprocessing, such as in the case of sensor values [14,15]. In addition, some studies have designed adaptive algorithms to ensure that learning objects are updated more reliably and effectively as they learn [16][17][18]. Table 1 summarizes the algorithms, contexts, and ideas proposed in previous studies and identifies the learning methods used in each study. These studies classified whether the learning object learned via the designed system itself or with human cooperation and also classified the learning objects into robots (robot arms and humanoids), games, or systems. In relation to the existing studies that have explored mimicking of human behavior, research related to the task of table lifting performed by humans and humanoid robots has been conducted. Humanoid robots move by predicting human movements through prediction-based algorithms, and the reliability of the prediction is arrived at by the motion predictor [19]. A study was conducted on dynamically walking and balancing robots that use reinforcement learning to learn dynamic gait without prior learning. It aimed to solve complex control problems with respect to the robot's motion control by mapping the motion space from discrete to continuous areas. The balanced learning method, which used the movement of the robot arms and legs to move the zero-moment point in the robot sole, can keep the biped robot in a static stable state. It showed that the robot can learn how to improve motion in terms of the walking speed in the proposed way [20].
There have also been studies exploring flipping of handkerchiefs and folding of t-shirts. Emphasis was placed on the learning of the robotic arms, and deep reinforcement learning was investigated to learn complex policies through high-level observations such as typing. Because deep reinforcement learning requires a large number of training samples, a method that improves the sample efficiency and learning stability with fewer samples by combining the characteristics of smooth policy updates with automatic feature extraction of deep neural networks was proposed [21]. Similarly, there has been a study involving learning robots that pick up and classify objects. To be able to interact with dynamic objects in an unstructured environment, robots need manipulation capabilities to handle the confusion, change, and object variability. The robots learn a closed-loop policy that maps depth camera inputs to motion commands and compare different approaches to make the problem easier to deal with, including the reward formation, curriculum learning, and the use of pre-trained policies with reduced work to pre-start tasks. Training the robots with heuristics helped achieve the desired behavior [22]. Collaborative robots are widely used in hybrid assembly tasks involving intelligent manufacturing. Research related to teaching-learning-collaboration models, where collaborative robots can learn through human demonstrations and support human partners in the working environment, has been put forth. This approach allows humans to control the robot using natural language instructions according to their personal work preferences. The robot then learns the assembly demonstrations from human using the maximum entropy inverse reinforcement learning algorithm and updates the task-based knowledge with the optimal assembly strategy. In the collaborative process, the robot can leverage the learned knowledge to actively support people in collaborative assembly work. As such, in the case of humanoid robot learning, an object is trained by imitating or pre-learning human behavior. Hence, pre-learning is essential at all stages of simulation or learning [23]. Robotic applications of reinforcement learning often undermine the autonomy of the learning process to achieve practical training time in real physical systems. To overcome this problem, recently developed deep reinforcement learning algorithms based on off-policy training for deep Q-functions can be extended to complex 3D manipulation tasks, and efficiently implement deep neural network policies to train the actual physical robot [24].
In contrast to the case involving interaction between the self-learning agent and the environment, it is recommended to train an agent manually using an evaluative reinforcement framework to update the rewards of human trainer feedback on the current state. Based on the evaluation of the agent's recent performance, the trainer can offer rewards in any form of representation that can be mapped to scalar values. The agent's goal is to act in consideration of the current state by receiving feedback from the human to choose the action that will receive the most rewards. For this purpose, the agent incorporates the reward function obtained from humans and selects the behavior that is expected to receive the highest reward. By learning a reward model for human feedback, agents can act on their goals even when there is no human feedback and choose a task that is expected to maximize their rewards when human feedback is provided. The agent attempts to maximize the immediate reward, assuming that the human trainer had already considered the impact of each behavior, when receiving feedback. This problem is consistent with supervised learning. Assuming each action is a training sample, for the selected action at time t, the state s t at the current time t and the state s t+1 at the next time t + 1 are considered attributes of the sample, and the reward of the human trainer for that action is considered a label [25]. Delivered by the corrective advice communicated by human framework, a reinforcement learning method models new human feedback based on manually training agents through evaluative reinforcement framework. The author used a binary signal in the action domain of the agent. Further, the reward value was updated by appropriately utilizing past human feedback [26].
Existing studies to effectively learn through human evaluation mainly deal with simple, uncomplicated problems of a single network. In a recent study dealing with the situation where two or more network topologies compete and coexist, an effective algorithm is proposed in which human evaluation is adaptively updated [27]. In this paper, propose an algorithm that strategically updates rewards until a stable situation is reached, dealing with an environment where two or more dynamically changing networks compete with each other and adaptively update human evaluation.
Human evacuation frameworks used in emergencies have focused on modeling and simulating emergencies. Recently, reinforcement learning has been investigated for real-time shortest path calculation methods that can be used in emergency situations [28]. This method has both advantages and disadvantages in terms of two approaches: agent-based modeling and equation-based modeling. Because agent-based models are slow but accurate, studies are being conducted to produce fast linear models based on reinforcement learning. In addition, the formula-based modeling method is fast, but the range of erroneous measurements is rather large [29].
In-depth reinforcement learning methods have been studied for mining and processing large amounts of data in a dynamically changing environment [30]. Additionally, studies dealing with the resource allocation problems of a large amount of resources and devices have been conducted using deep neural networks to learn the environment and make decisions regarding the allocation problems according to network conditions, such as service latency and requirements [31].
Previous studies related to effective object learning methods for tasks that require human expertise remain useful to overcome the limitations of preprocessing learning, which requires large amounts of computation and numerous samples. Reinforcement learning has evolved in recent years in terms of its applicability to complex sequential decision-making tasks which are generally modeled using the Markov decision process (MDP). Reinforcement learning is a methodology for deciding the next action after inferring the reward based on the achievement of the goal, which is in turn based on the agent's current state and the interactions between the systems exhibiting the current state [32].
The typical learning method among traditional reinforcement learning algorithms is the Q-learning method which infers the maximized reward value by calculating the Q-function, which is a behavioral value function as shown in Equation (1), at each time period [33].
In (1), s t is the state at time t, a t is the behavior of time t, r t+1 is the reward of time t + 1, and α is the learning rate in the range of (0,1). The closer the α is to 1, the more likely it is to induce learning to place a greater emphasis on the current situation and to adjust the share of rewards for future predicted behavior through a discount rate, γ [34].

Methodology of Cooperative Human-Robot Evaluation
This study is concerned with the problem of artificial intelligence soccer game shown in Figure 1. This is an example of a special case that differs from those previously investigated. In this paper, to deal with the environment in which two or more dynamically changing multi-agents in networks compete and coexist with each other, designed an artificial intelligence soccer game similar to the artificial intelligence basketball game covered in [27].
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 16 deep neural networks to learn the environment and make decisions regarding the allocation problems according to network conditions, such as service latency and requirements [31].
Previous studies related to effective object learning methods for tasks that require human expertise remain useful to overcome the limitations of preprocessing learning, which requires large amounts of computation and numerous samples. Reinforcement learning has evolved in recent years in terms of its applicability to complex sequential decision-making tasks which are generally modeled using the Markov decision process (MDP). Reinforcement learning is a methodology for deciding the next action after inferring the reward based on the achievement of the goal, which is in turn based on the agent's current state and the interactions between the systems exhibiting the current state [32].
The typical learning method among traditional reinforcement learning algorithms is the Qlearning method which infers the maximized reward value by calculating the Q-function, which is a behavioral value function as shown in Equation (1), at each time period [33].
In (1), is the state at time , is the behavior of time , is the reward of time + 1, and α is the learning rate in the range of (0,1). The closer the α is to 1, the more likely it is to induce learning to place a greater emphasis on the current situation and to adjust the share of rewards for future predicted behavior through a discount rate, γ [34].

Methodology of Cooperative Human-Robot Evaluation
This study is concerned with the problem of artificial intelligence soccer game shown in Figure  1. This is an example of a special case that differs from those previously investigated. In this paper, to deal with the environment in which two or more dynamically changing multi-agents in networks compete and coexist with each other, designed an artificial intelligence soccer game similar to the artificial intelligence basketball game covered in [27]. The learning goal of the soccer game in this study is to pass the ball by avoiding the red team soccer players who are obstructed by the blue team soccer players. In real-life football games, players move dynamically on the field and try to pass the ball successfully to a player from the same team.
As such, players in football games described as multi-agent form a network topology as shown in Figure 1, and these network topologies show the two teams competing against each other. Each The learning goal of the soccer game in this study is to pass the ball by avoiding the red team soccer players who are obstructed by the blue team soccer players. In real-life football games, players move dynamically on the field and try to pass the ball successfully to a player from the same team.
As such, players in football games described as multi-agent form a network topology as shown in Figure 1, and these network topologies show the two teams competing against each other. Each team is given a network topology at time t, and there is a state as a reference when it is expressed as a reinforcement learning problem considering the situation and environment, and there is another state that follows and competes similarly.
To deal with this reinforcement learning problem in this special case, this study defines the Q-Function as Equation (2) and calculates the maximum human evaluation reward value by calculating every time cycle period, t.
State s 2 t that coexists with state s 1 t is not an independent state, but it is affected by the same action in the same environment. Therefore, it can be defined as Equation (3).
Traditional reinforcement learning problems often involved a single network topology with the learning objects. However, this study covers two or more network topologies such as a dynamically changing competitive network. In a single network topology, the reward policy could be learned through a system reward update calculated from the learning object. Conversely, in this study, to establish a reward policy for a complex network in which two networks are learned in the same environment, the procedure for obtaining rewards through human evaluation and the reward of the system calculated from the learning object are properly applied. Here, a procedure is involved that updates rewards through human intervention, while existing studies [25,26] have mapped human evaluation through a binary process; moreover, this study advances evaluation feedback in a more quantitative and accurate form. The human rating is entered as a real value between 0 and 1, and the reward policy is updated in a much more accurate form.
The human evaluation reward, h t , obtained in the process of learning a number of times, can be modelled by a Gaussian distribution as shown in Equation (4), where µ is the evaluation mean, and σ is the distribution of standard deviation. In general, the mean of the Gaussian distribution is estimated h t using Equation (5), where n is the sample size and is the number of quantitative rewards from human evaluation that has been learned for a number of times t. The standard deviation of the Gaussian distribution is estimated using Equation (6).σ In repeated learning, when a correction value evaluated by a human is normalized through the distribution, as described above in Equation (6), h t is calculated and h t is estimated. Subsequently, the reward value obtained through the estimated human evaluation via repeated learning is subjected to a procedure for updating it adaptively, as shown in Equations (7) and (8).
Here, Equation (7) can be used to minimize the loss function F h t+1 that defines the difference between the estimate h t+1 during time t + 1 and estimate h t the time t. When h t+1 is updated, the step size η should be set differently for each estimate iteration; hence, η is increased when the variation in the estimated human evaluation reward value is small, and η is reduced when the variation in the reward value through the estimated human evaluation is large. G t stores the sum of squareds of the gradient values of h t updated through estimation in time t. In the case of updating h t , the process proceeds to a size inversely proportional to the root of G t in the existing step size η. The reason for this is that if the estimated human evaluation value h t has changed significantly, it will move less. Conversely, if the estimated human evaluation value h t has changed less, it will shift more. At this time, is a small value of between 10 −4 and 10 −8 , meant for preventing division by zero. As per Equation (8), the reward value, h t , adaptively updated by the human evaluation is compared with the human evaluation reward, h t , obtained from learning in the present iteration. As shown in Equation (9), h t is updated again via adopting the greater of the two values.
In Equation (9), the correction value h t updated adaptively by human evaluation should be appropriately calculated with the reward value of the system derived from the update of the learning object to determine the final reward value, h * t .
In Equation (10), h * t is calculated as the reward, r t , of the system derived from the updating of the learning object, and the correction value, h t , adaptively updated via human evaluation, and it is calculated using the appropriate policy rate, δ, by the human intervene policy rate. On application of the quantitative evaluation of human being proposed in this study adaptively, the framework that established the system reward value and the appropriately calculated policy through the update of learning object is shown in Figure 2.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 7 of 16 step size . The reason for this is that if the estimated human evaluation value ℎ has changed significantly, it will move less. Conversely, if the estimated human evaluation value ℎ has changed less, it will shift more. At this time, ϵ is a small value of between 10 and 10 , meant for preventing division by zero. As per Equation (8), the reward value, ℎ , adaptively updated by the human evaluation is compared with the human evaluation reward, ℎ , obtained from learning in the present iteration. As shown in Equation (9), ℎ is updated again via adopting the greater of the two values. ℎ = max (ℎ , ℎ ) In Equation (9), the correction value ℎ updated adaptively by human evaluation should be appropriately calculated with the reward value of the system derived from the update of the learning object to determine the final reward value, ℎ * .
In Equation (10), ℎ * is calculated as the reward, , of the system derived from the updating of the learning object, and the correction value, ℎ , adaptively updated via human evaluation, and it is calculated using the appropriate policy rate, , by the human intervene policy rate. On application of the quantitative evaluation of human being proposed in this study adaptively, the framework that established the system reward value and the appropriately calculated policy through the update of learning object is shown in Figure 2.  Table 2 shows the complete algorithm involved in the proposed cooperative human-machine evaluation framework. First, the human intervene rate, , the discount rate, γ, and learning rate, , are defined (lines 1-3). The agent observes the new state (line 6). Then, receive the exact quantitative  Table 2 shows the complete algorithm involved in the proposed cooperative human-machine evaluation framework. First, the human intervene rate, δ, the discount rate, γ, and learning rate, α, are defined (lines 1-3). The agent observes the new state (line 6). Then, receive the exact quantitative evaluation h from the human (line 10), the update of the adaptive human evaluation reward (lines [11][12][13][14][15]. Finally, when the reward for the Q-function is updated and determined, the Q-function is calculated (lines 17 and 18). This process is repeated for a given time, T, to proceed with the learning process (lines 4).

System Implementation and Experimental Results
This section explains in detail the cooperative human-machine evaluation framework introduced in Section 3. The implementation of the proposed framework and the numerical analyses using the proposed software program are presented in this section. The program software is developed using MATLAB© (made by MathWorks, Natick, MA, USA)and C++ language. The implemented software program includes several panels and two graph windows.
The panels that make up the program deal with the network topology as shown in Figure 3. They form a network in two dimensions as shown in Figure 3a,b and in three dimensions as shown in Figure 3c and are expressed in a distribution to maximize visibility. using the proposed software program are presented in this section. The program software is developed using MATLAB© (made by MathWorks, Natick, MA, USA)and C++ language. The implemented software program includes several panels and two graph windows.
The panels that make up the program deal with the network topology as shown in Figure 3. They form a network in two dimensions as shown in Figure 3a,b and in three dimensions as shown in Figure 3c and are expressed in a distribution to maximize visibility. The graph window and the windows that show the calculated values are shown in Figure 4. The windows have several functions. In a soccer game depicted in dynamically changing competitive network topologies as shown in Figure 4a, the players' status changes and the system's reward is calculated and shown.  At the same time, the user monitors two competitive network topologies depicted as a football game and inputs feedback on the behavior of the current state with a real number between 0 and 1.
The evaluation feedback from the human is calculated by estimation and adaptive reward evaluation calculation, as shown in Table 2.
Afterward, the reward from the system in Figure 4a and the human evaluation in Figure 4b are calculated according to the given level of human intervention, and the reward value according to the behavior of the learning object is updated. Figure 5 shows a program that implements the proposed cooperative human-machine evaluation framework for the reinforcement learning of competitive network topologies. The panels and windows that make up the program are described in detail in Table 3. For instance, in a soccer game, there exist both network topologies: one and its enemy, coexisting in the same environment. At the same time, the user monitors two competitive network topologies depicted as a football game and inputs feedback on the behavior of the current state with a real number between 0 and 1.
The evaluation feedback from the human is calculated by estimation and adaptive reward evaluation calculation, as shown in Table 2.
Afterward, the reward from the system in Figure 4a and the human evaluation in Figure 4b are calculated according to the given level of human intervention, and the reward value according to the behavior of the learning object is updated. Figure 5 shows a program that implements the proposed cooperative human-machine evaluation framework for the reinforcement learning of competitive network topologies. The panels and windows that make up the program are described in detail in Table 3. For instance, in a soccer game, there exist both network topologies: one and its enemy, coexisting in the same environment. calculated according to the given level of human intervention, and the reward value according to the behavior of the learning object is updated. Figure 5 shows a program that implements the proposed cooperative human-machine evaluation framework for the reinforcement learning of competitive network topologies. The panels and windows that make up the program are described in detail in Table 3. For instance, in a soccer game, there exist both network topologies: one and its enemy, coexisting in the same environment. In this study, a soccer game was supposed, and an experiment was carried out through the example of learning to deliver the ball to allies while avoiding the enemy network that obstructed the flow of the ball. The players of both teams do not stay in a fixed position but have a distribution over a defined range and move dynamically; thereby, changing the network topology. The human evaluator visually confirms the degree of obstruction of the path for the ball to be delivered by the enemy and enters a real number between 0 to 1 accordingly. When a game agent passes a ball In this study, a soccer game was supposed, and an experiment was carried out through the example of learning to deliver the ball to allies while avoiding the enemy network that obstructed the flow of the ball. The players of both teams do not stay in a fixed position but have a distribution over a defined range and move dynamically; thereby, changing the network topology. The human evaluator visually confirms the degree of obstruction of the path for the ball to be delivered by the enemy and enters a real number between 0 to 1 accordingly. When a game agent passes a ball without any intervene, an expert (human) evaluates the pass closed to 1 using the developed system. The human's evaluations are subjective and influenced by the locations of the opponents and other reasons. The evaluation score modifies the reward values using Equations (9) and (10). Then, the game agent changes the passing action to obtain higher reward using the provided algorithm shown in Table 2. In the iterative learning, the human evaluation reward is calculated and updated by the provided calculations of the reward value of the system itself and the human evaluation reward value by the cooperative human-machine evaluation framework. Figure 6 shows the results of learning a soccer game represented by two networks that coexist using the proposed cooperative human-machine evaluation framework. There are two algorithms prepared for comparisons with the proposed cooperative human-machine evaluation framework. Table 3. Implemented Cooperative Human-Machine Evaluation Software Program.

Type Function Detailed Function Configurations
Panel Application of competitive network (e.g., Soccer Game).
-Define an iteration -By defining the defense range and attack range, specify the range of the following actions.

Panel
Network topology.
-Representation of a coexisting network topology only with current statues of both players' group.
-Monitoring both players' (e.g., Blue and Read team) network topology.
-Representation of a coexisting network topology with current statues and the following future statues.
-Representation of probability distributions depicting the following statues.
Window Adaptive human evaluation.
-Input window for human reward evaluation. -Calculation of adaptive human evaluation.
-The network topology learned and evaluated by humans in each iteration -It is updated by the adaptive human evaluation strategy between the estimated values and the current evaluated values in the iteration.

Window
Reward calculation and updates.
-Analyzing integrated reward. -Measurement of system performance.
-The rewards obtained for each learning are calculated - The average value is updated, and system performance is visualized.
The graph of Figure 6a represented by the symbol "o" signifies the reward of cooperative human-machine evaluation framework in each iteration as calculated by the algorithm of Table 2 and the graph of Figure 6a represented by the symbol "*" signify the value of the average value of the reward of a cooperative human-machine evaluation framework. Figure 6b is an algorithm in which the human evaluation is evaluated in binary (of 0 and 1). Moreover, the graphs of Figure 6b represented by the symbol "o" signify the reward of the simple evaluation in binary in each iteration, and the graph of Figure 6b represented by the symbol "*" signifies the average reward of evaluated in binary strategy. Figure 6c is an algorithm in which the traditional MDP method without human intervention. The graphs of Figure 6c represented by the symbol "o" means the reward of the MDP method in each iteration and the graph of Figure 6c represented by the symbol "*" signify the average reward of evaluated in the MDP method. As seen in Figure 6, the proposed cooperative human-machine evaluation achieved the fastest convergence.
This result is considered to be the result of human intervention that made quick decisions with specialized knowledge, unlike the way of updating only the reward of the system composed of existing learning objects. This study proposed a strategic method that aims to adaptively and quickly update the human evaluation scale and converge quickly to the maximum value.   Table 4 shows how the proposed cooperative human-machine evaluation framework differs in learning performance according to the degree of human intervention. The results are summarized according to the parameter of the degree of human intervention rate δ. The problem addressed in this experiment is that two networks each have 10 nodes and coexist in the learning environment simultaneously. The problem is solved in three cases by applying the proposed cooperative human-machine evaluation framework. The first case solved the problem only with the system reward of the network topology existing inside the environment without any human intervention. As the training iterations were repeated, the reward value increased, and the system learned to converge to the maximum; however, the performance was poor and needed enough iterations. On the other hand, in the second case, if the policy for updating reward is updated by applying the reward calculated through the system's internal reward and the proposed cooperative human-machine evaluation framework at a 50% rate, convergence tended to occur quickly to the maximum reward value. In the last case, where a 90% rate was applied, the highest Q-function value was obtained at the same number of learning while converging to the maximum reward value as the target. This is shown by comparison in Figure 7. Table 4. Comparisons of three different experiment scenarios using the proposed framework. method in each iteration and the graph of Figure 6c represented by the symbol "*" signify the average reward of evaluated in the MDP method. As seen in Figure 6, the proposed cooperative humanmachine evaluation achieved the fastest convergence.
This result is considered to be the result of human intervention that made quick decisions with specialized knowledge, unlike the way of updating only the reward of the system composed of existing learning objects. This study proposed a strategic method that aims to adaptively and quickly update the human evaluation scale and converge quickly to the maximum value. Table 4 shows how the proposed cooperative human-machine evaluation framework differs in learning performance according to the degree of human intervention. The results are summarized according to the parameter of the degree of human intervention rate δ. The problem addressed in this experiment is that two networks each have 10 nodes and coexist in the learning environment simultaneously. The problem is solved in three cases by applying the proposed cooperative humanmachine evaluation framework. The first case solved the problem only with the system reward of the network topology existing inside the environment without any human intervention. As the training iterations were repeated, the reward value increased, and the system learned to converge to the maximum; however, the performance was poor and needed enough iterations. On the other hand, in the second case, if the policy for updating reward is updated by applying the reward calculated through the system's internal reward and the proposed cooperative human-machine evaluation framework at a 50% rate, convergence tended to occur quickly to the maximum reward value. In the last case, where a 90% rate was applied, the highest Q-function value was obtained at the same number of learning while converging to the maximum reward value as the target. This is shown by comparison in Figure 7.  These results may vary depending on the assumed conditions and parameters. However, the proposed cooperative human-machine evaluation framework is effective in the complex network topologies in the context of human evaluation intervention using reinforcement learning.
In addition, a quantitative empirical study was conducted from two perspectives to show that the method proposed in this study is effective. As shown in Figure 8, when learning in three different ways during the same time, the time when the pass success rate of the Artificial Intelligence (AI) soccer game reaches more than 95% was compared. The result of learning the pass of players in the AI soccer game using the method of strategically updating the reward in the form of real number through human evaluation, the method proposed in this study, shows a pass success rate of up to These results may vary depending on the assumed conditions and parameters. However, the proposed cooperative human-machine evaluation framework is effective in the complex network topologies in the context of human evaluation intervention using reinforcement learning.
In addition, a quantitative empirical study was conducted from two perspectives to show that the method proposed in this study is effective. As shown in Figure 8, when learning in three different ways during the same time, the time when the pass success rate of the Artificial Intelligence (AI) soccer game reaches more than 95% was compared. The result of learning the pass of players in the AI soccer game using the method of strategically updating the reward in the form of real number through human evaluation, the method proposed in this study, shows a pass success rate of up to 96.6%. The binary number reward method achieved a pass success rate of 86.6% within the same time, and when learning using the Markov Decision Process method, a pass success rate of 62.6% was achieved during the same time. Using the method proposed in this study, strategically updating the reward in the form of a real number through human evaluation, the time required for the pass success rate of players in the AI soccer game to reach 95% was t = 29. In contrast, the binary number reward method took t = 73 to reach the 95% success rate of the pass. When learning using the Markov Decision Process method, the time it takes for the pass success rate to reach 95% is t = 93. These results may vary depending on the assumed conditions and parameters. However, the proposed cooperative human-machine evaluation framework is effective in the complex network topologies in the context of human evaluation intervention using reinforcement learning.
In addition, a quantitative empirical study was conducted from two perspectives to show that the method proposed in this study is effective. As shown in Figure 8, when learning in three different ways during the same time, the time when the pass success rate of the Artificial Intelligence (AI) soccer game reaches more than 95% was compared. The result of learning the pass of players in the AI soccer game using the method of strategically updating the reward in the form of real number through human evaluation, the method proposed in this study, shows a pass success rate of up to 96.6%. The binary number reward method achieved a pass success rate of 86.6% within the same time, and when learning using the Markov Decision Process method, a pass success rate of 62.6% was achieved during the same time. Using the method proposed in this study, strategically updating the reward in the form of a real number through human evaluation, the time required for the pass success rate of players in the AI soccer game to reach 95% was t = 29. In contrast, the binary number reward method took t = 73 to reach the 95% success rate of the pass. When learning using the Markov Decision Process method, the time it takes for the pass success rate to reach 95% is t = 93.

Conclusions
As reinforcement learning is applied to several goal-oriented systems, the difficulty of problems and the complexity of the calculations increase, and various methods have been proposed to solve them. In the process of researching algorithms that solve these problems and lead to good performance, methods of learning are studied by adding accurate and quantitative evaluation through human intervention with expert knowledge and experience in reinforcement learning process. This trend has led to the need for pre-processing or pre-learning to make learning objects resemble human behavior or appearance in reinforcement learning. Existing studies involving human intervention at the same time as learning are relatively scarce and instead relied on prior learning.
This study proposed a new method of updating the rewards of the system obtained in the process of learning by learning objects and the rewards derived from the evaluation of humans with expertise on the problem. This framework also proposed an adaptive strategy to update rewards resulting from a stable and effective human evaluation. In addition to that, if the learning object was dealing with a simple and independent form, this study deals with the problem of complex network topology. This type of problem suggests the need for a cooperative human-machine evaluation proposed in this study. The proposed framework was implemented as a software program that supports the cooperative human-machine evaluation framework to demonstrate efficiency. Therefore, the effectiveness is demonstrated by comparing the results in various scenarios.
The proposed multi-agent framework can be applied to cooperative tasks between human and machines, such as human-robot interaction, autonomous car driving, and artificial intelligence-based industrial tasks. In particular, it contributes to fast leaning process using multi-agents' evaluations. However, the provided framework is limited from the fact that human's evaluation interface and timings are crucial for faster training processes. In addition, various industrial tasks require different human evaluation methodologies. For this reason, the provided framework and its implementation have to be modified with the objectives of applications.
Further studies can consider the problem of determining more effectively the degree of human intervention required to update the combination of rewards via human evaluation and rewards from system learning. In addition, the provided framework can be applied to several real-life applications and scenarios. In a problem similar to the dynamic change competitive network topology discussed in this study, human intervention can be learned more quickly and efficiently.