Adaptive Human–Machine Evaluation Framework Using Stochastic Gradient Descent-Based Reinforcement Learning for Dynamic Competing Network

Complex problems require considerable work, extensive computation, and the development of effective solution methods. Recently, physical hardwareand software-based technologies have been utilized to support problem solving with computers. However, problem solving often involves human expertise and guidance. In these cases, accurate human evaluations and diagnoses must be communicated to the system, which should be done using a series of real numbers. In previous studies, only binary numbers have been used for this purpose. Hence, to achieve this objective, this paper proposes a new method of learning complex network topologies that coexist and compete in the same environment and interfere with the learning objectives of the others. Considering the special problem of reinforcement learning in an environment in which multiple network topologies coexist, we propose a policy that properly computes and updates the rewards derived from quantitative human evaluation and computes together with the rewards of the system. The rewards derived from the quantitative human evaluation are designed to be updated quickly and easily in an adaptive manner. Our new framework was applied to a basketball game for validation and demonstrated greater effectiveness than the existing methods.


Introduction
Artificial intelligence (AI) technologies are developing with a focus on designing systems for efficient learning, effective solution of complex problems, and rapid large-scale computation. Reinforcement learning (RL) takes the form of learning by rewarding the changing state from the action of the learning object in the defined system environment [1]. Problem solving approaches that involve RL require advanced methods due to the system complexity, as well as additional steps such as pre-learning or preprocessing. Therefore, to solve complex and difficult RL problems effectively, a strategic policy is used to update the system reward by obtaining feedback through human intervention [2,3]. Humans with expert knowledge of the problem to be solved can respond intuitively, accurately diagnose the system state, and quickly determine the required action. Therefore, the learning object can be clearly defined by utilizing the fact that it is similar to a human being. However, the accuracy of human evaluation has been reduced by designing such evaluations using binary numbers in previous studies focused on feedback by learning through human intervention.
In this study, to address these shortcomings, algorithms were designed to ensure accurate and clear learning through quantitative evaluation in the form of real numbers. In addition, the stochastic gradient descent (SGD) algorithm was used to overcome the disadvantage of slow learning when human intervention is involved in RL. We designed an algorithm that learns faster by adaptively updating the reward value in the form of a real number derived from human evaluation. Then, the adaptively updated reward is used for learning by calculating the final reward, with the reward being updated as the learning object in the system environment.
Further, a basketball game was designed as an example in which multiple network topologies coexist in a complex form. A basketball game is a complex problem in which the network changes in real time and the objective is correct passing of the ball among players on the same team to score points.
The remainder of this paper is organized as follows. Section 2 reviews the existing research and literature on the application of RL in various fields, focusing on RL studies for effective control of robots and machines that simulate humans. Section 3 describes in detail the adaptive update strategy framework of the human evaluation reward with the SGD algorithm proposed in this paper. Section 4 discusses the implementation and experimental examples of the proposed algorithm and framework and compares this approach with the methods used in previous studies. Finally, Section 5 summarizes the conclusions and directions for future research.

Background and Literature Review
RL generally involves an advanced model of the Markov decision process (MDP). In terms of sequential decision making, it is based on the interaction between the current state and the system. In this method, the reward is calculated, and then the action needed to achieve the learning objective is determined [1].
The general learning method of the RL algorithm is the Q-learning method, which calculates and updates the Q-function, the behavior value function of the learning object, at every time t. At this time, the algorithm is designed to calculate the Q-function by maximizing the reward value [4]: In (1), s t is the state at time t; a t is the learning object behavior at time t; r t+1 is the reward value at time t + 1; and α is the learning rate. The closer to α, the greater the value of the situation at the current time t and the behavior of the learning object. The discount rate γ is used to adjust the reward percentage for future behavior [5].
RL is used in a wide range of fields to solve important issues. This technique is applied according to the situation and environment, and methods of solving the corresponding problems are designed. In robot control, RL is an interesting topic and the most commonly used AI method. Research on various robotics topics, such as the use of robots in intelligence, soft robotics, and robot automation through navigation and autonomous control, has been conducted. Table 1 summarizes recent studies focusing on the relationship between humans and robots in relation to the learning of objects that are considered human interventions. It also lists the applications of these studies, keywords, and design methods of RL [6][7][8][9][10][11]. 1. Peg-in-hole task 2.
Slide in the groove assembly task 3.
Bolt-screwing task Learning from demonstration For example, RL has been effectively applied to enable individuals suffering from limb paralysis to drink liquids directly with the help of robotic manipulator arms [6]. In that study, five algorithms were applied and compared. Learning was performed effectively with a software emulator program, and the developed solution gave the user the ability to manipulate the cup. An assistant robot was effectively designed with the focus of supplying liquid from the cup using feedback through sensors that provided direct interaction between the human and robot.
Subsequently, RL was applied for stable, dynamic walking of biped robots [7]. This study was performed in the absence of prior knowledge or information on dynamic models, and the robot operation was controlled by mapping the motion space from the discrete to continuous domain. The research objective was to solve complex control problems. Among the components constituting the robot legs, a zero-moment point was selected from the sole and mapped to the movement of the limbs to learn balance. This study proved that a robot can learn how to improve its motion in terms of walking speed. Further, the proposed algorithm was implemented in a physical robot to prove its validity and effectiveness.
Another article suggested a framework that includes a learning phase that mimics human behavior and an RL phase that learns robot behavior [8]. The two-stage learning framework, which combines imitation and RL, shows how to work with people to lift tables quickly and successfully. The first stage is for learning the existence and location of the object called a table, and the second stage is the learning stage for performing operations and tasks. The robot operation is controlled by combining two types of controllers. This research demonstrated the successful construction of a collaborative robot designed to predict human movements and take proactive actions.
In another study, RL was applied to pancake flipping, energy minimization of bipedal walking robots, and archery-based aiming robots [9]. The authors argued that the ultimate goal of RL is to provide robots with the abilities to learn, improve, adapt, and play in tasks with dynamically changing constraints based on navigation and self-learning. It was suggested that RL is appropriate for highly dynamic tasks with clear scales and argued that imitation learning should be easy to demonstrate, use clear practices, and be effective for slow work. The regression-based learning algorithm was effective when the goal was small.
Further, the design of collaborative robots for assistance in assembly operations in manufacturing was investigated [10]. Collaborative robots are used in intelligent manufacturing-related environments and are developed to learn from human demonstrations and support human partners in collaborative environments. According to the personal preferences of humans, natural language instructions can be used to teach robots. The robot learns from human demonstrations using the maximum entropy inverse RL algorithm, and the task-based learning is updated using the optimal assembly strategy. These studies have shown that RL can be effectively used in the design of human-robot collaboration.
Regarding detailed investigation of how to imitate human behavior, studies have been conducted on methods of demonstrating an example motion for a robot in assembly work and extracting a manipulation function for robot learning and motion imitation [11]. In one method, the robot can directly learn how to control its movement. In a second method, when designing a robotic arm, a motion sensor can be attached to a human arm to enable human behavior to be mimicked. Finally, remote operation and control boxes can be utilized to provide hints to a robot. Each method was used to establish a strategic method by direct or indirect human intervention for robot learning.
In addition to robot control, RL is applied and used effectively in various fields. For convenience in daily life, RL has been applied to drone delivery, home energy system optimization, autonomous driving, and automatic parking systems [12][13][14][15]. In Internet of Things devices and networks, RL is mainly used to control traffic and congestion in complex situations. To reduce the collisions between the system and client effectively, the access method is designed using rule-based algorithms and RL. RL is also utilized as a means of selecting the appropriate channel [16][17][18][19]. RL is applied to the problem of choosing a route to escape to a destination by avoiding obstacles.
Several existing research studies handling human-centered RL technologies have been applied to various applications. Kim and Lee [20] and Lee [21] applied RL techniques to several evacuation frameworks. In dynamic situations such as sudden obstacles or removals of exits, these frameworks generated evacuation routes considering humans' interactions and their congestions. Another application handling human-centered technologies and human-artificial hybrid intelligence is the bio-signal processing between human and a system with artificial intelligence modules. Kim et al. [22] analyzed both human-system interactions using a stimulus-producing electroencephalogram (EEG). In the research study, EEG signals are obtained in real-time and are used for evaluating human's satisfactions with the interface which a system with artificial intelligence modules provides. Moreover, this area of research is directly related to drone control problems, where RL has been applied to design drones with obstacle avoidance. The data obtained from the sensor module mounted on the drone are used to configure the environment and state of the RL model, and the drone is controlled by designing an algorithm to maximize the reward value obtained from operation [12,23]. RL is also used to design energy management systems to determine the balance between agents and optimal scheduling strategies. The RL algorithm is designed to achieve an optimal equilibrium of agent rewards for balanced energy distribution and scheduling [13,24]. Studies in which RL has been applied to large-scale social infrastructures such as ships and aircraft have mainly dealt with ship route planning, aircraft radar design, and aircraft detection systems. The route planning problem is often addressed in RL, utilizing an RL algorithm that yields the maximum reward value for an unmanned ship. In aircraft detection systems and radar designs, RL is applied to optimal radar system design and aircraft image analysis to detect radio waves and minimize unnecessary interference [25][26][27].

Adaptive Human Evaluation Strategy Framework Using the SGD-Based Reinforcement Learning
The present study is related to the basketball game problem, which involves competition between the two teams as shown in Figure 1. The reason for focusing on a basketball game in this study was to represent a network topology in which two independent states coexist in the same environment. In a competition between two teams, such as a basketball game, the interference of one team with the goal of the other team occurs because the two network topologies coexist, which is very appropriate for expressing the competition. The proposed framework considers a dynamic competing network where both human groups are competing with each other. As one of these characteristics is a volatile environment, human's evaluations as well as machine learning techniques are essential. For this manner, the framework is proposed and tested seriously. In order to show the effectiveness of the proposed framework, a basketball game is illustrated. A basketball game is very fast, and the learning environment is highly complex; therefore, appropriate RL techniques should be applied. In this study, the learning goal of the basketball game was to pass the ball to a member of the blue team, which was interrupted by the players on the red team. The network topologies of the players on the red and blue teams coexisted in the same environment. Unlike traditional RL problems, it deals with multiple complex network topologies, rather than a single topology.
Existing RL network problems consist of learning objectives involving a single network topology. However, a different method was necessary in this study, as it includes multiple network topologies that form a competing network topology in a dynamically changing state. When RL is applied in a single network topology, the reward policy can be learned through a system reward update computed from the learning object. This method is very simple, and the rewards that occur in a single network can be calculated and updated through actions in a given environment and current state.
In this report, we propose a method of establishing reward policies for complex networks that learn two network topologies in the same environment. This method involves updating the reward policy by applying the rewards obtained through human evaluation as well as the system rewards calculated from the learning target.
First, to address the RL problem, which consists of two complex network topologies, the Q-function is defined as (2), and the maximum human evaluation reward value is calculated, taking into account all time periods t: Q t+1 s 1 However, the interference between the networks is affected by operation a t . Therefore, it can be defined as (3): The important point here is that, unlike when the RL algorithm is applied to a single network topology, as in the existing research, the reward acquisition process is performed through human evaluation. At every time t, the state changes so that the learning object can learn with effective rewards, human intervention occurs, and the reward policy is evaluated accordingly. In previous studies [2,3], the learning object has been taught using a binary human evaluation method to solve complex problems through human intervention and evaluation.
However, in this study, we designed a human evaluation algorithm by emphasizing that human evaluation should not be simply performed using a binary process to solve complex network topology and that human evaluation should involve quantitative, real number feedback.
The human evaluation reward her t obtained during learning at all times t can be modeled as a Gaussian distribution, as shown in (4): where µ is the mean value of the evaluation, and σ is the standard deviation. In general, the mean of the Gaussian distribution can be estimated as her t , using (5): where n is the sample size and the number of quantitative rewards from human evaluation learned over all times t. The standard deviation of the Gaussian distribution can be estimated using (6): In this study, human evaluation was performed by repeated learning, and the SGD algorithm was used to update her t adaptively. The complex network topologies covered in this study are computationally expensive due to the large number of human evaluations in the learning process. The general form of the SGD algorithm used in this study is shown in (7): The human evaluation value estimated through repeated learning is called her t , and the difference between her t updated at the present time t and her t+1 at the next environmental time point t + 1 is defined as the loss function F her t . A slope is used to minimize this function. Iteration is performed over the time t by a certain amount in the opposite direction of the gradient to find the value of h t that minimizes F her t . This change equation is defined by (7).
η is a predetermined step size. In general, the use of all of the data to calculate F her t is called batch gradient descent. However, this calculation requires excessive computation because F her t must be calculated for all of the data in one step. In this study, the computational complexity was higher than that in general problems because the two network topologies involved complex and special problems.
To prevent this problem, a method called SGD was used. In this method, F her t is calculated only for some small collections of data instead of all of the data. Because this method is much faster, more steps can be performed in the same time, and if the process is repeated several times, it usually converges to the same result as the batch. It is also possible to use SGD to converge in a better direction without falling into the local minima that will be lost in the batch gradient descent. In this study, SGD was used to update the estimated human value in repetitive learning, as shown in Figure 2. To evaluate the two network topologies that compete in a complex manner in repetitive learning, humans score points in the form of real numbers. In the iterative learning, the values evaluated by humans in real form were updated as shown in Figure 2 using the SGD algorithm. In the early stages, however, these values converge to the local minima. To solve this problem, an adaptive SGD algorithm was used. The step size N was set differently for each estimation iteration, as shown in (8): Therefore, if the variation of the estimated human evaluation reward value is small, η increases, and if it is large, η decreases. E t+1 is a function that updates the sum of squares of the gradient through which her t moves in time t. When her t is updated as in (9), her t moves in inverse proportion to the root value of E t + in the existing step size η. It means that if a step size η becomes larger, her t moves considerably. Since this adaptive method moves by setting the step size differently for each her t , it is highly likely to approach the optimum when the state of the environment evaluated by humans appears frequently or under the same conditions; hence, the fine value is adjusted while moving to small step sizes. Lesser variation of her t is designed to increase the step size to reach the optimum value. This method involves moving in a direction such that the loss can be reduced quickly and is a strategic and effective method of updating the human evaluation in the competitive problem of network topology coexisting in complex environment.
As shown in (10), her t is updated again by adopting the maximum of her t and her t : The correction value her t adaptively updated by human evaluation must be calculated appropriately with the reward value r t of the system reward derived by updating the learning object to determine the final reward value h * t . Note that h * t is the reward of learning finally used for repetitive learning, and her t is the reward calculated by human intervention and evaluation during the learning process.
To design human interventions and evaluations adaptively in an iterative RL process, SGD algorithms are used to implement reward policies and to perform appropriate calculations with rewards computed within competing systems with complex coexisting network topologies. A framework that summarizes these interactions is shown in Figure 3. In this paper, we propose a method of effectively updating rewards in a complex network topology. To set and update the rewards in the RL process, quantitative human evaluation is performed in real form and the reward policy is updated using the adaptive SDG algorithm. Afterwards, the system implements the rewards and appropriate calculations, and the RL model is designed in a more advanced way. Algorithm 1 details the overall algorithm of this framework. Algorithm 1. RL algorithm using adaptive human evaluation reward updating to establish reward policies. 1: ← constant: human intervene rate 2: γ ← constant: discount rate 3: ← constant: learning rate 4: for 1 ≤ ≤

System Implementation and Experimental Results
This section describes in detail the implementation of the proposed adaptive human evaluation strategy framework and presents the numerical analysis performed using software programs. The software program implemented as shown in Figure 4 consists of six different functional panels. Table  2 summarizes the functions of each panel.

System Implementation and Experimental Results
This section describes in detail the implementation of the proposed adaptive human evaluation strategy framework and presents the numerical analysis performed using software programs. The software program implemented as shown in Figure 4 consists of six different functional panels. Table 2 summarizes the functions of each panel.  The first panel is called "Application for basketball AI game," where the artificially coexisting and dynamically changing network topology problem discussed in this paper is applied to an AI basketball game. To determine the basketball game conditions, the number of times to repeat the lesson and the ranges of the attacking and defending teams are determined.
The second panel, called "Basketball AI game player network and system reward calculation," shows the basketball team network fluctuating dynamically during repeated learning, while calculating the system reward using the general RL algorithm.
The third panel, "Basketball AI game player position network topology distribution," depicts the position distribution of basketball players on the two-dimensional (2D) plane in the repetitive learning, as well as the three-dimensional (3D) mesh. Each change is shown in detail in Figure 5, and the cumulative changes as the learning is repeated are evident.
The fourth panel is called "Adaptive human evaluation estimation and calculation," where the evaluation is made through human intervention when the players of the attacking team choose the direction in which to pass the ball, and the reward is given as a real number between -5 and 5. The user directly enters the number in the form of a real number into the software system. The human evaluation rewards evaluated in this manner are handled more effectively in the fifth panel.
In the fifth panel, "Updating human evaluation reward using SGD algorithm (in repeated learning process)," the SGD algorithm is used adaptively to update the reward value that the human user entered through evaluation.
Finally, as shown in Figure 6, the system and adaptive human evaluation rewards calculated in the second and fifth panels are appropriately calculated to derive the final reward value and proceed with the learning. This final sixth panel is called "Reward update monitoring" and shows the rewards and average rewards in repeated learning of basketball games depicted using a complex and dynamically changing network with the proposed framework.   Figure 7 compares the RL methods using the adaptive human evaluation strategy proposed in this paper with those proposed in previous studies. The existing methods that were compared with the proposed method were human evaluation with binary updating, the SARSA method, and the MDP method. Table 3 defines and shows experimental conditions that apply equally to all methods. All of the methods involve learning basketball games with the complex and dynamically changing network topologies discussed in Section 3.  As shown in Table 4, the type and human intervention rate of human evaluation of each method, the average reward value, the Q-function value was calculated, and the points of convergence with the maximum reward value were compared. As shown in Table 4, the RL method with the proposed adaptive human evaluation strategy achieved convergence with the highest maximum reward value. In addition, it exhibits the fastest convergence to the maximum reward in Figure 7. This result demonstrates that, unlike when the method of updating the system reward using the existing learning objects is used, the learning reaches the maximum reward value faster when the quantitative evaluation is performed through human intervention.

Conclusions
RL has been studied in various forms to train learning objects to achieve desired goals. It is proposed to design an algorithm to apply RL by mapping an environment with high complexity and time-sensitive dynamic changes to the network topology.
To learn dynamically changing network topologies effectively, we designed a system to evaluate the status and update the rewards through human intervention. Unlike in the existing methods, quantitative and clear evaluations are made using real numbers. In addition, the reward value evaluated from human intervention is calculated by applying the SDG algorithm to establish an adaptive update strategy to improve the learning speed. This RL method is stable and effective and enables accurate reward updating through human intervention. After that, the system rewards and adaptive rewards estimated from the human evaluations are calculated and updated accordingly. To demonstrate the effectiveness of this technique, a basketball game was mapped to a competing network topology and investigated experimentally. The proposed framework was compared with the existing RL methods (binary evaluation, the SARSA method, and the MDP method) in the software environment. The proposed adaptive human evaluation strategy converged to the maximum reward value the fastest and produced a high Q-function value.
In future research, methods of effectively learning two or more opposing objects in a physical environment should be considered, as interventions that deliver human evaluations directly to learning objects in physical environments must be designed with more sophisticated and advanced reward updating strategies.

Conflicts of Interest:
The authors declare no conflict of interest.