Mobile User Interface Adaptation Based on Usability Reward Model and Multi-Agent Reinforcement Learning

: Today, reinforcement learning is one of the most effective machine learning approaches in the tasks of automatically adapting computer systems to user needs. However, implementing this technology into a digital product requires addressing a key challenge: determining the reward model in the digital environment. This paper proposes a usability reward model in multi-agent reinforcement learning. Well-known mathematical formulas used for measuring usability metrics were analyzed in detail and incorporated into the usability reward model. In the usability reward model, any neural network-based multi-agent reinforcement learning algorithm can be used as the underlying learning algorithm. This paper presents a study using independent and actor-critic reinforcement learning algorithms to investigate their impact on the usability metrics of a mobile user interface. Computational experiments and usability tests were conducted in a specially designed multi-agent environment for mobile user interfaces, enabling the implementation of various usage scenarios and real-time adaptations.


Introduction
The development of an innovative technology for automatic predictive adaptation of user interfaces for smartphones, tablets, and other mobile devices based on algorithms, models, and methods of deep machine learning is a relevant task.Currently, there is a demand in the IT industry to reduce the complexity of mobile app development, optimize computations in interface rendering, and improve usability in solving user tasks [1,2].On average, a typical user installs about 100 applications on their mobile device but uses no more than 50 of them in a month and less than 10 apps every day [3,4].It is assumed that predictive adaptation technologies for mobile user interfaces will significantly reduce the redundancy of applications, functions, and data and greatly increase the speed of information processing by the user [5,6].
In general, adaptivity can be defined as the capability of a system to achieve its goals in a changing environment, by selectively executing and switching between behavior models [7].An adaptive user interface is an interface that can autonomously adjust the content, layout, or style to better suit the user's capabilities and interests [8].Usually, automatic adaptive processes optimize the system state under conditions of limited resources [9].The adaptation of a mobile user interface using a simple algorithm that adjusts the positions of elements based on the frequency of use shows a dependence of adaptation frequency on training epochs, like the existing law of evolutionary adaptation in nature [10].Research in this direction using reinforcement learning technologies suggests that an intelligent agent based on deep machine learning models can acquire similar adaptive properties to those found in living organisms [11].
In reinforcement learning theory, intelligent agents learn to optimize their control over the environment [12].An agent interacts with the environment through a sequence of observations, actions, and rewards.Deep neural networks can be used to approximate the action value function, representing the maximum sum of discounted rewards at each time step achievable through some agent behavior policy [13].
Various reinforcement learning approaches are applied to enhance human-machine interaction.For instance, single-agent approaches, where the interface agent adapts the entire system content, including elements, widgets, and menus [14] or multi-agent approaches where the user-agent and interface-agent interact to improve cooperation when changing the user interface content [15], as well as multi-agent approaches with multiple widget agents-UI elements participating in adaptation [16].It is natural to view interface elements as multiple agents adapting to different usage scenarios [17].Therefore, this work utilizes the Multi-Agent Reinforcement Learning (MARL) approach to simultaneously train multiple agents.
The goal of the agent is to find a policy for optimal actions that maximizes cumulative rewards.Such rewards are called external and are generated by the environment surrounding the agent or defined as a model using expert knowledge about the task [18].A typical study examines the effectiveness of creating reward models in the context of adapting a user interface using RL and a reward model obtained on the basis of predictive human-computer interaction, augmented with human feedback [19,20].
This paper proposes a usability reward (UR) model in multi-agent reinforcement learning to evaluate the action values of agents based on environmental and agent state parameters in a mobile user interface.The reward model can include various components, such as the target state of the environment, qualitative and quantitative characteristics of the intelligent agent, indicators of task completion, and expert evaluation.To address the challenge of selecting reward model components, effective methods for exploring the environment and the state-action space of agents are needed.The better the components of the reward model are chosen, the more efficient the reinforcement learning process will be.One or several parameters simultaneously can influence the agent's reward.The use of the UR model and multi-agent reinforcement learning aims to enhance the usability and user experience of the mobile user interface.Therefore, the paper also examines the impact of standard MARL algorithms on the usability of the mobile user interface.
The paper is organized as follows.Section 2 provides a literature review.Section 3 describes the proposed model and the contribution of this paper.Section 4 discusses the experiment and the research environment for evaluating the proposed model's effectiveness.Section 5 summarizes the obtained results and serves as the conclusion.

Related Work
Research in the field of adaptive user interfaces and reinforcement learning (RL) algorithms has been conducted for several decades.Broadly, these studies can be divided into two major groups: improving usability and creating reward models.

Improving Usability
The user interface can be considered as a means of mapping user tasks to system tools [14].In this context, context and adaptation are important characteristics of the user-system interaction that can be used to simplify the task of representation and thus enhance the interface.A system based on these features can adapt its actions according to the given context.For instance, the interface agent-assistant proposed in [21] learns to assist the user by observing their actions and imitating them, receiving feedback from the user when incorrect actions are taken.
Implicit learning of user preferences is implemented by observing their actions [22].The agent analyzes user reactions to presented documents and extracts individual user profiles.Reinforcement learning is used to adapt weights in the user profile to represent user preferences optimally.
On the other hand, the motivation for creating adaptive human-machine systems that learn through trial and error is the automatic formation of a user-friendly human-machine dialogue [23].Theoretical learning of such systems can be performed using human users or corpora of human-machine interactions.However, exploring the extensive space of possible dialogue states is impractical without using user action modeling tools.
The results of user research show that people use reward signals not only to receive feedback on past actions, but also to model future rewards by adjusting new actions [24].This work demonstrates the importance of understanding the "human-teacher/agent-learner" system for developing algorithms that enhance the efficiency of intelligent agent learning.
Reinforcement learning approaches are used for mapping natural language instructions to sequences of executed actions [25].It is assumed that the developer has access to a reward model that determines the quality of executed actions.The policy gradient algorithm is used to estimate the parameters of a linear model of action selection results, demonstrating that this approach can compete with supervised learning methods without requiring many annotated training examples.The approach is applied to interpret instructions of two types: troubleshooting guides for operating systems and tutorials for computer games.
Convenient user interfaces can be characterized by their ability to support logical reasoning, planning under uncertainty, short-term adaptation, and long-term experiencebased learning [26].The corresponding structure of the human-machine interaction for such interfaces can be based on the model of Partially Observable Markov Decision Processes (POMDPs).The advantages of this approach are demonstrated in the example of a simple iPhone application interface controlled by gestures.Moreover, the hypothesis that people use models like POMDP for action planning in uncertain conditions is considered.Users can directly manipulate system functions to express their interests, and reinforcement learning is used to model the user, allowing the system to solve famous in RL the exploration-exploitation dilemma [27].This approach provides users with a more efficient way to manage their information search.
A conceptual reference structure for adapting the user interface has been defined, containing a set of adaptation properties [16].This set of properties can be used to compare different methods and generate new ideas for adaptation.Although most software developers recognize the relevance and benefits of adaptation properties, they are not yet considered in the design phase.Perhaps the key task of the machine learning method is to propose the right adaptation at the right time and place to make it as valuable as possible for the end user.Otherwise, adaptation will be subject to limitations that may prevent the expected benefits if not addressed: the risk of mismatch, disruption of cognitive functions, lack of prediction, lack of explanations, lack of engagement, and risks regarding confidentiality.
A model-oriented approach to reinforcement learning is aimed at solving all the mentioned adaptation problems and expanding the scenarios of using adaptive interfaces [8].The approach has been applied to adapt the layout of web page designs, placement of icons on the main screens of mobile devices, and reorganization of application menus.Depending on the application, the goal of adaptation may vary and include minimizing selection time, increasing the importance of elements, reducing cognitive load, increasing user engagement, or a combination of these goals.This approach can handle various types of adaptation, including the visual representation of graphic elements (position, size, color, etc.) or their behavior (number of elements, animation, etc.) [28].The main idea of the approach is based on special models of the human-machine interaction, where it is necessary to accurately determine the impact of adaptation on selected functional goals.

Creating a Reward Model
A model-oriented optimization of user satisfaction during interaction is the main goal of an adaptive interface proposed by [29].The reward model in this work is a simple sum of L1 and L2 distances between the vertices of the object (o) and the target (t): Thus, the reward model (1) measures the distance to the target in a user-defined interaction task.
After taking an action, the agent immediately receives a reward from the environment, set as a negative value of the expected time cost for each action [30]: where R(N) is the time cost of choosing a target from N candidates, R d (N) is the time cost of deciding which object to choose among N candidates, and R touch is the time cost of the action of selecting a target.An empirical decision-making time model (in seconds) was used to evaluate R d (N) = 0.08 log 2 (N) + 0.24, and the Fitts model was applied to estimate the touch time, representing an extended and refined form of Fitts' law [31].
With the help of Multi-Agent Reinforcement Learning (MARL) based on the popular Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm [32], it is possible to design adaptive systems by simulating reciprocal adaptation between a user and an intelligent tool [33].In this case, the reward of the agents is defined by an indicator function, equal to −1 at each step of the agents and +10 if the agents together reach the target position.Thus, agents must collaborate to achieve the goal in the minimum number of steps.This multi-agent approach demonstrates that agents can learn realistic action policies, with cooperative agent actions being 30% more effective compared to a single-agent approach.
The reward model includes two weighted conditions, creating a compromise between task accuracy and task execution time: how much the current input observation x differs from the target state g and how much time is required for the action [15].Therefore, the high-level action policy of the agent must consider how interface elements relate to the interaction task: The overall reward of agent  in the UR model consists of the sum of rewards across all  parameters, the sum of rewards across all  usability metrics, and the reward for task completion by the agent: where   is the reward of agent  for each parameter  from (4),   is the reward of agent  for each metric  from Table 2,  successi is the indicator function equal to 1 if the task of agent  is successfully completed (activated).The result of the  successi function is The average reward value of the agent  for each parameter  without considering the indicator function  successi can also be used in determining the quality of agent actions.
In Section 4, we will demonstrate how the reward is computed using expressions (4), (5), and (6) using the example of a multi-agent environment of a mobile kitchen user interface.Measurable usability metrics will allow expanding the reward model components to enhance the quality of the adaptive interface.Table 2 presents popular metrics that can be used in the Usability Reward model.
Where Structured_Text_Entry() returns 1 if the input element displays a mask, 0 otherwise; n-the total number of input elements that accept data in an exact format.
sential Efficiency [38]  = 100 •     Where Sessential is the number of user steps in the description of the primary (best) use case, and Senacted is the number of steps required to perform the use case with the current user interface design.

Built in Icon [37]
Where Built_in_Icon() returns 1 if the action element displays a system icon, 0 otherwise; n-the total number of action elements in the interface.
ensity Measure [37] Where ai and aframe represent the area of object i and the area of the frame, respectively, where ε gd is the difference between the input observation and the target state, α is the weighting coefficient, T M is the predicted movement time, depending on the current position and endpoint parameters, T D is the decision-making time, and The overall reward of agent  in the UR model consists of the sum of rewards across all  parameters, the sum of rewards across all  usability metrics, and the reward for task completion by the agent: where   is the reward of agent  for each parameter  from (4),   is the reward of agent  for each metric  from Table 2,  successi is the indicator function equal to 1 if the task of agent  is successfully completed (activated).The result of the  successi function is discounted by the coefficient  ∊ [0, 1].The average reward value of the agent  for each parameter  without considering the indicator function  successi can also be used in determining the quality of agent actions.
In Section 4, we will demonstrate how the reward is computed using expressions (4), (5), and (6) using the example of a multi-agent environment of a mobile kitchen user interface.Measurable usability metrics will allow expanding the reward model components to enhance the quality of the adaptive interface.Table 2 presents popular metrics that can be used in the Usability Reward model.Where Built_in_Icon() returns 1 if the action element displays a system icon, 0 otherwise; n-the total number of action elements in the interface.
success is an indicator function, equal to one if the task was successfully completed and zero otherwise.
The agent can receive an immediate reward from the environment once it takes an action [34].The reward is determined as the negative value of the time cost for performing the action.The estimation of time costs was based on empirical data obtained during the study of menu selection by multiple users.At each step, the agent forms a vector of values [sim, s f inger , s f ocus , p i ], where sim is the similarity between the current interface element and the target element, s f inger is the position of the element for the mouse pointer, s f ocus is the position of the element for the agent's selection, and p i is the probability of the target element being in a specific position within the interface, based on memory positions.The range of s f inger and s f ocus is normalized to [0. ..1] for better learning performance.Each value in this range is specified by the distribution of menu item positions.The Deep Q-Network (DQN) algorithm [11] was used for agent training, which performs well in reinforcement learning tasks where the state space is continuous and the action space is discrete.
There is a separate research direction related to finding a pre-unknown reward model of the environment based on agent observation data.The model of internal symbolic rewards by [35] defines rewards as low-dimensional decision tree structures and uses them for standard reinforcement learning.
Table 1 presents the main methods of implementing the fundamental principles of the user interface adaptation, their models, and reinforcement learning algorithms.Reinforcement learning neural network algorithms demonstrate the key trends in modern deep multi-agent learning: independent learning, centralized training with decentralized execution, and leverage basic neural network architectures, such as fully connected, convolutional, recurrent, and actor-critic.The approach in this paper applies the RL algorithms-IQL (multi-agent modification of DQN), MADDPG (multi-agent modification of DDPG), and the Dec-POMDP model (multi-agent modification of POMDP)-for implementing an adaptive mobile user interface, which will be demonstrated in Sections 3 and 4.

Decentralized POMDP Model
The uncertainty associated with the inability of a single agent to obtain the exact state of the environment complicated the application of the classical Markov decision process model for reinforcement learning.To overcome the uncertainty of the environment's state, a mathematical model called the Partially Observable Markov Decision Process (POMDP) was developed, which, in the multi-agent case, is extended into a Decentralized POMDP (Dec-POMDP) model by introducing joint actions and joint observations of agents [36].
The Dec-POMDP model consists of a tuple: (N, S, A, T, R, O, O, h, b 0 ), where N is the set of agents, S is the set of states, A is the set of joint actions, T is the transition function, R is the reward function, O is the set of joint observations, O is the observation function, h is the task's time horizon, and b 0 is the initial state distribution.The set of joint actions A consists of A = × i∈n A i , where A i is the set of actions of agent i.At each time t, agent i performs some action a i , which leads to the formation of the joint action of the multi-agent system ȃ = ⟨a 1 , a 2 , . . . ,a n ⟩.The result of the joint action ȃ effect on the environment is determined by the transition function T as the probability T(s ′ |s, a) to transition from state s to state s ′ .The set of joint observations O consists of O = × i∈n O i , where O i is the set of observations available to agent i.At each time t, agent i receives some observation o i , which forms the joint observation o = ⟨o 1 , o 2 , . . . ,o n ⟩.The observation function O determines the probability of agents perceiving the environment's state as a joint observation O(o|s ′ , a).
In the Dec-POMDP model, the set of agents learns to optimize their actions by maxi- over some time horizon h, considering the initial state distribution of the environ- ment b 0 : S → 0, 1 at time t = 0 (Figure 1).some observation  , which forms the joint observation  = 〈 ,  , … ,  〉.The observa-tion function  determines the probability of agents perceiving the environment's state as a joint observation (| , ).
In the Dec-POMDP model, the set of agents learns to optimize their actions by maximizing the expected cumulative reward  = ∑   ( ,  ) with a discount parameter  ∊ 0,1 over some time horizon ℎ, considering the initial state distribution of the environment  : S → 0,1 at time  = 0 (Figure 1).

Usability Reward Model
To solve the task of selecting reward model components, efficient methods for exploring the environment and the agents' "state-action" space are needed.Additionally, to improve the accuracy and reliability of the reward model, a combination of parameters is required, such as the target state of the environment, qualitative and quantitative characteristics of the agent, usability metrics of the interface, and indicators of task completion.In the UR model, the reward of agent  is determined for each parameter k at time  ≠ 0 by the formula: where

Usability Reward Model
To solve the task of selecting reward model components, efficient methods for exploring the environment and the agents' "state-action" space are needed.Additionally, to improve the accuracy and reliability of the reward model, a combination of parameters is required, such as the target state of the environment, qualitative and quantitative characteristics of the agent, usability metrics of the interface, and indicators of task completion.In the UR model, the reward of agent i is determined for each parameter k at time t ̸ = 0 by the formula: where s is the target value of parameter k, characterizing the state of agent i and the environment, s is the current value of the parameter, β is the coefficient of parameter k.The values of s ′ k are normalized in the range from 0 to 1.The overall reward of agent i in the UR model consists of the sum of rewards across all k parameters, the sum of rewards across all u usability metrics, and the reward for task completion by the agent: The overall reward of agent  in the UR model consists of the sum of rewards across all  parameters, the sum of rewards across all  usability metrics, and the reward for task completion by the agent: where   is the reward of agent  for each parameter  from (4),   is the reward of agent  for each metric  from Table 2,  successi is the indicator function equal to 1 if the task of agent  is successfully completed (activated).The result of the  successi function is The average reward value of the agent  for each parameter  without considering the indicator function  successi can also be used in determining the quality of agent actions.
In Section 4, we will demonstrate how the reward is computed using expressions (4), (5), and (6) using the example of a multi-agent environment of a mobile kitchen user interface.Measurable usability metrics will allow expanding the reward model components to enhance the quality of the adaptive interface.Table 2 presents popular metrics that can be used in the Usability Reward model.
where r ik is the reward of agent i for each parameter k from (4), r iu is the reward of agent i for each metric u from Table 2, The overall reward of agent  in the UR model consists of the sum of rewards across  parameters, the sum of rewards across all  usability metrics, and the reward for k completion by the agent: ere   is the reward of agent  for each parameter  from (4),   is the reward of nt  for each metric  from Table 2,  successi is the indicator function equal to 1 if the k of agent  is successfully completed (activated).The result of the  successi function is counted by the coefficient  ∊ [0, 1].
The average reward value of the agent  for each parameter  without considering indicator function  successi can also be used in determining the quality of agent acns.
In Section 4, we will demonstrate how the reward is computed using expressions (4), , and (6) using the example of a multi-agent environment of a mobile kitchen user inface.Measurable usability metrics will allow expanding the reward model components enhance the quality of the adaptive interface.The overall reward of agent  in the UR model consists of the sum of rewards across all  parameters, the sum of rewards across all  usability metrics, and the reward for task completion by the agent: where   is the reward of agent  for each parameter  from (4),   is the reward of agent  for each metric  from Table 2,  successi is the indicator function equal to 1 if the task of agent  is successfully completed (activated).The result of the  successi function is discounted by the coefficient  ∊ [0, 1].The average reward value of the agent  for each parameter  without considering the indicator function  successi can also be used in determining the quality of agent actions.
In Section 4, we will demonstrate how the reward is computed using expressions (4), (5), and (6) using the example of a multi-agent environment of a mobile kitchen user interface.Measurable usability metrics will allow expanding the reward model components

Name of Usability Metric (Reference) Definition
Structured Text Entry [37] where Structured_Text_Entry() returns 1 if the input element displays a mask, 0 otherwise; n-the total number of input elements that accept data in an exact format.
Essential Efficiency [38] where S essential is the number of user steps in the description of the primary (best) use case, and S enacted is the number of steps required to perform the use case with the current user interface design.

Built in Icon [37]
where Built_in_Icon() returns 1 if the action element displays a system icon, 0 otherwise; n-the total number of action elements in the interface.
Density Measure [37] where a i and a frame represent the area of object i and the area of the frame, respectively, and n is the number of objects in the frame.
Layout Appropriateness [38] LA = 100• where Pi,j is the transition frequency between visual components i and j, Di, j is the distance between visual components i and j.

Default Value [37]
where a i is the input element with a default value, n is the total number of input elements.
Task Concordance [38] where N is the number of tasks to be ranked, D is the inconsistency index, i.e., the number of tasks pairs whose difficulties are arranged correctly minus those pairs whose difficulties are arranged incorrectly.

Target Element Size [37]
where a i returns 1 if the area of object i is greater than or equal to 44 pt × 44 pt (or 88 pt × 88 pt), otherwise 0, and n is the number of interactive objects in the interface.
Text Size [37] where FontSize i returns 1 if the font size for text input i is greater than or equal to 16 px, otherwise 0, and n is the number of text inputs in the interface.
Task Visibility [38] TV = 100• V i S total , ∀ i where S total is the total number of implemented steps to complete the use case, V i is the visibility of the implemented step i, ranging from 0 to 1.

Balance [37]
where BL vert is the vertical balance, and BL hor is the horizontal balance.
where L, R, T, and B refer to the left, right, top, and bottom edges, respectively.W j is the weight of the j-th side of the interface (left, right, top, and bottom).

Name of Usability Metric (Reference) Definition
Horizontal (BH) or Vertical Balance (BV) [38] where W 1 is the weight of the first side, W 2 is the weight of the second side.
The weight of a side (W i ) = the number of pixels used × the distance of the side from the center.The center = halfway between the left edge of the leftmost visual element and the right edge of the rightmost element.
Element Readability [8] where B(i l ) is the activation of element i at point l, and δ is the constant for careful examination of the element in the absence of activation.
Element Selection [8] where a p and b p are the Fitts' law constants for estimating the time to select an element (for example, a p = 10.3 and b p = 4.8), for the target element i at point l.

Local Search [8]
where T c is a constant penalty for unexpectedness (element found where it was not expected), δ is the constant cost of careful examination of the element, N local is the number of nearby elements being checked, T trail is the constant time for the cursor to follow the gaze of the eye.
The average reward value of the agent i for each parameter k without considering the indicator function ll reward of agent  in the UR model consists of the sum of rewards across ers, the sum of rewards across all  usability metrics, and the reward for n by the agent: he reward of agent  for each parameter  from (4),   is the reward of ch metric  from Table 2,  successi is the indicator function equal to 1 if the is successfully completed (activated).The result of the  successi function is ge reward value of the agent  for each parameter  without considering unction  successi can also be used in determining the quality of agent ac- 4, we will demonstrate how the reward is computed using expressions (4), ing the example of a multi-agent environment of a mobile kitchen user inrable usability metrics will allow expanding the reward model components quality of the adaptive interface.

Definition
__ () red_Text_Entry() returns 1 if the input element displays a mask, 0 othertal number of input elements that accept data in an exact format.  s the number of user steps in the description of the primary (best) use case, e number of steps required to perform the use case with the current user n.

𝑡𝑡_𝑠𝑠𝑖𝑖_𝐵𝐵𝑆𝑆𝑜𝑜𝑖𝑖()
_Icon() returns 1 if the action element displays a system icon, 0 otherwise; mber of action elements in the interface. �     frame represent the area of object i and the area of the frame, respectively, mber of objects in the frame.
successi can also be used in determining the quality of agent actions.
In Section 4, we will demonstrate how the reward is computed using expressions ( 4)-(6) using the example of a multi-agent environment of a mobile kitchen user interface.Measurable usability metrics will allow expanding the reward model components to enhance the quality of the adaptive interface.Table 2 presents popular metrics that can be used in the Usability Reward model.
The usability of the interface is automatically calculated by the environment, where a reinforcing reward signal is generated based on the presented metrics (Table 2), and it is considered in the reward model of the multi-agent environment after each action performed by the agents.The values of the usability metrics r iu , according to the expressions described above, are in the range from 0 to 1, where approaching r iu → 0 indicates low interface quality and r iu → 1 indicates high quality.

IQL Algorithm
Thus, the adaptive user interface includes the application of the Dec-POMDP model, defining the agent's interaction process with the environment, the UR model, generating rewards, and the Independent Q-Learning (IQL) algorithm, performing multi-agent learning.The IQL algorithm using a fully connected neural network is based on the classical DQN algorithm [39].Three main innovations borrowed by IQL in deep Q-learning are as follows: the use of deep neural networks, the utilization of a replay buffer, and a target neural network.
The IQL algorithm implements the idea of independent learning by embedding an identical neural network architecture into each individual agent, considering the absence of explicit communication between agents.The main and target neural networks are initialized with random weights θ ← ξ and θ ′ ← ξ (step 1).The replay buffer Eb is initialized as empty, with a capacity Bl set to 10,000.The batch size Bs from the replay buffer is set to 32.The weight copying period Cs for the main and target neural networks is chosen to be 100 episodes.To explore the agent's action space in the IQL algorithm, the concept of ε-greedy action selection is applied (step 5).The value of ε linearly decreases over time from 1.0 to 0.1.The agent's selection of a possible action a is performed according to the maximum Q-value of the neural network's nonlinear approximator θ for state s, neural network weights θ, and considering the ε parameter.The IQL algorithm for agent i, where α is the learning rate; γ is the discount factor for future rewards; ε is the explorationexploitation parameter; Ne is the number of episodes, is presented Algorithm 1 below.
Algorithm 1 Deep Independent Learning for Agent i Using Fully Connected Neural Network.

3.
Get the state s.

4.
Determine the possible actions a ∈ A i in state s.

5.
Pass the state s into the main neural network and select a possible action a according to the maximum value of Q 1 (s, a; θ), Q 2 (s, a; θ), . . . ,Q K (s, a; θ), considering the parameter ε. 6.
Perform the possible action a.

7.
Obtain the reward R UR i from the UR model and the next state s ′ .8.
Save s, a, R UR i , s ′ in the replay buffer Eb.If the capacity Bl is exceeded, replace the value according to the FIFO queue principle.9.
Randomly select a minibatch s, a, R UR i , s ′ of size Bs from the replay buffer Eb. 10.For each tuple s, a, R UR i , s ′ in the minibatch, calculate Q j (s ′ , a ′ ; θ ′ ), by passing s′ to the target neural network.11.Calculate an intermediate variable to assess the optimality of the agent's actions: 12. For each j tuple s, a, R UR i , s ′ in the minibatch, calculate Q j (s, a; θ), by passing s to the main neural network.13.Calculate the loss function: 14. Update the weights θ of the main neural network using L j (θ) and the learning rate α. 15.Replace the weights θ ′ of the target neural network with the weights θ of the main neural network if the next period Cs has ended.
The main advantages of the IQL algorithm are as follows: simplicity of implementation, speed of operation, decentralized architecture, and robustness in achieving good results when applied to various practical tasks.This algorithm is iterative, involving decisions about adaptations (changes in the interface element's state) based on differences between the original and target user interfaces formed from data obtained from human-machine interaction processes and pre-formed usability context.

Multi-Agent Environment
For conducting comparative experiments on multi-agent reinforcement learning was developed a "sandbox" platform of a mobile user interface.The platform is based on the popular Kivy software library v2.0.0, which provides rapid application development capabilities for applications equipped with new user interfaces, such as multi-touch applications [40].Based on multithreading, decentralized control over a multitude of interface elements-widgets deployable in popular mobile ecosystems like Android is ensured.Each widget element is controlled by an independent learning agent, having access only to local environment observations.Figure 2 shows the graphical interface of the developed multi-agent environment of a mobile kitchen user interface.The investigated environment addresses the adaptation of the mobile user interface in solving user tasks-cooking dishes.
For conducting comparative experiments on multi-agent reinforcement learning was developed a "sandbox" platform of a mobile user interface.The platform is based on the popular Kivy software library v2.0.0, which provides rapid application development capabilities for applications equipped with new user interfaces, such as multi-touch applications [40].Based on multithreading, decentralized control over a multitude of interface elements-widgets deployable in popular mobile ecosystems like Android is ensured.Each widget element is controlled by an independent learning agent, having access only to local environment observations.
Figure 2 shows the graphical interface of the developed multi-agent environment of a mobile kitchen user interface.The investigated environment addresses the adaptation of the mobile user interface in solving user tasks-cooking dishes.In the interface of the multi-agent environment, usability metrics (Figure 3) are implemented based on expressions from Table 2: density metric (DM), target element size (TeS), balance (BL), element readability (TR), element selection (TP), local search (TL), layout appropriateness (LA), task visibility (TV), horizontal balance (BH), and vertical balance (BV).Figure 3 shows the developer interface for selecting UR model components and configuring the target user interface vector.In the interface of the multi-agent environment, usability metrics (Figure 3) are implemented based on expressions from Table 2: density metric (DM), target element size (TeS), balance (BL), element readability (TR), selection (TP), local search (TL), layout appropriateness (LA), task visibility (TV), horizontal balance (BH), and vertical balance (BV).Figure 3 shows the developer interface for selecting UR model components and configuring the target user interface vector.Due to the lack of required input data for the solved task in the developed environment, the following metrics were not utilized (Table 2): structured text entry (STE), essential efficiency (EE), built-in icons (BIs), default value (DV), text size (TxS), and task consistency (TC).For example, for the "Essential Efficiency (EE)" metric, the number of user steps in the description of the main (best) use case is not known in advance, and for the "Default Value (DV)" metric, the input elements with default values are unknown.

Use Cases
The developed multi-agent environment includes the use of interface agents representing basic ingredients from three different cuisines: European, Slavic, and Asian.In such a reinforcement learning task, the state space and the action space are discrete.
The actions of the agents include the following operations: moving the element in Due to the lack of required input data for the solved task in the developed environment, the following metrics were not utilized (Table 2): structured text entry (STE), essential efficiency (EE), built-in icons (BIs), default value (DV), text size (TxS), and task consistency (TC).For example, for the "Essential Efficiency (EE)" metric, the number of user steps in the description of the main (best) use case is not known in advance, and for the "Default Value (DV)" metric, the input elements with default values are unknown.

Use Cases
The developed multi-agent environment includes the use of interface agents representing basic ingredients from three different cuisines: European, Slavic, and Asian.In such a reinforcement learning task, the state space and the action space are discrete.
The actions of the agents include the following operations: moving the element in four different directions, rotations, and resizing (Figure 4).The state vector of each interface element (agent) includes the identifier, number of clicks, X position, Y position, size, and rotation.This work conducts an experiment to find the target state of the kitchen interface using three cuisine categories European, Slavic, and Asian and the user-chef, preparing dish ingredients.Adaptation involves placing dish elements (moving, resizing) so that the user can quickly select ingredients for dishes of a particular cuisine.
The target user interface is a user-friendly interface for performing a specific user task, such as cooking dishes from a particular cuisine.In such an interface, all elements are given specific positions and sizes, and some may be absent altogether.However, the target interface is unknown in advance and can only be determined through online adaptations based on human-machine interaction and multi-agent reinforcement learning.The results of the usability experiment to find the target user interface are presented in Figure 5.The experiment was conducted with students comprising 50 participants (33 masculine, 17 feminine, and 0 others) aged 19 to 23 (mean 22) who were residents of Eastern Europe with varying educational backgrounds and were opportunistically recruited.All participants reported frequent desktop or mobile and web usage.This work conducts an experiment to find the target state of the kitchen interface using three cuisine categories European, Slavic, and Asian and the user-chef, preparing dish ingredients.Adaptation involves placing dish elements (moving, resizing) so that the user can quickly select ingredients for dishes of a particular cuisine.
The target user interface is a user-friendly interface for performing a specific user task, such as cooking dishes from a particular cuisine.In such an interface, all elements are given specific positions and sizes, and some may be absent altogether.However, the target interface is unknown in advance and can only be determined through online adaptations based on human-machine interaction and multi-agent reinforcement learning.The results of the usability experiment to find the target user interface are presented in Figure 5.The experiment was conducted with students comprising 50 participants (33 masculine, 17 feminine, and 0 others) aged 19 to 23 (mean 22) who were residents of Eastern Europe with varying educational backgrounds and were opportunistically recruited.All participants reported frequent desktop or mobile and web usage.
The source code of the developed multi-agent environment, based on the Dec-POMDP model, which defines the agent's interaction with the environment, the UR model, which generates rewards, and the IQL and MADDPG algorithms for multi-agent learning, is available at: https://github.com/dimi3o/MobAdaptUi(accessed on 23 March 2024).The source code of the developed multi-agent environment, based on the Dec-POMDP model, which defines the agent's interaction with the environment, the UR model, which generates rewards, and the IQL and MADDPG algorithms for multi-agent learning, is available at: https://github.com/dimi3o/MobAdaptUi(accessed on 23 March 2024).

Reward Computation
For the proposed environment, the state space of each interface agent is defined by the parameters as follows: horizontal  and vertical position  , scale  and tilt  of the element.The parameter  , representing the number of user touches, characterizes the agent's task completion.Thus, the state  of the agent in expression ( 4) is defined by a tuple of values ( ,  ,  ,  ) , which is determined for the current  ( ) and target  ( ) interface states.Therefore, for calculating the total reward in expression ( 5), there will be  = 4 local rewards  , each of which returns a value in the range from 0 to 1.If the reward  assigns a value close to 0, then this parameter  significantly differs from the target, and if the value is close to 1, it is close to the target.
The indicator function  in expression ( 5) equals 1 if task  of the agent is completed.Thus, the parameter  , the number of users touches, defines the value of  when the element is activated by the user.For example, if the user clicks on the element, the agent-element is rewarded 0.1 for each click.The maximum value of  is 1 if the user makes enough clicks (activation).
If an element is positioned as in the target interface (Figure 5) and activated, the average arithmetic reward of the corresponding agent-element approaches 1.Therefore, for  = 4 parameters and local rewards  , parameter  , and the indicator function  the maximum total reward of agent  resulting from performing one action (5):  ≤ 5.
The graph showing the average value of the reward  obtained from the experiment, as described in equation ( 6), is presented in Figure 6.This graph illustrates the

Reward Computation
For the proposed environment, the state space of each interface agent is defined by the parameters as follows: horizontal s ′ x and vertical position s ′ y , scale s ′ s and tilt s ′ a of the element.The parameter s ′ t , representing the number of user touches, characterizes the agent's task completion.Thus, the state s ′ k of the agent in expression ( 4) is defined by a tuple of values s ′ x , s ′ y , s ′ s , s ′ a , which is determined for the current s and target s ′(target) k interface states.Therefore, for calculating the total reward in expression ( 5), there will be k = 4 local rewards r ik , each of which returns a value in the range from 0 to 1.If the reward r ik assigns a value close to 0, then this parameter k significantly differs from the target, and if the value is close to 1, it is close to the target.
The indicator function The overall reward of agent  in the UR model consists of the sum of rewards across  parameters, the sum of rewards across all  usability metrics, and the reward for completion by the agent: re   is the reward of agent  for each parameter  from (4),   is the reward of nt  for each metric  from Table 2,  successi is the indicator function equal to 1 if the of agent  is successfully completed (activated).The result of the  successi function is ounted by the coefficient  ∊ [0, 1].
The average reward value of the agent  for each parameter  without considering indicator function  successi can also be used in determining the quality of agent acs.
In Section 4, we will demonstrate how the reward is computed using expressions (4), and (6) using the example of a multi-agent environment of a mobile kitchen user inace.Measurable usability metrics will allow expanding the reward model components nhance the quality of the adaptive interface.successi in expression (5) equals 1 if task i of the agent is completed.Thus, the parameter s ′ t , the number of users touches, defines the value of in the UR model consists of the sum of rewards across ards across all  usability metrics, and the reward for t  for each parameter  from (4),   is the reward of able 2,  successi is the indicator function equal to 1 if the mpleted (activated).The result of the  successi function is f the agent  for each parameter  without considering an also be used in determining the quality of agent ac- trate how the reward is computed using expressions (4), a multi-agent environment of a mobile kitchen user inrics will allow expanding the reward model components ptive interface.The overall reward of agent  in the UR model consists of the sum of rewards acros all  parameters, the sum of rewards across all  usability metrics, and the reward fo task completion by the agent: where   is the reward of agent  for each parameter  from (4),   is the reward o agent  for each metric  from Table 2,  successi is the indicator function equal to 1 if th task of agent  is successfully completed (activated).The result of the  successi function i discounted by the coefficient  ∊ [0, 1].
The average reward value of the agent  for each parameter  without considerin the indicator function  successi can also be used in determining the quality of agent ac tions.
In Section 4, we will demonstrate how the reward is computed using expressions (4 (5), and (6) using the example of a multi-agent environment of a mobile kitchen user in terface.Measurable usability metrics will allow expanding the reward model component to enhance the quality of the adaptive interface.Table 2 presents popular metrics that ca be used in the Usability Reward model.

Name of Usability Metric
(Reference) Definition 1  successi is 1 if the user makes enough clicks (activation).
If an element is positioned as in the target interface (Figure 5) and activated, the average arithmetic reward of the corresponding agent-element approaches 1.Therefore, for k = 4 parameters and local rewards r ik , parameter s ′ t , and the indicator function The overall reward of agent  in the UR model consists of the sum of rewards acr all  parameters, the sum of rewards across all  usability metrics, and the reward task completion by the agent: where   is the reward of agent  for each parameter  from (4),   is the reward agent  for each metric  from Table 2,  successi is the indicator function equal to 1 if task of agent  is successfully completed (activated).The result of the  successi functio discounted by the coefficient  ∊ [0, 1].
The average reward value of the agent  for each parameter  without consider the indicator function  successi can also be used in determining the quality of agent tions.
In Section 4, we will demonstrate how the reward is computed using expressions (5), and (6) using the example of a multi-agent environment of a mobile kitchen user terface.Measurable usability metrics will allow expanding the reward model compone to enhance the quality of the adaptive interface.Table 2 presents popular metrics that be used in the Usability Reward model.The graph showing the average value of the reward r imean obtained from the experiment, as described in Equation ( 6), is presented in Figure 6.This graph illustrates the quality of agent actions without considering user clicks activation.Interface adaptation occurred over a time interval of 140 s, encompassing 1000 adaptations for each agent.To alter the interface state, agent actions were executed randomly.The decreasing trend in the reward function indicates a deterioration in interface usability in the absence of a defined agent action policy.quality of agent actions without considering user clicks activation.Interface adaptation occurred over a time interval of 140 s, encompassing 1000 adaptations for each agent.To alter the interface state, agent actions were executed randomly.The decreasing trend in the reward function indicates a deterioration in interface usability in the absence of a defined agent action policy.Under similar conditions, the expected cumulative average reward of all agents  = ∑   with a discount factor of  = 0.99 over the course of an episode (14 s, 2000 adaptations) reaches a plateau of limiting values, indicating the completion of the user interface usability improvement phase, is presented in Figure 7.Under similar conditions, the expected cumulative average reward of all agents

Experimental Results and Analysis
r imean with a discount factor of γ = 0.99 over the course of an episode (14 s, 2000 adaptations) reaches a plateau of limiting values, indicating the completion of the user interface usability improvement phase, is presented in Figure 7.
the reward function indicates a deterioration in interface usability in the absence fined agent action policy.Under similar conditions, the expected cumulative average reward of al  = ∑   with a discount factor of  = 0.99 over the course of an (14 s, 2000 adaptations) reaches a plateau of limiting values, indicating the comp the user interface usability improvement phase, is presented in Figure 7.       Usability testing of the adapted interfaces was conducted using the Questionnaire for User Interface Satisfaction (QUIS) 7.0 methodology [41,42], and key usability characteristics have been highlighted in Figure 11.Thus, an analysis of the quality and usability of the mobile kitchen user interface obtained through adaptation based on the IQL and MADDPG algorithms is performed, along with their comparison in terms of usability criteria [42,43].Usability testing of the adapted interfaces was conducted using the Questionnaire for User Interface Satisfaction (QUIS) 7.0 methodology [41,42], and key usability characteristics have been highlighted in Figure 11.Thus, an analysis of the quality and usability of the mobile kitchen user interface obtained through adaptation based on the IQL and MADDPG algorithms is performed, along with their comparison in terms of usability criteria [42,43].Usability testing of the adapted interfaces was conducted using the Questionnaire for User Interface Satisfaction (QUIS) 7.0 methodology [41,42], and key usability characteristics have been highlighted in Figure 11.Thus, an analysis of the quality and usability of the mobile kitchen user interface obtained through adaptation based on the IQL and MADDPG algorithms is performed, along with their comparison in terms of usability criteria [42,43].Usability testing of the adapted interfaces was conducted using the Questionnaire for User Interface Satisfaction (QUIS) 7.0 methodology [41,42], and key usability characteristics have been highlighted in Figure 11.Thus, an analysis of the quality and usability of the mobile kitchen user interface obtained through adaptation based on the IQL and MADDPG algorithms is performed, along with their comparison in terms of usability criteria [42,43].The resulting interface based on adaptation using the IQL algorithm, according to users, meets the "Effectiveness" criterion.This criterion signifies the interface's ability to enable users to accomplish tasks with minimal effort and time.In general, the higher the efficiency of the interface, the better users can achieve their goals and complete tasks with fewer time and effort expenditures.However, the adaptation result based on the IQL and MADDPG algorithms was perceived by users as unclear, meaning it does not meet the Comprehensibility criterion.This criterion measures the level of clarity and comprehensibility of the interface based on labels, instructions, and other interface elements, as well as the ease of understanding and using available functions and features.Importantly, in the survey results, users considered the obtained interface to meet the "Adaptability" criterion.That is, according to users, the obtained interface is adapted for use in various contexts: solving a task (preparing dishes of a certain cuisine), different devices, and screen resolutions.

Model Validation
The proposed approach based on the UR model was evaluated through empirical assessment of the time taken to search for and select user interface elements of interest, compared to two interfaces: a typical static interface (Static) and a frequency-adaptive interface (Frequency) [44].The static interface represents a grid world where all elements are stationary and positioned randomly.In the frequency-adaptive interface, elements are moved depending on the selected priority position on the screen and their frequency of use and are also arranged in a grid.In the UR model approach, a set of RL agents interact with the environment, learning through trial and error to find the optimal interface state and adjusting the position and size of elements in any way.
To compare these approaches, an experiment was conducted in which participants performed the task of selecting an element under three conditions (Static, Frequency, and UR model).Fifty participants aged 18 to 24 (average age 21) with varying levels of education were involved.All participants reported consistent active use of mobile devices.Interface layouts with 40 icons of dish ingredients were randomly generated for the experiment.The icons were evenly selected from common categories: European, Slavic, and Asian cuisines.An interface was created for each of the three conditions for each participant.The name of the target element was displayed in a tooltip at the top of the application window.Participants had to select the target element correctly to complete the trial.Clicking on the target element hid the menu and provided a short pause.The selection time was recorded automatically.The resulting interface based on adaptation using the IQL algorithm, according to users, meets the "Effectiveness" criterion.This criterion signifies the interface's ability to enable users to accomplish tasks with minimal effort and time.In general, the higher the efficiency of the interface, the better users can achieve their goals and complete tasks with fewer time and effort expenditures.However, the adaptation result based on the IQL and MADDPG algorithms was perceived by users as unclear, meaning it does not meet the Comprehensibility criterion.This criterion measures the level of clarity and comprehensibility of the interface based on labels, instructions, and other interface elements, as well as the ease of understanding and using available functions and features.Importantly, in the survey results, users considered the obtained interface to meet the "Adaptability" criterion.That is, according to users, the obtained interface is adapted for use in various contexts: solving a task (preparing dishes of a certain cuisine), different devices, and screen resolutions.

Model Validation
The proposed approach based on the UR model was evaluated through empirical assessment of the time taken to search for and select user interface elements of interest, compared to two interfaces: a typical static interface (Static) and a frequency-adaptive interface (Frequency) [44].The static interface represents a grid world where all elements are stationary and positioned randomly.In the frequency-adaptive interface, elements are moved depending on the selected priority position on the screen and their frequency of use and are also arranged in a grid.In the UR model approach, a set of RL agents interact with the environment, learning through trial and error to find the optimal interface state and adjusting the position and size of elements in any way.
To compare these approaches, an experiment was conducted in which participants performed the task of selecting an element under three conditions (Static, Frequency, and UR model).Fifty participants aged 18 to 24 (average age 21) with varying levels of education were involved.All participants reported consistent active use of mobile devices.Interface layouts with 40 icons of dish ingredients were randomly generated for the experiment.The icons were evenly selected from common categories: European, Slavic, and Asian cuisines.An interface was created for each of the three conditions for each participant.The name of the target element was displayed in a tooltip at the top of the application window.Participants had to select the target element correctly to complete the trial.Clicking on the target element hid the menu and provided a short pause.The selection time was recorded automatically.
One-way analysis of variance of the search and selection time (in milliseconds) as the dependent variable, with the conditions Static, Frequency, and UR model as the fixed independent variable, is presented in Figure 12

Limitations
The main foundation for the approach considered in the paper is multi-agent reinforcement learning, which offers its advantages and imposes its limitations [45].
Overall, the multi-agent approach in constructing a system in a nondeterministic and dynamic environment provides the following advantages: • Robustness: The ability of a multi-agent system to continue functioning in the event of failures in individual agents' operations.

•
Efficiency: Parallel data processing in a multi-agent system by autonomous agents accelerates the system's operation.

•
Adaptability: A decentralized multi-agent system is capable of dynamically changing its behavior in response to changes in the environment.

•
Encapsulation: A multi-agent system is modular by nature, allowing flexible changes to its structure, abstraction of data, and protection of internal implementations.

•
Scalability: In information perception and decision-making, a multi-agent system has no limitations on centralized data control, which enables better handling of increased workloads with added resources.
However, applying machine learning methods to multi-agent systems also leads to new challenges.
Firstly, multi-agent learning suffers from the so-called "curse of dimensionality," caused by the exponential growth of the state-action space with an increase in the number of agents.For example, in the approach described in this paper, single-agent

Limitations
The main foundation for the approach considered in the paper is multi-agent reinforcement learning, which offers its advantages and imposes its limitations [45].
Overall, the multi-agent approach in constructing a system in a nondeterministic and dynamic environment provides the following advantages: • Robustness: The ability of a multi-agent system to continue functioning in the event of failures in individual agents' operations.• Efficiency: Parallel data processing in a multi-agent system by autonomous agents accelerates the system's operation.• Adaptability: A decentralized multi-agent system is capable of dynamically changing its behavior in response to changes in the environment.• Encapsulation: A multi-agent system is modular by nature, allowing flexible changes to its structure, abstraction of data, and protection of internal implementations.• Scalability: In information perception and decision-making, a multi-agent system has no limitations on centralized data control, which enables better handling of increased workloads with added resources.
However, applying machine learning methods to multi-agent systems also leads to new challenges.
Firstly, multi-agent learning suffers from the so-called "curse of dimensionality," caused by the exponential growth of the state-action space with an increase in the number of agents.For example, in the approach described in this paper, single-agent reinforcement learning with 40 agents and eight possible actions, 320 Q-values would need to be stored for each state of the environment.However, in multi-agent learning with joint actions, it would be necessary to consider already 1.329 × 10 36 Q-values.
Secondly, many theoretical guarantees that existed and were formally proven in singleagent learning no longer apply.For instance, the classical Q-learning algorithm guarantees convergence to an optimal strategy for a finite Markov Decision Process.But in multiagent learning, in the considered case of a mobile user interface with the Dec-POMDP model, agents have to settle for not an optimal but some averaged equilibrium achieved in interactions with each other and the environment.
Thirdly, the exploration-exploitation problem becomes much more complex in multiagent learning, as the transition of one agent from exploration to exploitation or vice versa implicitly affects the strategies of all other agents.For example, in the mobile user interface being considered, this means that the interface elements will be inconveniently positioned, usability criteria will not be met, and elements will move only in one direction and be pressed against the screen edges, which is very inconvenient for the user to interact with.
Finally, the sixth challenge is related to coordination.The effect of each agent's action depends on the actions/inactions of other agents.Balancing between finding effective joint actions of agents and searching for optimal strategies is a constant challenge in coordination.Although this problem is more typical for cooperative multi-agent learning, the lack of coordination negatively affects independent learning agents as well.For example, due to the problem of multi-agent coordination, interface elements and agents from one type of cuisine may not immediately adapt to Slavic, European, and Asian cuisines.

Future Directions
Multi-agent systems, as a distinct discipline in computer science and computer engineering, began to develop at the end of the last century.Over the past twenty years, interest in this area has steadily increased [46], leading today to over half a million search results in the "Google Scholar" system.
On the one hand, this growth is associated with the belief that multi-agent systems are one of the promising paradigms for software development.The shift of focus from developing applications for personal workstations to developing distributed systems that interact remotely with each other has led to the complexity of these systems and their acquisition of properties characteristic of multi-agent systems.
On the other hand, the key property of an agent operating in an unknown environment is the ability to learn based on the experience gained from interacting with this environment or with other agents.The evident successes in recent years of deep single-agent learning in recognition and gaming tasks motivate researchers and developers to extend the potential of these methods to multi-agent learning.
Multi-agent learning today is a fundamental component of multi-agent systems.From a scientific perspective, researching the interaction between a multitude of simultaneously learning agents yields new knowledge that can be used in analyzing social structures of society, economically competitive environments, and control theory.From an engineering perspective, a multitude of simultaneous learning agents provides a conceptually new way of developing communication systems, robotics, transportation systems, human-machine interfaces, and software.
Currently, the main trend in AI is Large Language Models (LLMs).For example, recent works have successfully utilized LLM capabilities to gather abstract knowledge about global physics for decision-making problem-solving [47,48].Applying LLM in the approach of the current work is challenging because LLMs weigh tens of gigabytes and cannot be loaded onto a mobile device for autonomous use.Additionally, we require online learning; thus, our approach can function as grounding RL [47].Thus, the approach in the work encompasses this trend of LLM with RL.
The experiments were conducted on the Android mobile operating system.However, it is possible to port the entire experimental setup to other mobile operating systems.In the development of the mobile user interface, including the functions of various applications, inter-process communication technology is commonly used for data exchange between threads.Similar approaches enable applications to communicate with each other and with the operating system through data requests, function calls, and responses to signals.This opens broader possibilities for the use of the adaptive mechanism of multi-agent reinforcement learning in a full technology stack of mobile interfaces.

Conclusions
The usability reward model in multi-agent reinforcement learning for the mobile user interface is explored in this study.It consists of parameters of the target state of the environment, qualitative and quantitative characteristics of the agent, and a task completion indicator.Classic Dec-POMDP model, the independent and actor-critic deep learning algorithms were employed for multi-agent learning.The proposed UR model allows for explicit reward determination in the mobile digital environment, reducing the time for selecting user interface elements.
An original multi-agent sandbox environment is developed, enabling the collection of high-quality user data autonomously for various human-machine interaction tasks.It involves the ability to create interface elements as agents adapting to usage scenarios, simultaneous training of multiple agents based on the multi-agent reinforcement learning approach, and real-time interface adaptations.A key feature of this work is the attempt to investigate the impact of multi-agent reinforcement learning (MARL) algorithms on the usability of mobile user interfaces.Usability tests have shown that the IQL and MADDPG algorithms have their own strengths and weaknesses in terms of a positive user experience.However, the IQL algorithm has some advantages in terms of usability criteria.
Computational experiments in the proposed environment show that a user-friendly interface (based on the criterion of minimal element selection time) can be defined through online adaptations using human-machine interaction and multi-agent reinforcement learning.A comparison of the search and selection times between the UR model and two other interface types, Static and Frequency, reveals a difference of 105 ms, with the Frequency approach being approximately 9% slower than the UR model.
Finally, these results also indicate that RL remains one of the leaders in the field of tinyML and user interfaces, opening up new levels of usability for users on the one hand, and on the other hand, in the technical aspect, laying the foundation for the full integration of LLM models into mobile agents alongside RL.

Figure 1 .
Figure 1.Dynamics of the Dec-POMDP model operation in a multi-agent mobile user interface.

()
is the target value of parameter , characterizing the state of agent  and the environment,  ( ) is the current value of the parameter,  is the coefficient of parameter .The values of  are normalized in the range from 0 to 1.

Figure 1 .
Figure 1.Dynamics of the Dec-POMDP model operation in a multi-agent mobile user interface.

Figure 2 .
Figure 2. Multi-agent environment of a mobile kitchen user interface.

Figure 2 .
Figure 2. Multi-agent environment of a mobile kitchen user interface.

Figure 4 .
Figure 4. Actions of interface agents: move left, move right, move down, move up, rotate, resize.

Figure 4 .
Figure 4. Actions of interface agents: move left, move right, move down, move up, rotate, resize.
successi the maximum total reward of agent i resulting from performing one action (5): R i ≤ 5.

Figure 6 .
Figure 6.The average reward value of agent i in the UR model over the course of an episode (1000 adaptations).

Figure 7 .
Figure 7. Expected cumulative average reward of the set of interface element agents over the course of an episode (2000 adaptations).

Figures 8 -
Figures 8-10 show the results of Mobile User Interface (MUI) adaptation experiments during training using the multi-agent reinforcement learning algorithms IQL and MAD-DPG.The training episodes were set to a horizon of 200 environment adaptations.The results are aggregated over several dozen episodes (steps) to calculate the moving average loss during training (mean loss), as well as the local (reward) and cumulative rewards (total reward) for the agents.

Figure 6 .
Figure 6.The average reward value of agent i in the UR model over the course of an episode (1000 adaptations).

Figure 6 .
Figure 6.The average reward value of agent i in the UR model over the course of an episo adaptations).

Figure 7 .
Figure 7. Expected cumulative average reward of the set of interface element agents over course of an episode (2000 adaptations).

Figures 8 -
Figures 8-10 show the results of Mobile User Interface (MUI) adaptation expe during training using the multi-agent reinforcement learning algorithms IQL an DPG.The training episodes were set to a horizon of 200 environment adaptati results are aggregated over several dozen episodes (steps) to calculate the moving loss during training (mean loss), as well as the local (reward) and cumulative (total reward) for the agents.

Figure 7 .
Figure 7. Expected cumulative average reward of the set of interface element agents over the course of an episode (2000 adaptations).

Figures 8 -
Figures 8-10 show the results of Mobile User Interface (MUI) adaptation experiments during training using the multi-agent reinforcement learning algorithms IQL and MAD-DPG.The training episodes were set to a horizon of 200 environment adaptations.The results are aggregated over several dozen episodes (steps) to calculate the moving average loss during training (mean loss), as well as the local (reward) and cumulative rewards (total reward) for the agents.

Figure 8 .
Figure 8. Moving average loss during the training process for the IQL and MADDPG algorithms.

Figure 9 .
Figure 9. Local reward based on the UR model in multi-agent learning for the IQL and MADDPG algorithms.

Figure 10 .
Figure 10.Cumulative reward based on the UR model in multi-agent learning for the IQL and MADDPG algorithms.

Figure 8 . 22 Figure 8 .
Figure 8. Moving average loss during the training process for the IQL and MADDPG algorithms.

Figure 9 .
Figure 9. Local reward based on the UR model in multi-agent learning for the IQL and MADDPG algorithms.

Figure 10 .
Figure 10.Cumulative reward based on the UR model in multi-agent learning for the IQL and MADDPG algorithms.

Figure 9 . 22 Figure 8 .
Figure 9. Local reward based on the UR model in multi-agent learning for the IQL and MADDPG algorithms.

Figure 9 .
Figure 9. Local reward based on the UR model in multi-agent learning for the IQL and MADDPG algorithms.

Figure 10 .
Figure 10.Cumulative reward based on the UR model in multi-agent learning for the IQL and MADDPG algorithms.

Figure 10 .
Figure 10.Cumulative reward based on the UR model in multi-agent learning for the IQL and MADDPG algorithms.

Figure 11 .
Figure 11.Comparison of the quality of adaptation of the MUI in usability criteria.

Figure 11 .
Figure 11.Comparison of the quality of adaptation of the MUI in usability criteria.
. Vertical bars in the figure indicate 95% confidence intervals.The condition had a statistically significant effect on the selection time with large means for Static = 1285 ms, Frequency = 1273 ms, and the UR model = 1168 ms.Post-hoc Tukey-Kramer test showed that the UR model (1168 ms) was significantly faster (shorter search and selection time) than both Frequency (1273 ms) and Static (1285 ms); the difference between Static and Frequency was not statistically significant.Thus, the difference in search and selection time between the UR model and Frequency is 105 ms (i.e., the Frequency approach is approximately 9% slower than the UR model).Multimodal Technol.Interact.2024, 8, 26 17 of 22 One-way analysis of variance of the search and selection time (in milliseconds) as the dependent variable, with the conditions Static, Frequency, and UR model as the fixed independent variable, is presented in Figure 12.Vertical bars in the figure indicate 95% confidence intervals.The condition had a statistically significant effect on the selection time with large means for Static = 1285 ms, Frequency = 1273 ms, and the UR model = 1168 ms.Post-hoc Tukey-Kramer test showed that the UR model (1168 ms) was significantly faster (shorter search and selection time) than both Frequency (1273 ms) and Static (1285 ms);the difference between Static and Frequency was not statistically significant.Thus, the difference in search and selection time between the UR model and Frequency is 105 ms (i.e., the Frequency approach is approximately 9% slower than the UR model).

Figure 12 .
Figure 12.Difference in search and selection time for elements between the UR model approach in the current study and other interfaces, Static and Frequency.

Figure 12 .
Figure 12.Difference in search and selection time for elements between the UR model approach in the current study and other interfaces, Static and Frequency.

Table 1 .
Reinforcement Learning Algorithms and Models in User Interface Adaptation Tasks.
Table 2 presents popular metrics that can successi is the indicator function equal to 1 if the task of agent i is successfully completed (activated).The result of the
Table 2 presents popular metrics that can Usability Reward model.
able usability metrics.

Table 2
presents popular metrics that can sed in the Usability Reward model.le 2. Measurable usability metrics.  re Structured_Text_Entry() returns 1 if the input element displays a mask, 0 other-; n-the total number of input elements that accept data in an exact format.

Table 2
when the element is activated by the user.For example, if the user clicks on the element, the agent-element is rewarded 0.1 for each click.The maximum value of s. Definition  () successi