DroidbotX: Test Case Generation Tool for Android Applications Using Q-Learning

: Android applications provide benefits to mobile phone users by offering operative functionalities and interactive user interfaces. However, application crashes give users an unsatisfac-tory experience, and negatively impact the application’s overall rating. Android application crashes can be avoided through intensive and extensive testing. In the related literature, the graphical user interface (GUI) test generation tools focus on generating tests and exploring application functions using different approaches. Such tools must choose not only which user interface element to interact with, but also which type of action to be performed, in order to increase the percentage of code coverage and to detect faults with a limited time budget. However, a common limitation in the tools is the low code coverage because of their inability to find the right combina-tion of actions that can drive the application into new and important states. A Q-Learning-based test coverage approach developed in DroidbotX was proposed to generate GUI test cases for Android applications to maximize instruction coverage, method coverage, and activity coverage. The overall performance of the proposed solution was compared to five state-of-the-art test generation tools on 30 Android applications. The DroidbotX test coverage approach achieved 51.5% accuracy for instruction coverage, 57% for method coverage, and 86.5% for activity coverage. It triggered 18 crashes within the time limit and shortest event sequence length compared to the other tools. The results demonstrated that the adaptation of Q-Learning with upper confidence bound (UCB) exploration outperforms other existing state-of-the-art solutions. effectiveness of the proposed approach and shows a comparison with GUI test-generation tools for Android apps using 30 Android apps. Four criteria (i) instruction coverage, (ii) method coverage, (iii) activity coverage and (iv) number of detected crashes were used to evaluate and compare GUI test-generation tools. The experimental result revealed that the Q-Learning-based test coverage approach outperforms the state-of-the-art in coverage and in the number of detected crashes within the shortest events sequence length. For future work, DroidbotX will be extended to include input text data, which may integrate text prediction to improve coverage.


Introduction
Android operates on 85% of mobile phones with over 2 billion active devices per month worldwide [1]. The Google Play Store is the official market for Android applications (apps) that distribute over 3 million Android apps in 30 categories. For example, it provides entertainment, customization, education, and financial apps [2]. A previous study [3] indicated that a mobile device, on average, has between 60 and 90 apps installed. Besides, an Android user, on average spends 2 hours and 15 minutes on apps every day. Therefore, checking the app's reliability is a significant task. Recent research [2] showed that the number of Android apps downloaded is increasing drastically every year. Unfortunately, 17% of Android apps were still considered to be low-quality apps in 2019 [4]. Another study found that 53% of users would avoid using an app if the app crashed [5]. A mobile app crash not only offers a poor user experience but also negatively impact the app's overall rating [6,7]. The inferior quality of Android apps can be attributed to insufficient testing due to its rapid development practice. Android apps are ubiquitous, operating in complex environments, and evolve under market pressure. Android developers neglect appropriate testing practices as they consider it time-consuming, expensive, and involving a lot of repetitive tasks. Mobile app crashes are evitable and avoidable by intensive and extensive testing of mobile apps [6]. Through a graphical user interface (GUI), mobile app testing verifies the functionality and accuracy of mobile apps before these apps are released to the market [8][9][10]. Automated mobile app testing starts by generating test cases that include event sequences of the GUI components. In the mobile app environment, the test input (or test data) will be based on user interaction and system interaction (e.g., apps notification). The development of GUI test cases usually takes a lot of time and effort because of their non-trivial structures and highly interactive nature of GUIs. Android apps [11,12] usually possess many states and transitions, which can lead to an arduous testing process and poor testing performance for large apps. Over the past decade, Android test generation tools have been developed to automate user interaction and system interaction as inputs [13][14][15][16][17][18]. The purpose of these tools is to generate test cases and explore the app's functions by employing different techniques. These techniques are random-based, model-based, systematic based, and reinforcement learning. However, there are issues with low code coverage of existing tools [11,19,20], due to the inability to explore app functions extensively because some of the app functions can only be explored through a specific sequence of events [21]. Such tools must not only choose which GUI component to interact with but also which type of input to perform. Each type of input for each GUI component is likely to improve coverage. Coverage is an important metric to measure the efficiency of testing [22]. Combining different granularities from instruction, method, and activity coverage is beneficial for better results in testing Android apps. The reason is that activities and methods are vital to app development, so the numeric values of activity and method coverage are intuitive and informative [23]. Activity is the primary interface for user interaction and an activity comprises several methods and underlying code logic. Each method in every activity comprises a different number of lines of code. Instruction coverage provides information about the amount of code that has been executed. Hence, improving instruction and method coverage ensures that more of the app's functionalities associated with each activity are explored and tested [23][24][25]. Similarly, activity coverage is a necessary condition to detect crashes that can occur when interacting with the app's UI. The more coverage the tool explores, the more likely it would discover potential crashes [26].
This research proposes an approach that generates a GUI test case based on the Q-Learning technique. This approach systematically selects events and guides the exploration to expose the functionalities of an application under test (AUT) to maximize instruction, method, and activity coverage by minimizing redundant execution of events.
This approach was implemented into the test tool named DroidbotX (https://github.com/husam88/DroidbotX) and it is publicly available. The problem-based learning approach in teaching the public using DroidbotX is also available in the ALIEN (Active Learning in Engineering)(https://virtual-campus.eu/alien/problems/droidbotxgui-testing-tool/) virtual platform. The tool was used to evaluate the practical usefulness and applicability of our approach. DroidbotX constructs a state-transition model of an app and generates test cases. These test cases follow the sequences of events that are the most likely to explore the app's functionalities. The proposed approach was evaluated against state-of-the-art test generation tools. DroidbotX was compared with Android Monkey [27], Sapienz [16], Stoat [15], Droidbot [28], Humanoid [29] on 30 Android apps from the F-Droid repository [30].
In this study, instruction coverage, method coverage, activity coverage, and crash detection were analyzed to assess the performance of the approach. DroidbotX achieved higher instruction coverage, method coverage, activity coverage, and detected more crashes than the other tools on the 30 subject apps. Specifically, DroidbotX consistently resulted in 51.5% instruction coverage, 57% method coverage, 86.5% activity coverage, and triggered 18 crashes over the five tools.
The rest of this paper is divided as follows. Section 2 describes a test case generation for Android apps. Section 3 discusses reinforcement learning and focused on Q-Learning. Section 4 presents the background of Android apps and section 5 discusses the related GUI testing tools. Section 6 presents the proposed approach while section 7 presents an empirical evaluation. Section 8 analyzes and discusses the findings. Section 9 describes threats to validity and section 10 concludes the paper.

Test Case Generation for Android Apps
Test case generation is one of the most attention-demanding testing activities because of its strong impact on the overall testing process efficiency [31]. The total cost, time, and effort required for the overall testing will depend on the total number of test cases [32]. The pre-specified test case is a set of inputs provided to the application to obtain the desired output. Android apps are context-conscious because of their ability to sense and react with a great number of different inputs from user and system interactions [33,34]. An app is tested with an automatically generated sequence of events simulating user interaction with the GUI from the user's perspective to persistence layers. For example, interaction usually involves clicking, scrolling, or typing texts into a GUI element, such as a button, image, or text block. Android apps can sense and respond to multiple inputs from system interactions [33]. Interaction with system events includes SMS notifications, app notifications, or events coming from sensors. The underlying software responds by the execution of an event handler, i.e., an ActionListener, in one of several ways. These experiences are some of the events that need to be addressed in testing Android apps, as they effectively increase the complexity of app testing [35].

Q-Learning
Q-learning is a type of model-free technique of reinforcement learning (RL) [36]. RL is a branch of machine learning. Unlike other branches like supervised and unsupervised learning, its algorithms are trained using reward and punishment to interact with the environment. It is based on the concept of behavioral psychology that works on interacting directly with an environment which plays a key component in artificial intelligence. In RL techniques, a reward is observed if the agent reaches an objective. RL techniques include Actor-critic, Deep Q Network (DQN), State-Action-Reward-State-Action (SARSA), and Q-Learning. The major components of RL are the agent and the environment. The agent serves as an independent entity that performs unconstrained actions within the environment in order to achieve a specific goal. The agent performs an activity on the environment and uses trial-and-error interactions to gain information about the environment. There are four other basic concepts in the RL system along with the agent and the environment: (i) policy, (ii) reward, (iii) action, and (iv) state. The state describes the present situation of the environment and mimics the behavior of the environment. For example, this gives rise to the current situation and action. The model might predict the resultant next state and the next reward. Models are used to plan and decide on a course of action by considering possible future situations before they are experienced. Similarly, the reward is an abstract concept to evaluate actions. Reward refers to immediate feedback after performing an action. The policy defines the agent approach to select an action from a given state. It is the core of the RL agent and sufficient to determine behavior. In general, policies may be stochastic. An action is a possible move in a particular state.
Q-Learning is used to find an optimal action-selection policy for the given AUT, where the policy sets out the rule that the agent must follow when choosing a particular action from a set of actions [37]. There is an action execution that is immediately preceded to choose each action, which moves the agent from the current state to a new state. This agent is rewarded with a reward r upon executing the action a. The value of the reward is then measured using the reward function R. For the agent, the main aim of Q-Learning is to learn how to act in an optimal way that maximizes the cumulative reward. Thus, a reward is granted when an entire sequence of actions is carried out.
Q-Learning uses its Q-values to resolve RL problems. For each policy , the action-value function or quality function (Q-function) should be properly defined. None-theless, the value ( ; ) is the expected cumulative reward that can be achieved by executing a sequence of actions that starts with action from ; and then follows the policy . The optimal Q-function * is the maximum expected cumulative reward achievable for a given (state, action) pair over all possible policies. * ( , ) = max Intuitively, if * is known, the optimal strategy at each step is to take action that maximizes the sum: + * ( + 1, + 1), where is the immediate reward of the current step, while stands for the current time step, hence + 1 denotes the next one. The discount value (γ) is introduced to control the long-term rewards' relevance with the immediate one. Figure 1 presents the RL mechanism in the context of the Android app testing. In automated GUI testing, AUT is the environment; the state is the set of actions available on the AUT screen. The GUI actions are the set of actions available in the current state of the environment, and the testing tool is the agent. Initially, the testing tool does not know the AUT. As the tool generates and executes test event input based on trial-and-error interaction, the knowledge about AUT is updated to find a policy that facilitates systematic exploration to make efficient future action selection decisions. This exploration generates event sequences that can be used as test cases.

Android Apps Execution
There are four key components of an Android app as follows: (i) activities, (ii) services, (iii) broadcast receivers, and (iv) content providers. Each component represents a point where the user or system communicates with the GUI components. These components must be declared in the corresponding XML (eXtensible Markup Language) file. Android app manifest is an invaluable XML file stored in the root directory of the app's source as AndroidManifest.xml. When the device is compiled, the manifest file will be converted into a binary format. This file provides the necessary details about the device to the Android system, such as the package name and App ID, the minimum level of API (application programming interface) required, the list of mandatory permissions, and the hardware specifications.
Activity is the interface layer of the application the user manipulates to engage. Each activity represents a group of layouts such as the linear layout, which horizontally or vertically organizes the screen items. The interface includes GUI elements, known as widgets or controls. These elements are buttons, text boxes, search bars, switches, and number pickers. These elements allow users to interact with the apps. The widgets are handled as task stacks within the system. When an app is launched in the Android system, a new activity starts by default. It is usually positioned at the peak of the current stack and automatically becomes the running activity. Furthermore, the previous activity then remains in the stack just below it and does not come back to the foreground until the new activity exits. Stacks of operation can be seen on the screen. Activity is the primary target of testing tools for the Android app as the user navigates through the screen. The complete lifecycle of an activity is described by the following Activity methods; created, paused, resumed, and destroyed. These methods are linked together to disable in case the activity changes status. The activity lifecycle is tightly coupled with the Android framework, which is managed by an essential service called the Activity manager [38].
The activity comprises a set of views and fragments that present information to the user while interacting with the application. A fragment is a class that contains a portion of the user interface or behavior of the app, which can be placed as part of an activity. Fragments support more dynamic and flexible user interface (UI) designs on a large screen such as tablets. It was implemented in Android from API level 11 onwards. The fragment must always be embedded in an activity, and the fragment's lifecycle is directly affected by the lifecycle of the host activity. Fragments inside the activity will be stopped if the activity is stopped and destroyed if the activity is destroyed.

Related Works
Researchers have developed approaches to automate test generation for Android apps. This section highlights the existing tools with corresponding approaches. Table 1 classifies Android test generation tools based on seven features as follows: (1) technique, (2) test case generation approach, (3) test inputs, (4) testing environment, (5) test artifacts, (6) basis, and (7) availability.

Automated Graphical User Interface (GUI) Testing with Q-Learning
Mariani et al. [39] proposed AutoBlackTest, the first Q-Learning-based GUI testing tool for Java desktop software. AutoBlackTest initially extracts an abstract representation of the current state of the GUI and generates a behavioral model. This model is updated according to the current state reached and the immediate utility of the action. Then the behavioral model is used to select the next action to be executed, and then to restart the loop. TESTAR [40], another Q-Learning-based tool, is used to generate GUI test sequences based on web applications. The Q-Learning algorithm provided significant performance with an adequate set of parameters.
GunPowder [41] is a test input generation tool for search-based test data generation using deep RL. GunPowder has been specifically developed for C applications and consists of three phases: (i) instrumentation, (ii) execution, and (iii) fitness evaluation. In the instrumentation phase, in the first step, it adds instrumentation codes that allow the tool to control and monitor the execution of the program. Subsequently, in the second step, the tool builds and executes the program, and in the third phase, the machine learning algorithm was applied to generate test inputs. Currently, the fitness function supported by GunPowder aims to improve branch coverage. Although not suitable for Android app testing, other studies have adopted RL techniques for Android testing [24,42,43].
Vuong and Takada [42] proposed a Q-Learning-based automated test case generation tool designed for Android apps using the Markov model to describe the AUT. The tool learns the most relevant behavioral model of the AUT and generates test cases based on this model. The tool executes a sequence of a fixed number of events, also called an episode. After finishing an episode, the tool selects a random state from those that have already been visited and starts a new episode in the next phase. However, this tool has multiple limitations. For example, it only generates UI events and does not cover activities triggered by system events.
Adamo, Khan, Koppula, and Bryce [43] introduced a Q-Learning-based automated test case generation tool designed for Android apps built on the top of Appium and UI Automator. During the process of test case generation, the tool chooses an event with the highest Q-value from the set of available events in each state. The test case generation process is quite similar to previous work [42], and this generation process is divided into episodes where the states used in previous episodes are employed as a basis for beginning a new episode. The authors define the state to be the set containing the unique actions available.
AndroFrame [24] is a Q-Learning-based exploration tool that generates test cases. Instead of using a random based approach, the GUI is explored based on a pre-approximated probability distribution that satisfied a test objective. It creates a Q-matrix that shows the probabilities of reaching the test objective which is used to select the next action. However, AndroFrame has inconsistent activity coverage and only works with single-objective fitness functions, where each run has only one objective to increase the activity coverage or search crashes.

Automated GUI Testing with Reinforcement Learning
AimDroid [44] is a model-based test case generation tool for Android apps. AimDroid implements an RL-guided random approach. AimDroid is composed of two activities: it runs a breadth-first search to discover unexplored activities and insulates the discovered activity in a "cage" and intensively exploits such activity using RL-guided fuzzing algorithms. This tool divides the tests into episodes; each episode generates a bounded number of events and focuses on a single activity by disabling activity transitions. Furthermore, AimDroid uses an RL algorithm called SARSA to learn about the ability of events that can discover new activities, to "look ahead" and to select events that are more likely to trigger new activities and crash greedily. AimDroid also has some limitations. For example, it disables the activity transition, which may drop some faults caused by the activity life cy-cle. Moreover, AimDroid does not learn the second-best event to choose from, it only knows the best SARSA-based event, and for all other events, it chooses randomly.
Humanoid [29] was developed along with DroidBot [28] which was introduced to learn how users interact with Android apps. Humanoid uses a GUI model to comprehend and analyze the behavior of AUT. Nevertheless, Humanoid gives preference to human-interacted UI components. Humanoid works in two stages; (1) offline learning phase which is a deep neural network model used to master the relationship between GUI contexts and user-performed interactions, and (2) online testing phase where Humanoid developed a UI transition model for the AUT. In the second phase, it uses the UI transition model and the interaction model to determine the type of test input to send. The UI transition model guides Humanoid on the navigation of explored UI states, while the interaction model guides the discovery of the new UI states. As a limitation, this tool does not present an increment in coverage when compared to other tools. It is unable to use textual information available in the app to generate test cases.

Automated GUI Testing Approaches
Several approaches have been developed to facilitate test case construction, and to explore Android apps: (i) random-based, (ii) model-based, and (iii) systematic testing.
Random-based testing is one of the popular techniques to detect system-level faults within the app. It floods an app with unfeasible events that fail to explore all the app's functionalities [45]. This approach is used to generate events that make them suitable for stress testing efficiently. Android Monkey [27] is a black-box GUI testing tool in the Android SDK (Software Development Kit). Among the current test generation tools, this random-based testing tool has gained considerable popularity from society. Apart from its simplicity, it has demonstrated good compatibility with a myriad of Android platforms, making it the most commonly used tool for numerous industrial applications [19,46]. However, Android Monkey requires more time to generate a long sequence of events. These events include redundant events that repeatedly jump between app activities and unfruitful events that click on a non-interactive area on the screen [44,47,48]. Hu et al. [49] developed an approach on top of Android Monkey to automatically detect UI crashes. Dynodroid [13] used two heuristics algorithms and these algorithms select relevant events to the app's current state and repeat the process in the observe-select-execute cycle. However, Dynodroid has used instrumentation to infer relevant events to guide exploration. In contrast, UI-guided inputs without instrumentation were generated from our approach. SmartMonkey [50] uses a random testing approach proposed by [45] to reduce the number of test cases and the time required to identify the first fault. It constructs a transition model of the app based on random interaction and generates test cases through the random walk technique. SmartMonkey generates test cases consist of a sequence of user and system events.
Model-based testing uses a graph-based model to represent the user interaction with the app's GUI. The model is designed either manually or automatically by adopting the AUT's specifications, such as code or XML configuration files, or through direct interaction with the apps. Model-based exploration can be guided to specific unexplored parts using a systemic strategy such as depth-first exploration, breadth-first exploration, or hybrid [23], or a stochastic model [15]. The model-based technique has encountered difficulties in inaccurate modeling. More specifically, dynamic behaviors in GUIs can generate inaccurate model or state explosion issues due to non-deterministic changes in GUIs. Hence, the model-based approach ignores the changes, finds the event unimportant, and then proceeds with the discovery differently. Explicitly, a GUI model that includes only a limited range of possible behavioral spaces will minimize the effectiveness of tests. Baek et al. [51] conducted a multi-level state representation study to show that different levels of abstraction have an impact on the effectiveness of the modeling tool. A3E [23] explores apps with two strategies: depth-first exploration, which systematically analyzes apps while running on the actual devices and without access to the source code, and targeted exploration, which prioritizes exploration of activities that begin from the initial activity on a static activity transition graph. A3E represents every activity as an individual state without considering that the activity can exist in different states. This leads to a lack of some behaviors of the application as not all states of the activities are being explored. Orbit [52] statically analyzes the apps' source code to generate relevant events supported by the app. However, it uses a simple depth-first exploration that restarts the app from its initial state to backtrack to previous states. PUMA [53] includes a generic UI Automator that implements the same basic random approach as Android Monkey; however, it differs in its design and uses dynamic analysis to trigger changes in the environment during the app execution. Stoat [15], performs a stochastic model testing in two phases. First, it creates a probabilistic model by exploring and analyzing the apps GUI interactions dynamically. Second, it optimizes the state-model by performing Gibbs sampling and directs test generation from the optimized model to maximize code and activity coverage. Ape [54] uses a dynamic modeling approach to optimize the initial GUI model by leveraging runtime information.
Systematic testing uses more sophisticated techniques, such as symbolic execution and evolutionary algorithms to generate specific inputs. The strength of this approach is that it can leverage the source code to generate tests to reveal previously uncovered application behavior. Thor [55] executes the existing test cases under adverse conditions. However, Thor does not generate test cases and relies on injecting existing test cases with sequences of events, which do not affect the outcome of the original test cases. CrashScope [56] automatically detects AUT crashes using a hybrid strategy that combines systematic exploration and static analysis. AppDoctor [57] uses an approximate execution approach to speed up testing and automatically classify most reports into bugs or false positives. However, this approach offers the ability to replay bugs and introduces stack traces to the developer. It must replay crash traces to prune false-positives, and it does not offer highly detailed and expressive reports. SIG-Droid [58] is an automated system input generation tool for Android apps that combines program analysis techniques with symbolic execution to achieve high code coverage. Sapienz [16] uses a fully automated multi-objective search-based testing approach. It adapts genetic algorithms to optimize the test sequence to maximize code coverage and fault detection while minimizing the test sequence length. Sapienz explores the app components by using the specific GUIs and complex sequences of input events with a pre-defined pattern. This pre-defined pattern is termed the motif gene that capture the experience of the testers. Thus, it produces a higher code coverage by concatenating the atomic events.

Proposed Approach: Q-Learning to Generate Test Case for Android Apps
The idea behind using Q-Learning is that the tabular Q-function is rewarded with each selection of possible actions over the app. However, this reward may vary according to the test objective. Thus, events that are never selected can present a higher reward than events that have already been executed, which reduces the redundant execution of events and increases coverage.
Q-Learning has been used in software testing in the past and has shown better results to improve the random exploration strategy [24,42,43,59]. However, a common limitation to all these tools is that the reward function assigns the highest reward when the event is executed for the first time to maximize coverage or locate crashes. Nonetheless, in the proposed approach, the environment does not offer direct rewards to the agent. The agent itself tries to visit all states to collect more rewards. The proposed approach uses tabular Q-Learning like other approaches but uses an effective exploration strategy that reduces actions redundant execution and uses different states and action spaces. Action selection is the main part of Q-Learning in finding an optimal policy. The policy is a process that decides on the next action from the set of current actions. Unlike previous studies, the proposed approach utilizes the upper confidence bound (UCB) exploration-exploitation strategy as a learning policy to create an efficient exploration strategy for GUI testing. UCB tries to ensure that each action is explored well and is the most widely used solution for multi-armed bandit problems [60]. The UCB strategy is based on the principle of optimism in the face of uncertainty.

Implementation
Q-Learning technique with UCB exploration strategy was adopted to generate a GUI test case for Android apps to improve coverage and crash detection. This approach was built in a test tool named DroidbotX. Moreover, the main idea of using DroidbotX was to evaluate the practical usefulness and applicability of the proposed approach. DroidbotX works with Droidbot [28]. Droidbot is a UI-guided input generation tool used mainly for malware detection and compatibility testing. Droidbot was chosen because it is open-source and can test apps without having access to the apps' source code. Moreover, it can be used on an emulator or real device without instrumentation and is compatible with all Android APIs. The DroidbotX algorithm tries to visit all states because it assumes "optimism in the face of uncertainty". The principle of optimism in the face of uncertainty is known as a heuristic in sequential decision-making problems, which is a common point in exploration methods. The agent believes that it can obtain more rewards by reaching the unexplored parts of the state's space [61]. In this principle, actions are selected greedily, but strong optimistic prior beliefs are put on their payoffs so that strong contrary evidence is needed to eliminate the action from consideration. This technique has been used in several RL algorithms, including the interval exploration method [62]. In other words, it means that visiting new states and making new actions would bring the agent more reward than visiting old states and making old actions. Therefore, it starts from an empty Q-function matrix and assumes that every state and action reward an agent with +1. When it visits the state and makes an action a, the Q-function Q (s, a) decreases, and the priority of the action for the state becomes lower. Our DroidbotX approach generates sequences of test inputs for Android apps that do not have an existing GUI model. The overall DroidbotX architecture is shown in Figure 2. In Figure 2, the adapter acts as a bridge between the test environment and the test generation algorithm. The adapter is connected to an Android device or an emulator via the Android Debug Bridge (ADB). The adapter observer monitors the AUT and sends the current state to the test generator. Simultaneously, the executor receives the test inputs generated by the algorithm and translates them to commands. Furthermore, the test generator interacts and explores the app's functionalities following the observe-select-execute strategy, where all the GUI actions of the current state of AUT are observed; one action is selected based on the selection strategy under consideration, and the selected action is executed on the AUT. Similar to other test generators, DroidbotX uses a GUI model to save the memory of transitions called a UI transition graph (UTG). The UTG guides the tool to navigate between the explored UI states. The UTG is dynamically constructed at runtime, which is a directed graph whose nodes are UI states, and the edges between the two nodes are actions that lead to UI state transitions. The state node contains the GUI information and the running process information, and the methods are triggered by the action. DroidbotX uses Q-Learning-based test coverage approach shown in algorithm 1 and constructs UI transition graph in algorithm 2.

States and Actions Representation
In the Android app, all the UI widgets of an app activity are organized in a GUI view tree [51]. The GUI tree can be extracted via UI Automator, which is a tool provided by the Android SDK. UI widgets include buttons, text boxes, search bars, switches, and number pickers. Users interact with the app using click, long-click, scroll, swipe up, swipe down, input text, and other gestures collectively called as GUI actions or actions. Every action is represented by its action type and target location coordinates. The GUI action is either (1) widget-dependent such as click and text, or (2) widget-independent such as the back that presses the hardware back button. A 5-tuple denotes an action: = ( , , , , ), where is a widget on a particular state, is a type of action that can be performed on the widget (e.g., click, scroll, swipe), and holds arbitrary text if widget is a text field. For all non-text field widgets, the value is empty. Moreover, is the key event that includes back, menu, and home buttons on the device, and is a widget ID. Note that DroidbotX sends an intent action that installs, uninstalls, and restarts the app.
State abstraction refers to the procedure that identifies equivalent states. In this approach, state abstraction determines two states as equivalent if (1) they have similar GUI content which includes package, activity, widget's type, position, and widgets parent-child relationship, and (2) they have the same set of actions on all interactive widgets, which is widely used in previous GUI testing techniques [44,63,64]. GUI state or state ∈ describes the attributes of the current screen out of the Android device where S denotes the set of all states. A content-based comparison and a set of actions to decide state equivalence, where two states with different UI contents and different enabled actions are assumed to be different states.
For simplifying states and actions representation, take an example of the Hot death app. Hot death is a variation of the classic card game. The main page includes a new game, settings, help, about, and exit buttons. Figure 3 shows a screenshot of the app's main activity, initial state, and related widgets with a set of enabled actions. Widget-dependent action is detected when a related widget exists on the screen. For example, a click-action exists only if there is a related widget with the attribute clickable true. Widget-independent action is available in all states because the user can press on device hardware buttons such as the home all the time.

Exploration Strategy
Android apps can have complex interactions between the events that can be triggered by UI widgets, and states that can be reached, and the resulting coverage achieved. In automated testing, the test generator must choose not only which widget to interact with, but also what type of action to perform. Each type of action on each widget is likely to improve coverage. Our goal is to interact with the app's widgets by sending relevant actions for each widget dynamically. This reduces the number of ineffective actions performed and explores as much app state as possible. Thus, UCB was used as an exploration policy to explore the app for new states and try out new actions. For each state, all potential widgets are extracted with their IDs and location coordinates, and then systematically choose between five different actions (i.e., click, long-click, scroll, swipe left/right/up/down, and input text data) to interact with each widget. Next, whether the action brings the app to a new state by comparing its contents with all other states in the state model. If the agent identifies a new state, the exploring policy on the new state is recursively applied to discover unexplored actions. The exploration policy does not know about the consequences of each action, and the decision is made based on the Q-function.When exploration of this state terminates, intent was executed to restart the AUT. Android intent is the message that passed between Android app components such as the start activity method to invoke activity. Examples of termination, an action that cause the AUT to crash, an action that switches to another app, or a clicks home button. The home action always closes the AUT, while the back action often closes the AUT. The exploration passes the login screen by searching in a set of pre-defined input. Some existing tools such as Android Monkey [27] will stop at the login screen, failing to exercise the app beyond the login page.

Observer and Rewarder
The goal of the observer is to monitor the results of actions on the AUT. The Q-function then rewards the actions based on the results. Algorithm 1 uses the input parameters to explore the GUI and produces a set of event sequences as a test case for AUT. The Q-function ( , ) takes state and action . The Q-function matrix is constructed based on the current state. Each row in the matrix represents the expected Q-values for a particular state. The row size is equal to the number of possible actions for the state. The getEventfromActor function at lines 23-26 obtains all the GUI actions of the current state of AUT. The actions' initial values on the current state are assigned as 1 at line 26. The UpdateQFunction function at lines 13-21 decreases the value of the action to 0.99 when the test generator conducts this action in the state. When all action value is 0.99, the maximum value becomes 0.99, and the test generator starts to choose some actions again. Then one action is selected and executed, and when a new state is found, the Q-function trainer receives the next state and updates the Q-function matrix to the previous state. The test generator sends KeyEvents such as back button at lines 27-28, if the state is the last or if there are no new actions in the current state.
The Q-Learning algorithm uses equation (2) to estimate the value of ( , a) iteratively. The Q-function is initialized by a default value. Whenever an agent executes an action a from state to reach ′ and receives a reward + 1, the Q-function is updated as equation (2) where is a learning rate parameter between 0 and 1 and γ is a discount rate.

Action Selector
Action selection strategy is a crucial feature of DroidbotX. Right actions can improve the likelihood and decrease the time necessary to navigate to various app execution states. In the initial state, the test generator chooses the first action based on a random-ized exploration policy to avoid the systematic handling of GUI layouts in each state. Then, the test generator selects actions from the new states and generates event sequences in a way that attempts to visit all states. The Q-function calculates the expected future rewards for actions based on the set of states it visited. In each state, the test generator chooses an action that has the highest expected Q-value from the set of available actions using getSoftArgmaxAction function at lines 32-36, then the predicted Q-value for that action is reduced. Therefore, the test generator will not choose it again until all other actions have been tried. Formalizing this mathematically, the selected action is picked by Equation (3).
Equation (3) depicts the basic idea of UCB strategy, the expected overall reward of action is ( , ), denotes how often action has been selected in , while ( , ) is the number of times the action was selected in state , and is a confidence value that controls the level of exploration (set to 1). This method is known as "exploration through optimism," and it gives less-explored action a higher value and encourages the test generator to select them. The test generator uses the Q-function learned by equation (2) and UCB strategy to select each action intelligently, which balances the exploration and the exploitation of AUT.

. Test Case Generation
Test case is defined as a sequence of transitions. = ( 1, 1, 2), ( 2, 2, 3), . . . , ( , , + 1), where is the length of the test case. Each episode is considered to be a test case, and each test suite is a set of test cases. The transition is defined as a 3-tuple (start-state, ; action, ; end-state, ). Algorithm 2 dynamically constructs a UI transition graph to navigate between the explored UI states. It takes three input parameters: (1) the app under test, (2) Q-function for all the state-action pairs generated by algorithm 1, and (3) test suite completion criterion. The criterion for test suite completion is a fixed number of event sequences (set to 1000). DroidbotX's test generator explores a new state , and adds a new edge < − 1; − ; > to the UI transition graph, where − 1 is the last observed UI state and − is the action performed in − 1. For instance, consider generation of a test suite for Hot death Android app. DroidbotX creates an empty UI transition graph (line 1), explores the current state of AUT (line 3), observes all the GUI actions of the current state (line 5), and constructs a Q-function matrix. Then one action is selected and executed based on getSoftArgmaxAction function (line 7), when a new state is found, the UpdateQFunction function receives the next state and updates the Q-function matrix to the previous state. The transition of executed action, next state, and previous state are added to the graph (line 15). The Q-value of executed action is decreased to avoid using the same action of the current state. The process is repeated until completing the target number of actions. Figure 4 shows an example of UTG from the Hot death Android app.

Empirical Evaluation
This section provides an evaluation of DroidbotX compared to state-of-the-art tools using 30 Android apps. This evaluation employs the empirical case study method that is used in software engineering, as reported in [65,66]. The evaluation intends to answer the following research questions: RQ.1: How does the coverage achieved by DroidbotX compared to the state-of-the-art tools? RQ.2: How effective is DroidbotX to detect unique app crashes compared to the state-of-the-art tools? RQ.3: How does DroidbotX compare to the state-of-the-art tools in terms of test sequence length?

Case Study Criteria
Four criteria were used for the evaluation as follows: (1) Instruction coverage refers to the Smali [67] code instructions through decompiling the APK installation package. It is the ratio of triggered instruction in the Java instruction code of the app to the total number of instructions. Huang et al. [68] first proposed the concept of instruction coverage, which is used in many studies as an indicator to evaluate test efficiency [24,44,54,64]. It is a more accurate and valid test coverage criterion that reflects the adequacy of testing results for closed-source apps [25]. (2) Method coverage is the ratio of the number of methods called during execution of the AUT to the total number of methods in the source code of the app. By improving the method coverage, more functionalities of the app were explored and tested [11,23,24,26]. (3) Activity coverage is defined as the ratio of activities explored during execution to the total number of activities existing in the app. (4) Crash detection: An Android app crashes when there is an unexpected exit caused by an unhandled exception [69]. Crashes will result in the termination of the app's processes, and dialogue is displayed to notify the user about the app crash. The further the code the tool explores, the more likely it is to discover potential crashes.

Subject Selections
We used 30 Android apps chosen from F-Droid repository [30] for the experimental analysis. These apps were chosen from the repository based on the app's number of activities and user permissions required. These features were determined in the Android manifest file of the app. User permissions were selected to evaluate how tools react to different system events such as call logs, Bluetooth, Wi-Fi, location, and the camera of the device. Table 2 lists the apps by app type, along with the package name, the number of activities, methods, and instructions in the app (which offers a rough estimate of the app size). Acvtool [70] was used to collect instruction coverage and method coverage. This tool does not require the source code of the app.

Experimental Setup
Our experiments were executed on a 64-Bit Octa-Core machine with a 3.50 Gigahertz Intel Xeon ® central processing unit (CPU) running on Ubuntu 16.04 and 8 Gigabytes of RAM. Five state-of-the-art test generation tools for Android apps were installed on the dedicated machine for running our experiments. The tools chosen were Sapienz [16], Stoat [15], Droidbot [28], Humanoid [29], and Android Monkey [27].
The Android emulator x86 ABI (Application Binary Interface) image was used for experiments. All comparative experiments ran on emulators because the publicly available version of Sapienz only supports emulators. In contrast, DroidbotX, Droidbot, Humanoid, Stoat, and Android Monkey support both emulators and real devices. Moreover, Sapienz and Stoat ran on the same version of Android 4.4.2 (Android KitKat, API level 19) because of their compatibility as described in previous studies [15,16]; DroidbotX, Droidbot, Humanoid, and Android Monkey ran on Android 6.0.1 (Android Marshmallow, API level 23).
To achieve a fair comparison, a new Android emulator was used for each run to avoid any potential side-effects that may occur between the tools and apps. All tools were used with their default configurations. According to previous studies [11,54], Sapienz and Android Monkey were set to 200 milliseconds delay for GUI state updates. All testing tools were provided for an hour to test each app, similar to other studies [11,16,44]. To compensate for the possible effect of randomness during testing, each test was repeated five times (with each test consisting of one testing tool and one applicable app being tested). The final coverage and the progressive coverage were recorded after each action. Subsequently, the average value of the five tests was calculated as the final result.

Results
In this section, the research questions were answered by measuring and comparing four aspects: (i) instruction coverage, (ii) method coverage, (ii) activity coverage, and (iv) the number of detected crashes achieved by each testing tool on selected apps in our experiments. Table 3 shows the results obtained from the six testing tools. The gray background cells in Table 3 indicate the maximum value achieved during the test. The percentage value is the rounded-up value obtained from the average of the five iterations of the tests performed on each AUT. Table 3. Results on instruction coverage, method coverage, and activity coverage by test generation tools.  Table 3. On average, DroidbotX achieved 51.5% instruction coverage, which is the highest across the compared tools. It achieved the highest value on 9 of 30 apps (including four ties, i.e., where DroidbotX covered the same number of instructions as another tool) compared to other tools. Sapienz achieved 48.1%, followed by Android Monkey (46.8%), Humanoid (45.8%), Stoat (45%), and Droidbot (45%). Figure 5 presents the boxplots, where x indicates the mean of the final instruction coverage results across target apps. The boxes provide the minimum, mean, and maximum coverage achieved by the tools. Better results from DroidbotX can be explained as it accurately identifies which parts of the app are inadequately explored. The DroidbotX approach is used to explore the UI components by checking all actions available in each state and avoiding the use of the explored action to maximize coverage. In comparison, Humanoid achieved a 45.8% average value and had the highest coverage on 4 out of 30 apps due to its ability to prioritize critical UI components. Humanoid chooses from 10 actions available in each state that are likely to interact with human users. Android Monkey's coverage was close to Sapienz's coverage during a one-hour test. Sapienz uses Android Monkey to generate events and uses an optimized evolutionary algorithm to increase coverage. Stoat and Droidbot achieved lower coverage than the other four tools. First, Droidbot explores UIs in depth-first order. Although this greedy strategy can reach deep UI pages at the beginning, it may get stuck because the order of the event execution is fixed at runtime. Second, Droidbot does not explicitly revisit the previously explored states, and this may fail to include a new code that should be reached by different sequences. 2) Method coverage: DroidbotX significantly outperformed state-of-the-art tools in method coverage with an average value of 57%. The highest value was achieved on 9 out of 30 apps (including three ties where the tool covered the same method coverage as another tool). Table 3 shows that the coverage of app instructions obtained by the tools is lower than that of the method. This indicates that the method coverage cannot fully cover all the statements in the app's method. On average, Sapienz, Android Monkey, Humanoid, Stoat, and Droidbot achieved 53.7%, 52.1%, 51.2%, 50.9%, and 50.6% of method coverage, respectively. Stoat and Droidbot did not obtain the highest coverage of 50% on 10 of 30 apps after five rounds of testing. In contrast, DroidbotX achieved the highest method coverage of 50% in the 24 apps that were tested. In comparison, Android Monkey obtained less than 50% method coverage in eight apps. This study concluded that the AUT functionalities can be accomplished and explored using the observe-select-execute strategy and tested on standard equipment. Sapienz displayed the best method coverage on 5 out of 30 apps (including four ties where the tool covered the same method coverage as another tool). Sapienz's coverage was significantly higher for some apps such as "WLAN Scanner", "HotDeath", "ListMyApps", "SensorReadout", and "Terminal emulator". These apps have functionality that requires complex interactions with validated text input fields. Sapienz uses the Android Monkey input generation, which continuously generates events without waiting for the effect of the previous event.
Moreover, Sapienz and Android Monkey can generate several events, broadcasts, and text that have not been supported by other tools. DroidbotX obtained the best results for several other apps, especially "A2DPVolume", "Blockinger", "Ethersynth", "Resdicegame", "Weather Notification", and "World Clock". The DroidbotX approach assigns Q-values to encourage the execution of actions that lead to new or partially explored states. This enables the approach to repeatedly execute high-value action sequences and revisit the subset of GUI states that provides access to most of the AUT's functionality. Figure 6 presents the boxplots, where x indicates the mean of the final method coverage results across target apps. DroidbotX had the best performance compared to state-of-the-art tools, and Android Monkey was used as a reference for evaluation in most Android testing tools. Android Monkey can be considered a baseline because it comes with an Android SDK and is popular among developers. Android Monkey obtained a lower coverage compared to DroidbotX because of its redundancy and random exploratory approach. 3) Activity coverage: the activity coverage is measured by intermittent observation of the activity stack on the AUT and recording all activities listed down in the android manifest file. The activity coverage metric was chosen because, once DroidbotX has reached an activity, it can explore most of the activity's actions. The results determine activity coverage differences between DroidbotX and other state-of-the-art tools. The resulting average value of the tools revealed that the activity coverage performed better than instruction and method coverage, as shown in Table 3. DroidbotX outperformed the other tools in its activity coverage, such as instruction and method coverage. DroidbotX has an average coverage of 86.5%, which was best achieved by the "Alarm Clock" app (including 28 ties, i.e., whereby DroidbotX covered the same number of activities as another tool). DroidbotX outperformed other tools because it did not explicitly revisit previously explored states due to its reward function. This was followed by Sapienz and Humanoid, with the average mean value of activity coverage at 84% and 83.3%, respectively. Stoat successfully outperformed Android Monkey in activity coverage with an average activity coverage of 83% due to an intrusive null intent fuzzing that can start an activity with empty intents. All tools under study were able to cover more than 50% of coverage on 25 apps, and four testing tools covered 100% activity coverage on 15 apps. Android Monkey, however, achieved less than 50% activity coverage of about three apps. Android Monkey achieved the least activity coverage with an average mean value of 80%. Figure 7 shows the variance of the mean activity coverage of 5 runs across all 30 apps of the tool. The horizontal axis shows the tools used for the comparison. The vertical axis shows the percentage of activity coverage. Activity coverage was higher than the instruction and method coverage. DroidbotX, Droidbot, Humanoid, Sapienz, Stoat, and Android Monkey obtained a 100% coverage increased from a mean coverage of 89%, 85%, 86.6%, 87.3%, 85.8%, and 83.4%, respectively. All tools were able to cover above 50% of the activity coverage. Although Android Monkey implemented more types of events than other tools, it achieved the least activity coverage. Android Monkey generates random events at random positions in the App activities. Therefore, its activity coverage can differ significantly from app to app and may be affected by the number of events sequences generated. To sum up, the high coverage of DroidbotX was mainly due to the ability of DroidbotX to perform a meaningful sequence of actions that could drive the app into new activities. RQ.2: How effective is DroidbotX in detecting unique app crashes compared to the state-of-the-art tools?
A crash is uniquely identified by the error message and the crashing activity. LogCat [71] is used to repeatedly check the crashes encountered during the AUT execution. LogCat is a tool that uses the command-line interface to dump logs of all the system-level messages. Log reports were manually analyzed to identify unique crashes from the error stack following the Su et al. [15] protocol. First, crashes unrelated to the app's execution by retaining only exceptions containing the app's package name and filter crashes of the tool itself, or initialization errors of the apps in the Android emulator. Second, compute a hash over the sanitized stack trace of the crash to identify unique crashes. Different crashes should have a different stack trace and thus a different hash. Each unique crash exception is recorded per tool, and the execution process is repeated five times to prevent randomness in the results. The number of unique app crashes is used as a measure of the performance of the crash detection tool. Crashes detected by tools on a different version of Android via normalized stack traces were not compared because different versions of Android have different framework code. In particular, Android 6.0 uses the ART runtime while Android 4.4 uses Dalvik VM, different runtime environments have different thread entry methods. Based on Figure 8, each of the tools compared complements the others in crash detection and has its advantages. DroidbotX triggered an average of 18 unique crashes in 14 apps, followed by Sapienz (16), Stoat (14), Droidbot (12), Humanoid (12), and Android Monkey (11). Like activity coverage, Android Monkey remains the same as it has the least capacity to detect crashes due to its exploratory approach that generates a lot of ineffective and redundant events. Figure 8 summarizes the distribution of crashes by the six testing tools. Most of the bugs are caused by accessing null references. Common reasons are that developers forget to initialize references, access references that have been cleaned up, skip checks of null references, and fail to verify certain assumptions about the environments [57]. DroidbotX is the only tool to detect IllegalArgumentException on the ''World Clock'' app, because it is capable of managing the exploration of states, and systematically sends back button events that may change the activity life cycle. This bug is caused by an incorrect redefinition of the onPause method of activity. Android apps may have incorrect behavior due to mismanagement of the activity's lifecycle. Sapienz uses Android Monkey to generate an initial population of event sequences (including both user and system events) prior to genetic optimization. This allows Sapienz to trigger other types of exception, including ArrayIndexOutOfBoundsException, and ClassCastException. For the "Alarm Clock" app, DroidbotX and Droidbot detected a crash on an activity that was not discovered by other tools in the five runs. Manually inspected several randomly selected crashes to confirm that they do appear in the original APK as well, and found no discrepancy between the original and the instrumented APK behaviors.
RQ.3: How does DroidbotX compare to the state-of-the-art tools in terms of test sequence length?
The effectiveness of events sequence length on test coverage and crash detection was investigated. The event sequence length generally shows the number of steps required by the test input generation tools to detect a crash. It is critical to highlight its effectiveness due to its significant effects on time, testing effort, and computational costs. Figure 9 depicts the progressive coverage of each tool over the threshold time used (i.e., 60 minutes). The progressive average coverage for all 30 apps was calculated every 10 minutes for each of the test generation tools in the study and a direct comparison of the final coverage was published. In the first 10 minutes, the coverage for all testing tools increased rapidly, as the apps had just started. At 30 minutes, DroidbotX achieved the highest coverage value compared to other tools. The reason is that the UCB exploration strategy implemented in DroidbotX finds events based on their reward and Q-value, which eventually tries to select and execute the previously unexecuted or less executed events, thus aiming for high coverage. Sapienz coverage increased rapidly, as the apps had just started, whereas all UI states were new but could not exceed the peak reached after 40 minutes. Sapienz has a high tendency to explore visited states, which could generate more event sequences. Stoat, Droidbot, and Humanoid had almost the same result and had better activity coverage than Android Monkey. Android Monkey could not exceed the peak reached after 50 minutes. The reason is that a random approach generates the same set of redundant events leading to a fall in its activity exploration ability. It is essential to highlight that these redundant events produced insignificant coverage improvement as the time budget increased. Table 4 shows that the Q-Learning approach implemented in DroidbotX achieved 51.5% instruction coverage, 57% method coverage, 86.5% activity coverage, and triggered 18 crashes within the shortest event sequence length compared to other tools. The results show that adapting Q-Learning with the UCB strategy can significantly improve the effectiveness of the generated test cases. DroidbotX generated a sequence length of 50 events per AUT state with an average of 623 events per run across all apps (which is smaller than the default maximum sequence length of Sapienz). DroidbotX completed exploration before reaching the maximum number of events (set to 1000) within the time limit. Sapienz produced 6000 events and optimized events sequence lengths through the generation of 500 events per AUT state. Nevertheless, it created the largest number of events after Android Monkey. However, the coverage improvement was closer to Humanoid and Droidbot, which generated a smaller number of events. Both Humanoid and Droidbot generated 1000 events per hour. Sapienz uses Android Monkey that requires many events, which may include many redundant events to achieve high coverage. Hence, the coverage gained by Android Monkey only increases slightly as the number of events increases. Thus, a long events sequence length led to a minor positive effect on coverage and crash detection. Table 5 shows the statistics of models built by Droidbot, Humanoid, and DroidbotX. These tools use the UI transition graph to save the memory of state transitions. The graph model enables DroidbotX to manage the exploration of states systematically to avoid being trapped in a certain state, which also can help to minimize unnecessary transitions.
DroidbotX generates an average of 623 events to construct the graph model, while Droidbot and Humanoid generate 969 and 926 average events, respectively. Droidbot cannot exhaustively explore app functions due to its simple exploration strategies. The depth-first systematic strategy used in Droidbot is surprisingly much less effective than the random strategy since it visits UIs in a fixed order and spends much time on restarting the app when no new UI components are found. Stoat requires more time for test execution due to its model construction in the initial phase which consumes time. Model-free tools such as Android Monkey and Sapienz can easily mislead exploration because of the lack of connectivity information between GUIs [54]. The model constructed by DroidbotX is still not complete since it cannot capture all possible behaviors during exploration, which is still an important research goal on GUI testing [15]. All the events would introduce non-deterministic behavior if they were not properly modeled such as system events and events coming from motion sensors (e.g., accelerometer, gyroscope, and magnetometer). Motion sensors are used for gesture recognition which refers to recognizing meaningful body motions including the movement of the fingers, hands, arms, head, face, or body performed with the intent to convey meaningful information or to interact with the environment [72]. DroidbotX will be extended in the future to include more system events.

Threats to Validity
There are threats and limitations to the validity of our study. Threats to internal validity, the non-deterministic approach of the tools results in obtaining different coverage for each run. Thus, multiple runs were executed to reduce this threat and to remove outliers that could affect the study critically. Each testing tool was allowed to run five times, and the test results were recorded and then computed to yield an average result of final coverage and progressive coverage of the tools. Another threat to the internal validity of our study is Acvtool's instrumentation effect, which affects the integrity of the results obtained. These may be caused by errors triggered by Acvtool's incorrect handling of the binary code or by errors in our experimental scripts. To mitigate this risk, the traces of our experiments for the subject apps were manually inspected.
External validity was threatened by the representativeness of the study to the real world. This means how closely the apps and tools were used in this study to reflect the real world. Moreover, the generalizability of the results was limited as we used a limited number of subject apps. To mitigate these, a standard set of subject apps was used in our experiment from various domains, including fitness, entertainment, and tools applications. The subject apps from F-Droid, which is commonly used in Android GUI testing studies, were carefully selected and the details of the selection process were explained in Section 7.2. Therefore, our test is not prone to selection bias.

Conclusion
This research aims to present a Q-Learning-based test coverage approach to generate the GUI test case for Android apps. This approach adopted a UCB exploration strategy to minimize redundant execution of events that improve coverage and crash detection. The proposed approach generated inputs that visit unexplored app states and uses the execution of the app on the generated inputs to construct a state-transition model generated during runtime. This research also provided an empirical evaluation of the effectiveness of the proposed approach and shows a comparison with GUI test-generation tools for Android apps using 30 Android apps. Four criteria (i) instruction coverage, (ii) method coverage, (iii) activity coverage and (iv) number of detected crashes were used to evaluate and compare GUI test-generation tools. The experimental result revealed that the Q-Learning-based test coverage approach outperforms the state-of-the-art in coverage and in the number of detected crashes within the shortest events sequence length. For future work, DroidbotX will be extended to include input text data, which may integrate text prediction to improve coverage.