Unsupervised Detection of Changes in Usage-Phases of a Mobile App

: Under the ﬁerce competition and budget constraints, most mobile apps are launched without sufﬁcient tests. Thus, there exists a great demand for automated app testing. Recent developments in various machine learning techniques have made automated app testing a promising alternative to manual testing. This work proposes novel approaches for one of the core functionalities of automated app testing: the detection of changes in usage-phases of a mobile app. Because of the ﬂexibility of app development languages and the lack of standards, each mobile app is very different from other apps. Furthermore, the graphical user interfaces for similar functionalities are rarely consistent or similar. Thus, we propose methods detecting usage-phase changes through object recognition and metrics utilizing graphs and generative models. Contrary to the existing change detection methods requiring learning models, the proposed methods eliminate the burden of training models. This elimination of training is suitable for mobile app testing whose typical usage-phase is composed of less than 10 screenshots. Our experimental results on commercial mobile apps show promising improvement over the state-of-the-practice method based on SIFT (scale-invariant feature transform).


Introduction
As users prefer mobile devices over conventional personal computers as a platform for news and entertainment, mobile apps now dominate software usage. However, developers have faced new challenges. Under the fierce competition and budget constraints, most developers do not have time for detecting bugs and potential crashes in their apps; thus, most apps are launched without sufficient testing. Testing technologies have yet to catch up, and mobile app testing still depends on manual methods, while reliable automated testing tools are rare [1].
In the domain of automated app testing, Google's Android Monkey tool is regarded as the state-of-the-practice for automated testing for the Android system [2]. Android Monkey takes a practical solution for GUI (graphical user interface) based testing: a random testing approach ("monkey testing") for generating random events [3,4]. Although the "monkey" approach is cost-effective, the unintelligent manner of testing leaves room for improvement. For example, recent developments in object recognition have been utilized for improving random testing [5]. In [6], a reinforcement learning based approach was proposed for identifying how individual UI widgets are interacting. Saumya et al. introduced the idea of the automatic generation of worst case test inputs from a model of program behavior in order to test programs under extreme loads [7]. More detailed descriptions on approaches for automated testing of mobile apps were introduced in [8]. Automated mobile app testing requires a number of functionalities such as feature extraction, event generation, event execution, etc. AI (artificial intelligence)/ML (machine learning) based approaches can augment or improve the performance of some functionalities for example by recognizing UI widgets in GUI and extracting keywords through OCR (optical character recognition).
In this paper, we introduce methods for detecting changes in mobile app usage based on GUI screenshots. The detection of changes in usage is important for generating input events and backtracking past executions. From the viewpoint of detecting changes in data streams, there already exist excellent works on concept drift [9,10]. However, the characteristics of mobile app testing make it difficult to employ existing achievements in the field of concept drift. Typical concept drift methods train models in order to measure the error rate or to estimate densities. In the field of mobile app testing, each usage-phase is typically composed of less than 10 images, and it is practically impossible to construct a training dataset due to time limits or flexible and inconsistent designs and implementations of the same functionalities. Thus, we are unable to utilize training based methods (supervised approaches).
In order to tackle this obstacle, we propose detection methods not requiring training (unsupervised approaches). Our detection methods compare GUI screenshots of a mobile app. In order to measure the difference between consecutive images, we utilize graph entropy [11], graph kernels [12], a probability distribution comparison metric, a generative model (Chapter 1 in [13]), and a sequence of log-likelihood values. Experimental results on 50 commercial apps report that the proposed methods achieve encouraging performance compared to the current state-of-the-practice method based on SIFT (scale-invariant feature transform) [14,15], and it is possible to detect changes in a data stream in an unsupervised manner.
The structure of the rest of this work is as follows. Section 2 summarizes the relevant works. Section 3 describes the details of the dataset and the proposed methods. Section 4 includes the experimental results. Finally, we give concluding remarks in Section 5.

Related Work
Many applications require the detection of significant changes in data streams such as video streams or streams of signals. The changes of interest can be detected by methods utilizing concept drift learning [9] or methods based on anomaly detection [16] in general. The goal of concept drift techniques is to mine inherent patterns from data streams by learning the underlying distribution over time [10]. In order to obtain inherent patterns, a typical procedure for drift detection is composed of data retrieval, data modeling, the calculation of test statistics, and hypothesis testing steps [10]. Although data retrieval for data streaming is also an important research topic [17], the key to drift detection is the measurement of dissimilarity. Gama et al. introduced a detection method based on the error rate of a learner in which once the error rate of a model reaches a threshold (the warning level), then the training step for a new learner is started [18]. For fast training, extreme learning machine (ELM) was introduced, whose architecture has a single hidden layer, and only the connections between the hidden layer and output layer are trained [19]. Xu et al. [20] conceived of a dynamic extreme learning machine improving ELM by adjusting the architecture of a model according to the classification performance of the current model.
The distance or dissimilarity between the distribution of the past and new data can be utilized to detect changes. In [21], the relative entropy was used to compare the dissimilarity between two distributions while the bootstrap method enhanced the statistical significance. In least-squares density-difference (LSDD) estimation, a density-difference model is fit to the target density-difference function through the squared loss [22]. For unknown distributions, the difference between two distributions represented by LSDD can be estimated by the Gaussian kernel function, thus enabling change detection [23]. Ross et al. improved traditional sequential monitoring methods through nonparametric charts capable of detecting arbitrary changes to the process distribution [24]. With an estimated distribution, a sequence of the likelihood of the data stream can be computed.
The distributions of log-likelihood were compared so as to detect changes by computing the symmetric Kullback-Leibler divergence [25].
The occurrence of changes in a data stream can be interpreted as anomalies [26]. Recently, graph based approaches have been actively researched [16]. In order to represent graphs as points in a non-Euclidean space and detect changes, adversarial training of autoencoders was employed for graph embeddings on constant-curvature manifolds, and change detection tests were performed considering embedded graphs [27]. Dissimilarity from prototype graphs was also utilized for change detection [28].
The test for detecting changes can be interpreted as a procedure for deciding to either reject a hypothesis that a new instance is generated from the current distribution or not reject it. The Student t-test quantifies how significant the difference between two datasets is (chap10 in [29]). For sequential data, the Mann-Kendall test and the CUSUM (cumulative sum) test are suitable [30]. With multiple parameters, the sequence of log-likelihood is a good indicator for change detection. For a log-likelihood stream, t-statistics, Kolmogorov-Smirnov, or Lepage can be utilized as the test statistic [25].

Materials and Methods
The term "usage-phase" denotes states comprised of a user's experience with a mobile app such as login, viewing goods, writing/reading a post, etc. In most cases, changes of usage-phases accompany changes in GUI compositions. Some changes in usage-phase are easy to detect. However, some changes are so subtle that they make automated testing difficult. For example, if a user reads a long review, dragging down causes changes in screen composition, but the sequence of slightly different images constitutes the same usage-phase semantically. Without semantic knowledge on the current usage, an automated tester is likely to decide that each image corresponds to a new usage-state and devise a complex testing strategy. The proposed methods aim to detect changes in usage-phases while minimizing the false-positive ratio. In order to achieve our goal in an unsupervised manner, we propose methods based on graphs and probability distributions in a screenshot. In Sections 3.1.1 and 3.1.2, we formulate the problem and describe our dataset. Section 3.2 describes the basic change detection algorithm and its proposed variations briefly. Section 3.3 discusses a method based on SIFT. Section 3.4 describes graph kernel and graph entropy based change detection methods. In Section 3.5, methods utilizing probability distributions are explained.

Problem Statement
In this research, a user experience is represented as a stream of images, D = {y 1 , ...y n }, where each image corresponds to a screenshot of the currently used mobile app. With a given distance metric d(·, ·), a change in usage-phase can be detected if the measured difference between two consecutive screenshots (y i , y i+1 ) is greater than a threshold (τ), The choice of d(·, ·) and τ is important for implementing error-robust test systems. For the construction of error-robust test systems, we consider not D, but a sequence of graphs converted from D (i.e., undirected labeled graphs characterized by nodes representing UI widgets and edges denoting relations between UI widgets) and a stream of probability distributions representing each screenshot. For conversion, UI widgets in y i are recognized through Faster R-CNN [31].

Data
For our study, we collected 13,272 screenshots from 50 Android apps. After installing each app, a user used the installed app and produced screenshots utilizing the ADB (Android Debug Bridge) tool. In general, approximately 4 to 5 screenshots were sampled per second. After sampling, the user grouped a set of consecutive screenshots as a usage-phase manually. Details on the collected datasets are provided in Table 1.

Basic Change Detection Algorithm
Algorithm 1 provides the basic working of the proposed methods. In essence, a set of usage pages, U , is constructed by computing Equation (1) on D = {y 1 , ...y n }. If the computed value is larger than a threshold, τ, it is regarded that a change is detected. However, there exist variations of Algorithm 1 according to the employed methods and the selection of the threshold, τ. The threshold, τ, is determined by three methods: Min+Max 2 , the mean, and the empirical threshold. In order to compute Min+Max 2 , we observe differences between two consecutive screenshots in the current usage-phase, U t . "Min" and "Max" are the minimum and maximum difference in U t , respectively. Thus, in each usage-phase, U t , τ is adjusted dynamically. The "mean" is computed by firstly observing differences between consecutive screenshots in a mobile app, then computing the averaged differences. As a result, each app has a distinct "mean" value. In the case of empirical threshold, we fix τ by an empirically-determined number.
There are also variations on the computation of Equation (1). With the SIFT based method (Section 3.3), Equation (2) is utilized for computing Equation (1). For graph based methods (Sections 3.4.1 and 3.4.2), Algorithm 1 becomes more complicated. Firstly, the input, D = {y 1 , ...y n }, is converted into a set of graphs, G = {g 1 , ...g n } (for more details, refer to Section 3.4). Then for graph entropy based detection (Section 3.4.1), the conditional graph entropy (Equation (4)) between two consecutive graphs g i and g i+1 is utilized as a measure of difference between two corresponding GUI screenshots y i and y i+1 . For graph kernel-based detection (Section 3.4.2), dissimilarity (measured by Equation (5)) between two graphs g i and g i+1 is utilized as a measure of the difference between two corresponding GUI screenshots.
Initialize U t .

10
Add y i+1 to U t .

end if 12 end for
For probability distribution based methods (Sections 3.5.1 and 3.5.2), the input, D = {y 1 , ...y n }, is converted into a set of probability distributions, P = {P 1 , ...P n } (for more details, refer to Section 3.5). The Kullback-Leibler divergence between consecutive probability distributions P i and P i+1 is computed to measure the difference between corresponding GUI screenshots y i and y i+1 (Section 3.5.1). In Section 3.5.2, a generative model, M t , for the current usage-phase, U t , is constructed. Then, the likelihood of a new image y new (exactly the likelihood of a converted probability P new ) is computed by Equation (9). By computing Equation (9), a sequence of likelihood values is obtained. We also attempt to detect changes by testing the hypothesis that a probability distribution representing y new is generated by the current usage-phase model (M t ) based on the sequence of likelihood values (Section 3.5.3).

SIFT Based Method
The scale-invariant feature transform (SIFT) is utilized to detect local features in an image [32]. Because features extracted by SIFT are invariant to orientation, changes in illumination, and uniform scaling, SIFT is widely used for image and video analysis [33,34]. Target image search based on local features (TISLF) was introduced as a method for comparing target images and images in video sources by means of local features [34]. The TISLF is composed of a video segmentation step, a recognition step, and an estimation step. In TISLF, SIFT keypoints are utilized as a similarity measure between two consecutive images y i and y i+1 by: The value computed by Equation (2) is interpreted as the probability of two successive images belonging to the same interval. These values can be concatenated into the vector W = (P(y 1 , y 2 ), . . . , P(y i−1 , y i )) representing the successive similarities over successive images. We utilize Equation (2) for detecting changes and denote it as the "SIFT" method in this paper. The SIFT based method is regarded as the state-of-the-practice method in this work.

Graph Based Methods
After recognizing UI widgets in a screenshot, a graph can be obtained from a set of recognized UI widgets and their corresponding coordinates. The given user experience, D = {y 1 , ...y n }, is converted into a stream of graphs, G = {g 1 , ..., g n }, by Prim's algorithm. For graph conversion, each recognized UI widget is considered as a node, and an edge between two nodes is determined with a label denoting the minimum distance between two UI widgets. In detail, a complete graph consisting of every recognized UI widget and node between every possible combinations of nodes is converted into a minimum spanning tree by Prim's algorithm. The resulting minimum spanning trees act as inputs for graph based detection methods. With the converted graphs, we can choose d(·, ·) in Equation (1) based on measures such as graph entropy [11] or graph kernels [12].

Graph Entropy Based Detection
Entropy based similarity can be interpreted as a measure of information needed to describe a distribution P utilizing another distribution Q [35]. Therefore, the conditional entropy, H(X|Y), which is the required amount of information for quantifying the outcome of a random variable X given another random variable Y, gives a lead for the similarity between two graphs g 1 , g 2 . The graph entropy of a graph G and a random variable X is defined as: In Equation (3) In Equation (4), W, X, Y is a Markov chain, that is p(w|x, y) = p(w|x). From Equation (3) and the definition of the relative entropy, we are able to compute the entropy of a graph g i and compare two consecutive graphs.

Graph Kernel Based Detection
A kernel k(x, x') is a similarity measure between x and x'. The role of a graph kernel is to evaluate similarity in the graph structure (an extensive study on graph kernels was provided in [36]). A graph can be interpreted as bags of vertices and edges. Then, the level of similarity between two graphs, G 1 and G 2 , can be computed by comparing all pairs of labels of the vertex from g 1 and g 2 , where l(v i ) denotes the label of vertex v i and k(·, ·) is the equality indicator function. k VL acts as a linear function of labels (of vertices) in two different graphs. Thus, two consecutive graphs can be compared by graph kernels, especially the vertex label histogram kernel [37].

Probability Distribution Based Methods
Each screenshot can be represented as a probability distribution of UI widgets and their connections. For a screenshot y i , the probability distribution of y i is defined as: P(y i ) = (P(v 1 ), · · · , P(v n ), P(e 1 ), · · · , P(e m )) (6) where n denotes the number of UI widget categories and m is the number of possible combinations of UI widgets (note: because the categories of UI widgets are fixed, we are able to generate the possible combination of nodes (UI widgets) and assign a unique identification label to each edge connecting two nodes). Based on Equation (6), a detecting method based on the Kullback-Leibler divergence (KLD) measure, a generative approach utilizing the likelihood measure, and a method through hypothesis testing are conceptualized.

KLD Based Detection
The Kullback-Leibler divergence, Equation (7), is a measure for comparing two probability distributions. For discrete probability distributions P and Q defined on a probability space X , the Kullback-Leibler divergence is: Because we convert each screenshot into a probability distribution by Equation (6), Equation (7) is a natural choice for computing Equation (1). The procedure is simple: Firstly, two consecutive screenshots y i and y i+1 are represented as probability distributions based on Equation (6). Then, the KLD value between two resulting probabilities, P(y i ) and P(y i+1 ), is computed by Equation (7).

Usage-Phase Model Based Detection
Once we detect usage-phase changes, we are able to collect screenshots (U t = {y t,1 , · · · , y t,2 , y t,i }; y t,i denotes the ith screenshot belonging to the tth usage-phase, U t ) deemed to belong to a same usage-phase. With the probability distribution converted screenshots in U t , we build a model, M t , for U t : M t = P(v (1) |U t ), · · · , P(v (n) |U t ), P(e (1) |U t ), · · · , P(e (m) |U t ) (8) where t,i (Z is a normalization constant to make the summation of all elements in M t equal to one; v (k) t,i and e (s) t,i denote the number of the kth UI widget and the sth edge in the ith screenshot in U t , respectively). That is, P(v (k) |U t ) is computed from the number of occurrence of the UI widget, v (k) , in screenshots belong to U t . From Equation (8), we can compute the likelihood L(y new |M t ): the probability of a new screenshot, y new , being sampled from a probability distribution represented by M t . The log-likelihood L(y new |M t ) is computed by: where new denote the number of the kth UI widget and the lth edge in y new , respectively. If the computed likelihood is greater than a threshold, y new is accepted as a member of U t , else y new is regarded as a member of a new usage-phase, U t+1 .

Hypothesis Testing Based Detection
By computing Equation (9) on U t , a set of likelihood values is obtained. We are able to determine whether to accept the hypothesis that a set of parameters representing an unknown distribution for y new is equal to a set of parameters specifying U t by running a likelihood ratio test, In order to compute Equation (10) in our setting, we regard the probability values in Equation (8) as a set of parameters specifying the hypothesis. Elements in a probability distribution (Equation (8)) produced by U t constitute Ω 0 . We calculate the maximum likelihood of the new screenshot y i+1 , which is observed after the screenshot in U t (g t,i ) based on the assumed model described by Ω 0 (for more details on the likelihood ratio test, refer to [29]).

Results and Discussions
The performance of each method was compared based on datasets from 50 commercial mobile apps. In order to measure performance in terms of the fraction of correctly estimated changes and the fraction of detected changes, TP (true positive), TN (true negative), FP (false positive), and FN (false negative) were deployed. TP means a case in which the proposed method detects the changes successfully. TN is a case where the proposed method does not notice the changes in the same usage-phase. If the proposed method claims a change in a stream of screenshots in fact belongs to the same usage-phase, this is considered as FP. Finally, FN is a case where the proposed method fails to detect changes. The performance of each proposed method was quantified utilizing precision(precision = TP TP+FP ), recall(recall = TP TP+FN ), and accuracy(accuracy = TP+TN TP+FP+TN+FN ). Precision is a measure of the ratio of correctly detected changes, and recall is the fraction of detected changes. Accuracy measures the fraction of correct changes to estimations. Table 2 reports the overall performance of the proposed methods. In terms of precision and accuracy, graph based methods ("graph kernel" and "graph entropy" in Table 2) and probability distribution based methods ("KLD" and "likelihood" in Table 2) achieved better results compared to the SIFT based method and hypothesis testing based method when τ was determined dynamically. In terms of recall, however, the SIFT based method and the hypothesis testing based method reported better result than graph based methods or other probability distribution based methods. The difference in performance tells us that the SIFT based method and hypothesis testing based method were too sensitive to changes in the observed features. Thus, they detected more changes, but incorrectly. For detecting changes in usage-phases, reducing the ratio of misdetection(the cases of wrongly detecting changes in a usage-phase) is as important as increasing the ratio of the accuracy, precision, and recall. In order to measure performance in term of misdetection, we defined new measures of NPrecision (negative precision, NPrecision = TN TN+FN ) and NRecall (negative recall, NRecall = TN TN+FP ). Table 3 reports the performance in terms of NPrecision and NRecall. Each method reported similar performance in terms of NPrecision, but the graph based method ("graph kernel") and the method focusing on the difference between two probability distributions ("KLD") achieved better performance than others in terms of NRecall. From Tables 2 and 3, we concluded that (1) "graph kernel" based methods and the "KLD" based method achieved promising results compared to other methods, and (2) dynamically adjusted thresholds were better than fixed thresholds.

Case Studies
Dividing a stream of screenshots into distinct usage-phases is not easy. Figure 1 provides an example. In this example, it is possible to group Figure (a), (b), and (c) into one usage-phase (because they represent searching) or make a group of Figure (b) and (c) as they show search results. Because it is ambiguous to make distinct groups from such screenshots in Figure 1, we refer to these cases as ambiguous cases and observe the performance considering ambiguous cases. Table 4 reports the performance on datasets considering ambiguous cases. From 13,272 screenshots, seven-hundred thirty-six screenshots were designated as ambiguous cases. As expected, the performance on the ambiguous cases (numbers in parentheses in Table 4) was poor. The performance on datasets excluding ambiguous cases was slightly better than the reported performance in Table 2 because screenshots corresponding to ambiguous cases amounted to only 5.5% of the whole screenshots.  Graph based methods require multiple nodes and edges to construct a graph. If a screenshot contains too few UI widgets (Figure 2) or the number of successfully recognized UI widgets is too small, graph based methods are likely to produce poor performance. Table 5 reports the performance on datasets excluding cases generating too small graphs (a graph with less than three nodes) and the performance on datasets corresponding to too small graphs (numbers in parentheses in Table 5). From Tables 2 and 5, we can see that the performance on datasets excluding small graphs was better than performance on datasets considering whole screenshots. However, the improvement in performance was not so impressive due to the small number of screenshots corresponding to small graphs (4.5% of all screenshots).
Besides precision, recall, and accuracy, the number of estimated usage-phases gave insight into the performance of the proposed methods. Table 6 provides the average number of estimated usage-phases (the number of manually labeled usage-phases is provided in Table 1). The performance in Table 6 informs us that methods based on SIFT (when utilizing the "mean" threshold), graph entropy (when utilizing the "empirical threshold"), and hypothesis testing were too sensitive.   Compared to the average value of the number of the manually-labeled usage-phases (265.44), it seemed that the SIFT based method and hypothesis testing achieved better results. However, we should be cautious when comparing the number of detected usage-phases. SIFT and hypothesis testing based methods detected more usage-phases than other methods, but the usage-phases detected by these two methods contained more false positive cases.
In this research, we proposed various candidates for detecting changes in a mobile app's usage in an unsupervised manner. The empirical results in Tables 2 and 3 state that each method was relatively good at avoiding misdetection, but poor at detecting changes. Graph based methods utilizing the graph kernel and graph entropy and probability distribution based methods based on Kullback-Leibler divergence (KLD) and the usage-phase model (the "likelihood" method) achieved promising results in terms of accuracy. However, the performance from the perspective of precision and recall showed that we could not confirm the superiority of any method compared to other methods.
However, our work provided valuable insights for other researchers. The first insight was the need for a dynamically adjusted threshold. Our research confirmed that in order to detect changes, a threshold should be adjusted dynamically (Tables 2, 4, and 5). The empirical results showed that conventional change detecting methods based on hypothesis testing ( [28,30]) were not suitable for detecting changes in usage-phases of a mobile app. The inferior performance of the hypothesis testing based method resulted from the lack of sufficient screenshots in each usage-phase. In our datasets, each usage-phase consisted of 6.0 screenshots on average. Thus, conventional hypothesis testing based methods may be unsuitable for app testing.
The second insight was the need for utilizing probability distributions resulting from screenshots. Although a graph kernel based method achieved better results overall, graph based methods had some deficiencies due to external causes. Firstly, current object recognition techniques are not perfect, so there existed some UI widgets that were unrecognized or wrongly recognized. Secondly, each screenshot from a mobile app contained a relatively small number of UI widgets. Thus, the failure of recognition could hinder the formation of graphs.

Conclusions
This paper presented change detection methods based on graph entropy, graph kernel, the Kullback-Leibler divergence, a generative model, and a hypothesis testing method. By utilizing recent advancements in object recognition, we were able to convert the input sequence of GUI screenshots into a sequence of graphs or probability distributions, thus constructing more robust change detection methods compared to the current state-of-the-practice method. The proposed methods detected changes in an unsupervised manner, not requiring training. This elimination of training requirements was a very significant advantage compared to the current change detection methods. Contrary to the datasets analyzed by existing change detection methods, our intended application, mobile app testing, typically had less than 10 GUI screenshots per usage-phase; thus, we could not obtain enough instances to train a model. In addition, the requirements of mobile app testing did not allow time for training. Our experimental results demonstrated that the proposed methods achieved promising results compared to the current state-of-the-practice method, but the result also clarified that the proposed methods should be improved before being employed for mobile app testing.
Our experience from this research and daily usage of mobile apps told us that we could not deny the existence of usage-phases in one app, but it was very difficult to define usage-phases in an objective manner and that the concept of the usage-phase was very ambiguous. In our future works, we will develop methods based on these findings. Rather than attempting to fix change-occurring points, we will search for micro usage-phases (composed of 2~3 screenshots), then combine these micro usage-phases into bigger clusters dynamically. Although this method is likely to be inferior to the proposed methods in this research in terms of memory cost, the hierarchical combination of micro usage-phases may overcome the inherent ambiguity of usage-phases. We also have plans to enhance the proposed methods by utilizing additional features such as event generation.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: