Next Article in Journal
Spectral Domain Optical Coherence Tomography Imaging Performance Improvement Based on Field Curvature Aberration-Corrected Spectrometer
Previous Article in Journal
A Holistic Cybersecurity Maturity Assessment Framework for Higher Education Institutions in the United Kingdom
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Unsupervised Detection of Changes in Usage-Phases of a Mobile App

1
Department of Computer Science and Engineering, Kangwon National University, Chuncheon-si, Gangwon-do 24341, Korea
2
Apptest.ai, Songpa-gu, Seoul 05854, Korea
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2020, 10(10), 3656; https://doi.org/10.3390/app10103656
Submission received: 17 April 2020 / Revised: 15 May 2020 / Accepted: 22 May 2020 / Published: 25 May 2020

Abstract

:
Under the fierce competition and budget constraints, most mobile apps are launched without sufficient tests. Thus, there exists a great demand for automated app testing. Recent developments in various machine learning techniques have made automated app testing a promising alternative to manual testing. This work proposes novel approaches for one of the core functionalities of automated app testing: the detection of changes in usage-phases of a mobile app. Because of the flexibility of app development languages and the lack of standards, each mobile app is very different from other apps. Furthermore, the graphical user interfaces for similar functionalities are rarely consistent or similar. Thus, we propose methods detecting usage-phase changes through object recognition and metrics utilizing graphs and generative models. Contrary to the existing change detection methods requiring learning models, the proposed methods eliminate the burden of training models. This elimination of training is suitable for mobile app testing whose typical usage-phase is composed of less than 10 screenshots. Our experimental results on commercial mobile apps show promising improvement over the state-of-the-practice method based on SIFT (scale-invariant feature transform).

1. Introduction

As users prefer mobile devices over conventional personal computers as a platform for news and entertainment, mobile apps now dominate software usage. However, developers have faced new challenges. Under the fierce competition and budget constraints, most developers do not have time for detecting bugs and potential crashes in their apps; thus, most apps are launched without sufficient testing. Testing technologies have yet to catch up, and mobile app testing still depends on manual methods, while reliable automated testing tools are rare [1].
In the domain of automated app testing, Google’s Android Monkey tool is regarded as the state-of-the-practice for automated testing for the Android system [2]. Android Monkey takes a practical solution for GUI (graphical user interface) based testing: a random testing approach (“monkey testing”) for generating random events [3,4]. Although the “monkey” approach is cost-effective, the unintelligent manner of testing leaves room for improvement. For example, recent developments in object recognition have been utilized for improving random testing [5]. In [6], a reinforcement learning based approach was proposed for identifying how individual UI widgets are interacting. Saumya et al. introduced the idea of the automatic generation of worst case test inputs from a model of program behavior in order to test programs under extreme loads [7]. More detailed descriptions on approaches for automated testing of mobile apps were introduced in [8]. Automated mobile app testing requires a number of functionalities such as feature extraction, event generation, event execution, etc. AI (artificial intelligence)/ML (machine learning) based approaches can augment or improve the performance of some functionalities for example by recognizing UI widgets in GUI and extracting keywords through OCR (optical character recognition).
In this paper, we introduce methods for detecting changes in mobile app usage based on GUI screenshots. The detection of changes in usage is important for generating input events and backtracking past executions. From the viewpoint of detecting changes in data streams, there already exist excellent works on concept drift [9,10]. However, the characteristics of mobile app testing make it difficult to employ existing achievements in the field of concept drift. Typical concept drift methods train models in order to measure the error rate or to estimate densities. In the field of mobile app testing, each usage-phase is typically composed of less than 10 images, and it is practically impossible to construct a training dataset due to time limits or flexible and inconsistent designs and implementations of the same functionalities. Thus, we are unable to utilize training based methods (supervised approaches).
In order to tackle this obstacle, we propose detection methods not requiring training (unsupervised approaches). Our detection methods compare GUI screenshots of a mobile app. In order to measure the difference between consecutive images, we utilize graph entropy [11], graph kernels [12], a probability distribution comparison metric, a generative model (Chapter 1 in [13]), and a sequence of log-likelihood values. Experimental results on 50 commercial apps report that the proposed methods achieve encouraging performance compared to the current state-of-the-practice method based on SIFT (scale-invariant feature transform) [14,15], and it is possible to detect changes in a data stream in an unsupervised manner.
The structure of the rest of this work is as follows. Section 2 summarizes the relevant works. Section 3 describes the details of the dataset and the proposed methods. Section 4 includes the experimental results. Finally, we give concluding remarks in Section 5.

2. Related Work

Many applications require the detection of significant changes in data streams such as video streams or streams of signals. The changes of interest can be detected by methods utilizing concept drift learning [9] or methods based on anomaly detection [16] in general. The goal of concept drift techniques is to mine inherent patterns from data streams by learning the underlying distribution over time [10]. In order to obtain inherent patterns, a typical procedure for drift detection is composed of data retrieval, data modeling, the calculation of test statistics, and hypothesis testing steps [10]. Although data retrieval for data streaming is also an important research topic [17], the key to drift detection is the measurement of dissimilarity. Gama et al. introduced a detection method based on the error rate of a learner in which once the error rate of a model reaches a threshold (the warning level), then the training step for a new learner is started [18]. For fast training, extreme learning machine (ELM) was introduced, whose architecture has a single hidden layer, and only the connections between the hidden layer and output layer are trained [19]. Xu et al. [20] conceived of a dynamic extreme learning machine improving ELM by adjusting the architecture of a model according to the classification performance of the current model.
The distance or dissimilarity between the distribution of the past and new data can be utilized to detect changes. In [21], the relative entropy was used to compare the dissimilarity between two distributions while the bootstrap method enhanced the statistical significance. In least-squares density-difference (LSDD) estimation, a density-difference model is fit to the target density-difference function through the squared loss [22]. For unknown distributions, the difference between two distributions represented by LSDD can be estimated by the Gaussian kernel function, thus enabling change detection [23]. Ross et al. improved traditional sequential monitoring methods through nonparametric charts capable of detecting arbitrary changes to the process distribution [24]. With an estimated distribution, a sequence of the likelihood of the data stream can be computed. The distributions of log-likelihood were compared so as to detect changes by computing the symmetric Kullback–Leibler divergence [25].
The occurrence of changes in a data stream can be interpreted as anomalies [26]. Recently, graph based approaches have been actively researched [16]. In order to represent graphs as points in a non-Euclidean space and detect changes, adversarial training of autoencoders was employed for graph embeddings on constant-curvature manifolds, and change detection tests were performed considering embedded graphs [27]. Dissimilarity from prototype graphs was also utilized for change detection [28].
The test for detecting changes can be interpreted as a procedure for deciding to either reject a hypothesis that a new instance is generated from the current distribution or not reject it. The Student t-test quantifies how significant the difference between two datasets is (chap10 in [29]). For sequential data, the Mann–Kendall test and the CUSUM (cumulative sum) test are suitable [30]. With multiple parameters, the sequence of log-likelihood is a good indicator for change detection. For a log-likelihood stream, t-statistics, Kolmogorov–Smirnov, or Lepage can be utilized as the test statistic [25].

3. Materials and Methods

The term “usage-phase” denotes states comprised of a user’s experience with a mobile app such as login, viewing goods, writing/reading a post, etc. In most cases, changes of usage-phases accompany changes in GUI compositions. Some changes in usage-phase are easy to detect. However, some changes are so subtle that they make automated testing difficult. For example, if a user reads a long review, dragging down causes changes in screen composition, but the sequence of slightly different images constitutes the same usage-phase semantically. Without semantic knowledge on the current usage, an automated tester is likely to decide that each image corresponds to a new usage-state and devise a complex testing strategy. The proposed methods aim to detect changes in usage-phases while minimizing the false-positive ratio. In order to achieve our goal in an unsupervised manner, we propose methods based on graphs and probability distributions in a screenshot. In Section 3.1.1 and Section 3.1.2, we formulate the problem and describe our dataset. Section 3.2 describes the basic change detection algorithm and its proposed variations briefly. Section 3.3 discusses a method based on SIFT. Section 3.4 describes graph kernel and graph entropy based change detection methods. In Section 3.5, methods utilizing probability distributions are explained.

3.1. Experiment Description

3.1.1. Problem Statement

In this research, a user experience is represented as a stream of images, D = { y 1 , y n } , where each image corresponds to a screenshot of the currently used mobile app. With a given distance metric d ( · , · ) , a change in usage-phase can be detected if the measured difference between two consecutive screenshots ( y i , y i + 1 ) is greater than a threshold ( τ ),
d ( y i , y i + 1 ) > τ .
The choice of d ( · , · ) and τ is important for implementing error-robust test systems. For the construction of error-robust test systems, we consider not D , but a sequence of graphs converted from D (i.e., undirected labeled graphs characterized by nodes representing UI widgets and edges denoting relations between UI widgets) and a stream of probability distributions representing each screenshot. For conversion, UI widgets in y i are recognized through Faster R-CNN [31].

3.1.2. Data

For our study, we collected 13,272 screenshots from 50 Android apps. After installing each app, a user used the installed app and produced screenshots utilizing the ADB (Android Debug Bridge) tool. In general, approximately 4 to 5 screenshots were sampled per second. After sampling, the user grouped a set of consecutive screenshots as a usage-phase manually. Details on the collected datasets are provided in Table 1.

3.2. Basic Change Detection Algorithm

Algorithm 1 provides the basic working of the proposed methods. In essence, a set of usage pages, U , is constructed by computing Equation (1) on D = { y 1 , y n } . If the computed value is larger than a threshold, τ , it is regarded that a change is detected. However, there exist variations of Algorithm 1 according to the employed methods and the selection of the threshold, τ . The threshold, τ , is determined by three methods: M i n + M a x 2 , the mean, and the empirical threshold. In order to compute M i n + M a x 2 , we observe differences between two consecutive screenshots in the current usage-phase, U t . “Min” and “Max” are the minimum and maximum difference in U t , respectively. Thus, in each usage-phase, U t , τ is adjusted dynamically. The “mean” is computed by firstly observing differences between consecutive screenshots in a mobile app, then computing the averaged differences. As a result, each app has a distinct “mean” value. In the case of empirical threshold, we fix τ by an empirically-determined number.
There are also variations on the computation of Equation (1). With the SIFT based method (Section 3.3), Equation (2) is utilized for computing Equation (1). For graph based methods (Section 3.4.1 and Section 3.4.2), Algorithm 1 becomes more complicated. Firstly, the input, D = { y 1 , . . y n } , is converted into a set of graphs, G = { g 1 , g n } (for more details, refer to Section 3.4). Then for graph entropy based detection (Section 3.4.1), the conditional graph entropy (Equation (4)) between two consecutive graphs g i and g i + 1 is utilized as a measure of difference between two corresponding GUI screenshots y i and y i + 1 . For graph kernel-based detection (Section 3.4.2), dissimilarity (measured by Equation (5)) between two graphs g i and g i + 1 is utilized as a measure of the difference between two corresponding GUI screenshots.
Algorithm 1:Basic change detection algorithm.
Applsci 10 03656 i001
For probability distribution based methods (Section 3.5.1 and Section 3.5.2), the input, D = { y 1 , y n } , is converted into a set of probability distributions, P = { P 1 , P n } (for more details, refer to Section 3.5). The Kullback–Leibler divergence between consecutive probability distributions P i and P i + 1 is computed to measure the difference between corresponding GUI screenshots y i and y i + 1 (Section 3.5.1). In Section 3.5.2, a generative model, M t , for the current usage-phase, U t , is constructed. Then, the likelihood of a new image y n e w (exactly the likelihood of a converted probability P n e w ) is computed by Equation (9). By computing Equation (9), a sequence of likelihood values is obtained. We also attempt to detect changes by testing the hypothesis that a probability distribution representing y n e w is generated by the current usage-phase model ( M t ) based on the sequence of likelihood values (Section 3.5.3).

3.3. SIFT Based Method

The scale-invariant feature transform (SIFT) is utilized to detect local features in an image [32]. Because features extracted by SIFT are invariant to orientation, changes in illumination, and uniform scaling, SIFT is widely used for image and video analysis [33,34]. Target image search based on local features (TISLF) was introduced as a method for comparing target images and images in video sources by means of local features [34]. The TISLF is composed of a video segmentation step, a recognition step, and an estimation step. In TISLF, SIFT keypoints are utilized as a similarity measure between two consecutive images y i and y i + 1 by:
p ( y i , y i + 1 ) = # of matching points total # of keypoints .
The value computed by Equation (2) is interpreted as the probability of two successive images belonging to the same interval. These values can be concatenated into the vector W = P ( y 1 , y 2 ) , , P ( y i 1 , y i ) representing the successive similarities over successive images. We utilize Equation (2) for detecting changes and denote it as the “SIFT” method in this paper. The SIFT based method is regarded as the state-of-the-practice method in this work.

3.4. Graph Based Methods

After recognizing UI widgets in a screenshot, a graph can be obtained from a set of recognized UI widgets and their corresponding coordinates. The given user experience, D = { y 1 , y n } , is converted into a stream of graphs, G = { g 1 , , g n } , by Prim’s algorithm. For graph conversion, each recognized UI widget is considered as a node, and an edge between two nodes is determined with a label denoting the minimum distance between two UI widgets. In detail, a complete graph consisting of every recognized UI widget and node between every possible combinations of nodes is converted into a minimum spanning tree by Prim’s algorithm. The resulting minimum spanning trees act as inputs for graph based detection methods. With the converted graphs, we can choose d ( · , · ) in Equation (1) based on measures such as graph entropy [11] or graph kernels [12].

3.4.1. Graph Entropy Based Detection

Entropy based similarity can be interpreted as a measure of information needed to describe a distribution P utilizing another distribution Q [35]. Therefore, the conditional entropy, H ( X | Y ) , which is the required amount of information for quantifying the outcome of a random variable X given another random variable Y, gives a lead for the similarity between two graphs g 1 , g 2 . The graph entropy of a graph G and a random variable X is defined as:
H G ( X ) = min X W Γ ( G ) I ( W ; X ) .
In Equation (3), Γ ( G ) denotes the group of independent sets of G (here, a set of vertices of G is considered as independent if there exists no interconnection), I ( W ; X ) is the relative entropy where I ( W ; X ) = H ( W ) H ( W | X ) , and W takes values in Γ ( G ) . From the basic definition of graph entropy, the conditional graph entropy [11] is:
H G ( X | Y ) = min X W Γ ( G ) I ( W ; X | Y ) .
In Equation (4), W , X , Y is a Markov chain, that is p ( w | x , y ) = p ( w | x ) . From Equation (3) and the definition of the relative entropy, we are able to compute the entropy of a graph g i and compare two consecutive graphs.

3.4.2. Graph Kernel Based Detection

A kernel k ( x , x ) is a similarity measure between x and x . The role of a graph kernel is to evaluate similarity in the graph structure (an extensive study on graph kernels was provided in [36]). A graph can be interpreted as bags of vertices and edges. Then, the level of similarity between two graphs, G 1 and G 2 , can be computed by comparing all pairs of labels of the vertex from g 1 and g 2 ,
k V L ( g 1 , g 2 ) = v i g 1 v r g 2 k ( l ( v i ) , l ( v r ) )
where l ( v i ) denotes the label of vertex v i and k ( · , · ) is the equality indicator function. k V L acts as a linear function of labels (of vertices) in two different graphs. Thus, two consecutive graphs can be compared by graph kernels, especially the vertex label histogram kernel [37].

3.5. Probability Distribution Based Methods

Each screenshot can be represented as a probability distribution of UI widgets and their connections. For a screenshot y i , the probability distribution of y i is defined as:
P ( y i ) = ( P ( v 1 ) , , P ( v n ) , P ( e 1 ) , , P ( e m ) )
where n denotes the number of UI widget categories and m is the number of possible combinations of UI widgets (note: because the categories of UI widgets are fixed, we are able to generate the possible combination of nodes (UI widgets) and assign a unique identification label to each edge connecting two nodes). Based on Equation (6), a detecting method based on the Kullback–Leibler divergence (KLD) measure, a generative approach utilizing the likelihood measure, and a method through hypothesis testing are conceptualized.

3.5.1. KLD Based Detection

The Kullback–Leibler divergence, Equation (7), is a measure for comparing two probability distributions. For discrete probability distributions P and Q defined on a probability space X , the Kullback–Leibler divergence is:
D K L ( P | | Q ) = x X P ( x ) l o g P ( x ) Q ( x ) .
Because we convert each screenshot into a probability distribution by Equation (6), Equation (7) is a natural choice for computing Equation (1). The procedure is simple: Firstly, two consecutive screenshots y i and y i + 1 are represented as probability distributions based on Equation (6). Then, the KLD value between two resulting probabilities, P ( y i ) and P ( y i + 1 ) , is computed by Equation (7).

3.5.2. Usage-Phase Model Based Detection

Once we detect usage-phase changes, we are able to collect screenshots ( U t = { y t , 1 , , y t , 2 , y t , i } ; y t , i denotes the i th screenshot belonging to the t th usage-phase, U t ) deemed to belong to a same usage-phase. With the probability distribution converted screenshots in U t , we build a model, M t , for U t :
M t = P ( v ( 1 ) | U t ) , , P ( v ( n ) | U t ) , P ( e ( 1 ) | U t ) , , P ( e ( m ) | U t )
where P ( v ( k ) | U t ) = 1 Z × i = 1 | U t | v t , i ( k ) and P ( e ( k ) | U t ) = 1 Z × i = 1 | U t | e t , i ( k ) (Z is a normalization constant to make the summation of all elements in M t equal to one; v t , i ( k ) and e t , i ( s ) denote the number of the k th UI widget and the s th edge in the i th screenshot in U t , respectively). That is, P ( v ( k ) | U t ) is computed from the number of occurrence of the UI widget, v ( k ) , in screenshots belong to U t . From Equation (8), we can compute the likelihood L ( y n e w | M t ) : the probability of a new screenshot, y n e w , being sampled from a probability distribution represented by M t . The log-likelihood L ( y n e w | M t ) is computed by:
L ( y n e w | M t ) = l o g C M t + k = 1 n l o g P ( v ( k ) | U t ) v n e w ( k ) + k = 1 m l o g P ( e ( k ) | U t ) e n e w ( k ) ,
where C M t = ( i v n e w ( i ) + j e n e w ( j ) ) ! k v n e w ( k ) ! × l e n e w ( l ) ! . v n e w ( k ) and e n e w ( l ) denote the number of the k th UI widget and the l th edge in y n e w , respectively. If the computed likelihood is greater than a threshold, y n e w is accepted as a member of U t , else y n e w is regarded as a member of a new usage-phase, U t + 1 .

3.5.3. Hypothesis Testing Based Detection

By computing Equation (9) on U t , a set of likelihood values is obtained. We are able to determine whether to accept the hypothesis that a set of parameters representing an unknown distribution for y n e w is equal to a set of parameters specifying U t by running a likelihood ratio test,
λ = L ( Ω ^ 0 ) L ( Ω ^ ) = max Θ Ω 0 L ( Θ ) max Θ Ω L ( Θ ) .
In order to compute Equation (10) in our setting, we regard the probability values in Equation (8) as a set of parameters specifying the hypothesis. Elements in a probability distribution (Equation (8)) produced by U t constitute Ω 0 . We calculate the maximum likelihood of the new screenshot y i + 1 , which is observed after the screenshot in U t ( g t , i ) based on the assumed model described by Ω 0 (for more details on the likelihood ratio test, refer to [29]).

4. Results and Discussions

The performance of each method was compared based on datasets from 50 commercial mobile apps. In order to measure performance in terms of the fraction of correctly estimated changes and the fraction of detected changes, TP (true positive), TN (true negative), FP (false positive), and FN (false negative) were deployed. TP means a case in which the proposed method detects the changes successfully. TN is a case where the proposed method does not notice the changes in the same usage-phase. If the proposed method claims a change in a stream of screenshots in fact belongs to the same usage-phase, this is considered as FP. Finally, FN is a case where the proposed method fails to detect changes. The performance of each proposed method was quantified utilizing precision( precision = T P T P + F P ), recall( recall = T P T P + F N ), and accuracy( accuracy = T P + T N T P + F P + T N + F N ). Precision is a measure of the ratio of correctly detected changes, and recall is the fraction of detected changes. Accuracy measures the fraction of correct changes to estimations.

4.1. Overall Performance

Table 2 reports the overall performance of the proposed methods. In terms of precision and accuracy, graph based methods (“graph kernel” and “graph entropy” in Table 2) and probability distribution based methods (“KLD” and “likelihood” in Table 2) achieved better results compared to the SIFT based method and hypothesis testing based method when τ was determined dynamically. In terms of recall, however, the SIFT based method and the hypothesis testing based method reported better result than graph based methods or other probability distribution based methods. The difference in performance tells us that the SIFT based method and hypothesis testing based method were too sensitive to changes in the observed features. Thus, they detected more changes, but incorrectly.
For detecting changes in usage-phases, reducing the ratio of misdetection(the cases of wrongly detecting changes in a usage-phase) is as important as increasing the ratio of the accuracy, precision, and recall. In order to measure performance in term of misdetection, we defined new measures of NPrecision (negative precision, NPrecision = T N T N + F N ) and NRecall (negative recall, NRecall = T N T N + F P ). Table 3 reports the performance in terms of NPrecision and NRecall. Each method reported similar performance in terms of NPrecision, but the graph based method (“graph kernel”) and the method focusing on the difference between two probability distributions (“KLD”) achieved better performance than others in terms of NRecall. From Table 2 and Table 3, we concluded that (1) “graph kernel” based methods and the “KLD” based method achieved promising results compared to other methods, and (2) dynamically adjusted thresholds were better than fixed thresholds.

4.2. Case Studies

Dividing a stream of screenshots into distinct usage-phases is not easy. Figure 1 provides an example. In this example, it is possible to group Figure (a), (b), and (c) into one usage-phase (because they represent searching) or make a group of Figure (b) and (c) as they show search results. Because it is ambiguous to make distinct groups from such screenshots in Figure 1, we refer to these cases as ambiguous cases and observe the performance considering ambiguous cases.
Table 4 reports the performance on datasets considering ambiguous cases. From 13,272 screenshots, seven-hundred thirty-six screenshots were designated as ambiguous cases. As expected, the performance on the ambiguous cases (numbers in parentheses in Table 4) was poor. The performance on datasets excluding ambiguous cases was slightly better than the reported performance in Table 2 because screenshots corresponding to ambiguous cases amounted to only 5.5 % of the whole screenshots.
Graph based methods require multiple nodes and edges to construct a graph. If a screenshot contains too few UI widgets (Figure 2) or the number of successfully recognized UI widgets is too small, graph based methods are likely to produce poor performance. Table 5 reports the performance on datasets excluding cases generating too small graphs (a graph with less than three nodes) and the performance on datasets corresponding to too small graphs (numbers in parentheses in Table 5). From Table 2 and Table 5, we can see that the performance on datasets excluding small graphs was better than performance on datasets considering whole screenshots. However, the improvement in performance was not so impressive due to the small number of screenshots corresponding to small graphs ( 4.5 % of all screenshots).
Besides precision, recall, and accuracy, the number of estimated usage-phases gave insight into the performance of the proposed methods. Table 6 provides the average number of estimated usage-phases (the number of manually labeled usage-phases is provided in Table 1). The performance in Table 6 informs us that methods based on SIFT (when utilizing the “mean” threshold), graph entropy (when utilizing the “empirical threshold”), and hypothesis testing were too sensitive.
Compared to the average value of the number of the manually-labeled usage-phases (265.44), it seemed that the SIFT based method and hypothesis testing achieved better results. However, we should be cautious when comparing the number of detected usage-phases. SIFT and hypothesis testing based methods detected more usage-phases than other methods, but the usage-phases detected by these two methods contained more false positive cases.
In this research, we proposed various candidates for detecting changes in a mobile app’s usage in an unsupervised manner. The empirical results in Table 2 and Table 3 state that each method was relatively good at avoiding misdetection, but poor at detecting changes. Graph based methods utilizing the graph kernel and graph entropy and probability distribution based methods based on Kullback–Leibler divergence (KLD) and the usage-phase model (the “likelihood” method) achieved promising results in terms of accuracy. However, the performance from the perspective of precision and recall showed that we could not confirm the superiority of any method compared to other methods.
However, our work provided valuable insights for other researchers. The first insight was the need for a dynamically adjusted threshold. Our research confirmed that in order to detect changes, a threshold should be adjusted dynamically (Table 2, Table 4, and Table 5). The empirical results showed that conventional change detecting methods based on hypothesis testing ([28,30]) were not suitable for detecting changes in usage-phases of a mobile app. The inferior performance of the hypothesis testing based method resulted from the lack of sufficient screenshots in each usage-phase. In our datasets, each usage-phase consisted of 6.0 screenshots on average. Thus, conventional hypothesis testing based methods may be unsuitable for app testing.
The second insight was the need for utilizing probability distributions resulting from screenshots. Although a graph kernel based method achieved better results overall, graph based methods had some deficiencies due to external causes. Firstly, current object recognition techniques are not perfect, so there existed some UI widgets that were unrecognized or wrongly recognized. Secondly, each screenshot from a mobile app contained a relatively small number of UI widgets. Thus, the failure of recognition could hinder the formation of graphs.

5. Conclusions

This paper presented change detection methods based on graph entropy, graph kernel, the Kullback–Leibler divergence, a generative model, and a hypothesis testing method. By utilizing recent advancements in object recognition, we were able to convert the input sequence of GUI screenshots into a sequence of graphs or probability distributions, thus constructing more robust change detection methods compared to the current state-of-the-practice method. The proposed methods detected changes in an unsupervised manner, not requiring training. This elimination of training requirements was a very significant advantage compared to the current change detection methods. Contrary to the datasets analyzed by existing change detection methods, our intended application, mobile app testing, typically had less than 10 GUI screenshots per usage-phase; thus, we could not obtain enough instances to train a model. In addition, the requirements of mobile app testing did not allow time for training. Our experimental results demonstrated that the proposed methods achieved promising results compared to the current state-of-the-practice method, but the result also clarified that the proposed methods should be improved before being employed for mobile app testing.
Our experience from this research and daily usage of mobile apps told us that we could not deny the existence of usage-phases in one app, but it was very difficult to define usage-phases in an objective manner and that the concept of the usage-phase was very ambiguous. In our future works, we will develop methods based on these findings. Rather than attempting to fix change-occurring points, we will search for micro usage-phases (composed of 2~3 screenshots), then combine these micro usage-phases into bigger clusters dynamically. Although this method is likely to be inferior to the proposed methods in this research in terms of memory cost, the hierarchical combination of micro usage-phases may overcome the inherent ambiguity of usage-phases. We also have plans to enhance the proposed methods by utilizing additional features such as event generation.

Author Contributions

Conceptualization, H.-S.S.; Methodology, H.-S.S.; Software, H.C. and R.K.; Validation, H.C., R.K., and H.-S.S.; Formal analysis, H.-S.S.; Investigation, H.C., R.K., and H.-S.S.; Data curation, H.C. and R.K.; Writing, original draft preparation, H.-S.S.; Writing, review and editing, H.-S.S.; Visualization, H.C. and R.K.; Supervision, H.-S.S.; Project administration, H.-S.S.; Funding acquisition, H.-S.S. All authors read and agreed to the published version of the manuscript.

Funding

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2018R1D1A1B07047156).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AIArtificial intelligence
CUSUMCumulative sum
ELMExtreme learning model
FNFalse negative
FPFalse positive
GUIGraphical user interface
KLDKullback–Leibler divergence
LSDDLeast-squares density-difference
MLMachine learning
NPrecisionNegative precision
NRecallNegative recall
OCROptical character recognition
R-CNNRegions with convolutional neural network
SIFTScale-invariant feature transform
TISLFTarget image search based on local features
TNTrue negative
TPTrue positive
UIUser interface

References

  1. Mao, K.; Harman, M.; Jia, Y. Sapienz: Multi-objec tive automated testing for android applications. In Proceedings of the 25th International Symposium on Software Testing and Analysis, Saarbrücken, Germany, 18–20 July 2016; pp. 94–105. [Google Scholar]
  2. Google. Android Monkey. 2017. Available online: https://developer.android.com/studio/test/monkey (accessed on 25 May 2020).
  3. Wetzlmaier, T.; Ramler, R.; Putschögl, W. A framework for monkey GUI testing. In Proceedings of the 9th IEEE International Conference on Software Testing, Verification and Validation, Chicago, IL, USA, 11–15 April 2016; pp. 416–423. [Google Scholar]
  4. Nyman, N. Using monkey test tools. STQE 2000, 29, 18–23. [Google Scholar]
  5. White, T.D.; Fraser, G.; Brown, G.J. Improving random GUI testing with image-based widget detection. In Proceedings of the ACM SIGSOFT International Symposium on Software Testing and Analysis, Beijing, China, 15–19 July 2019; pp. 307–317. [Google Scholar]
  6. Degott, C.; Borges, N.P., Jr.; Zeller, A. Learning user interface element interactions. In Proceedings of the ACM SIGSOFT International Symposium on Software Testing and Analysis, Beijing, China, 15–19 July 2019; pp. 296–306. [Google Scholar]
  7. Saumya, C.; Koo, J.; Kulkarni, M.; Bagchi, S. XSTRESSOR: Automatic generation of large-scale worst-case test inputs by inferring path conditions. In Proceedings of the 12th IEEE International Conference on Software Testing, Verification and Validation, Xi’an, China, 22–27 April 2019; pp. 1–12. [Google Scholar]
  8. Moran, K.; Linares-Vásquez, M.; Bernal-Cárdenas, C.; Vendome, C.; Poshyvanyk, D. Automatically discovering, reporting and reproducing android application crashes. In Proceedings of the IEEE International Conference on Software Testing, Verification and Validation, Chicago, IL, USA, 10–15 April 2016; pp. 33–44. [Google Scholar]
  9. Iwashita, A.S.; Papa, J.P. An overview on concept drift learning. IEEE Access 2019, 7, 1532–1547. [Google Scholar] [CrossRef]
  10. Lu, J.; Liu, A.; Dong, F.; Gu, F.; Gama, J.; Zhang, G. Learning under concept drift: A review. IEEE Trans. Knowl. Data Eng. 2019, 31, 2346–2363. [Google Scholar] [CrossRef] [Green Version]
  11. Orlitsky, A.; Roche, J.R. Coding for computing. IEEE Trans. Inf. Theory 2001, 47, 903–917. [Google Scholar] [CrossRef]
  12. Vishwanathan, S.V.N.; Schraudolph, N.N.; Kondor, R.; Borgwardt, K.M. Graph kernels. J. Mach. Learn. Res. 2010, 11, 1201–1242. [Google Scholar]
  13. Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006; pp. 42–43. [Google Scholar]
  14. Li, X.; Hu, W.; Shen, C.; Zhang, Z.; Dick, A. A survey of appearance models in visual object tracking. ACM Trans. Intell. Syst. Technol. 2013, 58, 1–48. [Google Scholar] [CrossRef] [Green Version]
  15. Wang, Y.; Du, L.; Dai, H. Unsupervised SAR image change detection based on SIFT keypoints and region information. IEEE Geosci. Remote Sens. Lett. 2016, 13, 931–935. [Google Scholar] [CrossRef]
  16. Akoglu, L.; Tong, H.; Koutra, D. Graph-based anomaly detection and description: A survey. Data Min. Knowl. Discov. 2015, 29, 626–688. [Google Scholar] [CrossRef] [Green Version]
  17. Ramírez-Gallego, S.; Krawczyk, B.; García, S.; Woźniak, M.; Herrera, F. A survey on data processing for data stream mining: Current status and future directions. Neurocomputing 2017, 239, 39–57. [Google Scholar] [CrossRef]
  18. Gama, J.; Medas, P.; Castillo, G.; Rodrigues, P. Learning with drift detection. In Advances in Artificial Intelligence—SBIA 2004. SBIA 2004. Lecture Notes in Computer Science; Bazzan, A.L.C., Labidi, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2004; Volume 3171, pp. 286–295. [Google Scholar]
  19. Huang, G.; Huang, G.-B.; Song, S.; You, K. Trends in extreme learning machines: A review. Neural Netw. 2015, 61, 32–48. [Google Scholar] [CrossRef]
  20. Xu, S.; Wang, J. Dynamic extreme learning machine for data stream classification. Neurocomputing 2017, 238, 433–449. [Google Scholar] [CrossRef]
  21. Dasu, T.; Krishnan, S.; Venkatasubramanian, S.; Yi, K. An information-theoretic approach to detecting changes in multi-dimensional data streams. In Proceedings of the 38th Symposium on the Interface of Statistics, Computing Science, and Applications, Pasadena, CA, USA, 24–27 May 2006; pp. 1–24. [Google Scholar]
  22. Nguyen, T.D.; Du Plessis, M.C.; Kanamori, T.; Sugiyama, M. Constrained least-squares density-difference estimation. IEICE Trans. Inf. Syst. 2014, 97, 1822–1829. [Google Scholar] [CrossRef] [Green Version]
  23. Bu, L.; Alippi, C.; Zhao, D. A PDF-free change detection test based on density difference estimation. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 324–334. [Google Scholar] [CrossRef] [PubMed]
  24. Ross, G.J.; Adams, N.M. Two nonparametric control charts for detecting arbitrary distribution changes. J. Qual. Technol. 2012, 44, 102–116. [Google Scholar] [CrossRef]
  25. Alippi, C.; Boracchi, G.; Carrera, D.; Roveri, M. Change detection in multivariate datastreams: Likelihood and detectability loss. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, New York, NY, USA, 9–15 July 2016; pp. 1368–1374. [Google Scholar]
  26. Ranshous, S.; Shen, S.; Koutra, D.; Harenberg, S.; Faloutsos, C.; Samatova, N.F. Anomaly detection in dynamic networks: A survey. WIREs Comput. Stat. 2015, 7, 223–247. [Google Scholar] [CrossRef]
  27. Grattarola, D.; Zambon, D.; Alippi, C.; Livi, L. Change detection in graph streams by learning graph embeddings on constant-curvature manifolds. arXiv 2019, arXiv:1805.06299v3. [Google Scholar] [CrossRef] [Green Version]
  28. Zambon, D.; Alippi, C.; Livi, L. Concept drift and anomaly detection in graph streams. IEEE Trans. Neur. Netw. Learn. Syst. 2018, 29, 5592–5605. [Google Scholar] [CrossRef] [Green Version]
  29. Wackerly, D.D.; Mendenhall, W., III.; Scheaffer, R.L. Likelihood ratio tests. In Mathematical Statistics with Applications; Thomson Brooks/Cole: Belmont, NV, USA, 2008; pp. 549–550. [Google Scholar]
  30. Alippi, C.; Roveri, M. An adaptive CUSUM-based test for signal change detection. In Proceedings of the International Symposium on Circuits and Systems, Island of Kos, Greece, 21–24 May 2006; pp. 5752–5755. [Google Scholar]
  31. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
  32. Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; pp. 1150–1157. [Google Scholar]
  33. Park, M.-H.; Park, R.-H.; Lee, S.W. Shot boundary detection using scale invariant feature matching. Proceedings of SPIE Visual Communications and Image Processing; SPIE: Bellingham, WA, USA, 2006; pp. 569–577. [Google Scholar]
  34. Guan, B.; Ye, H. Target image video search based on local features. arXiv 2019, arXiv:1808.03735v2. [Google Scholar]
  35. Korhonen, A.; Krymolowski, Y. On the robustness of entropy-based similarity measures in evaluation of subcategorization acquisition systems. In Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002), Taipei, Taiwan, 31 August–1 September 2002; pp. 1–7. [Google Scholar]
  36. Kriege, N.M.; Johansson, F.D.; Morris, C. A survey on graph kernels. Appl. Netw. Sci. 2020, 5, 6. [Google Scholar] [CrossRef] [Green Version]
  37. Sugiyama, M.; Borgwardt, K.M. Halting in Random Walk Kernels. Adv. Neural Inf. Process. Syst. 2015, 28, 1630–1638. [Google Scholar]
Sample Availability: Samples of the GUI screenshots are available from the authors.
Figure 1. An example of the ambiguous division of usage-phases. (a) An initial search window. (b) Corresponding search result. (c) The same search result with images.
Figure 1. An example of the ambiguous division of usage-phases. (a) An initial search window. (b) Corresponding search result. (c) The same search result with images.
Applsci 10 03656 g001
Figure 2. An example of screenshots generating too small graph.
Figure 2. An example of screenshots generating too small graph.
Applsci 10 03656 g002
Table 1. A list of tested apps and the number of usage-phase screenshots (numbers in parentheses mean the number of usage-phases manually determined/the averaged number of GUI screenshots per usage-phase).
Table 1. A list of tested apps and the number of usage-phase screenshots (numbers in parentheses mean the number of usage-phases manually determined/the averaged number of GUI screenshots per usage-phase).
4sharedalbaalbamonamazonbaedalcarscbscgv
heaven sports
23192176201135294252154
(34/6.8)(8/11.5)(20/8.8)(35/5.7)(8/16.9)(42/7.0)(18/14.0)(36/4.3)
coupangdeepdoordrink waterebayemartespnfacebook
boosterdashreminder
13343358021714315897225
(19/7.0)(55/7.9)(187/3.1)(22/9.9)(24/6.0)(13/12.2)(15/6.5)(27/8.3)
fishgooglegrouponhawhaehome&kakaokoraillittle
braintranslator shoppingbus caesars
310189335196177279212193
(40/7.8)(32/5.9)(47/7.1)(20/9.8)(46/3.8)(59/4.7)(40/5.3)(48/4.0)
mcdonaldsmelonmessengermy fitnessnavernewsofferuppapago
webtoonbreak
373201388387150273365173
(69/5.4)(11/18.3)(53/7.3)(49/7.9)(17/8.8)(25/10.9)(32/11.4)(27/6.4)
pininterestpluto tvposhmarkrokushareitsmartspotifythe weather
news channel
394299348333458263263195
(49/8.0)(32/9.3)(74/4.7)(22/15.1)(63/7.3)(28/9.4)(57/4.6)(18/10.8)
todaytripletubiwayfairwishworkout foryanoljayelp
home woman
227237227549393359258411
(31/7.3)(72/3.3)(14/16.2)(96/5.7)(68/5.8)(56/6.4)(63/4.1)(76/5.4)
yeogieotaezumo The mean of the manually-labeled
208128usage-phases
(44/4.7)(31/4.1)265.44
Table 2. Change detection performance in 50 apps.
Table 2. Change detection performance in 50 apps.
MethodTPFPTNFNPrecisionRecallAccuracy
( M i n + M a x ) / 2 SIFT6841413978713880.3260.3300.789
Graph kernel32336810,83217490.4670.1560.840
Graph entropy36968210,51817030.3510.1780.820
KLD26442810,77218080.3820.1270.832
Likelihood26043210,76818120.3760.1250.831
Hypothesis1084548157199880.1650.5230.513
testing
MeanSIFT10314306689410410.1930.4980.597
Graph kernel6821550965013900.3060.3290.778
Graph entropy9532413878711190.2830.4600.734
KLD77824588742129402400.3750.717
Likelihood781253386671,2910.2360.3770.712
Hypothesis1085548057209870.1650.5240.513
testing
Empirical thresholdSIFT706114110,05913660.3820.3410.811
Graph kernel56992710,27315030.3800.2750.817
Graph entropy1159392172799130.2280.5590.636
KLD538110510,09515340.3270.2600.801
Likelihood7512214898613210.2530.3620.734
Hypothesis1085548257189870.1650.5240.513
testing
Table 3. Change detection performance in 50 apps: misdetection.
Table 3. Change detection performance in 50 apps: misdetection.
MethodThresholdNPrecisionNRecallThresholdNPrecisionNRecall
SIFT ( M i n + M a x ) / 2 0.8760.874Mean0.8690.616
Graph0.8610.9670.8740.862
kernel
Graph0.8610.9390.8870.785
entropy
KLD0.8560.9620.8710.781
Likelihood0.8560.9610.8700.774
Hypothesis testing0.8530.5110.8530.511
SIFT Empirical threshold0.8800.898
Graph0.8720.917
kernel
Graph0.8890.650
entropy
KLD0.8680.901
Likelihood0.8720.802
Hypothesis testing0.8530.511
Table 4. Performance on the dataset excluding ambiguous cases (numbers in parentheses mean performance on the dataset of ambiguous cases).
Table 4. Performance on the dataset excluding ambiguous cases (numbers in parentheses mean performance on the dataset of ambiguous cases).
MethodTPFPTNFNPrecisionRecallAccuracy
(Min + Max)/2SIFT610140196438820.3030.4090.818
(74)(12)(144)(506)(0.860)(0.128)(0.296)
Graph20835510,68912840.3690.1390.869
kernel(115)(13)(143)(465)(0.898)(0.198)(0.351)
Graph29467210,37211980.3040.1970.851
entropy(75)(10)(146)(505)(0.882)(0.129)(0.300)
KLD19044210,62213020.3100.1270.862
(74)(6)(150)(506)(0.925)(0.128)(0.304)
Likelihood18742610,61813050.3050.1250.862
(73)(6)(150)(507)(0.924)(0.126)(0.303)
Hypothesis819540156436730.1320.5490.515
testing(265)(80)(76)(315)(0.768)(0.457)(0.463)
MeanSIFT905425167935870.1760.6070.614
(126)(55)(101)(454)(0.696)(0.217)(0.308)
Graph4801505953910120.2420.3220.799
kernel(202)(45)(111)(378)(0.818)(0.348)(0.425)
Graph774237986657180.2450.5190.753
entropy(179)(34)(122)(401)(0.840)(0.309)(0.409)
KLD577240986359150.1930.3870.735
(202)(43)(113)(378)(0.824)(0.348)(0.428)
Likelihood582248585599100.1900.3900.729
(200)(44)(112)(380)(0.820)(0.345)(0.424)
Hypothesis821540356416710.1320.5500.515
testing(265)(80)(76)(315)(0.768)(0.457)(0.463)
Empirical thresholdSIFT629112799178630.3580.4220.841
(77)(14)(142)(503)(0.846)(0.133)(0.298)
Graph39289110,15311000.3060.2630.841
kernel(177)(36)(120)(403)(0.831)(0.305)(0.404)
Graph956386971755360.1980.6410.649
entropy(203)(52)(104)(377)(0.796)(0.350)(0.417)
KLD4061087995710860.2720.2720.827
(131)(18)(138)(449)(0.879)(0.226)(0.365)
Likelihood549217588699430.2020.3680.751
(201)(34)(122)(379)(0.855)(0.347)(0.439)
Hypothesis819539956456730.1320.5490.516
testing(265)(80)(76)(315)(0.768)(0.457)(0.463)
Table 5. Performance on datasets capable of forming large enough graphs (numbers in parentheses mean performance on datasets forming relatively small graphs).
Table 5. Performance on datasets capable of forming large enough graphs (numbers in parentheses mean performance on datasets forming relatively small graphs).
MethodTPFPTNFNPrecisionRecallAccuracy
(Min + Max)/2SIFT6211404951511360.3070.3530.800
(63)(9)(272)(252)(0.875)(0.200)(0.562)
Graph25435110,56815030.4200.1450.854
kernel(69)(17)(264)(246)(0.802)(0.219)(0.559)
Graph34266110,25814150.3410.1950.836
entropy(27)(21)(260)(288)(0.563)(0.086)(0.482)
KLD20741510,50415500.3330.1180.845
(57)(13)(268)(258)(0.814)(0.181)(0.545)
Likelihood20342010,49915540.3260.1160.844
(57)(12)(269)(258)(0.826)(0.181)(0.547)
Hypothesis983543654837740.1530.5590.510
testing(101)(45)(236)(214)(0.692)(0.321)(0.565)
MeanSIFT9074,22066998500.1770.5160.600
(124)(86)(195)(191)(0.590)(0.394)(0.535)
Graph5851510940911720.2790.3330.788
kernel(97)(40)(241)(218)(0.708)(0.308)(0.567)
Graph862238585348950.2650.4910.741
entropy(91)(28)(253)(224)(0.765)(0.289)(0.577)
KLD6802,407851210770.2200.3870.730
(99)(45)(236)(216)(0.688)(0.314)(0.562)
Likelihood6842,484843510730.2160.3890.720
(98)(45)(236)(217)(0.685)(0.311)(0.560)
Hypothesis985543854817720.1530.5610.510
testing(101)(45)(236)(214)(0.692)(0.321)(0.565)
Empirical thresholdSIFT6401129979011170.3620.3640.823
(66)(12)(269)(249)(0.846)(0.210)(0.562)
Graph47388710,03212840.3480.2690.829
kernel(96)(40)(241)(219)(0.706)(0.305)(0.565)
Graph1064389270276930.2150.6060.638
entropy(95)(29)(252)(220)(0.766)(0.302)(0.582)
KLD4581079984012990.2980.2610.812
(79)(26)(255)(236)(0.752)(0.251)(0.560)
Likelihood6492164875511080.2310.3690.742
(101)(45)(236)(214)(0.692)(0.321)(0.565)
Hypothesis983543454857740.1530.5590.510
testing(101)(45)(236)(214)(0.692)(0.321)(0.565)
Table 6. Average number of the estimated usage-phases.
Table 6. Average number of the estimated usage-phases.
RealMethodThresholdEstimatedThresholdEstimatedThresholdEstimated
41.44SIFT M i n + M a x 2 41.94Mean106.74 Empirical threshold36.94
Graph13.8244.6426.92
kernel
Graph21.0267.32101.6
entropy
KLD13.8464.6232.84
Likelihood13.8466.2259.18
Hypothesis131.3131.38131.26
testing

Share and Cite

MDPI and ACS Style

Chae, H.; Kang, R.; Seok, H.-S. Unsupervised Detection of Changes in Usage-Phases of a Mobile App. Appl. Sci. 2020, 10, 3656. https://doi.org/10.3390/app10103656

AMA Style

Chae H, Kang R, Seok H-S. Unsupervised Detection of Changes in Usage-Phases of a Mobile App. Applied Sciences. 2020; 10(10):3656. https://doi.org/10.3390/app10103656

Chicago/Turabian Style

Chae, Hoyeol, Ryangkyung Kang, and Ho-Sik Seok. 2020. "Unsupervised Detection of Changes in Usage-Phases of a Mobile App" Applied Sciences 10, no. 10: 3656. https://doi.org/10.3390/app10103656

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop