E-SERS: An Enhanced Approach to Trust-Based Ranking of Apps

: The number of mobile applications (“Apps”) has grown significantly in recent years. App Stores rank/recommend Apps based on factors such as average star ratings and the number of installs. Such rankings do not focus on the internal artifacts of Apps (e.g., security vulnerabilities). If internal artifacts are ignored, users may fail to estimate the potential risks associated with installing Apps. In this research, we present a framework called E-SERS ( E nhanced S ecurity-related and E vidence-based R anking S cheme) for comparing Android Apps that offer similar functionalities. E-SERS uses internal and external artifacts of Apps in the ranking process. E-SERS is a significant enhancement of our past evidence-based ranking framework called SERS. We have evaluated E-SERS on publicly accessible Apps from the Google Play Store and compared our rankings with prevalent ranking techniques. Our experiments demonstrate that E-SERS, leveraging its holistic approach, excels in identifying malicious Apps and consistently outperforms existing alternatives in ranking accuracy. By emphasizing comprehensive assessment, E-SERS empowers users, particularly those less experienced with technology, to make informed decisions and avoid potentially harmful Apps. This contribution addresses a critical gap in current App-ranking methodologies, enhancing the safety and security of today’s technologically dependent society.


Introduction
Mobile applications ("Apps") markets ("App Stores"), such as the Google Play Store, Apple App Store, Amazon App Store, and Windows Phone App Store, currently have more than 5 million Apps (https://www.statista.com/statistics/276623/number-of-appsavailable-in-leading-app-stores/, accessed on 4 March 2021).These markets provide reviews and star ratings of Apps on a scale from 1 to 5 and use the weighted average star ratings score to promote specific Apps [1].Many studies have indicated that App ratings and associated reviews correlate positively with downloads and sales of Apps [2][3][4][5][6].
To assess this premise, we created a simple survey with one question-"in general, what is the most important factor that users considered to assess an App before downloading?"-anddistributed the survey to attendees of the IEEE TPS conference in 2019 who attended our session.The participants were randomly chosen for this survey and consisted of academics, students, and practitioners.The survey was conducted anonymously, and we did not request the users to provide their demographic data.We received 130 responses.The response summary is given below.
We recognize that the sample size and the nature of participants in our survey was rather small and homogeneous, but the responses (as indicated in Figure 1) show a similar outcome as that described by Lin et al. [7].As reviews and rating scores are important to select an App, developers try to manipulate these two factors.Third parties also provide App-promoting assistance (e.g., fake reviews [8]) and guarantee a certain rank desired by the developers for a certain time.In addition, user-provided rating scores have limitations-the average rating is often influenced by users' two extreme preferences of either one star or five stars [9].
Software 2024, 3, FOR PEER REVIEW 2 App-promoting assistance (e.g., fake reviews [8]) and guarantee a certain rank desired by the developers for a certain time.In addition, user-provided rating scores have limitations-the average rating is often influenced by users' two extreme preferences of either one star or five stars [9].In our opinion, the average star rating score is not comprehensive enough for selecting a particular App, as the star ratings are not always consistent with the user comments and many times, these comments tend to be unstructured and less focused on the technical aspects of the Apps [10].In addition, we have found, for many Apps, the average ratings, as evident from the associated narratives [10], do not address issues related to security risks (e.g., data leakage).Many Apps provide personalized services (e.g., SMS services) to the users.Such Apps usually ask users for explicit permissions to obtain personal information (e.g., contact details).A wrong setting of permissions may result in potential risks associated with the unintended disclosure of their sensitive data-malicious Apps have been reported in numerous studies [11][12][13][14][15][16].Once a user's data is compromised, they may incur significant hardship while trying to contain the impact of the exposure.In addition, as we had highlighted [10], there tends to be a disparity between the internal (e.g., programmatic features) and external views (e.g., user reviews) of Apps.
The above discussion indicates a need for a comprehensive ranking approach that will encompass several factors, including the trust about the behavior of the Apps.Such an approach will enable users to pick a trustworthy App from the choices.We had proposed an approach, SERS (Security-related and Evidence-based Ranking Scheme), that addressed this need [17,18].SERS uses principles of theory of evidence [19], subjective logic (SL) [20,21], static taint analysis, and natural language processing.The trust of an App, in SERS, is defined as the ability of an App to deliver the promised behavior under various operating situations and not to disclose any critical data.SERS computes a comprehensive trust score for an App by considering its internal and external artifacts (we recognize that Apps-related cybersecurity is a vast topic with many facets.Our focus in that study (and this as well) has been rather narrow, related to providing a holistic view that considers internal and external factors of an App and its role in ranking similar Apps.We are of the opinion that such a view, as it considers multiple evidences, will empower the users to make a proper selection out of the available choices for their specific needs).SERS, however, did not consider the presence of multiple sources to generate evidence, temporal and reputational features of user reviews, and the reputation of the sources used to generate internal evidence.Here, we describe E(Enhanced)-SERS, which specifically addresses these three issues.To examine the acceptance of these enhancements, we conducted another informal survey with the same audience and asked the following question: "Which one of the following ranking schemes could be the right fit to evaluate an App?".We, again, received 130 responses.These responses indicated that a combined ranking scheme (43.8%) is more acceptable than rankings solely based on average user rating, users' review sentiments, internal factors, and external factors.The principles behind E-SERS are generic, but this study evaluates the principles and the prototype in the In our opinion, the average star rating score is not comprehensive enough for selecting a particular App, as the star ratings are not always consistent with the user comments and many times, these comments tend to be unstructured and less focused on the technical aspects of the Apps [10].In addition, we have found, for many Apps, the average ratings, as evident from the associated narratives [10], do not address issues related to security risks (e.g., data leakage).Many Apps provide personalized services (e.g., SMS services) to the users.Such Apps usually ask users for explicit permissions to obtain personal information (e.g., contact details).A wrong setting of permissions may result in potential risks associated with the unintended disclosure of their sensitive data-malicious Apps have been reported in numerous studies [11][12][13][14][15][16].Once a user's data is compromised, they may incur significant hardship while trying to contain the impact of the exposure.In addition, as we had highlighted [10], there tends to be a disparity between the internal (e.g., programmatic features) and external views (e.g., user reviews) of Apps.
The above discussion indicates a need for a comprehensive ranking approach that will encompass several factors, including the trust about the behavior of the Apps.Such an approach will enable users to pick a trustworthy App from the choices.We had proposed an approach, SERS (Security-related and Evidence-based Ranking Scheme), that addressed this need [17,18].SERS uses principles of theory of evidence [19], subjective logic (SL) [20,21], static taint analysis, and natural language processing.The trust of an App, in SERS, is defined as the ability of an App to deliver the promised behavior under various operating situations and not to disclose any critical data.SERS computes a comprehensive trust score for an App by considering its internal and external artifacts (we recognize that Apps-related cybersecurity is a vast topic with many facets.Our focus in that study (and this as well) has been rather narrow, related to providing a holistic view that considers internal and external factors of an App and its role in ranking similar Apps.We are of the opinion that such a view, as it considers multiple evidences, will empower the users to make a proper selection out of the available choices for their specific needs).SERS, however, did not consider the presence of multiple sources to generate evidence, temporal and reputational features of user reviews, and the reputation of the sources used to generate internal evidence.Here, we describe E(Enhanced)-SERS, which specifically addresses these three issues.To examine the acceptance of these enhancements, we conducted another informal survey with the same audience and asked the following question: "Which one of the following ranking schemes could be the right fit to evaluate an App?".We, again, received 130 responses.These responses indicated that a combined ranking scheme (43.8%) is more acceptable than rankings solely based on average user rating, users' review sentiments, internal factors, and external factors.The principles behind E-SERS are generic, but this study evaluates the principles and the prototype in the Google Play Store context and compares it with other ranking techniques.Future work may extend to applying E-SERS to other App Stores.Hence, the specific contributions of this paper are as follows: (i) E-SERS formalizes SERS so that it can support any number of sources for generating the necessary evidence for a given App.(ii) This framework includes a reputation score for each of the sources used to generate internal and external evidence.(iii) The system features an enhanced risk assessment matrix associated with user permissions.(iv) The methodology quantifies and uses temporal and reputational aspects of user reviews.(v) The approach incorporates the feedback from surveys within the computing community, highlighting the preference for combined ranking schemes over simplistic rating-based approaches.
In this study, we address the problem of ranking similar Apps by considering a holistic view and empirically evaluating the proposed approach by using Apps from the Google Play Store.The rest of the paper is organized as follows: Section 2 provides related efforts.Section 3 discusses the E-SERS framework.Section 4 presents the evaluation of E-SERS.Section 5 presents experimental results.Finally, Section 6 states the threats to the validity and concludes the paper by indicating the summary of the research.

Related Literature
Sentiment analysis (SA): SA has been used to analyze reviews about products and movies [22][23][24].A few studies have also applied SA to App Store reviews [25,26].Sangani et al. [27] applied the review-to-topic mapping approach to pinpoint the most demanded feature, by the users, of Apps.Pagano and Maalej [3] and Palomba et al. [28] have examined the types of user feedback and unveiled how developers monitor user reviews and correlate them to users' ratings.A few research efforts have computed trust tuples based on the reviews of Apps [29,30].These efforts have focused only on the user reviews-we, in E-SERS, combine internal and external views of Apps to generate trust quantification of Apps.
Data flow analysis: User permissions play an important role [31] in identifying possible malicious activities of Apps.There are studies (e.g., DroidRanger [32] and DroidRisk [33]) that have assessed permission-based risks of Apps.DroidRisk considers the frequency and the number of permissions an App requires.Sarma et al. [31] and Gates et al. [34] have assigned high-risk quantification to severe permissions.However, permissions alone are not sufficient to assess and quantify risks, as not all requested permissions are actively utilized during the execution [35].In E-SERS, we focus only on the faulty data flows and corresponding permissions-like the approach suggested by Mirzaei et al. [36] to categorize the data flows into benign and malicious classes.
Trust: Trust has been studied in networks [41], the Internet of Things [42], and social [43] and legal [44] communities-trust is established between the trustor and trustee through observing prior events [20,45].In our past work [46], we had presented a comprehensive survey of trust in the software domain.In [30], we developed a trust model that is based on subjective logic for incorporating trust with events [47].Here, we have enhanced our previous models [10,17,18,30] and formalized the evidence-based trust management framework to infer direct and indirect trust artifacts for any given App.
Fraud act detection: Hernandez et al. [48] presented the 'Racketstore' platform, which collects App usage details and reviews them to detect any fraudulent activity that an App's developer may practice to increase the rank of their App.Here, the authors' approach is completely based on the indirect trust artifacts of an App.In [49], the authors have proposed a methodology to increase the trustworthiness of user engagement metrics (e.g., number of installs) by identifying the incentivized app's installation which is also based on external artifacts (e.g., offer details).E-SERS focuses on both direct and indirect trust artifacts.It aims to empower users by providing the trust score of an App instead of reporting fraud acts or discovering App store policy violations.
Traditional methods for App ratings: Popular App Stores, such as Google Play Store, provide an average star rating (between 1 and 5) for an App based on individual user ratings.E-SERS uses a more comprehensive scheme to rate Apps.
Ranking of Apps: Existing research efforts (e.g., [50,51]) are either based on an internal view or external view.Zhu et al. [50] presented a hybrid ranking principle which was a combination of risk scores and overall rating.The risk factor is established based on the permission requested by the App and risk value is determined by examining each of the dangerous permissions Apps request.Using permissions alone to estimate risk has serious limitations and is inaccurate.Cen et al. [51] used a crowd-sourcing ranking approach to solve the App risk assessment problem from users' comments.However, users' comments are subjective-thus, E-SERS focuses on both programmatic and user perspectives of an App.

Architecture
The conceptual architecture of the E-SERS is illustrated in Figure 2 (discussed in this section).The four basic components of E-SERS, and the notations that we use throughout the paper, are as follows:

Evidence to Opinion Mapping
We use SL to represent an opinion about an App.The opinion about an App X, created by a source Si, (ωX S i) is indicated by a (b, d, u, a) tuple.Here, b, d, and u represent the belief, disbelief, and uncertainty that a proposition-that we can trust the App X-is true, and a is the base rate that the proposition is correct, in the absence of any evidence.The (b, d, u, a) tuple is calculated using the following equations [54]: App's Artifacts (AAs): The AAs are categorized into "Direct Trust Artifacts" (DTAs) and "Indirect Trust Artifacts" (ITAs).DTAs indicate various internal evidence about an App and are gathered from APK files, source code, and jar files of an App.In contrast, user opinions, such as ratings and reviews, contribute to the ITAs of an App.
Evidence Sources: The evidence source set, S, for an App X is divided into two mutually exclusive subsets, S DT and S IT , which denote the list of sources that are used to generate the DTAs and the ITAs, respectively.Each evidence source (S i ∈ S) generates a set of evidence, EV X Si = {ev 1 , . .., ev n }.Each evidence, ev i , can be positive, negative, or neutral.Different techniques are used for extracting various types of evidence.
Evidence Processors: Each S i , for an App X, has an associated evidence processor, EP i .An EP i maps the set of evidence, EV X Si , to an opinion ω X S i.Each source may produce different evidence and, therefore, before fusing such different opinions, we need to normalize them so that evidence from a reputed source has more weight than from nonreputed ones.To do so, we have introduced the concept of source reputation into E-SERS.The reputation of each source, ω Si r i , is combined with the opinion of ω X S i to compute the weighted opinion ω r i :S i .Like the technique suggested in [21], we use the discounting (or weighted) operator (⊗) to represent the degree of trust about an evidence source.
Opinion Fusion: Opinions from different sources, ω X S 1 , . .., ω X S n , can be combined into a single opinion (ω X S ), using the consensus operator (⊕) [52].However, the consensus operator treats opinions equally-hence, in E-SERS, we have used the cumulative weighted fusion operator [53] to combine opinions and create a trust score for an App.

Evidence to Opinion Mapping
We use SL to represent an opinion about an App.The opinion about an App X, created by a source Si, (ω X S i) is indicated by a (b, d, u, a) tuple.Here, b, d, and u represent the belief, disbelief, and uncertainty that a proposition-that we can trust the App X-is true, and a is the base rate that the proposition is correct, in the absence of any evidence.The (b, d, u, a) tuple is calculated using the following equations [54]: In these formulae, 'n' indicates possible outcomes about any evidence.In E-SERS, 'n' is equal to 2 because an evidence can either be present or absent in an App.A trust score of an App X, obtained from an opinion, ω X , is measured as the expected value (E X ) that indicates the probability that X is trustworthy and is calculated as: 3.2.2.Algorithms for Computing the Trust Score for an App X Algorithm 1 accepts DTAs, ITAs, sources of evidence, and the user-desired weights (α and β) for both views of an App X and computes its trust score.

Algorithm 1. Computation of the Trust Score for an App
procedure calculateTrustScore (DTA X , ITA X , S DT , S IT , α, β) #Generate internal opinion from DTA for an App X. ω X

⊕S
DT ← create_internal_opinion (DTA X , S DT ).#Generate external opinion from ITA for an App X.

⊕S
IT , α, β).#Apply Formula (5) to compute expected value and normalize Software 2024, 3 255 Algorithm 2 maps input DTA X to the direct trust-based tuple, ω X ⊕S DT .Different evidence, generated from S DT , is classified as positive or negative based on their behavior towards the App.Equations ( 1) to ( 4) are then used to compute the direct trust tuple of the App.The discounting operator is used to combine a source's reputation with ω X S i to compute ω r i :Si .Opinions from all sources are merged using the consensus operator to create a single opinion, ω X ⊕S DT .Algorithm 3 maps input ITA X to the indirect trust tuple, ω X ⊕S IT .Here, for each evidence, the reputation of each review and the associated temporal weight are used to determine the influence of the evidence on the indirect trust tuple of the App.  1) to (4) to determine (b, d, u, a), ω X Si .Evaluate reputation (r i ) of S i based on F1-Score, ω Si ri .Calculate weighted opinion of S i , ω X ri: Si, using the discounting operator.end for Apply consensus operator to fuse opinions from different sources and compute ω X ⊕S IT .

E-SERS Approach and Evaluation
We created a prototype, using the above algorithms, and empirically evaluated it in the context of the Google Play Store.We identified five popular categories in the Google Play Store-Shopping, Travel, Insurance, Finance, and News.From these categories, we selected 25 Apps for this empirical evaluation.

Computation of Direct Trust
To generate the DTAs of an App from the above set of Apps, like [18], we used an open source static taint analyzer tool, FlowDroid [37]-other researchers have also used this tool [36,55].FlowDroid traces sensitive information associated with an App by identifying source-sink pairs.It then returns detailed information (e.g., API method's name that tries to read/write sensitive information from the App to third parties) about unauthorized leaks of any confidential data.In our experimentation, we also used another tool (FindBugs).For the sake of brevity, here we only discuss the results obtained from FlowDroid.Any identified leaks are considered as internal evidence.This set of evidence is expressed as ω X S 1 and is mapped to the internal trust tuple as described below.

Evidence Mapping to Trust Tuple Creation
Mapping Evidence of S 1 to ω X S 1 .In [18], we had introduced a four-step analysis for mapping sensitive data leaks to trust tuples.It consisted of: (i) identifying sensitive sourcesink pairs using FlowDroid, (ii) classifying sources and sinks into various categories using SuSi [56], (iii) assessing the risk factors with these pairs using NIST guidelines [57,58], and (iv) computing the internal trust tuple using Equations ( 1)-( 4).The step (iii) is enhanced in E-SERS by employing a 4 × 4 risk assessment matrix (as opposed to the preliminary 3 × 3 heuristic-based matrix used in [17,18]) and is used in the step (iv).Hence, below we describe only steps (iii) and (iv).
Step (iii): In this step, we assess the risk factors associated with permissions that are given to sensitive APIs.Android divides the permissions into different protection levels that affect whether run-time permission requests are required or not.Potential risks using the permissions are characterized as Normal, Signature, and Dangerous.We collected 91 permission identifiers (36 Normal permission identifiers, 29 Signature permission identifiers, and 26 Dangerous permission identifiers) from the Android site [59] and mapped them to the corresponding APIs using PScout [60], which is a technique to map API calls to permissions identifiers.
NIST guidelines for risk management of information technology systems ( [56,57]) are followed to assess the quantitative risk associated with the Android permissions.According to these guidelines, risk assessment is defined as: where P is the requested permission, R(P) is the risk of P, L(P), and I(P) are the likelihood, and the impact of P. Likelihood indicates the probability that a potential vulnerability may be exercised within the construct of the threat environment.Impact is used to measure the level of risk resulting from a successful threat exercise of a vulnerability.The determination of these risk levels is subjective, due to the assignment of a probability to the likelihood of each threat level and a value for its impact.We used the enhanced risk assessment matrix shown in Table 1-it contains four levels of likelihood and impact.By applying Equation (6), the risk assessment scale is divided into three distinct categories: High (>50 to 100), Moderate (>10 to 50), and Low (1 to 10).Based on the permission that is requested, the level of impact is classified into four different categories: Catastrophic (identifiers that fall into the Dangerous permission identifiers category), Critical (identifiers that fall into the Significant permission identifiers category), Marginal (identifiers that fall into the Normal permission identifiers category), and Negligible (identifiers that do not belong to any of the permission identifiers categories).The source and sink categories are placed into different likelihood categories based on their appearance in the App's source code.
We selected three malware data sets, one from VirusShare (https://virusshare.com) and two others from Drebin (https://drebin.mlsec.org/),which contain a total of 2555 malicious Apps, accessed on 6 March 2021.On these Apps, we ran FlowDroid and stored the source and sink categories that it reported.If any of the source/sink categories appear in these three malware data sets, then those are classified as belonging to the Frequent class; if they appear in two of the observed data sets, then those belong to the Probable class; if they appear in only one data set, it is classified into the Remote class; and the rest of the categories are considered as belonging to the Improbable class.The assignment of source and sink distribution to different likelihood categories is given below in Table 2. Step (iv): Any evidence that ensures data confidentiality is positive evidence and one that involves information leakage is negative evidence.Along with the analysis report generated by FlowDroid, we also keep track of the run-time log file.From that log file, we extract the number of total sources (ST) that exists in an App's code.If there is no leak, then ST is considered as a total number of positive evidence.Again, if data leaks are found then the positive evidence is calculated by subtracting the number of faulty sources (SF) from ST, where SF indicates the sources involved in information leakage.Once all evidence is generated, then Equations ( 1)-( 4) are applied to compute the (b, d, u, a) tuple that reflects the opinion ω X S 2 .

Computing Opinion of Direct Trust
After computing the opinion of FlowDroid (henceforth referred as S 1 ), ω S 1 X , about an App X, we then evaluate the reputation of S 1, indicated as ω r 1 S 1 .We have used precision, recall and F-Score to compute ω r 1 S 1 .To assess FlowDroid (i.e., S 1 ), DroidBench [61] is utilized [36].The result of this benchmarking effort is presented in Table 3 [37].The reputation tuple is based on the F1-score, as it gives a better measure of the wrongly classified instances than the accuracy metric [62].The F1-score is considered as the value of belief and the rest is assigned to disbelief.Here, the uncertainty remains zero due to the assumption that the domain experts formulate the benchmarks, so there is hardly any chance for ambiguity.The reputation scores of S1, using the details provided in Table 3, are (0.89, 0.11, 0, 0.5).Next, Equation ( 5) is applied to compute the ω r 1 S 1 .To compute the direct trust of Apps, we used a single source (FlowDroid) to generate evidence; hence, the fusion of opinions is not required here (ω We selected five Apps, each from the Shopping, Travel, Insurance, Finance, and News categories in the Google Play Store.These categories have been identified by NowSecure (https://www.nowsecure.com/), a leading security company, in their research efforts [10,63], accessed on 7 March 2021.There are other solutions that detect harmful viruses present in Apps (e.g., Google Play Protect-https://developers.google.com/android/play-protect, accessed on 7 March 2021).However, the warnings generated by such alternatives are not quantifiable.NowSecure generates a risk score for an App-this score is based on the Common Vulnerability Scoring System (CVSS).The major difference between our approach and NewSecure is that we have introduced a mapping scheme to compute a CVSS score for the good practices too.We investigated the association between NowSecure, and the insights based on DTAs-we do not have a subscription to NowSecure's paid service, so we could not gather any evidence about the Apps in the data set.In each of these five categories, we selected one App that was used by NowSecure in their study.After that, we identified four other Apps that were "similar in functionality" (indicated by the Google Play Store) to that App and had a reasonable number (average number of reviews per App is 2100) of user reviews.The data set that we created for experimentation contained reviews from 23 July to 19 October 2019.For each App, we collected three different data items using an in-house review crawler: the App's basic details (e.g., user rating, total number of reviews and installs, etc.), its Newest reviews, and the Most Relevant reviews.Google Play Store characterizes an App's reviews into three distinct categories: Newest, Most Relevant, and Rating-we focused only on the Newest and Most Relevant data sets.Reviews are converted to Unicode and then stored.Before sentiment analysis, reviews are decoded, via the Unicode data library [64], to remove umlauts, accents, and other similar features.

Mapping Sentiment Value to Opinion Model
The IBM Watson natural language understanding [65] tool ("Watson" is also denoted as S 2 in the following discussion) is used to predict the sentiment of preprocessed reviews.Watson returns a sentiment score in the range of [−1, +1] and indicates whether a given review reflects the positive or negative sentiment of the user.Watson's opinion is mapped to compute ω X S 2 = (b X S 2 ; d X S 2 ; u X S 2 ; a X S 2 ) as discussed below.

Sentiment Score to (b, d, u) Tuple Mapping
We followed a conversion scheme with boundary cases, like Gallege [29], while mapping Watson's opinion to ω X S 2 .However, they used the linear regression model, whereas we used the random forest regression model [66], as the mean absolute error is typically higher for linear regression than the random forest regression model.Table 4 contains the boundary cases for converting textual sentiments-from a two-tuple of sentiment score to (b, d, u).Here, (0, 1, 0) represents extreme disbelief and (1, 0, 0) represents extreme belief about a review.These boundary cases are fed into a random forest regression model to predict b and d, then compute u.

Review Reputation
To determine the reputation of reviews, researchers have adopted reviewer-centric methods [67,68].Such a reviewer-centric approach is not feasible, as Google Play Store does not provide reviewer details.Hence, we used a review-centric approach to determine the reputation of reviews.The Most Relevant category contains the set of reviews that were liked by the other users.We use this category to establish the reputation of any review-we utilize the 'num of likes' and the 'sentiment score' of the Most Relevant reviews.Next, mapping mechanism mentioned above is applied to convert the sentiment score of a review to a (b, d, u) tuple.The (b, d, u) tuples of Most relevant reviews are clustered (using k-means [69]) into different clusters (C 1 ; C 2 ; . ..; CN).Finally, the average number of 'total likes' (L) for all reviews (∀r) that belong to a cluster Ci is used as a weight for that cluster and computed as we used the random forest regression model [66], as the mean absolute error is typically higher for linear regression than the random forest regression model.Table 4 contains the boundary cases for converting textual sentiments-from a two-tuple of sentiment score to (b, d, u).Here, (0, 1, 0) represents extreme disbelief and (1, 0, 0) represents extreme belief about a review.These boundary cases are fed into a random forest regression model to predict b and d, then compute u.Sentiment Score (b, d, u) Sentiment Score (b, d, u) −1 (0, 1, 0) +1 (1, 0, 0) −0.75 (0, 0.75, 0.25) +0.75 (0.75, 0, 0.25) −0.5 (0, 0.5, 0.5) +0.5 (0.5, 0, 0.5) −0.25 (0, 0.25, 0.75) +0.25 (0.25, 0, 0.75)

Review Reputation
To determine the reputation of reviews, researchers have adopted reviewer-centric methods [67,68].Such a reviewer-centric approach is not feasible, as Google Play Store does not provide reviewer details.Hence, we used a review-centric approach to determine the reputation of reviews.The Most Relevant category contains the set of reviews that were liked by the other users.We use this category to establish the reputation of any review-we utilize the 'num of likes' and the 'sentiment score' of the Most Relevant reviews.Next, the mapping mechanism mentioned above is applied to convert the sentiment score of a review to a (b, d, u) tuple.The (b, d, u) tuples of Most relevant reviews are clustered (using k-means [69]) into different clusters (C1; C2; …; CN).Finally, the average number of 'total likes' (L) for all reviews (∀r) that belong to a cluster Ci is used as a weight for that cluster and computed as Once the weight is determined for each cluster, we predict the cluster membership for reviews in the Newest data set.Based on the cluster determination, the corresponding weight is assigned to each review.A high value of the weight represents a highly reputed review, and a low value denotes lower importance of that review (probably a fake review).Thus, this approach reduces the influence of fake reviews while computing the trust score of an App.

Determination of Temporal Weight
App developers routinely release new versions which fix bugs and update features.Therefore, it is appropriate to treat the old and recent reviews differently.We have introduced a temporal weight for each review to reduce the impact of older reviews.The weight is determined by Hawkes processes; a self-exciting spatio-temporal point processes model [70].To this model, we feed the timestamps of reviews from the Newest reviews data set.Then, the model learns to exponentially weigh reviews, going back in time, and returns the corresponding weight for each timestamp.We have, for simplicity, normalized the temporal weights to a scale of 10.
C i L r ∑ ∀r ping Watson's opinion to ωX S 2. However, they used the linear regression model, whereas we used the random forest regression model [66], as the mean absolute error is typically higher for linear regression than the random forest regression model.Table 4 contains the boundary cases for converting textual sentiments-from a two-tuple of sentiment score to (b, d, u).Here, (0, 1, 0) represents extreme disbelief and (1, 0, 0) represents extreme belief about a review.These boundary cases are fed into a random forest regression model to predict b and d, then compute u.

Review Reputation
To determine the reputation of reviews, researchers have adopted reviewer-centric methods [67,68].Such a reviewer-centric approach is not feasible, as Google Play Store does not provide reviewer details.Hence, we used a review-centric approach to determine the reputation of reviews.The Most Relevant category contains the set of reviews that were liked by the other users.We use this category to establish the reputation of any review-we utilize the 'num of likes' and the 'sentiment score' of the Most Relevant reviews.Next, the mapping mechanism mentioned above is applied to convert the sentiment score of a review to a (b, d, u) tuple.The (b, d, u) tuples of Most relevant reviews are clustered (using k-means [69]) into different clusters (C1; C2; …; CN).Finally, the average number of 'total likes' (L) for all reviews (∀r) that belong to a cluster Ci is used as a weight for that cluster and computed as Once the weight is determined for each cluster, we predict the cluster membership for reviews in the Newest data set.Based on the cluster determination, the corresponding weight is assigned to each review.A high value of the weight represents a highly reputed review, and a low value denotes lower importance of that review (probably a fake review).Thus, this approach reduces the influence of fake reviews while computing the trust score of an App.

Determination of Temporal Weight
App developers routinely release new versions which fix bugs and update features.Therefore, it is appropriate to treat the old and recent reviews differently.We have introduced a temporal weight for each review to reduce the impact of older reviews.The weight is determined by Hawkes processes; a self-exciting spatio-temporal point processes model [70].To this model, we feed the timestamps of reviews from the Newest reviews data set.Then, the model learns to exponentially weigh reviews, going back in time, and returns the corresponding weight for each timestamp.We have, for simplicity, normalized the temporal weights to a scale of 10.
Once the weight is determined for each cluster, we predict the cluster membership for reviews in the Newest data set.Based on the cluster determination, the corresponding weight is assigned to each review.A high value of the weight represents a highly reputed review, and a low value denotes lower importance of that review (probably a fake review).Thus, this approach reduces the influence of fake reviews while computing the trust score of an App.

Determination of Temporal Weight
App developers routinely release new versions which fix bugs and update features.Therefore, it is appropriate to treat the old and recent reviews differently.We have introduced a temporal weight for each review to reduce the impact of older reviews.The weight is determined by Hawkes processes; a self-exciting spatio-temporal point processes model [70].To this model, we feed the timestamps of reviews from the Newest reviews data set.Then, the model learns to exponentially weigh reviews, going back in time, and returns the corresponding weight for each timestamp.We have, for simplicity, normalized the temporal weights to a scale of 10.

Computing Opinion of Indirect Trust
Three elements are required to determine ω S 2 X : the review sentiment score, the temporal weight, and the weight of the review reputation.By multiplying these two weights, we compute the total weight for a review [71].A review that has a sentiment score between 0 and 1 is considered as positive evidence, and between 0 and −1 as negative evidence.After generating all evidence, (1)-( 4) are applied to compute the (b, d, u, a) tuple that indicates the opinion ω S 2 X .After computing the opinion of the tool S 2 , we need to evaluate its reputation.The existing literature provides Watson's (i.e., S 2 ) F1-score for data sets, such as movie reviews and Twitter comments.As reviews of Apps are conceptually different than these data sets, to assess S 2 , we created a benchmark based on collected reviews.We asked four domain experts to manually label the sentiment of 750 reviews-a total of 3000 reviews.To ensure the quality of the labels, we exchanged the reviews between these experts and cross-verified the outcomes.If a discrepancy was observed, then, based on the majority judgment, the review was labeled accordingly.From this labeled data set, we randomly picked 1000 positive reviews and 1000 negative reviews to create the benchmarking set-the confusion matrix for this benchmark data set is shown in Table 5.Based on this matrix, the Precision and Recall values for Watson are 0.89 and 0.85 and the F1-score of Watson is 0.87.Thus, the reputation S 2 (ω r 2 S 2 ) is (0.87, 0.13, 0, 0.5).Next, the discounting operator is applied to compute the ω r 2 :S 2 X .To compute the opinion of indirect trust, we have used a single source (Watson) to generate evidence; hence, the fusion of opinions is not required here (ω r 2 :S 2 X ↔ ω ⊕S IT X ).

Evidence Processor and Opinion Fusion
After computing the opinions for the direct trust (ω ) and indirect trust (ω ) of an App, we combine them into a single opinion, using the cumulative weighted fusion operator.The direct trust-based evidence is likely to have less ambiguity, as it solely focuses on the functional perspectives of an App.Hence, we assign a lower weight to the ω ⊕S IT X than to the ω ⊕S DT X ; the assigned weights are 30% and 70%, respectively.These weights can be adjusted as the user desires (wee understand that a user's ability to tolerate risk, and hence the trust in an unknown App, is subjective and also depends upon their technical background.Thus, the notion of trust is inherently user-dependent.What E-SERS provides, to users, is a framework that considers many facets of any App.The trust scores, and hence the rankings, of similar Apps provided by E-SERS is intended to empower users in their selection process-the user is given a choice of weighting internal and external evidence as per their preference and that will affect the ranking of similar Apps).This resultant opinion (ω ) counts all available evidence and thus, provides a more reliable quantification of trust associated with each App than the average star ratings provided by the Google Play Store.The ω ⊕(S DT , S IT ) X allows us to calculate the trust score (E X ) using Equation ( 5), which is normalized to a scale of 5.The value of E X helps to rank-order similar Apps.The ranking generated by E-SERS is compared using the Kendall Tau distance method [72] with other alternatives.

Experimental Results
In our study, we have created a data set of 25 popular Android Apps from distinct categories that are available in the Google Play Store.We selected the categories of Apps that have been identified by NowSecure in their research effort [10,63].In our study, Apps are selected from the Shopping, Travel, Insurance, Finance, and National and Local News categories.From each category, five different Apps were picked for our experiments.In each category, we selected one App that has been used by NowSecure in their study.After that, we identified four other Apps that were "similar in functionality" (as indicated by the Google Play Store) to that App and had a reasonable number (average number of reviews per App is 2100) of user reviews.These selected Apps belong to different ranges of popularity (such as the most popular, popular, and less popular) in terms of the number of installs.In our previous work [18], we had addressed the correlation between the traditional star rating, popularity (number of installs), and trust of an App.In [18], we performed our experiment by applying a data set of 35 Apps, taken from Google Play Store.The data set [18] indicated the following behavior: ■ If we consider only the traditional star ratings of all the Apps, as a typical App user would, we find that there is hardly any difference between Apps; however, the number of installs for each App varies a lot.This highlights the fact that traditional star rating does not accurately reflect the trust of an App.

■
In our experimental data set and based on the associated evidence that SERS generated, a less popular App (in terms of the number of downloads) is assessed as a more secure App than the other popular Apps.So, the SERS will provide users with a comprehensive view of an App and help them to select a more secure App instead of just following the traditional ratings and making a choice.
In the following discussion, we do not disclose Apps' identifiable details to maintain anonymity.

Findings from DTA Sources
The number of data leaks identified by FlowDroid for each category of Apps along with the reported sources and sinks categories are presented in Table 6.Source and sink APIs that belong to NO CATEGORY are not reported here, as they refer to non-sensitive data flows in SuSi [56].In [33], authors identify that sources that are categorized into NETWORK INFORMATION and UNIQUE IDENTIFIER are more likely to occur in malware Apps than in benign Apps.In addition, that study indicates that malware Apps are more prone to use short message service (SMS) as a sink to leak data to third parties-such scenarios are found in our test data set too.For the News category Apps, we noticed that the quantities of the source API belonging to the UNIQUE IDENTIFIER category and the sink APIs that refer to SMS_MMS were higher than the other categories.An interesting insight from the direct trust-based result is that the Apps selected from the News category were more probable to leak sensitive information than the ones from the other categories.A similar observation had been reported by NowSecure where they indicated that almost all local news Apps (in their data set) leaked user data, whereas 40% of them had severe security vulnerabilities that could lead to sensitive information being compromised.

Findings from ITA Sources
As indicated, we have collected a data set of 25 Apps from five distinct categories.The data set of the associated user reviews is described in Table 7.The matrix of average words (per review) denotes that the Most relevant reviews are always more detailed than the reviews that are in the Newest category.After examining the review sentiments and the corresponding ratings, we fou matches.Consider this review, for example: "Don't care for this app.Too confusi when it works.".The user provided a rating of 5 for this review, where Watson ret negative sentiment, reflecting a mismatch.We also performed a review-based e analysis between the Newest and Most relevant reviews data sets, presented in F For every category, there is a clear mismatch of evaluation based on these two da After examining the review sentiments and the corresponding ratings, we found mismatches.Consider this review, for example: "Don't care for this app.Too confusing, even when it works.".The user provided a rating of 5 for this review, where Watson returned a negative sentiment, reflecting a mismatch.We also performed a review-based evidence analysis between the Newest and Most relevant reviews data sets, presented in Figure 4.For every category, there is a clear mismatch of evaluation based on these two data sets.
For example, Newest reviews of App2 in the Shopping category mostly indicate positive sentiment whereas feedback in the Most Relevant data set indicates a mix of positive and negative sentiments.However, we have noticed a significant difference in the News category.Here, the sentiment score for each App's Newest reviews data set deviates from a high to low sentiment score for the Most Relevant reviews data set.For example, the sentiment score of App3 in the News category deviates from [0.75, −0.25] to [0, −0.25].This indicates that in the News category, users are experiencing similar difficulties (such as ads, malware, bugs, etc.) that were previously highlighted by others.Consider the following partial review with a high number of likes in the News Category: "Used to be 5 stars until ads started popping up.There are ads running continuously on the top of the screen. ... I have to delete this App because its ruined now. . ..."; the number of likes for this comment is 1765 and the sentiment score is −0.909597.
After examining the review sentiments and the corresponding ratings, we found mismatches.Consider this review, for example: "Don't care for this app.Too confusing, even when it works.".The user provided a rating of 5 for this review, where Watson returned a negative sentiment, reflecting a mismatch.We also performed a review-based evidence analysis between the Newest and Most relevant reviews data sets, presented in Figure 4.For every category, there is a clear mismatch of evaluation based on these two data sets.From the above discussion, it is inferred that looking only at the Newest review data set is not an ideal option, as it fails to unfold the detailed behavior of an App from the users' point of view.Therefore, the user should observe the Most Relevant reviews as well.However, in most of the cases in the Most Relevant review data set, the sentiment score was found to be negative.We can, hence, for our data set, conclude that the reviews in the Most Relevant category tend to have more negative sentiment than the Newest category, which reflects that the users are more inclined to like criticism rather than appreciation of an App.Overall, users give 'like's or write reviews to present their dissatisfaction or problems that they are facing.We also examined the number of reviews specific to bug or security concerns (presented in Table 7).To determine that, we created a list of keywords, which contained bug, fix, problem, issue, defect, crash, solve, permission, privacy, security, spy, spam, malicious, and leaks-most of these keywords are described by Maalej et al. [73] under the bug reports review type.

264
The keyword distribution, shown in Table 8, indicates that the users have addressed more bug-related feedback than security-related concerns.However, the total number of bug-and security-related reviews indicate that typical users are not aware of these internal issues.We found that one of the most popular Apps (App2), with more than 10 million installs, in the Finance category actively leaks sensitive user information.

Comparison of Different Ranking Schemes
Five different kinds of ranking schemes are devised using the outcome of our experiments.These are as follows: (i) ranking based on internal view, (ii) ranking based on external view, (iii) E-SERS ranking by combing the internal and external views, (iv) ranking based on average star ratings, and (v) Google Play Store Rank, (https://www.appbrain.com/stats/google-play-rankings/top/free/application/us#types,accessed on 19 March 2021).
We illustrate different scenarios for comparing these ranking schemes like our approach described in [19].The rank-orders differ from one scheme to another; therefore, we conducted an empirical analysis to identify the reasons behind this behavior.Table 9 shows Kendall Tau distances for four such comparisons from five distinct categories.Average Ratings and Indirect Trust: In an ideal case, the reviews should be consistent with the average star ratings given by the users-the Kendall Tau variance, as shown in Table 9, is between 0% and 40% when we compare these two rankings.This indicates that these two rankings are reasonably similar to each other.The noticed differences could be because of the following two potential reasons: (i) for our review data set, we assigned two additional weights-review-centric reputation score and the temporal weight-whereas, in an average rating score, all reviews are treated equally; and (ii) a mismatch is frequently observed between review sentiments and associated rating scores.Hence, the star ratings are not always true representations of the corresponding review narratives.
Average Ratings and Direct Trust: The Kendall Tau distances between these two schemes (Table 9) is between 30% and 60%.We selected an App, App4, from the News category that has opposite ranks-it has a rank of 2 out of 5 based on the user ratings and a rank of 5 based on the direct trust score.For further investigation of App4, we collected a total of 84 reviews (3.2% reviews of the total reviews) that matched with one of the keywords mentioned in Table 8.This low number (3.2%) suggests that most of the users are not concerned about the internal features of that App.Also, among these 84 reviews, most of the reviews reported crashing of the App.During the internal evidence analysis, we found critical security vulnerabilities in this App.We noticed that the data leaks associated with App4 deal with the Dangerous permission access (e.g., READ PHONE STATE).
Average Ratings and Google Play Store Rank: The Kendall Tau distance for these two schemes (Table 9) is between 30% and 40%.Factors that influence the Google Play Store ranking are App Name, App Description, Rating and Reviews, Backlinks, In-App Purchase, Updates Downloads and Engagement, and other hidden factors [74].However, leading App Stores do not disclose how the ranking factors are weighted.To understand the correlation between average rating scores and Google Play Store ranks, we conducted an experiment.We fetched a data set of 500 Apps from AppBrain (https://www.appbrain.com/stats/google-play-rankings/top_free/applications/us,accessed on 19 March 2021) and Google Play Store, which contained Google Play Rankings, rating scores, the number of installs, and the number of reviews.This data set is the training set for a machine learning model (XGBRegressor-https://www.datatechnotes.com/2019/06/regression-example-with xgbregressor-in.html,accessed on 19 March 2021) and is used to predict an App's rank.The outcome of this experiment indicated that the star rating, and the number of reviews have a higher importance score than the number of installs.Since the rating score is an influential factor for the Google Play Store ranking, the disparity between these two ranking schemes is not that high.
We selected an App, App1, from the Shopping category that had opposite ranking positions-it had a rank of 4 out of 5 based on the user ratings and a rank of 2 based on the Google Play Store scheme.A percentage of 45% of the total reviews for this App show a below 4 rating.On the other hand, the number of installs (more than 5 million) and the number of reviews (91,857) for this App are relatively higher than those for the others.So, these factors, when combined with other Google Play Store ranking factors, give the App a higher rank.

E-SERS and Google
Play Store Rank: Previous sections have indicated that rankings based on partial evidence result in significantly different orderings.Thus, there is a need to combine direct and indirect trust-based evidence to provide a comprehensive ranking scheme: the E-SERS approach.As we have stated that trust is subjective and indicates a user's tolerance to risk associated with installing any App on their mobile device, we do recognize that different users may associate different levels of importance to the direct and indirect artifacts of an App.We, however, adhere to the view that the direct trust evidence provides a better reflection of the App's quality to protect private information and, thus, we have assigned 70% weight to direct trust and 30% weight to indirect trust in our experiments.However, these weights can be adjusted as users desire.As can be seen from Table 9, the distance between the E-SERS ranking and the Google Play Store ranking varies from 30% to 50% and it is lesser than other distances.A higher weight for the indirect trust score reduces the distance with the Google Play Store rank, whereas a lower weight for that trust score increases the distance.
Such a scenario is illustrated with the help of App2 in the Shopping category.App2 is one of the top Apps ranked by the Google Play Store.User reviews and the rating scores depict a similar scenario, where approximately 78% of reviews are rated 4 stars or above.Based on the review sentiment, 70% reviews reflect positive sentiment for App2.E-SERS assigns a lower rank to App2 when it is evaluated based on direct trust-based evidence.During the internal evidence analysis, we found severe security vulnerabilities in this App.Through further investigation, we found that the data leak associated with App2 deals with the Dangerous permission access (e.g., ACCESS FINE LOCATION and ACCESS COARSE LOCATION) and these sensitive data are written to SMS_MMS, thereby, again, highlighting the fact that user reviews many times fail to a grasp the real view of an App and anyone relying on only reviews or star scores may regret their selection.

Conclusions
The first threat to the validity of E-SERS is that the 25 Apps used in this experiment might not be representative of the entire App Store.To address this threat, we have made our data available at https://tinyurl.com/E-SERS,accessed on 7 March 2021.However, the E-SERS approach is independent of the number of Apps used in the study.The second threat is that static data flow detection tools require all code to be accessible for analysis.We cannot fully address the issue, as we do not have access to an App's source code.However, we used the standard tools that have been used in other research studies for the Android Apps.Third, the static code analysis tools may return false-positive warnings.To overcome this limitation, we have considered the reputation score of the tools.Finally, E-SERS considered, based on a small informal survey, the top two influencing factors (Average Rating and the User Reviews) and ignored the other factors (such as Number of installs, App size, and Developer info) in the trust computations.A larger survey sample may result in different top factors and, thus, may change the trust computations.Finally, our goal is to educate and empower users in their App selection ("caveat emptor") and not "censor their freedom of choice" by filtering or de-ranking Apps.It will be a user's responsibility to select an App, hopefully based on the trust score that we are providing about that App.We do recognize that some users may either choose to ignore the trust score or may not be able to comprehend its importance.As can be seen from the Figure 5, the users are presented with all the trust scores (i.e., direct, indirect, and E-SERS-based) and it is up to them to use these details in their App selection.
choice" by filtering or de-ranking Apps.It will be a user's responsibility to select an App, hopefully based on the trust score that we are providing about that App.We do recognize that some users may either choose to ignore the trust score or may not be able to comprehend its importance.As can be seen from the Figure 5, the users are presented with all the trust scores (i.e., direct, indirect, and E-SERS-based) and it is up to them to use these details in their App selection.We recognize that Apps-related cybersecurity is a vast topic with many facets.Our focus here is rather narrow, related to providing a holistic view of an App that considers internal and external factors of an App and its role in ranking similar Apps.We are of the opinion that such a view, as it considers multiple evidence, will empower the users to make a proper selection out of the available choices for their specific needs.
This paper has proposed a ranking scheme, called E-SERS, which is an enhanced version of our past work.These enhancements are based on formalism, a quantified risk assessment matrix, temporal weights, and reputation of reviews, as well as the incorporation of the outcome of practitioner surveys.E-SERS computes direct trust and indirect trust scores for an App using internal and external evidence and aggregates the results using subjective logic operations.The rank-ordering of similar Apps from the Google Play Store, generated by E-SERS, is based on more comprehensive analysis than prevalent alternatives.E-SERS, using the direct trust artifacts, mitigates the limitation of the presence of few reviews associated with newly published Apps and, hence, is useful to the developers, users, and society.

Figure 1 .
Figure 1.Survey response on App evaluating factors.

Figure 1 .
Figure 1.Survey response on App evaluating factors.

Table 4 .
Mapping of sentiment score to (b, d, u).

Figure 3 Figure 3 .
Figure 3 presents the sentiment scores for each review in our data set where every point denotes the score for an individual review.The box plot shows the median, first, and third quartiles, and minimum and maximum sentiment scores for individual rating on a scale of 1 to 5.However, a significant amount of outliers are evident for the ratings of 1, 2, and 5. Software 2024, 3, FOR PEER REVIEW

Algorithm 2. Computation of opinion from DTA procedure createInternalOpinion
(DTA X , S DT ) for S i ∈ S DT do positive_evidence ← null negative_evidence ← null S i : ev (X) ← generate_internal_evidence(X) for e ∈ S i : ev(X)!= null do if e is positive evidence positive_evidence++. else negative_evidence++.end for Apply Formulae (1) to (4) to determine (b, d, u, a), ω X Si .Evaluate reputation (r i ) of S i based on F1-Score, ω Si ri .Calculate weighted opinion of S i , ω X ri: Si, using the discounting operator.endforApply consensus operator to fuse opinions from different sources and compute ω X ⊕S DT .Algorithm 3. Computation of opinion from ITAprocedure createExternalOpinion (ITA X , S IT ) for S i ∈ S IT do positive_evidence ← null.negative_evidence ← null.S i : ev (X) ← generate_external_evidence(X).for e ∈ S i : ev(X)!= null do review_reputation_weight ← apply Formula (6) # Normalized to scale 10 ||temporal weight|| 2 ← assign highest score to to recent reviews weight[e] ← review_reputation_weight ×temporal_weight.

Table 2 .
Likelihood categorization based on appearance.

Table 4 .
Mapping of sentiment score to (b, d, u).

Table 4 .
Mapping of sentiment score to (b, d, u).

Table 6 .
Data leaks details generated by FlowDroid.

Table 7 .
Statistics of collected user review data set.

Table 8 .
Reviews related to bug and security scope.