Next Article in Journal
Statistical and Multivariate Analysis of the IoT-23 Dataset: A Comprehensive Approach to Network Traffic Pattern Discovery
Previous Article in Journal
Leveraging Static Analysis for Feedback-Driven Security Patching in LLM-Generated Code
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Quantifying Privacy Risk of Mobile Apps as Textual Entailment Using Language Models

The Department of Computer Science, The Hang Seng University of Hong Kong, Hong Kong, China
J. Cybersecur. Priv. 2025, 5(4), 111; https://doi.org/10.3390/jcp5040111
Submission received: 10 November 2025 / Revised: 4 December 2025 / Accepted: 10 December 2025 / Published: 12 December 2025
(This article belongs to the Section Privacy)

Abstract

Smart phones have become an integral part of our lives in modern society, as we carry and use them throughout a day. However, this “body part” may maliciously collect and leak our personal information without our knowledge. When we install mobile applications on our smart phones and grant their permission requests, these apps can use sensors embedded in the smart phones and the stored data to gather and infer our personal information, preferences, and habits. In this paper, we present our preliminary results on quantifying the privacy risk of mobile applications by assessing whether requested permissions are necessary based on app descriptions through textual entailment decided by language models (LMs). We observe that despite incorporating various improvements of LMs proposed in the literature for natural language processing (NLP) tasks, the performance of the trained model remains far from ideal.

1. Introduction

In today’s society, smart phones have become an integral extension of ourselves, as we rely on them throughout the day. However, this “extension” can also pose risks as it may unknowingly gather and expose our personal information in harmful ways. This is possible because smart phones are embedded with numerous sensors, and even more to be added in newer models, which can capture our every action. When we install curious, or even malicious, mobile applications (apps) on our smart phones, and grant their permission requests, such as location, camera, motion and orientation, contacts, etc., these apps can utilize the embedded sensors and the stored data to collect and infer our personal information, preferences, and habits, such as sleeping habits, personality traits, or even medical conditions.
In recent years, with data scandals of apps, websites, and technology companies being widely covered by the news media, people have become more aware of the privacy risks associated with the use of these technologies. However, such awareness does not make users act more carefully to safeguard their privacy. One possible reason is that people overlook the potential risks associated with granting certain permissions and providing certain personal information to an app. This happens because users may not be fully aware of the exact set of permissions necessary for each function of an app and the potential impact of granting a permission, and hence, they may under-estimate the privacy threat of the app.
With so many mobile applications available both officially and otherwise, we believe it is beneficial to assess the privacy risk of these applications by focusing on the legitimacy of their permission requests. As a result, there are two major contributions of this work.
  • First, we design a machine learning-based framework to determine the privacy risk of mobile applications in terms of requested permissions;
  • Second, we implement the framework and assess the privacy risk of mobile applications with empirical studies.
Novelty Statement: Unlike traditional approaches that treat privacy risk quantification as a simple classification task, we formulate the problem as a textual entailment [1,2] task. We prepared a dataset consisting of app descriptions and their associated permission requests, and developed a framework that utilizes language models to perform this task.
Textual entailment, a natural language processing (NLP) task, seeks to determine whether one text fragment can be inferred from another by analyzing their semantic relationships. This involves understanding context, word meanings, and the underlying logic of the fragments of texts. By leveraging the semantic meanings of both app descriptions and permission descriptions, this approach enhances the model’s flexibility to accommodate new permissions that may emerge in the future. This raises the question of which language model performs best for this task.
We also observe that the labeled dataset is highly imbalanced for many permissions. This is primarily because we only classify a permission request as legitimate when the app explicitly requests it; we currently do not label non-requested permissions. An imbalanced dataset, if not addressed appropriately, could adversely affect the ability of the trained model to generalize. As a result, we also explore strategies such as data augmentation and the application of different loss functions to mitigate the effects of this imbalance.
In summary, this paper investigates the following research questions:
  • RQ1: How do various language models perform in quantifying the legitimacy of permissions formulated as textual entailment?
  • RQ2: What is the impact of employing different loss functions and data augmentation strategies on handling imbalanced data within the context of this study?
  • RQ3: How does training with multiple permission requests or utilizing pretraining with app descriptions influence the performance of the model?
The organization of the paper is as follows: We cover related work in Section 2. We provide problem formulation in Section 3. We present the solution in Section 4, the experiment setup in Section 5, and the related results and discussion in Section 6. We conclude and give future directions in Section 7.

2. Related Work

Our work touches upon several related fields: security of mobile applications (apps), language models for natural language processing (NLP) tasks, and strategies to handle imbalanced datasets.

2.1. Security of Mobile Applications

Our problem is closely related to security analysis of mobile applications (apps), such as identifying malicious apps, where many studies have been performed in the literature, e.g., [3]. However, as pointed out by Lin et al. [4], there are fundamental differences between the two. In particular, context matters as we decide whether a permission is necessary for the functions of an app—precise location is usually necessary for a map application, but definitely not for a calculator application. Consequently, merely analyzing whether an app accesses and uploads a piece of private information is insufficient to determine if it infringes on users’ privacy. Our work complements these security analysis efforts by focusing specifically on the legitimacy of permission requests, providing a more complete understanding of privacy risks associated with mobile applications.
Before training of the platform and performing analysis, we need to prepare our own dataset of apps, which includes the description of apps, their permission requests, and their privacy policy. As we were unable to follow the framework used in another work [5] to retrieve and download apps from the Google Play store, we derived our own approach, with reference to web crawlers, to collect apps from the app store by a Python (3.9) script using Selenium (v4.3.0) and BeautifulSoup (v4.11.1). Apps are retrieved from the Play store through the “search” function using a dictionary attack with 10,000 words, similar to the approach used in the other work [5].
Researchers have been studying the privacy risk of mobile applications. For instance, AC-Net [6] studied the correspondences between the semantics of descriptions and the permission usage. A variant of the RNN architecture with the Gated Recurrent Unit (GRU) was used. However, only 1415 popular Android apps were collected and 24,724 sentences were labeled, and permissions were grouped into 11 categories for easier analysis.

2.2. Language Models for Natural Language Processing (NLP) Tasks

In order to understand the meaning of app descriptions and the permissions, language models are used in this work. Various language models have been proposed in the literature for natural language processing (NLP) tasks, such as sentiment analysis, text summarization, part-of-speech labeling, translation, etc.
BERT [7] is one of the most notable early language models to utilize the Transformer architecture [8]. BERT was pretrained on large corpora using masked language modeling (MLM) [9] and next sentence prediction tasks. Later, a number of other language models based on BERT with improved performance were proposed. For example, RoBERTa [10] has more extensive training such as longer training time, more data, and bigger batches; DeBERTa [11] uses the disentangled attention mechanism where each word is represented using two vectors that encode its content and position.
Many other language models based on the Transformer architecture were also proposed in the literature. For example, ERNIE [12] was trained using both large-scale textual corpora and knowledge graphs, which provide rich structured knowledge facts for better language understanding.
There are also other works proposed to improve the performance of existing language models on downstream tasks. For instance, continued pretraining of the language model [9] using domain data is proposed to adapt the model for the downstream task.

2.3. Strategies to Handle Imbalanced Dataset

The collected data are found to be highly imbalanced for many of the permissions. As a result, a number of approaches for handling such imbalanced data are also explored.
One approach considers the use of a modified loss function, although it is also suggested that all losses are created equal [13] when the neural network has sufficient approximation power and the training is performed for sufficiently many iterations, resulting in a Neural Collapse phenomenon. For instance, the contrastive loss function [14,15] in Supervised Contrastive Learning [16] addresses class imbalance by leveraging pairwise comparisons to learn robust representations of different classes, focusing on similarities and differences rather than direct class labels. This improves the generalization capabilities of the model, leading to better performance on minority class examples. Meanwhile, the Focal Loss function [17] is a modified cross-entropy loss, which focuses more on the minority class and hard-to-classify examples, and leads to better performance when there is a significant class imbalance. The Dice Loss function [18] handles the data imbalance issue by attaching similar importance to false positives and false negatives, and associating training examples with dynamically adjusted weights to de-emphasize easy-negative examples.
Data augmentation and oversampling are also used in the literature for handling imbalanced data. For example, SMOTE [19] is an oversampling technique by synthesizing samples for the minority class using existing ones and interpolating them. However, SMOTE operates within the feature space, resulting in synthesizing data that do not represent any real text. Additionally, while SMOTE utilizes KNN, the feature spaces for NLP problems are vast, and KNN can struggle significantly in such high-dimensional contexts. Meanwhile, nlpAug [20] is a library specifically designed for data augmentation in NLP tasks. Its augmentation techniques include word insertion, substitution, swapping, and deletion.
In our experiment, we will compare the performance of these different language models and strategies to handle imbalanced datasets.

3. Problem Formulation

The Android permission system is a security framework that regulates how mobile applications (apps) access sensitive user data and system features on Android devices. For an app to access such data and features, it must request permission and seek user approval. Many functions, such as providing location-based services, taking photos, and storing data, necessitate specific permission requests. However, granting such requests whenever an app asks for them could expose a user’s sensitive information and privacy to potential misuse. Therefore, it is important for users to carefully evaluate permission requests to ensure their data remains secure.
To assist users in making informed decisions, it is essential for apps to clearly explain the necessity of permission requests in their descriptions by illustrating the functions they provide. However, users may still find it challenging to read through these descriptions and make the right choice. Therefore, our goal is to develop a framework that assesses the legitimacy of an app’s permission requests, by analyzing and interpreting its description, and informs the users of the assessment results.

4. Privacy Risk Quantification

To fully capture the meaning of a permission and provide the flexibility to cater to new permissions in the future, we decided to formulate the problem of determining whether a permission request is legitimate—through the description of app functions—as a textual entailment task. In this task, the text, t, is the description of the app, and the hypothesis, h, is the short description of the permission (column Description (as in Play Store) in Table A4), and we need to decide whether “t entails h”, i.e., t h . Notice that unlike the usual textual entailment task which has three outcomes, namely, entail, contradict, and neutral, our task only consists of entail and not entail. In particular, entail means an application is legitimate to request a permission according to its description, while not entail means it is illegitimate to do so. We let entail be positive and not entail be negative in the confusion matrix.
Modeling the problem as a textual entailment task offers the advantage that we do not need to separate sentences related to a permission from the overall app description during inference. This is because it is highly unlikely that different sentences in the app description would contain contradictory information regarding the need of a permission. Additionally, entailment can still hold even if only a portion of the text is relevant to the hypothesis. This means we only need to evaluate the app description in its entirety during inference.
As we were labeling the data, we noticed that many of the labeled permissions were highly imbalanced. For instance, many of the requests are legitimate for some of the permissions, and hence, we have many positive examples. To balance the samples, however, we cannot simply assign all the non-requesting apps as negative examples for such permissions. This is because it may still be legitimate for an app, which does not request a particular permission, to have requested such permission from its description. This is particularly true for permissions that may have been introduced after the release of the app. As such, we also explored various ways to handle the data imbalance issue, including different loss functions, which are claimed to be less sensitive to imbalanced datasets and different data augmentation techniques as discussed in Section 2.
In the experiments, we fine-tune a separate BERT model ([7], bert-base-uncased) for each individual permission, establishing the baseline performance for each of them. Then, various proposed pretraining, fine-tuning, and data augmentation approaches, together with different language models, are explored to determine their effectiveness on improving the classification result. These approaches include
  • The use of other language models, such as RoBERTa ([11], roberta-base), DeBERTa v3 ([21], DeBERTa-v3-base-mnli-fever-anli), and ERNIE ([12], ernie-2.0-base-en);
  • The use of different loss functions, such as Focal Loss [17], and also contrastive loss functions, such as Supervised Contrastive Loss [16] and InfoNCE loss [15], apart from the default cross-entropy loss;
  • The use of more elaborated permission description to assist with fine-tuning and inference;
  • The use of different data augmentation techniques, such as SMOTE [19] and the ones on word substitution provided in nlpAug [20];
  • Fine-tuning a model using a single permission versus multiple permissions together with (1) similar categories; (2) unrelated categories; and (3) different majority classes.
  • Pretraining the model with either the collected data or similar data from previous studies [6] to make the model better suited to the task.

5. Experiment Setup

5.1. Data Collection

A total of 216,711 mobile applications, 317 permissions, and 2,221,174 permission requests were collected from the Google Play Store for analysis. They were collected in January 2023.

5.2. Data Labeling

The collected permission requests are processed manually to assess their legitimacy in relation to the app description. Three student helpers and two research assistants majoring in computer science were recruited for this task. Those students were hired as they had general or even app development experience, making them better positioned to determine whether an app function requires certain permissions.
In order to improve the labeling correctness, the author, along with the two research assistants and the three student helpers, first had discussions on the potential functions associated with each permission, providing brief explanations for context. Each permission was then assigned to one student helper, who, along with the author and two research assistants, labeled at least 10 diverse samples. This approach ensured that contributions came from four individuals, facilitating comparison and a clearer understanding of each permission’s use cases. Table 1 presents the inter-annotator agreement of the initial samples, measured using Light’s Kappa, which represents the average of all possible pairwise Cohen’s Kappa coefficients among the annotators. It is important to note that while some permissions (e.g., 5, 17, 21, and 27) show very low or even zero value for Light’s Kappa during the initial labeling, this does not reflect a lack of agreement. Instead, it is due to the highly skewed request entailment for these permissions, which leads to the highly imbalanced classification results (see Table A1), resulting in a Kappa value that fails to capture the meaningful agreement between raters.
Following the initial labeling, the labeled samples were compared, and any discrepancies were thoroughly discussed and clarified. Only after this review did the student helpers proceed to label up to 600 examples for each permission, or the total number of samples collected, whichever was smaller. Note that different student helpers were assigned to work on different permission requests. During labeling, the student helpers could also label some requests as “uncertain.” They would then discuss with the author and the two research assistants to determine the correct label, ensuring accurate interpretation and consistent application of the labeling criteria.
Due to limited resources, however, only 38 permissions, each with at most 600 requests, were labeled and made available for the experiments. The list of permissions used in the experiments is provided in Table A4, and the distribution of textual entailment labels for the request of each permission is listed in Table A1. The dataset can be accessed through https://github.com/chrismahsu/AndroidApps (accessed on 11 July 2025). Notice that the IDs of the permissions were assigned according to the order in which they were retrieved from the Google Play Store. Some permissions are omitted as they are identical to the ones listed but classified under different categories. Also notice from Table A1 and Figure A1 that many of the labeled permission requests are imbalanced in terms of textual entailment results.

5.3. Metrics

Various metrics are used in the experiments to quantify the performance of the language models and improvement strategies. These metrics include F β score ( β is chosen such that Recall is considered β times as important as Precision), Matthews Correlation Coefficient ( M C C , also known as Phi coefficient), P 4  metric (also known as Symmetric F), and Area under the Receiver Operating Characteristic Curve score ( A U R O C ). When T P , F P , T N , and F N are True Positive, False Positive, True Negative, and False Negative, respectively, the three metrics of F β , M C C , and P 4 are defined as follows:
F β = ( 1 + β 2 ) × P r e c i s i o n × R e c a l l β 2 × P r e c i s i o n + R e c a l l = ( 1 + β 2 ) T P β 2 ( T P + F N ) + ( T P + F P ) ,
M C C = T P × T N F P × F N ( T P + F P ) ( T P + F N ) ( T N + F P ) ( T N + F N ) ,
P 4 = 4 1 P r e c i s i o n + 1 R e c a l l + 1 S p e c i f i c i t y + 1 N P V = 4 × T P × T N 4 × T P × T N + ( T P + T N ) × ( F P + F N ) ,
where P r e c i s i o n = T P T P + F P , R e c a l l = T P T P + F N , S p e c i f i c i t y = T N T N + F P , and N P V = T N T N + F N , while A U R O C is defined as the Area under the Receiver Operating Characteristic (ROC) Curve, which is the plot of the true positive rate ( T P R = R e c a l l = T P T P + F N ) against the false positive rate ( F P R = F P T N + F P ). For M C C , if exactly one of the four sums in the denominator is zero, the denominator is set to one. If more than one sum is zero, then M C C is undefined, and we remove the result of that fold from averaging during cross-validation.
Notice that apart from the usual F1 score, we also consider the F0.5 score. This is because firstly, the available samples can be very imbalanced. Also, the impact of false positives, meaning granting an app with illegitimate permission requests, is worse than that of false negatives, meaning flagging a legitimate permission request as illegitimate.
To give a more balanced and symmetric consideration of both positive and negative entailment classification, M C C , P 4 , and A U R O C metrics are also considered. In particular, M C C focuses on measuring the association between predicted and actual binary outcome, with a range from −1 to 1; a value of +1 represents a perfect prediction, 0 an average random prediction, and −1 an inverse prediction. P 4 also considers other aspects of a classifier’s performance, as it takes Precision, Recall, Specificity, and NegativePredictionValue (NPV) into consideration, with a range of 0 to +1; a value of +1, again, represents a perfect prediction. A U R O C assesses the classifier’s ability to distinguish between positive and negative classes, with a range from 0 to 1; a value of 1 represents perfect discrimination between classes; 0.5 no discriminative power (equivalent to random guessing); and values below 0.5 represent worse-than-random performance. The higher the A U R O C , the better the model’s ability to correctly classify positive and negative instances.

6. Experiment Results and Discussion

Most of the simulations are run on Google Colab Enterprise using machines with an Nvidia L4 GPU, while some are run on a local laptop equipped with an Nvidia GeForce RTX3070Ti GPU. All reported values represent the average from five folds of cross-validation, using a learning rate of 1 × 10 5 and default dropout probabilities of 0.1. All language models are optimized using AdamW [22] during fine-tuning. Unless stated otherwise, a permission is specified using the short description (column Description (as in Play Store) in Table A4).
We notice that each simulation with 5-fold cross-validation for 33 permissions together using ERNIE and RoBERTa takes around five to six hours to complete on Google Colab Enterprise, while that using DeBERTa takes around nine to ten hours to complete.

6.1. Baseline Performance

Figure 1 depicts the performance of BERT-base measured by various metrics using the cross-entropy loss function and a batch size of 4 for each individual permission, while Figure 2 depicts that of DeBERTa. The corresponding numeric values are listed in Table A2 and Table A3.
From Figure 1a,b, we can observe that for permissions with a majority of samples in entailment, the baseline performance measured using F 1 and F 0.5  scores is in general better. Meanwhile, when the majority of samples are in not-entailment, the baseline performance can be very poor. This is because the two metrics place more emphasis on positive classification, and having the majority of samples in entailment makes the model learn and perform better in positive classification. Meanwhile, having a minority of samples in entailment makes the model fail to learn positive classification accurately, which results in poorer performance in F 1 and F 0.5  scores.
When both positive and negative classifications are considered, as shown in Figure 1c–e with the use of P 4 , MCC, and AUROC metrics, the performance of BERT-base is more accurately reflected. For instance, the “good” performance of the model on positive classification is now shadowed by the poor performance on negative classification, such as permission 16, and more extremely, permissions 21 and 27. As such, we opt to use P 4 , MCC, and AUROC for performance comparison.
When we compare the results in Table A2 and Table A3, we notice that the performance of DeBERTa is slightly better than BERT if we consider the F 1 , F 0.5  scores, MCC, and AUROC metrics with an average improvement of 0.0111, 0.0137, 0.0128, and 0.0036, respectively; however, BERT-base performs slightly better if we consider the P 4 metric, with an average improvement of 0.0056.
Summary: P 4 , MCC, and AUROC are more appropriate metrics than F 1 and F 0.5  scores to quantify the performance as they consider both positive and negative classifications.

6.2. RQ1: How Do Various Language Models Perform in Quantifying the Legitimacy of Permissions Formulated as Textual Entailment?

Table 2 lists the experiment results of four language models, namely BERT-base-uncased, RoBERTa-base, DeBERTa-v3-base-mnli-fever-anli, and ERNIE-2.0-base-en, with four different loss functions, namely cross-entropy loss, Focal Loss, Supervised Contrastive Loss, and InfoNCE Loss, using the labeled data of 10 permissions (IDs 1, 2, 5, 6, 7, 8, 9, 11, 12, and 16 in Table A4) for training and testing. In the experiments, the default values of learning rate and temperatures for Supervised Contrastive Loss and InfoNCE Loss are used. The reported values are the average performance of individual permissions.
From Table 2, we can observe that when the same loss function is used, the four models, namely BERT, RoBERTa, DeBERTa, and ERNIE, demonstrate comparative performance when 10 permissions are being considered together. Among the four models, DeBERTa exhibits consistently slightly better performance.
Note that in the tables, the values of −0.6970 in P 4 , −0.5294 in M C C , and −0.2483 in A U R O C , when compared to the BERT baseline, correspond to 0 in P 4 and M C C , and 0.5 in A U R O C , for absolute values, which represent performances equivalent to random guessing.
Summary: The four language models considered, namely BERT, RoBERTa, DeBERTa, and ERNIE, exhibit comparable performance, with DeBERTa consistently achieving slightly better results.

6.3. RQ2: What Is the Impact of Employing Different Loss Functions and Data Augmentation Strategies on Handling Imbalanced Data Within the Context of This Study?

Table 3 lists the performance of various loss functions applied to BERT during the training process, including Focal Loss, Supervised Contrastive Loss, and InfoNCE Loss, using the labeled data of 33 permissions individually (all the permissions in Table A4). The reported values are the difference from the BERT baseline performance.
Table 4, Table 5 and Table 6 depict the experiment results when the language models are enhanced with different refinements in terms of loss function and data preparation for RoBERTa, DeBERTa, and ERNIE, respectively, using the labeled data of 33 permissions together for training and testing. In the experiments, the default values of learning rate and temperatures for Supervised Contrastive Loss and InfoNCE Loss are also used. Again, the reported values are the average performance of individual permissions.
Similarly, Table 7 lists the performance of various data augmentation approaches applied to BERT, including SMOTE and SynonymAug (SynAug) provided by nlpAug, and the reported values are the difference from the BERT baseline performance. In the table, SynAug x % means that x % of the words are replaced with synonyms. Please note that “N/A” in SynonymAug is due to the relatively balanced sample set of permissions, which results in no augmentation needed on the minority class. In contrast, “N/A” in SMOTE indicates insufficient samples for SMOTE to operate, leading to a lack of augmentation on the minority class.

6.3.1. Effects of Different Loss Functions

Table 3 shows that the use of other loss functions, namely Focal Loss, Supervised Contrastive Loss, and InfoNCE Loss, does not lead to improvements in textual entailment decision on average when permissions are considered individually. In fact, the performance significantly declines when Supervised Contrastive Loss and InfoNCE Loss are employed using their default parameters, while that of Focal Loss is mixed.
Table 2 also shows that when cross-entropy loss (CE) is replaced with Focal Loss (FL) with 10 permissions being considered together, some language models demonstrate slightly better performance on some performance metrics. However, using Supervised Contrastive Loss (SCL) or InfoNCE Loss would result in significantly worse performance for all of the models across all of the considered performance metrics, similar to that observed from Table 3 when permissions are considered individually.
When we compare Table 4, Table 5 and Table 6, where 33 permissions are considered together, we also notice that the performance of the models is very poor when contrastive loss functions, namely SCL or InfoNCE Loss, are used instead of the default cross-entropy loss, irrespective of the language model used. A careful examination of the results reveals that in some of the folds when the two contrastive loss functions are used, the decisions returned by the models are skewed to one side only—i.e., either all the requests are inferred as legitimate or all are illegitimate—but the skewness can also be inconsistent across different folds. Although increasing the batch size from 4 could improve the performance of the models, they are still worse than the other two loss functions, namely, cross-entropy loss and Focal Loss, under the same batch size with the default parameter setup. Meanwhile, the use of default cross-entropy loss and Focal Loss in model training gives acceptable results consistently without any completely skewed answers in any of the folds.
Summary: Cross-entropy loss and Focal Loss yield comparable performance, whereas Supervised Contrastive Loss (SCL) and InfoNCE Loss consistently perform significantly worse.

6.3.2. Effects of Data Augmentation

From Table 7, we can observe that performing data augmentation with relatively simple approaches, namely SMOTE and synonym substitution, does not result in improvement on textual entailment decision in general, and the performance is worsened on average using M C C and P 4 metrics. This is because replacing words with their synonyms, as in SynonymAug of nlpAug, does not help increase the diversity much in terms of mobile app descriptions that may be encountered. Working in the feature space, like SMOTE does, also has limited effectiveness because the space of language is of very high dimension.
Summary: Implementing data augmentation using relatively simple approaches yields minimal to no benefits.

6.4. RQ3: How Does Training with Multiple Permission Requests or Utilizing Pretraining with App Descriptions Influence the Performance of the Model?

Table 8 lists the performance of BERT-base when the data of two permissions are used together to fine-tune the model. Different combinations of the two permissions are used—(i) Same category and same majority label; (ii) Unrelated categories and same majority label; and (iii) Same category but different majority labels. A negative value means worse than baseline, while a positive value means better than baseline.

6.4.1. Effects of Training with Multiple Permission Requests

Table 8 reveals that fine-tuning a model with two permissions leads to inconsistent performance. For instance, using two permission requests from the same category with the same majority class, or from unrelated categories with the same majority class, as well as requests from the same category with different majority classes, does not consistently yield improved performance for the two permissions individually.
We also observe from Table 4, Table 5 and Table 6 that supplementing permission descriptions with their details yields inconsistent performance. In certain setups, such as when using InfoNCE Loss, the inclusion of permission description details can deteriorate the model’s overall performance. Conversely, when cross-entropy (CE) loss is employed, the addition of these details may enhance the performance of some models.
Summary: Training a single model with multiple permissions does not consistently lead to improved performance for the relevant permissions. Additionally, the inclusion of detailed permission descriptions produces mixed results.

6.4.2. Effects of Pretraining Models Using Data from the Target Domain

Table 5 and Table 6 also show that pretraining the language model using the app descriptions of the 33 permissions together with masked language modeling, similar to task-adaptive pretraining [9], yields mixed results. Specifically, the performance of DeBERTa declines, particularly when the batch size is small during fine-tuning, while ERNIE shows a slight improvement when compared with the baseline. However, the performance of the pretrained models is consistently worse than that of models fine-tuned without pretraining under the same batch size. Possible reasons for this observation include a lack of sufficient samples to effectively pretrain the model for the context, as well as the general nature of app descriptions, which may render further adaptation unnecessary.
Summary: The performance of pretrained models is consistently worse than those without pretraining.

6.5. Overall Observations

From the experiment results, we observed that various language models yielded comparable performance. Additionally, we implemented multiple enhancement strategies proposed in the literature; however, their effects were mixed, with many strategies proving ineffective or even detrimental to performance. The highest performance achieved thus far is an A U R O C of 0.8221, attained by DeBERTa using cross-entropy as the loss function while training on 33 permissions together using a batch size of 4.
While an A U R O C score of 0.8221 indicates a reasonably good ability to discern legitimate permission requests, it still falls short of ideal performance in this context. Specifically, this score corresponds to a true positive rate of 0.7982 and a false positive rate of 0.1540. Consequently, among 100 illegitimate permission requests made by potentially malicious apps, approximately 15 will be misclassified as legitimate and have their requests approved. This highlights the importance of further refinement of the model to minimize the false positive rate while maintaining a high true positive rate.

7. Conclusions

In this paper, we presented our preliminary results on quantifying the privacy risk of mobile applications by determining whether a requested permission is necessary from its description. We modeled the problem as textual entailment, also known as Natural Language Inference (NLI), a fundamental task in natural language processing (NLP), and used language models to solve it. We highlighted the difficulties encountered with the approach, such as the lack of reliable or balanced training data. We noticed that by incorporating various improvements proposed in the literature for NLP tasks, the performance of the trained model is still far from ideal.
In the future, we will consider the use of more labeled data from more permissions to evaluate the performance of various language models and their refinements, such as using different loss functions, model training strategies, and data preparation approaches. Large language models (LLMs) may also be utilized to assist with the task, including the use of them to explain the implication of specific permissions to help data labeling and model training; to identify the reason that makes a permission request legitimate from the description; or even for privacy quantification directly, with Chain-of-Thought [23], few-shot [24], or even zero-shot [25,26] learning.

Funding

This research was funded by grant UGC/FDS14/E01/21 from the Research Grants Council of the Hong Kong Special Administrative Region, and was partially supported by grant SDSC-SRG049 from the School of Decision Sciences of the Hang Seng University of Hong Kong.

Data Availability Statement

The dataset can be accessed through https://github.com/chrismahsu/AndroidApps (accessed on 11 July 2025).

Acknowledgments

The author would like to thank research assistants Andrew LEE and Barry CHENG for their discussions and assistance in preparing some of the script files, cleansing the dataset, and labeling the sample data. The author would also like to express gratitude to student helpers Stephanie SO, Katy LAM, and Chiu Faat LEUNG for their work on data labeling. During the preparation of this manuscript, the author utilized PoE for text editing purposes.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Number of labeled permission samples in each class for each permission.
Table A1. Number of labeled permission samples in each class for each permission.
IDEntailNot EntailIDEntailNot EntailIDEntailNot Entail
19348914925072526568
2375214151394602627571
5559271658416275855
656920171674322918566
743814018166434301590
83532461912347631137457
957921203881973243541
1082518215891133130459
11237363224691313442555
123822182328231135134448
1326574241734203857318
Table A2. Average F 1 and F 0.5 s c o r e s , P 4 , M C C , and A U R O C of 5-Folds cross-validation of BERT with cross-entropy as the loss function for the 33 permissions individually as BERT baseline performance. Values obtained after 5 epochs per fold, batch size is 4, and using simple permission description (column Description in Table A4).
Table A2. Average F 1 and F 0.5 s c o r e s , P 4 , M C C , and A U R O C of 5-Folds cross-validation of BERT with cross-entropy as the loss function for the 33 permissions individually as BERT baseline performance. Values obtained after 5 epochs per fold, batch size is 4, and using simple permission description (column Description in Table A4).
ID F 1 Score F 0.5 Score P 4 MCC AUROC
10.61300.66060.73950.57520.7705
20.82460.82120.74520.53250.7530
50.97920.97670.68920.54700.7535
60.98840.98470.72820.60300.7474
70.88590.87550.72100.51230.7476
80.81160.80120.76120.53480.7635
90.98460.98060.56980.47490.7124
100.55580.56160.69010.50670.7565
110.85040.85310.87590.75950.8780
120.76800.78300.67870.41010.7053
130.34330.40090.41560.36260.6489
140.48920.50880.62660.43970.7162
150.72270.73270.80820.64630.8197
160.98460.98360.46110.34460.6515
170.57870.64340.69060.47900.7140
180.80030.81170.85800.73450.8637
190.70560.72990.79960.64920.8143
200.88870.86570.80820.64090.8032
210.98980.98490.0000−0.00170.5992
220.88320.86690.61630.38030.6728
230.86610.85300.86660.74450.8701
240.75250.75070.81590.66160.8314
250.45300.51060.52180.46560.7003
260.20430.19870.26710.20940.6282
270.99560.99310.00000.00000.8000
290.06250.05430.09950.04260.5289
300.46670.57620.57370.46480.6778
310.86120.84160.90650.82260.9232
320.47290.49410.58610.45490.7155
330.92900.92460.95300.90860.9568
340.58640.69770.72730.60430.7407
350.77200.77940.84360.71210.8546
380.99110.98890.83100.72870.8240
avg.0.72910.74210.64470.51370.7558
stdev.0.24250.22910.24620.21910.0936
Figure A1. Fraction of legitimate permission requests, sorted in ascending order.
Figure A1. Fraction of legitimate permission requests, sorted in ascending order.
Jcp 05 00111 g0a1
Table A3. Average F 1 and F 0.5 s c o r e s , P 4 , M C C , and A U R O C of 5-fold cross-validation of DeBERTa with cross-entropy as the loss function for the 33 permissions individually as DeBERTa baseline performance. Values obtained after 5 epochs per fold, batch size is 4, and using simple permission description (column Description in Table A4).
Table A3. Average F 1 and F 0.5 s c o r e s , P 4 , M C C , and A U R O C of 5-fold cross-validation of DeBERTa with cross-entropy as the loss function for the 33 permissions individually as DeBERTa baseline performance. Values obtained after 5 epochs per fold, batch size is 4, and using simple permission description (column Description in Table A4).
ID F 1 Score F 0.5 Score P 4 MCCAUROC
10.70430.78800.80570.69570.8074
20.84730.84140.77060.57540.7772
50.98300.97630.57660.48800.7006
60.98670.98100.54980.46320.6732
70.88440.87230.67260.48460.7156
80.81480.81490.77040.55500.7745
90.98280.97890.56700.46000.6866
100.50910.59180.65250.48890.7058
110.90260.89900.91920.84160.9222
120.83830.82800.76270.55110.7709
130.00000.00000.0000−0.00460.4991
140.71010.68220.80930.66130.8530
150.72780.75670.81360.66160.8181
160.98630.98630.55930.44090.7015
170.78320.78250.84400.70820.8568
180.80390.84130.86220.74770.8575
190.76000.83270.84280.72550.8248
200.84090.83860.74100.52350.7572
210.99150.98750.21340.18670.6442
220.89100.86730.63720.42630.6733
230.88430.89940.89220.78930.8929
240.80350.84940.86100.74650.8538
250.39540.41860.51040.41090.7135
260.45840.52510.56430.47590.7012
270.99390.99230.0000−0.00250.7983
290.07500.05770.10770.07590.5635
300.53140.53040.55430.52080.7625
310.87130.87770.91470.83560.9145
320.39100.41900.51380.40980.7182
330.90490.91820.93730.88000.9330
340.49470.55480.61000.48590.7088
350.68800.76600.77440.66080.7926
380.98840.98690.48090.40490.6888
avg.0.74020.75580.63910.52650.7594
stdev.0.25520.24890.25200.22380.0991
Table A4. Details of the first 33 permissions retrieved from the Google Play Store.
Table A4. Details of the first 33 permissions retrieved from the Google Play Store.
IDCategoryDescriptio (as in Play Store)Permission Name (as in AndroidManifest.xml)No. of Apps
Requested 1
Permission Description Details (as in https://developer.android.com/reference/android/Manifest.permission) 2
1Device and
app history
retrieve running appsandroid.permission.GET_TASKS8535Allows the app to retrieve information about currently and recently running tasks. This may allow the app to discover information about which applications are used on the device.
2Wi-Fi connection
information
view Wi-Fi connectionsandroid.permission.ACCESS_WIFI_STATE101,727Allows the app to view information about Wi-Fi networking, such as whether Wi-Fi is enabled and the name of the connected Wi-Fi devices.
5Storageread the contents of
your USB storage
android.permission.READ_EXTERNAL_STORAGE135,861Allows the app to test a permission for the SD card that will be available on future devices.
6Storagemodify or delete the contents
of your USB storage
android.permission.WRITE_EXTERNAL_STORAGE127,762Allows the app to write to the SD card.
7Otherreceive data from Internetcom.google.android.c2dm.permission.RECEIVE120,229Allows apps to accept cloud to device messages sent by the app’s service. Using this service will incur data usage. Malicious apps could cause excess data usage.
8Otherfull network accessandroid.permission.INTERNET207,864Allows the app to create network sockets and use custom network protocols. The browser and other applications provide means to send data to the internet, so this permission is not required to send data to the internet.
9Otherprevent device from sleepingandroid.permission.WAKE_LOCK172,854Allows the app to prevent the phone from going to sleep.
10Otherdisable your screen lockandroid.permission.DISABLE_KEYGUARD3843Allows the app to disable the keylock and any associated password security. For example, the phone disables the keylock when receiving an incoming phone call, then re-enables the keylock when the call is finished.
11Othercontrol vibrationandroid.permission.VIBRATE97,442Allows the app to control the vibrator.
12Otherview network connectionsandroid.permission.ACCESS_NETWORK_STATE198,058Allows the app to view information about network connections such as which networks exist and are connected.
13Otherrun at startupandroid.permission.RECEIVE_BOOT_COMPLETED116,194Allows the app to have itself started as soon as the system has finished booting. This can make it take longer to start the phone and allow the app to slow down the overall phone by always running.
14Microphonerecord audioandroid.permission.RECORD_AUDIO35,595Allows the app to record audio with the microphone. This permission allows the app to record audio at any time without your confirmation.
15Locationapproximate location
(network-based)
android.permission.ACCESS_COARSE_LOCATION57,838Allows the app to obtain your approximate location. This location is derived by location services using network location sources such as cell towers and Wi-Fi. These location services must be turned on and available to your device for the app to use them. Apps may use this to determine approximately where you are.
16Identityadd or remove accountsandroid.permission.MANAGE_ACCOUNTS3259Allows the app to perform operations like adding and removing accounts, and deleting their password.
17Cameratake pictures and videosandroid.permission.CAMERA67,875Allows the app to take pictures and videos with the camera. This permission allows the app to use the camera at any time without your confirmation.
18Otherclose other appsandroid.permission.RESTART_PACKAGES, android.permission.KILL_BACKGROUND_PROCESSES2838Allows the app to end background processes of other apps. This may cause other apps to stop running.
19Otherpair with Bluetooth devicesandroid.permission.BLUETOOTH29,571Allows the app to view the configuration of the Bluetooth on the phone, and to make and accept connections with paired devices.
20Otherchange network connectivityandroid.permission.CHANGE_NETWORK_STATE12,018Allows the app to change the state of network connectivity.
21Otheruse accounts on the deviceandroid.permission.USE_CREDENTIALS5801Allows the app to request authentication tokens.
22Otherconnect and disconnect
from Wi-Fi
android.permission.CHANGE_WIFI_STATE13,672Allows the app to connect to and disconnect from Wi-Fi access points and to make changes to device configuration for Wi-Fi networks.
23Otherchange your audio settingsandroid.permission.MODIFY_AUDIO_SETTINGS25,244Allows the app to modify global audio settings such as volume and which speaker is used for output.
24Otheraccess Bluetooth settingsandroid.permission.BLUETOOTH_ADMIN14,764Allows the app to configure the local Bluetooth phone, and to discover and pair with remote devices.
25Othermanage document storageandroid.permission.MANAGE_DOCUMENTS1371Allows the app to manage document storage.
26Otherread Google service
configuration
com.google.android.providers.gsf.permission. READ_GSERVICES17,798Allows this app to read Google service configuration data.
27OtherGoogle Play license checkcom.android.vending.CHECK_LICENSE15,424Market license check
29Device ID &
call information
read phone status
and identity
android.permission.READ_PHONE_STATE41,888Allows the app to access the phone features of the device. This permission allows the app to determine the phone number and device IDs, whether a call is active, and the remote number connected by a call.
30OtherSmartcardService
Permission label
org.simalliance.openmobileapi.SMARTCARD102Enables Android applications to communicate with Secure Elements, e.g. SIM card, embedded Secure Elements, Mobile Security Card or others.
31Otherread battery statisticsandroid.permission.BATTERY_STATS630Allows an application to read the current low-level battery use data. May allow the application to find out detailed information about which apps you use.
32Otherdraw over other appsandroid.permission.SYSTEM_ALERT_WINDOW23,800Allows the app to draw on top of other applications or parts of the user interface. They may interfere with your use of the interface in any application, or change what you think you are seeing in other applications.
33Othermodify system settingsandroid.permission.WRITE_SETTINGS9642Allows the app to modify the system’s settings data. Malicious apps may corrupt your system’s configuration.
34Othersend sticky broadcastandroid.permission.BROADCAST_STICKY3112Allows the app to send sticky broadcasts, which remain after the broadcast ends. Excessive use may make the phone slow or unstable by causing it to use too much memory.
35Otherallow Wi-Fi
Multicast reception
android.permission. CHANGE_WIFI_MULTICAST_STATE6350Allows the app to receive packets sent to all devices on a Wi-Fi network using multicast addresses, not just your phone. It uses more power than the non-multicast mode.
38Contactsfind accounts on the deviceandroid.permission.GET_ACCOUNTS16,073Allows the app to obtain the list of accounts known by the phone. This may include any accounts created by applications you have installed.
1 out of the 216,711 collected apps; 2 accessed on 30 May 2023.

References

  1. Dagan, I.; Glickman, O.; Magnini, B. The pascal recognising textual entailment challenge. In Proceedings of the Machine Learning Challenges Workshop; Springer: Berlin/Heidelberg, Germany, 2005; pp. 177–190. [Google Scholar]
  2. Dagan, I.; Roth, D.; Zanzotto, F.; Sammons, M. Recognizing Textual Entailment: Models and Applications; Springer Nature: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
  3. Felt, A.P.; Finifter, M.; Chin, E.; Hanna, S.; Wagner, D. A survey of mobile malware in the wild. In Proceedings of the 1st ACM Workshop on Security and Privacy in Smartphones and Mobile Devices, Chicago, IL, USA, 17 October 2011; pp. 3–14. [Google Scholar]
  4. Lin, J.; Amini, S.; Hong, J.I.; Sadeh, N.; Lindqvist, J.; Zhang, J. Expectation and purpose: Understanding users’ mental models of mobile app privacy through crowdsourcing. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing, Pittsburgh, PA, USA, 5–8 September 2012; pp. 501–510. [Google Scholar]
  5. Viennot, N.; Garcia, E.; Nieh, J. A measurement study of google play. In Proceedings of the 2014 ACM International Conference on Measurement and Modeling of Computer Systems, Austin, TX, USA, 16–20 June 2014; pp. 221–233. [Google Scholar]
  6. Feng, Y.; Chen, L.; Zheng, A.; Gao, C.; Zheng, Z. AC-Net: Assessing the consistency of description and permission in Android apps. IEEE Access 2019, 7, 57829–57842. [Google Scholar] [CrossRef]
  7. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL-HLT, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, p. 2. [Google Scholar]
  8. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998. [Google Scholar]
  9. Gururangan, S.; Marasović, A.; Swayamdipta, S.; Lo, K.; Beltagy, I.; Downey, D.; Smith, N.A. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv 2020, arXiv:2004.10964. [Google Scholar]
  10. Liu, Y. RoBERTa: A robustly optimized BERT pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
  11. He, P.; Liu, X.; Gao, J.; Chen, W. DeBERTa: Decoding-enhanced BERT with disentangled attention. arXiv 2020, arXiv:2006.03654. [Google Scholar]
  12. Zhang, Z.; Han, X.; Liu, Z.; Jiang, X.; Sun, M.; Liu, Q. ERNIE: Enhanced language representation with informative entities. arXiv 2019, arXiv:1905.07129. [Google Scholar] [CrossRef]
  13. Zhou, J.; You, C.; Li, X.; Liu, K.; Liu, S.; Qu, Q.; Zhu, Z. Are all losses created equal: A neural collapse perspective. arXiv 2022, arXiv:2210.02192. [Google Scholar] [CrossRef]
  14. Gunel, B.; Du, J.; Conneau, A.; Stoyanov, V. Supervised contrastive learning for pre-trained language model fine-tuning. arXiv 2020, arXiv:2011.01403. [Google Scholar]
  15. Oord, A.v.d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
  16. Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised contrastive learning. Adv. Neural Inf. Process. Syst. 2020, 33, 18661–18673. [Google Scholar]
  17. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
  18. Li, X.; Sun, X.; Meng, Y.; Liang, J.; Wu, F.; Li, J. Dice loss for data-imbalanced NLP tasks. arXiv 2019, arXiv:1911.02855. [Google Scholar]
  19. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  20. Ma, E.K. nlpaug: Data Augmentation for NLP. Available online: https://github.com/makcedward/nlpaug (accessed on 10 February 2025).
  21. He, P.; Gao, J.; Chen, W. DeBERTaV3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv 2021, arXiv:2111.09543. [Google Scholar]
  22. Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
  23. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
  24. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar] [CrossRef]
  25. Wei, J.; Bosma, M.; Zhao, V.Y.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned language models are zero-shot learners. arXiv 2021, arXiv:2109.01652. [Google Scholar]
  26. Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large language models are zero-shot reasoners. Adv. Neural Inf. Process. Syst. 2022, 35, 22199–22213. [Google Scholar]
Figure 1. (a) F 1  score; (b) F 0.5  score; (c) P 4 ; (d) MCC; (e) AUROC. F 1 and F 0.5  scores, P 4 , MCC, and AUROC from 5-fold cross-validation of BERT with cross-entropy as the loss function for the 33 permissions individually (as BERT baseline performance). Values are obtained after 5 epochs per fold, batch size is 4, and a simple permission description is used (column Description in Table A4). Permissions are sorted in ascending order of the fraction of legitimate requests.
Figure 1. (a) F 1  score; (b) F 0.5  score; (c) P 4 ; (d) MCC; (e) AUROC. F 1 and F 0.5  scores, P 4 , MCC, and AUROC from 5-fold cross-validation of BERT with cross-entropy as the loss function for the 33 permissions individually (as BERT baseline performance). Values are obtained after 5 epochs per fold, batch size is 4, and a simple permission description is used (column Description in Table A4). Permissions are sorted in ascending order of the fraction of legitimate requests.
Jcp 05 00111 g001aJcp 05 00111 g001b
Figure 2. (a) F 1  score; (b) F 0.5  score; (c) P 4 ; (d) MCC; (e) AUROC. F 1 and F 0.5  scores, P 4 , MCC, and AUROC from 5-fold cross-validation of DeBERTa with cross-entropy as the loss function for the 33 permissions individually (as DeBERTa baseline performance). Values are obtained after 5 epochs per fold, batch size is 4, and a simple permission description is used (column Description in Table A4). Permissions are sorted in ascending order of the fraction of legitimate requests.
Figure 2. (a) F 1  score; (b) F 0.5  score; (c) P 4 ; (d) MCC; (e) AUROC. F 1 and F 0.5  scores, P 4 , MCC, and AUROC from 5-fold cross-validation of DeBERTa with cross-entropy as the loss function for the 33 permissions individually (as DeBERTa baseline performance). Values are obtained after 5 epochs per fold, batch size is 4, and a simple permission description is used (column Description in Table A4). Permissions are sorted in ascending order of the fraction of legitimate requests.
Jcp 05 00111 g002aJcp 05 00111 g002b
Table 1. Agreement among the four data annotators in the initial round, measured using Light’s Kappa.
Table 1. Agreement among the four data annotators in the initial round, measured using Light’s Kappa.
IDValueIDValueIDValueIDValue
10.6862120.6519210.0000310.5671
20.7265130.4787220.1589320.2334
50.0000140.7159230.6813330.4679
60.5931150.6398240.5899340.0848
70.7439160.1286250.2000350.4326
80.6000170.6390260.6667380.0889
90.3671180.7963270.0000
100.2693190.6561290.2231
110.6217200.3473300.4211
Table 2. Average metric values from 5-fold cross-validation for the performance of BERT, RoBERTa, DeBERTa, and ERNIE with various loss functions compared with the BERT baseline, trained and tested using 10 permissions together. Five epochs per fold and batch size = 4, and using a simple permission description. Negative values mean worse than baseline (in red), and positive values mean better than baseline.
Table 2. Average metric values from 5-fold cross-validation for the performance of BERT, RoBERTa, DeBERTa, and ERNIE with various loss functions compared with the BERT baseline, trained and tested using 10 permissions together. Five epochs per fold and batch size = 4, and using a simple permission description. Negative values mean worse than baseline (in red), and positive values mean better than baseline.
ModelP4MCCAUROC
BERT + CE (Cross-Entropy Loss)0.00990.04510.0213
BERT + SCL (Supervised Contrastive Loss)−0.6063−0.4790−0.2397
BERT + InfoNCE (Contrastive Loss)−0.6824−0.6372−0.2882
BERT + FL (Focal Loss)−0.00310.04320.0095
RoBERTa + CE (Cross-Entropy Loss)0.04070.07050.0282
RoBERTa + SCL (Supervised Contrastive Loss)−0.6970−0.5294−0.2483
RoBERTa + InfoNCE (Contrastive Loss)−0.6928−0.5611−0.2592
RoBERTa + FL (Focal Loss)−0.01840.03270.0016
DeBERTa + CE (Cross-Entropy Loss)0.08310.12580.0672
DeBERTa + SCL (Supervised Contrastive Loss)−0.6970−0.5294−0.2483
DeBERTa + InfoNCE (Contrastive Loss)−0.5894−0.4479−0.2141
DeBERTa + FL (Focal Loss)0.04030.10120.0480
ERNIE + CE (Cross-Entropy Loss)0.04240.07690.0340
ERNIE + SCL (Supervised Contrastive Loss)−0.6799−0.6372−0.2949
ERNIE + InfoNCE (Contrastive Loss)−0.6970−0.5294−0.2483
ERNIE + FL (Focal Loss)−0.01530.04630.0194
Table 3. Average P 4 and MCC from 5-fold cross-validation of BERT compared with BERT baseline, using different loss functions for the 33 permissions individually. Values are obtained after 5 epochs per fold, batch size is 4, and a simple permission description is used. Negative values mean worse than baseline (in red), and positive values mean better than baseline.
Table 3. Average P 4 and MCC from 5-fold cross-validation of BERT compared with BERT baseline, using different loss functions for the 33 permissions individually. Values are obtained after 5 epochs per fold, batch size is 4, and a simple permission description is used. Negative values mean worse than baseline (in red), and positive values mean better than baseline.
P 4 MCC
IDFocalSCLInfoNCEIDFocalSCLInfoNCE
1−0.0759−0.6991−0.63981−0.0541−0.6590−0.5308
2−0.0347−0.7452−0.65862−0.0646−0.5325−0.6421
50.0198−0.6892−0.554650.0381−0.5470−0.6031
60.0479−0.7282−0.588160.0598−0.6030−0.4833
7−0.0316−0.7126−0.67977−0.0394−0.5283−0.4834
8−0.0275−0.7612−0.76128−0.0544−0.5348−0.5348
90.1136−0.4814−0.569890.0836−0.4214−0.4749
100.0015−0.6388−0.647410−0.0040−0.5432−0.5699
11−0.0119−0.8759−0.875911−0.0266−0.8691−0.7595
12−0.0024−0.6787−0.6787120.0125−0.4101−0.4619
13−0.0716−0.4156−0.395913−0.0975−0.3626−0.3628
140.0688−0.5888−0.6266140.0786−0.3961−0.4397
15−0.0032−0.8082−0.6288150.0137−0.6463−0.5279
16−0.0845−0.4611−0.332316−0.0500−0.3446−0.2564
170.0593−0.6350−0.6906170.0620−0.5327−0.4790
180.0308−0.7791−0.6815180.0503−0.9910−0.6563
190.0035−0.7996−0.514319−0.0100−0.6492−0.4310
20−0.0446−0.6560−0.808220−0.0770−0.5326−0.7291
210.14230.11060.0000210.09870.0580−0.0862
22−0.0661−0.5829−0.616322−0.0180−0.4139−0.3803
23−0.0080−0.6674−0.851323−0.0096−0.8223−1.0030
240.0171−0.7652−0.8159240.0316−0.8270−0.6616
25−0.3489−0.5218−0.521825−0.3139−0.4656−0.4656
26−0.0726−0.2671−0.267126−0.0580−0.2094−0.2094
270.00000.16830.000027−0.00170.09650.0000
29−0.0995−0.09950.025029−0.0488−0.05290.0453
30−0.2804−0.4430−0.451530−0.2467−0.3759−0.4019
31−0.0142−0.9065−0.779631−0.0175−0.8226−0.8298
32−0.1424−0.5861−0.535032−0.1029−0.4549−0.4254
33−0.0361−0.9530−0.953033−0.0645−0.9086−0.9949
340.0037−0.7110−0.709134−0.0152−0.5941−0.5947
350.0168−0.8283−0.6839350.0296−0.8729−0.5803
38−0.1605−0.8276−0.598638−0.1261−0.8269−0.5426
avg.−0.0331−0.6071−0.5785avg.−0.0285−0.5332−0.5017
stdev.0.09690.26330.2391stdev.0.08600.26290.2343
Table 4. Average metric values from 5-fold cross-validation for the performance of RoBERTa with various refinements, trained and tested using 33 permissions together, compared with BERT baseline. Five epochs per fold. Negative values mean worse than baseline (in red), and positive values mean better than baseline.
Table 4. Average metric values from 5-fold cross-validation for the performance of RoBERTa with various refinements, trained and tested using 33 permissions together, compared with BERT baseline. Five epochs per fold. Negative values mean worse than baseline (in red), and positive values mean better than baseline.
Model P 4 MCCAUROC
RoBERTa + CE (Cross-Entropy Loss) (batch size = 4)0.04590.05130.0238
RoBERTa + CE + Detailed Permission Description (batch size = 4)0.05320.05770.0305
RoBERTa + SCL (Supervised Contrastive Loss) (batch size = 4)−0.6447−0.5137−0.2520
RoBERTa + SCL (Supervised Contrastive Loss) (batch size = 24)−0.5748−0.4961−0.2755
RoBERTa + InfoNCE Loss (batch size = 4)−0.6447−0.5137−0.2445
RoBERTa + InfoNCE Loss (batch size = 24)−0.4431−0.4432−0.1592
RoBERTa + InfoNCE Loss + Detailed Permission Description (batch size = 24)−0.5075−0.4050−0.1797
RoBERTa + FL (Focal Loss) (batch size = 4)0.03960.04280.0211
RoBERTa + FL + Detailed Permission Description (batch size = 4)0.05090.05470.0282
Table 5. Average metric values from 5-fold cross-validation for the performance of DeBERTa V3 with various refinements, trained and tested using 33 permissions together, compared with BERT baseline. Five epochs per fold. Negative values mean worse than baseline (in red), and positive values mean better than baseline.
Table 5. Average metric values from 5-fold cross-validation for the performance of DeBERTa V3 with various refinements, trained and tested using 33 permissions together, compared with BERT baseline. Five epochs per fold. Negative values mean worse than baseline (in red), and positive values mean better than baseline.
Model P 4 MCCAUROC
DeBERTa + CE (Cross-Entropy Loss) (batch size = 4)0.10880.11860.0663
DeBERTa + CE + Detailed Permission Description (batch size = 4)0.06320.08370.0377
DeBERTa + CE (Cross-Entropy Loss) (batch size = 16)0.07530.09320.0472
DeBERTa + CE + Detailed Permission Description (batch size = 16)0.06800.07810.0359
DeBERTa + SCL (Supervised Contrastive Loss) (batch size = 4)−0.5576−0.4464−0.2128
DeBERTa + SCL (Supervised Contrastive Loss) (batch size = 16)−0.6447−0.5137−0.2498
DeBERTa + InfoNCE Loss (batch size = 4)−0.5642−0.4519−0.2084
DeBERTa + InfoNCE Loss (batch size = 16)−0.4863−0.4842−0.2244
DeBERTa + InfoNCE Loss + Detailed Permission Description (batch size = 16)−0.6350−0.6241−0.2982
DeBERTa + FL (Focal Loss) (batch size = 4)0.09220.10160.0564
DeBERTa + FL + Detailed Permission Description (batch size = 4)0.07610.09280.0439
Pretrained DeBERTa + CE (batch size = 4)−0.2697−0.2375−0.1201
Pretrained DeBERTa + CE (batch size = 8)−0.0835−0.0987−0.0585
Table 6. Average metric values from 5-fold cross-validation for the performance of ERNIE with various refinements, trained and tested using 33 permissions together, compared with BERT baseline. Five epochs per fold. Negative values mean worse than baseline (in red), and positive values mean better than baseline.
Table 6. Average metric values from 5-fold cross-validation for the performance of ERNIE with various refinements, trained and tested using 33 permissions together, compared with BERT baseline. Five epochs per fold. Negative values mean worse than baseline (in red), and positive values mean better than baseline.
Model P 4 MCCAUROC
ERNIE + CE (Cross-Entropy Loss) (batch size = 4)0.03930.04390.0232
ERNIE + CE + Detailed Permission Description (batch size = 4)0.03800.04690.0228
ERNIE + CE (Cross-Entropy Loss) (batch size = 24)0.04820.05200.0175
ERNIE + CE + Detailed Permission Description (batch size = 24)0.05110.05160.0247
ERNIE + SCL (Supervised Contrastive Loss) (batch size = 4)−0.6447−0.5137−0.2528
ERNIE + SCL (Supervised Contrastive Loss) (batch size = 24)−0.5201−0.4147−0.1984
ERNIE + InfoNCE Loss (batch size = 4)−0.5020−0.5220−0.2544
ERNIE + InfoNCE Loss (batch size = 24)−0.5003−0.5828−0.2915
ERNIE + InfoNCE Loss + Detailed Permission Description (batch size = 24)−0.6373−0.6404−0.3116
ERNIE + FL (Focal Loss) (batch size = 4)0.06120.06580.0328
ERNIE + FL + Detailed Permission Description (batch size = 4)0.04900.04750.0257
Pretrained ERNIE + CE (batch size = 4)−0.0166−0.0103−0.0087
Pretrained ERNIE + CE (batch size = 24)0.02110.02510.0068
Table 7. Average P 4 and M C C from 5-fold cross-validation of BERT compared with BERT baseline, using different data augmentation approaches, with cross-entropy as the loss function for the 33 permissions individually. Values are obtained after 5 epochs per fold, batch size is 4, and a simple permission description is used. Negative value means worse than baseline (in red), positive value means better than baseline.
Table 7. Average P 4 and M C C from 5-fold cross-validation of BERT compared with BERT baseline, using different data augmentation approaches, with cross-entropy as the loss function for the 33 permissions individually. Values are obtained after 5 epochs per fold, batch size is 4, and a simple permission description is used. Negative value means worse than baseline (in red), positive value means better than baseline.
P 4 MCC
IDSMOTESynAug
0.5%
SynAug
2%
SynAug
4%
IDSMOTESynAug
0.5%
SynAug
2%
SynAug
4%
1−0.0114−0.0747−0.1188−0.22981−0.0149−0.0715−0.1339−0.1796
20.0003N/AN/AN/A20.0025N/AN/AN/A
5−0.10880.11850.01880.09955−0.07590.15150.04950.1271
60.05160.0479−0.01230.047160.07480.0598−0.01790.0717
7−0.0065−0.0843−0.0836−0.03697−0.0022−0.0518−0.0615−0.0396
8−0.0079N/AN/AN/A80.0022N/AN/AN/A
9−0.11670.1037−0.00950.05999−0.14790.0791−0.03550.0343
10−0.0240−0.1059−0.0459−0.086510−0.0188−0.07570.0067−0.0342
110.0226N/AN/AN/A110.0412N/AN/AN/A
120.0128N/AN/AN/A120.0649N/AN/AN/A
13−0.1098−0.1100−0.0257−0.415613−0.1110−0.1285−0.0413−0.3626
140.0708−0.0162−0.1024−0.0156140.0674−0.0060−0.0833−0.0032
15−0.0110−0.03400.0092−0.005915−0.01420.00430.02270.0048
16−0.07760.12450.09790.098016−0.05680.11520.07930.0980
170.0631−0.0933−0.0213−0.0060170.0727−0.0397−0.00310.0084
180.01890.02320.01690.0458180.03730.04110.02950.0802
190.0321−0.0899−0.0715−0.0542190.0479−0.1171−0.1071−0.0493
20−0.0431N/AN/AN/A20−0.0656N/AN/AN/A
210.37450.29960.09960.2996210.28960.28960.08960.2896
220.0628−0.0026−0.0929−0.1442220.0717−0.0141−0.0672−0.0742
23−0.0265N/AN/AN/A23−0.0549N/AN/AN/A
24−0.0947−0.0038−0.0361−0.095524−0.1061−0.0027−0.0414−0.0980
25−0.2221−0.1411−0.2563−0.468825−0.1984−0.1583−0.2550−0.4492
26−0.0989−0.2671−0.05650.043126−0.0851−0.2094−0.04450.0286
27N/A0.00000.00000.000027N/A−0.0017−0.0017−0.0017
290.0331−0.0995−0.0995−0.0995290.0290−0.0426−0.0426−0.0470
30−0.2204−0.0847−0.1497−0.104930−0.1924−0.0179−0.0958−0.0542
310.0008−0.0321−0.0162−0.0160310.0056−0.0457−0.0259−0.0270
32−0.0267−0.1066−0.0768−0.078332−0.0133−0.0873−0.0859−0.0767
33−0.0044−0.03030.0006−0.008433−0.0054−0.04520.0025−0.0131
34−0.0096−0.0431−0.1448−0.031534−0.0310−0.0766−0.1367−0.0426
350.01380.00670.01830.0151350.03070.01450.03390.0430
38−0.16660.0096−0.1884−0.197038−0.14310.0184−0.1780−0.1917
avg.−0.0197−0.0254−0.0499−0.0514avg.−0.0156−0.0155−0.0424−0.0355
stdev.0.10460.10620.08020.1522stdev.0.09430.10000.07770.1424
Table 8. Average metric values from 5-fold cross-validation for the performance of BERT compared with BERT baseline, trained and tested using 2 permissions together. Cross-entropy is used as the loss function, with 5 epochs per fold and batch size = 4, and a simple permission description is used. Negative values mean worse than baseline (in red), and positive values mean better than baseline.
Table 8. Average metric values from 5-fold cross-validation for the performance of BERT compared with BERT baseline, trained and tested using 2 permissions together. Cross-entropy is used as the loss function, with 5 epochs per fold and batch size = 4, and a simple permission description is used. Negative values mean worse than baseline (in red), and positive values mean better than baseline.
Permission Request CombinationIDsID P 4 MCCAUROC
Same category and same majority class5, 650.07250.08610.0025
6−0.2179−0.1627−0.0403
7, 870.03330.06130.0153
80.01440.04250.0200
Unrelated categories but same
majority class
5, 750.03020.0905−0.0220
7−0.01670.0139−0.0245
5, 850.08100.10410.0305
8−0.00270.00660.0009
6, 76−0.2955−0.2240−0.0232
7−0.00710.0340−0.0173
6, 86−0.0302−0.01280.0085
8−0.0366−0.0509−0.0162
Same category but opposite
majority classes
5, 255−0.0405−0.0162−0.0739
25−0.2713−0.2741−0.1240
6, 256−0.02960.0185−0.0065
250.0221−0.0408−0.0169
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ma, C.Y.T. Quantifying Privacy Risk of Mobile Apps as Textual Entailment Using Language Models. J. Cybersecur. Priv. 2025, 5, 111. https://doi.org/10.3390/jcp5040111

AMA Style

Ma CYT. Quantifying Privacy Risk of Mobile Apps as Textual Entailment Using Language Models. Journal of Cybersecurity and Privacy. 2025; 5(4):111. https://doi.org/10.3390/jcp5040111

Chicago/Turabian Style

Ma, Chris Y. T. 2025. "Quantifying Privacy Risk of Mobile Apps as Textual Entailment Using Language Models" Journal of Cybersecurity and Privacy 5, no. 4: 111. https://doi.org/10.3390/jcp5040111

APA Style

Ma, C. Y. T. (2025). Quantifying Privacy Risk of Mobile Apps as Textual Entailment Using Language Models. Journal of Cybersecurity and Privacy, 5(4), 111. https://doi.org/10.3390/jcp5040111

Article Metrics

Back to TopTop