Quantifying Privacy Risk of Mobile Apps as Textual Entailment Using Language Models

Ma, Chris Y. T.

doi:10.3390/jcp5040111

Open AccessArticle

Quantifying Privacy Risk of Mobile Apps as Textual Entailment Using Language Models

by

Chris Y. T. Ma

The Department of Computer Science, The Hang Seng University of Hong Kong, Hong Kong, China

J. Cybersecur. Priv. 2025, 5(4), 111; https://doi.org/10.3390/jcp5040111

Submission received: 10 November 2025 / Revised: 4 December 2025 / Accepted: 10 December 2025 / Published: 12 December 2025

(This article belongs to the Section Privacy)

Download

Browse Figures

Versions Notes

Abstract

Smart phones have become an integral part of our lives in modern society, as we carry and use them throughout a day. However, this “body part” may maliciously collect and leak our personal information without our knowledge. When we install mobile applications on our smart phones and grant their permission requests, these apps can use sensors embedded in the smart phones and the stored data to gather and infer our personal information, preferences, and habits. In this paper, we present our preliminary results on quantifying the privacy risk of mobile applications by assessing whether requested permissions are necessary based on app descriptions through textual entailment decided by language models (LMs). We observe that despite incorporating various improvements of LMs proposed in the literature for natural language processing (NLP) tasks, the performance of the trained model remains far from ideal.

Keywords:

mobile applications; privacy; language models

1. Introduction

In today’s society, smart phones have become an integral extension of ourselves, as we rely on them throughout the day. However, this “extension” can also pose risks as it may unknowingly gather and expose our personal information in harmful ways. This is possible because smart phones are embedded with numerous sensors, and even more to be added in newer models, which can capture our every action. When we install curious, or even malicious, mobile applications (apps) on our smart phones, and grant their permission requests, such as location, camera, motion and orientation, contacts, etc., these apps can utilize the embedded sensors and the stored data to collect and infer our personal information, preferences, and habits, such as sleeping habits, personality traits, or even medical conditions.

In recent years, with data scandals of apps, websites, and technology companies being widely covered by the news media, people have become more aware of the privacy risks associated with the use of these technologies. However, such awareness does not make users act more carefully to safeguard their privacy. One possible reason is that people overlook the potential risks associated with granting certain permissions and providing certain personal information to an app. This happens because users may not be fully aware of the exact set of permissions necessary for each function of an app and the potential impact of granting a permission, and hence, they may under-estimate the privacy threat of the app.

With so many mobile applications available both officially and otherwise, we believe it is beneficial to assess the privacy risk of these applications by focusing on the legitimacy of their permission requests. As a result, there are two major contributions of this work.

First, we design a machine learning-based framework to determine the privacy risk of mobile applications in terms of requested permissions;
Second, we implement the framework and assess the privacy risk of mobile applications with empirical studies.

Novelty Statement: Unlike traditional approaches that treat privacy risk quantification as a simple classification task, we formulate the problem as a textual entailment [1,2] task. We prepared a dataset consisting of app descriptions and their associated permission requests, and developed a framework that utilizes language models to perform this task.

Textual entailment, a natural language processing (NLP) task, seeks to determine whether one text fragment can be inferred from another by analyzing their semantic relationships. This involves understanding context, word meanings, and the underlying logic of the fragments of texts. By leveraging the semantic meanings of both app descriptions and permission descriptions, this approach enhances the model’s flexibility to accommodate new permissions that may emerge in the future. This raises the question of which language model performs best for this task.

We also observe that the labeled dataset is highly imbalanced for many permissions. This is primarily because we only classify a permission request as legitimate when the app explicitly requests it; we currently do not label non-requested permissions. An imbalanced dataset, if not addressed appropriately, could adversely affect the ability of the trained model to generalize. As a result, we also explore strategies such as data augmentation and the application of different loss functions to mitigate the effects of this imbalance.

In summary, this paper investigates the following research questions:

RQ1: How do various language models perform in quantifying the legitimacy of permissions formulated as textual entailment?
RQ2: What is the impact of employing different loss functions and data augmentation strategies on handling imbalanced data within the context of this study?
RQ3: How does training with multiple permission requests or utilizing pretraining with app descriptions influence the performance of the model?

The organization of the paper is as follows: We cover related work in Section 2. We provide problem formulation in Section 3. We present the solution in Section 4, the experiment setup in Section 5, and the related results and discussion in Section 6. We conclude and give future directions in Section 7.

2. Related Work

Our work touches upon several related fields: security of mobile applications (apps), language models for natural language processing (NLP) tasks, and strategies to handle imbalanced datasets.

2.1. Security of Mobile Applications

Our problem is closely related to security analysis of mobile applications (apps), such as identifying malicious apps, where many studies have been performed in the literature, e.g., [3]. However, as pointed out by Lin et al. [4], there are fundamental differences between the two. In particular, context matters as we decide whether a permission is necessary for the functions of an app—precise location is usually necessary for a map application, but definitely not for a calculator application. Consequently, merely analyzing whether an app accesses and uploads a piece of private information is insufficient to determine if it infringes on users’ privacy. Our work complements these security analysis efforts by focusing specifically on the legitimacy of permission requests, providing a more complete understanding of privacy risks associated with mobile applications.

Before training of the platform and performing analysis, we need to prepare our own dataset of apps, which includes the description of apps, their permission requests, and their privacy policy. As we were unable to follow the framework used in another work [5] to retrieve and download apps from the Google Play store, we derived our own approach, with reference to web crawlers, to collect apps from the app store by a Python (3.9) script using Selenium (v4.3.0) and BeautifulSoup (v4.11.1). Apps are retrieved from the Play store through the “search” function using a dictionary attack with 10,000 words, similar to the approach used in the other work [5].

Researchers have been studying the privacy risk of mobile applications. For instance, AC-Net [6] studied the correspondences between the semantics of descriptions and the permission usage. A variant of the RNN architecture with the Gated Recurrent Unit (GRU) was used. However, only 1415 popular Android apps were collected and 24,724 sentences were labeled, and permissions were grouped into 11 categories for easier analysis.

2.2. Language Models for Natural Language Processing (NLP) Tasks

In order to understand the meaning of app descriptions and the permissions, language models are used in this work. Various language models have been proposed in the literature for natural language processing (NLP) tasks, such as sentiment analysis, text summarization, part-of-speech labeling, translation, etc.

BERT [7] is one of the most notable early language models to utilize the Transformer architecture [8]. BERT was pretrained on large corpora using masked language modeling (MLM) [9] and next sentence prediction tasks. Later, a number of other language models based on BERT with improved performance were proposed. For example, RoBERTa [10] has more extensive training such as longer training time, more data, and bigger batches; DeBERTa [11] uses the disentangled attention mechanism where each word is represented using two vectors that encode its content and position.

Many other language models based on the Transformer architecture were also proposed in the literature. For example, ERNIE [12] was trained using both large-scale textual corpora and knowledge graphs, which provide rich structured knowledge facts for better language understanding.

There are also other works proposed to improve the performance of existing language models on downstream tasks. For instance, continued pretraining of the language model [9] using domain data is proposed to adapt the model for the downstream task.

2.3. Strategies to Handle Imbalanced Dataset

The collected data are found to be highly imbalanced for many of the permissions. As a result, a number of approaches for handling such imbalanced data are also explored.

One approach considers the use of a modified loss function, although it is also suggested that all losses are created equal [13] when the neural network has sufficient approximation power and the training is performed for sufficiently many iterations, resulting in a Neural Collapse phenomenon. For instance, the contrastive loss function [14,15] in Supervised Contrastive Learning [16] addresses class imbalance by leveraging pairwise comparisons to learn robust representations of different classes, focusing on similarities and differences rather than direct class labels. This improves the generalization capabilities of the model, leading to better performance on minority class examples. Meanwhile, the Focal Loss function [17] is a modified cross-entropy loss, which focuses more on the minority class and hard-to-classify examples, and leads to better performance when there is a significant class imbalance. The Dice Loss function [18] handles the data imbalance issue by attaching similar importance to false positives and false negatives, and associating training examples with dynamically adjusted weights to de-emphasize easy-negative examples.

Data augmentation and oversampling are also used in the literature for handling imbalanced data. For example, SMOTE [19] is an oversampling technique by synthesizing samples for the minority class using existing ones and interpolating them. However, SMOTE operates within the feature space, resulting in synthesizing data that do not represent any real text. Additionally, while SMOTE utilizes KNN, the feature spaces for NLP problems are vast, and KNN can struggle significantly in such high-dimensional contexts. Meanwhile, nlpAug [20] is a library specifically designed for data augmentation in NLP tasks. Its augmentation techniques include word insertion, substitution, swapping, and deletion.

In our experiment, we will compare the performance of these different language models and strategies to handle imbalanced datasets.

3. Problem Formulation

The Android permission system is a security framework that regulates how mobile applications (apps) access sensitive user data and system features on Android devices. For an app to access such data and features, it must request permission and seek user approval. Many functions, such as providing location-based services, taking photos, and storing data, necessitate specific permission requests. However, granting such requests whenever an app asks for them could expose a user’s sensitive information and privacy to potential misuse. Therefore, it is important for users to carefully evaluate permission requests to ensure their data remains secure.

To assist users in making informed decisions, it is essential for apps to clearly explain the necessity of permission requests in their descriptions by illustrating the functions they provide. However, users may still find it challenging to read through these descriptions and make the right choice. Therefore, our goal is to develop a framework that assesses the legitimacy of an app’s permission requests, by analyzing and interpreting its description, and informs the users of the assessment results.

4. Privacy Risk Quantification

To fully capture the meaning of a permission and provide the flexibility to cater to new permissions in the future, we decided to formulate the problem of determining whether a permission request is legitimate—through the description of app functions—as a textual entailment task. In this task, the text, t, is the description of the app, and the hypothesis, h, is the short description of the permission (column Description (as in Play Store) in Table A4), and we need to decide whether “t entails h”, i.e.,

t \Rightarrow h

. Notice that unlike the usual textual entailment task which has three outcomes, namely, entail, contradict, and neutral, our task only consists of entail and not entail. In particular, entail means an application is legitimate to request a permission according to its description, while not entail means it is illegitimate to do so. We let entail be positive and not entail be negative in the confusion matrix.

Modeling the problem as a textual entailment task offers the advantage that we do not need to separate sentences related to a permission from the overall app description during inference. This is because it is highly unlikely that different sentences in the app description would contain contradictory information regarding the need of a permission. Additionally, entailment can still hold even if only a portion of the text is relevant to the hypothesis. This means we only need to evaluate the app description in its entirety during inference.

As we were labeling the data, we noticed that many of the labeled permissions were highly imbalanced. For instance, many of the requests are legitimate for some of the permissions, and hence, we have many positive examples. To balance the samples, however, we cannot simply assign all the non-requesting apps as negative examples for such permissions. This is because it may still be legitimate for an app, which does not request a particular permission, to have requested such permission from its description. This is particularly true for permissions that may have been introduced after the release of the app. As such, we also explored various ways to handle the data imbalance issue, including different loss functions, which are claimed to be less sensitive to imbalanced datasets and different data augmentation techniques as discussed in Section 2.

In the experiments, we fine-tune a separate BERT model ([7], bert-base-uncased) for each individual permission, establishing the baseline performance for each of them. Then, various proposed pretraining, fine-tuning, and data augmentation approaches, together with different language models, are explored to determine their effectiveness on improving the classification result. These approaches include

The use of other language models, such as RoBERTa ([11], roberta-base), DeBERTa v3 ([21], DeBERTa-v3-base-mnli-fever-anli), and ERNIE ([12], ernie-2.0-base-en);
The use of different loss functions, such as Focal Loss [17], and also contrastive loss functions, such as Supervised Contrastive Loss [16] and InfoNCE loss [15], apart from the default cross-entropy loss;
The use of more elaborated permission description to assist with fine-tuning and inference;
The use of different data augmentation techniques, such as SMOTE [19] and the ones on word substitution provided in nlpAug [20];
Fine-tuning a model using a single permission versus multiple permissions together with (1) similar categories; (2) unrelated categories; and (3) different majority classes.
Pretraining the model with either the collected data or similar data from previous studies [6] to make the model better suited to the task.

5. Experiment Setup

5.1. Data Collection

A total of 216,711 mobile applications, 317 permissions, and 2,221,174 permission requests were collected from the Google Play Store for analysis. They were collected in January 2023.

5.2. Data Labeling

The collected permission requests are processed manually to assess their legitimacy in relation to the app description. Three student helpers and two research assistants majoring in computer science were recruited for this task. Those students were hired as they had general or even app development experience, making them better positioned to determine whether an app function requires certain permissions.

In order to improve the labeling correctness, the author, along with the two research assistants and the three student helpers, first had discussions on the potential functions associated with each permission, providing brief explanations for context. Each permission was then assigned to one student helper, who, along with the author and two research assistants, labeled at least 10 diverse samples. This approach ensured that contributions came from four individuals, facilitating comparison and a clearer understanding of each permission’s use cases. Table 1 presents the inter-annotator agreement of the initial samples, measured using Light’s Kappa, which represents the average of all possible pairwise Cohen’s Kappa coefficients among the annotators. It is important to note that while some permissions (e.g., 5, 17, 21, and 27) show very low or even zero value for Light’s Kappa during the initial labeling, this does not reflect a lack of agreement. Instead, it is due to the highly skewed request entailment for these permissions, which leads to the highly imbalanced classification results (see Table A1), resulting in a Kappa value that fails to capture the meaningful agreement between raters.

Following the initial labeling, the labeled samples were compared, and any discrepancies were thoroughly discussed and clarified. Only after this review did the student helpers proceed to label up to 600 examples for each permission, or the total number of samples collected, whichever was smaller. Note that different student helpers were assigned to work on different permission requests. During labeling, the student helpers could also label some requests as “uncertain.” They would then discuss with the author and the two research assistants to determine the correct label, ensuring accurate interpretation and consistent application of the labeling criteria.

Due to limited resources, however, only 38 permissions, each with at most 600 requests, were labeled and made available for the experiments. The list of permissions used in the experiments is provided in Table A4, and the distribution of textual entailment labels for the request of each permission is listed in Table A1. The dataset can be accessed through https://github.com/chrismahsu/AndroidApps (accessed on 11 July 2025). Notice that the IDs of the permissions were assigned according to the order in which they were retrieved from the Google Play Store. Some permissions are omitted as they are identical to the ones listed but classified under different categories. Also notice from Table A1 and Figure A1 that many of the labeled permission requests are imbalanced in terms of textual entailment results.

5.3. Metrics

Various metrics are used in the experiments to quantify the performance of the language models and improvement strategies. These metrics include

F_{β}

score (

β

is chosen such that Recall is considered

β

times as important as Precision), Matthews Correlation Coefficient (

M C C

, also known as Phi coefficient),

P_{4}

metric (also known as Symmetric F), and Area under the Receiver Operating Characteristic Curve score (

A U R O C

). When

T P

,

F P

,

T N

, and

F N

are True Positive, False Positive, True Negative, and False Negative, respectively, the three metrics of

F_{β}

,

M C C

, and

P_{4}

are defined as follows:

F_{β} = (1 + β^{2}) \times \frac{P r e c i s i o n \times R e c a l l}{β^{2} \times P r e c i s i o n + R e c a l l} = \frac{(1 + β^{2}) T P}{β^{2} (T P + F N) + (T P + F P)},

M C C = \frac{T P \times T N - F P \times F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}},

P_{4} = \frac{4}{\frac{1}{P r e c i s i o n} + \frac{1}{R e c a l l} + \frac{1}{S p e c i f i c i t y} + \frac{1}{N P V}} = \frac{4 \times T P \times T N}{4 \times T P \times T N + (T P + T N) \times (F P + F N)},

where

P r e c i s i o n = \frac{T P}{T P + F P}

,

R e c a l l = \frac{T P}{T P + F N}

,

S p e c i f i c i t y = \frac{T N}{T N + F P}

, and

N P V = \frac{T N}{T N + F N}

, while

A U R O C

is defined as the Area under the Receiver Operating Characteristic (ROC) Curve, which is the plot of the true positive rate (

T P R = R e c a l l = \frac{T P}{T P + F N}

) against the false positive rate (

F P R = \frac{F P}{T N + F P}

). For

M C C

, if exactly one of the four sums in the denominator is zero, the denominator is set to one. If more than one sum is zero, then

M C C

is undefined, and we remove the result of that fold from averaging during cross-validation.

Notice that apart from the usual F1 score, we also consider the F0.5 score. This is because firstly, the available samples can be very imbalanced. Also, the impact of false positives, meaning granting an app with illegitimate permission requests, is worse than that of false negatives, meaning flagging a legitimate permission request as illegitimate.

To give a more balanced and symmetric consideration of both positive and negative entailment classification,

M C C

,

P_{4}

, and

A U R O C

metrics are also considered. In particular,

M C C

focuses on measuring the association between predicted and actual binary outcome, with a range from −1 to 1; a value of +1 represents a perfect prediction, 0 an average random prediction, and −1 an inverse prediction.

P_{4}

also considers other aspects of a classifier’s performance, as it takes Precision, Recall, Specificity, and NegativePredictionValue (NPV) into consideration, with a range of 0 to +1; a value of +1, again, represents a perfect prediction.

A U R O C

assesses the classifier’s ability to distinguish between positive and negative classes, with a range from 0 to 1; a value of 1 represents perfect discrimination between classes; 0.5 no discriminative power (equivalent to random guessing); and values below 0.5 represent worse-than-random performance. The higher the

A U R O C

, the better the model’s ability to correctly classify positive and negative instances.

6. Experiment Results and Discussion

Most of the simulations are run on Google Colab Enterprise using machines with an Nvidia L4 GPU, while some are run on a local laptop equipped with an Nvidia GeForce RTX3070Ti GPU. All reported values represent the average from five folds of cross-validation, using a learning rate of

1 \times 10^{- 5}

and default dropout probabilities of 0.1. All language models are optimized using AdamW [22] during fine-tuning. Unless stated otherwise, a permission is specified using the short description (column Description (as in Play Store) in Table A4).

We notice that each simulation with 5-fold cross-validation for 33 permissions together using ERNIE and RoBERTa takes around five to six hours to complete on Google Colab Enterprise, while that using DeBERTa takes around nine to ten hours to complete.

6.1. Baseline Performance

Figure 1 depicts the performance of BERT-base measured by various metrics using the cross-entropy loss function and a batch size of 4 for each individual permission, while Figure 2 depicts that of DeBERTa. The corresponding numeric values are listed in Table A2 and Table A3.

From Figure 1a,b, we can observe that for permissions with a majority of samples in entailment, the baseline performance measured using

F 1

and

F 0.5

scores is in general better. Meanwhile, when the majority of samples are in not-entailment, the baseline performance can be very poor. This is because the two metrics place more emphasis on positive classification, and having the majority of samples in entailment makes the model learn and perform better in positive classification. Meanwhile, having a minority of samples in entailment makes the model fail to learn positive classification accurately, which results in poorer performance in

F 1

and

F 0.5

scores.

When both positive and negative classifications are considered, as shown in Figure 1c–e with the use of

P_{4}

, MCC, and AUROC metrics, the performance of BERT-base is more accurately reflected. For instance, the “good” performance of the model on positive classification is now shadowed by the poor performance on negative classification, such as permission 16, and more extremely, permissions 21 and 27. As such, we opt to use

P_{4}

, MCC, and AUROC for performance comparison.

When we compare the results in Table A2 and Table A3, we notice that the performance of DeBERTa is slightly better than BERT if we consider the

F 1

,

F 0.5

scores, MCC, and AUROC metrics with an average improvement of 0.0111, 0.0137, 0.0128, and 0.0036, respectively; however, BERT-base performs slightly better if we consider the

P_{4}

metric, with an average improvement of 0.0056.

Summary:

P_{4}

, MCC, and AUROC are more appropriate metrics than

F 1

and

F 0.5

scores to quantify the performance as they consider both positive and negative classifications.

6.2. RQ1: How Do Various Language Models Perform in Quantifying the Legitimacy of Permissions Formulated as Textual Entailment?

Table 2 lists the experiment results of four language models, namely BERT-base-uncased, RoBERTa-base, DeBERTa-v3-base-mnli-fever-anli, and ERNIE-2.0-base-en, with four different loss functions, namely cross-entropy loss, Focal Loss, Supervised Contrastive Loss, and InfoNCE Loss, using the labeled data of 10 permissions (IDs 1, 2, 5, 6, 7, 8, 9, 11, 12, and 16 in Table A4) for training and testing. In the experiments, the default values of learning rate and temperatures for Supervised Contrastive Loss and InfoNCE Loss are used. The reported values are the average performance of individual permissions.

From Table 2, we can observe that when the same loss function is used, the four models, namely BERT, RoBERTa, DeBERTa, and ERNIE, demonstrate comparative performance when 10 permissions are being considered together. Among the four models, DeBERTa exhibits consistently slightly better performance.

Note that in the tables, the values of −0.6970 in

P_{4}

, −0.5294 in

M C C

, and −0.2483 in

A U R O C

, when compared to the BERT baseline, correspond to 0 in

P_{4}

and

M C C

, and 0.5 in

A U R O C

, for absolute values, which represent performances equivalent to random guessing.

Summary: The four language models considered, namely BERT, RoBERTa, DeBERTa, and ERNIE, exhibit comparable performance, with DeBERTa consistently achieving slightly better results.

6.3. RQ2: What Is the Impact of Employing Different Loss Functions and Data Augmentation Strategies on Handling Imbalanced Data Within the Context of This Study?

Table 3 lists the performance of various loss functions applied to BERT during the training process, including Focal Loss, Supervised Contrastive Loss, and InfoNCE Loss, using the labeled data of 33 permissions individually (all the permissions in Table A4). The reported values are the difference from the BERT baseline performance.

Table 4, Table 5 and Table 6 depict the experiment results when the language models are enhanced with different refinements in terms of loss function and data preparation for RoBERTa, DeBERTa, and ERNIE, respectively, using the labeled data of 33 permissions together for training and testing. In the experiments, the default values of learning rate and temperatures for Supervised Contrastive Loss and InfoNCE Loss are also used. Again, the reported values are the average performance of individual permissions.

Similarly, Table 7 lists the performance of various data augmentation approaches applied to BERT, including SMOTE and SynonymAug (SynAug) provided by nlpAug, and the reported values are the difference from the BERT baseline performance. In the table, SynAug

x %

means that

x %

of the words are replaced with synonyms. Please note that “N/A” in SynonymAug is due to the relatively balanced sample set of permissions, which results in no augmentation needed on the minority class. In contrast, “N/A” in SMOTE indicates insufficient samples for SMOTE to operate, leading to a lack of augmentation on the minority class.

6.3.1. Effects of Different Loss Functions

Table 3 shows that the use of other loss functions, namely Focal Loss, Supervised Contrastive Loss, and InfoNCE Loss, does not lead to improvements in textual entailment decision on average when permissions are considered individually. In fact, the performance significantly declines when Supervised Contrastive Loss and InfoNCE Loss are employed using their default parameters, while that of Focal Loss is mixed.

Table 2 also shows that when cross-entropy loss (CE) is replaced with Focal Loss (FL) with 10 permissions being considered together, some language models demonstrate slightly better performance on some performance metrics. However, using Supervised Contrastive Loss (SCL) or InfoNCE Loss would result in significantly worse performance for all of the models across all of the considered performance metrics, similar to that observed from Table 3 when permissions are considered individually.

When we compare Table 4, Table 5 and Table 6, where 33 permissions are considered together, we also notice that the performance of the models is very poor when contrastive loss functions, namely SCL or InfoNCE Loss, are used instead of the default cross-entropy loss, irrespective of the language model used. A careful examination of the results reveals that in some of the folds when the two contrastive loss functions are used, the decisions returned by the models are skewed to one side only—i.e., either all the requests are inferred as legitimate or all are illegitimate—but the skewness can also be inconsistent across different folds. Although increasing the batch size from 4 could improve the performance of the models, they are still worse than the other two loss functions, namely, cross-entropy loss and Focal Loss, under the same batch size with the default parameter setup. Meanwhile, the use of default cross-entropy loss and Focal Loss in model training gives acceptable results consistently without any completely skewed answers in any of the folds.

Summary: Cross-entropy loss and Focal Loss yield comparable performance, whereas Supervised Contrastive Loss (SCL) and InfoNCE Loss consistently perform significantly worse.

6.3.2. Effects of Data Augmentation

From Table 7, we can observe that performing data augmentation with relatively simple approaches, namely SMOTE and synonym substitution, does not result in improvement on textual entailment decision in general, and the performance is worsened on average using

M C C

and

P_{4}

metrics. This is because replacing words with their synonyms, as in SynonymAug of nlpAug, does not help increase the diversity much in terms of mobile app descriptions that may be encountered. Working in the feature space, like SMOTE does, also has limited effectiveness because the space of language is of very high dimension.

Summary: Implementing data augmentation using relatively simple approaches yields minimal to no benefits.

6.4. RQ3: How Does Training with Multiple Permission Requests or Utilizing Pretraining with App Descriptions Influence the Performance of the Model?

Table 8 lists the performance of BERT-base when the data of two permissions are used together to fine-tune the model. Different combinations of the two permissions are used—(i) Same category and same majority label; (ii) Unrelated categories and same majority label; and (iii) Same category but different majority labels. A negative value means worse than baseline, while a positive value means better than baseline.

6.4.1. Effects of Training with Multiple Permission Requests

Table 8 reveals that fine-tuning a model with two permissions leads to inconsistent performance. For instance, using two permission requests from the same category with the same majority class, or from unrelated categories with the same majority class, as well as requests from the same category with different majority classes, does not consistently yield improved performance for the two permissions individually.

We also observe from Table 4, Table 5 and Table 6 that supplementing permission descriptions with their details yields inconsistent performance. In certain setups, such as when using InfoNCE Loss, the inclusion of permission description details can deteriorate the model’s overall performance. Conversely, when cross-entropy (CE) loss is employed, the addition of these details may enhance the performance of some models.

Summary: Training a single model with multiple permissions does not consistently lead to improved performance for the relevant permissions. Additionally, the inclusion of detailed permission descriptions produces mixed results.

6.4.2. Effects of Pretraining Models Using Data from the Target Domain

Table 5 and Table 6 also show that pretraining the language model using the app descriptions of the 33 permissions together with masked language modeling, similar to task-adaptive pretraining [9], yields mixed results. Specifically, the performance of DeBERTa declines, particularly when the batch size is small during fine-tuning, while ERNIE shows a slight improvement when compared with the baseline. However, the performance of the pretrained models is consistently worse than that of models fine-tuned without pretraining under the same batch size. Possible reasons for this observation include a lack of sufficient samples to effectively pretrain the model for the context, as well as the general nature of app descriptions, which may render further adaptation unnecessary.

Summary: The performance of pretrained models is consistently worse than those without pretraining.

6.5. Overall Observations

From the experiment results, we observed that various language models yielded comparable performance. Additionally, we implemented multiple enhancement strategies proposed in the literature; however, their effects were mixed, with many strategies proving ineffective or even detrimental to performance. The highest performance achieved thus far is an

A U R O C

of 0.8221, attained by DeBERTa using cross-entropy as the loss function while training on 33 permissions together using a batch size of 4.

While an

A U R O C

score of 0.8221 indicates a reasonably good ability to discern legitimate permission requests, it still falls short of ideal performance in this context. Specifically, this score corresponds to a true positive rate of 0.7982 and a false positive rate of 0.1540. Consequently, among 100 illegitimate permission requests made by potentially malicious apps, approximately 15 will be misclassified as legitimate and have their requests approved. This highlights the importance of further refinement of the model to minimize the false positive rate while maintaining a high true positive rate.

7. Conclusions

In this paper, we presented our preliminary results on quantifying the privacy risk of mobile applications by determining whether a requested permission is necessary from its description. We modeled the problem as textual entailment, also known as Natural Language Inference (NLI), a fundamental task in natural language processing (NLP), and used language models to solve it. We highlighted the difficulties encountered with the approach, such as the lack of reliable or balanced training data. We noticed that by incorporating various improvements proposed in the literature for NLP tasks, the performance of the trained model is still far from ideal.

In the future, we will consider the use of more labeled data from more permissions to evaluate the performance of various language models and their refinements, such as using different loss functions, model training strategies, and data preparation approaches. Large language models (LLMs) may also be utilized to assist with the task, including the use of them to explain the implication of specific permissions to help data labeling and model training; to identify the reason that makes a permission request legitimate from the description; or even for privacy quantification directly, with Chain-of-Thought [23], few-shot [24], or even zero-shot [25,26] learning.

Funding

This research was funded by grant UGC/FDS14/E01/21 from the Research Grants Council of the Hong Kong Special Administrative Region, and was partially supported by grant SDSC-SRG049 from the School of Decision Sciences of the Hang Seng University of Hong Kong.

Data Availability Statement

The dataset can be accessed through https://github.com/chrismahsu/AndroidApps (accessed on 11 July 2025).

Acknowledgments

The author would like to thank research assistants Andrew LEE and Barry CHENG for their discussions and assistance in preparing some of the script files, cleansing the dataset, and labeling the sample data. The author would also like to express gratitude to student helpers Stephanie SO, Katy LAM, and Chiu Faat LEUNG for their work on data labeling. During the preparation of this manuscript, the author utilized PoE for text editing purposes.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Number of labeled permission samples in each class for each permission.

ID	Entail	Not Entail	ID	Entail	Not Entail	ID	Entail	Not Entail
1	93	489	14	92	507	25	26	568
2	375	214	15	139	460	26	27	571
5	559	27	16	584	16	27	585	5
6	569	20	17	167	432	29	18	566
7	438	140	18	166	434	30	15	90
8	353	246	19	123	476	31	137	457
9	579	21	20	388	197	32	43	541
10	82	518	21	589	11	33	130	459
11	237	363	22	469	131	34	42	555
12	382	218	23	282	311	35	134	448
13	26	574	24	173	420	38	573	18

Table A2. Average

F 1

and

F 0.5 s c o r e s

,

P_{4}

,

M C C

, and

A U R O C

of 5-Folds cross-validation of BERT with cross-entropy as the loss function for the 33 permissions individually as BERT baseline performance. Values obtained after 5 epochs per fold, batch size is 4, and using simple permission description (column Description in Table A4).

Table A2. Average

F 1

and

F 0.5 s c o r e s

,

P_{4}

,

M C C

, and

A U R O C

of 5-Folds cross-validation of BERT with cross-entropy as the loss function for the 33 permissions individually as BERT baseline performance. Values obtained after 5 epochs per fold, batch size is 4, and using simple permission description (column Description in Table A4).

ID	$F 1 Score$	$F 0.5 Score$	$P_{4}$	$MCC$	$AUROC$
1	0.6130	0.6606	0.7395	0.5752	0.7705
2	0.8246	0.8212	0.7452	0.5325	0.7530
5	0.9792	0.9767	0.6892	0.5470	0.7535
6	0.9884	0.9847	0.7282	0.6030	0.7474
7	0.8859	0.8755	0.7210	0.5123	0.7476
8	0.8116	0.8012	0.7612	0.5348	0.7635
9	0.9846	0.9806	0.5698	0.4749	0.7124
10	0.5558	0.5616	0.6901	0.5067	0.7565
11	0.8504	0.8531	0.8759	0.7595	0.8780
12	0.7680	0.7830	0.6787	0.4101	0.7053
13	0.3433	0.4009	0.4156	0.3626	0.6489
14	0.4892	0.5088	0.6266	0.4397	0.7162
15	0.7227	0.7327	0.8082	0.6463	0.8197
16	0.9846	0.9836	0.4611	0.3446	0.6515
17	0.5787	0.6434	0.6906	0.4790	0.7140
18	0.8003	0.8117	0.8580	0.7345	0.8637
19	0.7056	0.7299	0.7996	0.6492	0.8143
20	0.8887	0.8657	0.8082	0.6409	0.8032
21	0.9898	0.9849	0.0000	−0.0017	0.5992
22	0.8832	0.8669	0.6163	0.3803	0.6728
23	0.8661	0.8530	0.8666	0.7445	0.8701
24	0.7525	0.7507	0.8159	0.6616	0.8314
25	0.4530	0.5106	0.5218	0.4656	0.7003
26	0.2043	0.1987	0.2671	0.2094	0.6282
27	0.9956	0.9931	0.0000	0.0000	0.8000
29	0.0625	0.0543	0.0995	0.0426	0.5289
30	0.4667	0.5762	0.5737	0.4648	0.6778
31	0.8612	0.8416	0.9065	0.8226	0.9232
32	0.4729	0.4941	0.5861	0.4549	0.7155
33	0.9290	0.9246	0.9530	0.9086	0.9568
34	0.5864	0.6977	0.7273	0.6043	0.7407
35	0.7720	0.7794	0.8436	0.7121	0.8546
38	0.9911	0.9889	0.8310	0.7287	0.8240
avg.	0.7291	0.7421	0.6447	0.5137	0.7558
stdev.	0.2425	0.2291	0.2462	0.2191	0.0936

Figure A1. Fraction of legitimate permission requests, sorted in ascending order.

Table A3. Average

F 1

and

F 0.5 s c o r e s

,

P_{4}

,

M C C

, and

A U R O C

of 5-fold cross-validation of DeBERTa with cross-entropy as the loss function for the 33 permissions individually as DeBERTa baseline performance. Values obtained after 5 epochs per fold, batch size is 4, and using simple permission description (column Description in Table A4).

Table A3. Average

F 1

and

F 0.5 s c o r e s

,

P_{4}

,

M C C

, and

A U R O C

of 5-fold cross-validation of DeBERTa with cross-entropy as the loss function for the 33 permissions individually as DeBERTa baseline performance. Values obtained after 5 epochs per fold, batch size is 4, and using simple permission description (column Description in Table A4).

ID	$F 1 Score$	$F 0.5 Score$	$P_{4}$	MCC	AUROC
1	0.7043	0.7880	0.8057	0.6957	0.8074
2	0.8473	0.8414	0.7706	0.5754	0.7772
5	0.9830	0.9763	0.5766	0.4880	0.7006
6	0.9867	0.9810	0.5498	0.4632	0.6732
7	0.8844	0.8723	0.6726	0.4846	0.7156
8	0.8148	0.8149	0.7704	0.5550	0.7745
9	0.9828	0.9789	0.5670	0.4600	0.6866
10	0.5091	0.5918	0.6525	0.4889	0.7058
11	0.9026	0.8990	0.9192	0.8416	0.9222
12	0.8383	0.8280	0.7627	0.5511	0.7709
13	0.0000	0.0000	0.0000	−0.0046	0.4991
14	0.7101	0.6822	0.8093	0.6613	0.8530
15	0.7278	0.7567	0.8136	0.6616	0.8181
16	0.9863	0.9863	0.5593	0.4409	0.7015
17	0.7832	0.7825	0.8440	0.7082	0.8568
18	0.8039	0.8413	0.8622	0.7477	0.8575
19	0.7600	0.8327	0.8428	0.7255	0.8248
20	0.8409	0.8386	0.7410	0.5235	0.7572
21	0.9915	0.9875	0.2134	0.1867	0.6442
22	0.8910	0.8673	0.6372	0.4263	0.6733
23	0.8843	0.8994	0.8922	0.7893	0.8929
24	0.8035	0.8494	0.8610	0.7465	0.8538
25	0.3954	0.4186	0.5104	0.4109	0.7135
26	0.4584	0.5251	0.5643	0.4759	0.7012
27	0.9939	0.9923	0.0000	−0.0025	0.7983
29	0.0750	0.0577	0.1077	0.0759	0.5635
30	0.5314	0.5304	0.5543	0.5208	0.7625
31	0.8713	0.8777	0.9147	0.8356	0.9145
32	0.3910	0.4190	0.5138	0.4098	0.7182
33	0.9049	0.9182	0.9373	0.8800	0.9330
34	0.4947	0.5548	0.6100	0.4859	0.7088
35	0.6880	0.7660	0.7744	0.6608	0.7926
38	0.9884	0.9869	0.4809	0.4049	0.6888
avg.	0.7402	0.7558	0.6391	0.5265	0.7594
stdev.	0.2552	0.2489	0.2520	0.2238	0.0991

Table A4. Details of the first 33 permissions retrieved from the Google Play Store.

ID	Category	Descriptio (as in Play Store)	Permission Name (as in AndroidManifest.xml)	No. of Apps Requested ¹	Permission Description Details (as in https://developer.android.com/reference/android/Manifest.permission) ²
1	Device and app history	retrieve running apps	android.permission.GET_TASKS	8535	Allows the app to retrieve information about currently and recently running tasks. This may allow the app to discover information about which applications are used on the device.
2	Wi-Fi connection information	view Wi-Fi connections	android.permission.ACCESS_WIFI_STATE	101,727	Allows the app to view information about Wi-Fi networking, such as whether Wi-Fi is enabled and the name of the connected Wi-Fi devices.
5	Storage	read the contents of your USB storage	android.permission.READ_EXTERNAL_STORAGE	135,861	Allows the app to test a permission for the SD card that will be available on future devices.
6	Storage	modify or delete the contents of your USB storage	android.permission.WRITE_EXTERNAL_STORAGE	127,762	Allows the app to write to the SD card.
7	Other	receive data from Internet	com.google.android.c2dm.permission.RECEIVE	120,229	Allows apps to accept cloud to device messages sent by the app’s service. Using this service will incur data usage. Malicious apps could cause excess data usage.
8	Other	full network access	android.permission.INTERNET	207,864	Allows the app to create network sockets and use custom network protocols. The browser and other applications provide means to send data to the internet, so this permission is not required to send data to the internet.
9	Other	prevent device from sleeping	android.permission.WAKE_LOCK	172,854	Allows the app to prevent the phone from going to sleep.
10	Other	disable your screen lock	android.permission.DISABLE_KEYGUARD	3843	Allows the app to disable the keylock and any associated password security. For example, the phone disables the keylock when receiving an incoming phone call, then re-enables the keylock when the call is finished.
11	Other	control vibration	android.permission.VIBRATE	97,442	Allows the app to control the vibrator.
12	Other	view network connections	android.permission.ACCESS_NETWORK_STATE	198,058	Allows the app to view information about network connections such as which networks exist and are connected.
13	Other	run at startup	android.permission.RECEIVE_BOOT_COMPLETED	116,194	Allows the app to have itself started as soon as the system has finished booting. This can make it take longer to start the phone and allow the app to slow down the overall phone by always running.
14	Microphone	record audio	android.permission.RECORD_AUDIO	35,595	Allows the app to record audio with the microphone. This permission allows the app to record audio at any time without your confirmation.
15	Location	approximate location (network-based)	android.permission.ACCESS_COARSE_LOCATION	57,838	Allows the app to obtain your approximate location. This location is derived by location services using network location sources such as cell towers and Wi-Fi. These location services must be turned on and available to your device for the app to use them. Apps may use this to determine approximately where you are.
16	Identity	add or remove accounts	android.permission.MANAGE_ACCOUNTS	3259	Allows the app to perform operations like adding and removing accounts, and deleting their password.
17	Camera	take pictures and videos	android.permission.CAMERA	67,875	Allows the app to take pictures and videos with the camera. This permission allows the app to use the camera at any time without your confirmation.
18	Other	close other apps	android.permission.RESTART_PACKAGES, android.permission.KILL_BACKGROUND_PROCESSES	2838	Allows the app to end background processes of other apps. This may cause other apps to stop running.
19	Other	pair with Bluetooth devices	android.permission.BLUETOOTH	29,571	Allows the app to view the configuration of the Bluetooth on the phone, and to make and accept connections with paired devices.
20	Other	change network connectivity	android.permission.CHANGE_NETWORK_STATE	12,018	Allows the app to change the state of network connectivity.
21	Other	use accounts on the device	android.permission.USE_CREDENTIALS	5801	Allows the app to request authentication tokens.
22	Other	connect and disconnect from Wi-Fi	android.permission.CHANGE_WIFI_STATE	13,672	Allows the app to connect to and disconnect from Wi-Fi access points and to make changes to device configuration for Wi-Fi networks.
23	Other	change your audio settings	android.permission.MODIFY_AUDIO_SETTINGS	25,244	Allows the app to modify global audio settings such as volume and which speaker is used for output.
24	Other	access Bluetooth settings	android.permission.BLUETOOTH_ADMIN	14,764	Allows the app to configure the local Bluetooth phone, and to discover and pair with remote devices.
25	Other	manage document storage	android.permission.MANAGE_DOCUMENTS	1371	Allows the app to manage document storage.
26	Other	read Google service configuration	com.google.android.providers.gsf.permission. READ_GSERVICES	17,798	Allows this app to read Google service configuration data.
27	Other	Google Play license check	com.android.vending.CHECK_LICENSE	15,424	Market license check
29	Device ID & call information	read phone status and identity	android.permission.READ_PHONE_STATE	41,888	Allows the app to access the phone features of the device. This permission allows the app to determine the phone number and device IDs, whether a call is active, and the remote number connected by a call.
30	Other	SmartcardService Permission label	org.simalliance.openmobileapi.SMARTCARD	102	Enables Android applications to communicate with Secure Elements, e.g. SIM card, embedded Secure Elements, Mobile Security Card or others.
31	Other	read battery statistics	android.permission.BATTERY_STATS	630	Allows an application to read the current low-level battery use data. May allow the application to find out detailed information about which apps you use.
32	Other	draw over other apps	android.permission.SYSTEM_ALERT_WINDOW	23,800	Allows the app to draw on top of other applications or parts of the user interface. They may interfere with your use of the interface in any application, or change what you think you are seeing in other applications.
33	Other	modify system settings	android.permission.WRITE_SETTINGS	9642	Allows the app to modify the system’s settings data. Malicious apps may corrupt your system’s configuration.
34	Other	send sticky broadcast	android.permission.BROADCAST_STICKY	3112	Allows the app to send sticky broadcasts, which remain after the broadcast ends. Excessive use may make the phone slow or unstable by causing it to use too much memory.
35	Other	allow Wi-Fi Multicast reception	android.permission. CHANGE_WIFI_MULTICAST_STATE	6350	Allows the app to receive packets sent to all devices on a Wi-Fi network using multicast addresses, not just your phone. It uses more power than the non-multicast mode.
38	Contacts	find accounts on the device	android.permission.GET_ACCOUNTS	16,073	Allows the app to obtain the list of accounts known by the phone. This may include any accounts created by applications you have installed.

¹ out of the 216,711 collected apps; ² accessed on 30 May 2023.

References

Dagan, I.; Glickman, O.; Magnini, B. The pascal recognising textual entailment challenge. In Proceedings of the Machine Learning Challenges Workshop; Springer: Berlin/Heidelberg, Germany, 2005; pp. 177–190. [Google Scholar]
Dagan, I.; Roth, D.; Zanzotto, F.; Sammons, M. Recognizing Textual Entailment: Models and Applications; Springer Nature: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
Felt, A.P.; Finifter, M.; Chin, E.; Hanna, S.; Wagner, D. A survey of mobile malware in the wild. In Proceedings of the 1st ACM Workshop on Security and Privacy in Smartphones and Mobile Devices, Chicago, IL, USA, 17 October 2011; pp. 3–14. [Google Scholar]
Lin, J.; Amini, S.; Hong, J.I.; Sadeh, N.; Lindqvist, J.; Zhang, J. Expectation and purpose: Understanding users’ mental models of mobile app privacy through crowdsourcing. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing, Pittsburgh, PA, USA, 5–8 September 2012; pp. 501–510. [Google Scholar]
Viennot, N.; Garcia, E.; Nieh, J. A measurement study of google play. In Proceedings of the 2014 ACM International Conference on Measurement and Modeling of Computer Systems, Austin, TX, USA, 16–20 June 2014; pp. 221–233. [Google Scholar]
Feng, Y.; Chen, L.; Zheng, A.; Gao, C.; Zheng, Z. AC-Net: Assessing the consistency of description and permission in Android apps. IEEE Access 2019, 7, 57829–57842. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL-HLT, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, p. 2. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998. [Google Scholar]
Gururangan, S.; Marasović, A.; Swayamdipta, S.; Lo, K.; Beltagy, I.; Downey, D.; Smith, N.A. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv 2020, arXiv:2004.10964. [Google Scholar]
Liu, Y. RoBERTa: A robustly optimized BERT pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
He, P.; Liu, X.; Gao, J.; Chen, W. DeBERTa: Decoding-enhanced BERT with disentangled attention. arXiv 2020, arXiv:2006.03654. [Google Scholar]
Zhang, Z.; Han, X.; Liu, Z.; Jiang, X.; Sun, M.; Liu, Q. ERNIE: Enhanced language representation with informative entities. arXiv 2019, arXiv:1905.07129. [Google Scholar] [CrossRef]
Zhou, J.; You, C.; Li, X.; Liu, K.; Liu, S.; Qu, Q.; Zhu, Z. Are all losses created equal: A neural collapse perspective. arXiv 2022, arXiv:2210.02192. [Google Scholar] [CrossRef]
Gunel, B.; Du, J.; Conneau, A.; Stoyanov, V. Supervised contrastive learning for pre-trained language model fine-tuning. arXiv 2020, arXiv:2011.01403. [Google Scholar]
Oord, A.v.d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised contrastive learning. Adv. Neural Inf. Process. Syst. 2020, 33, 18661–18673. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
Li, X.; Sun, X.; Meng, Y.; Liang, J.; Wu, F.; Li, J. Dice loss for data-imbalanced NLP tasks. arXiv 2019, arXiv:1911.02855. [Google Scholar]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Ma, E.K. nlpaug: Data Augmentation for NLP. Available online: https://github.com/makcedward/nlpaug (accessed on 10 February 2025).
He, P.; Gao, J.; Chen, W. DeBERTaV3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv 2021, arXiv:2111.09543. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar] [CrossRef]
Wei, J.; Bosma, M.; Zhao, V.Y.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned language models are zero-shot learners. arXiv 2021, arXiv:2109.01652. [Google Scholar]
Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large language models are zero-shot reasoners. Adv. Neural Inf. Process. Syst. 2022, 35, 22199–22213. [Google Scholar]

Figure 1. (a)

F 1

score; (b)

F 0.5

score; (c)

P_{4}

; (d) MCC; (e) AUROC.

F 1

and

F 0.5

scores,

P_{4}

, MCC, and AUROC from 5-fold cross-validation of BERT with cross-entropy as the loss function for the 33 permissions individually (as BERT baseline performance). Values are obtained after 5 epochs per fold, batch size is 4, and a simple permission description is used (column Description in Table A4). Permissions are sorted in ascending order of the fraction of legitimate requests.

Figure 1. (a)

F 1

score; (b)

F 0.5

score; (c)

P_{4}

; (d) MCC; (e) AUROC.

F 1

and

F 0.5

scores,

P_{4}

, MCC, and AUROC from 5-fold cross-validation of BERT with cross-entropy as the loss function for the 33 permissions individually (as BERT baseline performance). Values are obtained after 5 epochs per fold, batch size is 4, and a simple permission description is used (column Description in Table A4). Permissions are sorted in ascending order of the fraction of legitimate requests.

Figure 2. (a)

F 1

score; (b)

F 0.5

score; (c)

P_{4}

; (d) MCC; (e) AUROC.

F 1

and

F 0.5

scores,

P_{4}

, MCC, and AUROC from 5-fold cross-validation of DeBERTa with cross-entropy as the loss function for the 33 permissions individually (as DeBERTa baseline performance). Values are obtained after 5 epochs per fold, batch size is 4, and a simple permission description is used (column Description in Table A4). Permissions are sorted in ascending order of the fraction of legitimate requests.

Figure 2. (a)

F 1

score; (b)

F 0.5

score; (c)

P_{4}

; (d) MCC; (e) AUROC.

F 1

and

F 0.5

scores,

P_{4}

, MCC, and AUROC from 5-fold cross-validation of DeBERTa with cross-entropy as the loss function for the 33 permissions individually (as DeBERTa baseline performance). Values are obtained after 5 epochs per fold, batch size is 4, and a simple permission description is used (column Description in Table A4). Permissions are sorted in ascending order of the fraction of legitimate requests.

Table 1. Agreement among the four data annotators in the initial round, measured using Light’s Kappa.

ID	Value	ID	Value	ID	Value	ID	Value
1	0.6862	12	0.6519	21	0.0000	31	0.5671
2	0.7265	13	0.4787	22	0.1589	32	0.2334
5	0.0000	14	0.7159	23	0.6813	33	0.4679
6	0.5931	15	0.6398	24	0.5899	34	0.0848
7	0.7439	16	0.1286	25	0.2000	35	0.4326
8	0.6000	17	0.6390	26	0.6667	38	0.0889
9	0.3671	18	0.7963	27	0.0000
10	0.2693	19	0.6561	29	0.2231
11	0.6217	20	0.3473	30	0.4211

Table 2. Average metric values from 5-fold cross-validation for the performance of BERT, RoBERTa, DeBERTa, and ERNIE with various loss functions compared with the BERT baseline, trained and tested using 10 permissions together. Five epochs per fold and batch size = 4, and using a simple permission description. Negative values mean worse than baseline (in red), and positive values mean better than baseline.

Model	P₄	MCC	AUROC
BERT + CE (Cross-Entropy Loss)	0.0099	0.0451	0.0213
BERT + SCL (Supervised Contrastive Loss)	−0.6063	−0.4790	−0.2397
BERT + InfoNCE (Contrastive Loss)	−0.6824	−0.6372	−0.2882
BERT + FL (Focal Loss)	−0.0031	0.0432	0.0095
RoBERTa + CE (Cross-Entropy Loss)	0.0407	0.0705	0.0282
RoBERTa + SCL (Supervised Contrastive Loss)	−0.6970	−0.5294	−0.2483
RoBERTa + InfoNCE (Contrastive Loss)	−0.6928	−0.5611	−0.2592
RoBERTa + FL (Focal Loss)	−0.0184	0.0327	0.0016
DeBERTa + CE (Cross-Entropy Loss)	0.0831	0.1258	0.0672
DeBERTa + SCL (Supervised Contrastive Loss)	−0.6970	−0.5294	−0.2483
DeBERTa + InfoNCE (Contrastive Loss)	−0.5894	−0.4479	−0.2141
DeBERTa + FL (Focal Loss)	0.0403	0.1012	0.0480
ERNIE + CE (Cross-Entropy Loss)	0.0424	0.0769	0.0340
ERNIE + SCL (Supervised Contrastive Loss)	−0.6799	−0.6372	−0.2949
ERNIE + InfoNCE (Contrastive Loss)	−0.6970	−0.5294	−0.2483
ERNIE + FL (Focal Loss)	−0.0153	0.0463	0.0194

Table 3. Average

P_{4}

and

MCC

from 5-fold cross-validation of BERT compared with BERT baseline, using different loss functions for the 33 permissions individually. Values are obtained after 5 epochs per fold, batch size is 4, and a simple permission description is used. Negative values mean worse than baseline (in red), and positive values mean better than baseline.

Table 3. Average

P_{4}

and

MCC

from 5-fold cross-validation of BERT compared with BERT baseline, using different loss functions for the 33 permissions individually. Values are obtained after 5 epochs per fold, batch size is 4, and a simple permission description is used. Negative values mean worse than baseline (in red), and positive values mean better than baseline.

$P_{4}$				$MCC$
ID	Focal	SCL	InfoNCE	ID	Focal	SCL	InfoNCE
1	−0.0759	−0.6991	−0.6398	1	−0.0541	−0.6590	−0.5308
2	−0.0347	−0.7452	−0.6586	2	−0.0646	−0.5325	−0.6421
5	0.0198	−0.6892	−0.5546	5	0.0381	−0.5470	−0.6031
6	0.0479	−0.7282	−0.5881	6	0.0598	−0.6030	−0.4833
7	−0.0316	−0.7126	−0.6797	7	−0.0394	−0.5283	−0.4834
8	−0.0275	−0.7612	−0.7612	8	−0.0544	−0.5348	−0.5348
9	0.1136	−0.4814	−0.5698	9	0.0836	−0.4214	−0.4749
10	0.0015	−0.6388	−0.6474	10	−0.0040	−0.5432	−0.5699
11	−0.0119	−0.8759	−0.8759	11	−0.0266	−0.8691	−0.7595
12	−0.0024	−0.6787	−0.6787	12	0.0125	−0.4101	−0.4619
13	−0.0716	−0.4156	−0.3959	13	−0.0975	−0.3626	−0.3628
14	0.0688	−0.5888	−0.6266	14	0.0786	−0.3961	−0.4397
15	−0.0032	−0.8082	−0.6288	15	0.0137	−0.6463	−0.5279
16	−0.0845	−0.4611	−0.3323	16	−0.0500	−0.3446	−0.2564
17	0.0593	−0.6350	−0.6906	17	0.0620	−0.5327	−0.4790
18	0.0308	−0.7791	−0.6815	18	0.0503	−0.9910	−0.6563
19	0.0035	−0.7996	−0.5143	19	−0.0100	−0.6492	−0.4310
20	−0.0446	−0.6560	−0.8082	20	−0.0770	−0.5326	−0.7291
21	0.1423	0.1106	0.0000	21	0.0987	0.0580	−0.0862
22	−0.0661	−0.5829	−0.6163	22	−0.0180	−0.4139	−0.3803
23	−0.0080	−0.6674	−0.8513	23	−0.0096	−0.8223	−1.0030
24	0.0171	−0.7652	−0.8159	24	0.0316	−0.8270	−0.6616
25	−0.3489	−0.5218	−0.5218	25	−0.3139	−0.4656	−0.4656
26	−0.0726	−0.2671	−0.2671	26	−0.0580	−0.2094	−0.2094
27	0.0000	0.1683	0.0000	27	−0.0017	0.0965	0.0000
29	−0.0995	−0.0995	0.0250	29	−0.0488	−0.0529	0.0453
30	−0.2804	−0.4430	−0.4515	30	−0.2467	−0.3759	−0.4019
31	−0.0142	−0.9065	−0.7796	31	−0.0175	−0.8226	−0.8298
32	−0.1424	−0.5861	−0.5350	32	−0.1029	−0.4549	−0.4254
33	−0.0361	−0.9530	−0.9530	33	−0.0645	−0.9086	−0.9949
34	0.0037	−0.7110	−0.7091	34	−0.0152	−0.5941	−0.5947
35	0.0168	−0.8283	−0.6839	35	0.0296	−0.8729	−0.5803
38	−0.1605	−0.8276	−0.5986	38	−0.1261	−0.8269	−0.5426
avg.	−0.0331	−0.6071	−0.5785	avg.	−0.0285	−0.5332	−0.5017
stdev.	0.0969	0.2633	0.2391	stdev.	0.0860	0.2629	0.2343

Table 4. Average metric values from 5-fold cross-validation for the performance of RoBERTa with various refinements, trained and tested using 33 permissions together, compared with BERT baseline. Five epochs per fold. Negative values mean worse than baseline (in red), and positive values mean better than baseline.

Model	$P_{4}$	MCC	AUROC
RoBERTa + CE (Cross-Entropy Loss) (batch size = 4)	0.0459	0.0513	0.0238
RoBERTa + CE + Detailed Permission Description (batch size = 4)	0.0532	0.0577	0.0305
RoBERTa + SCL (Supervised Contrastive Loss) (batch size = 4)	−0.6447	−0.5137	−0.2520
RoBERTa + SCL (Supervised Contrastive Loss) (batch size = 24)	−0.5748	−0.4961	−0.2755
RoBERTa + InfoNCE Loss (batch size = 4)	−0.6447	−0.5137	−0.2445
RoBERTa + InfoNCE Loss (batch size = 24)	−0.4431	−0.4432	−0.1592
RoBERTa + InfoNCE Loss + Detailed Permission Description (batch size = 24)	−0.5075	−0.4050	−0.1797
RoBERTa + FL (Focal Loss) (batch size = 4)	0.0396	0.0428	0.0211
RoBERTa + FL + Detailed Permission Description (batch size = 4)	0.0509	0.0547	0.0282

Table 5. Average metric values from 5-fold cross-validation for the performance of DeBERTa V3 with various refinements, trained and tested using 33 permissions together, compared with BERT baseline. Five epochs per fold. Negative values mean worse than baseline (in red), and positive values mean better than baseline.

Model	$P_{4}$	MCC	AUROC
DeBERTa + CE (Cross-Entropy Loss) (batch size = 4)	0.1088	0.1186	0.0663
DeBERTa + CE + Detailed Permission Description (batch size = 4)	0.0632	0.0837	0.0377
DeBERTa + CE (Cross-Entropy Loss) (batch size = 16)	0.0753	0.0932	0.0472
DeBERTa + CE + Detailed Permission Description (batch size = 16)	0.0680	0.0781	0.0359
DeBERTa + SCL (Supervised Contrastive Loss) (batch size = 4)	−0.5576	−0.4464	−0.2128
DeBERTa + SCL (Supervised Contrastive Loss) (batch size = 16)	−0.6447	−0.5137	−0.2498
DeBERTa + InfoNCE Loss (batch size = 4)	−0.5642	−0.4519	−0.2084
DeBERTa + InfoNCE Loss (batch size = 16)	−0.4863	−0.4842	−0.2244
DeBERTa + InfoNCE Loss + Detailed Permission Description (batch size = 16)	−0.6350	−0.6241	−0.2982
DeBERTa + FL (Focal Loss) (batch size = 4)	0.0922	0.1016	0.0564
DeBERTa + FL + Detailed Permission Description (batch size = 4)	0.0761	0.0928	0.0439
Pretrained DeBERTa + CE (batch size = 4)	−0.2697	−0.2375	−0.1201
Pretrained DeBERTa + CE (batch size = 8)	−0.0835	−0.0987	−0.0585

Table 6. Average metric values from 5-fold cross-validation for the performance of ERNIE with various refinements, trained and tested using 33 permissions together, compared with BERT baseline. Five epochs per fold. Negative values mean worse than baseline (in red), and positive values mean better than baseline.

Model	$P_{4}$	MCC	AUROC
ERNIE + CE (Cross-Entropy Loss) (batch size = 4)	0.0393	0.0439	0.0232
ERNIE + CE + Detailed Permission Description (batch size = 4)	0.0380	0.0469	0.0228
ERNIE + CE (Cross-Entropy Loss) (batch size = 24)	0.0482	0.0520	0.0175
ERNIE + CE + Detailed Permission Description (batch size = 24)	0.0511	0.0516	0.0247
ERNIE + SCL (Supervised Contrastive Loss) (batch size = 4)	−0.6447	−0.5137	−0.2528
ERNIE + SCL (Supervised Contrastive Loss) (batch size = 24)	−0.5201	−0.4147	−0.1984
ERNIE + InfoNCE Loss (batch size = 4)	−0.5020	−0.5220	−0.2544
ERNIE + InfoNCE Loss (batch size = 24)	−0.5003	−0.5828	−0.2915
ERNIE + InfoNCE Loss + Detailed Permission Description (batch size = 24)	−0.6373	−0.6404	−0.3116
ERNIE + FL (Focal Loss) (batch size = 4)	0.0612	0.0658	0.0328
ERNIE + FL + Detailed Permission Description (batch size = 4)	0.0490	0.0475	0.0257
Pretrained ERNIE + CE (batch size = 4)	−0.0166	−0.0103	−0.0087
Pretrained ERNIE + CE (batch size = 24)	0.0211	0.0251	0.0068

Table 7. Average

P_{4}

and

M C C

from 5-fold cross-validation of BERT compared with BERT baseline, using different data augmentation approaches, with cross-entropy as the loss function for the 33 permissions individually. Values are obtained after 5 epochs per fold, batch size is 4, and a simple permission description is used. Negative value means worse than baseline (in red), positive value means better than baseline.

Table 7. Average

P_{4}

and

M C C

from 5-fold cross-validation of BERT compared with BERT baseline, using different data augmentation approaches, with cross-entropy as the loss function for the 33 permissions individually. Values are obtained after 5 epochs per fold, batch size is 4, and a simple permission description is used. Negative value means worse than baseline (in red), positive value means better than baseline.

$P_{4}$					MCC
ID	SMOTE	SynAug 0.5%	SynAug 2%	SynAug 4%	ID	SMOTE	SynAug 0.5%	SynAug 2%	SynAug 4%
1	−0.0114	−0.0747	−0.1188	−0.2298	1	−0.0149	−0.0715	−0.1339	−0.1796
2	0.0003	N/A	N/A	N/A	2	0.0025	N/A	N/A	N/A
5	−0.1088	0.1185	0.0188	0.0995	5	−0.0759	0.1515	0.0495	0.1271
6	0.0516	0.0479	−0.0123	0.0471	6	0.0748	0.0598	−0.0179	0.0717
7	−0.0065	−0.0843	−0.0836	−0.0369	7	−0.0022	−0.0518	−0.0615	−0.0396
8	−0.0079	N/A	N/A	N/A	8	0.0022	N/A	N/A	N/A
9	−0.1167	0.1037	−0.0095	0.0599	9	−0.1479	0.0791	−0.0355	0.0343
10	−0.0240	−0.1059	−0.0459	−0.0865	10	−0.0188	−0.0757	0.0067	−0.0342
11	0.0226	N/A	N/A	N/A	11	0.0412	N/A	N/A	N/A
12	0.0128	N/A	N/A	N/A	12	0.0649	N/A	N/A	N/A
13	−0.1098	−0.1100	−0.0257	−0.4156	13	−0.1110	−0.1285	−0.0413	−0.3626
14	0.0708	−0.0162	−0.1024	−0.0156	14	0.0674	−0.0060	−0.0833	−0.0032
15	−0.0110	−0.0340	0.0092	−0.0059	15	−0.0142	0.0043	0.0227	0.0048
16	−0.0776	0.1245	0.0979	0.0980	16	−0.0568	0.1152	0.0793	0.0980
17	0.0631	−0.0933	−0.0213	−0.0060	17	0.0727	−0.0397	−0.0031	0.0084
18	0.0189	0.0232	0.0169	0.0458	18	0.0373	0.0411	0.0295	0.0802
19	0.0321	−0.0899	−0.0715	−0.0542	19	0.0479	−0.1171	−0.1071	−0.0493
20	−0.0431	N/A	N/A	N/A	20	−0.0656	N/A	N/A	N/A
21	0.3745	0.2996	0.0996	0.2996	21	0.2896	0.2896	0.0896	0.2896
22	0.0628	−0.0026	−0.0929	−0.1442	22	0.0717	−0.0141	−0.0672	−0.0742
23	−0.0265	N/A	N/A	N/A	23	−0.0549	N/A	N/A	N/A
24	−0.0947	−0.0038	−0.0361	−0.0955	24	−0.1061	−0.0027	−0.0414	−0.0980
25	−0.2221	−0.1411	−0.2563	−0.4688	25	−0.1984	−0.1583	−0.2550	−0.4492
26	−0.0989	−0.2671	−0.0565	0.0431	26	−0.0851	−0.2094	−0.0445	0.0286
27	N/A	0.0000	0.0000	0.0000	27	N/A	−0.0017	−0.0017	−0.0017
29	0.0331	−0.0995	−0.0995	−0.0995	29	0.0290	−0.0426	−0.0426	−0.0470
30	−0.2204	−0.0847	−0.1497	−0.1049	30	−0.1924	−0.0179	−0.0958	−0.0542
31	0.0008	−0.0321	−0.0162	−0.0160	31	0.0056	−0.0457	−0.0259	−0.0270
32	−0.0267	−0.1066	−0.0768	−0.0783	32	−0.0133	−0.0873	−0.0859	−0.0767
33	−0.0044	−0.0303	0.0006	−0.0084	33	−0.0054	−0.0452	0.0025	−0.0131
34	−0.0096	−0.0431	−0.1448	−0.0315	34	−0.0310	−0.0766	−0.1367	−0.0426
35	0.0138	0.0067	0.0183	0.0151	35	0.0307	0.0145	0.0339	0.0430
38	−0.1666	0.0096	−0.1884	−0.1970	38	−0.1431	0.0184	−0.1780	−0.1917
avg.	−0.0197	−0.0254	−0.0499	−0.0514	avg.	−0.0156	−0.0155	−0.0424	−0.0355
stdev.	0.1046	0.1062	0.0802	0.1522	stdev.	0.0943	0.1000	0.0777	0.1424

Table 8. Average metric values from 5-fold cross-validation for the performance of BERT compared with BERT baseline, trained and tested using 2 permissions together. Cross-entropy is used as the loss function, with 5 epochs per fold and batch size = 4, and a simple permission description is used. Negative values mean worse than baseline (in red), and positive values mean better than baseline.

Permission Request Combination	IDs	ID	$P_{4}$	MCC	AUROC
Same category and same majority class	5, 6	5	0.0725	0.0861	0.0025
	5, 6	6	−0.2179	−0.1627	−0.0403
	7, 8	7	0.0333	0.0613	0.0153
	7, 8	8	0.0144	0.0425	0.0200
Unrelated categories but same majority class	5, 7	5	0.0302	0.0905	−0.0220
	5, 7	7	−0.0167	0.0139	−0.0245
	5, 8	5	0.0810	0.1041	0.0305
	5, 8	8	−0.0027	0.0066	0.0009
	6, 7	6	−0.2955	−0.2240	−0.0232
	6, 7	7	−0.0071	0.0340	−0.0173
	6, 8	6	−0.0302	−0.0128	0.0085
	6, 8	8	−0.0366	−0.0509	−0.0162
Same category but opposite majority classes	5, 25	5	−0.0405	−0.0162	−0.0739
	5, 25	25	−0.2713	−0.2741	−0.1240
	6, 25	6	−0.0296	0.0185	−0.0065
	6, 25	25	0.0221	−0.0408	−0.0169

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, C.Y.T. Quantifying Privacy Risk of Mobile Apps as Textual Entailment Using Language Models. J. Cybersecur. Priv. 2025, 5, 111. https://doi.org/10.3390/jcp5040111

AMA Style

Ma CYT. Quantifying Privacy Risk of Mobile Apps as Textual Entailment Using Language Models. Journal of Cybersecurity and Privacy. 2025; 5(4):111. https://doi.org/10.3390/jcp5040111

Chicago/Turabian Style

Ma, Chris Y. T. 2025. "Quantifying Privacy Risk of Mobile Apps as Textual Entailment Using Language Models" Journal of Cybersecurity and Privacy 5, no. 4: 111. https://doi.org/10.3390/jcp5040111

APA Style

Ma, C. Y. T. (2025). Quantifying Privacy Risk of Mobile Apps as Textual Entailment Using Language Models. Journal of Cybersecurity and Privacy, 5(4), 111. https://doi.org/10.3390/jcp5040111

Article Menu

Quantifying Privacy Risk of Mobile Apps as Textual Entailment Using Language Models

Abstract

1. Introduction

2. Related Work

2.1. Security of Mobile Applications

2.2. Language Models for Natural Language Processing (NLP) Tasks

2.3. Strategies to Handle Imbalanced Dataset

3. Problem Formulation

4. Privacy Risk Quantification

5. Experiment Setup

5.1. Data Collection

5.2. Data Labeling

5.3. Metrics

6. Experiment Results and Discussion

6.1. Baseline Performance

6.2. RQ1: How Do Various Language Models Perform in Quantifying the Legitimacy of Permissions Formulated as Textual Entailment?

6.3. RQ2: What Is the Impact of Employing Different Loss Functions and Data Augmentation Strategies on Handling Imbalanced Data Within the Context of This Study?

6.3.1. Effects of Different Loss Functions

6.3.2. Effects of Data Augmentation

6.4. RQ3: How Does Training with Multiple Permission Requests or Utilizing Pretraining with App Descriptions Influence the Performance of the Model?

6.4.1. Effects of Training with Multiple Permission Requests

6.4.2. Effects of Pretraining Models Using Data from the Target Domain

6.5. Overall Observations

7. Conclusions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

ID	Entail	Not Entail	ID	Entail	Not Entail	ID	Entail	Not Entail
1	93	489	14	92	507	25	26	568
2	375	214	15	139	460	26	27	571
5	559	27	16	584	16	27	585	5
6	569	20	17	167	432	29	18	566
7	438	140	18	166	434	30	15	90
8	353	246	19	123	476	31	137	457
9	579	21	20	388	197	32	43	541
10	82	518	21	589	11	33	130	459
11	237	363	22	469	131	34	42	555
12	382	218	23	282	311	35	134	448
13	26	574	24	173	420	38	573	18

ID	Entail	Not Entail	ID	Entail	Not Entail	ID	Entail	Not Entail
1	93	489	14	92	507	25	26	568
2	375	214	15	139	460	26	27	571
5	559	27	16	584	16	27	585	5
6	569	20	17	167	432	29	18	566
7	438	140	18	166	434	30	15	90
8	353	246	19	123	476	31	137	457
9	579	21	20	388	197	32	43	541
10	82	518	21	589	11	33	130	459
11	237	363	22	469	131	34	42	555
12	382	218	23	282	311	35	134	448
13	26	574	24	173	420	38	573	18

ID	Entail	Not Entail	ID	Entail	Not Entail	ID	Entail	Not Entail
1	93	489	14	92	507	25	26	568
2	375	214	15	139	460	26	27	571
5	559	27	16	584	16	27	585	5
6	569	20	17	167	432	29	18	566
7	438	140	18	166	434	30	15	90
8	353	246	19	123	476	31	137	457
9	579	21	20	388	197	32	43	541
10	82	518	21	589	11	33	130	459
11	237	363	22	469	131	34	42	555
12	382	218	23	282	311	35	134	448
13	26	574	24	173	420	38	573	18