Personal Information Classiﬁcation on Aggregated Android Application’s Permissions

: Android is o ﬀ ering millions of apps on Google Play-store by the application publishers. However, those publishers do have a parent organization and share information with them. Through the ‘Android permission system’, a user permits an app to access sensitive personal data. Large-scale personal data integration can reveal user identity, enabling new insights and earn revenue for the organizations. Similarly, aggregation of Android app permissions by the app owning parent organizations can also cause privacy leakage by revealing the user proﬁle. This work classiﬁes risky personal data by proposing a threat model on the large-scale app permission aggregation by the app publishers and associated owners. A Google-play application programming interface (API) assisted web app is developed that visualizes all the permissions an app owner can collectively gather through multiple apps released via several publishers. The work empirically validates the performance of the risk model with two case studies. The top two Korean app owners, seven publishers, 108 apps and 720 sets of permissions are studied. With reasonable accuracy, the study ﬁnds the contact number, biometric ID, address, social graph, human behavior, email, location and unique ID as frequently exposed data. Finally, the work concludes that the real-time tracking of aggregated permissions can limit the odds of user proﬁling.


Introduction
The proliferation of the personal data breach for gaining valuable insight into user preferences has become a common phenomenon in the data-driven industrial revolution. Popular services (software, app and website) are being reliant on the personal data to provide commercial services that generate lucrative revenue streams in the business. Since the media content consumption trend is concentrated on mobile devices, the risk of personal data leakage in this environment is getting higher. Once, a company receives personal data, it not only just analyzes data but also trades to other companies as a business policy [1]. Once compromised, invaders can use the data to regain personally identifiable information (PII) and potential personally identifiable information (PPII) or partial identity to harm the users by social engineering attacks. Therefore, personal data has already become a new oil and privacy threat modeling are at the crux of the next industrial revolution [2]. Besides, recent data breaching incidents also demand further research in the privacy domain [3][4][5][6].

Problem Statement and Motivation
While using apps, users are less likely to consider the actual owner of the app (publisher and owner). In addition, users also prefer to pick apps from the same company. However, companies often exchange personal data, which is even more within the subsidiaries of the identical parent association [24][25][26][27][28]. Simultaneous closure of key Facebook subsidiaries (Facebook, Instagram, Messenger, WhatsApp, etc.) also proves the use of identical storage or infrastructure [29,30]. Therefore, it is highly likely to share the app permissions among linked publishers and owners [31][32][33][34]. Therefore, this exchange of sensitive Android permissions might amplify identity leakage [35][36][37][38]. Tracking and managing of the shared data through multiples apps are quite challenging for the users. Thus, it is substantially important to consider the data aggregations caused re-identification at the ultimate app owners' end ( Figure 1). Therefore, a concrete classification of PII on the above issue is highly needed to reduce the risk of specific PII leakage [23]. Existing privacy risk-models do not consider this data aggregation fact [20,[39][40][41]. The study believes our work is the first in the context of PII classification where aggregation of app permission by the same owner (publisher and owner) is measured. At one side, the users demand higher services from apps while on the other hand, personal data flow should also be limited, so to lessen this large-scale data re-identification threat further study is needed.

Key Consideration and Contribution
In response to the aforementioned issues, the study analyzes how permissions can be efficiently managed to reduce PII leakage. Figure 1 provides an overall idea of the app permission aggregation associated identity threat model. Partial information is collected through multiple apps, so the parent organization (PO) can analyze and infer user identities by combining that partial information. A parent organization re-identifies a user identity by combining different parts of information that has been collected from the individual app (App) publishes through several publishers (P; Figure 1). With 2.6 million apps and 95 billion downloads, Android is the top mobile app platform [42]. Therefore, this study considers all the permissions owned by the top two South Korean Android app publishers such as Kakao [43] and Naver [44]. Each of them owns several publishers to publish varieties of apps. This study considers five publishers of Kakao [43] and two of Naver [44]. In total 720 rows of permission sets made with 26 dangerous Android permissions of 118 apps are used. Finally, 720 permission sets gathered by app owners are classified into 8 PII classes. A web app ('Privacy Analysis on Mobile App' [45]) is developed by integrating the Google Play-store application programming interface (API) to analyze the permission aggregation caused PII threat.
The key resolutions of the study are (a) to address how Android app permissions collected by an

Key Consideration and Contribution
In response to the aforementioned issues, the study analyzes how permissions can be efficiently managed to reduce PII leakage. Figure 1 provides an overall idea of the app permission aggregation associated identity threat model. Partial information is collected through multiple apps, so the parent organization (PO) can analyze and infer user identities by combining that partial information. A parent organization re-identifies a user identity by combining different parts of information that has been collected from the individual app (App) publishes through several publishers (P; Figure 1). With 2.6 million apps and 95 billion downloads, Android is the top mobile app platform [42]. Therefore, this study considers all the permissions owned by the top two South Korean Android app publishers such as Kakao [43] and Naver [44]. Each of them owns several publishers to publish varieties of apps. This study considers five publishers of Kakao [43] and two of Naver [44]. In total 720 rows of permission sets made with 26 dangerous Android permissions of 118 apps are used. Finally, 720 permission sets gathered by app owners are classified into 8 PII classes. A web app ('Privacy Analysis on Mobile App' [45]) is developed by integrating the Google Play-store application programming interface (API) to analyze the permission aggregation caused PII threat. The key resolutions of the Appl. Sci. 2019, 9,3997 3 of 24 study are (a) to address how Android app permissions collected by an identical owner are exposing user PII, (b) to classify PII on aggregated Android permissions using the machine learning technique and (c) to offer suggestions for reducing large-scale permission re-identification.

Roadmap
The rest of this study is designed as follows. Section 2 describes the 'Android permission system' associated jargons and existing works. Section 3 proposes the aggregated Android permission associated personal information risk model. Section 4 validates the proposed model with the case studies. Section 5 discusses the pros and cons of the risk model along with the future scope and recommendations. Finally, Section 6 concludes the work followed by necessary references.

Personal Information and Privacy
Personal data is something that reveals an individual's identity or features. PII is a popular jargon used in the USA, Korea and Australia [46]. The European Union uses 'personal information' [47]. The National Institute of Standards and Technology (NIST) [48] defines PII as "any information about an individual maintained by an agency, including (PII) any information that can be used to distinguish or trace an individual's identity, such as name, social security number, date and place of birth, mother's maiden name, or biometric records; and (linked PII or PPII) any other information that is linked or linkable to an individual, such as medical, educational, financial, and employment information". Studies [49][50][51] have also mentioned that a combination of partial identities can produce a full identity. Therefore, throughout the study, partial identity and PPII were considered as a similar object. [48][49][50][51] defined partial identity or PPII as "A partial identity is a subset of attribute values of a complete identity, where a complete identity is the union of all attribute values of all identities of this person". Table 1 shows a few examples of PII and PPII [37,46,52,53]. De-identification (Did) is a method by which information is modified in a way that data can no longer directly or indirectly indicate an identity. After the Did, the data is known as non-personally identifiable information (NPII). The data that is unable to distinguish any identity individually is NPII (shareable and risk-free information). Australian privacy law and practice [54] defines it as "non-identifiable data, which have never been labeled with individual identifiers or from which identifiers have been permanently removed, and by means of which no specific individual can be identified. A subset of nonidentifiable data is those that can be linked with other data, so it can be known that they are about the same data subject, although the person's identity remains unknown". Finally, re-identification (Rid) of personal data is done by matching the anonymous identities with available partial information to expose the exact identity of the data owner [55]. In the Android ecosystem, app owners (publisher and owner) Rid user data in order to generate app user's unique identity. Our previous works [35,37,56] have already stated, Android app permissions affect PII. Beside phones, permissions are being collected by the apps run on a smart vehicle, TV, smart car, smart toys, Internet of Things (IoT) device, etc. Table 1. Personal information (personally identifiable information (PII) and potential personally identifiable information (PPII)).
Both Apple (iOS; Figure 2a) and Google (Android; Figure 2b) collect user consent while using sensitive user data ( Figure 2) [57,58]. As this study only focused on Android, the study domain was limited to the Android app. An Android app requires user data, and to get that Android introduced the app's 'permission system'. Until now Android introduced four types of permission gathering system such as normal, signature, dangerous and special. Dangerous Android permission collects user consent that allows an app to access sensitive personal data of a user [59] (Table 2).

Android Application Permission Associated Privacy Risk
Android boosts its privacy by changing its 'permission system' in the last update [59]. The latest Android version needs runtime permissions from users, which are the least acceptable. However, almost 50% of the current Android system collects 'dangerous permissions' at the installation time [60]. Android [37] defines it as "Dangerous permissions cover areas where the app wants data or resources that involve the user's private information, or could potentially affect the user's stored data or the operation of other apps." According to Android, 'dangerous permission' collects sensitive data accessing permission. The app user can either choose not to use the app or allow the 'dangerous permissions'

Android Application Permission Associated Privacy Risk
Android boosts its privacy by changing its 'permission system' in the last update [59]. The latest Android version needs runtime permissions from users, which are the least acceptable. However, almost 50% of the current Android system collects 'dangerous permissions' at the installation time [60]. Android [37] defines it as "Dangerous permissions cover areas where the app wants data or resources that involve the user's private information, or could potentially affect the user's stored data or the operation of other apps." According to Android, 'dangerous permission' collects sensitive data accessing permission. The app user can either choose not to use the app or allow the 'dangerous permissions' to access personal data, so, 'dangerous permissions' must be handled sophisticatedly [61]. The study classifies the risky PII only on the collective availability of those 'dangerous permissions' to an app owner ( Table 2).
For instance, calendar stores personal data like routine, activity and behavior. Call logs and phone status can reveal a user's family, friends and connected peoples' contacts [10]. Similarly, camera, microphone and sensor collect very sensitive biological and voice data. Location collects not only the exact physical position but also coarse location [62]. Finally, storage and SMS are considered as an excessive source of personal data as they store group photo, photo id, audio clip, video clip, etc. [63].
As classification helps to differentiate attack surfaces and prepare accordingly [64], studies [7,9,10,17,65] have already analyzed app permissions for privacy threat modeling [66][67][68]. A study [69] states five ways of privacy breaching through the app permission system. They are: (a) Consistent access of the apps to the local storage, (b) allowing more permissions than needed, (c) hiding reason for data gathering, (d) collection of data without any context to the app and (e) app developer's illiteracy.
Another study [70] analyzed 300 state-of-the-art research collected from IEEE, ACM, Springer, ScienceDirect, etc. associated with Android security and privacy. It identified PII exposure via permission as one of the key sources of privacy leaking through Android apps.
Li et al. [71] identified 'significant Android permissions' out of long app permissions and proposed 'significant permissions'. It identified 22 vital permissions through a machine learning-based method to upgrade overall privacy. However, the study overlooked the permission caused a threat to PII.
Arora et al. [72] introduced both permission (static) and network (dynamic) analysis-based privacy awareness methodology. This study also used machine learning methods to classify malicious Android samples. The study mentioned that the static (app permission) risks are often ignored by researchers, so our proposed study focused on 'dangerous Android permission' associated risks.
Fritsch et al. [35] proposed that partial identities can be generated from app permissions. The study considered an experimental survey and successfully proved a combination of 'dangerous Android permissions could form PII. The author classifies sensitive personal information based on the survey output. However, the study explores permission leakage only to the app level, instead, our study also considers publishers' and owners' scope of data aggregation.
Liang et al. [38] detect Android malware by considering permissions declared in the app manifest. The study examines the AndroidManifest.xml file and combines information of Android permissions to classify malware and benign apps. The study implements 'droid combine', which accumulates and analyzes six Android permission files to enhance the security of the Android system. Shuba et al. [73] classified PII leakage on the mobile device with AntShield [39]. It performs deep packet inspection (DPI) in the middle of the network with machine learning to find PII leakage with prior knowledge. It explicitly considers the potential data aggregation among mobile devices, which was overlooked by similar other studies [74,75]. However, legal issues demotivate the DPI method [76].
He et al. [7] analyzed over 150 Apps and found that the deployment of third-party libraries in apps might put user identities in danger. It revealed third-party apps are collecting permissions without complying with Android privacy terms. With the same notion, our study considers publishers and owners (third parties) that might threat user privacy.
Although the General Data Protection Regulation (GDPR) [47], the NIST [48] and the Health Insurance Portability and Accountability Act (HIPAA) [77] are safeguarding privacy, somehow user identities are very often re-identified by the service providers.
Finally, few of our works [36,37,78] have already stated permission aggregation can easily Rid user identity. Those studies also stated the necessity of PII risk classification on Android permission aggregation to lower Rid risk. To propose an efficient PII classification, this study considers the chance of large-scale permission aggregation by the service providers (publishers and owners).

Risk Modeling
This model aimed to identify the risk of personal information exposing due to user profiling by the Android app. While designing the risk model, the study considered the permission aggregation caused identity exposing threat. This section first summarizes the proposed method followed by a discussion on the Android permission system (Section 3.1), personal information types (Section 3.2) and the risk model (Section 3.3) in the light of Sections 3.1 and 3.2. The study used five major entities ( Figure 3).

1.
Application (app) permissions: This study considered sensitive information sensing 'Android app permissions'. An app provides services in exchange for those permissions.

2.
Android permission aggregation: This study considered the accumulation of the available Android permissions by apps, associated app publishers and the ultimate app owners. 3.
Permission generates PII: This study listed PII from several established resources. The study blended a heuristic approach with a real-life survey to link the app's permissions to a specific PII.

4.
Classification of PII on aggregated permissions: After coupling a specific permission set to PII, PII was classified. This study used several machine learning approaches to verify the classification.

5.
Identification of risky PII: The study identified the classified PII as a risky PII set. 6.
The risk model considered the odds of permission (partial information) flow from an Android app to the publishers and the ultimate owner. Many Android apps are being published by multiple subsidiaries. As most of those subsidiaries (publishers) are owned by very few companies, the chance of permission aggregation is high. This study considered the earlier stated facts to explore a set of a risky PII classes by proposing a risk model. Since gaining a specific PII requires a precise set of Android permissions, a classification of PII on Android permissions are highly needed. In short, the risk model identified easily exposed PII caused by the Android permission aggregation-based user profiling.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 6 of 26 aggregation to lower Rid risk. To propose an efficient PII classification, this study considers the chance of large-scale permission aggregation by the service providers (publishers and owners).

Risk Modeling
This model aimed to identify the risk of personal information exposing due to user profiling by the Android app. While designing the risk model, the study considered the permission aggregation caused identity exposing threat. This section first summarizes the proposed method followed by a discussion on the Android permission system (Section 3.1), personal information types (Section 3.2) and the risk model (Section 3.3) in the light of Section 3.1 and 3.2. The study used five major entities ( Figure 3).
1. Application (app) permissions: This study considered sensitive information sensing 'Android app permissions'. An app provides services in exchange for those permissions. 2. Android permission aggregation: This study considered the accumulation of the available Android permissions by apps, associated app publishers and the ultimate app owners. 3. Permission generates PII: This study listed PII from several established resources. The study blended a heuristic approach with a real-life survey to link the app's permissions to a specific PII. 4. Classification of PII on aggregated permissions: After coupling a specific permission set to PII, PII was classified. This study used several machine learning approaches to verify the classification. 5. Identification of risky PII: The study identified the classified PII as a risky PII set. 6. The risk model considered the odds of permission (partial information) flow from an Android app to the publishers and the ultimate owner. Many Android apps are being published by multiple subsidiaries. As most of those subsidiaries (publishers) are owned by very few companies, the chance of permission aggregation is high. This study considered the earlier stated facts to explore a set of a risky PII classes by proposing a risk model. Since gaining a specific PII requires a precise set of Android permissions, a classification of PII on Android permissions are highly needed. In short, the risk model identified easily exposed PII caused by the Android permission aggregation-based user profiling.

Android Permission Scope Model
This section defines the personal information distribution likelihood on the Android platform. Android platform associated stakeholders and their permission gathering mechanism (user consent) were discussed here. The key stakeholders of the Android ecosystem were defined here: Definition 1: Application, a software intended to perform special function and tasks of individuals through a handphone, smart TV, IoT device, smart vehicle, etc. on the Android platform.

Android Permission Scope Model
This section defines the personal information distribution likelihood on the Android platform. Android platform associated stakeholders and their permission gathering mechanism (user consent) were discussed here. The key stakeholders of the Android ecosystem were defined here:  Out of all permissions, Android marked 26 dangerous permissions that involve the user's private information or affect the user's identity (requires user consent to access) [59] ( Greater data possession can lead to stronger data mining. Thus, the study provides three data distribution associated assumptions by the applications, publishers and owners respectively. Assumption 1. Actually, not all {P n1 , P n2 . . . . P n26 } are concurrently needed by an application. Each Android application demands a specific set of dangerous permissions. For example, application A 1 might require a set of dangerous permissions, {P n1 , P n5 , P n6 , P n12 , P n13 , P n16 , P n23 , P n26 }; application A 2 might require a fully different set of dangerous permissions, {P n2 , P n3 , P n8 , P n15 , P n19 , P n21 , P n25 } and application A 3 might require another set of dangerous permissions, {P n1 , P n3 , P n8 , P n13 , P n16 , P n21 , P n25 }. The above example indicates that more than one application may be used to collect all available {P n1 , P n2 . . . . P n26 }. Therefore, the study inferred the possibility of dangerous permissions aggregation from multiple apps.

Assumption 2.
Android releases several apps on the Play-store through the publishers. Usually, each publisher issues multiple applications. Naturally, a publisher uses a common data repository in order to serve underneath services [30]. In addition, a publisher does not have any legal obstacles that refrain from data aggregation out of the published applications [80,81]. Therefore, the {W 1 , W 2 . . . . W NPUB } automatically gains control over the dangerous permissions of those applications. The study also inferred the possibility of {P n1 , P n2 . . . . P n26 } aggregation by an app publisher from all of their published apps.
Assumption 3. For brand recognition, financial consideration and capital raising giant companies often possess subsidiaries and sub-organizations. Those subsidiaries and sub-organization also can be Android application publishers. Alike assumption 2, {K 1 , K 2 . . . . K NOWN } also has no legal difficulties that inspire the parent organization for data aggregation out of several android publishers and associated applications. Therefore, the While doing so, app A, B and C collected personal data through P1, P2, P9 and P10 and P15, P16, P18, P19, P25 and P26 respectively for the publisher J and K. Afterward, owner X gathered all the data required to get δ from publisher J and K. Finally, owner X owned the δ that could identify a PII of an application user. In short, A ∪ B ∪ C ∪ J ∪ K ⊂ δ or PII was owned by the X.  SN} was threatened. Table 3 demonstrates an example of the second study objective. A company X' could re-identity four PII through four combinations of a dangerous permissions set (Table 3). As a different set of dangerous permissions led us to a different PII, this study separated those permission sets to classify the correlations (which PII was leaked for a certain permission set). Four PIIs (, ,  and ) of a user were attained by X' if permission set (5th, 10th, 15th and 23rd); (1st, 15th, 9 th and 26th); (3rd, 4th, 9th and 17th) and (5th, 6th, 23rd and 26th) were owned (aggregated) by the owner X' respectively. Section 3.3.1 and 3.3.2 explained the proposed PII classification in detail. Since Android permissions can convey significant partial information directly or indirectly to a company, therefore, this study considered those app permissions as PPII. Tables 2 and 4 describe the Android permission system in detail. This study provided enough evidence to assume that the PPII (Table 1) and the dangerous Android permission (Table 2) was a single entity [35][36][37]56,82,83].  Remark 3. Only a particular group of PPIIs generates a specific PII, similarly only a specific set of dangerous permissions generate a specific PII. The second objective of the study, 'classification of PII's according to aggregated dangerous permissions' was done by creating a dangerous permission versus PII matrix. The study classified for which particular set of {P n1 , P n2 . . . . P n26 }, a specific {S 1 , S 2 . . . . S N } was threatened. Table 3 demonstrates an example of the second study objective. A company X' could re-identity four PII through four combinations of a dangerous permissions set (Table 3). As a different set of dangerous permissions led us to a different PII, this study separated those permission sets to classify the correlations (which PII was leaked for a certain permission set). Four PIIs (α, β, δ and γ) of a user were attained by X' if permission set (5th, 10th, 15th and 23rd); (1st, 15th, 9 th and 26th); (3rd, 4th, 9th and 17th) and (5th, 6th, 23rd and 26th) were owned (aggregated) by the owner X' respectively. Sections 3.3.1 and 3.3.2 explained the proposed PII classification in detail.

Connection Between Android Permission and PPII
Since Android permissions can convey significant partial information directly or indirectly to a company, therefore, this study considered those app permissions as PPII. Tables 2 and 4 describe the Android permission system in detail. This study provided enough evidence to assume that the PPII (Table 1) and the dangerous Android permission (Table 2) was a single entity [35][36][37]56,82,83].

Association of Android Permission and PPII
This section used both the heuristic method and a real-life user survey on the available data to correlate between the dangerous permission (PPII) and the PII.

The heuristic method for Android application permission and PII mapping:
The heuristic systemic model (HSM) [84] has already been well practiced while classification reasoning. This study heuristically selected eight PII, those were highly engaged with the Android app's permissions. While doing this heuristic modeling of PII, the authors used already available knowledge base and literature [46,73,82]. However, this study used a practical assumption to detect eight risky PII (Table 5) where modeling was logical and rational rather than evidence centric. Therefore, to confirm the HSM, this study underwent another level of validation by conducting a survey. Table 5 reflects the final set of PII that are threatened due to aggregated app permissions. Before the survey, participants undertook a 3 h long workshop where PII, PPII and Android permission gathering systems and privacy risk were taught. The survey result provided around 2000 unique way of PII leaking through the dangerous Android permissions set. Among those, only 720 unique permissions providing sets were chosen for this study. Figures 5 and 6 provide a few screenshots of the web app [45] where several functions of our proposed web app were visualized. Figure 5 shows the overall state of the 'Privacy Analysis on Mobile App' web application followed by Figure 6a,b, which shows detail functions of the application with a closer view.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 11 of 26 unique permissions providing sets were chosen for this study. Figures 5 and 6 provide a few screenshots of the web app [45] where several functions of our proposed web app were visualized. Figure 5 shows the overall state of the 'Privacy Analysis on Mobile App' web application followed by Figure 6a,b, which shows detail functions of the application with a closer view. To classify PII, this study only correlated the 'dangerous permissions' (  unique permissions providing sets were chosen for this study. Figures 5 and 6 provide a few screenshots of the web app [45] where several functions of our proposed web app were visualized. Figure 5 shows the overall state of the 'Privacy Analysis on Mobile App' web application followed by Figure 6a,b, which shows detail functions of the application with a closer view. To classify PII, this study only correlated the 'dangerous permissions' (  To classify PII, this study only correlated the 'dangerous permissions' ( An example of the survey output for a student 'X' is given in Table 6. After selecting all the used apps from Kakao (owner), permissions like camera, record_audio, body_sensors, read_calendar, access_network, call_phone, send_sms, get_account, read_call_log, etc. were provided to it. Then, the rightmost column, PII class was filled by the user choice. Therefore, based on user 'X"s opinion (1) biometric ID was given with legal consent as a result of permitting camera, record_audio, body_sensors and use_fingerprint; (2) location was given with legal consent as a result of permitting read_ext_storage, read_calendar, access_network and access_coarse_location and (3) contact number was given with legal consent as a result of permitting call_phone, send_sms, get_account and read_call_log. The dataset used by this study contained similar information like Table 6.

Case Studies
This section empirically validated the proposed risk model with the case studies. To analyze the large-scale data aggregation, all the app permissions owned by the two owners were considered. For an efficient analysis and visualization, this study presented evaluation results separately by two case studies [85]. Section 4.1 elaborated the experimental environment followed by the two case studies on Kakao (Section 4.2) and Naver (Section 4.3).

Case Study Selection Criteria
The work studied Kakao [43] and Naver [44], the top two South Korean Android app owners (publishers) on the following factors: 1.
The number of app users: Both Kakao and Naver own a significant number of users. For instance, 'KakaoTalk' application has 220 million registered and 50 million active users [86]. Similarly, 94% of the local search market is owned by the 'Naver search' application [87].

2.
Application brand popularity in the local market: Both Kakao and Naver are the top application brand in South Korea. Out of the top 10 popular apps, three is from Kakao and two is from Naver [88].

3.
The similarity in service: Availability of diverse apps such as chatting, map, dictionary, translator, social, etc. from both the types are another reason for selecting Kakao and Naver.

4.
Availability of multiple publishers: Both Kakao and Naver publish apps via several publishers.

Android Privacy Permission Analyzing Web Application
During this work, a web application tool was developed for analyzing and visualizing the risk model. 'Privacy analysis on mobile app' (http://52.79.237.144:3000/) [45] was built for analyzing large-scale privacy permission aggregation risk. The proposed app consisted of three parts.

•
A node.js based web application server (WAS): The application gathered data from the 'Google Play-store' [57] using the 'open API'. An asynchronous input and output (I/O) based service providing mechanism was implemented using Node.js to provide a non-blocking facility. Regardless of the response success rate of the used API, the proposed application could perform independently. • A complete database: The database included 18,500 mobile applications with 160,000 dangerous Android permissions. The information of each mobile application included various related information such as title, publisher, downloads and rating of an individual mobile application. It also shows the permission list for each mobile application. Our web application used the Google Play-store [57] API to integrate applications of the Android Play store and our database. In addition, if any application was not in our database that could be added immediately for future use.

•
The last part was the web-based service portal: It provided functions like searching, listing and grouping of data in terms of owning a company. The application also allowed users to download an excel file containing the Android permission set and associated app owner's identity.

Android Privacy Permission Collection
This study listed all the available subsidiaries of Kakao and Naver. Afterward, a previously stated web application was used to download applications, permissions and associated publishers of all the listed subsidiaries of Kakao [43] and Naver [44] for further study. At this point, each entry of the excel sheet contained the owner, publisher, app and associated dangerous permissions. This study combined all the permissions with respect to the publishers and owners. Figure 7 shows two examples of an extensive collection of dangerous permissions (top half of the figure) by applications (bottom half of the figure). Figure 7a,b shows 'Kakao games corp' and 'Daum corp' of Kakao as an example [37].
Appl. Sci. 2019, 9, x FOR PEER REVIEW 13 of 26 During this work, a web application tool was developed for analyzing and visualizing the risk model. 'Privacy analysis on mobile app' (http://52.79.237.144:3000/) [45] was built for analyzing large-scale privacy permission aggregation risk. The proposed app consisted of three parts.


A node.js based web application server (WAS): The application gathered data from the 'Google Play-store' [57] using the 'open API'. An asynchronous input and output (I/O) based service providing mechanism was implemented using Node.js to provide a non-blocking facility. Regardless of the response success rate of the used API, the proposed application could perform independently.  A complete database: The database included 18,500 mobile applications with 160,000 dangerous Android permissions. The information of each mobile application included various related information such as title, publisher, downloads and rating of an individual mobile application. It also shows the permission list for each mobile application. Our web application used the Google Play-store [57] API to integrate applications of the Android Play store and our database. In addition, if any application was not in our database that could be added immediately for future use.  The last part was the web-based service portal: It provided functions like searching, listing and grouping of data in terms of owning a company. The application also allowed users to download an excel file containing the Android permission set and associated app owner's identity.

Android Privacy Permission Collection
This study listed all the available subsidiaries of Kakao and Naver. Afterward, a previously stated web application was used to download applications, permissions and associated publishers of all the listed subsidiaries of Kakao [43] and Naver [44] for further study. At this point, each entry of the excel sheet contained the owner, publisher, app and associated dangerous permissions. This study combined all the permissions with respect to the publishers and owners. Figure 7 shows two examples of an extensive collection of dangerous permissions (top half of the figure) by applications (bottom half of the figure). Figure 7a,b shows 'Kakao games corp' and 'Daum corp' of Kakao as an example [37].

Description of the Evaluation Tool
This study used weka [89], a machine learning implementing tool for training and testing dataset. Performance matrices used against classifiers were random forest [90], decision tree [91], PART and bagging cart [92,93]. For evaluating the empirical results four key features like accuracy,

Description of the Evaluation Tool
This study used weka [89], a machine learning implementing tool for training and testing dataset. Performance matrices used against classifiers were random forest [90], decision tree [91], PART and bagging cart [92,93]. For evaluating the empirical results four key features like accuracy, precision, recall and F1 score [94] were used. One of the key trials for measuring whether a proposed classification is true or not is accuracy. It indicates the amount of true positive and true negative. The closer the accuracy is to 1, the better the used classification is performed. Precision refers to the degree to which repeated measurements under the same conditions show an identical outcome. The percentage of total relevant results correctly classified by the algorithm is known as recall. The recall is the number of real positives that are identified accurately. As precision and recall come at the cost of another, it is highly unlikely to maximize both the metrics simultaneously. Consequently, another significant metric named F-1 score originates.

Description of the Dataset
As the total number of instances for Kakao dataset was 326, this study used 85% (277) of randomly chosen instances for training purposes and the rest (15%; 49) was used for testing. The study used 26 dangerous permissions and eight PII classes for the Kakao case study (Table 7).  Figure 8 shows the performance measuring matrices such as the accuracy, precession, recall and f1 score against the key machine learning classification algorithm like the random forest, decision tree, PART and bagging cart. The factors and classification method used in this study are globally recognized [94]. Figure 8 shows, for the Kakao [43] dataset, decision tree achieved a better outcome than the other classifiers in terms of accuracy. Maximum accuracy achieved while classifying PII from dangerous Android permission was 71%. For the Kakao case, the decision tree-based classifier achieved a maximum precision of 0.76. This study achieved 0.70 recall for decision tree. Finally, the F1 score (F measure) was recorded as 0.69. Figure 8 shows the performance record for Kakao tested by the study.
recognized [94]. Figure 8 shows, for the Kakao [43] dataset, decision tree achieved a better outcome than the other classifiers in terms of accuracy. Maximum accuracy achieved while classifying PII from dangerous Android permission was 71%. For the Kakao case, the decision tree-based classifier achieved a maximum precision of 0.76. This study achieved 0.70 recall for decision tree. Finally, the F1 score (F measure) was recorded as 0.69. Figure 8 shows the performance record for Kakao tested by the study.

Description of the Dataset
As the total number of instances for the Naver dataset was 394, this study used 85% (335) of randomly chosen instances for training and the rest (15%; 60) was used for testing. The study used 26 dangerous permissions and eight PII classes for the Naver case study (Table 8).  Figure 9 shows the performance measuring matrices such as the accuracy, precession, recall and f1 score against the globally recognized machine learning classification algorithm like the random forest, decision tree, PART and bagging cart [94]. Figure 9 shows a maximum of 75% accuracy was achieved with the Naver [44] data while classifying PII by the random forest classification. Maximum precision was recorded as 0.81. This study achieved 0.75 recall for the random forest. The stated F1 score by this work for Naver was 0.74. Figure 9 shows detail PII classification performance record for Naver. f1 score against the globally recognized machine learning classification algorithm like the random forest, decision tree, PART and bagging cart [94]. Figure 9 shows a maximum of 75% accuracy was achieved with the Naver [44] data while classifying PII by the random forest classification. Maximum precision was recorded as 0.81. This study achieved 0.75 recall for the random forest. The stated F1 score by this work for Naver was 0.74. Figure 9 shows detail PII classification performance record for Naver.

Key Learning From the Study
Based on the survey results and empirical evidence, the study highlighted the possible trends of permission aggregation.

Key Learning From the Study
Based on the survey results and empirical evidence, the study highlighted the possible trends of permission aggregation.

•
On the severity of access, Android permissions were of two types of one-time access and continuous access. PII like 'biometric IDs' and 'contact numbers' were directly exposed through one-time access of 'USE_FINGERPRINT' and 'CELL_PHONE'. Alternatively, 'social graph' and 'human behavior' required continuous access to 'READ_CALL_LOG', 'READ_EXT_STORAGE', 'READ_CALENDAR', etc. • PII exposing risk depended on two sub-factors like quantitative and qualitative attributes of the Android permissions. For example, PII like 'location' (location graph) got more accurate after reiteration of 'FINE_LOCATION' and 'COURSE_LOCATION' permission. That is, the more specific the Android permissions used, the more the PII is exposed. Alternatively, the permission 'BODY_SENSOR' sensed surroundings, preferably the human body continuously. However, few 'BODY_SENSOR's might collect a high-quality attribute, some might not get to. Thus, the user profiling risk (PII exposing rate) also depended on the data quality gathered through permissions.

•
The PII revealing risk did not depend on a single Android app (permission). As the publishers and owners could aggregate permissions from different apps, awareness and research on a single Android application (permission) were not enough anymore. Due to owner side permission accumulation risk, risk modeling studies must evaluate from multiple app perspectives.

Probabilistic Analysis of the Risk Model
The proposed study was a novel approach that linked the leakage of PII with aggregated Android permissions. Although a quantitative perspective or the accuracy of the study was limited, consideration of permission aggregation at the manufacturer side increased the qualitative value of the risk model. So far, no other study has considered how heavyweight organizations can gather PII through Android applications, more specifically, through aggregated Android permissions.
The probabilistic analysis of the risk model was discussed in two phases. Firstly, permission aggregation by publishers from multiple underneath applications. The likelihood of breaching individual PII via app permission App 1 to App n towards Pub 1 to Pub n was P a (A), P b (B), P c (C) . . . . P n (N). Similarly, the likelihood of getting a specific PII for aggregated app permissions was distributed as P a (A) ∪ P b (B) ∪ P c (C). Since individual PII is mutually exclusive in relation to Android permissions, the same privacy permission may be responsible for several PII leaking. The chance of a single PII leakage to a specific publisher from the list of publishers Pub 1 to Pub n is independent, that is P a to n (A and B and C . . . N) = 0. Secondly, the study proposed the chance of getting eight PII classes (contact number, biometric ID, address, social graph, human behavior, unique ID, email and location) from the combination of a specific permission set. To discuss the probability of a particular PII class leakage from android permission, we would first have a brief look at probability estimation of decision tree and random forest from the knowledge of existing literature [95,96]. The class likelihoods for an end node were projected by the comparative occurrence of that group in the very last node (Figure 10). The probability estimation of the tree for a new class is the likelihood of a particular class linked with the terminal node (PII1 to 4; Figure 10). The result estimation for the random forest was done by the aggregated result of individual probability of a new subject among all the trees. Actually, each and every tree forming the random forest classification algorithm was translated as a logistic regression probabilistic model. Figure 10 illustrates the aforementioned single-tree representation concept in detail. For a terminal node or class (PII1 to 4) all the events to get to that terminal or class were considered. For example, the conditional probability for PII4 could be estimated as P (y = 1|x1 ≤p1, x2 ≤ p2, x3 ≤ p3), Where, y is the dichotomous outcome of the probability of leaking a particular class. Similarly, p1, p2 and p3 were the points where the tree had been spat for decision-making. The probability of leaking each class from PII1 to 4 is defined as: If all the trees that consist of random forest are translated like the above procedure followed by the re-estimation of probability for the individual tree, the probability of the proposed PII classification can be calculated. Similarly, relative class frequency (RF) and Laplace estimation can also estimate the probability estimation tree (PET) [95]. The probability estimation of the tree for a new class is the likelihood of a particular class linked with the terminal node (PII 1 to 4 ; Figure 10). The result estimation for the random forest was done by the aggregated result of individual probability of a new subject among all the trees. Actually, each and every tree forming the random forest classification algorithm was translated as a logistic regression probabilistic model. Figure 10 illustrates the aforementioned single-tree representation concept in detail. For a terminal node or class (PII 1 to 4 ) all the events to get to that terminal or class were considered. For example, the conditional probability for PII 4 could be estimated as P (y = 1|x 1 ≤p 1 , x 2 ≤ p 2 , x 3 ≤ p 3 ), Where, y is the dichotomous outcome of the probability of leaking a particular class. Similarly, p 1 , p 2 and p 3 were the points where the tree had been spat for decision-making. The probability of leaking each class from PII 1 to 4 is defined as: If all the trees that consist of random forest are translated like the above procedure followed by the re-estimation of probability for the individual tree, the probability of the proposed PII classification can be calculated. Similarly, relative class frequency (RF) and Laplace estimation can also estimate the probability estimation tree (PET) [95].

Recommendations
The study made the following recommendations for Android application users: • Along with the assessment of an application privacy features, publisher's and owner's identity must be noted. The study recommended using applications from different publishers and owners.

•
The study also recommended the users to track the total flow of dangerous Android permissions to any specific application publishers or owners. For example, the user should seriously monitor which of the sensitive PII is about to be revealed to a specific publisher or owner.

•
Most of the studies were dealing with Android permissions to personal data leak classification by considering the first level (application level) information. However, this study recommended researchers to consider the second (publishers) and third level (owner) data (permissions) aggregation. Otherwise, the overall research outcomes and objective might be missed.

Limitations
As the main goal of the risk model is to consider the permission aggregation caused privacy risk to identical owners, the study might have limited accuracy. The accuracy of our proposed classification showed a lower score in comparison to other Android application based PII classification approaches [68,97]. The size of the dataset was one of the major limitations of this study. The study was an ongoing work where only two owners were considered by the case studies. During the machine learning algorithm evaluation, we could train and test our proposal with a limited amount of data. Naturally, 5-10 publishers from owner publish apps in the Play store [57]. The permission versus PII dataset with manufacturer tagging had to be done manually. This study might be affected by a lack of consideration of a weighting factor of individual Android permissions, as this study assumed individual permission carried equal attributes (data significance). Similarly, proposed data aggregation-based classification overlooked the effect of race, gender, age, country, etc. minor factors while classifying the PII threat. Therefore, the PII might have classified slightly differently based on the aforestated factors. Finally, the quality of the analyzed data features was reasonably good considered in the case study, however, the number of considered features were small as only dangerous permissions identified were considered. Therefore, in future research, every Android permission will be measured.

Future Consideration
Our future study plan is to consider the followings to make the work efficient and user-friendly.

•
Considering the diverse data sharing scope among stakeholder: Data sharing scope among stakeholders (owners, publishers and applications) is variable. In future, the study plans to consider a distinct data sharing likelihood from the application to the publisher (α), the publisher to the owner (β) and the owner to owner (γ). The inter-company relationship and data-storing architecture decide the data sharing options. Permission data acquisition rate would vary based on the aforesaid factor.

•
The weighting of the individual Android permission: Not all Android permission carries equal information that can form PII. The study plans to introduce a data value weight (ω 1 , ω 2 . . . ω N ) on each Android dangerous permission based on its contribution of PII formation. The dangerous Android permission to PII formation matrices will be redefined based on these weights.

•
Redesigning the analyzing app: Finally, the study plans to improve the web app 'Privacy analysis on mobile App' (http://52.79.237.144:3000/) [45] based on the aforesaid factors. The web application will be reshaped in such a way where the user can track the percentage of the overall personal data flow through Android permissions towards a specific owner.

•
In the future, a larger dataset with higher application owning enterprises, such as Alphabet, Facebook, eBay, Amazon, etc. would be considered

Conclusions
This study presented a novel personal information risk classification model that was inspired by the permission aggregation concept. Android permissions congregated through the mobile applications, publishers and owners were considered by the risk model. Different classifiers were used and compared to classify vulnerable PII. A web application (http://52.79.237.144:3000/) [45] was developed during the study in order to visualize the actual fact of large-scale privacy permission aggregation. With the latest Google-Play API, our web application [45] provided users with a better understanding of the large-scale Android permissions aggregation by influential owners.
The study illustrated how permission aggregation empowered entrepreneurs with a large amount of partial information. Consequently, this partial information amplifies the chance of user identity re-identification (user profiling). While classifying the leakage of specific PII from a set of Android permissions, the study considered the top two South Korean app owners (Kakao and Naver) along with their publishers and their published apps. The study achieved a satisfactory accuracy by adopting a decision tree and random forest classification algorithm to validate the data aggregation caused PII risk model. As data accumulation has become a key business imperative that challenges several user identities, the study concluded the contact number, biometric ID, address, social graph, human behavior, unique ID, email and location information as frequently exposed PII.
The analytical results demonstrated that the classification approach was guaranteed to find a higher rate of PII risk towards large enterprises. The proposed risk model concept could also be used for marketing policy-making for the data-driven economy. In the future, the study plans to develop a privacy-aware permission sharing and tracking tool that can check the real-time permission movement towards the decisive owners. In addition, the study plans to apply the proposed risk model for the IoT domain as well as most of the IoT devices are managed by the Android application.