According to a report released by the Interactive Advertising Bureau (IAB) of the United States on 26 April 2017 [1
], the trend of digital advertising has transferred from personal computers to mobile devices. In 2016, the annual revenue of digital advertising in the United States was USD 72.5 billion, of which revenue from mobile ads exceeded 50% for the first time, reaching USD 36.6 billion. The Mobile Application Industry Report for 2015 [2
] revealed more about the popularity and importance of mobile advertising: 82% of app developers made a profit by advertising, and 91% were still using banner ads. Obviously, consumers were not willing to pay to download apps, so app developers turned to free apps and used ads to make profit.
The report on malicious mobile software evolution released by Kaspersky in February 2017 [3
] listed 8,526,221 detected malicious apps in 2016, which was three times as many as that in 2015. Increasingly more information security reports related to mobile ads have since been conducted. A report by Trend Micro in June 2017 [4
] showed that a Trojan Android ad program called Xavier could steal users’ personal information and transmit it to somewhere without user permission whenever users downloaded the embedded app. According to the Trend Micro data, over 800 Google Play Android apps contained the Trojan ad lib, which had been downloaded millions of times. These apps included utility apps such as photo editing apps, desktop and ringtone change apps. Xavier had a self-protection mechanism to avoid detection, and also downloaded and executed other malicious codes.
Doctor Web, an anti-malware company, indicated that in June 2016 a Trojan called Android.Spy.305, had been embedded in 155 Google Play apps, and estimated that more than 2.8 million people had downloaded and installed them [5
]. The new Trojan, Android.Spy.305.origin, originally put into an ad lib, was embedded in apps when some developers used this ad lib to generate advertising revenue. It was known that 155 kinds of apps made by 8 app development companies had been infected. Once mobile device users had installed the embedded Android.Spy.305.origin module with the ad lib, it then connected to a Command and Control server to download the additional Android.Spy.306.origin module. The additional module would then begin to steal personal data, including, Google account E-mail logins and passwords, installed app lists, system languages, mobile brands, device names, IMEI numbers, OS versions, screen resolution, telecom operators, etc. In addition, third party apps would be installed during the app installation, which would then display various malicious advertisements from time to time. Many researchers have noticed the security issues caused by ad libs and a lot of efforts have been made in recent years to address this. Some of them are introduced below.
Athanasopoulos et al. [6
] estimated that more than half of the apps available on Google Play contained ad libs linked to third party advertisers, posing a significant security risk to mobile app users. They therefore proposed the Native Code Isolation for Android Applications (NaClDroid) architecture to separate the program code of an ad lib from that of an app, thus preventing permission sharing. Kumar et al. [7
] noted that many ad libs required too many privileges or used privileges for which they did not have authority. Some observed apps could also sniff network traffic to obtain package content across the ad requests of multiple ad networks, making a user’s personal information more easily accessible. They also discussed how a few notorious ad libs used online third parties to stealthily transfer personal information to an unknown server. Gao et al. [8
] noted that because ad libs and apps were compiled after their merging, it was impossible to prevent the ad lib from using unauthorized permissions that exceeded the ad lib instructions. They therefore designed the Permission Supervision for Android Advertising (PmDroid) system to block ad libs’ unauthorized use of permissions to transfer information. PmDroid employed a graphical interface to present the seriousness of any unauthorized usage. To understand the actual actions of these SDKs, Gao et al. wrote 53 different apps, each with a different ad lib embedded. The apps did not do anything, but announced all the privileges of the Android system to which they had access. The packet traffic of the apps was then recorded in order to understand how the ad libs abused permissions. Because the apps themselves did nothing, all network traffic was the result of ad lib activity. The authors concluded that unauthorized use of permissions by ad libs was very serious.
Narayanan et al. [9
] observed that it was difficult to judge ad lib behavior using only the ad lib program code due to the widespread use of modern obfuscation tools. They used 26 different ads in their experimental dataset in order to test such obfuscation tools. They then proposed the AdDetect framework to assist in detecting ad libs and their behaviors in apps. AdDetect used semantic analysis to check ad libs, and used a support vector machine (SVM) to make classification judgments. Liu et al. [10
] proposed their system, called PEDAL, to de-escalate privileges for ad libs in mobile apps. The study reported that, even if ad libs used obfuscation tools, PEDAL had a 98% accuracy in detecting them. Yan et al. [11
] designed a new Android model, RTDroid, which basically modified the internal components of the Android operating system, and made use of a real-time Virtual Machine (VM) instead of the original Android Dalvik VM. This ensured that the execution of any app and its ad lib had greater predictability.
Book and Wallach [12
] noted that, while ad libs could use the privileges of the host app to secretly transmit data, the host app could also use the privileges of the ad lib to engage in extra, unauthorized actions. That is, app developers and ad networks were colluding to carry out aggressive activities. The authors collected 114,000 apps, and collected statistics for the 20 most frequently used advertisers, identifying a total of 64,000 apps using those 20 ad libs. By observing the behavior of the 64,000 apps, they concluded that app developers often actively collected too much personal user information to supply to ad networks in pursuit of high advertising profits. In addition, they found that the greater the popularity of an app, the easier it was to engage in such behavior, since as the number of users of an app increased, so did the motivation for advertisers to engage in such profit-seeking actions. Ruiz et al. [13
] discussed the problems caused by ad lib updates. According to their experiment data, over 90% of apps were free, and advertising was the only income for these app developers, so it was very important to ensure that ad libs embedded in apps could bring the expected profit. If ad libs didn’t achieve the expected profit, they were replaced or updated. The authors collected 13,983 versions of 5937 apps, and found that nearly 50% of these apps had changed their ad libs within 12 months by increase, removal or update. Ad lib maintenance was thus a burden on app developers. Su et al. [14
] developed a data exploration method for HTTP dataflow. The features adopted were quantitative, timing and semantic. The authors claimed that their traffic identification of malicious ad libs could achieve an accuracy of 95% in their experiments. Kuzuno and Magata [15
] used the difference of HTTP online traffic to identify ad libs. They adopted 1000 known advertising pictures to identify others. The experiment results exhibited a 76% detection rate for known advertisement maps and 96% for manual sorting advertisement maps.
Kajiwara et al. [16
] observed that ad libs periodically used ad request packets to transfer personal information to ad servers, and received ad reply packets from ad servers. These reply packets were mainly advertisement pictures which appeared on the apps, changing the window screen. It was thus possible to estimate whether an app had an embedded ad lib by mathematically processing the HTTP frequencies online and screen changes. Crussell et al. [17
] focused on the issue of MAdFraud, wherein app developers used background processing to connect to ad servers and ask for advertisements for profit, without users’ knowledge, or have the program automatically click ads, thereby deceiving the ad networks. The PrivacyGuard system proposed by Song and Hengartner [18
] had a number of functions which could not only track the flow of sensitive information, but also handled sensitive information to protect against illegal access. Backes et al. [19
] noted the trend of ads being embedded in free apps, but those released apps often using an old version of an ad lib, thus hiding security weaknesses. The authors designed a system to help users check whether the ad libs contained in downloaded apps had security concerns, including whether there were malicious behavior instructions for obfuscation. Lee et al. [20
] proposed the use of Contextual and Semantic perspectives to distinguish between app behavior and ad behavior. Tang et al. [21
] carried out a static analysis of 10,710 apps, and found that 76.08% of them had obvious unauthorized use problems, and of those, 424 apps’ sensitive permissions were only used by ad libs, instead of the host apps. This study also deals with the abuse of permissions by ad libs in a semantic way. Liu et al. [22
] discussed the possibility that analytics libraries were more likely to leak users’ personal information than ad libs. Analytics libraries are the mechanisms for tracking ad presentation and ad clicking on mobile phones.
Stevens et al. [23
Today, what users most want to know in this context is what ad libs are embedded in their downloaded apps. If an ad lib is well known, it may be relatively safe, and vice versa. This study runs apps on an emulator, and analyzes their network behaviors related to advertising. Most ad libs exhibit different behavior patterns, which are plotted into graphs to determine whether an ad lib comes from a trusted advertising company, using similarities between the graphs. The remainder of this paper is arranged as follows. Section 2
describes the operation of ad libs and related knowledge. Section 3
describes the proposed method of graph drawing according to an ad lib’s network behavior patterns. Section 4
presents experimental results, and Section 5
2. The Operation of Ad Libs and Their Security Issues
Since 2010, mobile networks have undergone rapid growth and development. The boom included the adoption of mobile devices to quickly and reliably send messages. This sudden ubiquity of mobile devices resulted in a new mobile advertising market worth thousands of millions of U.S. dollars each year. The revenue of Internet advertising took 23 years to catch up with that of TV advertising, but the income of mobile advertising took only 6 years to surpass that of computer ads in 2016 [24
]. However, as shown in a Purdue University and Microsoft report [25
], the cost of using these free apps, which depended on ads for income, was the power consumption of the mobile phone and leakage of users’ personal information. The surprise finding was that up to 75% of mobile device electrical power was used for advertising services, or tracking and uploading the relevant information of the user. There were already a variety of proper solutions for the advertising problems caused by websites, which could be addressed by computers via browsers. However, the information security problems caused by the mobile app ads had not yet been completely solved.
The reason these mobile advertising security-related problems were so difficult to work out was that ad libs were embedded in host apps and compiled together into APK execution files. That is, the ad lib had become a part of the entire app and could use all the permissions granted to the app. For example, an ad lib could claim to only use permissions P1 and P2, while the host app claims permissions P3, P4 and P5. Once merged and complied, the ad lib would be able to use all permissions, i.e., P1, P2, P3, P4 and P5. When installing this app, the Android system only informs the user that it will use P1, P2, P3, P4 and P5, and once the app is installed, the system does not distinguish between the host app and ad lib permission use, and does not prevent the ad lib from using all privileges belonging to permissions P3, P4 and P5.
shows the flowchart of app advertising processing.
The AppBrain website [26
] listed the top five hundred ad networks around the world, of which the Google’s AdMob was the most popular. According to the latest information released by this website in December 2017, 61.52% of all apps installed had the AdMob ad lib embedded. The second most popular network was Unity, with 18.73% of all apps installed having this ad lib embedded. Third was Chartboost, with 14.00% of installed apps using this ad lib. The less well-known ad networks may offer higher advertising profits to attract app developers, but their security risk is higher. Ad networks provided documentation on their official websites, but some collect more personal information than their permissions allow, and app developers are not aware of it, or do not mind, because of their desire for profit. This is because, if an ad lib cannot collect enough personal information to include in ad request messages, ad servers determine that it is unable to provide effective advertising images to potential customers, and will thus not reply to the ad request messages. This means that app developers lose financially on the app, as free app developers rely on ads being sent to the users’ phones for income.
According to Ruiz et al. [27
], as a result of a large increase in free apps, the reply rate to ad requests of the Top 40 ad networks is lower than 18%. Therefore, free app developers must turn to the less known ad networks and (or) embed multiple ad libs from different companies in an app at the same time so as to increase the possibility of getting advertising pictures. The authors also collected more than 625,000 apps on these ad networks. After analysis, it was found that 34.88% of those apps had two or more ad libs embedded. A small number of apps had as many as 28 ad libs embedded.
Wei et al. [28
] found that even apps with a good reputation and flagged as normal by anti-virus software were likely to be connected to malicious websites during their operation. They combined static (decompiling and checking program code) with dynamic (running apps for two hours and clicking as many links as possible through the tools) methods to observe who an app would communicate with. They collected 13,500 normal and popular apps, and found that in the course of their execution, these apps were connected to 254,022 URLs. In addition, 1,260 known malicious apps were collected, and it was found that they were connected to 19,510 URLs in their execution processes. According to the check returns based on Web-Of-Trust (WOT) [29
] and VirusTotal [30
], the authors divided all the above URLs into four categories: good websites, low-reputation websites, bad websites and malicious websites. Of the normal and popular apps, 8.8% of them were connected to malicious websites, 15% were connected to bad websites, and 73% to low-reputation websites. A total of 74% were connected to websites unsuitable for children. Of the known malicious apps, the situation can be expected to be worse. But otherwise, the authors found that the online URL distribution was similar to that of normal apps. This paper revealed an important point: even thorough and effective anti-virus software cannot guarantee that a certain app is safe, because the problem may not lie in the app itself, but with the website associated in the execution process. If connected to malicious or bad websites, a normal app could cause unimaginable damage.
The above authors [28
] also found that only static decompilation of apps was not sufficient to achieve an effective full check by examining all possible online URLs, because the website could reconnect to other URLs through HTTP redirect mechanisms. Such problems are more difficult to predict because of the embedded ad libs of the apps. In fact, online advertising companies could resell ad slots to other ad networks (usually less known) through Ad exchange [31
] so as to maximize advertising profits. This increases the advertising security risk, as the website could connect to multiple URLs when, for example, a free online game app is executed. Aside from the game server(s), the site could connect to ad server(s), redirect or unnamed server(s) by an ad resale mechanism.
3. The Proposed Method
Unlike static analysis, dynamic analysis focuses on the behavior of program execution, by analyzing the behavior of an app in an emulator. In some cases, better results may be obtained by dynamic analysis because it is resistant to obfuscation tools. Some researchers have emphasized the importance of dynamic analysis [32
] for this reason. Since ad messages are carried out through HTTP packets, an understanding of HTTP is necessary to study the behavior pattern analysis of an ad lib, including the meaning of each field, and the information contained in it, so that the required data used in this research can be obtained.
In this study, an app was executed in an emulator and the packets of all network behavior were recorded, from which the packets related to the advertisement were picked using the proposed method. The tools used in this study were BlueStacks, TCP DUMP, ADB and self-created software. The emulator, called BlueStacks, used TCP DUMP to record network traffic from the virtual network adapter. The Android deb bridge (ADB) tool could directly access the Android emulator. The “logcat” instructions therein produced the required record files and the “pull” instructions exported the packet files (PCAP format) in the virtual machine. Because the captured packets were extremely large and messy, a program was designed to filter the packets related to the advertisement. Figure 2
a shows a part of the proposed program, Figure 2
b shows an ad request message, and Figure 2
c shows an ad reply message.
The proposed graph-based method first identified the main behaviors of the ad lib, each of which was expressed by one vertex in the graph. All vertices were connected according to the proposed algorithm, and then an undirected graph was constructed to represent the network behavior of the ad lib. The PChome [37
] ad lib was taken as an example to illustrate as follows. Figure 3
b, was constructed using the algorithms, as shown in Figure 4
According to the algorithm in Figure 4
, the input array of actions is first checked from left and right so as to find the first IMG, which was taken as the first IMG point (vertex) in the graph, also called the MainImg in the algorithm. Each element of the input array on the left of the first IMG formed its own vertex, which was drawn on the right side of the MainImg and linked to the MainImg by the edge marked “url”. Then the element on the right of the first IMG in the input array was processed. The element on the right of the first IMG was checked in the input array, and each IMG on the right formed its own vertex, which was drawn on the left side of the MainImg and linked to the MainImg by the edge marked “cookies”. If there was no IMG (either HTML or JS), one vertex was formed, which was connected to the vertex formed by the nearest IMG on the left side of the element in the input array with the edge marked “url”.
4. Experiment Results
This section gives more examples to demonstrate the effectiveness of the proposed approach. In Figure 5
, on the left of each subgraph is the main ad behavior of an ad lib obtained from the network traffic of the emulator, and on the right side of each subgraph is the undirected graph based on the algorithm, as shown in Figure 4
Some of the advertising companies listed on the AppBrain website [26
] and their ad libs were chosen and processed according to the above process, with the results presented in Table 1
uses ad lib as the index, while Table 2
uses the graph as the index. In Table 2
, all graphs are categorized into different types, from A to P.
Using this formula, the corresponding values of different graphs could be obtained, and the corresponding values of some graphs are shown in Table 3
. If an unknown pattern of ad lib produced the advertising behavior shown in Figure 6
, the graph was generated as Figure 6
b. The value obtained by the suggested formula is 22, according to the numbers of different types of vertices. However, there was no matched graph value in Table 3
, which indicated that it was a newly found advertising behavior model. Therefore, the content of the ad packet needed to be further analyzed, and it was found that this was the behavior of Mydas ad lib, shown in Figure 6
c. Finally, the newly acquired information was added to Table 1
, Table 2
and Table 3
, in order to expand the content of known advertising patterns. The method proposed in this paper made it possible to more quickly classify the ad lib in an app. Some advertisers or app developers may deliberately hide the Host name. In this situation, the ad lib could still be classified by checking the ad behavior graph. If two or more ad libs shared the same graphs, the range of candidates was significantly reduced because of the classification.