Real-Time Filtering Non-Intentional Bid Request on Demand-Side Platform

Thi-Thanh-An Nguyen; Duy-An Ha; Wen-Yuan Zhu; Shyan-Ming Yuan

doi:10.3390/app122312228

,

and

¹

EECS International Graduate Programs, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan

²

TenMax AD Tech Lab, Taipei 222, Taiwan

³

Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan

^*

Author to whom correspondence should be addressed.

Appl. Sci.2022, 12(23), 12228;https://doi.org/10.3390/app122312228

Version Notes

Order Reprints

Review Reports

Abstract

While real-time bidding brings a huge profit for online businesses, it also becomes a potential target for malicious purposes. In real-time bidding, the bid request traffic could be classified into two kinds: intentional and non-intentional. Intentional bid requests come from ordinal web users while non-intentional bid requests come from abnormal web users. From the perspective of a demand-side platform (DSP), the budget of advertisers should be used as effectively as possible by limiting non-intentional traffic. Therefore, it is essential to classify and predict these two kinds of bid request traffic. In this research, we propose a real-time filtering bid requests (RFBR) model to predict whether an incoming bid request is intentional or non-intentional from the DSP’s viewpoint. Our model is built on three stages. In the first stage, we analyzed all potential attributes in the bid request scheme and figured out the relations between abnormal behaviors and their attributes; in the second stage, a classification model was built to classify normal and abnormal audiences by the extracted features and self-defined thresholds; in the third stage, a RFBR model was built to classify intentional and non-intentional bid requests. The experimental result shows that our system can effectively classify incoming bid requests.

Keywords:

online advertising; demand-side platform (DSP); real-time bidding; anomaly detection; data mining

1. Introduction

Online advertising, also known as internet advertising, is any form of promotional brands or services on various online platforms that used to increase sales and brand awareness without being restricted by time and space. It has a major role in the main revenue of most websites and mobile apps. According to a report from the Internet Advertising Bureau (IAB) website, the annual revenue of online advertising surpassed 189 billion US dollars in 2021, which represented an increase of 35.4% from 50 billion US dollars over the previous year. On the other hand, real-time bidding is a new extremely effective form of online advertising which has allowed advertisers and publishers to trade their advertising impressions through real-time auctions in an advertising exchange platform. Due to the growth of RTB, it has also become a gainful target for fraudulent activities which are an attempt to defraud advertising platforms for the purpose of financial gain. Abnormal traffic, which is generated from fraudulent activities, can come from the practice of fraud in impressions, clicks, conversion, data events, etc. However, abnormal traffic is not only from non-human users but also from crawler or some real users who diverted to websites or applications without their control and awareness. For these reasons, a study [1] has classified bidding request traffic into two groups: (1) intentional bidding requests which come from ordinal web users; and (2) non-intentional bidding requests which come from abnormal web users. It is obvious that non-intentional bidding requests may harm the online advertising ecosystem. From the perspective of a DSP, this kind of traffic should be filtered out as much as possible and the use of budget should be used as effectively as possible. Moreover, the DSP plays a role that serves the advertisers by managing their advertising budget and delivering ads to the right audiences. Unfortunately, the advertisers do not interact with “real” users in many cases. The previous works indicate that fraudulent activities are out of control in online advertising [1,2,3]. Therefore, classifying and predicting this kind of bid request traffic is very important and becomes a challenge in both industrial and academic areas. As the result, this research would like to propose a RFBR model to classify the incoming bid request. In particular, all potential attributes in the bid request scheme are first analyzed to figure out the relation between abnormal behaviors and these attributes. After abnormal behaviors are identified by extracting the features, a classification model is built by these features and corresponding thresholds to classify normal and abnormal audiences. Finally, a RFBR model was built to classify intentional and non-intentional bid requests.

Accordingly, this work makes the following contributions:

We discovered the attributes of bid requests to extract the features that are useful in detecting abnormal behaviors of audiences;
We built a model to label abnormal audiences by applying the extracted features and self-defined thresholds;
We proposed RFBR model to classify incoming bid requests as intentional or non-intentional based on the labeled audiences.

2. Online Advertising and Real-Time Bidding

In this section, we present a brief overview of the online advertising ecosystem, real-time bidding, and DSP. Additionally, types of fraudulent activities in online advertising and the techniques used to face fraud are also reviewed. At last, some prior studies are described.

2.1. Online Advertising and Real-Time Bidding

Online advertising refers to any form of promotional brands or services on various online platforms, such as social media, websites, search engines, etc. The main objective of online advertising is to increase sales and brand awareness without being restricted by time and space. On the other hand, RTB is a part of online advertising that automates buying and selling of online advertising impressions. RTB is beneficial to both publishers and advertisers. For advertiser side, RTB allows advertisers to buy more efficiently and displays more relevant advertisements to users. For publisher side, RTB allows publishers to increase their budget and filling rates by opening a real-time auction. The process of RTB happens in a very short time (less than one second) with the process as described in Figure 1. The definitions of some important terminologies in online advertising and RTB are presented below:

Figure 1. Real-Time Bidding Process.

Advertiser: buys advertising space on the internet to promote their products and services;
Publisher: owns the online platforms with digital space to sell and show the advertisements;
Demand Side Platform (DSP): purchases ad impressions on publisher advertising inventory in an automated manner;
Supply Side Platform (SSP): sells advertising in an automated manner;
Ad Exchange: trades the advertising transaction procedure between SSP and DSP;
Ad network: connects multiple platforms by matching advertisers to websites looking to host advertisements;
Agency: helps advertisers display advertisements in the most suitable place and optimize the budget;
Impression: the number of times advertisements have appeared on any device screen within the publisher’s network;
Click: a click event is created when users have clicked on an advertisement to reach online information about the advertiser’s products and services;
Click Through Rate (CTR): the ratio of clicks to impressions;
Conversion: a conversion event is generated when a user clicks on an advertisement and keeps on performing some activities on the site such as download, sign up, purchase of a product, etc.

2.2. Advertising Fraud in Regard to Ad Exchange

Advertising fraud is a principal concern for all marketers. Advertising fraud is any attempt to defraud advertising platforms for the purpose of financial gain. In particular, it is the practice of fraud in impressions, clicks, conversion, or data events and this will affect both advertisers and publishers from the ad exchange’s perspective. There are different methods by which fraudsters can carry out advertising fraud. The following are several common types of advertising fraud:

Botnet advertising fraud: causes a huge number of fake clicks, fake visits to websites, and fake traffics [4]. Botnet is an interconnected network of users’ machines which are installed with malicious software or pieces of coding to perform automated tasks and execute fraud actions;
Click hijacking: thieves click by redirecting a click on one ad to an irregular ad. To execute a click hijacking attack, fraudsters will intrude a user’s device, website, or proxy server;
Click spamming, click injection: a form of mobile advertising fraud which automatically generates a large number of fake clicks or impressions to steal an advertiser’s budget [4];
Ad stacking: one type of impression fraud with programmatic ad placements. In order to execute this attack, a fraudster stacks layers of multiple ads on top of each other to hide illegitimate ads under legitimate ads. In this way, the illegitimate ad also gains impressions, even though it is hidden behind another ad [5];
Ad injection: illegitimate ads that insert into legitimate ad space on a website without their permission via malicious browser extensions or malware. In this way, an illegitimate ad can replace the existing ad entirely or display itself on websites that are not supposed to contain ads.

2.3. Advertising Fraud in Regard to Ad Exchange

Advertising fraud has been hurting online advertising at various levels. It is essential to have countermeasures to detect and prevent these kinds of frauds. Due to various forms of advertising fraud, various methods are used to deal with ad fraud in the context of an ad exchange [5].

Statistical-based anomaly detection: analyzes a large-sized bid request dataset and creates an algorithm to detect abnormal activity. This type of detection is used to identify fake traffic;
Anomaly-based detection: observes the unexpected increases in the number of impressions and CTR of the publisher by using the publisher’s history. This detection can be used to recognize unusual publishers [6];
Signature-based detection: a process by which static rules are defined to decide whether ad traffic should be considered abnormal or not. These static rules are conducted by looking for characteristics within a large amount of ad traffic to identify malicious behaviors. Signature-based detection is only useful in recognizing attacks that are already known; however, it cannot identify unknown attacks [6];
Website popularity and page ranking: checks the reliability of websites by comparing their actual amount of traffic and the suggested page ranking. There are some ranking websites such as Alexa or Compete [6].

3. Related Works

Recently, previous works have shown an increased interest in ad fraud detection and focused on various solutions for detecting ad fraud. Several approaches are presented for different advertising fraud scenarios, including botnet [3,5], click and conversion [1,2,7,8,9,10], and RTB [3,10,11,12,13,14,15]. In case of handling bot traffic, Stone analyzed the traffic generated by the malware to study the behavior of a bot sample and to identify the location of the bot’s server [6], and Pater scored ad traffic to classify the intentional and non-intentional ad traffic in real time [3]. For the click and conversation fraud, Dave built a realistic-looking landing page to observe and measure click-spam across ten major ad networks and four types of ads [2], and Wang developed conditional random fields (CRF) to learn users’ click behaviors [8]. In case of PPV and PPC networks, Springborn identified pay-per-view (PPV) networks to analyze the purchase traffic coming from a set of honey websites and inflate advertising impressions on websites [9].

In addition, there have been several studies using machine learning techniques to build a model to classify and predict fraudulent activities as well as user behaviors in online advertising [8,16,17,18]. Specifically, Cetintas and Wang proposed the prediction of user visits and viewability for online display advertising by building models based on a probabilistic latent model [19,20]. In addition, Tian used the clustering method to investigate a novel crowd fraud detection method for search engine advertising [5]. Similarly, Chapelle presents a machine learning framework based on logistic regression for modeling response prediction in display advertising [16]. However, these supervised and semi-supervised learning approaches have not seemed to be efficient in detecting fraud by using the real DSP bid request dataset in RTB.

From a methodological point of view, Pastor has designed a system for the detection of invalid ad traffic [3]. This paper classifies invalid traffic by computing the Shannon Entropy of the distribution of bid requests across IP addresses for each domain. On the other hand, Stone analyzed a dataset containing transactions for ingress and egress ad traffic from an ad network to detect fraudulent activities based on publishers by determining dynamic thresholds in some features to detect anomalous ad traffic [6]. The research to date has tended to focus on detecting and classifying non-intentional traffic in the context of publishers rather than audiences. For this reason, they may ignore non-intentional bid requests that do not come from a group of abnormal publishers. In this research, we provide another aspect of detecting fraud traffic based on the perspective of the audience. This provides an ability not only able to detect non-intentional bid requests from groups but also the individual. Moreover, we also offer a system that has the feasibility to be integrated into the RTB process to filter non-intentional bid requests.

4. Research Methodology

In this section, we analyzed all potential attributes in bid requests to explore the abnormal access behaviors of audiences and publishers from the perspective of a DSP. Moreover, the dataset used in this research is also described.

4.1. Dataset Description

The dataset used in this research is sampled from 1% of the full real-world dataset which was provided by TenMax AD Tech Lab Co., LTD in Taiwan without labeling for 14 days (1–14 July 2018). A summary of our dataset is shown in Table 1. Each bid request contains user and publisher information such as time, audienceId, IP address, user agency, location, URL, host, domain, etc. The schema of bid request is defined by OpenRTB [17].

Table 1. The basic statistics of dataset.

4.2. Building Features

Since access behavior is concerned with the important characteristics for recognizing an abnormal audience, we focus on analyzing the potential attributes to extract the features that represent the corresponding abnormal access behavior. To study the associated access behavior of a bid request, it can best be treated under three headings: identifier, measuring, and observation period.

Identifier: the advertising fraud traffic can be identified under two categories of source: audience and publisher. Most of the prior works have focused on publishers, such as [1,7,9,11]; however, we also can discover suspicious audiences which used to generate advertising fraud traffics. In particular, the associated identifiers of audiences include:

AudienceId: identifies the browser via third-party cookies. For browsers, the audienceId is renewed if the cookie is wiped;
(IP, UA): the combination of IP address (IP) and user agency. IP is the IP address of the user machine. User agency is used to identify the kind of browser, browser version, OS, and OS version. This combination is best in identifying a user since one IP address can be used by multiple users in some cases.

Similarly, associated identifiers of publisher comprise three attributes:

URL: the link accessed by the bid request;
domain: the domain of the URL;
publisherid: is unique ID given for each publisher;

Measuring: according to the previous works [9,12,14], we summed up three characteristic manners of non-intentional access behavior to estimate and measure non-intentional access behavior.

Traffic volume: the extraordinarily large number of bid requests from a specific identifier indicates non-intentional access behavior;
Behavior frequency: an audience accessing a small number of publishers or a publisher accessed by a small number of audiences also implies non-intentional access behavior;
Behavior regularity: the traffic from an audience or publisher with regularity also assumes non-intentional behavior. For example, botnet behavior.

Observation period: the access behavior of non-intentional bid requests can be observed under four observation periods: second, minute, hour, and day.

In this research, we focus on analyzing and observing in detail the access behaviors of audiences with reference to audienceId and (IP, UA).

4.2.1. AudienceId and (IP, UA) Information

The distributions of bid requests in audienceId and (IP, UA) are shown in Figure 2. As can be seen from the figure, more than 99% of total audienceIds and (IP, UA) have a very small number of bid requests. To observe the access behavior of non-intentional audienceId and (IP, UA), we pay attention to analyzing the audienceIds and (IP, UA) which have very high numbers of bid request. In particular, we focus on studying the top 1000 audienceIds and (IP, UA) and analyzing some top audienceIds and (IP, UA) in detail.

Figure 2. The CDF graph of the number of bid requests of audienceId.

4.2.2. The Access Behavior by Day and Hour of AudienceId

Traffic volumes of top audienceIds and (IP, UA) by day and hour are shown in Figure 3. In our observation, some audienceIds and (IP, UA) accessed more than 700 times a day; this number is unusually large. Moreover, some audiences appeared continuously for 24 h a day; this also indicates an abnormal access behavior. In summary, two simple heuristics of normal and abnormal access behavior of audienceId and (IP, UA) based on day and hour are:

Figure 3. The number of bid requests of some top audienceIds and (IP, UA) by day and hour.

General trend: The number of bid requests for each audience in one day in not very high. They show up for some hours of the day;
Abnormal behavior: Unnaturally large numbers of bid requests in one day. Appearing in almost all hours of the day.

4.2.3. The Interval Time

Since our dataset is a 1% sample of the full dataset, it does not accurately reflect the access behavior regarding the interval time of fraudsters; therefore, the time interval has not seemed to be helpful in detecting abnormal behaviors of audienceIds and (IP, UA) when the average interval time is not short enough as well as when this interval time does not follow any regularity. However, in the full dataset, this is an important rule used to detect abnormal behaviors.

Because interval times are not effective in detecting abnormal behaviors, we can discover another method based on time to observe. Based to ordinary observations, legitimate audiences do not click on the link more than once or twice within one second. As shown in Figure 4, most audiences have only one or two requests within one second. For this reason, a high number of bid requests in two seconds is also considered as abnormal behavior. As a result, two simple heuristics of normal and abnormal access behavior of audienceId and (IP, UA) based on the number of bid requests in one second might be summarized as below:

Figure 4. The number of bid requests of audienceId and (IP, UA) in one second.

General trend: an audience has one or two bid requests within one second;
Abnormal behavior: having more than two bid requests within one second.

4.2.4. URL Attributes

Advertising fraud can trick advertisers into paying more for advertising space on the website than they should with methods such as domain spoofing, URL substitution, etc. There was a significant observation between the number of bid requests and the number of distinct URLs in audienceIds and (IP, UA), as shown in Figure 5. Most of the URLs follow the number of bid requests; however, there are some anomaly points which have high numbers of bid requests but low numbers of distinct URLs. These anomaly points are considered abnormal audienceIds and (IP, UA), and the number of URLs following the number of bid requests is also considered a rule to determine abnormal bid requests. In summary, two simple heuristics of normal and abnormal access behavior of audienceId and (IP, UA) based on URL attributes are:

Figure 5. The number of distinct URLs of top 1000 audienceId and (IP, UA).

General trend: the more bid requests of audienceId and (IP, UA), the greater the number of different URLs.
Abnormal behavior: having large numbers of bid requests but very few different URLs in audienceId and (IP, UA).

4.3. Feature Extraction

The access behaviors of audience are discovered after the potential associated attributes were observed and preprocessed. Some conclusions of abnormal behavior of audienceId and (IP, UA) can be summarized as below:

AudienceId/(IP, UA) has a very high number of bid requests by day;
AudienceId/(IP, UA) appears in almost all hours of the day;
AudienceId/(IP, UA) has a number of bid requests within one second more than two;
AudienceId/(IP, UA) has large numbers of bid requests but very few distinct URLs.

Given these above points, the features which are useful in capturing properties of fraudulent traffic based on audiences are listed in Table 2. This table shows the features related to abnormal access behavior from the audience side including audienceId and (IP, UA). We have a total of eight features from both audienceId and (IP, UA) represented by a vector

R = {r_{1}, r_{2}, \dots, r_{8}}

. Given these points of rule extraction, the identifiers including audienceIds and (IP, UA) are integrated with the associated historical bid request and access logs from the previous days and built in offline to store the access behavior statistics.

Table 2. Feature Extraction.

5. Our Proposed Approach

In this section, we formulate the problem definition of detecting an incoming bid request and define the rules and the threshold to classify intentional and non-intentional bid requests.

5.1. Problem Definitions

The problem of classification of intentional and non-intentional bid requests in RTB can be formulated as the below definition:

Given a set of rules $r_{i} \in R$ and its corresponding threshold $t_{i} \in T$ , first, a classification of audience is built by the function $f_{1} (.)$ , $y_{a u} = f_{1} (r, t)$ where $y_{a u} \in {0, 1}$ ; $y_{a u} = 1$ flags abnormal audiences and 0 flags normal audiences. Finally, the incoming bid request is classified by building the function $y_{b i d} = f_{2} (y_{a u})$ where $y_{b i d} \in {0, 1}$ ; $y_{b i d} = 1$ denotes a non-intentional bid request and 0 denotes an intentional bid request.

5.2. Rules

According to the feature extraction in Section 3, there are a number of rules that are considered to detect abnormal audiences:

Rule 1—The percentage of bid requests by day: Any audience generating a very high number of bid requests could be considered as suspicious, and likewise, its percentage of bid requests is also higher than the others. The reason for this rule is that a fraud such as bot stayed in one machine and generated a high number of bid requests;
Rule 2—The number of appeared hours: Any audience appearing for many hours within one day would be considered as suspicious. The reason for this rule is that a normal audience very rarely appeared most hours within one day;
Rule 3—The number of bid requests within one second: Any audience generating a high number of bid requests within one second indicates suspicious activities. The reason for this rule is that the usual audiences could not generate more than two bid requests within a very short period of time: one second;
Rule 4—Distinct URLs per bid request: Any audience generating a large number of bid requests but very few distinct associated URLs are also considered as fraudulent. The reason for this rule is that the more bidding requests generated by audiences, the more distinct URLs they have.

5.3. Threshold

A set of rules are not enough to classify audiences; corresponding specific thresholds of these rules are also required. If an audience meets one of the specific thresholds of the rules, it will be considered as an abnormal audience; otherwise, it is considered as a normal audience. Each rule threshold is applied to the rule’s values measured for a time period of one day or one second and computed by estimating statistical methods.

5.3.1. Threshold for the First Rule

The first rule is determined by counting the number of bid requests of each audienceId and (IP, UA), and calculating the percentage of the bid requests in one day. The results obtained from Figure 6 show that over 99% of audienceIds have a percentage less than 0.03 in total bid requests on that day. In our observation, more than 99% of audienceIds have a very small number of bid requests or the number is not large enough to conclude that is an abnormal audience. However, the remaining top 1%, who have a percentage equal to or greater than 0.03, usually have anomalous behavior and a very high number of bid requests, concluding that they are abnormal audiences. Therefore, 0.03 was chosen as the threshold to identify abnormal audienceIds for the first rule.

Figure 6. The CDF graph for the percentage of top 1% of audienceIds in one day.

In the same way, 0.02 was chosen as the threshold to identify abnormal (IP, UA) for the first rule as shown in Figure 7.

Figure 7. The CDF graph for the percentage of top 1% of (IP, UA) in one day.

5.3.2. Threshold for the Second Rule

The second rule is determined by counting the number of appeared hours of each audienceId and (IP, UA). It can be seen from the result in Figure 8 that very few audienceIds and (IP, UA) generated bid requests for more than five hours in one day. In reality, if a usual audienceId or (IP, UA) shows up for 5, 6, or 7 h in one day, it is reasonable. However, when an audienceId or (IP, UA) shows up for more than 20 h, this is unusual behavior and it can be explained by the behavior coming from non-humans such as naïve bots, etc. As a result, when an audienceId or (IP, UA) accesses for more than 20 h in one day, it can be flagged as an abnormal audience and 20 is chosen as the threshold of both audienceId and (IP, UA) for the second rule.

Figure 8. The number of appeared hours of audienceIds and (IP, UA).

5.3.3. Threshold for the Third Rule

The third rule is determined by counting the number of bid requests of each audienceId and (IP, UA) within one second. It is apparent from Figure 9 that very few audienceIds and (IP, UA) generated more than two bid requests within one second. A normal audienceId or (IP, UA) only generates one bid request within one second; some audiences will double click and create two bid requests within one second, but this can be explained. However, if an audienceId or (IP, UA) generates more than two bid requests within one second, this indicates fraud. For this reason, three was chosen as the threshold of both audienceId and (IP, UA) for the third rule.

Figure 9. The number of bid requests of audienceIds and (IP, UA) within one second.

5.3.4. Threshold for the Fourth Rule

The fourth rule is determined by dividing the number of distinct URLs by the number of bid requests of each audienceId and (IP, UA). Figure 10 shows an CDF graph of the distinct URLs per bid request. As can be seen from Figure 10, over 99% of audienceIds and (IP, UA) have a fraction greater than 0.05 each day. In our observation, more than 99% of audienceIds and (IP, UA) have a very small number of bid requests and distinct URLs, or a large number of bid requests and a large number of distinct URLs; this is not considered abnormal behavior. However, the remaining top 1% who have a fraction value greater than 0.05 usually have a very large number of bid request but access only a few distinct URLs; this is an abnormal behavior. We also can conclude that is an abnormal audience. In this case, 0.05 was chosen as the threshold of both audienceId and (IP, UA) for the fourth rule.

Figure 10. CDF—#URL/#bid requests of top 1% of audienceId and (IP, UA).

5.3.5. Summary of Building Model

A summary of data rules and the corresponding thresholds for detecting abnormal audiences are presented in Table 3. According to our observations, we found that these thresholds produce a good result when they can filter out a large number of abnormal audienceIds and (IP, UA).

Table 3. Thresholds for detecting abnormal audiences.

6. Model Building

In this section, we present how to build the models to classify audience (denoted as audienceId and (IP, UA)) and the incoming bid requests. In particular, we introduce two main procedures. First, an audience classification is constructed to label normal and abnormal audiences. Next, a RFBR model is built to classify the incoming bid requests.

6.1. Audience Classification

The audience classification includes two main parts: audienceId classification and (IP, UA) classification. In the case of audienceId classification, it can be built through the following stages:

First, the data rules of audienceId are extracted from attributes of the dataset;
Second, the data rules of audienceId are compared with the corresponding thresholds. If the audienceId’s data rules meet one of the threshold conditions, the audienceId is labeled as abnormal; otherwise, the audienceId is labeled as normal.

After audienceId is labeled, the abnormal audienceIds are appended to AudienceId_BlackList for further processes through the steps below:

AudienceId_BlackList is created on the first day in the dataset;
AudienceId_BlackList at day $d$ is obtained by appending the abnormal audienceId derived at day $d - 1, d - 2, \dots$ ;

AudienceId_BlackList also contains the last appearance of audienceId. If an audienceId does not appear within 60 days, it is removed from AudienceId_BlackList.

The process of building (IP, UA) classification is similar to the process of audienceId classification.

6.2. Real-Time Bid Request Filtering Process

The bid requests are labeled in the real-time process based on the result of AudienceId_BlackList and IPUA_BlackList. Figure 11 illustrates our proposed RFBR model and how it could be integrated with the bidding process in the DSP. Our RFBR process is integrated as a plug-in program in the pre-bidding stage in real-time processes from the DSP side. The main purpose of the RFBR process is responsible for classifying bid requests into intentional and non-intentional. Specifically, a bid request on day

d

that contains information about audienceId, IP address, user agency, URL, etc., is checked whether it matches one of the audienceId or (IP, UA) in the blacklist of day

(d - 1)

or not. If it matches, it is labeled as non-intentional; otherwise, it is labeled as intentional. In particular, our real-time bid request labeling method goes through the following steps:

Figure 11. Real-Time Filtering Bid Requests.

Step 1: First, the DSP receives a bid request with the audience and publisher information (audienceId, IP address, user agency, country, URL, publisherId, etc.) as input information of the RFBR process;
Step 2: Second, the audienceId and (IP, UA) information of the bid request are used to check with the AudienceId_BlackList and IPUA_BlackList, respectively. If it matches, the bid request is labeled as non-intentional; otherwise, it is labeled as intentional as output information;
Step 3: The labeling information of the bid request is written to request reply. If the bid request is non-intentional, it will be dropped; otherwise, it is sent to the auction process.

7. Results and Evaluations

In this section, we have deployed our proposed approach with the real sample dataset. In particular, the result of labeling audienceIds and (IP, UA) by day are calculated and the abnormal audienceIds and (IP, UA) are stored in the AudienceId_BlackList and IPUA_BlackList, respectively. Then, the result of real-time bid request labelling is achieved. In the end of this section, we also evaluate the efficiency of our proposed system.

7.1. Results of Classifying Audience

In the first model of audience classification, after implementing fourteen days of the real sample dataset, the results of labeling audienceIds and (IP, UA) are shown in the first two graphs of Figure 12. In our observation, four rules are effective in labeling abnormal audienceIds and (IP, UA). The average number of audienceIds labeled abnormal each day can be listed as follows: via the first rule is 6; the second rule is 6; via the third rule is 70; the fourth rule is 18; and all rules is 92. In the same way, the average number of (IP, UA) labeled abnormal each day can be listed as follows: via the first rule is 10; the second rule is 7; the third rule is 88; the fourth rule is 19; and all rules is 117. In general, the third rule can label a greater number of audienceIds and (IP, UA) than the other rules.

Figure 12. Results of labeling abnormal (IP, UA) and audienceIds by four rules; AudienceId_BlackList and IPUA_BlackList in each day.

Correspondingly, the results of AudienceId_BlackList and IPUA_BlackList each day are shown in the last graph of Figure 12. The average number of appended abnormal audienceIds and (IP, UA) each day to AudienceId_BlackList and IPUA_BlackList are 80 and 102, respectively. On the last day of our dataset, the number of elements in the two blacklists is 1046 for audienceId and 1353 for (IP, UA).

7.2. Results of Real-Time Filtering Bid Requests Process

After classifying audienceId and (IP, UA), the proposed RFBR system is also implemented. In particular, the first day of our RFBR process starts from the second day of the real sample bid request dataset and it continues until the last day (2–14 July 2018). The first day (1 July 2018) is used to generate and append the abnormal audienceIds and (IP, UA) to AudienceId_BlackList and IPUA_BlackList.

The results obtained from the real-time bid request labeling process are shown in first two graphs of Figure 13. The average number of bid requests labeled non-intentional in real time is 3539 bid requests, representing 0.4% of the total number of bid requests per day. This number is not high but is also considered as a good result since the data set is the sample and some rules could not extract from this kind of dataset. Moreover, the percentage of bid requests labeled non-intentional on the first three days is not high; however, as more and more abnormal audienceIds and (IP, UA) are appended to the AudienceId_BlackList and IPUA_BlackList, the percentage of bid requests labeled non-intentional also increases in the following days. In particular, the results of number of bid requests labeled non-intentional in first three days are 3012 (0.339%), 2305 (0.258%), and 2208 (0.245%) respectively, of the total number of bid requests on that day. For the next three days, the number of labeled bid requests increased more and more. However, the number of labeled bid request on the ninth day was reduced again compared to the previous days; this result can also be explained when the abnormal audienceIds and (IP, UA) observed on this day are also much lower in number than on other days.

Figure 13. (1) Percentage of non-intentional bid requests in RTB process; (2) the number of non-intentional bid requests by day; and (3) the number of distinct URLs extracted from non-intentional bid requests.

Together with the bid requests labeled non-intentional in the real-time process, we also extract the URLs and domains from these non-intentional bid requests to observe and evaluate. The average number of distinct extracted URLs per day is 1411, and after thirteen days of the process, we obtained 17,134 distinct URLs and 180 distinct domains. These results of extracting URLs and domains from non-intentional bid requests show that we have obtained a large number of distinct URLs but a small number of domains.

7.3. Evaluation

It is challenging to evaluate the efficiency of our analysis and RFBR model since our experiment implements a large dataset without labeling. To the best of our knowledge, there is no specific method available for evaluating fraud, invalid traffic, etc. in ad networks [3,6]. Additionally, from the results of labeling non-intentional bid requests, we can derive a number of URLs and domains, and this information is one of the potential factors in evaluating the performance of our approach. Given these points and referring to the evaluation method from [6], to evaluate the effectiveness of our method, a ground truth about the trusted and untrusted domains is first established, and then this ground truth is applied to evaluate the extracted domains.

7.3.1. Ground Truth

To distinguish between trusted and untrusted sites, we construct the ground truth of untrusted sites from observing fraudulent sites and referencing [6] with the following considerations:

The site has no content, or content is hidden by ad banners;
The content of the site is full of ads or the ad content is more than site content;
The site has content which is stolen or overlapped from other sites;
A group of sites have exactly the same design and content;
The site contains content that appears to be illegal or makes no sense.

On the other hand, the trusted sites can be recognized by these following considerations:

The site has a good ranking from similarweb (LTD, n.d.).
The site is well-designed and contains useful or trusted content.
The site has good interactions with users such as comments, reposts, likes, etc.

7.3.2. Evaluation

In this part, we evaluate the efficiency of our proposed method based on analyzing the extracted domains from non-intentional bid requests. To consider a domain as either trusted or untrusted, it has to match at least one of these above considerations. After thirteen days of the real-time process (2–14 July 2018), we collected a total 180 distinct domains and these domains were used to discover what domains are trusted or untrusted in this section. During our observation, we found some rules of untrusted domains as follows:

Several domains or sites of the domain appear as regular but have a very large amount of bid requests coming from only one or two audiences, and after more than one or two months, these sites stopped working;
Several domains or sites of the domain hold entirely ad-based content as shown in Figure 14;

Figure 14. Untrusted domains contain entirely ad-based content.
Several domains or sites of the domain have the same site frames and hold entirely ad-based content as shown in Figure 15;

Figure 15. This is a figure. Schemes follow the same formatting.

Several domains or sites of the domain are blank;
Several domains or sites of the domain appear as regular domains; however, when we click on any topics on that site, it will redirect to irregular ads;
Several domains or sites of the domain contain illegal content;
Several domains or sites of the domain have an unusually high number of ads; the percentage of ads is greater than the user content.

Additionally, Table 4 shows the results of classifying trusted and untrusted domains. The result shows that 7% of bid requests come from trusted domains, 76% of bid requests come from untrusted domains, and 17% of bid requests come from unidentified domains. The unidentified domains mean these appear as regular domains, but contained several ads and have a site ranking in similarweb for the Taiwan area of more than 1000. The results obtained from the preliminary analysis of trusted and untrusted domains show that our proposed RFBR methods are good at capturing non-intentional bid requests in a real-time process.

Table 4. The result of classifying trusted and untrusted domains.

7.3.3. Our Proposed RFBR Approach’s Applicability

To assess the applicability of our proposed RFBR approach when integrated with the pre-bidding process from the DSP side, our proposed approach needs to meet the requirements of DSPs and advertisers:

Scalability: In reality, a DSP has to control more than eighty million bid requests per day, this means that the DSP will process almost one thousand bid requests per second. Therefore, the computational requirement of the add-in program is as low as possible. Since our proposed RFBR approach extracts four features and labels audienceId and (IP, UA) with the fixed threshold of the previous day, it does not need to use a high computational ability. Therefore, the integration of our proposed method will not require a high computation and configuration and it is unchallenging to deploy it in a real system;
Delay: The delay of our proposed approach impacts the overall delay of the DSP’s performance; therefore, this delay must be minimized as much as possible. In our approach, the bid request is filtered through an available AudienceId_BlackList and IPUA_BlackList; thus, the delay through this filtering does not affect the overall delay of system.

7.3.4. The Proposed Approach’s Limitations

In addition to the advantages and capabilities of our proposed approach, it still contains some limitations that we need to consider and overcome:

Thresholds are defined by an estimation analysis method; thus, some abnormal audienceIds and (IP, UA) are ignored;
Since rules are predefined, we can only detect frauds that follow these rules; other frauds do not.

8. Conclusions and Future Works

In this research, we present a RFBR model that serves as a plug-in program in the pre-bidding stage of the real-time process from the DSP side to detect non-intentional bid requests. The results of our proposed approach show that it is effective in detecting non-intentional bid requests since it can filter a large number of abnormal audienceIds and (IP, UA) as well as non-intentional bid requests. Furthermore, the feasibility of integrating our proposed approach into the DSP process is simple and possible without affecting the delay and scalability of the DSP process.

Due to the effectiveness of RFBR model to the sample dataset, we will develop our approach to the real dataset and simulate the execution of the RTB process for the DSP side in the future research. We will also overcome the limitations of our approach as mentioned above. Since we can observe and evaluate our system by untrusted sites and domains, it is also a potential method to detect and classify fraudulent traffic. Moreover, along with the development of deep learning, it also opens up many methods to classify fraud traffic in online advertisement.

Author Contributions

T.-T.-A.N., D.-A.H., W.-Y.Z. and S.-M.Y. researched and designed this topic. T.-T.-A.N., D.-A.H. and W.-Y.Z. preprocessed the dataset and built features. T.-T.-A.N. and D.-A.H. worked on problem definition, identified data rules and thresholds, and labeled the audience. T.-T.-A.N. designed the model, implemented experiments, and evaluated the results. T.-T.-A.N. and S.-M.Y. organized and wrote the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Ministry of Science and Technology of Taiwan grant number 111-2410-H-A49-070-MY2.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Acknowledgments

This study is supported by the study of TenMax AD Tech Lab Co., LTD and our lab (Distributed Computing System Laboratory, National Chiao Tung University, Taiwan): “Identifying Non-Intentional Ad Traffic on the Demand-Side in Display Advertising” [18].

Conflicts of Interest

The authors declare no conflict of interest.

References

Stitelman, O.; Perlich, C.; Dalessandro, B.; Hook, R.; Raeder, T.; Provost, F. Using Co-Visitation Networks for Detecting Large Scale Online Display Advertising Exchange Fraud. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining—KDD’13, Chicago, IL, USA, 11–14 August 2013. [Google Scholar]
Dave, V.; Guha, S.; Zhang, Y. Measuring and fingerprinting click-spam in ad networks. ACM SIGCOMM Comput. Commun. Rev. 2012, 42, 175. [Google Scholar] [CrossRef]
Pastor, A.; Parssinen, M.; Callejo, P.; Vallina, P.; Cuevas, R.; Cuevas, A.; Kotila, M.; Azcorra, A. Nameles: An Intelligent System for Real-Time Filtering of Invalid ad Traffic. In Proceedings of the Web Conference 2019—Proceedings of the World Wide Web Conference WWW, San Francisco, CA, USA, 13–17 May 2019. [Google Scholar]
Daswani, N.; Mysen, C.; Rao, V.; Weis, S.; Gharachorloo, K.; Ghosemajumder, S. Online Advertising Fraud. Crimeware Underst. New Attacks Def. 2008, 40, 1–28. [Google Scholar]
Tian, T.; Zhu, J.; Xia, F.; Zhuang, X.; Zhang, T. Crowd Fraud Detection in Internet Advertising. In Proceedings of the 24th International Conference on World Wide Web—WWW’15, Florence, Italy, 18–22 May 2015. [Google Scholar]
Stone-Gross, B.; Stevens, R.; Zarras, A.; Kemmerer, R.; Kruegel, C.; Vigna, G. Understanding Fraudulent Activities in Online Ad Exchanges. In Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference—IMC’11, Berlin, Germany, 2–4 November 2011. [Google Scholar]
Oentaryo, R.; Lim, E.P.; Finegold, M.; Lo, D.; Zhu, F.; Phua, C.; Cheu, E.Y.; Yap, G.E.; Sim, K.; Nguyen, M.N.; et al. Detecting Click Fraud in Online Advertising: A Data Mining Approach Ghim-Eng Yap. J. Mach. Learn. Res. 2014, 15, 99–140. [Google Scholar]
Wang, C.J.; Chen, H.H. Learning User Behaviors for Advertisements Click Prediction; SIGIR 2011 Workshop Internet Advertising; National Taiwan University: Taipei City, Taiwan, 2011. [Google Scholar]
Springborn, K.; Barford, P. Impression Fraud in Online Advertising via Pay-Per-View Networks. In Proceedings of the 22nd USENIX Conference on Security, Washington, DC, USA, 14–16 August 2013. [Google Scholar]
Soldo, F.; Metwally, A. Traffic Anomaly Detection Based on the IP Size Distribution. In Proceedings of the IEEE INFOCOM, Orlando, FL, USA, 25–30 March 2012. [Google Scholar]
Ahmad, S.; Purdy, S. Real-Time Anomaly Detection for Streaming Analytics. arXiv 2016, arXiv:1607.02480. [Google Scholar]
Way, H. Real-Time Bidding: The Online Ad Exchange. Park. Assoc. 2012. Available online: https://www.parksassociates.com/bento/shop/samples/Parks%20Assoc%20Real-time%20Bidding%20The%20Online%20Ad%20Exchange.pdf (accessed on 17 September 2021).
Yuan, S.; Wang, J.; Zhao, X. Real-time Bidding for Online Advertising: Measurement and Analysis. In Proceedings of the Seventh International Workshop on Data Mining for Online Advertising ADKDD, Chicago, IL, USA, 11 August 2013. [Google Scholar]
Wang, J.; Zhang, W.; Yuan, S. Display Advertising with Real-Time Bidding (RTB) and Behavioural Targeting. arXiv 2017, arXiv:1610.03013. [Google Scholar]
Wismans, L.; Romph, E.; Friso, K.; Zantema, K. Real Time Traffic Models, Decision Support for Traffic Management. Procedia Environ. Sci. 2014, 22, 220–235. [Google Scholar] [CrossRef]
Chapelle, O.; Manavoglu, C.E.; Rosales, M.R. Simple and Scalable Response Prediction for Display Advertising. ACM Trans. Intell. Syst. Technol. 2014, 5, 61. [Google Scholar] [CrossRef]
IAB Tech Lab. Available online: https://www.iab.com/guidelines/real-time-bid (accessed on 5 March 2021).
Ha, D.A.; Nguyen, T.T.A.; Zhu, W.Y.; Yuan, S.M. Identifying Non-Intentional Ad Traffic on the Demand-Side in Display Advertising. In Proceedings of the 2021 International Conference on Technologies and Applications of Artificial Intelligence (TAAI), Taichung, Taiwan, 18–20 November 2021. [Google Scholar]
Cetintas, S.; Chen, D.; Si, L. Forecasting user visits for online display advertising. Inf. Retr. 2013, 16, 369–390. [Google Scholar] [CrossRef]
Wang, C.; Kalra, A.; Borcea, C.; Chen, Y. Viewability Prediction for Online Display Ads. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management—CIKM’15, Melbourne, Australia, 18–23 October 2015. [Google Scholar]

Figure 1. Real-Time Bidding Process.

Figure 2. The CDF graph of the number of bid requests of audienceId.

Figure 3. The number of bid requests of some top audienceIds and (IP, UA) by day and hour.

Figure 4. The number of bid requests of audienceId and (IP, UA) in one second.

Figure 5. The number of distinct URLs of top 1000 audienceId and (IP, UA).

Figure 6. The CDF graph for the percentage of top 1% of audienceIds in one day.

Figure 7. The CDF graph for the percentage of top 1% of (IP, UA) in one day.

Figure 8. The number of appeared hours of audienceIds and (IP, UA).

Figure 9. The number of bid requests of audienceIds and (IP, UA) within one second.

Figure 10. CDF—#URL/#bid requests of top 1% of audienceId and (IP, UA).

Figure 11. Real-Time Filtering Bid Requests.

Figure 12. Results of labeling abnormal (IP, UA) and audienceIds by four rules; AudienceId_BlackList and IPUA_BlackList in each day.

Figure 13. (1) Percentage of non-intentional bid requests in RTB process; (2) the number of non-intentional bid requests by day; and (3) the number of distinct URLs extracted from non-intentional bid requests.

Figure 14. Untrusted domains contain entirely ad-based content.

Figure 15. This is a figure. Schemes follow the same formatting.

Table 1. The basic statistics of dataset.

Category
bid request	12,490,828
bid response	6,350,292
audienceId	9,107,971
(IP address, user agency)	12,411,316
click	4187

Table 2. Feature Extraction.

Category	Identifier	Time Duration
The number of bid requests	AudienceId, (IP, UA)	Daily
The number of appeared hours	AudienceId, (IP, UA)	Hours in a day
The number of bid requests within one second	AudienceId, (IP, UA)	By Second
The number of distinct URLs and the number of bid requests	AudienceId, (IP, UA)	Daily

Table 3. Thresholds for detecting abnormal audiences.

Rules	Threshold of AudienceId	Threshold of (IP, UA)
The percent of bid requests by day	0.03	0.02
The number of appeared hours	20	20
Bid requests within one second	3	3
Distinct URLs per bid request	0.05	0.05

Table 4. The result of classifying trusted and untrusted domains.

Domain	# Domain	# Bid Request	%
Untrusted	89	34,925	76%
Trusted	38	3264	7%
Not defined	53	7818	17%
All	180	46,007	100%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Real-Time Filtering Non-Intentional Bid Request on Demand-Side Platform

Abstract

1. Introduction

2. Online Advertising and Real-Time Bidding

2.1. Online Advertising and Real-Time Bidding

2.2. Advertising Fraud in Regard to Ad Exchange

2.3. Advertising Fraud in Regard to Ad Exchange

3. Related Works

4. Research Methodology

4.1. Dataset Description

4.2. Building Features

4.2.1. AudienceId and (IP, UA) Information

4.2.2. The Access Behavior by Day and Hour of AudienceId

4.2.3. The Interval Time

4.2.4. URL Attributes

4.3. Feature Extraction

5. Our Proposed Approach

5.1. Problem Definitions

5.2. Rules

5.3. Threshold

5.3.1. Threshold for the First Rule

5.3.2. Threshold for the Second Rule

5.3.3. Threshold for the Third Rule

5.3.4. Threshold for the Fourth Rule

5.3.5. Summary of Building Model

6. Model Building

6.1. Audience Classification

6.2. Real-Time Bid Request Filtering Process

7. Results and Evaluations

7.1. Results of Classifying Audience

7.2. Results of Real-Time Filtering Bid Requests Process

7.3. Evaluation

7.3.1. Ground Truth

7.3.2. Evaluation

7.3.3. Our Proposed RFBR Approach’s Applicability

7.3.4. The Proposed Approach’s Limitations

8. Conclusions and Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics