Improving Fraudster Detection in Online Auctions by Using Neighbor-Driven Attributes

: Online auction websites use a simple reputation system to help their users to evaluate the trustworthiness of sellers and buyers. However, to improve their reputation in the reputation system, fraudulent users can easily deceive the reputation system by creating fake transactions. This inﬂated-reputation fraud poses a major problem for online auction websites because it can lead legitimate users into scams. Numerous approaches have been proposed in the literature to address this problem, most of which involve using social network analysis (SNA) to derive critical features (e.g., k -core, center weight, and neighbor diversity) for distinguishing fraudsters from legitimate users. This paper discusses the limitations of these SNA features and proposes a class of SNA features referred to as neighbor-driven attributes (NDAs). The NDAs of users are calculated from the features of their neighbors. Because fraudsters require collusive neighbors to provide them with positive ratings in the reputation system, using NDAs can be helpful for detecting fraudsters. Although the idea of NDAs is not entirely new, experimental results on a real-world dataset showed that using NDAs improves classiﬁcation accuracy compared with state-of-the-art methods that use the k -core, center weight, and neighbor diversity.


Introduction
Online shopping has recently become a major part of people's lifestyles.It allows people to directly buy or sell goods or services over the Internet by using a web browser.Additionally, it offers various methods for buying and selling goods or services.For example, on shopping websites such as Amazon.comand Tmall, sellers can list goods for sale only at a fixed price, whereas on other shopping websites such as eBay, Ruten, and Taobao, sellers can list goods for sale either at a fixed price or through an auction.
In online shopping, a consumer typically lacks the first-hand experience with both the merchandise and the seller that is expected in a physical store.To compensate for this, most online shopping websites employ a reputation system or a review system to collect users' feedback regarding their shopping experience to assist potential buyers for selecting suitable merchandise and trustworthy sellers.For example, Amazon.com and Rakuten (www.rakuten.co.jp) employ a unidirectional reputation system in which only buyers can rate both sellers and merchandise, but not vice versa.By contrast, some shopping websites, such as eBay, Ruten (www.ruten.com.tw),Taobao (www.taobao.com),and Tmall (www.tmall.com),use a bidirectional reputation system in which the buyer and the seller in a transaction can rate each other.The reputation system allots a high reputation score to users who have received many positive ratings and few or no negative ratings.Houser and Wooders [1] indicated that seller reputation has an economically and statistically significant effect on the price of merchandise in an online auction.For example, a buyer is often willing to purchase the same merchandise at a higher price from a more reputable seller than from a less reputable seller.Therefore, maintaining a high reputation score is crucial for an online seller to attract sales and gain a price advantage.
Due to the lucrative opportunity associated with reputation scores, many fraudulent users attempt to tamper with the reputation system to obtain a high reputation score.Typically, a group of collusive users create fake transactions within the group for low-price merchandise and give each other positive ratings [2].Such a fraudulent scheme, referred as "inflated-reputation fraud" [3], has become common, and can damage the trustworthiness of the reputation system if appropriate measures are not taken.
Numerous approaches have been proposed in the literature to detect inflated-reputation fraud [3][4][5][6][7][8][9].Most approaches build a social network of users based on their transaction history, in which nodes and links represent users and transactions, respectively.Social network analysis (SNA) is then applied to derive effective features for fraudster detection.Fraudsters improve their reputation scores by relying on collusive users creating many fake transactions and thus forming a cohesive group in the social network.Therefore, several studies have focused on detecting cohesive groups in social networks [3][4][5][6].By contrast, a recent study [7] suggested that the neighbors of a fraudster in a social network exhibit similar behavior leading to low neighbor diversity.Thus, the concept of neighbor diversity was proposed for fraudster detection.Although neighbor diversity has been shown to outperform previous approaches, it often falsely identifies legitimate users with considerably low neighbor diversity as fraudsters.
To improve the neighbor diversity approach, this paper proposes the concept of neighbor-driven attributes (NDAs) for fraudster detection.In brief, an NDA is an attribute of a node that is calculated using a user-selected feature of the neighbors of the node.We present several NDAs that are calculated as maximum or mean of neighbors' feature values, based on features such as the k-core, the number of received ratings, and the number of canceled transactions [3,5,7].The objective of this study is to extend the neighbor diversity to NDAs such that the shortcomings of the neighbor diversity are avoided and the performance of fraudster detection is improved.Experimental results on a real-world dataset show that these NDAs improve classification accuracy compared with state-of-the-art methods that use SNA features such as k-core, center weight, and neighbor diversity.
The remainder of this paper is organized as follows: Section 2 reviews SNA features for fraudster detection mentioned in the literature.Section 3 proposes the concept of NDAs.Section 4 describes the data collection process.Section 5 presents the performance study and discusses the results.Finally, Section 6 concludes the paper and provides directions for future studies.

SNA for Fraudster Detection
This section reviews essential SNA features that are particularly effective for fraudster detection in online auctions.However, not all fraudster detection approaches are SNA-based approaches.For example, user-level features such as the median, sum, mean, or standard deviation of the prices of merchandise that a user bought or sold within a certain period were used in [8,9]; transaction-related features (including price, frequency, comment, and connectedness in the transaction network) and user-level features (including reputation and age) were used in [10].For a brief review of recent fraudster detection approaches, please refer to [7].For surveys on online auction frauds in general, please refer to [11,12].

Construction of a Social Network
Before applying the SNA for fraudster detection, a social network of users must be created.One option is to use a transaction network, where each node represents a user and each link represents a transaction between a buyer and a seller.Another option is to use a rating network, where each node represents a user and each link represents a rating given by a user to another user after a transaction.
After a transaction on an auction website, such as eBay or Ruten, the buyer and the seller can give each other a positive, neutral, or negative rating to reflect their experience in the transaction.Thus, each link in the rating network has a corresponding link in the transaction network; however, because a user is not required to give a rating after a transaction, not all links in the transaction network have a corresponding link in the rating network.
In auction websites such as eBays, Taobao, Ruten and Yahoo!Kimo, the rating history used to construct the rating network is accessible to the public.However, the transaction history used to construct the transaction network is not available, unless with the permission of the auction websites [10].As a result, most previous studies used the rating network [4][5][6]13].Since inflated-reputation fraud relies on the accumulation of positive ratings, this study also used the rating network.

SNA Features
As described in Section 1, numerous SNA-based approaches for fraudster detection have focused on detecting cohesive groups in a social network [3][4][5][6].In SNA, various measurements for cohesive groups are available, such as component, clique, community, k-core, and k-plex.Among them, k-core is the most effective for fraudster detection because fraudsters tend to appear in k-core subgraphs with k = 2 [4].A k-core of a graph G is a maximal connected subgraph of G in which all nodes have a degree of at least k.A node may be present in several k-core subgraphs, each with a different k value.The maximum of these k values is referred to as the k-core value of the node.To calculate the k-core value of each node in a graph, Batagelj and Zaversnik proposed an O(m)-time algorithm (Figure 1), where m is the number of links in the graph [14].The algorithm repeatedly prunes the least connected nodes and, thus, disentangles the hierarchical structure of the graph.Although k-core is widely used for detecting fraudsters [3][4][5][6], it often results in low precision.
where each node represents a user and each link represents a rating given by a user to another user after a transaction.
After a transaction on an auction website, such as eBay or Ruten, the buyer and the seller can give each other a positive, neutral, or negative rating to reflect their experience in the transaction.Thus, each link in the rating network has a corresponding link in the transaction network; however, because a user is not required to give a rating after a transaction, not all links in the transaction network have a corresponding link in the rating network.
In auction websites such as eBays, Taobao, Ruten and Yahoo!Kimo, the rating history used to construct the rating network is accessible to the public.However, the transaction history used to construct the transaction network is not available, unless with the permission of the auction websites [10].As a result, most previous studies used the rating network [4][5][6]13].Since inflated-reputation fraud relies on the accumulation of positive ratings, this study also used the rating network.

SNA Features
As described in Section 1, numerous SNA-based approaches for fraudster detection have focused on detecting cohesive groups in a social network [3][4][5][6].In SNA, various measurements for cohesive groups are available, such as component, clique, community, k-core, and k-plex.Among them, k-core is the most effective for fraudster detection because fraudsters tend to appear in k-core subgraphs with k = 2 [4].A k-core of a graph G is a maximal connected subgraph of G in which all nodes have a degree of at least k.A node may be present in several k-core subgraphs, each with a different k value.The maximum of these k values is referred to as the k-core value of the node.To calculate the k-core value of each node in a graph, Batagelj and Zaversnik proposed an O(m)-time algorithm (Figure 1), where m is the number of links in the graph [14].The algorithm repeatedly prunes the least connected nodes and, thus, disentangles the hierarchical structure of the graph.Although k-core is widely used for detecting fraudsters [3][4][5][6], it often results in low precision.Wang and Chiu [4] suggested using the "center weight" feature included in the SNA program Pajek [15] to improve precision.A "robbery" algorithm (Figure 2) was used to calculate the center weight of each node in a graph [15].Initially, the algorithm sets the center weight of each node as the degree of the node.The nodes with larger weights then repeatedly steal weights from adjacent nodes with smaller weights.Finally, only a small number of nodes have a center weight greater than zero, and all of the adjacent nodes have a center weight equal to zero.Essentially, nodes with non-zero center weight are the centers of the network and are classified as fraudsters.Although the center weight improves the precision of fraudster detection, it reduces the recall [4].Wang and Chiu [4] suggested using the "center weight" feature included in the SNA program Pajek [15] to improve precision.A "robbery" algorithm (Figure 2) was used to calculate the center weight of each node in a graph [15].Initially, the algorithm sets the center weight of each node as the degree of the node.The nodes with larger weights then repeatedly steal weights from adjacent nodes with smaller weights.Finally, only a small number of nodes have a center weight greater than zero, and all of the adjacent nodes have a center weight equal to zero.Essentially, nodes with non-zero center weight are the centers of the network and are classified as fraudsters.Although the center weight improves the precision of fraudster detection, it reduces the recall [4].Lin and Khomnotai [7] suggested that the neighbors of fraudsters in social networks exhibit similar patterns because fraudsters need their collusive neighbors to give them positive ratings in the reputation system.Thus, the concept of neighbor diversity was proposed for fraudster detection, where nodes with low neighbor diversity are likely to be fraudsters.Before calculating neighbor diversity, a node's feature (e.g., the number of received ratings, the number of canceled transactions, the k-core value, and the join date) is selected.All nodes in the network are then divided into several groups, based on their values on the selected feature.Finally, the neighbor diversity on the selected feature of a node is calculated as the Shannon entropy [16] of the group distribution of the node's neighbors, as follows: where Dattr(x) denotes the neighbor diversity on the selected feature attr of the node x, n is the number of groups, and pi(x) is the number of x's neighbors in group i divided by the number of x's neighbors.Although neighbor diversity has been shown to outperform the k-core value and the center weight for fraudster detection [7], it has three limitations.First, the use of neighbor diversity tends to misclassify nodes with few neighbors as fraudsters.For example, consider the extreme case of a user with only one neighbor.Since the user does not have many collusive neighbors with whom to engage in inflated-reputation fraud, the user is not likely to be a fraudster.However, the neighbor diversity of a one-neighbor user is minimal.Consequently, such users are falsely classified as fraudsters because of their low neighbor diversity.Second, if all neighboring nodes of a node x belong to the same group (i.e., n = 1 and p1 = 1 in Equation ( 1)), then x's neighbor diversity also reaches the minimum regardless of the number of x's neighboring nodes.This observation suggests that neighbor diversity cannot distinguish between a node with only one neighboring node and a node with numerous neighboring nodes that belong to the same group.Third, the method used in this study to divide the nodes into groups based on a selected feature was ad hoc, and requires adjustments for different features.Some approaches defined their features in a recursive manner, similar to the Authority and Popularity scores in HITS and Google's PageRank [17].Therefore, the values of these features were also calculated recursively.For example, Chau et al. [9] constructed a Markov Random Field (MRF) model from the transaction history among all traders, and applied the belief propagation algorithm to calculate the probabilities of fraudster, accomplice and normal user for each node.The same approach was also used in [18] except that the observed values of the nodes in the MRF model were instantiated to a constant.Additionally, Bin and Faloutsos [19] used the loopy belief propagation algorithm instead of the belief propagation algorithm to derive the features' values.Lin and Khomnotai [7] suggested that the neighbors of fraudsters in social networks exhibit similar patterns because fraudsters need their collusive neighbors to give them positive ratings in the reputation system.Thus, the concept of neighbor diversity was proposed for fraudster detection, where nodes with low neighbor diversity are likely to be fraudsters.Before calculating neighbor diversity, a node's feature (e.g., the number of received ratings, the number of canceled transactions, the k-core value, and the join date) is selected.All nodes in the network are then divided into several groups, based on their values on the selected feature.Finally, the neighbor diversity on the selected feature of a node is calculated as the Shannon entropy [16] of the group distribution of the node's neighbors, as follows: where D attr (x) denotes the neighbor diversity on the selected feature attr of the node x, n is the number of groups, and p i (x) is the number of x's neighbors in group i divided by the number of x's neighbors.
Although neighbor diversity has been shown to outperform the k-core value and the center weight for fraudster detection [7], it has three limitations.First, the use of neighbor diversity tends to misclassify nodes with few neighbors as fraudsters.For example, consider the extreme case of a user with only one neighbor.Since the user does not have many collusive neighbors with whom to engage in inflated-reputation fraud, the user is not likely to be a fraudster.However, the neighbor diversity of a one-neighbor user is minimal.Consequently, such users are falsely classified as fraudsters because of their low neighbor diversity.Second, if all neighboring nodes of a node x belong to the same group (i.e., n = 1 and p 1 = 1 in Equation ( 1)), then x's neighbor diversity also reaches the minimum regardless of the number of x's neighboring nodes.This observation suggests that neighbor diversity cannot distinguish between a node with only one neighboring node and a node with numerous neighboring nodes that belong to the same group.Third, the method used in this study to divide the nodes into groups based on a selected feature was ad hoc, and requires adjustments for different features.Some approaches defined their features in a recursive manner, similar to the Authority and Popularity scores in HITS and Google's PageRank [17].Therefore, the values of these features were also calculated recursively.For example, Chau et al. [9] constructed a Markov Random Field (MRF) model from the transaction history among all traders, and applied the belief propagation algorithm to calculate the probabilities of fraudster, accomplice and normal user for each node.The same approach was also used in [18] except that the observed values of the nodes in the MRF model were instantiated to a constant.Additionally, Bin and Faloutsos [19] used the loopy belief propagation algorithm instead of the belief propagation algorithm to derive the features' values.

Neighbor-Driven Attributes
Although the k-core value, center weight, and neighbor diversity all have limitations, the concept of neighbors is crucial in calculating these features.This motivates the idea of NDAs; in other words, deriving a new attribute of a node by using a feature of the node's neighbors.For deriving an NDA, first, an existing feature of a node is selected.An NDA based on the selected feature of the node can then be defined as the mean or maximum of the feature values of the node's neighbors.Unlike neighbor diversity, the neighbors are not required to be divided into groups according to the selected feature.This simplifies the calculation and avoids the ad hoc nature of neighbor diversity.
In this study, we chose one of the three features (i.e., the k-core value, number of received ratings, and number of canceled transactions) for calculating NDAs, because these three features have been shown to be related to inflated-reputation fraud in online auctions [3,5,7].Six NDAs were defined, as shown in Table 1.
Our proposed method is to use each of these NDAs alone and in conjunction with the k-core value, center weight, and neighbor diversity to build a classifier for fraudster detection.The algorithm used to construct a classifier is decision tree or support vector machine.The experimental results in Section 5 show that the use of some of these NDAs significantly improves the classification accuracy for detecting fraudsters.

NDA Description
N k pxq the mean of the k-core values of node x's neighbors N k pxq the max of the k-core values of node x's neighbors N r pxq the mean of the numbers of received ratings of node x's neighbors N r pxq the max of the numbers of received ratings of node x's neighbors N c pxq the mean of the numbers of canceled transactions of node x's neighbors N ĉ pxq the max of the numbers of canceled transactions of node x's neighbors Notably, the idea of deriving features of a node from the node's neighbors is not new.Neighbor diversity is an example.Features, such as k-core and center weight, depend on all the nodes in a connected graph, not just the neighboring nodes.In comparison, they all are more complex than the NDAs in Table 1 because the NDAs in Table 1 only require a mean or max operation.However, the proposed NDAs can be viewed as a simple extension of an existing feature.For example, N k pxq and N k pxq in Table 1 are the extensions of k-core.The same process can be applied to other existing features to derive new NDAs.Furthermore, the choice of the aggregation function for an NDA is likely to be application-dependent.In this study, mean and max are chosen due to the intensive interaction between the members in a collusive group.

Datasets for Experiment
Ruten (www.ruten.com.tw) is one of the largest auction websites in Taiwan, funded by eBay and PChome Online [13].In this performance study, we used a dataset collected from Ruten's website.This dataset was also used in [7] and was collected in a level-wise manner, as in [4][5][6]10,13].The data collection process started with 932 accounts (denoted as the first level accounts) that were posted in the suspended list by Ruten during July 2013.Then, it extended to the next level of accounts, including those accounts that had given ratings to or received ratings from the accounts on the first level.The same process was then repeated for the next level.Finally, the dataset, denoted as D all , contained 4407 accounts.
If an account in D all was suspended by Ruten due to fraudulent behaviors (e.g., fake biddings, evaluation hype, selling counterfeit products, or fail to deliver product), then the account was classified as a fraudster; otherwise, the account was classified as a non-fraudster.As a result, the D all dataset contained 1080 fraudsters and 3327 non-fraudsters.
To calculate the neighbor diversity and NDAs for each account in D all , we needed to build a social network containing all the accounts in D all and their neighboring accounts.Thus, we repeated the same data collection process to one more level down so that all neighbors of the accounts in D all were identified.This step identified 233,169 new accounts.Finally, a social network with 237,576 nodes (i.e., 4407 accounts of D all plus the 233,169 new accounts) and 348,259 undirected links was built, where each node represented an account, and each undirected link represented a positive rating from one account to another account.Notably, the social network was an undirected, unweighted graph, where duplicate links between any two nodes were removed.Furthermore, because our list of suspended accounts was collected during July 2013, the social network did not include those links for the ratings that occurred after 31 July, 2013.
For each node in the social network, the following features were first calculated: the k-core value (Figure 1), center weight (Figure 2), the number of received ratings, and the number of canceled transactions.Then, the following features were calculated only for those nodes representing the 4407 accounts in D all : neighbor diversity on the number of received ratings [7] and the six NDAs in Table 1.Notably, we did not calculate the neighbor diversity and the NDAs for the 233,169 new accounts because the social network was only a part of the complete rating network and it did not include all neighbors of the 233,169 new accounts.The D all dataset with 4407 records and nine features (i.e., k-core value, center weight, neighbor diversity on the number of received ratings, and six NDAs) was used for the experiment in Section 5.
As described in Section 2, using neighbor diversity often classifies a node with all of its neighboring nodes having similar numbers of received ratings as a fraudster.Misclassification could occur when the number of its neighboring nodes is small.To verify this statement, we constructed two sub-datasets, D one and D more , of D all for the experiment in Sections 5.2 and 5.3.The D one dataset contained those nodes whose neighboring nodes having similar numbers of received ratings, and the D more dataset contained those nodes whose neighboring nodes having diverse numbers of received ratings.Specifically, this was achieved by first dividing all nodes into several groups based on their number of received ratings, where the first group contained nodes with a number of received ratings between 0 and 49, and the ith group contained nodes with a number of received ratings between 25 ˆ2i´1 and 25 ˆ2i , for i > 1, as did in [7].Subsequently, the D one dataset included those nodes whose neighboring nodes were all in the same group, and by Equation (1), those nodes had low neighbor diversity.The D more dataset included those nodes whose neighboring nodes appeared in more than one group, and by Equation (1), those nodes had higher neighbor diversity than the nodes in D one .
To check whether a classifier learned from one dataset could effectively apply to another unrelated dataset, we constructed two sub-datasets, D train and D test , of D all for the experiment in Section 5.4.Specifically, this was achieved by randomly dividing the social network into two disjoint subgraphs of approximately equal size, and then each subgraph corresponded to a sub-dataset.Notably, no link existed between an account in D train and an account in D test .Thus, if a classifier learned from D train could detect the fraudsters in D test with high accuracy, then the features and the classification algorithm used to construct the classifier were effective.Table 2 summarizes the number of fraudsters and the number of non-fraudsters in these datasets.

Experimental Settings and Results
The experimental study contained four classification tests.In Tests 1-3, datasets D all , D one , and D more were used to conduct 10-fold cross validation, respectively.In Test 4, dataset D train was used to train a classifier, and then the classifier was used to detect the fraudsters in D test .
In each test, the following experiments were conducted.First, to study the effectiveness of the NDAs, we used only one NDA at a time as an input in a classification algorithm.Then, to compare the performance of the existing approaches with that of our approach, we used the following three attribute combinations as inputs in a classification algorithm: D r ; k-core and CW; k-core, CW, and D r , where D r , k-core, and CW refer to the neighbor diversity on the number of received ratings [7], k-core value, and center weight [4], respectively.Finally, we repeated the same experiment with the addition of an NDA to study how the NDA could help to improve the performance of the existing approaches.
Two classification algorithms (J48 decision tree and support vector machine (SVM)) from Weka [20] were used in this study.Default parameter settings of both algorithms in Weka were adopted.Since previous work [7] also used the same classification algorithms and parameter settings, our performance results reflected the impact of using the NDA as input to the classification algorithms.

Test 1: D all Dataset
Table 3 shows the performance of using only one NDA to classify all accounts in the D all dataset.As shown in Table 2, the D all dataset contained 3327 non-fraudsters in 4407 accounts and, thus, 75.5% (3327/4407 = 75.5%)was considered as the baseline for the classification accuracy.The two NDAs based on k-core (i.e., N k and N k) underperformed in the experiment.Thus, for the rest of the experiment, we retained only the four NDAs (i.e., N r , N r, N c , and N ĉ), the classification accuracy of which (Table 3) was above 83%.Among them, N r and N r performed the most effectively.Boldface type indicates the best performance in each column of Tables 3-18.Table 4 shows the performance of using D r with or without an NDA.Compared with the use of only D r (Row 1 in Table 4), the addition of N r , N r, N c , or N ĉ invariably improved the accuracy, precision, and recall.Overall, the addition of N r or N r achieves the strongest improvement.In Tables 4-6, 8-10, 12-14 and 16-18 italics show the performance values with an NDA that were lower than the corresponding values in Row 1 (i.e., the case without any NDA).
Table 5 shows the performance of using k-core and CW with or without an NDA.Compared with the use of only k-core and CW (Row 1 in Table 5), the addition of N r or N r improved the classification accuracy of J48 from 82.9816% to >90%, and the classification accuracy of SVM from 82.9362% to >85%.Thus, J48 was a more suitable algorithm for this problem than SVM.The addition of N c or N ĉ often improved the accuracy and precision, but reduced the recall.
Table 6 shows the performance of using k-core, CW, and D r with or without an NDA.Compared with the use of only k-core, CW, and D r (Row 1 in Table 6), the addition of N r or N r considerably improved the performance, and J48 achieved superior results than SVM did.
Overall, the existing approaches (i.e., without NDAs) by using D r (Row 1 in Table 4), k-core and CW (Row 1 in Table 5), or k-core, CW and D r (Row 1 in Table 6) all achieved a classification accuracy above 75.5% (the baseline for the D all dataset), indicating the effectiveness of these approaches.However, with the addition of NDAs (particularly N r or N r), the performance can be further improved.

Test 2: D one Dataset
According to the results in Section 5.1, N r and N r achieved the best performance.Thus, for the rest of the performance study, we focused on these two NDAs.As shown in Table 2, the D one dataset contained 716 fraudsters in 1146 accounts, and thus, 62.5% (716/1146 = 62.5%) was considered as the baseline for the classification accuracy.Table 7 shows the performance of using only one NDA.The accuracy of using N r (i.e., above 76%) or N r (i.e., above 77%) was significantly higher than the baseline performance of 62.5%.
The first row of Tables 8-10 shows the performance of the existing approaches by using D r , or k-core and CW, or k-core, CW, and D r , respectively.Using these three combinations resulted in poor classification accuracy near the baseline value of 62.5%.The D one dataset contained accounts for which the neighbors were from the same group (Section 4), and the existing approaches that used D r , or k-core and CW, or k-core, CW, and D r were not effective on such datasets (Section 2).However, with the addition of N r or N r, the performance was significantly improved (Rows 2 and 3 in Tables 8-10).When using N r or N r alone (Table 7), we observed that the classification accuracy exceeded 76% for N r and 77% for N r.When N r or N r was used with D r , or k-core and CW, or k-core, CW, and D r , the classification accuracy improved up to 79%, according to Tables 8-10.Therefore, the addition of N r or N r was essential for detecting fraudsters in the D one dataset.

Test 3: D more Dataset
The D more dataset contained 2897 non-fraudsters in 3261 accounts (Table 2) and, thus, 88.83% (2897/3261 = 88.83%) was considered the baseline for the classification accuracy.Table 11 shows the performance of using N r or N r.By using N r or N r we achieved the classification accuracy of >92.7%, which was more satisfactory than the baseline value of 88.83%.However, for the D more dataset, the improvement was not as significant as that observed for the D one dataset, in which the classification accuracy improved from the baseline of 62.5% to >76% (Table 7).Therefore, N r and N r performed more effectively in the D one dataset than in the D more dataset.
The first row of  shows the performance of the existing approaches by using D r , or k-core and CW, or k-core, CW, and D r , respectively.Using these three combinations achieved the classification accuracy to greater than 90%.Compared with the baseline value of 88.83% for the D more dataset, the results indicate the effectiveness of the existing approaches on the D more dataset.As described in Section 4, the D more dataset contained accounts where the neighbors were from various groups, whereas the D one dataset contained accounts where the neighbors were from the same group.The existing approaches (i.e., without using NDAs) performed more satisfactorily on the D more dataset than on the D one dataset.Finally, the addition of N r or N r can further improve the performance of the existing approaches (Rows 2 and 3 in Tables 12-14) with J48 classifiers.

Test 4: D train and D test Datasets
In Test 4, D train was used to train a classifier, and then the classifier was used to detect fraudsters in D test .show the performance results on D test .The D test dataset contained 1646 non-fraudsters in 2202 accounts (Table 2), and thus, 74.75% (1646/2202 = 74.75%)was considered the baseline for the classification accuracy.Table 15 shows the performance of using N r or N r.By using N r or N r, we achieved the classification accuracy of >84%, which was more satisfactory than the baseline value of 74.75%.
The first row of  shows the performance of the existing approaches by using D r , or k-core and CW, or k-core, CW, and D r , respectively.Using these three combinations achieved the classification accuracy to greater than 78%.However, with the addition of N r or N r, the performance can be further improved in accuracy, precision, recall, and F 1 (Rows 2 and 3 in Tables 16-18).The results are consistent with the 10-fold cross validation results in Test 1 on the dataset D all .
Because D all was the union of its two disjoint subsets D one and D more, and D test was a subset of D all , every account in D test was in either D one or D more.Thus, we could study the performance results in Tables 16-18 in more details by examining the classification accuracy on the two disjoint subsets of D test : D test X D one and D test X D more .Table 19 shows how the 2202 accounts of D test were distributed in D test X D one and D test X D more .Table 20 shows the percentage of correctly classified accounts in D test X D one or D test X D more .The addition of either N r or N r did not make a significant difference on the percentage of correctly classified accounts in D test X D more , but it did significantly improve the percentage of correctly classified accounts in D test X D one .The results are consistent with the 10-fold cross validation results in Tests 2 and 3.

Discussion
There are four main findings in the above performance study.First, the existing approaches using D r , or k-core and CW, or k-core, CW, and D r performed well on the D more dataset, but poorly on the D one dataset (Row 1 of Tables 8-10 and 12-14).As mentioned in Section 4, the D one dataset contained accounts where the neighbors were from the same group with a similar number of received ratings, and thus neighbor diversity of any account in the D one dataset was minimal.This result showed the ineffectiveness of the existing approaches for accounts with low neighbor diversity.
Second, using N r or N r alone (Tables 3, 7, 11 and 15) helped to achieve considerable improvement over the baseline performance on all four datasets, D all , D one , D more , and D test .Thus, N r or N r has a high potential to improve the existing approaches.
Third, the addition of either N r or N r to the existing approaches improved the classification performance (Tables 4-6, 8-10, 12-14 and 16-18).The Wilcoxon signed-rank test was used to compare the classification accuracy between the cases without using any NDA and the cases of adding either N r or N r.The results were in the expected direction and were significant (Z = ´4.172,p < 0.005 for the addition of N r , and Z = ´4.2,p < 0.005 for the addition of N r).The Wilcoxon signed rank test also shows that no significant difference on classification accuracy between the cases of adding N r and the cases of adding N r (Z = ´0.973,p = 0.330).Specifically, the accuracy and recall were significantly improved, although the precision was occasionally reduced when using the SVM classifier.However, the J48 decision tree classifier, rather than the SVM classifier, seems more suitable for the problem of fraudster detection.Moreover, the decision tree classifier was adopted as the classification algorithm in previous studies [5,8,21].
Fourth, according to Table 3, using the NDAs based on the k-core (i.e., N k or N k) did not achieve satisfactory results.Since the k-core value of an account depends on the number of neighbors in the social network, it fails to capture the repeated ratings between two accounts.A possible remedy to this problem is to use a weighted graph for the social network where the weight of a link reflects the number of ratings between the two connected nodes.The algorithm for calculating the k-core in Figure 1 also needs to be adjusted to work in a weighted graph.

Conclusions
Most recent approaches use SNA to derive essential features (e.g., k-core, center weight, neighbor diversity) for distinguishing between fraudsters and normal users.In this study, we proposed the concept of NDAs, and found two NDAs (i.e., the mean and the maximum of the number of received ratings of a user's neighbors) that improved the performance of fraudster detection.Previous approaches performed poorly when users with low neighbor diversity were included (Section 5.2).Our results suggested these two NDAs can help in such situations.
In this study, a user's neighbors in both NDAs and neighbor diversity refer to immediate neighbors; in other words, those with whom the user directly interacts.However, it is possible to extend the same concept to indirect neighbors within a predefined distance.As the relationship between fraudsters and their collusive groups becomes sophisticated, it will be necessary to search beyond the immediate neighbors to reveal their network.The extension of NDAs and neighbor diversity warrants further study.
Three possible extensions to our data collection process are worthy of further investigation.First, the dataset for the experimental study was collected using breadth-first search.Since the number of the collected nodes grew rapidly in breadth-first search, we only proceeded to three levels.Other network sampling methods, such as depth-first search or random walk, can be applied to provide different aspects of the rating behavior in online auctions.
Second, because the dataset was crawled from Ruten's website and not acquired directly from Ruten, its content is limited to what was on the website.Consequently, some important features related to fraudster detection (e.g., account creation time) are not available in our dataset.Extending the dataset to include those features could help unveil fraudsters' activities.
Third, because the dataset was collected during July 2013, it only covers fraudsters' activities during that period.Since fraudsters constantly change their fraudulent techniques, collecting datasets from different periods along the timeline will allow us to investigate how their fraudulent techniques evolve and validate the effectiveness of fraudster detection approaches.

Figure 2 .
Figure 2. Robbery algorithm for calculating center weight.

Figure 2 .
Figure 2. Robbery algorithm for calculating center weight.

Table 3 .
Performance of an NDA on the D all dataset.

Table 4 .
Performance of D r with or without an NDA on the D all dataset.

Table 5 .
Performance of k-core and CW with or without an NDA on the D all dataset.

Table 6 .
Performance of k-core, CW, and D r with or without an NDA on the D all dataset.

Table 7 .
Performance of an NDA on the D one dataset.

Table 8 .
Performance of D r with or without an NDA on the D one dataset.

Table 9 .
Performance of k-core and CW with or without an NDA on the D one dataset.

Table 10 .
Performance of k-core, CW, and D r with or without an NDA on the D one dataset.

Table 11 .
Performance of an NDA on the D more dataset.

Table 12 .
Performance of D r with or without an NDA on the D more dataset.

Table 13 .
Performance of k-core and CW with or without an NDA on the D more dataset.

Table 14 .
Performance of k-core, CW, and D r with or without an NDA on the D more dataset.

Table 15 .
Performance of an NDA on the D test dataset.

Table 16 .
Performance of D r with or without an NDA on the D test dataset.

Table 17 .
Performance of k-core and CW with or without an NDA on the D test dataset.

Table 18 .
Performance of k-core, CW, and D r with or without an NDA on the D test dataset.

Table 19 .
Distribution of accounts in the D test dataset.

Table 20 .
Percentage of correctly classified accounts in D test X D one or D test X D more.