Learning DOM Trees of Web Pages by Subpath Kernel and Detecting Fake e-Commerce Sites

: The subpath kernel is a class of positive deﬁnite kernels deﬁned over trees, which has the following advantages for the purposes of classiﬁcation, regression and clustering: it can be incorporated into a variety of powerful kernel machines including SVM; It is invariant whether input trees are ordered or unordered; It can be computed by signiﬁcantly fast linear-time algorithms; And, ﬁnally, its excellent learning performance has been proven through intensive experiments in the literature. In this paper, we leverage recent advances in tree kernels to solve real problems. As an example, we apply our method to the problem of detecting fake e-commerce sites. Although the problem is similar to phishing site detection, the fact that mimicking existing authentic sites is harmful for fake e-commerce sites marks a clear difference between these two problems. We focus on fake e-commerce site detection for three reasons: e-commerce fraud is a real problem that companies and law enforcement have been cooperating to solve; Inefﬁciency hampers existing approaches because datasets tend to be large, while subpath kernel learning overcomes these performance challenges; And we offer increased resiliency against attempts to subvert existing detection methods through incorporating robust features that adversaries cannot change: the DOM-trees of web-sites. Our real-world results are remarkable: our method has exhibited accuracy as high as 0.998 when training SVM with 1000 instances and evaluating accuracy for almost 7000 independent instances. Its generalization efﬁciency is also excellent: with only 100 training instances, the accuracy score reached 0.996.


Introduction
Tree-structured data has been the focus of a great deal of machine learning research for a long time. In 1979, Taï [1] extended Levenshtein edit distance [2] to trees, and since then, we have been able to use distance measures to evaluate the similarity of trees. In 2001, Collins and Duffy [3] introduced the first instance of tree kernels and has opened new avenues for leveraging a variety of powerful kernel machines to study trees. Furthermore, recent work has shown that the subpath kernel is both more efficient and accurate than other tree kernels: significantly fast algorithms that compute the subpath kernel in time linear to the size of input trees (the number of vertices) have been proposed [4,5]; An intensive experiment with various datasets has showed that the subpath kernel outperforms other tree kernels in classification accuracy when used with SVM.
Thus, the recent advances in tree kernel methods have made it possible to leverage powerful kernel machines for the purpose of analyzing various real tree data. Two of the most important advantages of using kernel machines are the wide variety of methods based on multivariate analysis, and the excellent generalization performance obtained even when analyzing small datasets. In this regard, the main aims of this paper are to introduce: • A generic system design for effective and efficient analysis of tree data with the subpath kernel; • A detailed application of the subpath kernel to document object model (DOM) trees, an important example of tree data; and • The remarkably high performance observed in analysis of DOM trees for detecting fake e-commerce sites with the subpath kernel system.
Fake e-commerce sites have been serious threats to costumers and providers of ecommerce. They pretend to be authentic sites that are authorized by companies operating e-commerce platforms and attempt to obtain money by fraud, steal personal information such as credit card numbers, ruin reputation of the platform operators, and so forth. The number of reported incidents of fake e-commerce sites started to increase in 2012, and it continues to increase rapidly. In Japan, the monetary damage reported in 2012 was merely 48 million JPY, but rapidly grew to 2.9 billion JPY in 2014 and 3.1 billion JPY in 2015.
Although fake e-commerce sites may seem akin to a popular class of malicious sites referred to as phishing sites, there is a crucial difference. Phishing pages attempt to resemble a single legitimate site, meaning they must mimic one particular site as closely as possible. By contrast, fake e-commerce sites do not have any specific targets, and similarity to specific existing authentic sites is even harmful for them, because the similarity may raise consumers' awareness of their illegality and as a result will shorten their lifetime. Therefore, phishing detection approaches do not work with e-commerce fraud detection sites, because they cannot be compared to a single known-good target site [6].
Nevertheless, it will be helpful to review the techniques developed for phishing detection. Surveys in the literature (e.g., [7][8][9][10]) have reported that the major sources of information consist of web-page contents, URLs of websites and blacklists.
• Web-page contents can provide information at the finest granularity, and the range of features that can be extracted is the widest. The features of hyperlinked URLs [11], linked images [11,12], and embedded executable codes [13] are all extracted from page contents. Distributions of various information such as terms used [14], tag contents [11], contents types [14] and terms' TF-IDF values [15,16] are also useful features. In exchange, the detailed investigation of web-page contents may cause damages to users' computers by executing malware included, and features extracted from texts are prone to be language-specific. More importantly, content-based features can be shallow in many cases, that is, it is not difficult for adversaries to alter them so as to circumvent detection. • The URL of a website can also provide useful information for phishing detection. The features include not only tokens from the URL [15][16][17] but also metadata obtained from external sources such as DNS [15], Google PageRank [15] and Amazon Alexa [14]. Because of the limit of the length of URL strings, the information obtained is also limited. Link redirection and change of web-contents can be an effective circumvention method, as well. • A blacklist is created in a concentrative manner and specifies URLs of known phishing sites. Since its trustworthiness is high, it is widely used in real services like Google Safe Browsing [18] and eBay Toolbar [7]. Nevertheless, automated crawlers can be easily detected, and also, changing web-contents can easily outdate it. The most significant disadvantage of blacklists is the time difference between discovery of phishing sites and their registration to blacklists. The most significant disadvantage of blacklists is the time difference between discovery of phishing sites and their registration on blacklists.
In this paper, we are interested in methods that can automatically detect newly born (zero-day) fake websites. Therefore, the blacklist approach is out of the scope of this paper. In general, use of machine learning techniques is known effective to detect zero-day attacks, and the survey [19] introduces latest attempts to apply machine learning techniques to the problem of detecting fake websites. Some leverage the remarkable advances of the research on artificial neural networks (ANN) to obtain excellent detection accuracy [20][21][22][23], while some rely on the established methods including SVM, Decision Tree, Random Forest and Gradient Boosting [23,24].
Many of the reported methods that can detect zero-day fake sites show significantly high accuracy rates of detection. For example, the accuracy of 0.999 is reported in [14,15]. They, however, rely on shallow features, such that adversaries can change their values to bypass detection without insuperable difficulties. This is the problem of the existing methods that we recognize and try to solve.
For the purpose, we develop a far more resilient approach. We propose a method that takes advantage of the structural information of a web page's document object model (DOM) tree. A DOM tree of a web page is a tree that represents the nesting structure of its hypertext markup language (HTML) tags. Figure 1 shows an extremely simple example of an HTML document and the DOM tree that represents it. Real DOM trees usually include thousands of vertices.
This approach is based on a discovery reached through joint investigation with the National Police Agency of Japan and Rakuten Inc., an e-commerce company that is operating the largest e-commerce platform in Japan. Law enforcement agencies and e-commerce companies have been cooperating to identify fake sites as quickly as possible in order to minimize the damages to consumers. They collect information of fake sites through complaint forms and allocate staff to analyze the collected information. Once they discover fake sites, they add their URLs to public blacklists. This initiative has shut down thousands of fake sites, usually very quickly. As a result of this policework, the average lifetime of a fake site is short, and the criminals must continuously develop and release new fake sites quickly. Because of this short lifespan, a single group of criminals must operate make fake sites simultaneously in order to make profits worth the effort required to maintain a set of short-lived sites. Thus, criminals cannot spend much time or effort developing each new site from scratch. On the other hand, they have to make each site look different to fool consumers. Changing content and layout is easy and affordable, but modifying page structure is expensive. Thus, we expect that DOM trees of web-pages can carry effective and sustainable features for the purpose of discriminating fake sites from authentic sites.
Using HTML tags as features for detecting fake websites has been proposed in the literature. For example, the five categories recommended in [25] as important sources of features are: web page text; web page source code; URLs; images; and, linkage, and HTML tags are obtained from source codes of web pages. The tags are, however, used individually, and their structural information is not incorporated into detection of fake websites. In contrast, the method presented in this paper evaluates only the structural information of HTML tags, that is, DOM trees, not the content of web pages, such as texts or images. This does not mean, however, that features obtained from the other sources are useless. As recommended in [25], it is desirable that real fake-website detection tools rely on features from a variety of sources for robust performance, and we recommend including structural features of DOM trees to measure similarity of web pages.
In [26], the relationship between elements of fake-website detection tools and users' acceptance of the tools is studied. In particular, accuracy and speed are critical for earning users' acceptance. While structural information includes effective cues that can lead to improvement in accuracy, analyzing structural information is generally costly in time. In this regard, the subpath kernel can satisfy both of the seemingly contradictory requirements of high accuracy and high efficiency. In fact, as we see in Section 5, the accuracy rate observed in our experiments reaches as high as 0.998, and prediction of a single website with a support vector machine (SVM) classifier requires only a few tens of milliseconds.
In addition, because a (positive definite) tree kernel can be viewed as a method to plot trees as points in Euclidean (Hilbert) spaces, the subpath kernel can be combined with a variety of multi-variate analysis methods including SVM, principal component analysis (PCA), multiple regression analysis, and Gaussian process analysis, and we can select the most effective analysis tool according to our objective.
The remainder of this paper is organized as follows. In Section 2, we give a description of the data that we use in our research. In Section 3, we briefly review a theory of similarity measures to evaluate similarity of DOM trees. Section 3.3.2 reviews tree kernels, the class of measures that we use. Section 4 provides a specific description of the subtree kernel. Section 5 describes our method, experiments, and further applications. Although the target application in this paper is detection of fake e-commerce sites, it has larger applications.

The Data Used in Our Research
The initial form of the data that we use in our research is a list of URLs of real ecommerce sites annotated relating to whether the sites of the URLs are fake or authentic. We have obtained the data by different means for fake sites and authentic sites.

Positive Examples-Fake Sites
Rakuten, Inc. has provided us with a list of 3597 URLs of fake sites. Rakuten is a Japanese electronic commerce and Internet company, which operates Rakuten Ichiba, the largest e-commerce platform in Japan. Rakuten is a large target for fraud because it is ranked 14th worldwide in revenue. The fake sites that Rakuten provided claim to be authorized by Rakuten, but try to defraud consumers with low-quality goods, or products that are never delivered.
To confront this threat, Rakuten has formed an Internet patrol team whose mission is to investigate fraudulent sites. Their current method relies on manual investigation, which they would like to improve through automation.

Negative Examples-Authentic Sites
The number of authentic sites found by the aforementioned investigation is very small compared to the number of fake sites. Therefore, we need to collect a compatible number of URLs of authentic sites to apply machine learning properly. For the purpose, we have developed a crawler program that visits authentic sites by tracing links starting from a Rakuten's official site, https://ranking.rakuten.co.jp. This method produced 3349 URLs of authentic sites.
We know, however, that the web-pages that this method collects include pages whose purpose is not to sell goods to consumers. For example, our dataset includes Rakuten's help pages about its e-commerce platform and credit cards. Nevertheless, we decided to use all of these pages for two reasons: first, the positive examples also include non-shopping sites; and second, the percentage of such non-shopping pages is smaller than 1%, so their impact to machine learning analysis is expected to be limited.

Partition into a Training and Validation Datasets
We have selected 500 fake sites and 500 authentic sites to include in a training dataset, while the remaining 3097 fake sites and 2849 authentic sites are preserved for validation.
The training dataset will be used to train classifiers, while the validation dataset will be used to evaluate how well the classifiers have been trained.

Similarity Measures for DOM Trees
As described, we expect that DOM trees include features that persist despite adversaries' efforts to slip past detection attempts. Nevertheless, not all features of DOM trees are appropriate to use for our purpose. In the following, we will describe what features of DOM trees we should leverage in our proposal. Figure 2 shows histograms of the vertex number s(t), the leaf number l(t) and the height h(t) of DOM trees t in the training dataset: a leaf is a vertex of t with no children; h(t) is the length of the longest upward paths from a leaf to the root in t. We can observe that the thresholds of 970, 498 and 23 for s(t), l(t) and h(t) can separate the distributions of fake and authentic sites clearly, and the accuracy rates reach 0.941, 0.943 and 0.974, respectively. Despite the high accuracy of the methods, we cannot use them for our purpose, because adversaries can easily change these values by adding dummy tags to HTML documents without altering their logical meaning or layout. It is also important to note that an appropriate detection method must not be affected by these indices. Otherwise, adversaries can lower the effective detection accuracy by equalizing sizes of trees.

Measures Based on Shallow Features
One alternative we might consider to evaluate the similarity of trees is the use of edit distances with distance-based classifiers (for example, k-nearest-neighborhood). However, this method also relies on shallow features. The edit distance is a widely used class of dissimilarity measures and has been intensively studied for a long time. The Taï distance [1] is a well-known example, and many variations have been derived from it. Among them, the subpath distance, which is based on subpaths in trees, has proven to be excellent in classification accuracy [27]: a subpath is an upward sequence of one or more contiguous vertices of a tree ( Figure 3). When SP(t) denotes the entire set of subpaths of a tree t, the following formula defines the distance:  This approach does not meet our purposes because d SP (t 1 , t 2 ) relies on a single longest subpath shared between t 1 and t 2 . In fact, when t 1 is fake and t 2 is authentic, adversaries can modify only a small portion of t 1 limiting the resulting change to its logical meaning and layout to the minimum so that t 1 and t 2 share a longer path, which results in a decrease of d SP (t 1 , t 2 ). Any variations of the Taï distance suffer from the same problem, since they evaluate the similarity only by maximal shared substructures in the same way as the subtree distance.

Measures Based on Deep Features
In this paper, we pursue measures that evaluate the similarity of the entire spectrum distributions of substructures in trees.
For example, Figure 4 depicts spectrum distributions of SP(t) for a fake tree, an authentic tree and a tree whose authenticity is unknown: the x axis represents subpaths that appear in t, while the y axis indicates their occurrence numbers. Whether the third tree is fake or authentic should be determined by to which the unknown spectrum distribution is more similar, the first or the second. As mentioned, the edit distance d SP (t 1 , t 2 ) leverages only the length of the longest common subpaths that appear in the distributions and ignore the remainder of the spectrum distributions. In contrast, a similarity measure that we are pursuing should incorporate every local similarity of the distributions into evaluation: the entire set of substructures of a tree t (for example, SP(t)) determines a neighborhood system around vertices of t, and hence, structural information around every vertex definitely affects the entire spectrum distribution; It is difficult for adversaries to intentionally control changes of a spectrum distribution, because every small change of a tree (for example, a substitution, insertion and deletion of a vertex) will spread widely over the distribution.
We can search such measures from tow major categories: divergences and positive definite mapping kernels.

•
Kullback-Leibler divergence is the best known example, but it does not satisfy the axiom of symmetry. The symmetric divergences include Jensen-Shannon divergence, Bhattacharrya distance, Hellinger distance, L p -norm and Brownian distance. Divergences can be used with distance-based classifiers such as the nearest centroid and k-nearest neighborhood algorithms for classification, and k-means and k-memoid algorithms for clustering. • A positive definite kernel K : X × X → R can be viewed as an inner product operator ·, · of some inner product space H [28] (reproducing kernel Hilbert space, RKHS). That is, by some mapping Φ : X → H, an element x ∈ X can be identified with a vector Φ(x) ∈ H, and K(x, y) = Φ(x), Φ(y) holds. In particular, normalization of K(x, y) yields the well-known cosine similarity.
K(x, y) More importantly, by identifying X as a subspace of H, we can extensively apply many useful multivariate analysis techniques to X including PCA for feature extraction, SVM classifier for classification, the kernel k-means algorithm for clustering, and Gaussian process analysis for regression. A positive definite mapping kernel [27], in addition, evaluates spectrum distributions.
Although both divergences and positive definite mapping kernels can be used for our purpose of evaluating the similarity between DOM trees, this paper deploys the latter, since the advantage of multivariate analysis is significant.

Positive Definite Mapping Kernels
The mapping kernel is a generalization of the well-known convolution kernel [29] and is defined as follows.
For sets x and y, we call a subset µ ⊆ x × y a mapping from x to y and allocate a set of mappings M x,y to each pair (x, y). Hence, when X denotes a family of sets, we have a mapping system M determined by M = {M x,y | (x, y) ∈ X × X }. Furthermore, we let k : Ω × Ω → R be an arbitrary kernel defined over Ω = x∈X x. Then, the mapping kernel associated with X and k is determined by For example, a convolution kernel is a mapping kernel with the setting of M H Positive definiteness of a kernel allows us to view it as the inner product operator of some inner product space (RKHS). With respect to positive definiteness of mapping kernels, we have Theorem 1. (Shin et al. [30].) The following are equivalent.
For more details, refer to Appendix A.

Tree Mapping Kernels
In this paper, a tree always means a rooted ordered tree. A rooted ordered tree is equipped with two orders of vertices: a generation order and a traversal order; A generation order determines an ancestor-to-descendant relation of vertices; v < w indicates that v is an ancestor of w; In particular, v w means that v is the immediate ancestor (parent) of w; The traversal order ≺, on the other hand, determines a left-to-right relation; The nearest common ancestor of v and w is denoted by v ∧ w. In Figure 3, for example, html < figure, body figure, head ≺ figure and figure ∧ tbody = body hold. For a formal description, refer to Appendix B.
Furthermore, we assume the vertices of trees are all labeled: We denote the labeling function by : Ω → A, where Ω is the entire set of labeled vertices of our target trees and A is the alphabet of labels.
When x ⊆ Ω and y ⊆ Ω represent two trees, a mapping set M x,y to be used for a mapping kernel should be determined so that it reflects the interior structures of x and y. The minimum requirements for µ ∈ M x,y are: 1. µ is a one-to-one partial mapping; 2. µ preserves the generation orders of x and y, that is, µ(v) < µ(w), if, and only if, v < w; 3. µ preserves the generation orders of x and y, that is, In [31], 32 different mapping systems are defined, and the associated tree mapping kernels are investigated. Among them, we see two examples.

•
The elastic tree kernel [32] is an important example of tree kernels in the literature and is also defined as an instance of mapping kernels with mapping sets The subpath kernel [4,33] is also a mapping kernel determined by M SP x,y , which consists of µ ⊂ x × y such that Dom(µ) and Ran(µ) are subpaths. Figure 6 exemplifies For the interior kernel k : Ω × Ω → R to be used in Equation (1), we can use the following kernel with two weight parameters α and β: It is easy to see that k α,β is positive definite, if α ≥ β ≥ 0. For the elastic and subpath tree kernels, we further constrain β to be zero.

A Comparison of Tree Mapping Kernels
In [31,33], tree mapping kernels obtained from Equations (1) and (2) are comprehensively investigated with respect to the classification accuracy when used with support vector classifiers (SVC). A mapping system M is chosen out of 32 different types including M E x,y and M SP x,y , while (α, β) moves over grid points in the region of 0 ≤ β ≤ α ≤ 1. The comparison is based on averaged ten-fold cross validation scores (accuracy scores) across ten independent datasets. Figure 7 shows the result of the comparison focusing on the best six tree kernels investigated. For the subpath, parse and elastic tree kernels, β was fixed to zero, while α was adjusted to show the best accuracy performance. For the other kernels, both α and β were adjusted. In the chart, the values in the hatched rectangles show the averaged rank of each kernel, while the figures in parentheses that follow kernel names are p-values of Hommel test [34] as recommendation by [35]. We should remark that the subpath kernel overwhelmingly outperforms the other kernels. 1 2 Figure 7. The averaged ranks and the p-values of Hommel test, displayed between parentheses, of the parse tree kernel [3], the elastic tree kernel [32], the SPI, CPI and CD kernels [30] with the subpath kernel as control. The red thick line represents the critical distance for the significance level 0.01.

Conclusion on Kernel Selection
To confirm the superiority of the subpath kernel in classification accuracy, we run an additional experiment using a dataset with 25 fake DOM trees and as many authentic DOM trees. In the experiment, we compare the subpath kernel with 59 tree kernels including the kernels tested in [31,33]. Figure 8 shows the result. The accuracy scores are measured through ten-fold cross validation. Only the subpath kernel marks 1.0 accuracy, while the scores of all the other kernels falls between 0.54 and 0.93. Thus, the subpath kernel outperforms even for the dataset of this paper. In addition, the subpath kernel has the following advantages.

•
Linear time algorithms to compute the subpath kernel are known [4,5]. Since the time complexity of computing other tree kernels is mostly of the quadratic order of the size of input trees, this advantage is significant. In particular, this enables real-time detection of fake large-scale DOM trees with thousands of vertices. • The definition of the subpath kernel is invariant no matter whether trees are ordered or unordered. Although DOM trees are derived as ordered trees, adversaries may change the traversal order of vertices without impacting their logical meaning or layout. Thus, we see that the results by the subpath kernel is secure against the attack of changing the traversal order.

The Subpath Kernel
In this section, we look into the subpath kernel more closely.

Efficient Computing of the Subpath Kernel
In [5], a linear-time algorithm to compute the subpath kernel is presented. It takes a least common prefix (LCP) array as an input, and LCP are computed from a suffix array extracted from trees [5].

Extraction of Suffixes
For better efficiency of computation, the first step of suffix array extraction is assignment of unique identifier to HTML tags. In this section, we use the following assignment. A suffix is a sequence of the identifiers of HTML tags of a subpath from a vertex of a DOM tree to the root. Figure 9 exemplifies a DOM tree and all the suffixes extracted from it. Since the DOM tree consists of 17 vertices, 17 suffixes are extracted. For example, a subpath tr:tbody:table:body:html corresponds to a suffix [25,24,23,8,3]. In the suffix array, suffixes are sorted in the lexicographical order. A linear-time algorithm to compute the sorted suffix array is presented in [4].

Generation of Least Common Prefix Arrays
To generate an LCP array for two trees x and y, the suffix arrays of x and y are first merged and are the sorted in a lexicographical order. Table 1 shows an example: suffixes extracted from x of Figure are displayed in red, while y in black.

Understanding the Decay Factor α
When computing K SP (x, y), a subpath with n vertices shared between x and y is counted by a factor of α n . Therefore, the contribution of long subpaths increases, as α becomes greater. To be precise, if a subpath with n vertices is shared by x and y, they also share i subpaths with n + 1 − i vertices, and therefore, its contribution to the kernel value is evaluated by if α = 1. Figure 10 shows the change of the ratio γ (10,α) γ(m,α) , when m and α changes. We see that a single subpath with ten vertices is equivalent to 91 subpaths with two vertices for α = 1.5, and the number rapidly decreases as m increases and α decreases.

The Proposed Method
After seeing the entire picture of the steps of our detection system, we see experimental results on the performance of each step. In the experiments, we use iMac Pro with 14 core 2.5 GHz Intel Xeon W and 128 Gbyte memory. Figure 11 depicts the entire structure of our proposed system. HTML mark-up documents are first input into an HTML parsing/sanitization program, which extract DOM trees from them after eliminating and correcting non-standardized descriptions included. The obtained DOM trees are input into a suffixes extracting program (Section 4.1.1), and the obtained suffix arrays are stored in a database. Since the suffix array of a DOM tree will be used for multiple times to compute the subpath kernel values with other DOM tree, storing the suffix array in a database will be beneficial to improve the efficiency of the entire process.

The Entire Picture
In the next step, the necessary subpath kernel values are computed. When a training dataset consists of DOM trees x 1 , . . . , x n , the complete Gram matrix Is computed. k(x i , x j ) is the normalized value of the subpath kernel value K SP determined by The normalization is necessary to eliminate undesirable effects caused by different sizes of DOM trees (Section 3.1). For the prediction purpose, kernel values are computed only between the DOM trees in a test dataset and the DOM trees that comprise support vectors in the training dataset. To compute the subpath kernel value K SP (x, y), the LCP array between x and y is first computed (Section 4.1.2), and then, it is input into the linear-time algorithms presented in [5]. In the step of parameter optimization, the optimal values for the two hyper-parameters of the decay factor α and the regulation coefficient C of SVC are computed through a grid search algorithm.
In the prediction step, a separating hyper-plane in the reproducing kernel Hilbert space is computed with the training dataset and the associated optimal hyper-parameters, and then, a predicted label of a DOM tree in the test dataset is computed by SVC.

Suffixes Extraction
The suffix array extraction program of our system takes a batch of DOM trees as inputs, and then, computes the associated suffix arrays efficiently through parallel computation. Figure 12 shows the averaged time in milliseconds to extract suffix arrays from a single DOM tree, when changing the batch size from 50 to 1000. Although the time scores for n = 50 and 100 are higher because of the overhead of parallel computation, we see that our program can process a single DOM tree in 20 milliseconds on average.

Gram Matrix Generation
Since a Gram matrix is symmetric, to compute an n-dimensional Gram matrix, computation of n(n+1) 2 kernel values is necessary. Hence, the effect of deploying parallel computation is expected to be greater than used for suffix array extraction. Figure 13 shows the total run-time scores of computing Gram matrices for n that varies from 50 to 1000. Although the total run-time (the blue line) should be in theory proportional to the number of kernel values to compute (the orange line), we can see a gap between them that is caused by the overhead of using parallel computation.
In fact, Figure 14 plots the averaged computation time to compute a single kernel value, that is, the ratio of the total run-time to the number of kernel values. the effect of the overhead is evident, when n is small. As n increases, the contribution of parallel computation becomes remarkable, and the averaged time only as great as 60 microseconds.

Hyper-Parameter Optimization
To build optimal models, there are two adjustable hyper-parameters: the decay factor α for the subpath kernel and the regulation coefficient C for SVC. Our program for hyperparameter optimization performs a simple grid search by changing α from 0.1 to 2.0 with an interval 0.1 and log 10 C from −5 to 5 with an interval 1. The total number of different combinations of α and C to test is 220. For evaluation of each combination of α and C, cross-validation scores with five folds are used. We evaluate the performance of this step from both efficiency and accuracy points of view.    Figure 15 shows that the relationship between the total run-time for optimizing the hyper-parameters and the size n of training datasets can be fit to an increasing regression line with a determination coefficient as high as 0.968. Furthermore, the run-time for n = 1000 is 3004 s ≈ 50 min. Due to the excellent generalization performance of SVC, we can obtain good models even for n ≤ 1000, and hence, model updates can be performed quickly. Figure 16a consists of a plane view and a contour plot of the cross-validation scores (accuracy scores) obtained through the grid search for n = 1000. Except that there is a steep drop-off where the score falls from 1.0 to 0.857 within the rectangular area with 1.4 ≤ α ≤ 12.0 and −5 ≤ log 10 C ≤ −1, the score is mostly 1.0.
When investigating the cases of fake and authentic sites separately (Figure 16b,c, the accuracy score for the fake sites, computed by TP TP+FN , varies within a narrow range between 0.976 and 1.0, while the score for the authentic site, computed by TN TN+FP , has a steep drop-off in the same rectangular area as the score for the entire dataset. For the smaller training datasets we have tested, we observe the same property for most of cases ( Figure 17). In particular, the maximum cross-validation is 1.0 for all the cases including when the dataset size is as small as 50.

Prediction
Next, using the 5946 validation examples, we test only the parameter combinations that have shown the highest cross-validation score of 1.0. Figure 18a shows the obtained accuracy scores. Interestingly, it turns out that the score drops rapidly from 0.98 to 0.8, when the decay factor exceeds 1.3. This implies that overfitting has occurred. When we focus on the fake sites, as Figure 18b shows, the range of the accuracy score is very narrow between 0.999 and 0.992, and the identification problem of fake sites turns out not to be sensitive to hyper-parameter selection. On the other hand, Figure 18c shows that learning of authentic sites with a decay factor higher than 1.3 is likely to generate overfitting models, whose accuracy score can be as low as 0.579.
As seen in Section 4.2, the subpath kernel with a high decay factor evaluates longer shared subpaths more significantly. Hence, the results of the validation shows that pages of authentic sites are better characterized by shorter subpaths rather than longer subpaths. Furthermore, Figure 19 shows the results of validation when we use training datasets whose size are smaller than 1000. Surprisingly, even small datasets can exhibit very high accuracy, and for example, a dataset with only 50 examples shows the accuracy score of 0.994. Also, we can observe overfitting occurring in the same way and under almost the same conditions as when we use a dataset of size 1000, and SVC shows the best accuracy performance when the parameters are selected from the area of 0.9 ≤ α ≤ 1.1 and 0 ≤ log 10 C ≤ 5.
Once the hyper-parameters α and C are specified, we will be able to make prediction on which unknown sites are fake or authentic based on their DOM trees.
The first step of prediction is the step to train an SVC based on the hyper-parameter values obtained in the previous optimization process and the training dataset used. The output of training SVC is a model that specifies the separating hyperplane computed through training.
In our system, we assume that, once an SVC is trained, we continue to use the same SVC to make many predictions. Hence, the time efficiency of training an SVC is not significantly critical.     Figure 20 shows the runtime t in seconds to train an SVC when the size n of a dataset varies. When excluding one outlier (n = 400), the relation between t and n fits to a line with determination coefficient 0.990. When n = 400, the objective function has converged exceptionally slowly, and the program of libSVM [36] has reached the default upper limit of 10,000,000 iterations. Including this exceptional datum, the total runtime for training does not one minute largely. We can conclude that The actual output of training an SVC is a subset of examples of a training dataset, which uniquely determines a hyperplane. To make a prediction on a new DOM tree, it suffices to compute the kernel values between the new DOM tree and the support vectors Since these DOM trees have been converted into suffix arrays in the suffix arrays extraction process, computing these kernel values can be performed very quickly by taking advantages of the algorithm introduced by [5].
In fact, Figure 21 shows the averaged runtime to make a prediction for a single unknown DOM tree and the number of support vectors for eleven training datasets with different sizes. We can see that the runtime and the number of support vectors are faithfully proportional. Although the runtime for n = 200 is exceptionally high, it is only 65 milliseconds. For the other datasets the scores are lower than 25 milliseconds. This indicates that our system can sufficiently realize real-time detection of fake sites.  Table 2 exhibits the results of comparison of our method trained by 100 and 1000 data, respectively, against seven benchmark methods for phishing site detection in terms of accuracy performance. The accuracy performance is evaluated using four measures of precision, recall, F-score and accuracy rate, if possible. We see that our method trained with 1000 data (500 fake and 500 authentic) outperforms the others for all of the four measures. Interestingly, even when our method is trained with only 100 data, the measurements are almost compatible with the best of the benchmark methods. -

Comparison in Accuracy Performance
Mainly because the datasets used for the evaluation are different according to the methods, we cannot statistically conclude that our method is superior to the benchmark methods. However, we should emphasize that our method leverages robust features derived from distributions of subpaths in DOM trees. In contrast, the benchmark methods mainly rely on shallow features, which the adversaries can alter without difficulties.
Furthermore, we should note the fact that our method exhibited high accuracy performance using small datasets for training and significantly larger datasets for testing. This observation should be a clear evidence that indicates the excellent generalization performance of our method. The practical meanings of this observation are: we can reduce the frequency of model renewal; and the model renewal by itself can be performed significantly efficiently.

Conclusions
We have shown a method to realize highly accurate and real-time detection of fake e-commerce sites based on DOM tree similarity. We demonstrated its accuracy and speed on a real dataset provided by Rakuten. While the efficiency and performance of our method alone make it compelling, another advantage is the minimal amount of training data required. We believe our method has the potential to increase the safety of e-commerce solutions. Not only does it detect fake sites quickly and with high accuracy, given that it requires only small training data sets, it allows web security software to update itself more quickly against new threats as they emerge.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. A Theory of Positive Definite Kernels
A real-valued kernel is a symmetric bivariable function K : X × X → R defined over a set X . A positive definite kernel, on the other hand, is defined by Definition A1. A kernel K : X × X → R is positive definite, if, and only if, ∑ n i=1 ∑ n j=1 c i c j K(x i , x j ) ≥ 0 holds for any n ∈ N, {x 1 , . . . , x n } ⊆ X and {c 1 , . . . , c n } ⊆ R.
The simplest example of positive definite kernels is a dot product over a finite dimensional vector space X = R n : the dot product of v v v = (v 1 , . . . , v n ) and w w w = (w 1 , . . . , w n ) in R n is determined by v v v · w w w = ∑ n i=1 v i w i . In reverse, Theorem A1 asserts that any positive definite kernel can be viewed as an inner product defined over a linear space.
An n-dimensional matrix [K(x i , x j )] i,j=1,...,n given for {x 1 , . . . , x n } ⊂ X is referred to as a Gramian matrix, and K is positive definite, if, and only if, the Gramian matrix for any {x 1 , . . . , x n } ⊂ X has no negative eigenvalues [28]. For a positive definite kernel, we have Theorem A1. (Schönberg [28]). K is positive definite, if, and only if, there exists a projection π of X into a real Hilbert space H with inner product ·, · such that K(x, y) = π(x), π(y) for any (x, y) ∈ X × X . This H is called the reproducing kernel Hilbert space (RKHS).
A Hilbert space H is a linear space with inner product such that any Cauchy series with respect to the topology induced from the inner product converges to a point in H. The dimension of the RKHS of finite X is bounded above by |X |.
When X R D , positive definite kernels is commonly used to obtain linear separability, an important condition to apply some classifiers such as the support vector classifier (SVC). . . , w w w n }, we say that P and N are linearly separable. For P and N that are not linearly separable, we can leverage Theorem A1 to map the points of X in a Hilbert space H so that the images of P and N become linearly separable in H. For this purpose, we can use Gaussian kernels, also known as RBF kernels, defined by G(x x x, y y y) = exp − x x x−y y y 2 2σ 2 , for example.
Theorem A1 proves its merits, when the elements of X are not vectors. In fact, we assume that X is a space of (DOM) trees in this paper. The projection π maps trees to points in H, and hence, we become capable to apply multivariate analysis to the projected points. Moreover, many techniques of multivariate analysis rely only inner products of the projected points, and their inner products can be computed through the kernel function. This proves to be significantly faster in many cases than direct computation of inner products in H, which can be highly dimensional.
To obtain positive definite kernels for trees, the convolution kernel and Haussler's theorem are a fundamental tool. We assume a kernel k : Ω × Ω → R defined over a set Ω and let X consist of finite subsets of Ω. For (x, y) ∈ X × X , Defines a convolution kernel. In our setting, Ω is a space of vertices, and k is a similarity function for vertices. A typical example of k is Kronecker's delta function δ v,w on labels of vertices: if two vertices v and w have the same label, k(v, w) = 1, and k(v, w) = 0, otherwise. Moreover, we denote a tree with a vertex set x ⊆ Ω by the same symbol x, and therefore, the associated convolution kernel is a tree kernel.
Theorem A2. (Haussler [29]). If k is positive definite, the associated convolution kernel K is positive definite.

Appendix B. Rooted Ordered Trees
We first define rooted trees. In this paper, we define a rooted tree as a particular class of partially ordered sets.
We let (V, ≤) be a partially ordered set. Hence, the following hold: Definition A2. A partially ordered set (V, ≤) is a rooted tree, if, and only if, the following conditions are met.
1. The root r ∈ V exists such that r ≤ v for any v ∈ V. 2. For any v ∈ V, V v = {w | w ≤ v} is totally ordered.
In the definition, V means a set of vertices of a tree, and the order ≤ determine a generation order. We say v ∈ V is an ancestor of w ∈ V, if v < w holds. If v is the nearest ancestor of w, we denote v w and say that v is the parent of w and w is a child of v. A vertex with no children is called a leaf. Also, the nearest common ancestor v ∧ w is defined for {v, w} ⊆ V by For any (v, w) ∈ V × V, v ∧ w always exists and is unique. On the other hand, a rooted tree t with an entire set of leaves L is called an ordered tree, when Definition A4. If the leaves is numbered by L = {l 1 , . . . , l |L| so that l i ∧ l j ∧ l k = l i ∧ l k holds for any 1 ≤ i < j < k ≤ L, we say that the tree is ordered. Figure A1 exemplifies a numbering of leaves of an ordered tree. Also, a leaf numbering canonically introduces a traversal order among vertices. Definition A5. For two distinct vertices v and w of a rooted ordered tree (V, ≤), we define v ≺ w, if, and only if, max{i | l i ≥ v} < min{i | l i ≥ w} holds.

Appendix C. Edit Distances for Trees
We briefly study edit distances for trees. In particular, we introduce the constrained distance (Appendix C.3) and the degree-two distance (Appendix C.4).

Appendix C.1. Taï Distance and Its Variations
Levenshtein [2] first defined the first instance for strings, and later, Taï [1] extended it to trees. The Taï distance is defined as the minimum length of paths from a tree t 1 to another tree t 2 in the entire graph of trees. The graph includes trees as vertices, while an edge is defined between a tree and another, if, and only if, one is converted into the other by substitution, deletion or insertion of a single vertex. For example, Figure A2 depicts the shortest path t 1 → t 2 → t 3 → t 4 , and hence, the Taï distance of t 1 and t 2 is 3. Many efforts to improve the Taï edit distance in particular in terms of computational efficiency followed, and we have many variations [39][40][41][42][43][44][45][46][47] (See Appendix C), Taï distance is not only the first tree edit distance proposed in the literature but also the most common. An important problem Taï distance is its high computational complexity. Computing Taï distances for unordered rooted trees is known NP-hard. Although its computational complexity for ordered rooted trees is polynomial-time, the original algorithm presented in [1] required significantly heavy computation in practice.
Much effort has been made to develop efficient algorithms to compute Taï distances.
• Zhang and Shasha [40] proposed an algorithm of O(|X||Y| min(w(X), h(X)) min(w(Y), h(Y))) -time: X and Y denote rooted ordered trees; |X|, w(X) and h(X) denote the size (the number of vertices), the width (the number of leaves) and the height (the length of the longest path from the root to a leaf) of X. According to shapes of trees, this varies between O(|X||Y|) and O(|X| 2 |Y| 2 ). • Klein [48] improved the efficiency to O(|X| 2 |Y| log |y|)-time by taking advantage of decomposition strategies [49]. • Demaine et al. [50] further optimized this technique and presented an algorithm of O(|X| 3 )-time. Demaine's algorithm is the fastest in terms of the asymptotic evaluations, but it easily lapses into the worst case. Therefore, the algorithm of Zhang and Shasha in fact outperforms Demaine's algorithm in many practical cases. • RTED, an algorithm that Pawlik and Augsten [51] have developed, not only has the same asymptotic complexity as Demaine's algorithm but also almost always outperforms the competitors in practice.
Furthermore, the space complexity of Zhang's algorithm, Demaine's algorithm and RTED is O(|X||Y|), which is practically small.
Although the aforementioned improvement in efficiency was remarkable, the asymptotic time complexity of O(|X| 3 ) is still too heavy for some practical applications. In the literature, several new distances have been proposed to take over Taï distance. In the following, we introduce two of the most important examples of them.

Appendix C.2. Mappings Associated with Edit Paths
In Section 3.1, we have introduced a graph of trees so that the vertices of the graph are trees, while an edge connecting two trees X and Y means that X is converted into Y by a single edit operation, which is one of substituting a new label for the label of a vertex of X, deleting a vertex of X and inserting a vertex of Y ( Figure A2). An edit path, on the other hand, is a path in the graph, and therefore, represents a sequence of edit operations that converts the source tree into the destination tree.
An edit path from X to Y associates a vertex v of X which the edit path does not delete with a vertex µ(v) of Y with which the edit path replaces the v. The entire collection of such (v, µ(v)) determines a mapping, that is, a subset µ ⊆ X × Y. For simplicity, we let the same symbols of X and Y denote the vertex sets of the trees X and Y. We call this µ the mapping associated with the edit path. A mapping associated with an edit path is one-to-one. Furthermore, Taï proved Theorem A3. (Ref. [1]). For rooted ordered trees X and Y, a mapping µ ⊆ X × Y is associated with some edit path, if, and only if, µ preserves the generation and traversal orders of X and Y.
This theorem also asserts that M T X,Y is identical to the set of mappings associated with edit paths. With M T X,Y , Taï distance is determined by The function δ v,w is 1, if the labels of v and w are identical, and 0, otherwise.

Appendix C.3. The Constrained Distance
The constrained distance [52] is defined by imposing the certain constraint described below on mappings of Taï distance. Zhang has also presented an efficient algorithm to compute constrained distances, whose time and space complexity is O(|X||Y|). Although Richter [53] independently introduced the structure-respecting distance tailoring Taï distance to particular applications, Bille [54] has shown the identity between the constrained and structure-respecting distances.
To describe the constraint of the constrained distance, we have to introduce the concept of separable vertex sets.
Definition A6. We let S and T be two subsets of vertices of a tree and let S ∧ (T ∧ ) denote the nearest common ancestors of the vertices in S (T). S and T are separable, if, and only if, neither S ∧ ≤ T ∧ nor S ∧ ≥ T ∧ holds. Definition A7. A mapping µ ⊆ X × Y is said to be separable, if, and only if, (1) µ preserves the generation and traversal orders of X and Y and (2) µ(S) and µ(T) are separable in Y for any separable subsets S and T in in X.
The partial mapping depicted by Figure A3 is separable. Definition A8. An edit path of Taï distance is said to be separable, if, and only if, the associated mapping is separable.
We denote the entire set of separable edit paths from X to Y by Π S X,Y and the entire set of mappings associated with edit paths in Π S X,Y by M S X,Y . Then, the constraint distance d C (X, Y) is determined by Figure A3. A separable mapping {(v i , w i )} i=1,1...,5 ∈ M S x,y .
Appendix C.4. The Degree-Two Distance The degree-two distance [41] imposes the following constraint on the primitive edit operations of deletion and insertion: roots must not be deleted or inserted; only vertices with degree one and two can be deleted and inserted. The degree of a vertex is the number of edges that the vertex has, and the degree-two distance is the minimum length of edit paths under this constraint. The time and space complexity of the degree-two distance is O(|X||Y|). Figure A4 exemplifies an edit path of the degree-two distance. In X, the vertex v d is of degree one, and hence, we can delete it under the constraint of the degree-two distance. By deleting v d , the degree of v b has changed from three to two in X 2 , and hence, we are allowed to delete it. Also, we can insert v f between v a and v g , because the resulting degree of v f in X 5 is two. For the same reason, we can insert v d below v f . The length of this edit path is five. Also, it is easy to see that the shortest edit path under the constraints, and hence the degree-two distance between X and Y turns out to be five. We have Theorem A4. For µ ⊆ X × Y that preserves the generation and traversal orders, µ be a mapping associated with an edit path of the degree-two distance, if, and only if, (v ∧ v , w ∧ w ) ∈ µ holds for any (v, w) ∈ µ and (v , w ) ∈ µ.
Thus, we see that M E x,y is identical to the entire set of mapping associated with edit paths of the degree-two distance. Hence, the degree-two distance d D2 (X, Y) between X and Y is determined by Substitute v g for v e Insert v f Insert v d Figure A4. An edit path of the degree-two distance.
turns out to be normal as well. Since a normal triangular matrix is always diagonal, A is a diagonal matrix and its diagonal elements constitute the eigenvalues of G. In particular, all of the diagonal elements of A are non-negative, since G is positive (semi-)definite, and a diagonal matrix √ A such that √ A √ A = A can be determined. Hence, we have We let √ AU = [v 1 , . . . , v n ] with column vectors v i ∈ R n and define π by π(x i ) = v i . K(x i , x j ) = v i T v j holds.