Deep Web Search Interface Identiﬁcation: A Semi-Supervised Ensemble Approach

: To surface the Deep Web, one crucial task is to predict whether a given web page has a search interface (searchable HyperText Markup Language (HTML) form) or not. Previous studies have focused on supervised classiﬁcation with labeled examples. However, labeled data are scarce, hard to get and requires tedious manual work, while unlabeled HTML forms are abundant and easy to obtain. In this research, we consider the plausibility of using both labeled and unlabeled data to train better models to identify search interfaces more effectively. We present a semi-supervised co-training ensemble learning approach using both neural networks and decision trees to deal with the search interface identiﬁcation problem. We show that the proposed model outperforms previous methods using only labeled data. We also show that adding unlabeled data improves the effectiveness of the proposed model.


Introduction
The Deep Web (also called the Invisible Web and the Hidden Web) refers to a part of World Wide Web content that is different from the Surface Web, which can be crawled and easily indexed by traditional search engines [1].The vast information of the Deep Web is located behind specific web search interfaces, usually in the form of HTML forms [2], and can be surfaced only by formulating a search query on such interfaces [3].Understanding the search interfaces [4], sampling the web databases [5], classification and integration of web sources [6,7] are the key problems that often arise in mining the Deep Web.
In dealing with the above-mentioned problems, one usually assumes that search interfaces are often identified and ready to use.However, identifying whether a web page contains search interfaces or not is still challenging and non-trivial to date.If the scale of the deep web under investigation is small, we could manually judge and collect the search interfaces, as was done in [8,9].However, this method will not work for the real Deep Web, which is estimated to be 500 times larger than the Surface Web [1].
The search interface identification process has to be automatic.In a machine learning environment, a binary classifier is needed to differentiate searchable interfaces from non-searchable ones.The past decade has seen various approaches to identify search interfaces, including decision trees [2], adaptations of random forests [10] and other combinations of machine learning techniques [11].Regardless of which learning algorithm is used, most identification approaches are supervised and have to face the problem of the scarcity of labeled data.
In this paper, we introduce a semi-supervised search interface identification approach, which is different from most previous supervised-based classification techniques [2,10,12,13] and the semi-supervised proposed method in [14].In our approach, two-base classifiers, namely neural networks and decision trees, are trained alternatively to obtain additional diversity data from unlabeled examples.We then use the unlabeled diversity data to increase the diversity of the base classifiers in the ensemble.Experiments show that our proposed approach, SSCTE (semi-supervised co-training ensemble) outperforms state-of-the-art supervised classifiers, such as K Nearest Neighbours (KNN), Support Vector Machines (SVM) boosting and random forests, in most cases.
As a second contribution, we provide a group of search interface identification datasets for the Deep Web research community.Previous research mainly adopted the University of Illinois at Urbana-Champaign (UIUC) Web Integration Repository [15] in their experiments, but the number of examples actually used in related experiments varies.This is because some of the URLs in the datasets are out-of-date and no longer available, and thus, their experiments might not be verified.This inspired us to create a repository of our own and to make it public with both URLs and the downloaded HTML forms.Our datasets can be downloaded from the web site (https://github.com/whcsu/sscte/)or will be sent upon request.It is by far the largest dataset available in terms of labeled search interface examples.
The rest of the paper is organized as follows.Section 2 overviews related work in search interface identification and semi-supervised ensemble learning.In Section 3, we propose a novel search interface identification approach based on semi-supervised co-training ensembles.Th experimental setup and result analysis are described in Section 4. Finally, in Section 5, we conclude the paper.

Related Work
In this section, we briefly review previous studies on search interface identification and take a glimpse at semi-supervised classification and semi-supervised ensembles in particular.Later on, we shall develop a novel search interface identification algorithm within the semi-supervised ensemble learning framework.

Search Interface Identification
The search interface identification problem can be stated formally in the following way: given a set of web pages W A , find W S ⊂ W A that contains searchable interfaces.Since HTML forms constitute a large majority of search interfaces on the Deep Web, most studies focus on HTML form-based search for identification.
A search form, through which a query can be issued by modifying the controls and then submitted for further processing, is usually a section of an HTML file that begins with a < f orm > tag and ends with a < /f orm > tag.A typical search form on www.arxiv.organd its corresponding source code are shown in Figures 1 and 2.  To automatically identify the search interfaces among a vast amount of HTML forms, a number of supervised machine learning algorithms have been proposed, and these methods fall into two categories: pre-query and post-query.Pre-query algorithms use a form classifier to judge an HTML file according to the form features in the interface.Post-query methods identify the form searchability by submitting queries through HTML forms in the interface, and a decision is made based on the returned result pages.
Cope et al. [2] proposed a pre-query approach with the C4.5 decision tree as the learning algorithm to classify them.This method was further developed by [11,12], but far less features were used.Shestakov [13] also applied decision tree algorithms in identifying search interfaces, but they divided all HTML forms into two groups based on the number of visible controls and implemented two separate binary classifiers for classification.They demonstrated that such separation improves the system accuracy.Ye et al. [10] extended the random forest algorithm by applying a weighted feature selection during the building of individual tree classifiers.Wang et al. [16] proposed a hierarchical framework, which used an ontology in the web page classifier and in the form content classifier, while a C4.5 decision tree in the form structure classifier.Marin-Castro et al. [17] created eight heuristic rules based on extensive heuristic analysis to discriminate non-searchable from searchable HTML forms.
Bergholz and Childlovskii [18] gave an example of the post-query approach.They constructed a form analyzer using different heuristics, such as the length of the input field, the presence or absence of password protection, to identify whether a form was queryable.They then applied a query prober to manage the automatic form filling and decided the usefulness of the form according to the results of probing.Lin and Zhou [19] studied simple search interfaces with only one single input field, but could accept multiple search keywords of different attributes and provided a search interface probing strategy based on the query words' hit rate and the reappearance frequency of the query words in the result pages.
From the above, we may notice that as the post-query approach requires automatic filling of HTML forms, which is very challenging for HTML forms with multiple input controls, it has found very limited applications so far.Additionally, due to the same reason, pre-query approaches dominate today's search interface identification solutions.
Previous search interface identification approaches have demonstrated the power of supervised classification algorithms.However, in building supervised classifiers, we often face the problem of scare labeled examples together with a vast amount of unlabeled data.As semi-supervised learning (SSL) methods can exploit both labeled and unlabeled data to obtain a higher accuracy, we might turn to SSL for help [20,21].

Semi-Supervised Ensemble Learning
SSL falls between unsupervised learning (without any labeled examples) and supervised learning (with labeled examples only) [20].In a semi-supervised classification framework, l labeled examples L = {x 1 , . . ., x l } and u unlabeled examples U = {x l+1 , . . ., x l+u } are given and exploited together in training to find a good classifier with improved accuracy.SSL algorithms usually work in the following way: first, a classifier is trained on L, and an initial hypothesis H is obtained; then, H is used to classify the examples in U; the examples with high confidence are labeled, added to L and deleted from U; this process is iterated for a fixed number of times or stopped when U becomes empty.
Semi-supervised ensemble learning (also called semi-supervised multiple classifier systems, semi-supervised learning by disagreement) [22,23] is a kind of SSL algorithm that exploits unlabeled data collaboratively to increase the performance of the ensemble.Since highly diverse base classifiers are the key to the success of ensemble systems [24,25], different strategies in supervised ensemble learning, such as bagging [26], boosting [27] and random subspace [28], are extended to the semi-supervised framework to obtain a higher ensemble diversity and, hence, a higher classification accuracy.
Semi-supervised MarginBoost (SSMBoost), proposed in [29], generalized AdaBoost [30] to semi-supervised classification by redefining the margin notion to include unlabeled data.The same as SSMBoost, Adaptive Semi-Supervised Ensemble (ASSEMBLE) [31] also adopted the MarginBoost notation for both labeled and unlabeled data.The major difference is that they assign pseudo-classes to the unlabeled data in constructing the ensembles.SemiBoost [32] exploited both the clustering assumption and the large margin criterion.It also used the pairwise similarity measurements to guide the selection of unlabeled examples at each iteration.Different from the above boosting-based semi-supervised ensembles, co-forest [33] incorporated random forest in the semi-supervised framework and demonstrated the strength of bagging and random space method in computer-aided medical diagnosis.

A Semi-Supervised Co-Training Ensemble
Our proposed method addresses the search interface identification problem by turning it into an ensemble learning problem using both labeled and unlabeled data.As the key to successful ensemble learning methods is to construct individual base classifiers that are as diverse as possible, thus increasing the ensemble's diversity, a strategy of varying both the training data and the base classifier themselves is adopted in the proposed algorithm.
In our approach, the most popular methods to achieve diversity from labeled training data obtained through resampling techniques, such as bagging and random subspace, are applied, and these methods will be detailed in the first part of this section.Later on, we focus on how to obtain further diversity from unlabeled training examples, which is also the major novelty of the proposed algorithm.Besides the data diversity, we also consider the diversity caused by the base classification algorithms themselves and discuss how to combine the outputs of these base classifiers effectively.Finally, we present our semi-supervised co-training ensemble (SSCTE) learning algorithm.

Diversity Generation from Data
In the proposed algorithm, we exploit both bagging and random subspace methods in training the base classifiers, as is done in co-forest [33].In bagging [26], base classifiers are trained on a bootstrapped example of the original training set.In random subspace (attribute bagging) [28], various feature subsets of the original data are created by sampling the whole feature set without replacement.Additionally, training data with different bootstrapped versions and subsets of features are used to train diverse base classifiers.
However, when a large amount of unlabeled data is available, diversity obtained from labeled data alone is not enough, and we should try to gain further diversity from the unlabeled examples, as well.
Melville et al. [34] introduced a supervised ensemble learning algorithm, called DEcoratE, which artificially generates new wrongly-labeled examples (termed diversity data) and merges the diversity data with the training data to train new base classifiers.Their experiments show that the diversity data did increase the diversity among the base classifiers and, therefore, improved the classification accuracy.In this paper, instead of obtaining diversity data from labeled training examples, we want to use the information from the vast amount of unlabeled data to create it.
First, a base classifier trained on the labeled data is applied to make predictions on the unlabeled data.Then, unlabeled examples with a highly confident prediction probability are selected.These highly confident examples will be misclassified deliberately (i.e., given class labels opposite of the predictions of the base classifier) and will be used for the creation of the diversity data pool.The total number of examples to be selected is specified as a fraction, D f ra , of the original training set size.The process of diversity data creation is shown in the following Algorithm 1: Algorithm 1 Diversity data creation.

Diversity Obtained from Base Classification Algorithms
Unstable classification algorithms (i.e., a small change in the training set can lead to a remarkable change in the produced model), such as neural networks and decision trees, are good candidates for base classifiers in ensemble learning [24,26].Most supervised or semi-supervised ensemble learning algorithms only consider one kind of base classification algorithm, either decision trees [33,35,36] or neural networks [37].
As neural networks and decision trees classifiers are drastically different in nature, and they may make different (uncorrelated) errors on the same examples, i.e., examples misclassified by neural networks might be corrected by decision trees and vice versa.This build-in complementary mechanism of the base classifiers will increase the base classifiers' diversity and therefore should improve the classification accuracy of the ensemble on the whole.
Thus, instead of using only one kind of base learning algorithm, we use both neural networks and decision trees as the base algorithms, and these two kinds of classifiers are trained alternatively in the whole process, as is done in the co-training algorithm [38].This is why we call the algorithm semi-supervised co-training ensemble.

Aggregating the Base Classifiers
In ensemble learning algorithms, different base classifiers usually have different prediction capabilities.Classifiers that have high predicative power should be given higher weights.Thus, in SSCTE, we adopt a weighted majority mechanism in combing the base classifiers, and the weights of base classifiers are based upon their performance on the so-called out-of-bag data.
In SSCTE, each base classifier is constructed using a different bootstrap sample from the original labeled data plus some diversity data.About 1/3 of the labeled data are left out of the bootstrap sample and not used in the i-th iteration.The out-of-bag (OOB) data generated in the i-th iteration are used to examine the performance of the i-th base classifier.Additionally, the resulting OOB prediction accuracy oob i will be used to calculate the weight of the i-th base classifier w[i], according to the following formula: (oobmax−oob i ) (oobmax−oob min ) , oob min < obb i < oob max ; 0, oob i ≤ max(oob min , 0.5). ( where oob max , oob min are the maximum and minimum values among all OOB accuracy values oob i s, respectively.

The Algorithm
In SSCTE (see Algorithm 2), a semi-supervised ensemble is generated iteratively.
Algorithm 2 A semi-supervised co-training ensemble (SSCTE) learning algorithm.C = C C i 16: end for 17: calculate weights w[i]s for all classifiers based on their oob i s according to formula (1) Output: the learning ensemble C. In prediction, a sample (x, y) is assigned with class label y * as the one receiving the weighted majority of the votes: Initially, a bootstrapped sample is generated from the original labeled dataset.In each iteration, the current base classifier is trained on a different bootstrapped dataset plus some diversity data created by the previous base classifier using Algorithm 1.The diversity data are then removed from the unlabeled data.This iterative process continues until the unlabeled data become empty or the iteration number condition is reached.Finally, base classifiers trained in all iterations are combined together using weighted majority voting.

Experimental Results
In this paper, we are going to classify a held out test set using the SSCTE algorithm learned on a training set consisting of both labeled and unlabeled examples.We first describe the Deep Web search interface dataset used in the experiments and then discuss two evaluation metrics and the statistic tests used.Finally, we present the experiment results.

Dataset Description
We evaluate the performance of our algorithm on real-world search interface datasets in Deep Web mining.The most related dataset used in previous research is the UIUC Web Integration Repository and was provided by [15] in 2004.However, as time goes by, many web links out of the 447 query interfaces in the repository have become broken, and some domain names have even disappeared.Consequently, it is not adequate to still compare the algorithms' performance based on the outdated UIUC dataset.Thus, in this study, we only consider the recently collected dataset by the authors with both the original HTML forms and processed data that are publicly available.It is by far the largest search interface dataset used in similar research and can be downloaded from https://github.com/whcsu/sscte/.The Comma-Separated Values (CSV) and Attribute-Relation File Format (ARFF) files of the dataset are also available upon request.A brief description of the dataset is detailed in Table 1.An important step in classifying search forms from non-search forms is to characterize the HTML forms embedded in the deep web HTML files.This is usually done through extracting certain features within the forms.As shown in [2], features, such as the numbers of "text" INPUT control and SELECT control are good indicators of the form searchability. Suggested in [10], FORM attributes, such as "method" and "action", are also considered in our research.These features are extracted from the form element or control structures and will be called "structural features" in our research.One may notice that in Figure 2, the word "search" occurs five times within the form body, and such semantically "positive" words help a human to determine the form searchability. Generally, LABEL name, FORM action URL and textual contents within the form are of great semantical importance and they become "semantical features" in our extracted feature set.
In our approach, we have extracted 18 features to characterize Deep Web search forms.Statistical distributions of all 18 features in search and non-search forms are shown in the following Figure 3, where feature occurrences of search forms and non-search forms are indicated by red and blue colors, respectively.

Evaluation Metrics and Statistical Tests
We use the following abbreviations for the ease of explanation: P (# positive, i.e., search interface examples), N (# negative, i.e., non-search interface examples), TP (# true positives), TN (# true negatives), FP (# false positives) and FN (# false negatives).The evaluation metrics considered in this paper are classification accuracy (ACC) and area under the receiver operating characteristic (ROC) curve (AUC).
Classification accuracy can be calculated easily by the following formula: The AUC of a classifier is equivalent to the probability that the classifier will rank a randomly-chosen positive example higher than a randomly chosen negative one [39].The value of the AUC will always be between zero and one.Additionally, all workable classifiers should have an AUC larger than 0.5, i.e., better than random guessing.Usually, a classifier that has a greater area will have a better average performance.
To compare the ACCs and AUCs of different classifiers, the Friedman non-parametric test, based on the average ranks of the classification algorithms in all runs of the experiments, is applied.We calculate Friedman's test statistic [40] according to the following formula: where n denotes the number of experiments, m the number of classifiers and r j i the rank of classifier j on the i-th run.The statistic approximately follows a chi-square distribution, and if F T is large enough, we can reject the null hypothesis that there is no significant difference among the compared classifiers; a post hoc Nemenyi test can further be applied to locate the differences [40].
In the Nemenyi test, denote by R j the mean rank of classifier C j on all n experiments: R j = 1 n n i=1 r j i .The statistic z for two classifiers C 1 and C 2 is calculated as follows: C 1 and C 2 are significantly different in performance if the z value is larger than the critical difference value [40].

Results and Analysis
Here, we randomly partition the DMOZ-labeled data into two parts: labeled training data (70%, 628 examples) and labeled test data (30%, 269 examples).

Parameter Sensitivity
First, we want to test the performance of SSCTE with different iterations k, where parameters D f ra and p conf use the algorithm default settings one and 0.95, respectively.As shown in Figure 4, SSCTE's classification accuracy and AUC increase slowly when the number of iterations increases at the very beginning.This indicates that a larger ensemble size will lead to a relatively better performance.However, when k ≥ 40, SSCTE is no longer too sensitive to the number of iterations.Thus, for efficiency and accuracy reasons, we choose k = 50 as the default setting in later experiments.
Next, we want to discover SSCTE's performance using various amounts of diversity data, indicated by the D f ra factor.As shown in Figure 5, diversity data help to increase SSCTE's performance when the amount of diversity data is small (1 ≤ D f ra ≤ 15).However, when D f ra > 16, SSCTE's performance begins to deteriorate and becomes unstable.This is because the resulting training set contains too much noise data.For this reason, we recommend setting D f ra between 1-10 and in our later experiments; D f ra is set to one.Here, we will test SSCTE's performance under different combinations of base classifiers: SSCTE with neural networks and decision trees (SSCTE, the proposed algorithm), SSCTE with only neural networks (SSCTEnn) and SSCTE with only decision trees (SSCTEtree).For all of these three algorithms, the same settings, i.e., p conf = 0.95, D f ra = 0.1 and k = 50, are applied.In the experiments, unlabeled data are kept unchanged, and different sizes (controlled by α) of labeled examples are used as the labeled training set.
The results shown in Figures 6 and 7 demonstrate that SSCTE beats the other two in most cases under both classification accuracy and AUC metrics.SSCTE is also much more stable than SSCEnn and SSCTEtree.This confirms our assumption that neural networks and decision trees are complementary to each other in the ensemble, and this combination of using two different kinds of base classifiers does improve the algorithm's performance.

Comparisons
Finally, we compare our SSCTE algorithm with state-of-the-art supervised learning algorithms: monolithic classifiers, such as SVM, KNN and J48 (Java C4.5); and ensemble classifiers, such as boosting (GBDT, gradient boosted decision trees) and random forests (RF).
The proposed SSCTE algorithm is implemented in R, a free software programming language for statistical computing.For SSCTE, we randomly choose α% (α ∈ [7,99]) of data from DMOZ-labeled training data and use it with DMOZ-unlabeled data to form a new training set.In SSCTE, p conf , D f ra and k are set to 0.95, 1 and 50, respectively.For supervised methods, only α% DMOZ-labeled data are used in training, and default parameters suggested in the R [41] packages are adopted.Hereafter, the reported performance of SSCTE and other methods corresponds to the result out of 93 runs on different percentages of the labeled training data, with unlabeled training data only for SSCTE.
Experimental results with different α in terms of classification accuracy and AUC are shown in Figures 8 and 9, respectively.
As shown in Figures 8 and 9, in terms of classification accuracy, SSCTE surpasses SVM, KNN, J48, GBDT and random forest in a majority of 98%, 97%, 80%, 81% and 81% cases and is inferior to these algorithms in 2%, 3%, 20%, 19% and 19% cases.In terms of AUC, SSCTE outperforms SVM, KNN, J48, GBDT and random forest in a majority, 99%, 100%, 100%, 94% and 92%, of cases, and is secondary to these models in much fewer, 1%, 0%, 0%, 6% and 8%, cases.As expected, all approaches improve their results with the increase of α, i.e., the size of labeled data, especially at the very beginning.When α > 20, most approaches, however, are not too sensitive to the size of the labeled data, while SSCTE still follows a relatively sharp increase in performance.This means that, if more labeled data are available, SSCTE always implies a better performance.
The Friedman rank sum test statistics for the above comparisons in terms of ACC and AUC are 319.853and 425.725, respectively.This is significant (the corresponding p-value is less than 2.2 × 10 −16 ), and a post hoc Nemenyi test was then applied to find out which pairs of algorithms are significantly different.
For the ease of notation, denote classifiers SSCTE, SVM, KNN, J48, GBRT and RF by A, B, C, D, E and F, respectively.The average ranks of classifiers A, B, C, D, E and F are: For α = 0.05, the critical value of the studentized range distribution q α = 4.121 and divide q α value by √ 2, we get the Nemenyi's test critical value of 2.9140.It can be seen that in terms of ACC, four Nemenyi statics z BA1 , z CA1 , z DA1 ,z EA1 exceed 2.9140, while in terms of AUC, all five Nemenyi statics exceed the critical value.Thus, there exists significant differences between the proposed SSCTE and SVM, KNN, J48, GBRT in terms of ACC and also significant differences between SSCTE and SVM, KNN, J48, GBRT and random forests in terms of AUC.In other words, SSCTE is significantly better than SVM, KNN, J48 and GBRT under both the ACC and ACC metrics and is also significantly better than RF in terms of AUC.

Conclusions and Future Work
Determining whether an HTML web form is searchable or not is vital to further mining of the vast information of the Deep Web.Different from previous supervised approaches, we have adopted a semi-supervised co-training ensemble using neural networks and decision trees as the base classifiers.
In the proposed SSCTE algorithm, the bagging and random subspace method are exploited together to create a diverse ensemble.In SSCTE, data used to diversify the training set are generated from unlabeled data and, thus, extend the diversity data notion to a semi-supervised learning framework.Furthermore, the combination of neural networks and decision trees instead of using only one kind of base learners also increases the ensemble's diversity.
Experimental results on the DMOZ datasets have shown that SSCTE outperforms state-of-the-art classifiers, such as SVM, KNN, random forests, boosting and C4.5 decision trees.In our ongoing work, we will work toward optimizing SSCTE's parameters under different scenarios and seeking other base learners suitable for semi-supervised ensemble learning.

Figure 1 .
Figure 1.Arxiv search interface: a simple search form.

Figure 3 .
Figure 3. Feature distributions in HTML forms.

R
jA = 2.096774, R jB = 4.623656, R jC = 4.505376 R jD = 4.537634, R jE = 3.021505, R jF = 2.215054 Using the proposed algorithm (SSCTE, A) as the control algorithm and computing the z statistic of the Nemenyi test for different classifier pairs, we obtain: z BA1 = 9.2104, z CA1 = 8.7792, z DA1 = 8.8968 z EA1 = 3.3706, z F A1 = 0.4311 Similarly, in the case of AUC metric, the corresponding z statistic values of the five classifier pairs are: z BA2 = 15.0110,z CA2 = 12.8553, z DA2 = 2.9787 z EA2 = 8.8185, z F A2 = 6.5453 unlabeled data BaseLearn, base learning algorithm p conf , confidence probability threshold D f ra , factor that determines number of diversity data.BaseLearn to calculate class probability p i , for example u i in U ; 3: label u i with class opposite to the prediction if its p i > p conf ; 4: add all u i s in Steps 2 and 3 to the diversity data pool; 5: select D f ra most confident examples in the pool to form D.
Output: D, diversity data

Table 1 .
Dataset used in the experiments.The dataset contains 897 labeled examples and 18,624 unlabeled examples and is extracted by crawling some of the web links indexed in the largest web directory www.DMOZ.orgduring the period April 2011, and March 2012.The distribution for the DMOZ dataset varies, ranging from general sites, academic sites, recreation sites to social sites and science sites.