Feature Selection and Recognition Methods for Discovering Physiological and Bioinformatics RESTful Services

Many physiology and bioinformatics research institutions and websites have opened their own data analysis services and other related Web services. It is therefore very important to be able to quickly and effectively select and extract features from the Web service pages to learn about and use these services. This facilitates the automatic discovery and recognition of Representational State Transfer or RESTful services. However, this task is still challenging. Following the description feature pattern of a RESTful service, the authors proposed a Feature Pattern Search and Replace (FPSR) method. First, they applied a regular expression to perform a matching lookup. Then, a custom string was used to substitute the relevant feature pattern to avoid the segmentation of its feature pattern and the loss of its feature information during the segmentation process. Experimental results showed in the visualization that FPSR obtained a clearer and more obvious boundary with fewer overlaps than the test without using FPSR, thereby enabling a higher accuracy rate. Therefore, FPSR allowed the authors to extract RESTful service page feature information and achieve better classification results.


Introduction
In order to facilitate data sharing and the exchange of analytical methods, and their use by other organizations, many physiological and bioinformatics research institutions and websites have opened their own data analysis services and other related Web services.However, it is still a challenge to find and use these Web services quickly and effectively [1,2].Unlike traditional Simple Object Access Protocol (SOAP) Web services, it is crucial to effectively extract and select the features of Representational State Transfer or RESTful service pages for the automated identification and discovery of these pages.
Representational State Transfer (REST) is a Web software architecture style that was proposed in 2000 by Dr. Roy Thomas Fielding in his doctoral thesis.After participating in the formulation of the HTTP standard and the URL standard, Fielding put together these two design styles, which he summarized as a new architectural style, namely, the REST architecture style [3].The purpose of this style was to facilitate the transmission of information between different software programs in the network (e.g., the Internet).The Web services that conform to the REST design style are called RESTful services.
For the most part, Web services are currently divided into traditional SOAP services and RESTful services.Because of its simple implementation, lightweight quality and extensibility, RESTful services have been widely applied, and have become mainstream in Web service development [4].Traditional structured SOAP Web services rely on the Web Services Description Language (WSDL) to describe the services.WSDL is a structured language based on XML, and the computer can distinguish and understand its descriptive information [5].However, the description of RESTful services does not have a unified standard.Many people have extracted many description models [6][7][8], but these models have not been widely used.The description of RESTful services primarily uses semi-structured HTML documents to describe the functions and interfaces and invokes service methods with natural language [9].For a computer, the description page of the service is the same as the general document.It makes RESTful service description inundated with massive web pages, and a computer cannot identify RESTful service pages directly, which hinders the automated discovery of RESTful services.
For this paper, the authors proposed the FPSR method to preprocess the RESTful service pages related to physiological and bioinformatics information, and to extract and recognize the features of the RESTful service pages.The method ensured that their features were not disrupted by subsequent segmentation.The experimental analysis showed that this method could effectively improve classification performance.Compared with the traditional method, this method achieved a higher accuracy, precision, recall rate and f-score, thereby extracting RESTful service page features more effectively.

Related Work
The recognition of RESTful service pages is a binary classification of web pages.Web classification is different from the traditional plain text classification.Compared to that classification [10], HTML is a semi-structured document that contains information (e.g., labels, images, links, code, and scripts), which makes automatic classification difficult.In recent years, machine learning algorithms, such as Support Vector Machine (SVM), Naive Bayes [11], K-Nearest Neighbors (KNN) [12], and Decision tree [13] have been widely used in web page classification.There are many aspects that are worthy of research and exploration to select and extract features effectively in web pages and use machine learning algorithms for classification.
Kan and Thi [14] extracted feature information from URL and used SVM to carry out fast web page classification.Hu et al. [15] extracted pictures and text information in pages and used multi-kernel learning to classify drug-related websites.Zhao et al. [9] used the Naive Bayes algorithm and SVM to classify RESTful services by extracting features from web page content and document structure information.Rajalakshmi and Xaviar [16] conducted an experimental study on URL-based web page classification feature-weighting techniques and used SVM to perform web page classification.Altay et al. [17] used context-sensitive and keyword density to extract features and used SVM, maximum entropy, and Extreme learning machine for malicious web page detection.Siddiqui et al. [18] used HITEx (Health Information Text Extraction), which is an open-source natural processing software, to extract text features and used SVM and the Naive Bayes algorithm to identify health information web pages.Onan [19] analyzed and compared four kinds of feature selection methods (correlation, consistency, information gain and chi-square-based feature selection) and four different algorithms (Naive Bayes, KNN, C4.5, and FURIA) on the predictive effect of web page classification.Mohamed et al. [13] used Term Occurrence, Term Frequency, and TD-IDF for feature selection and extraction, and used the Naive Bayes, KNN, and Decision tree algorithms for web page classification.Kiziloluk and Ozer [20] selected the HTML pages' tags as features, then used the Parliamentary Optimization Algorithm (POA) for web page classification.

Dependent Libraries
Python and related libraries were used to implement the web crawler program, data processing, feature selection and extraction, model training, and database reading/writing operations.The tools and libraries used are as follows.
Requests: A powerful HTTP Client library for Python.It is very easy to send a custom HTTP request and to retrieve HTML pages using Requests.
Beautiful Soup: A Python library used for parsing HTML or XML, which can parse, query, and modify the document tree.It is convenient to analyze the HTML pages and extract information.
scikit-learn: An excellent machine learning library in Python, which provides many out-of-the-box machine learning algorithms, data preprocessing algorithms, and simple and effective data mining and data analysis tools.Many machine learning algorithms can be easily used through this library.
re: This is a Python module regular expression, which provides regular expression matching operations.Regular expressions use a single string to describe and match the string corresponding to the pattern rules.
pymongo: A Python library for operating MongoDB, providing support for reading, writing, querying, deleting, and other operations of MongoDB.

Data Acquisition
To obtain datasets for training and testing, the authors wrote a web crawler program using Python's Requests library, and obtained RESTful service pages for biological and physiological information research institutions and organizations, including Kyoto Encyclopedia of Genes and Genomes (KEGG) [21], the European Bioinformatics Institute (EMBL-EBL) [22], German Neuroinformatics Node (G-Node) [23], etc.In addition, to increase the diversity and quantity of datasets, they also retrieved description pages of RESTful services and some other non-RESTful pages from companies such as GitHub, Google, and Facebook.They saved the retrieved URLs and the original web pages to the MongoDB database for subsequent reading and analysis.MongoDB is a NoSQL database which can save the data in JavaScript Object Notation (JSON)-like documents.Because it is schema-free and has scalability, the data storage modal can be easily changed.For saving the file, MongoDB is more suitable than other relational databases.

Datasets
The size and representativeness of the dataset determined the upper bound of the training model, which played an important role in the training result of the model.The authors obtained 731 RESTful service pages, of which 265 pages were related to physiological information, and 1739 non-RESTful pages, for a total of 2470 samples.The dataset was saved in the local MongoDB database.

Web Crawler
The authors used the Requests and Beautiful Soup libraries in Python for the implementation of Web Crawler.First, they found all the URLs about the RESTful service using the Beautiful Soup library to parse the directory page of the service.Then, the Requests library was used to send all the HTTP requests for each URL and get all the web pages for each URL, and then each page and URL was saved to MongoDB.The read and write operations of MongoDB required the help of the pymongo library in Python.For some websites, access could not be gained without web proxy.Requests was therefore used to set the proxies argument to access URLs through the parameter setting of proxies.

Preprocessing of Web Pages
The preprocessing of web pages plays a significant role in the feature extraction and selection and the performance of the classifier.
For this study, the document captured by Web Crawler was in HTML format.HTML is a semi-structured text format, and the tags in a HTML document are used to annotate texts, images, and other contents.JavaScript and Cascading Style Sheets (CSS) can also be used to describe a web page's functionality and presentation, information which was not meaningful for this study.The authors removed it in preprocessing and retained the other information, thereby obtaining the plain text content.
The preprocessing of web pages is shown in Figure 1.During the processing, the Beautiful Soup library was used to finish the HTML parsing, the searching and deleting operation of the target tag, and the extraction and filter of the tag attribute, which ensured efficient processing.
Information 2018, 9, x FOR PEER REVIEW 4 of 11

Preprocessing of Web Pages
The preprocessing of web pages plays a significant role in the feature extraction and selection and the performance of the classifier.
For this study, the document captured by Web Crawler was in HTML format.HTML is a semistructured text format, and the tags in a HTML document are used to annotate texts, images, and other contents.JavaScript and Cascading Style Sheets (CSS) can also be used to describe a web page's functionality and presentation, information which was not meaningful for this study.The authors removed it in preprocessing and retained the other information, thereby obtaining the plain text content.
The preprocessing of web pages is shown in Figure 1.During the processing, the Beautiful Soup library was used to finish the HTML parsing, the searching and deleting operation of the target tag, and the extraction and filter of the tag attribute, which ensured efficient processing.The web page preprocessing steps are as follows: Useless tag removal: HTML documents may include some tags such as <img>, <script>, and <style>, which are used to display images, control the behavior of a page, and describe the appearance.These noise or non-sense tags were removed in preprocessing.By using Beautiful Soup (HTML parser), these tags were found by tag name, then removed.
Plain text extraction: The other tags, such as <p>, <title>, <body>, <header>, and many others, include information about the page content description.Beautiful Soup was used to remove the markup information and obtain the plain text.
Feature pattern search and replace: As shown in Figure 2, the authors annotated two types of pattern string in segments from RESTful service pages of KEGG (https://www.kegg.jp/kegg/rest/keggapi.html)and G-Node (http://g-node.github.io/g-nodeportal/key_functions/data_api/0.1/api_specification.html).The URL is the base part of the link of a special web service, and the path corresponds to a special web service from a website.Almost all RESTful pages include a similar special pattern.Like a URL, the special path corresponds to a special service and email address.These special patterns in RESTful service pages are more frequent than in other web pages.Depending on different word segmentation methods, however, these feature patterns can be split, or the same pattern can be treated as a different feature.To solve this problem, the authors proposed the Feature Pattern Search and Replace method.First, they used the regular expression to search and match all the pattern strings.Then, they replaced the same pattern string The web page preprocessing steps are as follows: Useless tag removal: HTML documents may include some tags such as <img>, <script>, and <style>, which are used to display images, control the behavior of a page, and describe the appearance.These noise or non-sense tags were removed in preprocessing.By using Beautiful Soup (HTML parser), these tags were found by tag name, then removed.
Plain text extraction: The other tags, such as <p>, <title>, <body>, <header>, and many others, include information about the page content description.Beautiful Soup was used to remove the markup information and obtain the plain text.
Feature pattern search and replace: As shown in Figure 2, the authors annotated two types of pattern string in segments from RESTful service pages of KEGG (https://www.kegg.jp/kegg/rest/keggapi.html)and G-Node (http://g-node.github.io/g-node-portal/key_functions/data_api/0.1/api_specification.html).The URL is the base part of the link of a special web service, and the path corresponds to a special web service from a website.Almost all RESTful pages include a similar special pattern.Like a URL, the special path corresponds to a special service and email address.These special patterns in RESTful service pages are more frequent than in other web pages.Depending on different word segmentation methods, however, these feature patterns can be split, or the same pattern can be treated as a different feature.To solve this problem, the authors proposed the Feature Pattern Search and Replace method.First, they used the regular expression to search and match all the pattern strings.Then, they replaced the same pattern string with a special word.This method helped to avoid the problem of pattern destruction in the tokenizing process and to save the feature pattern information.The details of the feature pattern search and replace are shown in Table 1.
Save the results: Finally, the processed text was saved to MongoDB. Figure 3 is an example of the HTML document preprocessing.with a special word.This method helped to avoid the problem of pattern destruction in the tokenizing process and to save the feature pattern information.The details of the feature pattern search and replace are shown in Table 1.
Save the results: Finally, the processed text was saved to MongoDB. Figure 3 is an example of the HTML document preprocessing.with a special word.This method helped to avoid the problem of pattern destruction in the tokenizing process and to save the feature pattern information.The details of the feature pattern search and replace are shown in Table 1.
Save the results: Finally, the processed text was saved to MongoDB. Figure 3 is an example of the HTML document preprocessing.

Feature Extraction and Selection
BoW (bag of words) is a common document representation method in information retrieval [10].BoW lists the words with their frequency per document.The frequency of words in a document can be used to represent the document, which allows users to compare documents.BoW ignores the effect of the grammar and the order of words.
Feature extraction is the process of converting raw data to numerical information which a computer can recognize.The computer cannot recognize the raw text information directly.The raw text can be represented by a vector using BoW, followed by a machine learning algorithm used to process the vector.
Feature selection is the process of removing redundant and irrelevant features, that is, to select the most effective feature from the raw features [24].It can reduce the dimensionality of the datasets and improve the performance of the algorithm.In web page classification, the dimensionality of features is very large as well as the number of samples.Many of the raw features were irrelevant to this study.It is easy to overload the algorithm and create a problem of calculation and memory consumption if feature selection is not performed.
Most machine learning algorithms need input with a fixed length, Because machine learning cannot process the raw text directly.The raw text must be converted to a numerical value with a fixed length, namely, a vector.The document d can be represented as a vector of terms (e.g., (t 1 , t 2 , t 3 , • • • , t i )), in which t i is a feature of d.
The process of dictionary initialization is shown in Figure 4: Information 2018, 9, x FOR PEER REVIEW 6 of 11

Feature Extraction and Selection
BoW (bag of words) is a common document representation method in information retrieval [10].BoW lists the words with their frequency per document.The frequency of words in a document can be used to represent the document, which allows users to compare documents.BoW ignores the effect of the grammar and the order of words.
Feature extraction is the process of converting raw data to numerical information which a computer can recognize.The computer cannot recognize the raw text information directly.The raw text can be represented by a vector using BoW, followed by a machine learning algorithm used to process the vector.
Feature selection is the process of removing redundant and irrelevant features, that is, to select the most effective feature from the raw features [24].It can reduce the dimensionality of the datasets and improve the performance of the algorithm.In web page classification, the dimensionality of features is very large as well as the number of samples.Many of the raw features were irrelevant to this study.It is easy to overload the algorithm and create a problem of calculation and memory consumption if feature selection is not performed.
Most machine learning algorithms need input with a fixed length, Because machine learning cannot process the raw text directly.The raw text must be converted to a numerical value with a fixed length, namely, a vector.The document d can be represented as a vector of terms (e.g., , , , ⋯ , ), in which is a feature of d.
The process of dictionary initialization is shown in Figure 4: In the dictionary initialization phase, texts were split by using space and punctuation as delimiters to extract words of at least two letters.The dictionary consisted of the words occurring in documents more than 10 times and in less than 90% of the number of documents.Stop words were removed.The detail of the algorithm is showed in Algorithm  In the dictionary initialization phase, texts were split by using space and punctuation as delimiters to extract words of at least two letters.The dictionary consisted of the words occurring in documents more than 10 times and in less than 90% of the number of documents.Stop words were removed.The detail of the algorithm is showed in Algorithm 1: Algorithm 1 The detail process of dictionary initialization Input: the text T, stop words W stop , the Threshold of DF max d f , min d f Output: the set of words W dict for each text In this process, the authors had to measure the words with their document frequency and the words with their word counts per document.This information was then used to convert a document into a vector.
Finally, term frequency-inverse document frequency (TF-IDF) was used to adjust the weight of words.The more documents the word appeared in, the less valuable that word was as a feature.The words found frequently in all documents were systematically discounted.That left only the frequent and distinctive words as the feature.First, TD-IDF measured the number of times that the words occurred in a given document, and then it adjusted the number by the number of documents in which the words appeared.The formula is as follows: where t f ij is the number of occurrences of word i in document j, d f i is the number of documents containing i, and N is the total number of documents.Principal Component Analysis (PCA), a tool for finding patterns in high-dimensional data, is a commonly used linear dimension reduction algorithm.It selects the features with high variance, and improves the calculation speed without the loss of model accuracy [25].The generated dictionary contained 8974 words.Therefore, the document feature was an 8974-dimensional vector.The number of features was higher than that of the samples.PCA was used to reduce the feature dimension, thereby improving the speed of the model training.

Classification Method
SVM is a supervised learning model with associated learning algorithms that analyzes data used for classification and regression analysis [26].SVM belongs to a general linear classifier, and it minimizes empirical errors and maximizes geometric edge regions.To solve the problem of the nonlinear discrimination of datasets in low-dimensional space, a suitable kernel function was selected.Thus, the original finite-dimensional space was mapped into a much higher-dimensional space by nonlinear transformation.
SVM has been used in most of the related works and has proved its success in text classification.SVM chooses the hyperplane to obtain the maximal distance between it and the nearest training-data point, so it has a good generalization ability.With a small amount of instance data to analyze, the SVM algorithm can give a better result.In order to train and test the feature extraction and selection, the authors chose the SVM algorithm of the linear kernel function as the training model of their classifier, with other parameters set to the default value.The linear kernel function has been suggested for text classification [27].

Results and Discussion
This section presents the visualization of each document processed with the FPSR method and without the FPSR method, respectively, and the classification results obtained from the different methods.

Dimensionality Reduction and Visualization Results
The authors conducted the feature extraction of files processed using the FPSR method, and the size of the generated dictionary was 8974.The eigenvectors of each file were reduced to 2D using PCA and visualized in a 2D plane.The result using the FPSR method is shown in Figure 5.
Traditional processing without the FPSR method was also done.When the preprocessing of the web pages was maintained, and the feature extraction and the dictionary initialization were left unchanged, the size of the generated dictionary was 9000.The visualization result is shown in Figure 6 after the same dimension reduction operation.The number of the features without the FPSR method is higher than the one using the FPSR method, because the FPSR method treats different URLs, paths or email strings as the same pattern.By comparing the visualization results of two methods, it is evident that the boundary of the figure using the FPSR method is clearer and more obvious with fewer overlaps.The classifier presented in this paper would make it easier to distinguish between these two categories of data.
Information 2018, 9, x FOR PEER REVIEW 8 of 11 using the FPSR method is clearer and more obvious with fewer overlaps.The classifier presented in this paper would make it easier to distinguish between these two categories of data.

Classification Results
There were 731 RESTful service description pages and 1739 other types of pages in the database.From the data, 80% of each type of data were chosen as the training set, and the remaining 20% was chosen as the test set.First, the dimension was reduced to 2D using PCA, and then a linear kernel function was selected by the SVM algorithm to train the data.The results for precision, recall rate, and f1-score with and without FPSR are listed in Tables 2 and 3, respectively.The accuracy with FPSR reached 99.4%, whereas that of the traditional method is 94.7%, and the recall rate of the RESTful service pages was only 87.1%.using the FPSR method is clearer and more obvious with fewer overlaps.The classifier presented in this paper would make it easier to distinguish between these two categories of data.

Classification Results
There were 731 RESTful service description pages and 1739 other types of pages in the database.From the data, 80% of each type of data were chosen as the training set, and the remaining 20% was chosen as the test set.First, the dimension was reduced to 2D using PCA, and then a linear kernel function was selected by the SVM algorithm to train the data.The results for precision, recall rate, and f1-score with and without FPSR are listed in Tables 2 and 3, respectively.The accuracy with FPSR reached 99.4%, whereas that of the traditional method is 94.7%, and the recall rate of the RESTful service pages was only 87.1%.

Classification Results
There were 731 RESTful service description pages and 1739 other types of pages in the database.From the data, 80% of each type of data were chosen as the training set, and the remaining 20% was chosen as the test set.First, the dimension was reduced to 2D using PCA, and then a linear kernel function was selected by the SVM algorithm to train the data.The results for precision, recall rate, Information 2018, 9, 227 9 of 11 and f1-score with and without FPSR are listed in Tables 2 and 3, respectively.The accuracy with FPSR reached 99.4%, whereas that of the traditional method is 94.7%, and the recall rate of the RESTful service pages was only 87.1%. Figure 7 indicates the confusion matrices of the test result with different preprocessing methods.As shown in Figure 7, the test dataset included 147 RESTful service pages and 348 other pages, and the FPSR method obtained better results, which correctly classified almost all samples with a classification accuracy of 99.4%.The recall score was more important than the precision in this study, because the authors had to identify as many of the RESTful service pages as possible.However, the recall score of the preprocessing without FPSR was only 87.1%.Therefore, the authors concluded that the FPSR method could extract the characteristics of RESTful services more effectively.Figure 7 indicates the confusion matrices of the test result with different preprocessing methods.As shown in Figure 7, the test dataset included 147 RESTful service pages and 348 other pages, and the FPSR method obtained better results, which correctly classified almost all samples with a classification accuracy of 99.4%.The recall score was more important than the precision in this study, because the authors had to identify as many of the RESTful service pages as possible.However, the recall score of the preprocessing without FPSR was only 87.1%.Therefore, the authors concluded that the FPSR method could extract the characteristics of RESTful services more effectively.

Conclusions and Future Work
In this paper, the authors proposed a FPSR method for feature extraction and selection of RESTful service description pages related to physiological information.The RESTful service page feature pattern was searched and replaced with a specific string to avoid the destruction of its pattern in the segmentation stage and to retain its feature information.Finally, the results of the dimensionality reduction visualization and classification of the test set showed that this method could more effectively select and extract RESTful service feature information, thereby achieving a higher classification accuracy and a lower error rate.This will play an important role in the automated discovery of RESTful services related to physiological and bioinformatic information.
In future studies, researchers will have to increase the number of samples in the dataset and enrich its diversity, and then verify the validity of the method on the new dataset.Furthermore, they will have to study the relevant classifier algorithms and evaluate the indicators of each algorithm,

Conclusions and Future Work
In this paper, the authors proposed a FPSR method for feature extraction and selection of RESTful service description pages related to physiological information.The RESTful service page feature pattern was searched and replaced with a specific string to avoid the destruction of its pattern in the segmentation stage and to retain its feature information.Finally, the results of the dimensionality reduction visualization and classification of the test set showed that this method could more effectively

Figure 2 .
Figure 2. Example of Kyoto Encyclopedia of Genes and Genomes (KEGG) and German Neuroinformatics Node (G-Node) RESTful description.

Figure 2 .
Figure 2. Example of Kyoto Encyclopedia of Genes and Genomes (KEGG) and German Neuroinformatics Node (G-Node) RESTful description.

Figure 2 .
Figure 2. Example of Kyoto Encyclopedia of Genes and Genomes (KEGG) and German Neuroinformatics Node (G-Node) RESTful description.

Figure 3 .
Figure 3. Example of HTML preprocessing.Figure 3. Example of HTML preprocessing.

Figure 3 .
Figure 3. Example of HTML preprocessing.Figure 3. Example of HTML preprocessing.

Figure 7 .
Figure 7. (a) The confusion matrix with the FPSR method; (b) The confusion matrix without the FPSR method.

Figure 7 .
Figure 7. (a) The confusion matrix with the FPSR method; (b) The confusion matrix without the FPSR method.

Table 1 .
Details of Feature Pattern Search and Replace (FPSR).

Table 1 .
Details of Feature Pattern Search and Replace (FPSR).