Semantic Features with Contextual Knowledge-Based Web Page Categorization Using the GloVe Model and Stacked BiLSTM

: Internet technologies are emerging very fast nowadays, due to which web pages are generated exponentially. Web page categorization is required for searching and exploring relevant web pages based on users’ queries and is a tedious task. The majority of web page categorization techniques ignore semantic features and the contextual knowledge of the web page. This paper proposes a web page categorization method that categorizes web pages based on semantic features and contextual knowledge. Initially, the GloVe model is applied to capture the semantic features of the web pages. Thereafter, a Stacked Bidirectional long short-term memory (BiLSTM) with symmetric structure is applied to extract the contextual and latent symmetry information from the semantic features for web page categorization. The performance of the proposed model has been evaluated on the publicly available WebKB dataset. The proposed model shows superiority over the existing state-of-the-art machine and deep learning methods.


Introduction
Nowadays, information available on the World Wide Web (WWW) is growing exponentially, due to which finding user-relevant web pages has become challenging and tedious. The Search engine returns web pages as per the query submitted by the user, but these web pages may or may not be relevant to the user. For example, a search engine either returns too many results or misinterprets the user query due to linguistic ambiguity [1]. Typically, search engines collect the results of prior user searches to respond to future queries depending on the previous results with the highest hits [2]. Thus, there is a need for an efficient web page categorization method as per the user's requirements. Web page categorization is a supervised learning problem in which a set of labeled categories are assigned to web pages. Web page categorization plays a crucial role in improving the performance of search engines. Due to a lack of semantic and contextual information, many search engines rely solely on keywords and suffer from linguistic ambiguity. The main objective of the proposed work was to develop an effective technique for resolving linguistic ambiguity by considering the semantic and contextual knowledge of the web pages. Information retrieval, parenteral filtering, a recommendation system, and focused crawling are popular application areas of web page categorization [3,4].
Many researchers have concentrated on categorizing web pages based on textual information by counting the appearance of the terms in the textual corpus. This is called frequency-based web page categorization, and it examines the frequency with which text appears on the web page [5]. Content information such as 'title' 'description' and structure information such as 'keywords' and 'tags' may be available on web pages. In recent years, researchers have used structure-based feature vectors such as HTML tags to categorize web pages [6,7].
The term frequency and structure-based feature vector approaches are insufficient to discover the semantic meaning of textual information, and text sequence information remains unutilized. Researchers have used the Natural Language Processing (NLP) word embedding technique and the Long Short Term Memory (LSTM) layer in recent advances in deep learning [8]. The application domains of stated techniques include text classification, chatbots, sentiment analysis, object classification, and sequence problems [9][10][11]. The word embedding technique has been frequently utilized in NLP to produce feature vectors that are semantically correlated. To extract contextual features and latent symmetry information from sequential terms, the BiLSTM model has been utilized [10,12].
The principal approaches for the web page categorization discussed in the next section are based on the set of textual and structural features of the web pages. Previous methods lack the ability to utilize the semantic and contextual features of the web pages. Recent advancements in deep learning, specifically in NLP and LSTM, have motivated what is proposed in the present work. In this paper, a model is proposed for web page categorization that utilizes the GloVe word embedding model to capture the semantic features of terms and transform words into word vectors of web pages. In addition, the Stacked BiLSTM model has also been applied to conduct the feature extraction of sequential word vectors. Finally, the Softmax classifier has effectively categorized web pages based on rich semantic and contextual information. The main contributions of the proposed work are:

•
The combination of GloVe and BiLSTM is used to categorize the web pages by resolving the linguistic ambiguity problem. Linguistic ambiguity is the ability to derive more than one meaning from a user query and is also referred to as lexical ambiguity in some articles [13]. • Previous research on the web page categorization mentioned in the next section uses term frequency of the text and structure of the web pages [5,6,14]. In comparison, the semantic and contextual features for efficient web page categorization are novel ideas suggested in this paper.
The rest of this paper is organized as follows. Section 2 presents the background and related works on web page classification and categorization. Section 3 describes the proposed model for the web page categorization. The experimental setup, dataset description, and hyperparameter settings are described in Section 4. Section 5 presents the experimental results and a performance comparison of the proposed model with other methods and existing methods. Finally, Section 6 concludes this work.

Background and Related Work
Earlier research approaches have addressed the problem of web page classification as per the user's preferences [15]. This is a simple document classification problem based on the textual contents and features of web pages. On the other hand, the classification of web pages is based on counting the frequency of text terms to form a term frequency feature vector. These feature vectors are applied to train the classifier to classify web pages. For example, Lipras et al. [16] classified the web pages of news articles into four categories using the frequency of unigram, bigram, trigrams, and four grams feature vector by random forest classifier.
Li et al. [5] proposed an entity similarity network based on Wikipedia to categorize web pages into ten classes using the term frequency feature vector. Feature vectors extracted from the title and main text of the web pages were utilized by the naive Bayesian classifier. The accuracy was improved by applying the probability distribution function (PDF) on all the terms in the feature vector; a Wikipedia entity word created these probability distribution functions. Document frequency-based feature vectors have been used by past researchers to categorize web pages. Jinhua et al. [17] proposed a semi-supervised learning algorithm to categorize WebKB and 20 newsgroup web pages based on their textual content. In this study, a semi-supervised algorithm based on expectation maximization presented the best results. The F1-score of this method was superior to the supervised machine learning algorithm.
Mulahuwaish et al. [18] recently proposed a method to classify web documents using the Support Vector Machine (SVM), Decision Tree (DT), and K-Nearest Neighbour (KNN). These approaches rely on document frequency-based features to improve classification efficiency.
The primary issue with the word frequency feature vector is that it may suffer from sparsity, which can be resolved using feature selection methods to categorize web pages. Tian et al. [19] proposed a method to categorize web pages into seven categories to resolve this issue. It leveraged information gain to reduce the dimensionality of the term frequency feature vector, which was then used to train an SVM.
To classify web pages of Yahoo sports news, Selamat et al. [20] utilized a principal component analysis (PCA) for a dimensionality reduction in the term frequency-inverse document frequency (TF-IDF) feature vector using a neural network. TF-IDF was a popular method for document classification and natural language processing, but the inclusion of PCA showed improved results.
Researchers have also used an unsupervised learning-based approach to reduce the dimension of feature vectors. Li et al. [21] proposed an approach that utilized the Autoencoder feature reduction method based on unsupervised learning to classify the emotions of social comments using the deep neural network. Wei et al. [22] improved the web page classification accuracy by utilizing the Latent Dirichlet Allocation (LDA) unsupervised learning model for feature reduction and classification using SVM.
In addition to textual information, a web page also contains structural information, i.e., labels and tags. In this situation, TF-IDF considers that the same term as different features appears in separate tags to improve web page classification accuracy. Ozal et al. [6] proposed a web page categorization technique based on a genetic algorithm using tagged terms as features. In this study, each web page was represented as a chromosome with a list of feature weights between 0 and 1. The genetic algorithm-based binary web page classification achieved a better classification accuracy than KNN and the naïve Bayesian classifier.
Lee et al. [7] proposed a binary classification of web pages using tagged terms as features and classified given data into either art or non-art, science or non-science, health or non-health, and computer or non-computer. This article used a meta-heuristic approach called simplified swarm optimization (SSO) to learn the best weight of every feature from the tagged term feature list. The SSO-based web page classifier achieved a better F1-score than a genetic algorithm, the Bayesian classifier, and the KNN classifier.
Vinod et al. [14] proposed an efficient method for web page categorization where feature vectors were based on feature extraction and a weight assignment process to each text term. These feature weights were assigned according to the domain-specific web page keyword list, and this extensive feature set was used by SVM to categorize web pages.
The web page ontology used for web page classification utilizes its semantic features. Recently, some research on web page categorization has focused on the semantic features of the textual content of web pages. Saleh et al. [23] proposed a web page classification technique called the classification of multilayer domain ontology (CMOD) based on a page analysis module, a page importance module, and a page classification module.
Wai et al. [24] developed a web page categorization technique by extracting the semantic features of webpages using ontology. Finally, the classification was performed by the enhanced version of C 4.5 and the Naïve Bays classifier.
Ensemble learning combines multiple machine models, each developed to solve a trivial problem to solve a large problem. Recently, Gupta et al. [25] proposed an ensemble learning-based method to web page classification. It utilized a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model to create feature vectors and deep inception modeling with the residual connection.
These are state-of-the-art techniques in the field of web page classification. In summary, through a the study of existing works, a knowledge gap was identified related to the fact that web page categorization is based on a web page's textual and structural features.
In this paper, the novelty of the research is related to the utilization of the semantic and contextual features of web pages.

Proposed Methodology
The web page categorization may categorize web pages into a binary class or a multiclass. In the binary class, each web page {w 1 , w 2 , w 3 , . . . . . . , w n } is categorized into one of two categories {c 1 , c 2 }. However, in a multi-class classification, each web page {w 1 , w 2 , w 3 , . . . . . . w n } is categorized into one of many categories {c 1 , c 2 , c 3 , . . . . . . c m }. The proposed web page categorization model categorizes web pages into different categories.
This section comprises a detailed description of the proposed model's overall methodology and a step-by-step procedure referred to as Algorithm 1.
The web pages from the WebKB dataset described in Section 4.2 were gathered 2.
Two samples of the WebKB dataset were taken to conduct experiments with the proposed model shown in Figure 1 3.
Data preprocessing, as explained in Section 3.1 5.
GloVe word embedding was used to make feature vectors for categorization as explained in Section 3.2, and was utilized to produce a semantically related feature vector which passes to the BiLSTM layer 6.
BiLSTM layer extract contextual features as explained in Section 3.3 7.
Contextual features were passed to the softmax classifier to categorize web pages into its category explained as output layer in Section 3.4 8.
Evaluation matrices precision, recall, accuracy, and the F1-score were calculated using the confusion matrix, and the ROC curve of the proposed model was calculated 9.
A comparative performance analysis of the proposed model with existing techniques was completed, as shown in Section 5.2 10. For the generalization of the proposed model, experiments were performed on more a extensive dataset named as DMOZ in Section 5.3 11. Evaluation matrices, the precision, recall, and accuracy, and the F1-score were also calculated with the DMOZ dataset using a confusion matrix The proposed model is divided into four modules, as shown in Figure 1, beginning with the input of WebKB data for data preprocessing, followed by word embedding, the stacked BiLSTM module, and the output layer, as discussed in the following section. Section 4.2 provides a detailed description of the arrangements of the WebKB dataset used in this paper.

Data Preprocessing
The WebKB dataset is a collection of web pages that were used to perform the experiments for the proposed research. The web pages are in the HTML format, so the required data are extracted from the HTML files. First, in data preprocessing, the HTML body, paragraph, title, and tags are scanned to extract only text data. After that, punctuations and digits are removed from the extracted text data which is converted to lowercase. Finally, all the stop words are removed, and stemming is applied on preprocessed text data. For the tokenization, the preprocessed data are now passed to the word embedding phase.

Word Embedding
Word embedding is the distributed representation of words which is called word vector representation [26]. The embedding layer extracts the semantic features from inputs of data preprocessing and passes them as input to the further layers of the Neural network learning model as the real-valued vectors. One hot encoding method and distributed word representation are widely used for word vector representations. The high dimensionality of word vector representation is caused by a one-hot encoding technique, which is removed using distributed word representation by mapping each word into a low dimensional dense vector [8]. In addition, distributed word embedding captures semantic information from words. So, words such as faculty and teacher are not considered to be two different features.

GloVe Word Embedding
GloVe word embedding technique [27] is used in the proposed model to extract semantic features from the text on web pages. The GloVe is an abbreviation of global vector, and GloVe embedding is an unsupervised learning algorithm for distributed word representation of the text extracted from the web pages [28]. The GloVe method is easier to train over the data due to its parallel implementation. It captures the semantic relationships of words in the vector space. A global co-occurrence matrix X was created using the words found in the Wikipedia dataset to train the GloVe word embedding model. In this paper, pre-trained word vector data named glove6b.zip with a vector with a size 100 embedding layer were used which were prepared on 400 k word vocabulary.
In the co-occurrence matrix, X: X ij represents the number of context words i which appear with word j. The GloVe model minimizes the following objective function:

Stacked Bidirectional LSTM
The standard architecture of an LSTM cell to effectively overcome the vanishing gradient problem and convey useful information within the LSTM network is shown in Figure 2 [29]. Due to the vanishing gradient problem during model training, the traditional Recurrent Neural Network (RNN) may fail to capture long-term dependencies among feature vectors [30]. The LSTM cell consists of three gates, namely, the input gate (i t ), the forget gate (f t ), and the output gate (o t ). These gates control a memory cell activation vector.

•
Forget gate: This is used to determine the amount of information from a previous state c t−1 that should be retained or forgotten based on the present input x t and hidden state h t−1 . The output of the forget gate lies between 0 and 1. The forget gate is expressed by Equation (2).
where b f , w x f , and w h f are the bias vectors, the weight matrix is between x t and f t , and the weight matrix is between h t−1 and f t , respectively. • Input gate: This is used to determine the amount of the network's input x t that must be retained to the current cell state c t . The input gate is expressed by Equation (3).
where b f , w xi , and w hi are bias vectors, the weight matrix is between x t and i t , and the weight matrix is between h t−1 and i t , respectively. • Output gate: This is used to determine the amount of information conveyed to the LSTM network from the cell state c t through the current output value h t . LSTM gates are fully connected networks whose input is a vector and whose output is a real number. The output gate is expressed by Equation (4).
where b f , w xo , and w ho are the bias vectors, the weight matrix is between x t and o t , and the weight matrix is between h t−1 and o t , respectively.
The final outputs of the LSTM cell are cell output state c t and layer output h t, which are expressed in Equations (5) and (6), respectively.
The intermediate cell input state is represented by c t which is expressed by Equation (7): where b f , w xc , and w hc are the bias vectors, the weight matrix is between x t and c t , and the weight matrix is between h t−1 and c t , respectively.
In the recently published work by Peng et al. [31], LSTM was applied with fruit fly optimization for NN3 data for a time series analysis. On the other hand, LSTM was also used by Qin et al. [32] in a totally diverse domain to predict the life span of a gear in the mechanical transmission system. In this way, LSTM was identified as a robust and efficient concept useful to solve various research problems used in the proposed architecture.
LSTM cannot utilize the contextual information from the future tokens of web pages, so the bidirectional LSTM (BiLSTM) layer consists of two separate hidden LSTM layers in opposite directions to the same output [33,34]. With this structure, the output layer can utilize the long-term related information of web pages from previous and future text contexts.
A bidirectional LSTM (BiLSTM) comprises of two separate LSTMs that combine information from both directions of a phrase to obtain word annotations [35]. The information acquired as word annotations from the web pages is then integrated. The forward LSTM processes the input left to right and calculates the hidden state The deep hierarchical model can further improve the performance of the web page categorization [36]. However, the stacking of BiLSTM is more efficient in the shallow learning model, so the proposed model defines a stacked BiLSTM network to utilize the local contextual and latent symmetry complex information of web pages. The output of the lower layer becomes the input of the upper layer in a Stacked BiLSTM network.
The summary in Sections 3.2 and 3.3 of the GloVe word embedding technique shows that it is helpful to identify relevant words from the text in the web pages by computing the probability ratio that those are semantically related. On the other hand, LSTM captures the long-term dependencies between token sequences, which is better for text classification. Due to these factors, we combined the GloVe Model and Stacked BiLSTM in our proposed model.

Output Layer
Finally, the output layer of the proposed model generates four output values for each category in order to predict the web page category with the highest output value. The SoftMax activation function is utilized in the output layer to ensure that the output value ranges from 0 to 1, which is then used as the predicted probability to predict the web page category. The predicted probability is computed using Equation (9).
where c = actual class label of a web page and x = web page features. The Adam optimizer adjusts the learning rate to attain the optimum hyperparameter values and categorical cross-entropy loss used to train the proposed model. The dropout layer is also utilized to avoid overfitting problems with a dropout value = 0.25. The crossentropy loss increases as the predicated probability diverges from the actual category of web pages. The cross-entropy loss is calculated using Equation (10).
where, L(y , y) is the categorical cross-entropy loss for the predicted results, y i is the actual output for the ith web page (actual category), y i is the predicted output for the ith web page, and N is the total number of web pages.

Experiment Setup
This section discusses the datasets, performance metrics, and optimal hyperparameter values used during the implementation of the proposed model.

Dataset Description
The WebKB benchmark dataset [37] is used in the proposed model to categorize web pages. It is comprised of web pages from four distinct institutions: Cornell (867) Table 1 shows the number of Train and Test web pages of the WebKB dataset with the four categories considered in the experiments of the present research work. Another dataset available publicly is an extensive repository of the web pages named as the DMOZ dataset [38]. It consists of web pages of different domains, including Arts, Business, Computers, Games, Health, Home, News, Recreation, Reference, Science, Shopping, Society, and Sports used in the experiments of the present article. Table 2 shows the number of Train and Test web pages of the DMOZ dataset.

Performance Matrices
To measure the effectiveness of the proposed model, the well-known measures of precision, recall, F1-score, accuracy, and the confusion matrix that represent the performance of any classifier were evaluated [39]. Table 3 represents the confusion matrix for the binary classification. True positive (TP) represents the correctly identified web pages, false positive (FP) represents the incorrectly identified web pages, true negative (TN) represents the correctly rejected web pages, and false-negative (FN) represents the incorrectly rejected web pages.  Precision denoted as (P) shows the correctness of the proposed model calculated using Equation (11), Recall denoted as (R) shows the completeness of the proposed model calculated using the Equation (12), Accuracy shows the exactness of the proposed model calculated using Equation (13), F1-score is the harmonic mean of Precision and Recall calculated using Equation (14) F1 score = 2PR P + R (14)

Hyperparameter Setting
Many deep learning models provide an explicitly defined parameter to control memory, and the execution cost is called a hyperparameter. A deep learning model's success based on hyperparameter values must be set before applying a learning algorithm. The variable values are different for different tasks. Table 4 shows the proposed model's hyperparameter value.

Result Analysis
This section describes the performance of the proposed model and the performance comparison of the proposed model with machine learning classifiers, deep learning methods, methods published in recent articles, and the applicability with the DMOZ dataset.

Performance of the Proposed Model
The confusion matrix represents the TP, FP, TN, and FN values to calculate the evaluation matrices and to draw the Receiver Operating Curve (ROC) of the proposed model. Using these values, the Precision, Recall, and F1-scores were calculated for four categories of the WebKB dataset used in the experiments. Figure 3 illustrates the confusion matrix of the proposed model with four classes in which each column indicates the predicted category of the web page, while each row represents the actual category of the web page [40]. Thus, the sum of data values of cells in each row corresponds to the total number of data instances for that category. The experiments were also conducted by considering only three classes of the WebKB dataset for an apple-to-apple comparison with the existing work. Figure 4 illustrates the confusion matrix of the proposed model with three classes.  To better understand the significance of the values in the confusion matrix, the course category was considered, which produced fewer false-positive and false-negative values. This means that the proposed model correctly categorized 262 web pages in the course category. On the other hand, the project category generated more false-positive and falsenegative values because they may have been associated with other categories. Table 5 shows the evaluation matrices of the proposed model with four classes, which shows the Precision, Recall, and F1-scores for each category. The proposed model achieved an average F1-score of 83.03% and an average accuracy of 85.32%. In addition, Table 6 shows the evaluation matrices of the proposed model with three classes. In this case, the proposed model achieved an F1-score of 88.84% and an accuracy of 88.73%. The ROC was plotted between two parameters, i.e., the True Positive Rate (TPR) and the False Positive Rate (FPR). Thus, it shows the performance of the categorization model at all possible thresholds. For example, Figure 5 depicts the course category with the highest Area Under Curve (AUC) value of 0.99. At the same time, the AUC value for the faculty category was 0.95, which was the lowest. Similarly, Figure 6 depicts the ROC for three classes where the course category with the highest Area Under Curve (AUC) value of 0.99. At the same time, the AUC value for the faculty category was 0.94, which was the lowest.

Comparative Performance Analysis
This section is further divided into two subsections; the first subsection is dedicated to the experimental analysis which comprises the implementation details of the baseline models. Section 2 is an actual comparative analysis of the proposed model with the state-of-the-art methods from past research.

Experimental Analysis
In this section, to test the effectiveness of the proposed model, it is compared with the deep learning and machine learning models. For this purpose, the following models were implemented.

•
Naïve Bayesian (NB): The Naïve Bayesian (NB) machine learning model is a simple multi-class classifier which is generally used in text classification [41,42]. It has a specific error rate because it determines the posteriority probability through prior knowledge and data. Nevertheless, NB produces accurate results in a short span of time with minimal training data. In this paper, the feature vectors formed by the TF-IDF approach and Multinominal Naïve Bayes (MNB) were applied with the value of parameter alpha = 1 for web page categorization. • SVM: The Support Vector Machine is a supervised machine learning model based on structural risk minimization and was introduced by Vapnik [11,42]. SVM creates a hyperplane that separates the data into two sets with the maximum margin. In this paper, TF-IDF was used to obtain the frequency of the words for utilizing the features of each web page. Web page categorization was performed using support vector classifier (SVC) with radial basis function (RBF) kernel and the value of parameter C = 1 [40]. • LSTM: The LSTM model captures the contextual information of the text of the web pages in a forward direction. Moreover, it utilizes the last hidden state to categorize the web pages [18,43]. The word embedding layer resolves the data sparsity problem in the BOW and TF-IDF machine learning approaches. In this paper, a deep learning model was also implemented, using a word embedding layer of 100 dimensions for feature extraction followed by an LSTM layer and finally a softmax layer for web page categorization. • BiLSTM: As LSTM lacks a backward layer, it only gathers contextual information from a web page in one way. The BiLSTM model, on the other hand, collects the web page's contextual information simultaneously from left to right and right to left [6,23]. Another model was implemented in this paper using a word embedding layer of 100 dimensions for feature extraction, followed by a BiLSTM layer and finally a softmax layer for web page categorization.
Various experiments have been conducted to evaluate the performance of different machine learning and deep learning models. Figure 7 shows the bar chart comparison on the basis of Precision, Recall, F1-scores, and the accuracy of four baseline models. In this bar chart, the machine learning approaches such as NB and SVM achieved F1scores of 75.34% and 76.50%. Compared to this, the deep learning approaches such as LSTM and BiLSTM are suitable for a sequence processing task, achieving F1-scores of 80.02% and 80.50%, respectively, and showed a better performance. The proposed model outperformed the other models due to the semantic information generated by GloVe passed to stacked BiLSTM, and the stacking of BiLSTM extracted additional contextual features. This arrangement of the proposed model achieved a better performance in terms of precision (84.01%), recall (82.39%), F1-score (83.03%), and accuracy (85.32%).

Comparative Performance Analysis with Existing Work
The comparative performance analysis of the proposed model with the state-of-the-art methods in terms of Precision, Recall, F1-score, and accuracy are shown in Tables 7 and 8. Gupta et al. [25] considered the same four categories that were implemented for the experiments of the proposed model. They reported an accuracy of 79.00% using the BERT pre-trained model with a deep residue inception layer module. The proposed model achieved an accuracy of 85.32%, which demonstrates a comparatively better performance, as shown in Table 7.
Bhalla et al. [14] proposed an automatic web page classification method in which a domain-specific feature extraction was followed by SVM classifier and achieved an F1-score of 86.33% with three classes of the WebKB dataset. The proposed model was also developed for the three-class categorization of web pages and showed a better performance with an F1-score of 88.84%, as shown in Table 8.

Applicability of the Proposed Model with the DMOZ Dataset
The proposed method was fine-tuned for the WebKB dataset, and it can also be easily fine-tuned on the other dataset of web pages for the categorization of the web pages. A few experiments were also performed with the DMOZ data set to prove the proposed model's generality and applicability. Figure 8 illustrates the confusion matrix of the proposed model with thirteen classes of the DMOZ dataset. The effectiveness of the proposed model evaluated in terms of Precision, Recall, and the F1-Score for each category of the DMOZ dataset is shown in Table 9. The accuracy of the proposed model with the DMOZ dataset was 80.23%, and the F1-score was 77.49%. These experiments prove that the proposed model performs well if applied with the real-world benchmarked datasets available to conduct future research.

Discussion
A summary of the experimental results conducted throughout the development of the present article is shown in Table 10. Detailed results are quoted and discussed in previous sections, but a summary is compiled here to demonstrate that the proposed method outperformed various scenarios. With the help of this table, one can easily observe that the proposed model showed a better F1-Score and accuracy than the machine learning and deep learning models. Still, the proposed model performed significantly well with a more extensive dataset such as DMOZ with higher categories. As far as applicability is concerned, the web page categorization model proposed here can be useful to improve the efficiency of search engines by providing keyword-based categorization of the web pages before mapping them with the user query.

Conclusions
This paper proposed and implemented a model for web page categorization that utilized the GloVe and Stacked BiLSTM. Feature extraction and classifier design are crucial processes to achieve this task, and many machine learning models have shown a better performance in this field. However, it is still challenging to understand the semantic features of words in web pages, and the categorization accuracy needs to be improved. In this paper, the GloVe model extracted the semantic features of words in web pages, followed by the Stacked BiLSTM model which extracted the contextual features of the web pages. Finally, the Softmax classifier was applied to the extracted features to categorize the web pages into pre-defined categories. The experimental results demonstrate that the proposed web page categorization model performed better than the machine learning classifiers, deep learning models, and existing methods. Specifically, implementing the proposed model considering three categories of the WebKB dataset gave an accuracy of 89.09%, the same dataset with four categories gave an accuracy of 85.32%, and the DMOZ dataset considering thirteen classes gave an accuracy of 80.23%.
As far as limitations are concerned, the proposed work based on a deep learning model required lots of hyperparameter tuning, and the computation time was also a bit longer. On the other hand, the proposed model gave good accuracy values and F-1 scores as compared to machine learning models such as SVM and NB. The proposed method can also be applicable with publicly available datasets, as DMOZ used in this article. The proposed model resolves the linguistic ambiguity, so it can also be applied to enhance the performance of the search engines. In the future, web pages categorized by utilizing the proposed model may be used further as a web corpus for domain-specific search engines, recommendation systems, and information retrieval to improve the quality of web searches. In addition, other word embedding models such as fast text, BERT, etc., may be utilized to enhance the accuracy of web page categorization.