A Rule-Based Approach to Embedding Techniques for Text Document Classiﬁcation

: With the growth of online information and sudden expansion in the number of electronic documents provided on websites and in electronic libraries, there is di ﬃ culty in categorizing text documents. Therefore, a rule-based approach is a solution to this problem; the purpose of this study is to classify documents by using a rule-based. This paper deals with the rule-based approach with the embedding technique for a document to vector (doc2vec) ﬁles. An experiment was performed on two data sets Reuters-21578 and the 20 Newsgroups to classify the top ten categories of these data sets by using a document to vector rule-based (D2vecRule). Finally, this method provided us a good classiﬁcation result according to the F-measures and implementation time metrics. In conclusion, it was observed that our algorithm document to vector rule-based (D2vecRule) was good when compared with other algorithms such as JRip, One R, and ZeroR applied to the same Reuters-21578 dataset. They can be found in comparative studies of different approaches using flat (i.e., non-hierarchical) category systems in this corpus [7]. Hierarchical text classifiers are among the first works in this field, experiments with two classifiers on the subset of the Reuters collection reported by Kollar and Sahami [8]. Our rule-based and embedding models contributed to classifying the categories of Reuter’s dataset such as (Acq, Corn, Crude, Earn, Grain, and Ship) according to their contents. The objective of this manuscript is to provide deeper information about the performance of embedding to rule-based text classification. The main


Introduction
There has been an urgency in terms of classifying the information available online in the past ten years, prompting researchers to focus on automatic text classification (ATC). A widely used research method for this problem depends on rule-based and embedding techniques. In the 1960s, rule-based approaches began to emerge; however, they became more common in the 1970s and 1980s [1]. The late 1980s witnessed the formation of same-time or concurrent operations and activation of rules within production systems, all of which carried on into the following decade. The rule-based system includes a set of rules that can be implemented for many purposes, including the support of decision-making or for a predictive decision in real implementations. It is possible to divide the methods of creating rules into the categories of 'conquer and separate' [2] and 'divide and conquer' [3]. This produces categorization rules in the intermediate form of a decision tree, such as C4.5, C5.0, and ID3 [2]. In the same manner, ID3 is a covering technique [4] with an approach in the form of 'if then' rules. The structure of the systems of rule-based methods depends on logic-specific types, such as deterministic logic, fuzzy logic, and probabilistic logic. It can also divide the system of rule-based into the following types: systems of fuzzy rule-based, deterministic rule-based and probabilistic rule-based and determine rule-based systems being in the context of bases of the rule, which includes bases of modular rules, single rules, and chained rules [5]. In practice, the task of ensemble learning may be performed in a parallel form, in a distributed manner or on mobile platforms according to given computing environments. Finally, rule-based systems are divided into three types, namely distributed, mobile and parallel [6]. The Reuters-21578 newswire benchmark and 20 Newsgroups are the most widely used benchmark the most widely used benchmark corpora in the research community in the text categorization field. They can be found in comparative studies of different approaches using flat (i.e., non-hierarchical) category systems in this corpus [7]. Hierarchical text classifiers are among the first works in this field, experiments with two classifiers on the subset of the Reuters collection reported by Kollar and Sahami [8]. Our rule-based and embedding models contributed to classifying the categories of Reuter's dataset such as (Acq, Corn, Crude, Earn, Grain, and Ship) according to their contents. The objective of this manuscript is to provide deeper information about the performance of embedding to rulebased text classification. The main research question to explore is how varying rule-based affects the performance of text classification and we investigate the performance differences when combining our rules-based and one of the embedding models such as (doc2vec) in the task of text classification on the different datasets. We will implement many steps to acquire robust rules-based and embedding models using the Reuter 21578 and 20 Newsgroup corpora so as to make text categorization (TC) easier. Finally, Figure 1 summarizes the essential steps for a rule-based approach.

Related Work
Various studies relating to classification have been carried out taking a number of approaches. In classification systems, a rule-based learning approach to text categorization is utilized. Imaichi and Yanase suggested using rule-based methods selectively depending on the nature of the information to be extracted and make comparisons with the machine learning [9]. The model of rule-based learning consists of a set of rules learned from data [10]. Han Liu introduced an integrated framework

Related Work
Various studies relating to classification have been carried out taking a number of approaches. In classification systems, a rule-based learning approach to text categorization is utilized. Imaichi and Yanase suggested using rule-based methods selectively depending on the nature of the information to be extracted and make comparisons with the machine learning [9]. The model of rule-based learning consists of a set of rules learned from data [10]. Han Liu introduced an integrated framework for the design of systems of rule-based to implement missions of categorization, which included the process of rule representation, rule generation, and rule simplification. The study stressed the importance of the combination of different types of algorithms of rule learning techniques via ensemble learning [5]. Rule-based machine translation (RBMT) considers the unclear points pertaining to morphology and lexicon as serious challenges. A contribution by Rios and Göhring [11] describes an approach to resolving the forms of the morphologically ambiguous verb if a rule-based decision is not possible due to tagging errors or parsing. Cronin et al. developed an automated patient portal message classifiers with the rule-based approach using natural language processing (NLP) and the bag of words [12]. Ganglberger et al. discuss different automatic spike detection methods in order to improve detection performance and establish a user-adjustable sensitivity parameter mainly by examining the functioning of a rule-based system, artificial neural networks (ANNs) and random forests [13]. Accordingly, the rule-based system needed a feature selection to classify text documents. Feature selection can be performed by following one of three approaches, i.e., filter, wrapper or embedded approaches [14]. In this study, this depends on embedded methods to select a feature. The optimal parameters are learned by using the embedded method to perform the feature selection approach [15]. INRA (Institut national de la recherche agronomique) and Cnrs (Centre national de la recherche scientifique) at University Paris Saclay proposed a two-step method to normalize multi-word terms with concepts from a domain-specific ontology. In this method, they used vector representations of terms computed with word embedding information and hierarchical information from ontology concepts [16]. Le and Mikolov presented word2vec and later introduced the doc2vec algorithm based on adjusted techniques for learning how to embed texts identical to word2vec, thus turning doc2vec into a branch of word2vec [17]. In their work, doc2vec was applied to model embedding for text categorization. The motivation of this study was to classify text documents by taking a rule-based approach to embedding techniques and this work will assist us in determining the acceptable methods to follow for in-text categorization based upon measuring the related criteria. Finally, this manuscript comprises eight sections containing all the necessary information related to the rule-based approach using automatic text classification for the top ten categories of the Reuters-21578 and 20 Newsgroups data sets. Then, the paper is structured as follows: Section 1 introduces the rule-based and text classification approaches. Section 2 presents related work with the study. Section 3 explains in detail the research methodology. Section 4 presents the data analysis and results. Section 5 discusses the study results. Finally, Sections 6-8 provide a conclusion, future research directions and limitations

Embedding Methods
The presently applied rule-based with embedding technique comprises numerous factors and is used in many applications, one being text categorization. Embedding is one of the promising applications of unsupervised learning as well as transfer learning because embedding is induced by the large unlabeled corpus. Embedding is used two character-level embedding models (fast text and ELMo) and two document-level models (doc2vec and InferSent) to compare with word-level word2vec, all in accordance to the novel approach introduced by References [18]. The rule classification system uses the doc2vec model which is a type of the document embeddings of one of the embedding methods. There are three types of embedding: word2vec, character and doc2vec, as shown in the following.

Word Embedding
We can define "a word embedding" as content representation such that words of similar meaning also receive the same representation. This method deals with representing documents and words and it may be seen as one of the keys in the procedures ahead of deep learning when testing natural language processing (NLP) problems. Furthermore, it is a category of methodologies in which vectors that are real-valued are used in a predefined vector space to represent single words. Every word is determined to be a vector and the values of the vectors are discovered following a neural network method. Later, the technique is usually grouped into a profound learning field. The key to the approach is to use a densely dispersed representation for each word. A real-valued vector is utilized to represent each word, frequently tens of, or many, measurements. This is divided into hundreds and thousands or matched to larger numbers of dimensions required to represent a word, such as one-hot encoding [19].

Character Embedding
Word2vec is arranged based on the character n grams in a character embedding model. As character n grams is shared across words assuming a closed-world alphabet, these models can generate embedding for out of vocabulary (OOV) words as well as words that occur infrequently. The two character-level embedding patterns of fastText may be used as those appearing in References [20,21], which describe ELMo in the following manner: • fastText: applies a 300 dimensional model pre -rained on Common Crawl and Wikipedia via the Continuous Bag of Words (CBOW). To generate a representation for joint multiword expressions (MWE), fastText considers every word as whitespace delimited, taking away every space and handling them in the form of a united compound. For instance, 'couch potato' becomes 'couchpotato.' In the case of paraphrases, it uses the same word averaging technique as word2vec.

•
ELMo: utilizes the Elmo Embedder group of Python's allennlp library, being pre trained in SNLI and SQuAD, with a dimensionality of 1024. It is noted that the essential use case for ELMo is implemented by generating embedding in context. However, it does not provide any context in the input for compatibility with the other models. Thus, the benefits of the full potential of this model are unknown. Therefore, ELMo is not suitable since the relative compositionality of a compound is often predictable from its component words only [18], so the present study makes use of doc2vec.

Document Embedding
In this study, doc2vec is a proposal for paragraph-level embedding from the research team responsible for word2vec. It is possible to use the doc2vec approach to learn a model that can create an embedding technique in a specific document. In contrast to some of the known used methods (such as averaging word vectors, n gram models and bag of words (BOW)), doc2vec is public too and it can be utilized to create embedding from any length of text. From large corpora of raw text, it can train doc2vec in a totally unsupervised fashion. Doc2vec operates effectively once applied to represent extended texts [22]. In this paper, doc2vec (Distributed Memory Model) is used. Doc2vec is an offset to the present word embedding models and it is a popular method to learn word vectors. Moreover, it can be divided into two partitions.

A Distributed Memory Model
This part of doc2vec contributes to our study; our methodology for learning doc2vec is a model inspired by the techniques to learn the vectors of words. The inspiration is to provide a commitment to a forecast about words following in the sentence. Although the fact that a word vector is instated arbitrarily, as an indirect result, it can capture semantics from the forecasting task. Therefore, it will use this idea in our doc2vec and in the identical method. The doc2vec can also help in the estimating task of the following word since there are numerous settings that were inspected from the paragraph. In our doc2vec framework (see Figure 2), a unique vector is mapped to every paragraph and described by a column in matrix D for every word W. The two-word vectors and doc2vec are concatenated or averaged to estimate the next word in a context. In the experiments, the concatenation method was used to consolidate the vectors. The section token can be thought of as another word, and the paragraph works as memory recalling what is absent from the subject or the setting of the section itself. For the previous reason, the Distributed Memory Model of Paragraph Vectors (PV DM) is named doc2vec. The contexts are fixed-length and are inspected from a sliding window over a section. The paragraph of a vector is shared over all contexts produced from the same paragraph, but not over the paragraphs. In this model, predicting the fourth word is possible by using the chain or the average of the related vector along with a context of three words. The doc2vec is assumed to be the absent data from the present context and it can function as a memory of the paragraph subject. After being trained, the doc2vec can be utilized as vocabularies for the paragraph. In summary, the algorithm has two main stages:

1.
Training stage to acquire word vectors W, soft max weights U; b and doc2vec D on as of now observed paragraphs.

2.
The inference stage to acquire doc2vec D for new paragraphs by including more columns in D and a gradient descending on D while holding W, U, and b fixed. D is utilized to make a prediction about various specific labels utilizing a standard classifier [17].
Vectors (PV DM) is named doc2vec. The contexts are fixed-length and are inspected from a sliding window over a section. The paragraph of a vector is shared over all contexts produced from the same paragraph, but not over the paragraphs. In this model, predicting the fourth word is possible by using the chain or the average of the related vector along with a context of three words. The doc2vec is assumed to be the absent data from the present context and it can function as a memory of the paragraph subject. After being trained, the doc2vec can be utilized as vocabularies for the paragraph. In summary, the algorithm has two main stages: 1. Training stage to acquire word vectors W, soft max weights U; b and doc2vec D on as of now observed paragraphs. 2. The inference stage to acquire doc2vec D for new paragraphs by including more columns in D and a gradient descending on D while holding W, U, and b fixed. D is utilized to make a prediction about various specific labels utilizing a standard classifier [17].

Distributed Bag of Words Model
The method described earlier involves forecasting the following words in a text window by using a concatenation of the doc2vec by vectors. Another approach is to eliminate the context words at the source by driving the model to estimate words sampled in any manner from the paragraph in the output. In reality, it means that for each round of stochastic gradient descent, a text window is examined, followed by sampling a random word from the text window and forming a classification task given the doc2vec, as shown in Figure 3 [17]. To build the doc2vec model, the major stages of training need to be prepared and tested with a dataset, as shown in the following.

Distributed Bag of Words Model
The method described earlier involves forecasting the following words in a text window by using a concatenation of the doc2vec by vectors. Another approach is to eliminate the context words at the source by driving the model to estimate words sampled in any manner from the paragraph in the output. In reality, it means that for each round of stochastic gradient descent, a text window is examined, followed by sampling a random word from the text window and forming a classification task given the doc2vec, as shown in Figure 3 [17].
window over a section. The paragraph of a vector is shared over all contexts produced from the same paragraph, but not over the paragraphs. In this model, predicting the fourth word is possible by using the chain or the average of the related vector along with a context of three words. The doc2vec is assumed to be the absent data from the present context and it can function as a memory of the paragraph subject. After being trained, the doc2vec can be utilized as vocabularies for the paragraph. In summary, the algorithm has two main stages: 1. Training stage to acquire word vectors W, soft max weights U; b and doc2vec D on as of now observed paragraphs. 2. The inference stage to acquire doc2vec D for new paragraphs by including more columns in D and a gradient descending on D while holding W, U, and b fixed. D is utilized to make a prediction about various specific labels utilizing a standard classifier [17].

Distributed Bag of Words Model
The method described earlier involves forecasting the following words in a text window by using a concatenation of the doc2vec by vectors. Another approach is to eliminate the context words at the source by driving the model to estimate words sampled in any manner from the paragraph in the output. In reality, it means that for each round of stochastic gradient descent, a text window is examined, followed by sampling a random word from the text window and forming a classification task given the doc2vec, as shown in Figure 3 [17]. To build the doc2vec model, the major stages of training need to be prepared and tested with a dataset, as shown in the following. To build the doc2vec model, the major stages of training need to be prepared and tested with a dataset, as shown in the following.

Data Sets Types
A dataset is defined as a collection of related, but separated, features of related data that can be accessed individually or in combination. It can be formed and arranged into a type of data structure. For example, a dataset may be contained in the collection of business data (identity, salaries, names, address, contact information, etc.). It is possible to transform a database and use it as a set of data and we can connect the data inside it with a particular type of information. We will evaluate the doc2vec model for our rule-based approach applied to the following two datasets.

Reuters-21578 Dataset
The Reuters Newswire in 1987 saw the emergence of documents of the Reuters-21578 collection and this is a publically available version of the well-known Reuters-21578 "ApteMod" corpus for text categorization. A Reuters Ltd. (S. Weinstein, S. Dobbins, and M. Topliss) and the Carnegie Group, Inc. (M. Cellio, P. Andersen, P. Hayes, Ir. Nirenburg, and L. Knecht) collected and indexed these documents according to certain categories. The Reuters-21578 collection is distributed in 22 files. Each of the first 21 files (reut2-000.sgm through reut2-020.sgm) contains 1000 documents, while the last (reut2-021.sgm) contains 578 documents. The files are in SGML format. Rather than going into the details of the SGML language, it is described how the SGML tags are used to divide each file, and each document, into sections. Each of the 22 files begins with a document type declaration line: <!DOCTYPE lewis SYSTEM "lewis.DTD">The documents of Reuters-21578 are divided into training and test sets. Each document has five-category tags, namely, TOPICS, PLACES, PEOPLE, ORGS, and EXCHANGES. Each category has the number of topics that are used for a document, but in this study focuses on the TOPIC category only.

20 Newsgroups Dataset
About 20,000 newsgroup documents are collected in the 20 Newsgroups dataset and these documents are partitioned (almost) equally into 20 separate newsgroups. The data are organized into 20 different newsgroups, each corresponding to a different topic. Some of the newsgroups are very closely related to each other (including newsgroups such as comp.sys.mac.hardware and comp.sys.ibm.pc.hardware), while others are highly unrelated (including newsgroups such as misc.forsale and soc.religion.christian) [23].

Pre-Processing Data Sets
Pre-Processing is an important step for initializing the text, it takes an amount of processing time. Pre-Processing includes several steps such as tokenization, punctuation, stop and stop words.

Tokenizing
Tokenizing is a process of cutting the input text into pieces of words/tokens by remembering the sequence in the text that is in the tokenization and simultaneously discarding specific characters, such as punctuation [24]. Tokenizing is defined as the process of breaking down documents into words or terms called tokens. An entire text is lowercased when all the punctuation is removed and when applying the process of tokenization [25].

Punctuation
Defined as a set of marks, they are used to make sentences flow smoothly and express meaning accurately. These marks determine the place of pause or provide a signifying feeling to our words. Punctuation makes sentences pure by breaking ideas. Moreover, the punctuation points quotes, out titles and other main parts of the language. Finally, punctuation is vital in any text, necessitating their introduction. Examples include ",", "!", "?", "*".

Stop Word Removal
One important step in text classification is to eliminate stop words. A stop word is defined as a list of commonly used words that have an important function in a text but no meaning. A stop word in a text is removed to reduce noise terms, and as a result, the keyword remains [26]. Stop words are 7 of 22 common words occurring in most documents, such as "the," "and," "from," "are," "to," etc. They are required to apply this processing because these stop words cannot decide the category of the document in the categorization system [25].

Stemming
When acquiring information, stemming changes a word form to its root by means of specific principles related to the target language [27]. This is vital due to the presence of affixes, which consist of prefixes, infixes, suffixes, and confixes (combinations of prefixes and suffixes) in derived words [28]. Stemming is a process of reducing the terms to their roots. For example, words such as "working," "worker" and "worked" are reduced to "work" and "crumbling" and "crumbled" are reduced to "crumb." This process is used to reduce the computing time and space as different forms of words are stemmed into a single word. In fact, this is the main advantage of this process [25].

Local Dictionary Creation
It is the role of the main dictionary to perform feature selection in text categorization with a different set of features being selected from each category. Several studies have been conducted which used this type of dictionary. In the local dictionary, a contrasting set of features is selected from each, independent of the other categories, and that dictionary works to increase the speed of the classification process for each category by selecting the most important features in that category. Table 1 introduces the local dictionary for a number of categories in the dataset.

Rule-Based Approach
The rule based approach is considered to be one of the most flexible methods by which the black box of the process of the text classification technique can be shown. The details of a process of classification can be observed and it can add a number of tools or new instructions to obtain good results. The next subsections will explain the approach of rule-based in briefly.

Rules-Preliminary
A rule-based system is commonly comprised of a set of if then rules [29] expressed such that there are various approaches to information representation in the area of artificial intelligence. However, the most famous one may be in the form of if then rules defined as: "IF cause (antecedent) THEN effect (consequent)."

1.
Rules: Data are used to derive the most known symbolic representations of knowledge: • A natural and smooth form of representation → possible search by humans and their interpretations;

2.
A standard form of the rules; 3.
Other forms: Class if conditions; conditions → class.

Rules-Formal Notations
Rule-based processes, also known as expert or generation systems represent a type of artificial intelligence. The rules in this system are used as the learning representation for the information that is coded into the system [30]. The expert system affects the implications of rule-based systems completely and it copies the reasoning of human experts in explaining an information-intensive issue. Instead of learning in a declarative, static manner as a course action of things that are valid, rule-based systems can be considered to be knowledge that can be represented as a set of rules determining what to do or what to conclude in various situations.

Structure of a Rule-Based Expert System
In the early 1970s, Simon and Newell from the University of Carnegie Mellon proposed a production system model which is the foundation of modern rule-based expert systems [31]. The idea of that production model was based on whenever humans applied knowledge (expressed as production rules); they can solve any problem represented by problem-specific information. The problem-specific information or facts in short-term memory and the production rules were stored in long-term memory. A rule-based expert system has five components: the database, the knowledge base, the explanation facilities, the inference engine, and the user interface [32].

Classification Methods
Classification is a data mining technique that assigns items in a set to the target class. The aim of classification is to visualize correctly the target categories for each case in the data [33]. Three rule-based classification methods are applied in addition to our rule-based (D2VRule) method that is taken as a benchmarking algorithm to be studied for the Reuters-21578 and 20 Newsgroups datasets.

JRip (RIPPER)
This algorithm is one of the essential and most well-known. A set of rules in growing the size is used to examine classes and a premier set of rules for each category is created using JRip (RIPPER) with gradually reduced errors by handling all the instances of a special decision in the training data as categories. It returns a set of rules that cover every member of that class. Therefore, it proceeds to the next categories and does the same, repeating these processes until every category covered [34].

One Rule (OneR)
Abbreviated to OneR, this method uses a simple algorithm in a text classification technique to create a decision tree with one level. From different instances, OneR can deduce simple but precise classification rules. In spite of its simplicity, OneR is able to treat lost values and lost numeric attributes more flexibly. The OneR algorithm generates one rule for each predictor (class) in the data. The rule with the minimum false rate is selected by depending on the principle of one rule for each attribute in the training data [35].

ZeroR
ZeroR is considered to be the simplest classification method based on the target and it disregards all other predictors. In spite of ZeroR lacking predictability power, it is helpful in determining the performance of a baseline as a metric for other classification methods. ZeroR constructs a hesitancy table for the feature and selects its highest hesitancy value [36].

Evaluation Measures
The performance evaluation feature selection approaches are computed using recall (R) and precision (P) [37].

Precision
Precision is defined as a percentage of relevant documents correctly retrieved by the system having a symbol (TP) with respect to every document relevant to humans (TP + FN) [37]: where • TP (true positive) is defined as the correctly assigned number of documents to Class (i). • FP (false positive) is defined as the incorrectly assigned number of documents to Class (i) by the classifier but which actually does not belong to that class.

Recall
The percentage of relevant documents correctly retrieved by the system (TP) with respect to every document relevant to humans is TP + FN. In other words, recall is equal to the ratio of the retrieved relevant documents to the relevant documents [37]. where: • TN (true negative) is defined as the classifier not assigning documents to Class (i); they actually do not belong to Class (i). • FN (false negative) is defined as the classifier not assigning documents to Class (i); however, they actually do belong to Class (i).

F-Measure
This element is defined as a global estimation of the performance of an information retrieval (IR) system by combining measure precision (P) and recall (R) in a single measure called F-measure [37].

Error Rate Inverse of Accuracy
This element is defined as a global estimation of the performance of an information retrieval (IR) system by combining measure precision (P) and recall (R) in a single measure called F-measure [37].

Accuracy
Accuracy is defined as the percentage of documents correctly classified by the system [36]

Experiment Setup
In this section, we describe the experimental setup of a text classification system which includes the preprocessing, documents representation, rule-based induction in addition to evaluation metrics.

Preparing the Dataset
A collection of documents was used from the Reuters-21578 dataset (for the training dataset) and the top ten categories were selected for the 20 Newsgroups dataset. The first step of a prepared dataset was implemented using: 1. Tokenization 2.
Stop word 4. Stemming These approaches are explained in detail in the previous sections. Figure 4 explains the steps of preparing a dataset for a rule-based approach.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 10 of 22 In this section, we describe the experimental setup of a text classification system which includes the preprocessing, documents representation, rule-based induction in addition to evaluation metrics.

Preparing the Dataset
A collection of documents was used from the Reuters-21578 dataset (for the training dataset) and the top ten categories were selected for the 20 Newsgroups dataset. The first step of a prepared dataset was implemented using:

Rule-Based Processing (Documents Representation)
The previous steps were necessary to begin the rule-based creation process; however, the following sections are more important to build our rule-based by using the doc2vec approach under a titled document to the vector rule based (D2VRule).

Terms Indexing
The term indexing was considered a necessary step to build the dictionary and had benefits fpr classification processes. This dictionary was named a local dictionary; it was considered the main dictionary to apply feature selection in text categorization. In this dictionary, a different set of features was selected from each category. Several studies have been performed using the local dictionary policy. In the local dictionary, a contrasting set of features was selected from each independently of the other categories, and this dictionary helped to increase the ability of the classification process for each category by selecting the most important terms in that category.

Doc2vec Creation
In this step, the doc2vec approach was taken (explained in detail in previous sections). A doc2vec model was built by using documents of the training dataset. This step was necessary in order to determine the similarity between vocabularies, which were sets of familiar words in the language of a document of a local dictionary as well as training documents to acquire important features selection (vocabularies). These were used to classify text documents of the test corpus.

Computing Similarity of Vocabularies
The vocabulary was extracted from documents by training the data set such that words that were similar or had a related meaning to other words were extracted. This can be of benefit when one wishes to avoid repeating the same word by concentrating on the value of similarity of vocabulary near to 1 and removing the vocabulary which has a value near to 0 (zero) by depending on a threshold value. The procedure of similarity was performed by building a doc2vec model to prepare documents and compute the similarity of vocabularies in a local dictionary with doc2vec itself using special instructions in Python (most similarity).

Sorting of Vocabularies
The values of similarities of vocabularies were arranged according to threshold values defined as points beyond which there was a change in the manner a program executes. In particular, the threshold value was represented as the value of the similarity of terms in documents and by which it determined the important words in these documents. Figure 5 presents the steps used to implement the rule.

Rule-Based Induction
Promising results can be obtained when applying our rule-based (D2VRule) to the number of standard problems in the text classification. Therefore, to classify the objects, it is necessary for most learning algorithms in the first step to transform these objects into a representation suitable for concept learning. The transformation process of electronic texts is discussed in the previous section of Part 1 and Part 2. In the D2VRule, as in other rule induction systems, it is defined as a decision rule

Rule-Based Induction
Promising results can be obtained when applying our rule-based (D2VRule) to the number of standard problems in the text classification. Therefore, to classify the objects, it is necessary for most learning algorithms in the first step to transform these objects into a representation suitable for concept learning. The transformation process of electronic texts is discussed in the previous section of Part 1 and Part 2. In the D2VRule, as in other rule induction systems, it is defined as a decision rule that is a set of Boolean clauses linked by logical (AND, OR) operators which together imply membership in a particular class. A sequence of rules ending in a default rule with an empty set of clauses usually builds a hypothesis of a classification. When we apply the classification process, it can divide the core of the rules-base into two parts, with the left-hand sides of the rules being implemented sequentially until one of them evaluates to true, and the right-hand side of the rule being offered as the class prediction.

Set of Rule-Based Instructions
Text categorization was implemented by depending on a measurement metric called the feature selection metric. Its general idea was to determine the importance of words (vocabularies) using a measure that can remove non-informative words and retain informative words.

Rule-Based Evaluation
The rule-based categories were checked according to dataset categories and classification rules, followed by evaluation measurements being computed. The evaluation measurements include:
Error rate 5. Accuracy Examples of an induction rule and the evaluation metrics are shown in Figure 6. Finally, it can arrange the pre-processing of our rule-based approach according to the block diagrams in Figure 7 and build the block diagram of the rule-based technique. The following figure shows the processing of the rule-based approach for two partitions, which was used in the text classification technique.

Recall measurements 3. F-Measures 4. Error rate 5. Accuracy
Examples of an induction rule and the evaluation metrics are shown in Figure 6. Finally, it can arrange the pre-processing of our rule-based approach according to the block diagrams in Figure 7 and build the block diagram of the rule-based technique. The following figure shows the processing of the rule-based approach for two partitions, which was used in the text classification technique.

Results
In this section, it is possible to encounter extensive investigations of precision, recall, F-measure, error rate, and accuracy criteria. Moreover, precision and recall formulations (Equations (1)-(4)) were used for the Reuters-21578 and 20 Newsgroups datasets to classify the top ten categories individually. The computations were compared in order to select the acceptable method to implement the text classification. In addition, our rule-based approach examined the acq, corn, crude, earn, grain, interest, money-fix, ship, trade, and wheat top ten categories for the Reuters-21578 dataset. For the 20 Newsgroups dataset, our rule-based approach examined the categories of data sets.
As seen in Figures 8-12, we explored the precision, recall, F-measures, error rates, and accuracy of a rule-based approach to classify the test documents when we selected the top ten categories of the Reuters-21578 and 20 Newsgroups datasets.

Results
In this section, it is possible to encounter extensive investigations of precision, recall, F-measure, error rate, and accuracy criteria. Moreover, precision and recall formulations (Equations (1)-(4)) were used for the Reuters-21578 and 20 Newsgroups datasets to classify the top ten categories individually. The computations were compared in order to select the acceptable method to implement the text classification. In addition, our rule-based approach examined the acq, corn, crude, earn, grain, interest, money-fix, ship, trade, and wheat top ten categories for the Reuters-21578 dataset. For the 20 Newsgroups dataset, our rule-based approach examined the categories of data sets.
As seen in Figures 8-12, we explored the precision, recall, F-measures, error rates, and accuracy of a rule-based approach to classify the test documents when we selected the top ten categories of the Reuters-21578 and 20 Newsgroups datasets. error rate, and accuracy criteria. Moreover, precision and recall formulations (Equations (1)-(4)) were used for the Reuters-21578 and 20 Newsgroups datasets to classify the top ten categories individually. The computations were compared in order to select the acceptable method to implement the text classification. In addition, our rule-based approach examined the acq, corn, crude, earn, grain, interest, money-fix, ship, trade, and wheat top ten categories for the Reuters-21578 dataset. For the 20 Newsgroups dataset, our rule-based approach examined the categories of data sets.
As seen in Figures 8-12, we explored the precision, recall, F-measures, error rates, and accuracy of a rule-based approach to classify the test documents when we selected the top ten categories of the Reuters-21578 and 20 Newsgroups datasets.  used for the Reuters-21578 and 20 Newsgroups datasets to classify the top ten categories individually. The computations were compared in order to select the acceptable method to implement the text classification. In addition, our rule-based approach examined the acq, corn, crude, earn, grain, interest, money-fix, ship, trade, and wheat top ten categories for the Reuters-21578 dataset. For the 20 Newsgroups dataset, our rule-based approach examined the categories of data sets.
As seen in Figures 8-12, we explored the precision, recall, F-measures, error rates, and accuracy of a rule-based approach to classify the test documents when we selected the top ten categories of the Reuters-21578 and 20 Newsgroups datasets.       As shown in Figures 13-18, we explored the precision, recall, and accuracy of a rule-based approach to classify the test documents when we selected the top ten categories of the 20 Newsgroup dataset.    As shown in Figures 13-18, we explored the precision, recall, and accuracy of a rule-based approach to classify the test documents when we selected the top ten categories of the 20 Newsgroup dataset. As shown in Figures 13-18, we explored the precision, recall, and accuracy of a rule-based approach to classify the test documents when we selected the top ten categories of the 20 Newsgroup dataset.                Finally, when the rules for JRip, OneR and ZeroR were applied to the Reuters-21578 dataset, we obtained F-measures and accuracy metrics of 0.713-0.752, 0.506-0.598 and 0.219-0.39 for JRip, OneR and ZeroR, respectively. Table 2 introduces the comparison measurements among three rule-based classification methods, and the precision and recall of the system were averaged by using the microaverage method.

Discussion
The development of computer technologies, rule-based techniques, and automatic learning techniques can make information retrieval technology easier and more efficient. There exist many approaches to decision-making, such as rule-based and artificial neural networks. The rule-based approach is considered one of the most flexible methods by which the black box of the process of text classification techniques can be shown. The details of a process of classification can be seen and it can add some tools or new instructions to obtain good results. All preprocessing on two datasets (Reuters-21578 and 20 Newsgroups) is implemented using the Python programming language, an open-source tools framework, and a document-level embedding (doc2vec) technique to represent   Finally, when the rules for JRip, OneR and ZeroR were applied to the Reuters-21578 dataset, we obtained F-measures and accuracy metrics of 0.713-0.752, 0.506-0.598 and 0.219-0.39 for JRip, OneR and ZeroR, respectively. Table 2 introduces the comparison measurements among three rule-based classification methods, and the precision and recall of the system were averaged by using the microaverage method.

Discussion
The development of computer technologies, rule-based techniques, and automatic learning techniques can make information retrieval technology easier and more efficient. There exist many approaches to decision-making, such as rule-based and artificial neural networks. The rule-based approach is considered one of the most flexible methods by which the black box of the process of text classification techniques can be shown. The details of a process of classification can be seen and it can add some tools or new instructions to obtain good results. All preprocessing on two datasets (Reuters-21578 and 20 Newsgroups) is implemented using the Python programming language, an open-source tools framework, and a document-level embedding (doc2vec) technique to represent  Table 2 introduces the comparison measurements among three rule-based classification methods, and the precision and recall of the system were averaged by using the micro-average method.

Discussion
The development of computer technologies, rule-based techniques, and automatic learning techniques can make information retrieval technology easier and more efficient. There exist many approaches to decision-making, such as rule-based and artificial neural networks. The rule-based approach is considered one of the most flexible methods by which the black box of the process of text classification techniques can be shown. The details of a process of classification can be seen and it can add some tools or new instructions to obtain good results. All preprocessing on two datasets (Reuters-21578 and 20 Newsgroups) is implemented using the Python programming language, an open-source tools framework, and a document-level embedding (doc2vec) technique to represent text documents being used, which appears to be more effective in the preparation of data. In addition, the rule-based approach would support the classification approach by improving the recall, precision and accuracy measurements of classification.
A suitable vocabulary (informative words) is selected according to the following criteria: 1.
The highest value of similarity of the feature 2.
The highest numbers of the term frequency (numbers of repetitions of important words in documents) 3.
Highest numbers of document frequency (number of documents including the feature) The recall, precision, F-measures, error rate and accuracy are obtained according to a suitable choice of vocabulary selection. It is clear that there are precision and other metrics evaluations in a rule-based approach to classify categories of test datasets affected by the above criteria. According to Ligęza [29], symbolic rules are some of the most popular knowledge representation and reasoning methods. Therefore, we have many reasons to view the rule-based approach as superior to other approaches. Firstly, for the naturalness of expression, expert knowledge can be used as guiding rules. Secondly, we have modularity, in which the rules-based approach can be considered an independent method. Thirdly, the restriction of syntax allows the construction of rules and checking of consistency using other programs [38]. Fourthly, it is a compact representation of general knowledge; it can easily form the representation of general knowledge about a problem. Fifthly, the provision of explanations is represented by the ability of the rule-based approach to provide explanations for any derived conclusions in a direct manner, which is considered to be a vital feature [39]. The information extraction techniques of the rule-based approach have been used effectively in commercial systems and are favored because they are easily understood and controlled [40]. The rule-based approach and temporal specificity score TSS based classification approaches are proposed, and the results show that the proposed rule-based classifier outperforms the other four algorithms by achieving 82% accuracy, whereas the TSS classification achieves 77% accuracy [41]. In 2019, Li et al. [42], they proposed a model where the performance for it was still good and mostly stable with respect to the F-measure, and from the curve of this measure, when the number of extracted keywords N was 7, the F-measure reached a maximum of 43.1% compared to Xia's work. [43], in which, the basic idea of TextRank used for keyword extraction was introduced. The process of constructing candidate keywords and the F-measure up to 37.28%, and all these previous results were less than our results, where F-measures reached 76.75%. Decision table, Ridor, OneR, DTNB and PART are five algorithms applied to the chess end game dataset and by using evaluation metrics to check the performance of these algorithms, it appears from the results that PART is the accept rule-based classification algorithm when compared with other studied rule based algorithms.
On the other hand, the OneR algorithm showed an overall low performance for every parameter, and when these results were matched with our results when applying the OneR rule to the Reuters-21578 dataset, it became clear that the OneR algorithm had low values of precision, recall, and F-Measures [44]. A single attribute-based classification (SAC) is needed to divide the original dataset into multiple one-dimensional datasets. The experimental results show that SACs performed better than the classical OneR algorithm. The performance of different classification methods was examined on the large dataset [45].
The algorithms tested were SMO, J48, Zero, OneR, RPart and the Naive Bayes algorithm. It was discovered that the highest error was found in the ZeroR classifier with an average score of approximately 0.5. The other algorithms ranged on average 0.1-0.2. Therefore, the ZeroR technique is not a good option for the classification of any dataset due to its many errors [46], these results are in agreement with our conclusion. The performance of three rule classifier algorithms, namely RIDOR, JRIP and Decision Table, using the Iris datasets, was calculated using the cross-validation parameter. Finally, it was observed that the JRIP technique is not a good option for classification [47], and when applying those algorithms to our dataset, it became apparent that our algorithm agrees with the results of other algorithms. In Reference [48], an improved hierarchical clustering algorithm has been developed based on association rules and these algorithms were tested on benchmark data set Reuters-21578, and the results (F-measures) produced by the association rule-based hierarchical clustering (ARBHC) method are better than the results of the traditional hierarchical algorithm, and these results (F-M equal to 29%) are so much less than our results. uRule is a new rule-based classification and prediction algorithm, it was proposed to classify a limited number of uncertain data, and the accuracy of the uRule classifier remains relatively stable like our rule-based, but our rule was applied on a huge of documents within Reuters and 20Newgroups datasets [49]. Reference [50] presents a new technique using state-of-art machine-learning methods, deep learning, and it is used to solve the problem of choosing the best structures and architectures of neural networks out of many possibilities, and it introduced the RMDL (random multimodel deep learning) for classification that combines multi deep learning models to produce better performance, they have evaluated this approach on datasets such as the Web of Science (WOS), Reuters, MNIST, CIFAR, IMDB, and 20NewsGroups dataset, Furthermore, the proposed approach shows improvement in classification accuracy for both text and image classification. Finally, this accuracy for Reuters-21578 and 20Newgroups datasets in the best case is equal to 90.69% and 87.91% respectively, but our result related to accuracy for the same datasets was 90.72%, 90.07% respectively.
This provides a better classification process according to evaluation metrics.

Conclusions
We selected our rule-based approach to classify text documents into ten categories for two datasets, which in our case were the Reuters-21578 and 20 Newsgroups datasets. Computer programming using Python was implemented. It was expected that these results would be beneficial for information retrieval systems and this work has assisted us in setting the acceptable methods for use in text classification by depending on precision, recall and accuracy approaches. In conclusion, the results were, in the case of Reuters-21578, 79% precision, 75% recall, and 76.75% F-measures, 9.28% error rate, and 90.72% accuracy measurement. For the 20 Newsgroups dataset, the results were 76% precision, 66.64% recall, 70.98% F-measures, 9.93% error rate, and 90.07% accuracy measurement. When we compared our algorithms with other algorithms (JRip, OneR, and ZeroR) for the Reuters-21578 dataset and by using the performance factors of precision, recall, F-measure, error rate, and classification accuracy, it was observed that our algorithm performed better than other algorithms and had a good classification process. Our intention is to make some improvements to the rule-based approach so as to be more active with the real-time dataset of the Reuters agency as well as selecting new types of machine learning.

Future Work
We intend to make further contributions, with some enhancements to the rule-based approach, which are more active with real-time datasets, such as newspaper datasets. Tagging content or products using categories as a way to improve browsing or to identify related content on your website. Platforms such as E-commerce, news agencies, content curators, blogs, directories, and likes can use automated technologies to classify and tag content and products.

The Limitations
Text classification is an important research problem in many fields. However, there are several challenges remaining in the processing of textual data [51].

1.
Our results pertain to two specific datasets, namely Reuters-21578 and 20 Newsgroups.

2.
We worked to improve the classification technique by taking a large number of documents in the training part of the dataset since the volume of the training data had an important role in learning a model. Training data must be labeled and be large enough to cover all upcoming classes.