Semantic Ontology-Based Approach to Enhance Arabic Text Classiﬁcation

: Text classiﬁcation is a process of classifying textual contents to a set of predeﬁned classes and categories. As enormous numbers of documents and contextual contents are introduced every day on the Internet, it becomes essential to use text classiﬁcation techniques for different purposes such as enhancing search retrieval and recommendation systems. A lot of work has been done to study different aspects of English text classiﬁcation techniques. However, little attention has been devoted to study Arabic text classiﬁcation due to the difﬁculty of processing Arabic language. Consequently, in this paper, we propose an enhanced Arabic topic-discovery architecture (EATA) that can use ontology to provide an effective Arabic topic classiﬁcation mechanism. We have introduced a semantic enhancement model to improve Arabic text classiﬁcation and the topic discovery technique by utilizing the rich semantic information in Arabic ontology. We rely in this study on the vector space model (term frequency-inverse document frequency (TF-IDF)) as well as the cosine similarity approach to classify new Arabic textual documents.


Introduction
Nowadays, text classification has become a vital technique to classify unclassified contents to pre-defined classes.Such a technique can help in finding interesting information and can enhance decision making techniques.This technique is widely used in a wide range of applications such as search engines [1][2][3], web mining [4], recommender systems [5][6][7][8], and sentiment analysis [9][10][11][12][13]. Due to the increase of Arabic content on the Internet, there is a need of an effective Arabic text classifier by considering the rich morphology, syntax and semantics of the Arabic language.
Over the last few years, many researchers have introduced different methods for effective Arabic text classification such as k-nearest neighbor (kNN), artificial neural networks (ANN), naive Bayesian classifier (NB), and support vector machine (SVM) [14][15][16][17][18][19].However, there is little work done in analyzing Arabic text in the literature.Hence, we propose enhanced Arabic topic-discovery architecture (EATA), which can provide an improved Arabic topic discovery mechanism for documents.We have used an Arabic ontology, which has been introduced by HAWALAH [20], to propose an Arabic multi-disciplinary ontology from multiple resources.This ontology consists of a number of topics, which are linked with sub-topics.We have introduced a semantic model by using a vector space model (term frequency-inverse document frequency (TF-IDF)) and the cosine similarity approach to improve Arabic classification and topic discovery techniques.Finally, we have evaluated the performance of EATA for Arabic text classification by using common information retrieval evaluation techniques.
The rest of this paper is structured as follows.Section 2 discusses related work in the fields of topic discovery and text classification.Section 3 presents our approach of EATA.Section 4 presents the evaluation of our proposed framework and lists the experimental results.Finally, the paper ends with conclusions and future work.

Previous Work
Topic discovery, text categorization, or text classification can be defined as the process of labeling unstructured data (e.g., in webpages or documents) by using pre-defined categories in order to define such data and organize them [21][22][23][24][25][26].A lot of work has been done to study the different aspects of text classification techniques and their usage in the knowledge discovery.Similarly, many studies have been conducted on comparing different text classification algorithms for Arabic textual contents.
Mamoun et al. [27] have compared the performance of different classification algorithms when applying to Arabic texts.In this work, the authors overcome the lack of Arabic corpuses in the literature by forming a small corpus that consists of 10 categories and 1000 documents.This corpus is used in this paper to compare three classifiers such as support vector machine, naive Bayesian, and k-nearest neighbor.The authors claimed that the SVM classifier has outperformed the other classifiers, and provides better results.However, we can notice that one limitation of this work is that the number of categories that are used in the experiment is less, and there is no relationship between them.
Hussien et al. [28] have compared the performance of three classifiers: SMO, naive Bayesian, and a decision tree classifier by applying them on Arabic texts.A dataset of 2363 documents, which is categorized into 6 categories, has been used to evaluate this work.The authors have shown that applying pre-processing techniques over the Arabic texts can improve the accuracy of all the classifiers.However, the evaluation is applied on a small dataset that consists of six unrelated categories.
Wahbeh et al. [29] have conducted a comparison of three text classifiers on Arabic text such as support vector machine, naive Bayesian, and Decision Tree Classifier.They have suggested a pre-processing approach, which includes removing stop words and normalization in order to reduce the noise of Arabic texts.In this experiment, they have used a dataset of four categories and a total of 1000 documents.Moreover, they have used two evaluation measurements such as k-fold cross validation and a percentage split of testing and learning datasets.The results have showed that the naive Bayesian classifier has outperformed the other algorithms.One limitation of this work is that the evaluated dataset is every small for evaluating the algorithms.
Al-Thwaib et al. [30] have conducted a comparison study of two classifiers, such as support vector machine (SVM) and k-nearest neighbor (kNN), when applying to Arabic texts.About 800 documents have been classified under four categories.In contradiction to [29], this paper has demonstrated that the SVM algorithm has outperformed the kNN algorithm in all of the two proposed experiments such as the percentage split method, and the k-fold cross validation method.
Al-diabat and Mofleh [31] have evaluated different rule based classification algorithms for Arabic text corpuses such as one rule, rule induction and decision trees.In this experimental study, they have found that customized hybrid approach has performed better than the traditional algorithms.
The authors of [32][33][34] utilized naive Bayesian to perform supervised sentiment analysis for Arabic languages [32].A multi-algorithm was used, with naive Bayesian and a bag of words model.The algorithms were applied to a dataset collected from e-commerce websites.The result showed the accuracy for naive Bayesian is 93.87% [33] by applying the hyper algorithm with naive Bayesian and with unigram and bigram features for the algorithms applied to a Twitter dataset.The result showed the accuracy is up to 90% [34]; using naive Bayesian, with precision, recall and f-measure metrics, the result showed the accuracy of naive Bayesian lexicon up to 90%.
The authors of [35,36] applied naive Bayesian in their experiments with a dataset collected from social networks including Twitter and Facebook [35] and addressed the issue related to the lack of lexicons for analysis and testing in the Arabic sentiment analysis context.A relatively large dataset of modern standard Arabic (MSA) comments and reviews has been collected.The result for naive Bayesian performance accuracy is up to 85%.Moreover, the supervised approach was used by [36]; the study used hyper-languages Iraqi, Egyptian, and Lebanese dialect Arabic with 10-fold cross-validation.The result showed an Accuracy of 87.9%.
There are several studies on the pre-processing techniques of Arabic documents to enhance the classification performance.Larkey et al. [37] have proposed a novel stemming approach to improve the information retrieval for Arabic documents using the TREC Arabic dataset.El-Khair and Ibrahim Abu [38] have observed that the classification effectiveness and accuracy of Arabic documents is enhanced by eliminating the stop words in these documents.
Reviewing the previous research, we noted that most of the proposed researchers used traditional classification techniques, but little attention has been made to find the effectiveness of more advanced techniques, such as ontologies and semantic web, to improve Arabic text classification.

Enhanced Arabic Topic-Discovery Architecture (EATA)
In 2018, we published a paper on introducing a framework for constructing an Arabic multi-disciplinary ontology from various resources.This framework had two main phases, building an Arabic ontology and enhancing the Arabic ontology [20].In this paper, we have introduced a third phase, in which we provide a novel way to utilize the prepared ontology and provide an effective mechanism for Arabic topic discovery.Figure 1 represents the complete EATA including the relationship with the previous phase.

Phase 3: Semantic Topic Discovery
In this phase, we aim at proposing a semantic enhancement model to improve the Arabic classification and topic discovery technique by utilizing the rich semantic information in the Arabic ontology.We rely in this phase on the vector space model (VSM) as well as the cosine similarity approach to classify new Arabic text in the Arabic ontology [39,40].
According to Salton et al. [41], classifying textual contents by using VSM and ranking them based on the similarity weights is effective and simple.Due to this reason, many researchers have used this technique for different applications.For example, Challam et al. [42] have suggested VSM to provide users with more precise search results based on their pre-collected profiles.Castells et al. [43] have proposed an ontology-based model to enhance the search mechanism by using VSM and a ranking algorithm.Mamoun et al. [27] have observed that VSM performs better than the other classifiers such as k-nearest neighbor (kNN) and naive Bayesian (NB) in classifying Arabic documents.Odeh et al. [44] have proposed an approach that uses the VSM to categorize Arabic textual contents.
In this project, VSM is used to convert all the Arabic ontology terms into vectors of weights using the term frequency-inverse document frequency (TF-IDF) technique using the following equations: Once all the terms in the Arabic ontology are weighted, the cosine similarity approach is used to map any new Arabic text onto the Arabic ontology using Equation (4).
Many researchers in this stage have attempted to improve the classification approach by taking advantage of the semantic information in ontologies.Ontology offers a conceptual representation of information, which includes semantic entities, relations, and characteristics of all the conceptual topics.In this project, Arabic ontology is used to provide rich semantic information to discover hidden semantic knowledge.
Various literature has been published in recent years to expose the rich semantic information in ontologies to improve the traditional VSM classifier by applying different techniques.One of these techniques is presented by Trajkova and Gauch [45], who have suggested to use the similarity of textual documents and categories c i as well as super-categories c i .Costa and Lima [46] have extracted rich semantic knowledge from the relations of ontology to improve the classification process.Du et al. [47] have improved the classification process by introducing a semantic similarity vector space model, which combines both the vector terms and semantic similarities between them.
However, most of the past studies do not consider noise while classifying documents.Similarly, same sub-categories with different meanings may appear under different super-categories such as 'education → software' and 'computing → software' in the ODP.In this paper, we propose a semantic clustering mechanism (SCM) to improve the classification process and overcome such challenges.The SCM aims at extracting the semantic relations that are hidden between documents and ontological categories.We first compute the cosine similarities between the ontological categories and textual documents.Then, we formulate semantic clusters from the highest similar n results.Each cluster contains all the categories that are directly associated to each other.Algorithm 1 and Figure 2 present the working of SCM: Add(c j ); TC-Array.Remove(c j ); Figure 2 shows an example of how the SCM works.Consider a document that can be classified as a software document, and can be mapped to three sub-categories related to the ontology in Figure 2 with different relatedness scores.These are 'biology software', 'computer software' and 'robotics software'.To implement the SCM, we created clusters based on all the directly-related categories (i.e., clusters 1-3 in Figure 3) using Algorithm 1.
Then, we assume that the larger cluster with more categories and higher average cosine similarity should have more relevance to a document.Therefore, to reflect this assumption, we first compute the average of clusters using: Equation (5).
where CL j is the cluster whose average c i we need to compute, SW is the similarity weight of the c i , and | c ∈ CL j | is the total number of c ∈ CL j .
Next, we compute the strength of a cluster by dividing the total number of categories under a cluster by the total number of all interesting categories in all the clusters, which is represented by Equation ( 12).This gives us the importance of a larger cluster with more categories.
where | c ∈ CL j | is the total number of categories under one cluster CL j divided by the total number of categories under all clusters.Next, we compute the weight of a cluster by using Equation (7).We calculate the weight of a cluster for all the concepts in that cluster in order to select the top ontological concept by using Equation (8).
SW.c i d = SW.ci d + CL j .ClusterWeight (8) where SW.c i d is the similarity weight between c i and d, CL j .ClusterWeight is the weight of the cluster CL j including c i .
Finally, once we spread the cluster weight to all the concepts in each cluster, the concept with the highest semantic similarity among all the clusters is selected to represent the input document.

Evaluation
In this section, we aim at evaluating the proposed semantic clustering mechanism (SCM) in comparison to three baseline classifiers: Support vector machine (SVM), naive Bayesian (NB) and decision tree (DT).In more detail, we applied the following steps:

Step 1: Dataset Preparation
As there is a lack of Arabic datasets in general, and Arabic ontological datasets in particular, we first create an Arabic ontological dataset using the following approaches: (1) We have used the AODP ontology dataset that was proposed by Hawalah [20].This dataset contained 12 domains with 107 categories that contained a total of 5021 documents mapped to these categories; Table 1 shows the 13 domains of AODP.Hawalah in [20] has also used many pre-processing techniques to clean and process the textual contents of the categories including: Lexical analysis and dimensionality reduction techniques.Hawalah [20] then proposed a mechanism to enrich the AODP by populating the sub-categories with related websites from a search engine.(2) We found that some of the sub-categories have limited representative textual contents that might affect evaluating the proposed semantic clustering mechanism (SCM), so we decided to populate the sub-categories with low representative textual contents with manually selected webpages.In order to accomplish this task, we invited 53 participants to populate the sub-categories that contain little or no contents with pre-labeled webpages.For this stage, each participant was asked to select two sub-categories from two different domains.For each sub-category, participants were asked to search and find 10 webpages related to the selected sub-categories.For each webpage, we implanted the pre-processing and statistic textual analysis using the proposed mechanisms in [25].In more details, these mechanisms, which were used originally to create the AODP, are used to clean the noise of the collected webpages by first removing all the non-textual contents including menus, HTML codes, images, and videos.Then, a lexical analysis was conducted to remove the following: Non-Arabic characters, numbers, symbols, extra white lines and spaces, and Arabic diacritical markings.Then, the dimensionality reduction technique removed Arabic common stop words, and finally, we normalized Arabic terms using the Arabic stemmer that was proposed by [37] and cleaned the following prefixes and suffixes: -Replacing , , and with .
Table 2 shows the details of the prepared dataset that will be used for the evaluation: Finally, we divided the prepared dataset into two subsets: training dataset and testing dataset.We will use for this evaluation the 10-fold cross-validation approach, as for each stage of the cross-validation process, the first (n) set is used as test data, and the remaining set (n − 1) is used as training data.

Step 2: Evaluation Metrics
As different methods can be used to examine classification accuracy; we use in this paper the accuracy, precision, recall, and F1 measures as follows: -Accuracy measure: is the number of correct classifications divided by the total number of classifications, and it can be calculated as follows: -Precision measure: is the ratio of the accurate data among the retrieved data, and it can be calculated as follows: -Recall measure: is the ratio of relevant data among the retrieved data, and it can be calculated as follows: -F1 measure: is a combination of the precision and recall, and can be calculated as follows: where: -TP: true positive.-TN: true negative.-FP: false positive.-EN: false negative.

Step 2: Evaluation Process
We aim at examining the impact of the proposed SCM on the Arabic classification performance, as well as comparing the performance of the SCM with three baseline methods: Support vector machine (SVM), naive Bayesian (NB), and decision tree (DT).Furthermore, we have calculated the precision, recall, f-measure, and accuracy for each domain separately and compared all the results by calculating the averages of all the measures.

Evaluation Results
We have calculated the precision, recall, f-measure, and accuracy for each domain of the dataset separately using four main classifier methods: SVM, ND, DT, and SCM.Figures 3-6 show the comparison results of each domain when applying the DT, NB, SVM, and SCM, respectively.Table 3 summarizes the results of SCM compared to the other baseline classifiers (i.e., DT, NB, and SVM) in terms of four evaluation measures: Precision, recall, f-measure and accuracy.The SCM achieved the highest average accuracy for all domains (0.856), and achieved an improvement of 5.2%, 6.2%, and 13.3% over the SVM, NB, and DT classifiers, respectively.In contrast, the DT classifier achieved the lowest accuracy as well as the lowest results of all the other measures.We also notice that the SVM achieved the second best results on all the selected measures, and that might be because the SVM usually performs well with text classifications.The overall results show that the proposed SCM classifier can effectively capture the semantic relations that might exist between related categories compared to traditional classifiers.

Conclusions and Future Work
Numerous studies have been conducted for topic discovery based on text classification algorithms.However, most of these studies are usually targeted at processing the English language not the Arabic language.Due to the complexity of the Arabic language, the same approaches are not very effective.Therefore, in this paper, we have presented the enhanced Arabic topic-discovery architecture (EATA), which provides a semantic ontology-based approach to enhance Arabic text classification.The lack of studies in the Information retrieval domain for the Arabic language has motivated us to present a novel approach in improving the traditional classification algorithms to cope with the challenges in processing the Arabic language.In this paper, we have proposed a semantic enhancement model to improve the Arabic classification and topic-discovery technique, which utilizes the rich semantic information in the Arabic ontology.A semantic clustering mechanism is suggested to capture the semantic relations between the potential topics that can be mapped onto the textual content.We set up a mathematical model to compute the importance of such clusters and reflect it on the encompassed topics.Finally, we have conducted an experiment to validate the effectiveness of the EATA.The results show that the classification performance of the proposed framework is much better than the traditional classification algorithms.
In the future, we will attempt to increase the size of the Arabic ontology by including more domains and multi-level topics.In addition, we will examine the effect of different semantic relations in the Arabic ontology and use some of the semantic reasoning mechanisms.

Algorithm 1 :
Semantic Clustering Mechanism AO = the Arabic ontology c i .do= the document that contained the description of c i .CS.Compute(x, y) = is the cosine similarity method to compute the similarity between x and y.SW.c i d = the semantic similarity weight between c i and d.TC-Array = is the top concept array that holds all the semantic similarity results between a document and all c ∈ AO.Input: d = a textual document that needs to be classified to a c ∈ AO Output: The top similar concept to d // Step1) Compute the cosine similarity between d and all c i ∈ AO Initialize TC-Array; foreach c i ∈ AO do SW c i d .Add(CS.Compute(d, c i .do));TC-Array.Add(SW c i d , c i ); // Step2) Apply semantic clustering mechanism TC-Array.Sort(asc); foreach c i ∈ TC-Array do Create-new-cluster(CL k ); CL k .Add(c i ); TC-Array.Remove(c i ); foreach c l ∈ CL k do foreach c j ∈ TC-Array do if (c l , cj) are linked directly then CL k .

Figure 2 .
Figure 2. A sample of a hierarchical ontology.

Figure 3 .
Figure 3.The comparison results of the decision tree (DT) classifier for each domain.

Figure 4 .
Figure 4.The comparison results of the naive Bayesian (NB) classifier for each domain.

Figure 5 .
Figure 5.The comparison results of the semantic clustering mechanism (SCM) classifier for each domain.

Figure 6 .
Figure 6.The comparison results of the SCM classifier for each domain.

Table 1 .
Number of domains and categories of AODP.

Table 2 .
Number of domains and categories of AODP.

Table 3 .
The comparison results of the proposed SCM classifier and other baseline classifiers.