Topical and Non-Topical Approaches to Measure Similarity between Arabic Questions

: Questions are crucial expressions in any language. Many Natural Language Processing (NLP) or Natural Language Understanding (NLU) applications, such as question-answering computer systems, automatic chatting apps (chatbots), digital virtual assistants, and opinion mining, can beneﬁt from accurately identifying similar questions in an effective manner. We detail methods for identifying similarities between Arabic questions that have been posted online by Internet users and organizations. Our novel approach uses a non-topical rule-based methodology and topical information (textual similarity, lexical similarity, and semantic similarity) to determine if a pair of Arabic questions are similarly paraphrased. Our method counts the lexical and linguistic distances between each question. Additionally, it identiﬁes questions in accordance with their format and scope using expert hypotheses (rules) that have been experimentally shown to be useful and practical. Even if there is a high degree of lexical similarity between a When question (Timex Factoid—inquiring about time) and a Who inquiry (Enamex Factoid—asking about a named entity), they will not be similar. In an experiment using 2200 question pairs, our method attained an accuracy of 0.85, which is remarkable given the simplicity of the solution and the fact that we did not employ any language models or word embedding. In order to cover common Arabic queries presented by Arabic Internet users, we gathered the questions from various online forums and resources. In this study, we describe a unique method for detecting question similarity that does not require intensive processing, a sizable linguistic corpus, or a costly semantic repository. Because there are not many rich Arabic textual resources, this is especially important for informal Arabic text processing on the Internet.


Introduction
It is a significant challenge to determine whether two utterances (lexical units, sentences, questions) are similar using Natural Language Processing (NLP) [1]. Similarity detection may lead to the success and the substantially improved results reported from many NLP engines; examples include Text-based Information Retrieval (IR) [2,3], machine translation (MT) [4], text clustering [5], opinion mining, and sentiment analysis [6][7][8].
The topic of text similarity has been addressed by many researchers in terms of various aspects. Some approaches focus on strings or sub-sequences of characters' similarity between texts, such as longest common sub-sequence (LCS). Alternatively, other approaches, such as cosine similarity and Jaccard similarity, emphasize the importance of the lexical units, where two utterances are similar if they share common words (lexical units) [9]. These methods are considered to be efficient methods for identifying similarity between utterances based on the shared lexical units.
By comparison, it is difficult to find logical similarities between different utterances using semantic similarity, regardless of whether the texts of the different utterances are really similar to one another [10]. For instance, even though texts differ at the word and When did the Titanic ship sink?". Both questions are both asking about different aspects of the same topic.
Non-topical similarity focuses on the interrogative tools (words) used to form the question, regardless of the topic of the question. For topical similarity, we use lexical and semantic similarity measures. In particular, we use Normalized Google Distance (NGD) [22] for semantic similarity, and we use rule-based approaches to address non-topical similarity.
It is common among researchers in this domain to consider only corpus data-driven algorithms to perform clustering and classification tasks on textual data (including questions) [23]. We believe that this is an important aspect of measuring question similarity. However, without the aid of a corpus, basic and straightforward rules may be hypothesized to improve the processing of the questions and to streamline their categorization.
For example, these two Arabic questions are not similar, despite the fact that they have high character subsequence similarity, high word-to-word similarity, and even high topical semantic similarity, merely because Question 1 is asking about the time and Question 2 is asking about a location: Question 1 Arabic = Question 1 English = When did the Battle of dignity (Al Karamah) occur?" Question 2 Arabic = Question 2 English = Where did the Battle of dignity (Al Karamah) occur?" Our approach can detect that Q1 and Q2 are topically similar but are different nontopically speaking.
In this article, we present a comprehensive approach to analyzing Arabic questions, and utilize that approach in Arabic question similarity detection with high accuracy given the limited linguistic resources of the Arabic language.
The structure of this article is as follows: The most pertinent previous research is discussed in Section 2. In Section 3, we present our method to measure topical similarity. Section 4 discusses the proposed non-topical similarity measures. Section 5 outlines our data acquisition and preparation. In Section 6, we present our experimental results, followed by evaluation and assessment remarks in Section 7. Finally, Section 8 lists our conclusions.

Text Similarity Approaches
We can view similarity between utterances as character similarity, lexical similarity, and semantic similarity. The focus of this article (question similarity) is a special case of the above similarities.
Character similarity [24] depends on the character arrangement of the text. As a direct consequence of this, two utterances are identical to one another if they include the same strings and characters. Examples of the most frequently used algorithms for character similarity include: 10. Extracting DIStributionally similar words using COoccurrences (DISCO) [42]: a statistical model based on a large vocabulary.
The above algorithms determine similarity considering word and text collocations, and they need a large and well-maintained textual corpus to function reliably and efficiently.
In order to improve the accuracy and coverage of the semantic similarity engine, a semantic network such as Wordnet [43] is often coupled with it.
In reality, a large number of scholars use Wordnet extensively to calculate similarity, which is regarded as a semantic similarity metric that may be used independently. This is beneficial for languages having huge resources, such as English. (There are 155,327 words in the English version of the WordNet, structured into 175,979 synsets.) The case of question similarity is special because questions usually have a short and limited context. Hence, determining question similarity is considered a challenging task. The challenge increases for the Arabic language, where semantic similarity algorithms cannot be fully utilized because of the absence of rich textual resources. As a result, here we present a hybrid technique that takes advantage of character similarity, lexical similarity, and semantic similarity, but does not need enormous textual resources, to which access is still thought to be a challenge for poorly resourced languages such as Arabic.

Topical Similarity
Topical similarity between questions measures the distance between the topics of the questions regardless of the question type or scope. For example, two questions would be considered similar if they both asked about World War II, regardless of the aspects of World War II that are the subjects of the two questions. To determine topical similarity, our approach extracts features from each question as follows: 1.
Semantic features.
Accordingly, we measure distances between the features of a pair of questions. The next subsections provide more details.

Character and Lexical Similarity of Arabic Questions
Here, we process a pair of Arabic questions (AQ1, AQ2) to determine their textual similarity (string and lexical similarity). We use a number of text similarity metrics, which provide a set of features for each pair. In order to create the set of features that belong to the pair, Algorithm 1 processes AQ1 and AQ2 as follows.
The algorithm analyzes a whole array of question couples, C. It starts by sending each question in every couple to an Arabic text normalizer, followed by a special question normalizer (described in Algorithm 2) that tries to eliminate nonstandard question words. This unifies the questions and removes avoidable variations, which will increase the accuracy of the topical similarity. Algorithm 2 uses a dictionary of nonstandard question words mapped to standard words, for example: Nonstandard question word: "Arabic: .... English: in what city do .... located?". Standard: "Arabic: English: where".
We used the FARASA Arabic tool [44] for the processing pipeline of the Arabic text of each couple.
In summary, Algorithm 1 produces the features below in correspondence to every couple in C:
Cosine similarity for AQ1, AQ2 after the normalization of their bag of words (BOW); 3. Jaccard similarity for AQ1, AQ2 after the normalization of their bag of words (BOW); 4.
Euclidian distance for AQ1, AQ2 after the normalization of their bag of words (BOW); 5.
Jaccard similarity for AQ1, AQ2 after the normalization of their Named Entities; 6.
Cosine similarity for AQ1, AQ2 after the normalization of their Named Entities; 7.
Jaccard similarity for AQ1, AQ2 after the Part of Speech (PoS) analysis of their normalized form; 8.
Cosine similarity for AQ1, AQ2 after the Part of Speech (PoS) analysis of their normalized form; 9.
Starting similarity measure that was calculated according to Algorithm 3; 10. Ending similarity measure that was calculated according to Algorithm 4; 11. Question word similarity that was calculated according to Algorithm 5.
The following is Algorithm 3, which calculates the starting similarity measure; it receives the normalized bag of words of a question couple and then returns a score of −1, 0, or 1. If the first two words in bowaq1 and bowaq2 are the same, Algorithm 3 returns 1, and if only the first word is similar, it will return 0. Otherwise, it returns −1. If bowaq1 1 = = bowaq2 1 && bowaq1 2 = = bowaq2 2 4: Return 1 5: Elseif bowaq1 1 = = bowaq2 1 6: Return 0 7: Else 8: The following is Algorithm 4, which calculates the ending similarity measure; it receives the normalized bag of words of a question couple and then returns a score of −1, 0, or 1. If the last two words in bowaq1 and bowaq2 are the same, Algorithm 4 returns 1, and if only the last word is similar, it will return 0. Otherwise, it returns −1.
The advantage of this feature is that certain couples may produce high levels of string and lexical similarity; nevertheless, the dissimilarity of the last few words of the questions may completely alter the questions' meaning. Algorithm 5 receives the normalized bag of words of a question couple and then returns a score of −1, 0, or 1. It determines similarity by relying on the scope of the question. Therefore, if AQ1 and AQ2 have the same type and scope, it returns 1. If their scopes are related, it yields 0, and if they are wholly unlike, it returns −1. A function called findaqw identifies the question word or words that were used in the question. Section 4, "Non topical similarity," further discusses question types and scopes. This feature is a nontopical feature because it is determined purely based on the question type rather than the "topic" of the question. if the scope of aqw1 and aqw2 is the same 6: Return 1 7: elseif the scopes of aqw1 and aqw2 are related 8: Return 0 9: else 10: Return −1 11: //end of Algorithms 5

Semantic Similarity (Normalized Google Distance)
We use Normalized Google Distance, often known as NGD, to determine semantic similarity. The Normalized Google Distance (NGD) is a semantic similarity metric that is computed based on the quantity of results that are provided by the Google search engine in response to a certain query string.
Words with meanings that are different from one another have a tendency to be farther apart on the Normalized Google Distance scale than phrases that are semantically linked to one another.
To be more exact, we can calculate NGD of t and r (where t and r are both search terms) according to the following formula: where f (t) is the volume of results produced by a Google search for the term t. The same interpretation applies for f (r), and f (t, r) is the number of hits returned when Google is searched for t and r together. G is the total number of pages indexed by Google. NGD (t, r) will be close to 0 if the terms t and r are related. We use NGD for Arabic question couples because it is practically convenient, computationally efficient, and does not require a corpus (unlike most other semantic similarity algorithms). Algorithm 6 shows the steps towards determining NGD similarity. Algorithm 6 receives a couple of Arabic questions and returns their NGD similarity. It should be noted that Algorithm 6 removes question words using the RemoveQW function (which is the opposite of findaqw). The number of results that are returned by a search using the term "the" is used in Algorithm 6 to estimate the total number of pages that Google has indexed.

Non-Topical Similarity
In this section, we investigate non-topical similarity (interrogative similarity) between Arabic questions. The focus here is on the interrogative tool that was used to form the question rather than the topic of the question. This can be very helpful in determining the overall distance between the two questions. Table 1 shows the most important scopes of questions asked in Arabic; each scope is labeled corresponding to one of the potential responses to the question. For example, there is no doubt that the response to a Timex Factoid question is either a time or a date. However, for a question about Location Factoids, the response would be a geographical region or a location. Semantically, the two questions (Timex Factoid, Location Factoid) will probably yield two different answers, and consequently, we can deduce a semantic distance even with the presence of high lexical similarity (topical similarity). We calculate a similarity metric for two Arabic questions by comparing the scope of the interrogative words in each of the questions (question words). When developing the similarity criteria, we make use of both experimental and theoretical approaches.
For instance, it is obvious that a question about a method that begins with " How" will not be the same as a question about a Timex Factoid that begins with " when," and on the basis of this, we can construct the following rule: If AQ1.sid = M and AQ2.sid = T then aqw1 = −1. Empirical experiments can validate or invalidate this hypothetical rule. Similar rules can be crafted; for instance, if two questions are of the same scope, then the rule would give them a 1 similarity. Through our experiments, we found that some different scopes had unproven similarity or distance; in such occurrences, rules will give them a score of 0.

Data Preparation
To test our proposed approach, we compiled 3382 Arabic questions from the Internet. A total of 2932 Arabic questions were extracted from Ejaaba.com (accessed on 1 February 2022), which is a collaborative Arabic community for answering casual questions. In addition, 450 questions were extracted semi-automatically from various Frequently Asked Questions pages, such as those of United Nations organizations, universities, and NGOs.
The 3382 questions were used to randomly generate 2200 Arabic question pairs. Each couple was labeled as T or F (where T indicates a similarity, and F indicates no similarity). In total, 679 couples were given a T label, and 1518 were given an F label.
It was statistically difficult to find a natural occurrence of T couples. Therefore, most of the T-labeled couples were crafted using various paraphrasing approaches by native speakers. We used the same approach for paraphrasing 150 F couples.
Normalization was performed on each of the couples in the dataset, which comprised 2200 couples. Normalization included Arabic text normalization and Arabic question normalization. Then, Algorithm 1 generated the proposed topical and non-topical features.
The scopes of the 3382 different questions are broken down into their respective distributions in Table 2. The size of our dataset is larger than (or comparable to) similar Arabic and non-Arabic experiments conducted based on labeled data. Table 3 shows the sizes of the datasets of similar experiments. Nagoudi [49] Arabic-English Short Text similarity 2400 English-Arabic pairs

Experimentation and Results
The 2200 couples were divided into 1450 couples as a training set and the remaining 750 couples as a test set. Although references do not have a perfect data split ratio between training and test sets, we chose a split ratio of 65.91% training to 34.09% testing in our experiment for the following reasons: 1.
The ratio of 60-70% for training is common [6,7] and was successfully used in similar experiments with comparable size and dimensions [8].
Our split satisfies the ratio suggested by [10] to achieve optimality, which is 2 where p is the "effective number of parameters." In our case, this is 4. Therefore, our split should be close to 2:1, which is close to the ratio we used.
The resulting dataset was subjected to a variety of classifiers. We note that these classifiers were selected based on the guidelines outlined in [50].
As shown in Table 4, the Random Forest classifier [51] with a nine-fold cross validation produced the best results in terms of accuracy, recall, and F-measure. The outcomes generated by the Random Forest classifier are listed in Table 5. In order to evaluate our proposed methodology and features, we carried out the experiment without making use of our unique features. This means that we did not use the following features: (1) EndSim; (2) StartSim; (3) QWordSim; (4) NGDSimilarity As a result, the evaluation depended only on elementary features extracted by measures such as the cosine similarity measure, the Jaccard distance, the Euclidean distance, and the LCS.
The results of the same test are shown in Table 6, but they do not include our topical or non-topical features. As shown, the accuracy of the identical algorithms significantly decreased as follows: (1) Precision dropped by (−21%), meaning that our measures have a positive effect on precision; (2) Recall dropped by (−19%), meaning that our measures have a positive effect on recall; (3) F1 dropped by (−22 %), meaning that our measures have a positive effect on F1.
Furthermore, we ran the test without using the non-topical features, only relying on topical features, including the semantic NGD measure. The results are shown in Table 7. It can be noted that there are noticeable improvements between the results shown in Tables 6 and 7, which highlight the importance of topical features, including NGD features.

Evaluation and Assessment
With an average F1 of 0.85, our method is successful in recognizing question paraphrasing and synonymy. The accuracy was enhanced due to the non-topical similarity metrics that were presented, particularly for the F-labeled questions. These findings were achieved without the use of a lexical dictionary, a semantic dictionary, or an ontological dictionary.
We infer from Table 5 that the T-labeled questions' precision is much lower than the F-labeled questions' precision. A possible explanation of this may be the fact that non-topical measures are extremely useful in deciding if two questions are distant (for instance, the proposed rules make it clear that "How" questions cannot be similar to "Who" questions). The identification of similar questions within the same scope, by comparison, needs more than just a resemblance in question types. It has been observed that some of the inaccuracies in T-labeled couples may be remedied by the use of a synonym lexicon (semantic network).
Our accuracy results are better than those achieved with similar Arabic [56] and non-Arabic experiments [57,58], as shown in Table 8. We acknowledge that the approaches below use different datasets and different performance metrics. However, Table 8 gives a clear indication that our approach has better or comparable results without using domain-dedicated dictionaries, word embedding, or semantic networks, whereas all of the approaches below use word embedding and/or a semantic network. Furthermore, [56,59], in particular, ran experiments using datasets having similar sizes and similar performance metrics, and our system showed improved results. We think that making use of dictionaries, word embedding, a language model, and semantic networks that are domain-specific would enhance the outcomes even more, and this will be a primary focus of study in the future. However, in this experiment, we tried to prove the possibility of achieving good results without expensive and rich lexical resources.

Conclusions
Using topical and non-topical data and features, this research demonstrated a unique approach for calculating the degree to which Arabic questions are similar to one another. The topical techniques relied on string, lexical, and semantic similarity measures between the Arabic texts of the questions, whereas the non-topical approaches focused on the interrogative tools that were utilized by the Arabic questions. Both of the approaches showed effectiveness in accurately detecting similarity. For semantic similarity, we used Normalized Google Distance (NGD) as it does not require a textual corpus.
We presented the results of an experiment on a dataset of 2200 couples of Arabic questions collected from the Internet. Our proposed topical and non-topical features increased the accuracy of the results significantly in comparison to a simple model that utilizes baseline features. Our experiment results were closely comparable to those of other Arabic and non-Arabic experiments, despite not using a textual corpus or a lexical/semantic network. We believe that the results can be further improved with the utilization of a multidomain Arabic lexical network, which will be part of our future work.