Topical and Non-Topical Approaches to Measure Similarity between Arabic Questions
Abstract
:1. Introduction
2. Text Similarity Approaches
- Jaro–Winkler [25]: based on the Jaro distance, which measures the edit distance between strings, it is used in computation linguistics and bioinformatics;
- Needleman–Wunsch [26]: used mostly in bioinformatics;
- Longest common sub-sequence (LCS) [27]: used mainly in computational linguistics, bioinformatics, and data compression;
- Damerau–Levenshtein [28]: based on the Levenshtein distance, which is used in bioinformatics, NLP, and fraud detection.
- Bidirectional Encoder Representations from Transformers (BERT) [34];
- Word2Vec [35];
- Explicit Semantic Analysis (ESA) [36]: a vector-based statistical model;
- Hyperspace Analogue to Language (HAL) [37]: a statistical model based on word co-occurrences;
- Pointwise Mutual Information—Information Retrieval (PMI-IR) [38]: a statistical model based on a large vocabulary;
- Second-order co-occurrence pointwise mutual information (SCO-PMI) [39]: a statistical model based on a large vocabulary;
- Latent Semantic Analysis (LSA) [40]: a vector-based statistical model;
- Generalized Latent Semantic Analysis (GLSA) [41]: a vector-based statistical model;
- Normalized Google Distance (NGD) [22]: a statistical model based on a large vocabulary from the Google Search engine;
- Extracting DIStributionally similar words using COoccurrences (DISCO) [42]: a statistical model based on a large vocabulary.
3. Topical Similarity
- Text features (characters and lexical features);
- Semantic features.
3.1. Character and Lexical Similarity of Arabic Questions
Algorithm 1: Main algorithm for processing question pairs | |
1: QuestionAnalyzer (C [ ]) | |
2: //C is an array of Arabic questions couple, | |
3: //each element of C is a couple AQ1, AQ2 | |
4: //start of Algorithm 1 | |
5: For every couple cd (AQ1, AQ2) in C | |
6: normq1 = Normalize (AQ1) | |
7: normq2 = Normalize (AQ2) | |
8: normqq1 = QNorm (normq1) | |
9: normqq2 = QNorm (normq2) | |
10: bowaq1 = BOW (normqq1) | |
11: bowaq2 = BOW (normqq2) | |
12: neraq1 = NER (AQ1) | |
13: neraq2 = NER (AQ2) | |
14: posaq1 = pos (normqq1) | |
15: posaq2 = pos (normqq2) | |
16: F [ d ] [ ] = { | |
17: lcs (normq1, normq2), | |
18: cosine (bowaq1, bowaq2), | |
19: jac (bowaq1, bowaq2), | |
20: euclidian (bowaq1, bowaq2), | |
21: jac (neraq1, neraq2), | |
22: cosine (neraq1, neraq2), | |
23: jac (posaq1, posaq2), | |
24: cosine (posaq1, posaq2), | |
25: StartingSim (bowaq1, bowaq2), | |
26: EndingSim (bowaq1, bowaq2), | |
27: QWordSim (bowaq1, bowaq2) | |
28: } | |
29: Return F | |
30: //end of Algorithm 1 |
Algorithm 2: Question normalization |
1: QNorm (AQ) |
2: //start of Algorithm 2 |
3: input dictionary (nonstand, stand) [ ] |
4: //each entry in the dictionary has a standard question “interrogative” form and a //non-standard form |
5: //entries of the dictionary are ordered in an ascending order, starting with the entries with //the longest number of words |
6: Foreach entry d (nonstand, stand) of dictionary [ ] |
7: Replace nonstand with stand in AQ |
8: Return AQ |
9: //end of Algorithm 2 |
- (1)
- bowaq1 and bowaq2: two sets of bags of words corresponding to the normalized AQ1 and AQ2, respectively;
- (2)
- neraq1 and neraq2: two sets of named entities extracted from AQ1 and AQ2, respectively;
- (3)
- posaq1 and posaq2: two forms representing Part of Speech (PoS) tagging of AQ1 and AQ2.
- Longest common subsequence for AQ1, AQ2 (after their text and question normalization);
- Cosine similarity for AQ1, AQ2 after the normalization of their bag of words (BOW);
- Jaccard similarity for AQ1, AQ2 after the normalization of their bag of words (BOW);
- Euclidian distance for AQ1, AQ2 after the normalization of their bag of words (BOW);
- Jaccard similarity for AQ1, AQ2 after the normalization of their Named Entities;
- Cosine similarity for AQ1, AQ2 after the normalization of their Named Entities;
- Jaccard similarity for AQ1, AQ2 after the Part of Speech (PoS) analysis of their normalized form;
- Cosine similarity for AQ1, AQ2 after the Part of Speech (PoS) analysis of their normalized form;
- Starting similarity measure that was calculated according to Algorithm 3;
- Ending similarity measure that was calculated according to Algorithm 4;
- Question word similarity that was calculated according to Algorithm 5.
Algorithm 3: Starting similarity algorithm |
1: StartingSim (bowaq1, bowaq2) |
2: //start of Algorithm 3 |
3: If bowaq11 = = bowaq21 && bowaq12 = = bowaq22 |
4: Return 1 |
5: Elseif bowaq11 = = bowaq21 |
6: Return 0 |
7: Else |
8: Return −1 |
9: //end of Algorithm 3 |
Algorithm 4: Ending similarity algorithm |
1: EndingSim (bowaq1, bowaq2) |
2: //start of algorithm 4 |
3: If bowaq1n = = bowaq2n && bowaq1n−1 = = bowaq2n−1 |
4: Return 1 |
5: Elseif bowaq1n = = bowaq2n |
6: Return 0 |
7: Else |
8: Return −1 |
9: //end of algorithm 4 |
Algorithm 5: Question type similarity |
1: QWordSim (bowaq1, bowaq1) |
2: //start of Algorithm 5 |
3: aqw1 = findaqw (bowaq) |
4: aqw2 = findaqw (bowaq2) |
5: if the scope of aqw1 and aqw2 is the same |
6: Return 1 |
7: elseif the scopes of aqw1 and aqw2 are related |
8: Return 0 |
9: else |
10: Return −1 |
11: //end of Algorithms 5 |
3.2. Semantic Similarity (Normalized Google Distance)
Algorithm 6: Normalized Google Distance similarity |
1: NGDSim(AQ1, AQ2) |
2: //Start of Algorithm 6 |
3: nonQT1 = RemoveQW (AQ1) |
4: nonQT2 = RemoveQW (AQ2) |
5: ft = callgooglesearch (nonQT1) |
6: fr = callgooglesearch (nonQT2) |
7: ftr = callgooglesearch (nonQT1 + nonQT2) |
8: G = callgooglesearch (“the”) |
9: sim = (max (log ft, log fr)–log ftr)/log G–min (log ft, log fr)) |
10: return sim |
11: //end of Algorithm 6 |
4. Non-Topical Similarity
5. Data Preparation
6. Experimentation and Results
- Many researchers reported that 67% training to 33% test reported optimized results when datasets were small [9].
- Our split satisfies the ratio suggested by [10] to achieve optimality, which is , where p is the “effective number of parameters.” In our case, this is 4. Therefore, our split should be close to 2:1, which is close to the ratio we used.
- (1)
- EndSim;
- (2)
- StartSim;
- (3)
- QWordSim;
- (4)
- NGDSimilarity
- (1)
- Precision dropped by (−21%), meaning that our measures have a positive effect on precision;
- (2)
- Recall dropped by (−19%), meaning that our measures have a positive effect on recall;
- (3)
- F1 dropped by (−22 %), meaning that our measures have a positive effect on F1.
7. Evaluation and Assessment
8. Conclusions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Vijaymeena, M.K.; Kavitha, K. A survey on similarity measures in text mining. Mach. Learn. Appl. Int. J. 2016, 3, 19–28. [Google Scholar]
- Sayed, A.; al Muqrishi, A. An efficient and scalable Arabic semantic search engine based on a domain specific ontology and question answering. Int. J. Web Inf. Syst. 2016, 12, 242–262. [Google Scholar] [CrossRef]
- Ye, X.; Shen, H.; Ma, X.; Bunescu, R.; Liu, C. From word embeddings to document similarities for improved information retrieval in software engineering. In Proceedings of the 38th International Conference on Software Engineering, Austin, TX, USA, 14–22 May 2016; pp. 404–415. [Google Scholar]
- Wieting, J.; Berg-Kirkpatrick, T.; Gimpel, K.; Neubig, G. Beyond BLEU: Training Neural Machine Translation with Semantic Similarity. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, 28 July–2 August 2019; pp. 4344–4355. [Google Scholar]
- Aggarwal, C.C.; Zhai, C.X. A survey of text clustering algorithms. In Mining Text Data 9781461432; Springer: Boston, MA, USA, 2012; pp. 77–128. [Google Scholar]
- Seki, K.; Ikuta, Y.; Matsubayashi, Y. News-based business sentiment and its properties as an economic index. Inf. Process. Manag. 2022, 59, 102795. [Google Scholar] [CrossRef]
- Guellil, I.; Adeel, A.; Azouaou, F.; Chennoufi, S.; Maafi, H.; Hamitouche, T. Detecting hate speech against politicians in Arabic community on social media. Int. J. Web Inf. Syst. 2020, 16, 295–313. [Google Scholar] [CrossRef]
- Daoud, M.; Daoud, D. Sentimental event detection from Arabic tweets. Int. J. Bus. Intell. Data Min. 2020, 17, 471–492. [Google Scholar] [CrossRef]
- Wang, Y.; Han, L.; Qian, Q.; Xia, J.; Li, J. Personalized Recommendation via Multi-dimensional Meta-paths Temporal Graph Probabilistic Spreading. Inf. Process. Manag. 2022, 59, 102787. [Google Scholar] [CrossRef]
- Han, M.; Zhang, X.; Yuan, X.; Jiang, J.; Yun, W.; Gao, C. A survey on the techniques, applications, and performance of short text semantic similarity. Concurr. Comput. Pract. Exp. 2021, 33, e5971. [Google Scholar] [CrossRef]
- Levshina, N. Corpus-based typology: Applications, challenges and some solutions. Linguist. Typology 2021, 26, 129–160. [Google Scholar] [CrossRef]
- Alwaneen, T.H.; Azmi, A.M.; Aboalsamh, H.A.; Cambria, E.; Hussain, A. Arabic question answering system: A survey. Artif. Intell. Rev. 2021, 55, 207–253. [Google Scholar] [CrossRef]
- Shumanov, M.; Johnson, L. Making conversations with chatbots more personalized. Comput. Human Behav. 2021, 117, 106627. [Google Scholar] [CrossRef]
- Gruber, T.R.; Brigham, C.D.; Keen, D.S.; Novick, G.; Phipps, B.S. Using Context Information to Facilitate Processing of Commands in A Virtual Assistant; United States Patent and Trademark Office: Washington, DC, USA, 2018.
- Suhaili, S.M.; Salim, N.; Jambli, M.N. Service chatbots: A systematic review. Expert Syst. Appl. 2021, 184, 115461. [Google Scholar] [CrossRef]
- Jurczyk, T.; Deshmane, A.; Choi, J.D. Analysis of Wikipedia-based Corpora for Question Answering. arXiv 2018, arXiv:1801.02073. [Google Scholar]
- Hamza, A.; En-Nahnahi, N.; Zidani, K.A.; Ouatik, S.E. An arabic question classification method based on new taxonomy and continuous distributed representation of words. J. King Saud Univ. Comput. Inf. Sci. 2020, 33, 218–224. [Google Scholar] [CrossRef]
- Daoud, M. Building Arabic polarizerd lexicon from rated online customer reviews. In Proceedings of the 2017 International Conference on New Trends in Computing Sciences, ICTCS 2017, Amman, Jordan, 11–13 October 2017; pp. 241–246. [Google Scholar]
- Silveira, C.R.; Santos, M.T.P.; Ribeiro, M.X. A flexible architecture for the pre-processing of solar satellite image time series data—The SETL architecture. Int. J. Data Min. Model. Manag. 2019, 11, 129–143. [Google Scholar]
- Daoud, D.; Daoud, M. Extracting terminological relationships from historical patterns of social media terms. In Lecture Notes in Computer Science (including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); 9623 LNCS; Springer Science+Business Media: Berlin, Germany, 2018; pp. 218–229. [Google Scholar]
- Grosan, C.; Abraham, A. Rule-Based Expert Systems. In Intelligent Systems Reference Library; Springer International Publishing: Cham, Switzerland, 2011; pp. 149–185. [Google Scholar]
- Azad, H.K.; Deepak, A. Query expansion techniques for information retrieval: A survey. Inf. Process. Manag. 2019, 56, 1698–1735. [Google Scholar] [CrossRef] [Green Version]
- Prakoso, D.W.; Abdi, A.; Amrit, C. Short text similarity measurement methods: A review. Soft Comput. 2021, 25, 4699–4723. [Google Scholar] [CrossRef]
- Tien, N.H.; Le, N.M.; Tomohiro, Y.; Tatsuya, I. Sentence modeling via multiple word embeddings and multi-level comparison for semantic textual similarity. Inf. Process. Manag. 2019, 56, 102090. [Google Scholar] [CrossRef] [Green Version]
- Ma, D.; Zhang, S.; Kong, F.; Cahyono, S.C. Comparison of document similarity measurements in scientific writing using Jaro-Winkler Distance method and Paragraph Vector method. IOP Conf. Ser. Mater. Sci. Eng. 2019, 662, 052016. [Google Scholar]
- Perumalla, S.R.; Eedi, H. Needleman–wunsch algorithm using multi-threading approach. In Advances in Intelligent Systems and Computing; Springer: Berlin/Heidelberg, Germany, 2020; Volume 1090, pp. 289–300. [Google Scholar]
- Abdeljaber, H.A. Automatic Arabic Short Answers Scoring Using Longest Common Subsequence and Arabic WordNet. IEEE Access 2021, 9, 76433–76445. [Google Scholar] [CrossRef]
- Zhao, C.; Sahni, S. String correction using the Damerau-Levenshtein distance. BMC Bioinform. 2019, 20, 277. [Google Scholar] [CrossRef] [PubMed]
- Wang, J.; Dong, Y. Measurement of Text Similarity: A Survey. Information 2020, 11, 421. [Google Scholar] [CrossRef]
- Hamza, A.; Ouatik, S.e.; Zidani, K.A.; En-Nahnahi, N. Arabic duplicate questions detection based on contextual representation, class label matching, and structured self attention. J. King Saud Univ. Comput. Inf. Sci. 2020, 34, 3758–3765. [Google Scholar] [CrossRef]
- Park, K.; Hong, J.S.; Kim, W. A Methodology Combining Cosine Similarity with Classifier for Text Classification. Appl. Artif. Intell. 2020, 34, 396–411. [Google Scholar] [CrossRef]
- Wahyuningsih, T.; Henderi, H.; Winarno, W. Text Mining an Automatic Short Answer Grading (ASAG), Comparison of Three Methods of Cosine Similarity, Jaccard Similarity and Dice’s Coefficient. J. Appl. Data Sci. 2021, 2, 45–54. [Google Scholar] [CrossRef]
- Hasan, A.M.; Noor, N.M.; Rassem, T.H.; Noah, S.A.M.; Hasan, A.M. A Proposed Method Using the Semantic Similarity of WordNet 3.1 to Handle the Ambiguity to Apply in Social Media Text. Lect. Notes Electr. Eng. 2020, 621, 471–483. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
- Jatnika, D.; Bijaksana, M.A.; Suryani, A.A. Word2Vec Model Analysis for Semantic Similarities in English Words. Procedia Comput. Sci. 2019, 157, 160–167. [Google Scholar] [CrossRef]
- Sangeetha, M.; Keerthika, P.; Devendran, K.; Sridhar, S.; Raagav, S.S.; Vigneshwar, T. Compute Query and Document Similarity using Explicit Semantic Analysis. In Proceedings of the 2022 6th International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, 29–31 March 2022; pp. 761–766. [Google Scholar]
- Baruah, N.; Gogoi, A.; Sarma, S.K.; Borah, R. Utilizing Corpus Statistics for Assamese Word Sense Disambiguation. In Advances in Computing and Network Communications; Springer: Singapore, 2021; Volume 736, pp. 271–283. [Google Scholar]
- Ahmad, R.; Ahmad, T.; Pal, B.L.; Malviya, S. Approaches for Semantic Relatedness Computation for Big Data. In Proceedings of the 2nd International Conference on Advanced Computing and Software Engineering (ICACSE) 2019, Sultanpur, India, 8–9 February 2019. [Google Scholar]
- Tabassum, N.; Ahmad, T. Extracting Users’ Explicit Preferences from Free-text using Second Order Co-occurrence PMI in Indian Matrimony. Procedia Comput. Sci. 2020, 167, 392–402. [Google Scholar] [CrossRef]
- Kim, S.; Park, H.; Lee, J. Word2vec-based latent semantic analysis (W2V-LSA) for topic modeling: A study on blockchain technology trend analysis. Expert Syst. Appl. 2020, 152, 113401. [Google Scholar] [CrossRef]
- Mittal, H.; Devi, M.S. Subjective Evaluation: A Comparison of Several Statistical Techniques. Appl. Artif. Intell. 2018, 32, 85–95. [Google Scholar] [CrossRef]
- Prasetya, D.D.; Wibawa, A.P.; Hirashima, T. The performance of text similarity algorithms. Int. J. Adv. Intell. Inform. 2018, 4, 63–69. [Google Scholar] [CrossRef] [Green Version]
- McCrae, J.P.; Rademaker, A.; Rudnicka, E.; Bond, F. English WordNet 2020: Improving and Extending a WordNet for English using an Open-Source Methodology. In Proceedings of the LREC 2020 Workshop on Multimodal Wordnets (MMW2020), Marseille, France, 11 May 2020; pp. 14–19. [Google Scholar]
- Abdelali, A.; Darwish, K.; Durrani, N.; Mubarak, H. Farasa: A Fast and Furious Segmenter for Arabic; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 11–16. [Google Scholar]
- Cer, D.; Diab, M.; Agirre, E.; Lopez-Gazpio, I.; Specia, L. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, BC, Canada, 3–4 August 2017; pp. 1–14. [Google Scholar]
- Nakov, P.; Màrquez, L.; Moschitti, A.; Magdy, W.; Mubarak, H.; Freihat, A.A.; Glass, J.; Randeree, B. SemEval-2016 Task 3: Community Question Answering. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, CA, USA, 16–17 June 2016; pp. 525–545. [Google Scholar]
- Chen, X.; Zeynali, A.; Camargo, C.; Flöck, F.; Gaffney, D.; Grabowicz, P.; Hale, S.; Jurgens, D.; Samory, M. SemEval-2022 Task 8: Multilingual news article similarity. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), Seattle, WA, USA, 14–15 July 2022; pp. 1094–1106. [Google Scholar]
- Mihaylova, T.; Karadzhov, G.; Atanasova, P.; Baly, R.; Mohtarami, M.; Nakov, P. SemEval-2019 Task 8: Fact Checking in Community Question Answering Forums. In Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, MN, USA, 6–7 June 2019; pp. 860–869. [Google Scholar]
- Nagoudi, E.M.B.; Ferrero, J.; Schwab, D.; Cherroun, H. Word Embedding-Based Approaches for Measuring Semantic Similarity of Arabic-English Sentences. Commun. Comput. Inf. Sci. 2018, 782, 19–33. [Google Scholar]
- Kadhim, A.I. Survey on supervised machine learning techniques for automatic text classification. Artif. Intell. Rev. 2019, 52, 273–292. [Google Scholar] [CrossRef]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
- Alsultanny, Y.A. Machine Learning by Data Mining REPTree and M5P for Predicating Novel Information for PM10. Cloud Comput. Data Sci. 2020, 1, 40–48. [Google Scholar] [CrossRef]
- Wang, F.; Li, Z.; He, F.; Wang, R.; Yu, W.; Nie, F. Feature Learning Viewpoint of Adaboost and a New Algorithm. IEEE Access 2019, 7, 149890–149899. [Google Scholar] [CrossRef]
- Triayudi, A.; Widyarto, W.O. Comparison J48 And Naïve Bayes Methods in Educational Analysis. J. Phys. Conf. Ser. 2021, 1933, 012062. [Google Scholar] [CrossRef]
- Kurani, A.; Doshi, P.; Vakharia, A.; Shah, M. A Comprehensive Comparative Study of Artificial Neural Network (ANN) and Support Vector Machines (SVM) on Stock Forecasting. Ann. Data Sci. 2021, 2021, 1–26. [Google Scholar] [CrossRef]
- Einea, O.; Elnagar, A. Predicting semantic textual similarity of arabic question pairs using deep learning. In Proceedings of the 2019 IEEE/ACS 16th International Conference on Computer Systems and Applications (AICCSA), Abu Dhabi, United Arab Emirates, 3–7 November 2019. [Google Scholar]
- Nakov, P.; Hoogeveen, D.; Màrquez, L.; Moschitti, A.; Mubarak, H.; Baldwin, T.; Verspoor, K. SemEval-2017 Task 3: Community Question Answering. arXiv 2017, arXiv:1912.00730. [Google Scholar]
- Galbraith, B.V.; Pratap, B.; Shank, D. Talla at SemEval-2017 Task 3: Identifying Similar Questions Through Paraphrase Detection. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, BC, Canada, 3–4 August 2017. [Google Scholar]
- Franco-Salvador, M.; Kar, S.; Solorio, T.; Rosso, P. UH-PRHLT at SemEval-2016 Task 3: Combining Lexical and Semantic-based Features for Community Question Answering. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, CA, USA, 16–17 June 2016; pp. 814–821. [Google Scholar]
- Wu, H.; Huang, H.; Jian, P.; Guo, Y.; Su, C. BIT at SemEval-2017 Task 1: Using Semantic Information Space to Evaluate Semantic Textual Similarity. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, BC, Canada, 3–4 August 2017; pp. 77–84. [Google Scholar]
ID | Scope | Answer | Formal Interrogative Form | Paraphrased Words |
---|---|---|---|---|
L | Factoid-Fact | Location | أين Where | Arabic: في أيّ مكان English: In what/which location Arabic: ما موقع English: What is the location Arabic: في اي حي/قرية/شارع English: in what/which neighborhood/town/street |
N | Factoid-Fact | Numeric value | كم How many How much | Arabic: ما عدد English: what is the count Arabic: ما قياس English: what is the count Arabic: ما هو طول English: what is the length |
T | Factoid-Fact | Time | متى, أيّان “when” | Arabic: ما تاريخ English: what is the date Arabic: في ايّ وقت English: at what time |
NE | Factoid-Fact | Named Entity | لمن Whose | Arabic: الى من English: for whom Arabic: من هو English: Who is Arabic: لاي English: For whom |
NED | Definition | Named Entity | من, ما What | Arabic: ما تعريف English: what is the definition Arabic: من هو English: Who is |
M | Method | Method | كيف How | Arabic: ما هي طريقة English: What is the method Arabic: ما هو وصفة English: What is the recipe Arabic: ما الخطوات English: What are the steps |
P | Purpose | Purpose | لماذا Why | Arabic: ما هو السبب English: what is the reason Arabic: ما المسبب English: What causes |
C | Cause | Cause | ماذا What | Arabic: ما الذي English: What |
L | List | List | اذكر, عدد List | |
YN | Yes/No | Yes/No | هل Is/was/are… | Arabic: ء English: interrogative Hamzah |
Scope | Number of Questions |
---|---|
T | 494 |
L | 446 |
N | 389 |
NE | 152 |
NED | 311 |
M | 440 |
P | 271 |
C | 254 |
L | 108 |
YN | 517 |
Name | Language | Task | Size |
---|---|---|---|
SemEval-2017 Task 1 [45] | Multilingual, including Arabic | Semantic Textual Similarity | 1101 Arabic pairs |
SemEval-2016 Task 3, subtask B [46] | English | Question Similarity | 317 original, 1999 Q-Q pairs |
SemEval-2022 Task 8 [47] | Multilingual, including Arabic | News Similarity | 548 Arabic Pairs |
SemEval-2019 Task 8 [48] | English | Question Answering | 2310 questions |
Nagoudi [49] | Arabic–English | Short Text similarity | 2400 English-Arabic pairs |
Classifier | Top Average Precision |
---|---|
Random Forest | 0.84 |
REPTree [52] | 0.82 |
ADABoost [53] | 0.80 |
J48 [54] | 0.83 |
Naïve Bayes [54] | 0.69 |
SVM [55] | 0.75 |
ANN (4 dense layers, 20 epochs) [55] | 0.81 |
Precision | Recall | F1 | |
---|---|---|---|
T | 84% | 59% | 70% |
F | 87% | 96% | 91% |
Average | 84% | 85% | 85% |
Precision | Recall | F1 | |
---|---|---|---|
T | 39% | 32% | 34% |
F | 72% | 80% | 76% |
Average | 63% | 66% | 63% |
Precision | Recall | F1 | |
---|---|---|---|
T | 53% | 45% | 50% |
F | 77% | 80% | 76% |
Average | 63% | 66% | 65% |
Name | Language | Task | Approach | Results |
---|---|---|---|---|
[56] | Arabic | Question similarity | Word embedding, and Deep learning | Accuracy, 58%, 77% on two different experiments on two different datasets |
[59] | English | Question Similarity | Semantic networks: BabelNet, FrameNet | Average Precision, 76.7% |
[49] | English | Question Similarity | Word embedding, machine translation | Accuracy, based on human judgment, 76% |
[60] | Arabic | Question Similarity | Semantic networks: WordNet, and word embedding | Accuracy, based on human judgment, 75% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Daoud, M. Topical and Non-Topical Approaches to Measure Similarity between Arabic Questions. Big Data Cogn. Comput. 2022, 6, 87. https://doi.org/10.3390/bdcc6030087
Daoud M. Topical and Non-Topical Approaches to Measure Similarity between Arabic Questions. Big Data and Cognitive Computing. 2022; 6(3):87. https://doi.org/10.3390/bdcc6030087
Chicago/Turabian StyleDaoud, Mohammad. 2022. "Topical and Non-Topical Approaches to Measure Similarity between Arabic Questions" Big Data and Cognitive Computing 6, no. 3: 87. https://doi.org/10.3390/bdcc6030087
APA StyleDaoud, M. (2022). Topical and Non-Topical Approaches to Measure Similarity between Arabic Questions. Big Data and Cognitive Computing, 6(3), 87. https://doi.org/10.3390/bdcc6030087