Enhancing Attention’s Explanation Using Interpretable Tsetlin Machine
Abstract
:1. Introduction
2. Related Works
3. Proposed Architecture: TM Initialized Attention Model
3.1. Clause Score from Tsetlin Machine Architecture
3.2. Attention Based Neural Network
4. Experiments and Results
- MR is a movie review dataset for binary sentiment classification with just one sentence per review [40]. There are 5331 positive reviews and 5331 critical reviews in the corpus. In this study, we used a training/test splitfrom [41] (https://github.com/mnqu/PTE/tree/master/data/mr (accessed on 24 February 2022).).
- Reuters The Reuters 21,578 dataset has two subsets: R52 and R83 (all-terms version). R8 is divided into eight categories, including 5485 training and 2189 exam papers. R52 is divided into 52 categories and 6532 training and 2568 test papers.
- TF-IDF+LR: Bag-of-words model with inverse document frequency weighting for term frequency. The classifier is based on logistic regression.
- CNN: CNN-rand uses arbitrarily initialized word embeddings [45].
- LSTM: The LSTM model that we employ here is from [46], representing the entire text using the last hidden layer. We used both the model that is using pretrained embeddings and without using.
- Bi-LSTM: Bi-directional LSTMs are widely used for text classification that models both forward and backward information.
- PV-DBOW: PV-DBOW is a paragraph vector model where the word order is not considered and is trained with Logistic Regression used as a softmax classifier [47].
- PV-DM: PV-DM is a paragraph vector model, with word ordering taken into consideration [47].
- fastText: This baseline uses the average of the word embeddings provided by fastText as document embedding. The embedding is then fed to a linear classifier [48].
- SWEM: SWEM applies simple pooling techniques over the word embeddings to obtain a document embedding [49].
- Graph-CNN-C: A graph CNN model uses convolutions over a word embedding similarity graph [50], employing a Chebyshev filter.
- Tsetlin Machine: Simple BOW model for Tsetlin Machine without feature enhancement.
- Bi-GRU+Attn: Bi-directional GRUs are widely used for text classification. We compare our model with Bi-GRU fed with pre-trained word embeddings along with attention layer on top of it.
- TM+Bi-GRU+Attn: Proposed model with Bi-GRU model with pretrained word embedding initialized with pretrained TM score in its input layer.
4.1. Performance Comparison with State-of-the-Arts
4.2. Explainability
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Zhang, Y.; Marshall, I.J.; Wallace, B.C. Rationale-Augmented Convolutional Neural Networks for Text Classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; Volume 2016, pp. 795–804. [Google Scholar]
- Wang, W.; Yang, N.; Wei, F.; Chang, B.; Zhou, M. Gated Self-Matching Networks for Reading Comprehension and Question Answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 Junly–4 August 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; Volume 1, pp. 189–198. [Google Scholar]
- Lakkaraju, H.; Bach, S.H.; Leskovec, J. Interpretable Decision Sets: A Joint Framework for Description and Prediction. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16), San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 1675–1684. [Google Scholar]
- Mahoney, C.J.; Zhang, J.; Huber-Fliflet, N.; Gronvall, P.; Zhao, H. A Framework for Explainable Text Classification in Legal Document Review. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; pp. 1858–1867. [Google Scholar]
- Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 13–17 August 2016. [Google Scholar]
- Sundararajan, M.; Taly, A.; Yan, Q. Axiomatic Attribution for Deep Networks. In Proceedings of the 34th International Conference on Machine Learning JMLR (ICML’17), Sydney, Australia, 6–11 August 2019; Volume 70, pp. 3319–3328. [Google Scholar]
- Camburu, O.M.; Rocktäschel, T.; Lukasiewicz, T.; Blunsom, P. e-SNLI: Natural Language Inference with Natural Language Explanations. In Proceedings of the NeurIPS, Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2015, arXiv:1409.0473. [Google Scholar]
- Parikh, A.; Täckström, O.; Das, D.; Uszkoreit, J. A Decomposable Attention Model for Natural Language Inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 2249–2255. [Google Scholar]
- Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; Volume 1, pp. 4171–4186. [Google Scholar]
- Simonyan, K.; Vedaldi, A.; Zisserman, A. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. arXiv 2014, arXiv:1312.6034. [Google Scholar]
- Jain, S.; Wallace, B.C. Attention is not Explanation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; Volume 1, pp. 3543–3556. [Google Scholar]
- Wiegreffe, S.; Pinter, Y. Attention is not not Explanation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 11–20. [Google Scholar]
- Lipton, Z.C. The Mythos of Model Interpretability. Queue 2018, 16, 31–57. [Google Scholar] [CrossRef]
- Granmo, O.C. The Tsetlin Machine—A Game Theoretic Bandit Driven Approach to Optimal Pattern Recognition with Propositional Logic. arXiv 2018, arXiv:1804.01508. [Google Scholar]
- Granmo, O.C.; Glimsdal, S.; Jiao, L.; Goodwin, M.; Omlin, C.W.; Berge, G.T. The Convolutional Tsetlin Machine. arXiv 2019, arXiv:1905.09688. [Google Scholar]
- Yadav, R.K.; Jiao, L.; Granmo, O.C.; Goodwin, M. Human-Level Interpretable Learning for Aspect-Based Sentiment Analysis. In Proceedings of the AAAI, Vancouver, BC, Canada, 2–9 February 2021. [Google Scholar]
- Bhattarai, B.; Granmo, O.C.; Jiao, L. Explainable Tsetlin Machine framework for fake news detection with credibility score assessment. arXiv 2021, arXiv:2105.09114. [Google Scholar]
- Abeyrathna, K.D.; Bhattarai, B.; Goodwin, M.; Gorji, S.R.; Granmo, O.C.; Jiao, L.; Saha, R.; Yadav, R.K. Massively Parallel and Asynchronous Tsetlin Machine Architecture Supporting Almost Constant-Time Scaling. In Proceedings of the ICML; PMLR, Online. 2021; pp. 10–20. [Google Scholar]
- Yadav, R.K.; Jiao, L.; Granmo, O.C.; Goodwin, M. Interpretability in Word Sense Disambiguation using Tsetlin Machine. In Proceedings of the 13th International Conference on Agents and Artificial Intelligence (ICAART), Vienna, Austria, 4–6 February 2021. [Google Scholar]
- Yadav, R.K.; Jiao, L.; Granmo, O.C.; Goodwin, M. Enhancing Interpretable Clauses Semantically using Pretrained Word Representation. In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, Punta Cana, Dominican Republic, 11 November 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 265–274. [Google Scholar]
- Zaidan, O.; Eisner, J.; Piatko, C. Using Annotator Rationales to Improve Machine Learning for Text Categorization. In Proceedings of the Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, Rochester, NY, USA, 22–27 April 2007; Association for Computational Linguistics: Stroudsburg, PA, USA, 2007; pp. 260–267. [Google Scholar]
- Cho, K.; van Merrienboer, B.; Gülçehre, Ç.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the EMNLP, Doha, Qatar, 25–29 October 2014. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global Vectors for Word Representation. In Proceedings of the EMNLP, Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Doshi-Velez, F.; Kim, B. Towards A Rigorous Science of Interpretable Machine Learning. arXiv 2017, arXiv:1702.08608. [Google Scholar]
- Liu, H.; Yin, Q.; Wang, W.Y. Towards Explainable NLP: A Generative Explanation Framework for Text Classification. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 5570–5581. [Google Scholar]
- Freitas, A.A. Comprehensible classification models: A position paper. SIGKDD Explor. 2014, 15, 1–10. [Google Scholar] [CrossRef]
- Zhang, Q.; Wu, Y.N.; Zhu, S.C. Interpretable Convolutional Neural Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8827–8836. [Google Scholar]
- Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.C.; Salakhutdinov, R.; Zemel, R.S.; Bengio, Y. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of the ICML, Lille, France, 6–11 July 2015. [Google Scholar]
- Bao, Y.; Chang, S.; Yu, M.; Barzilay, R. Deriving Machine Attention from Human Rationales. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 1903–1913. [Google Scholar]
- Mohankumar, A.K.; Nema, P.; Narasimhan, S.; Khapra, M.M.; Srinivasan, B.V.; Ravindran, B. Towards Transparent and Explainable Attention Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Seattle, WA, USA, 5–10 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 4206–4216. [Google Scholar]
- McDonnell, T.; Lease, M.; Kutlu, M.; Elsayed, T. Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments. In Proceedings of the HCOMP, Austin, TX, USA, 30 October–3 November 2016. [Google Scholar]
- Zhang, X.; Jiao, L.; Granmo, O.C.; Goodwin, M. On the Convergence of Tsetlin Machines for the IDENTITY- and NOT Operators. IEEE Trans. Pattern Anal. Mach. Intell. 2021. [Google Scholar] [CrossRef] [PubMed]
- Sharma, J.; Yadav, R.; Granmo, O.C.; Jiao, L. Human Interpretable AI: Enhancing Tsetlin Machine Stochasticity with Drop Clause. arXiv 2021, arXiv:2105.14506. [Google Scholar]
- Lei, J.; Wheeldon, A.; Shafik, R.; Yakovlev, A.; Granmo, O.C. From Arithmetic to Logic Based AI: A Comparative Analysis of Neural Networks and Tsetlin Machine. In Proceedings of the 27th IEEE International Conference on Electronics Circuits and Systems (ICECS2020); Online. 2020. [Google Scholar]
- Lei, J.; Rahman, T.; Shafik, R.; Wheeldon, A.; Yakovlev, A.; Granmo, O.C.; Kawsar, F.; Mathur, A. Low-Power Audio Keyword Spotting Using Tsetlin Machines. J. Low Power Electron. Appl. 2021, 11, 18. [Google Scholar] [CrossRef]
- Mikolov, T.; Karafi, M.; Khudanpur, S. Recurrent neural network based language model. In Proceedings of the Interspeech, Makuhari, Japan, 26–30 September 2010. [Google Scholar]
- Chung, J.; Gülçehre, Ç.; Cho, K.; Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
- Pang, B.; Lee, L. Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. In Proceedings of the Association for Computational Linguistics, Ann Arbor, MI, USA, 26–31 July 2005; Association for Computational Linguistics: Stroudsburg, PA, USA, 2005; pp. 115–124. [Google Scholar]
- Tang, J.; Qu, M.; Mei, Q. PTE: Predictive Text Embedding through Large-Scale Heterogeneous Text Networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 August 2015; Association for Computing Machinery: New York, NY, USA, 2015; pp. 1165–1174. [Google Scholar]
- Chollet, François and Others: Keras. 2015. Available online: https://keras.io (accessed on 1 March 2022).
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2015, arXiv:1412.6980. [Google Scholar]
- Srivastava, N.; Hinton, G.E.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
- Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 1746–1751. [Google Scholar]
- Liu, P.; Qiu, X.; Huang, X. Recurrent Neural Network for Text Classification with Multi-Task Learning. In Proceedings of the IJCAI, Manhattan, CA, USA, 9–16 July 2016; 2016; pp. 2873–2879. [Google Scholar]
- Le, Q.; Mikolov, T. Distributed Representations of Sentences and Documents. In Proceedings of the 31st International Conference on Machine Learning, PMLR. Bejing, China, 21–26 June 2014; Volume 32, pp. 1188–1196. [Google Scholar]
- Joulin, A.; Grave, E.; Bojanowski, P.; Mikolov, T. Bag of Tricks for Efficient Text Classification. In Proceedings of the EACL, Valencia, Spain, 3–7 April 2017; ACL: Stroudsburg, PA, USA, 2017; Volume 2, pp. 427–431. [Google Scholar]
- Shen, D.; Wang, G.; Wang, W.; Min, M.R.; Su, Q.; Zhang, Y.; Li, C.; Henao, R.; Carin, L. Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms. In Proceedings of the ACL, Melbourne, Australia, 26–28 March 2018; ACL: Stroudsburg, PA, USA, 2018; Volume 1, pp. 440–450. [Google Scholar]
- Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; Volume 29. [Google Scholar]
Input | Clause | 1 | 0 | ||
---|---|---|---|---|---|
Literal | 1 | 0 | 1 | 0 | |
Include Literal | P(Reward) | NA | 0 | 0 | |
P(Inaction) | NA | ||||
P(Penalty) | 0 | NA | |||
Exclude Literal | P(Reward) | 0 | |||
P(Inaction) | |||||
P(Penalty) | 0 | 0 | 0 |
Input | Clause | 1 | 0 | ||
---|---|---|---|---|---|
Literal | 1 | 0 | 1 | 0 | |
Include Literal | P(Reward) | 0 | NA | 0 | 0 |
P(Inaction) | 1.0 | NA | 1.0 | 1.0 | |
P(Penalty) | 0 | NA | 0 | 0 | |
Exclude Literal | P(Reward) | 0 | 0 | 0 | 0 |
P(Inaction) | 1.0 | 0 | 1.0 | 1.0 | |
P(Penalty) | 0 | 1.0 | 0 | 0 |
Models | MR | R8 | R52 |
---|---|---|---|
TF-IDF+LR | 74.59 | 93.74 | 86.95 |
CNN | 74.98 | 94.02 | 85.37 |
LSTM | 75.06 | 93.68 | 85.54 |
Bi-LSTM | 77.68 | 96.31 | 90.54 |
PV-DBOW | 61.09 | 85.87 | 78.29 |
PV-DM | 59.47 | 52.07 | 44.92 |
SWEM | 76.65 | 95.32 | 92.94 |
Graph-CNN-C | 77.22 | 96.99 | 92.75 |
Tsetlin Machine | 75.14 | 96.16 | 84.62 |
Bi-GRU+Attn | 77.15 | 96.20 | 94.85 |
TM+Bi-GRU+Attn | 77.95 | 97.53 | 95.71 |
Models | MR | R8 | R52 |
---|---|---|---|
Precision (macro) | 73.22 | 86.12 | 79.18 |
Recall (macro) | 70.42 | 87.44 | 75.44 |
F-Score (macro) | 69.32 | 88.32 | 76.66 |
Precision (micro) | 70.42 | 94.82 | 85.28 |
Recall (micro) | 70.42 | 94.82 | 85.28 |
F-Score (micro) | 70.42 | 94.82 | 85.28 |
Precision (weighted) | 73.22 | 95.02 | 85.51 |
Recall (weighted) | 70.42 | 95.12 | 85.12 |
F-Score (weighted) | 69.32 | 95.02 | 85.28 |
Models | MR | R8 | R52 |
---|---|---|---|
Precision (macro) | 75.21 | 88.69 | 82.32 |
Recall (macro) | 72.20 | 90.66 | 79.26 |
F-Score (macro) | 71.34 | 89.26 | 79.87 |
Precision (micro) | 72.20 | 95.52 | 95.63 |
Recall (micro) | 72.20 | 95.52 | 95.63 |
F-Score (micro) | 72.20 | 95.52 | 95.63 |
Precision (weighted) | 75.21 | 95.60 | 95.33 |
Recall (weighted) | 72.20 | 95.23 | 95.63 |
F-Score (weighted) | 71.34 | 95.49 | 95.34 |
Models | MR | R8 | R52 |
---|---|---|---|
Precision (macro) | 75.63 | 94.70 | 83.81 |
Recall (macro) | 74.62 | 93.32 | 80.23 |
F-Score (macro) | 74.61 | 93.39 | 80.67 |
Precision (micro) | 74.62 | 96.52 | 96.82 |
Recall (micro) | 74.62 | 96.52 | 96.82 |
F-Score (micro) | 74.62 | 96.52 | 96.85 |
Precision (weighted) | 75.63 | 96.58 | 96.51 |
Recall (weighted) | 74.62 | 96.52 | 96.52 |
F-Score (weighted) | 74.61 | 96.51 | 96.49 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yadav, R.K.; Nicolae, D.C. Enhancing Attention’s Explanation Using Interpretable Tsetlin Machine. Algorithms 2022, 15, 143. https://doi.org/10.3390/a15050143
Yadav RK, Nicolae DC. Enhancing Attention’s Explanation Using Interpretable Tsetlin Machine. Algorithms. 2022; 15(5):143. https://doi.org/10.3390/a15050143
Chicago/Turabian StyleYadav, Rohan Kumar, and Dragoş Constantin Nicolae. 2022. "Enhancing Attention’s Explanation Using Interpretable Tsetlin Machine" Algorithms 15, no. 5: 143. https://doi.org/10.3390/a15050143
APA StyleYadav, R. K., & Nicolae, D. C. (2022). Enhancing Attention’s Explanation Using Interpretable Tsetlin Machine. Algorithms, 15(5), 143. https://doi.org/10.3390/a15050143