Polynomial Perceptrons for Compact, Robust, and Interpretable Machine Learning Models
Abstract
1. Introduction
2. Related Work
3. Theoretical Foundation
3.1. Perceptron
3.2. Polynomial Perceptrons
- The coefficient represents the weight of a non-additive interaction of .
- The degree of each term is the sum of the exponents of all x components in that term.
- The degree of , denoted as d, is the maximum degree among its terms.
- For every term of any of degree d, it holds that .
3.3. Learning Process for a Polynomial Perceptron
3.4. Interpretability of Polynomial Perceptrons
- If , the monomial does not contribute to the marginal influence of .
- If , the interaction term scales proportionally with the product of the other involved variables.
- If , the influence of becomes nonlinear even in isolation.
- First-order terms represent global linear effects.
- Second-order terms represent pairwise interactions.
- Higher-order terms represent multi-variable cooperative effects.
4. Polynomial Perceptrons in Practical Use Cases
4.1. Binary Classification
4.1.1. Architecture
4.1.2. Evaluation
4.1.3. Interpretability
- 1.
- (Structural Conservation)
- 2.
- (Activation Independence) The quantities depend exclusively on the polynomial structure of and are independent of the choice of activation function g.
- 3.
- (Monotonic Invariance) If g is strictly monotonic, then the relative influence ordering induced by is preserved in .
- (2)
- Activation Independence.
- (3)
- Monotonic Invariance.
4.2. Multiclass Classification
4.2.1. Architecture
4.2.2. Evaluation
4.2.3. Interpretability
4.3. Image Classification
4.3.1. Preliminary Architecture
4.3.2. Robust Architecture
- PP-Flat: Global degree-2 polynomial applied to flattened input.
- PP-Local (): Non-overlapping localized polynomial units.
- PP-Local (): Overlapping localized polynomial units.
4.3.3. Evaluation
- PP-Flat: Global degree-2 polynomial applied to flattened input.
- PP-Local: Global degree-2 () Non-overlapping localized polynomial units.
- PP-Local: Global degree-2 () Overlapping localized polynomial units.
- MLP-100: One hidden layer with 100 units (capacity-matched to PP-Local ).
- MLP-224: One hidden layer with 224 units (capacity-matched to PP-Local ).
- CNN-Low: Two-layer CNN.
- CNN-Medium: Two-layer CNN.
- MLP-100:
- MLP-224:
- CNN-Low:
- –
- Conv(1, 32, ) + ReLU.
- –
- MaxPool().
- –
- Conv(32, 64, ) + ReLU.
- –
- MaxPool().
- –
- Fully Connected (to 10 classes).
- CNN-Medium:
- –
- Conv(1, 48, ) + ReLU.
- –
- Conv(48, 96, ) + ReLU.
- –
- MaxPool().
- –
- Fully Connected (to 10 classes).
4.3.4. Interpretability via Structured Contribution Decomposition
4.4. Natural Language Processing
4.4.1. Architecture
- From the corpus , the TfidfVectorizer was trained considering exclusively the vocabulary contained in a pre-trained embedding model (GloVe, 100 dimensions), loaded via the Gensim (v4.3.3) package.
- Relying on the resulting term–document matrix, let denote the j-th token in the i-th document, with TF–IDF weight , and let be its embedding vector. The weighted embedding of each token is then computed as follows:
- Subsequently, the document-level embedding for the i-th document is defined as the normalized weighted average:where denotes the number of tokens in the i-th document.
4.4.2. Evaluation
- The SemEval Baseline, based on a linear SVM trained on a TF-IDF representation, with hyperparameters set to the default configuration of the scikit-learn (v1.7.0) Python (v3.13.5) library.
- The SemEval Winner, also relying on SVM but using an RBF kernel and leveraging Google’s Universal Sentence Encoder to obtain sentence-level embeddings, training solely on the given dataset.
- The SemEval Third-place, consisting of a stacked Bidirectional Gated Recurrent Unit (BiGRU) architecture, using fastText word embeddings as input features.
4.4.3. Interpretability
- For each token t, we retrieve its associated TF–IDF weight and its corresponding GloVe embedding vector .
- The contribution of token t is isolated by removing its scaled embedding from the overall embedding sum of the input, thereby yielding a modified representation that excludes the influence of t.
- The modified embedding is then propagated through the trained model, yielding the prediction , which reflects the model’s output in the absence of token t.
- Finally, the contribution of token t is quantified as the difference between the baseline prediction , obtained using the full input, and the counterfactual prediction , obtained after removing t. Such a contribution can be expressed as follows:
5. Limitations and Scalability Considerations
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C.; et al. Deepseek-v3 technical report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Strubell, E.; Ganesh, A.; McCallum, A. Energy and policy considerations for modern deep learning research. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI: Washington, DC, USA, 2020; Volume 34, pp. 13693–13696. [Google Scholar]
- Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef] [PubMed]
- Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning Internal Representations by Error Propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition; Rumelhart, D.E., McClelland, J.L., Eds.; MIT Press: Cambridge, MA, USA, 1986; Volume 1, pp. 318–362. [Google Scholar]
- Williams, R.J.; Zipser, D. Learning representations by back-propagating errors. Neural Comput. 1989, 1, 270–280. [Google Scholar] [CrossRef]
- Giles, C.L.; Maxwell, T. Learning, invariance, and generalization in high-order neural networks. Appl. Opt. 1987, 26, 4972–4978. [Google Scholar] [CrossRef] [PubMed]
- Pao, Y.H. Adaptive Pattern Recognition and Neural Networks; Addison-Wesley: Reading, MA, USA, 1989. [Google Scholar]
- McCullagh, P.; Nelder, J.A. Generalized Linear Models, 2nd ed.; Chapman and Hall: New York, NY, USA, 1989. [Google Scholar]
- Schölkopf, B.; Smola, A.J. Learning with Kernels; MIT Press: Cambridge, MA, USA, 2002. [Google Scholar]
- Livni, R.; Shalev-Shwartz, S.; Shamir, O. On the computational efficiency of training neural networks. Adv. Neural Inf. Process. Syst. 2014, 27, 855–863. [Google Scholar]
- Bhati, D.; Amiruzzaman, M.; Zhao, Y.; Guercio, A.; Le, T. A Survey of Post-Hoc XAI Methods From a Visualization Perspective: Challenges and Opportunities. IEEE Access 2025, 13, 120785–120806. [Google Scholar] [CrossRef]
- Madsen, A.; Reddy, S.; Chandar, S. Post-hoc Interpretability for Neural NLP: A Survey. ACM Comput. Surv. 2022, 55, 155. [Google Scholar] [CrossRef]
- Simonyan, K.; Vedaldi, A.; Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv 2013, arXiv:1312.6034. [Google Scholar]
- Sundararajan, M.; Taly, A.; Yan, Q. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, Australia, 6–11 August 2017; pp. 3319–3328. [Google Scholar]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: New York, NY, USA, 2017; pp. 618–626. [Google Scholar]
- Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2016; pp. 1135–1144. [Google Scholar]
- Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference; Part I 13; Springer: Berlin/Heidelberg, Germany, 2014; pp. 818–833. [Google Scholar]
- Kim, B.; Wattenberg, M.; Gilmer, J.; Cai, C.; Wexler, J.; Viegas, F. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 4186–4195. [Google Scholar]
- Ghorbani, A.; Zou, J.Y. Neuron shapley: Discovering the responsible neurons. Adv. Neural Inf. Process. Syst. 2020, 33, 5922–5932. [Google Scholar]
- Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev. 1958, 65, 386. [Google Scholar] [CrossRef] [PubMed]
- McCulloch, W.S.; Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 1943, 5, 115–133. [Google Scholar] [CrossRef]
- Minsky, M.; Papert, S. Perceptrons; MIT Press: Cambridge, MA, USA, 1969. [Google Scholar]
- Basile, V.; Bosco, C.; Fersini, E.; Nozza, D.; Patti, V.; Pardo, F.M.R.; Rosso, P.; Sanguinetti, M. Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter. In Proceedings of the 13th International Workshop on Semantic Evaluation; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 54–63. [Google Scholar]


















| Model | Configuration | Test Accuracy | Parameters |
|---|---|---|---|
| PP | Degree | 0.8850 | 3 |
| PP | Degree | 0.8783 | 6 |
| PP | Degree | 0.9967 | 10 |
| SVM | RBF kernel | 0.9983 | Implicit |
| MLP | 1 layer (2) | 0.8850 | 9 |
| MLP | 1 layer (4) | 0.8883 | 17 |
| MLP | 1 layer (8) | 0.8983 | 33 |
| MLP | 1 layer (16) | 0.9967 | 65 |
| Model | Configuration | Test Accuracy | Parameters |
|---|---|---|---|
| PP | Degree 1 | 0.8333 | 9 |
| PP | Degree 2 | 0.8567 | 18 |
| PP | Degree 3 | 0.8667 | 30 |
| PP | Degree 4 | 0.8700 | 45 |
| SVM | RBF kernel | 0.8667 | implicit |
| MLP | 1-layer (4 neurons) | 0.8617 | 27 |
| MLP | 1-layer (8 neurons) | 0.8733 | 51 |
| MLP | 1-layer (16 neurons) | 0.8700 | 99 |
| Model | Test Accuracy | Parameters | Structure |
|---|---|---|---|
| PP-Flat | 88.9% | 3,085,050 | Global Polynomial |
| PP-Local () | 92.7% | 78,062 | Local Polynomial |
| PP-Local () | 94.1% | 175,627 | Overlapping Local Polynomial |
| MLP-100 | 88.3% | 79,510 | Dense |
| MLP-224 | 89.1% | 178,090 | Dense |
| CNN-Low | 90.4% | 50,186 | Convolutional |
| CNN-Medium | 92.0% | 230,218 | Convolutional |
| Test | |||||
|---|---|---|---|---|---|
| Accuracy | F1-Score | Precision | Recall | Parameters | |
| SemEval Baseline (SVM, TF-IDF) | − | 0.4510 | − | − | − |
| SemEval Winner (Google, SVM-RBF kernel) | − | 0.6510 | − | − | − |
| SemEval Third-place (fastText, BiGRU) | − | 0.5350 | − | − | − |
| Polynomial (tfidf, unbalanced, deg = 2) | 0.5601 | 0.3071 | 0.4609 | 0.2302 | 1326 |
| Polynomial (tfidf, balanced, deg = 2) | 0.5318 | 0.4676 | 0.5419 | 0.4111 | 1326 |
| Polynomial (tfidf, unbalanced, deg = 3) | 0.5609 | 0.2071 | 0.4396 | 0.1355 | 23,426 |
| Polynomial (tfidf, balanced, deg = 3) | 0.5405 | 0.4914 | 0.5502 | 0.4440 | 23,426 |
| Polynomial (tfidf+GloVe, unbalanced, deg = 2) | 0.5856 | 0.6349 | 0.5063 | 0.8511 | 5151 |
| Polynomial (tfidf+GloVe, balanced, deg = 2) | 0.7142 | 0.7115 | 0.7185 | 0.7046 | 5151 |
| Polynomial (tfidf+GloVe, unbalanced, deg = 3) | 0.5703 | 0.6254 | 0.4955 | 0.8476 | 176,851 |
| Polynomial (tfidf+GloVe, balanced, deg = 3) | 0.6998 | 0.7006 | 0.6986 | 0.7027 | 176,851 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Aldana-Bobadilla, E.; Molina-Villegas, A.; Cesar-Hernandez, J.; Garza-Fabre, M. Polynomial Perceptrons for Compact, Robust, and Interpretable Machine Learning Models. Entropy 2026, 28, 453. https://doi.org/10.3390/e28040453
Aldana-Bobadilla E, Molina-Villegas A, Cesar-Hernandez J, Garza-Fabre M. Polynomial Perceptrons for Compact, Robust, and Interpretable Machine Learning Models. Entropy. 2026; 28(4):453. https://doi.org/10.3390/e28040453
Chicago/Turabian StyleAldana-Bobadilla, Edwin, Alejandro Molina-Villegas, Juan Cesar-Hernandez, and Mario Garza-Fabre. 2026. "Polynomial Perceptrons for Compact, Robust, and Interpretable Machine Learning Models" Entropy 28, no. 4: 453. https://doi.org/10.3390/e28040453
APA StyleAldana-Bobadilla, E., Molina-Villegas, A., Cesar-Hernandez, J., & Garza-Fabre, M. (2026). Polynomial Perceptrons for Compact, Robust, and Interpretable Machine Learning Models. Entropy, 28(4), 453. https://doi.org/10.3390/e28040453

