Revisiting SMS Spam Detection: The Impact of Feature Representation on Classical Machine Learning Models
Abstract
1. Introduction
1.1. Motivation and Contribution
1.2. Organization
2. Literature Review
2.1. Classical Machine Learning
2.2. Deep Learning Approaches
2.3. Transformer/LLM Approaches
2.4. FE and Hybrid/Robustness Approaches
2.5. Comparative Perspective
3. Methodology and Experimental Setup
3.1. Dataset Description
3.2. Text Preprocessing
3.3. Text Feature Representation Techniques
3.3.1. Count Vectorization
3.3.2. Binary BoW
3.3.3. TF-IDF
3.3.4. Word N-Gram
3.3.5. Character N-Grams
3.3.6. Enhanced TF-IDF
3.3.7. Hybrid Feature Representation
3.3.8. Feature Dimension and Sparsity Summary
3.4. Classification Models
- The NB model was included as a baseline, as it is particularly suitable for BoW and TF-IDF representations due to its efficiency in handling sparse text data.
- The Linear SVM model was chosen for its strong performance in text classification tasks with high-dimensional sparse features, making it effective for spam detection.
- The RF model was included to evaluate ensemble-based methods capable of capturing non-linear feature interactions, providing a complementary perspective to linear models.
- The KNN model was selected as an instance-based approach to compare performance under different feature representations, despite its potential sensitivity to high-dimensional spaces.
- The LR model was used as a linear classifier with strong interpretability and efficiency, providing a baseline for comparison with other linear methods.
3.5. Evaluation Strategy and Performance Metrics
3.5.1. Implementation Details and Hyperparameter Settings
3.5.2. Execution Time Analysis
4. Results and Discussion
4.1. Comparison of Results Obtained from Existing Studies
4.2. Limitations
5. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligence |
| ANN | Artificial Neural Networks |
| BERT | Bidirectional Encoder Representations from Transformers |
| BoW | Binary Bag-of-Words |
| CNN | Convolutional Neural Networks |
| DNN | Deep Neural Networks |
| DT | Decision Tree |
| ET | Extra Trees |
| FE | Feature Engineering |
| FRNN | Fuzzy Recurrent Neural Network |
| GB | Gradient Boosting |
| GBDT | Gradient Boosted Decision Trees |
| GBM | Gradient Boosting Machine |
| GPT | Generative Pre-trained Transformer |
| GRU | Gated Recurrent Units |
| HHO | Harris Hawk Optimization |
| HMM | Hidden Markov Model |
| ICA | Independent Component Analysis |
| IG | Information Gain |
| KELM | Kernel Extreme Learning Machine |
| KNN | K-Nearest Neighbors |
| LGBM | Light Gradient Boosting Machine |
| LLM | Large Language Model |
| LoRA | Low-Rank Adaptation |
| LR | Logistic Regression |
| LSTM | Long Short-Term Memory |
| MNB | Multinomial Naive Bayes |
| MLR | Multinomial Logistic Regression |
| NB | Naive Bayes |
| PCA | Principal Component Analysis |
| RBF | Radial Basis Function |
| ReFT | Reinforcement Fine-Tuning |
| RF | Random Forest |
| RFE | Recursive Feature Elimination |
| RNN | Recurrent Neural Networks |
| SGD | Stochastic Gradient Descent |
| SVM | Support Vector Machine |
| TF | Term Frequency |
| TF-IDF | Term Frequency–Inverse Document Frequency |
References
- Sethi, P.; Bhandari, V.; Kohli, B. SMS Spam Detection and Comparison of Various Machine Learning Algorithms. In Proceedings of the 2017 International Conference on Computing, Communication and Technologies for Smart Nation (IC3TSN), New Delhi, India, 12–14 October 2017; pp. 28–31. [Google Scholar] [CrossRef]
- Theodorus, A.; Prasetyo, T.K.; Hartono, R.; Suhartono, D. Short Message Service (SMS) Spam Filtering using Machine Learning in Bahasa Indonesia. In Proceedings of the 2021 3rd East Indonesia Conference on Computer and Information Technology (EIConCIT), Surabaya, Indonesia, 9–11 April 2021; pp. 199–202. [Google Scholar] [CrossRef]
- Sharma, N. A Methodological Study of SMS Spam Classification Using Machine Learning Algorithms. In Proceedings of the 2022 2nd International Conference on Intelligent Technologies (CONIT), Karnataka, India, 24–26 June 2022; pp. 1–5. [Google Scholar] [CrossRef]
- Jain, H.; Mahadev, M. An Analysis of SMS Spam Detection using Machine Learning Model. In Proceedings of the 2022 Fifth International Conference on Computational Intelligence and Communication Technologies (CCICT), Roorkee, India, 17–18 June 2022; pp. 151–155. [Google Scholar] [CrossRef]
- De Luna, R.G.; Enriquez, K.L.; Española, A.M.; Ramos, M.; Magnaye, V.C.; Astorga, D.; Lanting, B.A.; Redondo, J.; Reaño, R.A.L.; Celestial, T.; et al. A Machine Learning Approach for Efficient Spam Detection in Short Messaging System (SMS). In Proceedings of the TENCON 2023—2023 IEEE Region 10 Conference (TENCON), Chiang Mai, Thailand, 31 October–3 November 2023; pp. 53–58. [Google Scholar] [CrossRef]
- Jain, V. Optimizing SMS Spam Detection: An In-Depth Analysis of Machine Learning Approaches. In Proceedings of the 2024 5th International Conference on Data Intelligence and Cognitive Informatics (ICDICI), Tirunelveli, India, 20–22 November 2024; pp. 847–852. [Google Scholar] [CrossRef]
- Airlangga, G. Optimizing SMS Spam Detection Using Machine Learning: A Comparative Analysis of Ensemble and Traditional Classifiers. J. Comput. Netw. Archit. High Perform. Comput. 2024, 6, 1234–1245. [Google Scholar] [CrossRef]
- Hafidi, N.; Khoudi, Z.; Nachaoui, M.; Lyaqini, S. Enhanced SMS Spam Classification Using Machine Learning with Optimized Hyperparameters. Indones. J. Electr. Eng. Comput. Sci. 2025, 37, 356–364. [Google Scholar] [CrossRef]
- Britto, R.V.; Jasirullah, N.; Prabhu, R.S.; Kodhai, E. Combatting SMS Spam: A Machine Learning Approach for Accurate and Scalable Detection. In Proceedings of the 2025 International Conference on Data Science, Agents, and Artificial Intelligence (ICDSAAI), Chennai, India, 24–25 January 2025; pp. 1–5. [Google Scholar] [CrossRef]
- Ozoh, P.; Ibrahim, M.; Ojo, R.; Sunmade, A.G.; Oyetayo, T. SMS Spam Detection Using Machine Learning Approach. Int. STEM J. 2025, 6, 10–27. [Google Scholar]
- Nawaz, I.; Khosa, S.N.; Fatima, R.; Saeed, M.; Hashmi, M.S.A. Smart Filters for SMS Spam: A Machine Learning Approach to SMS Classification. SES J. 2025, 2025, 71–95. [Google Scholar] [CrossRef]
- Ahmadi, M.; Khajavi, M.; Varmaghani, A.; Ala, A.; Danesh, K.; Javaheri, D. Leveraging LLMs for Cybersecurity: Enhancing SMS Spam Detection with Robust and Context-Aware Text Classification. Cyber-Phys. Syst. 2025, 1–25. [Google Scholar] [CrossRef]
- Hossain, F.; Uddin, M.N.; Halder, R.K. Analysis of Optimized Machine Learning and Deep Learning Techniques for Spam Detection. In Proceedings of the 2021 IEEE International IOT, Electronics and Mechatronics Conference (IEMTRONICS), Toronto, ON, Canada, 21–24 April 2021; pp. 1–6. [Google Scholar] [CrossRef]
- Altunay, H.C.; Albayrak, Z. SMS Spam Detection System Based on Deep Learning Architectures for Turkish and English Messages. Appl. Sci. 2024, 14, 11804. [Google Scholar] [CrossRef]
- Ghourabi, A.; Mahmood, M.A.; Alzubi, Q.M. A Hybrid CNN-LSTM Model for SMS Spam Detection in Arabic and English Messages. Future Internet 2020, 12, 156. [Google Scholar] [CrossRef]
- Chowdhury, S.H.; Morzina, M.S.; Hussain, M.I.; Hossain, M.M.; Shovon, M.; Mamun, M. LoRA and ReFT Optimized Explainable Machine Learning and Deep Learning Framework for SMS Spam Detection. In Proceedings of the 2025 International Conference on Quantum Photonics, Artificial Intelligence, and Networking (QPAIN), Rangpur, Bangladesh, 1–2 February 2025; pp. 1–10. [Google Scholar] [CrossRef]
- Liu, X.; Lu, H.; Nayak, A. A Spam Transformer Model for SMS Spam Detection. IEEE Access 2021, 9, 80253–80263. [Google Scholar] [CrossRef]
- Ahmed, M.N.; Ahamed, A.S.M.S.; Tamim, F.S. Optimizing SMS Spam Detection with Large Language Models and Transformer Architectures. In Proceedings of the 2025 International Conference on Electrical, Computer and Communication Engineering (ECCE), Chattogram, Bangladesh, 13–15 February 2025; pp. 1–8. [Google Scholar] [CrossRef]
- Srinivasarao, U.; Sharaff, A. Machine Intelligence Based Hybrid Classifier for Spam Detection and Sentiment Analysis of SMS Messages. Multimed. Tools Appl. 2023, 82, 31069–31099. [Google Scholar] [CrossRef]
- Salman, M.; Ikram, M.; Kaafar, M.A. Investigating Evasive Techniques in SMS Spam Filtering: A Comparative Analysis of Machine Learning Models. IEEE Access 2024, 12, 24306–24329. [Google Scholar] [CrossRef]
- Srinivasarao, U.; Sharaff, A. SMS Sentiment Classification Using an Evolutionary Optimization Based Fuzzy Recurrent Neural Network. Multimed. Tools Appl. 2023, 82, 42207–42238. [Google Scholar] [CrossRef] [PubMed]
- Almeida, T.A.; Gómez Hidalgo, J.M. SMS Spam Collection v.1. UCI Machine Learning Repository. 2012. Available online: https://archive.ics.uci.edu/dataset/228/sms+spam+collection (accessed on 23 January 2026).
- Ramanujam, E.; Abirami, A.M.; Sakthiprakash, K.; Sumitra, S. Efficient Extraction and Evaluation of Hand-Crafted Meta-Data Features for Dravidian Spam SMS Classification. Evolving Syst. 2026, 17, 1–18. [Google Scholar] [CrossRef]
- Verma, A.R.K.; Sadana, S. Textual, Non-Textual, and Hybrid Feature Engineering for SMS Spam Classification. IEEE Access 2025, 13, 176901–176914. [Google Scholar] [CrossRef]
- Xia, T.; Chen, X. A Discrete Hidden Markov Model for SMS Spam Detection. Appl. Sci. 2020, 10, 5011. [Google Scholar] [CrossRef]
- Abdel-Jaber, H. Detecting Spam and Ham SMS Messages Using Natural Language Processing and Machine Learning Algorithms. PeerJ Comput. Sci. 2025, 11, e3232. [Google Scholar] [CrossRef]
- Shen, L.; Wang, Y.; Li, Z.; Ma, W. SMS Spam Detection Using BERT and Multi-Graph Convolutional Networks. Int. J. Intell. Netw. 2025, 6, 79–88. [Google Scholar] [CrossRef]
- Yan, D.; Li, K.; Gu, S.; Yang, L. Network-Based Bag-of-Words Model for Text Classification. IEEE Access 2020, 8, 82641–82652. [Google Scholar] [CrossRef]
- Jasim, A.K.; Al-Ibeahimi, F.A.F.; Alkaabi, H.A. Explainable AI for SMS Spam Filtering: A Novel Hybrid Architecture Combining Fuzzy Logic and Bidirectional LSTM Networks. Franklin Open 2025, 14, 100466. [Google Scholar] [CrossRef]
- Xu, H.; Qadir, A.; Sadiq, S. Malicious SMS Detection Using Ensemble Learning and SMOTE to Improve Mobile Cybersecurity. Comput. Secur. 2025, 154, 104443. [Google Scholar] [CrossRef]




| Feature Representation | Feature Dimensions | Sparsity |
|---|---|---|
| Count Vectorization | 2483 | 99.63 |
| Binary BoW | 2483 | 99.63 |
| TF-IDF | 2483 | 99.63 |
| Word (Bigrams) | 7941 | 99.94 |
| Character N-grams (3-grams) | 4274 | 98.51 |
| Enhanced TF-IDF | 2489 | 99.59 |
| Hybrid (Words + Char) | 2583 | 98.96 |
| Component | Settings |
|---|---|
| Text Preprocessing | Lowercasing enabled; punctuation removed; digits replaced with space; stopwords removed using MATLAB default stopword list; optional lemmatization when supported. |
| Count Vectorization | MATLAB bagOfWords; minimum document frequency = 3; no maximum feature cap; vocabulary built inside each training fold only. |
| Binary BoW | Same vocabulary as Count Vectorization; binary weighting (term presence only). |
| TF-IDF | MATLAB tfidf; IDF computed within each training fold; default smoothing applied. |
| Word N-grams | Bigrams (n = 2); minimum document frequency = 3; vocabulary fitted per training fold. |
| Character N-grams | Character 3-grams (n = 3); minimum document frequency = 3; fitted within each training fold. |
| Enhanced TF-IDF | Standard TF-IDF augmented with six metadata features (message length, number of capital letters, capital letter ratio, number of digits, number of special characters ($, !, %, £, €), and URL indicator). |
| Hybrid Representation | Concatenation of TF-IDF word features and character 3-gram features; features normalized before training. |
| Logistic Regression | fitclinear; Learner = logistic; Regularization = ridge (L2); Lambda = 0.001; Solver = lbfgs; Standardize = true. |
| Linear SVM | fitcsvm; KernelFunction = linear; BoxConstraint = 1; Standardize = true. |
| RBF SVM | fitcsvm; KernelFunction = rbf; KernelScale = auto; BoxConstraint = 10; Standardize = true. |
| Random Forest | TreeBagger; Number of Trees = 100; other tree growth parameters (e.g., maximum depth, minimum leaf size, split criteria) kept at MATLAB defaults (no explicit depth constraint). |
| KNN | fitcknn; NumNeighbors = 7; Distance = euclidean; Standardize = true. |
| Naive Bayes | fitcnb; default distribution (Gaussian for numeric features); prior probabilities estimated from training fold; no kernel density estimation applied. |
| Model | Time Range (s) | Category |
| LR | 0.5–0.9 | Linear |
| Linear SVM | 41.0–176.6 | Linear |
| KNN | 12.1–62.5 | Instance-Based |
| RBF SVM | 89.3–539.9 | Kernel |
| RF | 81.9–899.8 | Ensemble |
| NB | 2101.3–8046.8 | Probabilistic |
| Feature | Dimensions | |
| Word N-grams | 7941 | |
| Count Vectorization | 2483 |
| Model | Acc.(%) | Prec.(%) | Rec.(%) | F1(%) | Spec.(%) | AUC |
|---|---|---|---|---|---|---|
| NB (BoW) | 97.85 | 94.12 | 86.21 | 89.99 | 99.56 | 0.9712 |
| SVM-L (BoW) | 98.03 | 95.67 | 87.15 | 91.21 | 99.62 | 0.9789 |
| SVM-R (BoW) | 97.94 | 95.23 | 86.88 | 90.86 | 99.58 | 0.9776 |
| RF (BoW) | 97.67 | 93.89 | 85.54 | 89.52 | 99.45 | 0.9698 |
| KNN (BoW) | 96.89 | 90.45 | 82.73 | 86.42 | 99.12 | 0.9534 |
| LR (BoW) | 97.98 | 95.45 | 86.98 | 91.01 | 99.60 | 0.9782 |
| NB (TF-IDF) | 97.92 | 94.56 | 86.61 | 90.40 | 99.58 | 0.9745 |
| SVM-L (TF-IDF) | 98.15 | 96.12 | 87.68 | 91.70 | 99.68 | 0.9812 |
| SVM-R (TF-IDF) | 98.08 | 95.89 | 87.42 | 91.46 | 99.65 | 0.9798 |
| RF (TF-IDF) | 97.78 | 94.23 | 85.95 | 89.89 | 99.51 | 0.9721 |
| KNN (TF-IDF) | 97.02 | 91.12 | 83.28 | 87.02 | 99.18 | 0.9567 |
| LR (TF-IDF) | 98.10 | 96.01 | 87.55 | 91.58 | 99.66 | 0.9805 |
| NB (Binary BoW) | 97.78 | 93.89 | 85.87 | 89.69 | 99.52 | 0.9698 |
| SVM-L (Binary BoW) | 97.96 | 95.34 | 86.75 | 90.84 | 99.59 | 0.9768 |
| SVM-R (Binary BoW) | 97.89 | 95.01 | 86.48 | 90.54 | 99.55 | 0.9754 |
| RF (Binary BoW) | 97.62 | 93.67 | 85.28 | 89.28 | 99.42 | 0.9685 |
| KNN (Binary BoW) | 96.78 | 89.98 | 82.35 | 85.99 | 99.05 | 0.9512 |
| LR (Binary BoW) | 97.92 | 95.21 | 86.61 | 90.71 | 99.57 | 0.9761 |
| NB (Word N-grams) | 97.45 | 92.78 | 84.87 | 88.64 | 99.35 | 0.9645 |
| SVM-L (Word N-grams) | 97.72 | 94.12 | 85.68 | 89.70 | 99.48 | 0.9712 |
| SVM-R (Word N-grams) | 97.65 | 93.89 | 85.41 | 89.45 | 99.45 | 0.9698 |
| RF (Word N-grams) | 97.38 | 92.56 | 84.61 | 88.40 | 99.32 | 0.9632 |
| KNN (Word N-grams) | 96.52 | 89.23 | 81.75 | 85.32 | 98.92 | 0.9478 |
| LR (Word N-grams) | 97.68 | 94.01 | 85.55 | 89.58 | 99.46 | 0.9705 |
| NB (Character N-grams) | 98.05 | 95.78 | 87.28 | 91.33 | 99.64 | 0.9798 |
| SVM-L (Character N-grams) | 98.42 | 97.89 | 90.23 | 93.91 | 99.71 | 0.9885 |
| SVM-R (Character N-grams) | 98.21 | 97.32 | 89.69 | 93.34 | 99.58 | 0.9871 |
| RF (Character N-grams) | 98.12 | 96.98 | 89.42 | 93.04 | 99.52 | 0.9865 |
| KNN (Character N-grams) | 97.45 | 93.56 | 85.95 | 89.59 | 99.35 | 0.9678 |
| LR(Char N-grams) | 98.55 | 98.54 | 90.50 | 94.35 | 99.79 | 0.9893 |
| NB (TF-IDF Enhanced) | 97.98 | 95.12 | 86.88 | 90.81 | 99.60 | 0.9768 |
| SVM-L (TF-IDF Enhanced) | 98.25 | 96.78 | 88.42 | 92.41 | 99.69 | 0.9845 |
| SVM-R (TF-IDF Enhanced) | 98.18 | 96.45 | 88.15 | 92.11 | 99.66 | 0.9832 |
| RF (TF-IDF Enhanced) | 97.95 | 95.01 | 86.75 | 90.68 | 99.58 | 0.9758 |
| KNN (TF-IDF Enhanced) | 97.25 | 92.34 | 84.21 | 88.09 | 99.28 | 0.9612 |
| NB (Hybrid (Word + Character features)) | 98.02 | 95.45 | 87.01 | 91.04 | 99.62 | 0.9778 |
| SVM-L (Hybrid (Word + Character features)) | 98.28 | 97.45 | 89.96 | 93.54 | 99.63 | 0.9878 |
| SVM-R (Hybrid (Word + Character features)) | 98.15 | 96.89 | 89.55 | 93.07 | 99.55 | 0.9858 |
| RF (Hybrid (Word + Character features)) | 97.98 | 95.67 | 87.28 | 91.28 | 99.60 | 0.9785 |
| KNN (Hybrid (Word + Character features)) | 97.32 | 92.89 | 84.48 | 88.49 | 99.31 | 0.9632 |
| LR (Hybrid (Word + Character features)) | 98.32 | 97.56 | 90.09 | 93.67 | 99.65 | 0.9881 |
| Representation | Character N-grams (3-grams) |
| Classifier | LR |
| Metric | Value |
| Accuracy | 98.55% |
| Precision | 98.55% |
| Recall | 90.50% |
| F1-score | 94.32% |
| Specificity | 99.79% |
| AUC | 0.9893 |
| Study | Dataset Size | Used Method | Performance |
|---|---|---|---|
| Sethi et al. [1] | 5574 messages | NB with IG | 98.445% (Accuracy) |
| Sharma [3] | 5572 messages | Extra Trees with TF-IDF | 98.5% (Accuracy) |
| Jain [6] | 5574 messages | LR | 98% (F1-score) |
| Airlangga [7] | 5572 messages | SVM with TF-IDF | 98.57% (Accuracy) |
| Altunay & Albayrak [14] | 5574 messages | Hybrid CNN + GRU | 99.07% (Accuracy) 99.22% (F1-score) |
| Nawaz et al. [11] | 5574 messages | LGBM with metadata | 100% (All metrics) |
| Liu et al. [17] | SMS Spam Collection v1 | Modified Transformer | High performance on imbalanced data |
| Ahmed et al. [18] | 5169 messages | H2O-Danube (LLM) | 94% (Macro F1-score) |
| Chowdhury et al. [16] | SMS Spam dataset | XGBoost with LoRA + GPT | 99.82% (Accuracy), 83.33% (F1-score) |
| Proposed Method | 5574 messages | Character N-grams + LR | 98.55% (Accuracy) 94.32% (F1-score) 98.55% (Precision) 90.50% (Recall) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Soysaldı Şahin, M.; Şahin, D.Ö.; Salah, A.F. Revisiting SMS Spam Detection: The Impact of Feature Representation on Classical Machine Learning Models. Electronics 2026, 15, 894. https://doi.org/10.3390/electronics15040894
Soysaldı Şahin M, Şahin DÖ, Salah AF. Revisiting SMS Spam Detection: The Impact of Feature Representation on Classical Machine Learning Models. Electronics. 2026; 15(4):894. https://doi.org/10.3390/electronics15040894
Chicago/Turabian StyleSoysaldı Şahin, Meryem, Durmuş Özkan Şahin, and Areej Fateh Salah. 2026. "Revisiting SMS Spam Detection: The Impact of Feature Representation on Classical Machine Learning Models" Electronics 15, no. 4: 894. https://doi.org/10.3390/electronics15040894
APA StyleSoysaldı Şahin, M., Şahin, D. Ö., & Salah, A. F. (2026). Revisiting SMS Spam Detection: The Impact of Feature Representation on Classical Machine Learning Models. Electronics, 15(4), 894. https://doi.org/10.3390/electronics15040894

