A Discrete Hidden Markov Model for SMS Spam Detection

: Many machine learning methods have been applied for short messaging service (SMS) spam detection, including traditional methods such as naïve Bayes (NB), vector space model (VSM), and support vector machine (SVM), and novel methods such as long short-term memory (LSTM) and the convolutional neural network (CNN). These methods are based on the well-known bag of words (BoW) model, which assumes documents are unordered collection of words. This assumption overlooks an important piece of information, i.e., word order. Moreover, the term frequency, which counts the number of occurrences of each word in SMS, is unable to distinguish the importance of words, due to the length limitation of SMS. This paper proposes a new method based on the discrete hidden Markov model (HMM) to use the word order information and to solve the low term frequency issue in SMS spam detection. The popularly adopted SMS spam dataset from the UCI machine learning repository is used for performance analysis of the proposed HMM method. The overall performance is compatible with deep learning by employing CNN and LSTM models. A Chinese SMS spam dataset with 2000 messages is used for further performance evaluation. Experiments show that the proposed HMM method is not language-sensitive and can identify spam with high accuracy on both datasets


Introduction
Nowadays, one of the most popular and common communication services is the short message service, known as SMS. SMS traffic volumes have risen from 1.46 billion in 2000 to 7.9 trillion in 2012 [1]. SMS-capable mobile phone users had reached 6.1 billion users by the year 2015 [2]. The growth of mobile users has generated a great deal of revenue [1]. Based on the latest statistics [3], global SMS revenue is predicted to hit 83.2 billion dollars in 2022 even though the revenue has continued to decrease after 2017. In addition, about half (43 billion dollars) of the global SMS revenue belongs to the global P2P (person-to-person) SMS messages market and the other half (40.2 billion dollars) belongs to A2P (application-to-person). A2P messages are sent by companies, such as bulksmsonline.com and bulksms.com, who provide bulk SMS sending services for commercial needs, e.g., verification codes, e-commercial notifications, express delivery notifications.
While enjoying the convenience of communication via electronic devices, unexpected advertising or even malicious information has flooded our email and phone message boxes. This spam information is usually unwanted or unsolicited electronic messages sent in bulk to a group of recipients [4]. It is being sent by spammers or even criminals who are driven by these most profitable spamming businesses. Spam first spread explosively but mainly in emails in the first decade of the 21th century, indicated by the statistical results provided in [5]. As SMS is low-cost, bulk-sending, and reliably reaches the 1.
We first propose to use a hidden Markov model for spam SMS detection based on word order. This method uses the word order information that consists of the key importance for human language, but it has been ignored by many traditional methods based on the BoW model.

2.
This research solves the issue where the TF.IDF algorithm for word weighting does not work well in SMS spam detection, due to the extremely low term frequency. 3.
The proposed method can be applied to alphabetic text (e.g., English) and hieroglyphic text (e.g., Chinese). It is not language-sensitive.
The rest of the paper is organized as follows. Related work is discussed in Section 2. The problem formulation and the proposed SMS spam detection method based on the discrete HMM are presented in Section 3. The experimental results and performance comparisons with well-known models are outlined in Section 4. The conclusions are drawn and future work is discussed in Section 5.

Rule-Based Filtering Technologies
The rule-based filtering techniques are popular in commercial business. SpamAssassin [17] is a successful forerunner of typical rule-based systems (RBSs). It has been adopted by antispam industry companies, such as Symantec and McAfee [18]. The next-generation RBS of Wirebrush4SPAM [19] was then developed to increase its throughput. Both SpamAssassin and Wirebrush4SPAM host a set of scored rules and run a score-based mechanism. A spam email is detected when the sum of scores from triggered rules is greater or equal to the value of a global threshold, which is called required score. However, throughput is a challenging issue of RBSs, and their time complexity of filtering algorithms could not be reduced to an acceptable level. To address the throughput issue, a constant time complexity spam detection algorithm was developed by Xia [20].

Content-Filtering Technologies
Content-filtering technologies have utilized machine learning technologies to combat spam. One of the most common technologies is the Bayesian classification filter [21]. Bayesian methods, such as naïve Bayes, worked efficiently and had become an important machine learning algorithm in information retrieval. It is based on Bayes theorem with a strong naïve independence assumption that treats each and every word as single, mutually exclusive, and independent. It is defined as a graphical probabilistic model for multivariate analysis. The nodes of the directed graph represent problem variables and the edges represent conditional dependencies between such variables. Jiang et al. [6] put forward a deep feature weighting (DFW) for naïve Bayes and applied it to text classification. Moreover, to enhance its accuracy, Bayesian methods are often hybrid with other algorithms. Sable et al. [7] introduced a hybrid system of SMS classification based on a naïve Bayes classifier and Apriori Algorithm. Ebadati and Ahmadzadeh [22] proposed a genetic algorithm (GA)-naïve Bayes for spam email detection with a genetic algorithm (GA) for feature extraction. Arifin at al. [23] focused on spam detection for SMS by a naïve Bayes classifier and frequent patterns (FP) tree mining, known as FP-Growth.
Vector space model is based on the BoW model. It represents documents as document vectors that are full of word weights and classify documents based on the cosine similarity value of the vectors. VSM is often used for text classification and information retrieval. Santos et al. [24] filtered spam by representing e-mails with the enhanced topic-based vector space model (eTVSM). Support vector machine is a set of binary SVM classifiers. It trains a decision equation from an n-dimensional space representation of the data into two regions using a hyperplane, which leads to high accuracy. It is popular because it is robust for many circumstances with high classifying speed. In natural language processing (NLP) research, the SVM n-dimensional space is the same BoW vector space of VSM. Chan et al. [25] proposes a word attack strategy and a feature reweighting method toward the SMS scenario in SVM when the length of a message is limited. Tekerek [9] compared the result of NB, K-nearest neighborhood (KNN), SVM, random forest (RF), and random tree (RT), and found that SVM had the best result.
Entropy, as an information theory, is also used with co-training by Zhang et al. [26,27] to combat spam reviews, which promote sales or defame competitors by misleading consumers. Decision tree is also a method for combating spam. Gashti [28] proposed a hybrid of harmony search algorithm (HSA) and decision tree for selecting the best features and classification.
Deep learning has aroused extensive attention these years. Pumrapee et al. [10] proposed an SMS spam detection method based on long short-term memory (LSTM). Research by Roy et al. [11] used convolutional neural network (CNN) and LSTM models in spam SMS detection to achieve the highest accuracy so far.
In addition, researchers also investigated hybrid models for performance improvement. Uysal et al. [29] investigated the impact of feature extraction and selection of the BoW model and then used KNN and SVM for spam SMS filtering. Karthika et al. [30] applied a latent semantic indexing (LSI)-based SVM model for email spam classification. Arijit et al. [31] filtered SMS spam by a recurrent neural network and LSTM. Yang et al. [32] used a multi-modal fusion, which applied LSTM and CNN models to process the text. Zhao et al. [33] applied six classifiers in the basic module and a deep neural network in the combination module. There are also other models for SMS spam detection, such as the neural network [34], KNN [35], and negative selection algorithm (NSA) [36]. Recently, Shang [37] developed a score-based filtering mechanism in consensus of hybrid multi-agent systems with malicious nodes, which can also be applied for spam detection.

Hidden Markov Model for Spam Detecions
HMM and its variants have found a wide variety of applications. There was a hierarchical hidden Markov model (HHMM) for real-time finger motion synthesis [38], a hierarchical multivariate HMM with reactive interpolation functionality for full-body motion reconstruction [39], a combining speaker-specific Gaussian mixture model (GMM) with a syllable-based HMM for speaker recognition [40], and a Spherical-Self Organizing Map (S-SOM) with HMM for classifying sets of time series [41], to list just a few examples. HMMs have also laid a solid foundation for their applications in NLP, including part-of-speech tagging in many languages [42,43] and name entity recognition [44].
However, based on our literature review including the latest review papers about SMS spam detection techniques [4,45,46], there is no report on using HMM for SMS spam detection based on the word order. Rafique and Farooq [47] used HMM for SMS spam detection on byte-level, which is the low communication level of SMS delivery. Gordillo and Conde [48], as forerunners in this field, proposed a HMM for detecting spam mail in 2007. The paper focused on obfuscated words detection, such as the example in the paper, m0ney or mo.ney for the word money. Therefore, instead of words, they focused on spam detection at the language character level, such as letters in English. They treated characters in spam emails like a DNA chain and used a similar DNA chain classification method for spam emails. Ebrahimi et al. [49] built a HMM for detection and classification of duplicate bug reports (BRs) by focusing on the relation of current BRs and incoming BRs. Similarly, Washha et al. [50] put forward a topic-based HMM for spam tweet filtering and predicted the tweet sample classification based on an assumed high dependency among successive tweets. Vennila et al. [51] used HMM in spam detection over internet telephony in voice, which belongs to a far different research field from this study. The existing work focused on emails, bug reports, and tweets, etc. Unlike the existing work, we aim to use a hidden Markov model for spam SMS detection based on the word order, which is a new application of HMM.

Problem Formulation and Notations
A typical SMS contains sequential words with punctuation. In English, as words are divided by blank spaces, English SMS is easy to be split, whereas in some other languages, e.g., Chinese, there are no blank spaces between words. These SMS messages have to first feed into a segmentation algorithm to extract words. In any case, each SMS text is first split or segmented into sequential words with punctuation at the very beginning.
Not all words are suitable for NLP. Punctuation and words only for positioning do not have much semantic information. Especially for SMS, many informal words, shortened and abbreviated words, social media acronyms, and some strange character sequences often appear in SMS. Part of them is also meaningless. They are called the stop words. Therefore, these stop words and punctuation are removed from the sequential words.
After these preprocesses, each SMS is refined to a word sequence, which is full of meaningful words. Let N denote the total number of all rest meaningful words in SMS including the duplicated ones. This set with N sequential words is the observation sequence denoted as The corresponding hidden state sequence is denoted as Y = q 1 , q 2 , q 3 , · · · , q N , satisfying the Markov property. The structure of these two sequences is represented by the directed graph in Figure 1.
where π is the initial probability distribution, A is the state transition probability matrix, and B is the observation probability distribution matrix.
For SMS spam detection, the set of hidden states is Thus, The HMM, denoted by λ, can be defined by a three-tuple: where π is the initial probability distribution, A is the state transition probability matrix, and B is the observation probability distribution matrix. For SMS spam detection, the set of hidden states is S = {s 1 , where a ij = P(q t+1 = s j q t = s i ) , for i, j = 1, 2 and 2 j=1 a ij = 1, for i = 1, 2.
Let W = {w 1 , w 2 , w 3 , · · · , w n } denote the set of observation states, which includes all different words in every spam and ham SMS. n is the total number of different words in both sets.
The proposed hidden Markov model for spam detection is shown in Figure 2. The HMM, denoted by λ, can be defined by a three-tuple: where π is the initial probability distribution, A is the state transition probability matrix, and B is the observation probability distribution matrix.
For SMS spam detection, the set of hidden states is denote the set of observation states, which includes all different words in every spam and ham SMS. n is the total number of different words in both sets. Thus, π is a 2 × 1 initial probability distribution over the state, 1 The proposed hidden Markov model for spam detection is shown in Figure 2.
ham: What you thinked about me. First time you saw me in class. spam: Are you unique enough? Find out from 30th August. www.areyouunique.co.uk. Please note that the example dataset only contains 2 SMS messages and words you and me are duplicated in the two SMS messages.
Then, the observation states set W is generated, i.e., W = {What, you, thought, me, First, time, saw, class, Are, unique, enough, Find, 30th, August, www, areyouunique, co, uk}. Each word in W appears in 6 of 17 ham and spam SMS sets with a certain frequency. These occurrence frequencies will be used to obtain the observation probability distribution.
In addition, the refined two-word sequences combine together to form the observation sequence In this example, X = {What, you, thinked, me, First, time, you, saw, me, class, Are, you, unique, enough, Find, 30th, August, www, areyouunique, co, uk}. It is obvious that the original word order is kept.
Therefore, the observation sequence X = {o 1 , o 2 , o 3 , · · · , o N } can be represented as Please note that different SMSs may have different lengths. All sequential words in each SMS combine together to form the final observation sequence X.

Label Each Word in Observation Sequence for HMM Learning
Among these training SMS from the UCI repository dataset, some of them are labeled as spam and the others are labeled as ham. The labeled SMS dataset can be described as: Take a look at the instances above again. The SMS, What you thinked about me. First time you saw me in class., is labeled as ham and the other one, Are you unique enough? Find out from 30th August. www.areyouunique.co.uk, is labeled as spam. Thus, it is represented as [What, you, thinked, me, First, time, you, saw, me, class] with the label ham and [Are, you, unique, enough, Find, 30th, August, www, areyouunique, co, uk] with the label spam.
However, each word in the observation sequence should be marked as ham or spam for the HMM learning. As the UCI repository dataset only has a label for each SMS, we use a compromised method to label all words in the observation sequence, i.e., labeling the words in the SMS based on the label of the SMS. The labels for the observation sequence can be represented as: {ham, ham, ham, · · · , ham, · · · , ham, ham, ham, · · · , ham, · · · , spam, spam, spam, · · · , spam} For the instances above, as the first SMS is a ham one, each word in the refined sequence is labeled as ham. As the second SMS is a spam one, each word in the sequence is labeled as spam.

Observation Probability Distribution
We calculate the probability of each observation state, i.e., each word in W, appearing in ham and spam SMS sets. The probability distribution is depicted in Figure 3. The word order in W is fixed for easy comparison, as shown in Figure 3.
For the instances above, as the first SMS is a ham one, each word in the refined sequence is labeled as ham. As the second SMS is a spam one, each word in the sequence is labeled as spam.

Observation Probability Distribution
We calculate the probability of each observation state, i.e., each word in W, appearing in ham and spam SMS sets. The probability distribution is depicted in Figure 3. The word order in W is fixed for easy comparison, as shown in Figure 3. The higher the frequency of a certain word, the higher the probability in the distribution. As the word order in W is fixed in Figure 3, we can compare the probability of each word in ham and spam sets visibly. It is found that:

•
As some words only have probability in a single dataset and their probability is equal to zero in another dataset, this indicates that these words only appear in the spam messages set or ham messages set; • As the probabilities of many words are quite different in different datasets, it is referred that these words appear in both sets with much different word frequencies; • Only a very small portion of them appear in both sets evenly. The higher the frequency of a certain word, the higher the probability in the distribution. As the word order in W is fixed in Figure 3, we can compare the probability of each word in ham and spam sets visibly. It is found that:

•
As some words only have probability in a single dataset and their probability is equal to zero in another dataset, this indicates that these words only appear in the spam messages set or ham messages set; • As the probabilities of many words are quite different in different datasets, it is referred that these words appear in both sets with much different word frequencies; • Only a very small portion of them appear in both sets evenly.
It is true that the BoW also took advantage of this information to design term weights algorithms. However, the TF algorithm does not work well in the SMS scenario, because of the shortage of term occurrence.
In this paper, we first calculate two observation state distributions in spam and ham subsets. The two distributions are combined together to form the initial value of the HMM observation probability distribution matrix, B = (b ij ) 2×n .

HMM Learning
The Baum-Welch algorithm [15] is typically used for finding HMM parameters λ = (π, A, B). That is, given HMM with initial parameters  (6) to update λ = (π, A, B) iteratively and find parameters that maximize the likelihood of observed data, i.e., argmax The initial parameters λ 0 = (π 0 , A 0 , B 0 ) are initialized as: π 0 = [0.5, 0.5] indicates that the state can start from ham or spam with the same probability.
A 0 = 0.5 0.5 0.5 0.5 means each state transmission has the same probability.
B 0 = (b ij ) 2×n infers the initial observation probability distribution calculated in Section 3.4. The Baum-Welch algorithm starts with initial parameters and then repeatedly takes two steps: Expectation step (E-step) and maximization step (M-step) until convergence (i.e., the difference of log-likelihood is less than small number d), as shown in Figure 4. The Baum-Welch algorithm starts with initial parameters and then repeatedly takes two steps: Expectation step (E-step) and maximization step (M-step) until convergence (i.e., the difference of log-likelihood is less than small number d), as shown in Figure 4.

SMS Property Prediction
Given the observation sequence and the trained hidden Markov model to find the optimal hidden state sequence, this is a typical decoding problem in HMM. The Viterbi algorithm [15] is applied to find the most likely hidden state sequence based on the input of each word sequence of the testing SMS. In formalization, we are given the testing observation sequence

SMS Property Prediction
Given the observation sequence and the trained hidden Markov model to find the optimal hidden state sequence, this is a typical decoding problem in HMM. The Viterbi algorithm [15] is applied to find the most likely hidden state sequence based on the input of each word sequence of the testing SMS. In formalization, we are given the testing observation sequence {o t } N t=1 and trained HMM with parameters λ= (π, A, B) to find the most likely state sequence. That is, Via the Viterbi decoding algorithm, for each word sequence of the testing SMS, the optimal hidden state sequence is produced. The state sequence is the combination of ham and spam. The prediction of the SMS property is based on the majority role, i.e., an SMS will be labeled as ham if the optimal hidden state sequence has more hams than spams. Otherwise, the SMS will be labeled as spam.

Data Preparation and HMM Learning
Step 1: Training an SMS dataset by first splitting or segmenting it into word sequences to keep their original order. Then, stop words are removed from the sequence and rest meaningful words form the observation sequence.
Step 2: Observation state probability distributions in ham and spam datasets are statistically analyzed and obtained.
Step 3: The compromised word label sequences are generated based on the labeled training SMS messages.
Step 5: The hidden state sequence and the observation sequence feed the discrete HMM. The discrete HMM is optimized by the Baum-Welch algorithm until convergence.
The training process workflow is shown in Figure 5.
their original order. Then, stop words are removed from the sequence and rest meaningful words form the observation sequence.
Step 2: Observation state probability distributions in ham and spam datasets are statistically analyzed and obtained.
Step 3: The compromised word label sequences are generated based on the labeled training SMS messages.
Step 4: The discrete HMM model is initialized as . The initial parameters are given in Section 3.5.
Step 5: The hidden state sequence and the observation sequence feed the discrete HMM. The discrete HMM is optimized by the Baum-Welch algorithm until convergence.
The training process workflow is shown in Figure 5.

SMS Classification
Given the trained HMM and observation sequence, the classification process involved finding out the optimal hidden state sequence and making a prediction of the SMS property based on the majority role. The SMS classification workflow is shown in Figure 6.
Step 1: Like the preprocess in the training workflow, the SMS dataset for classification is also first split or segmented into word sequences to keep their original word order. Then, stop words are removed from the sequence and the observation sequence is formed.
Step 2: Use the Viterbi decoding algorithm to find the optimal hidden state sequence for each SMS.
Step 3: Predict the SMS property based on the majority role.

SMS Classification
Given the trained HMM and observation sequence, the classification process involved finding out the optimal hidden state sequence and making a prediction of the SMS property based on the majority role. The SMS classification workflow is shown in Figure 6.

Dataset and Analysis
This research uses the widely adopted UCI repository dataset for performance evaluations. This unbalanced dataset contains a total of 5574 English SMS messages, in which 747 SMS messages are spam and 4827 are ham, as shown in Table 1. These messages were collected from Grumbletext-a UK public forum (www.grumbletext.co.uk), the SMS corpus from the National University of Singapore, and Caroline Tagg's Ph.D. thesis [58].  Step 1: Like the preprocess in the training workflow, the SMS dataset for classification is also first split or segmented into word sequences to keep their original word order. Then, stop words are removed from the sequence and the observation sequence is formed.

Number of SMS Percentage of SMS
Step 2: Use the Viterbi decoding algorithm to find the optimal hidden state sequence for each SMS.
Step 3: Predict the SMS property based on the majority role.

Dataset and Analysis
This research uses the widely adopted UCI repository dataset for performance evaluations. This unbalanced dataset contains a total of 5574 English SMS messages, in which 747 SMS messages are spam and 4827 are ham, as shown in Table 1. These messages were collected from Grumbletext-a UK public forum (www.grumbletext.co.uk), the SMS corpus from the National University of Singapore, and Caroline Tagg's Ph.D. thesis [58]. The experimental code is developed with Python 3.7 and Python packages including pyhanlp, pomegranate, and collections. The code runs on a MacBook with an Intel Core i7-7820 CPU and 16 GB of memory.
We first split the SMS, extract words, and remove the stop words. This resulted in 9955 meaningful words being extracted in total to form the set of observation states W. Then, we calculate the term frequency of each word in each SMS. The summary of the statistical results is shown in Table 2. We find that: • 9272 words appear only once in a single SMS and accounts for 93.13%.

•
The words that appear three times and above only account for 1.03% in total. Apparently, most of the words only occur once in a single SMS. Therefore, the feature extraction algorithms, like TF.IDF, does not work well for SMS spam detection.

Evaluation Metrics
The well-known and persuasive evaluation metrics for classification are precision (P), recall (R), F-measure (F1), and accuracy (A) [59]. Their parameters for metrics calculation are shown in Table 3. Among these metrics, accuracy is the most important item [28]. Precision (P) is the fraction of relevant instances among all retrieved instances.
Recall (R), called Sensitivity, is the fraction of the total amount of relevant instances that are actually retrieved.
F-measure (F1) is the harmonic mean of the precision and recall. It is a balance between precision and recall.
Accuracy (A) is the fraction of spam SMS messages that are correctly predicted among all SMS.
Area under the curve (AUC) is also a well-known criterion for classification. It is the average of the true positive rate (TPR) and false positive rate (FPR). The greater the AUC value, the more accurate the model.

Result of the Discrete HMM on the UCI Repository Dataset
To evaluate the performance of the proposed HMM method, we split the UCI repository dataset into two datasets. One is the training dataset containing 66% (about 2/3) spam and 66% ham SMS, and the other one is the testing dataset containing the rest 34% SMS, as shown in Table 4. The reason we divide the dataset in this way is that we want to compare the performance of the proposed HMM method with the best performance achieved by the CNN method in [11]. Roy et al. [11] used the same UCI dataset and divided the database as 2/3 for training and 1/3 for testing in their experiment.
Following the procedures described in Section 3.7, the confusion matrix is shown in Table 5. The evaluation results are shown in Table 6. To illustrate the performance of the proposed HMM method, we compare the results with those obtained by other machine learning models, including naïve Bayes (NB), support vector machine (SVM), non-negative matrix factorization (NMF), latent Dirichlet allocation (LDA), long short-term memory (LSTM), and convolutional neural network (CNN). The result comparison is shown in Table 7. As we pointed out before, all methods listed in Table 7 are evaluated by the same UCI dataset. Results of NB, SVM, NMF, and LDA methods were presented by Nagwani and Sharaff [56]. The results of LSTM and CNN were presented by Roy et al. [11]. The highest accuracy is achieved by the CNN method. The proposed HMM method has slightly better accuracy than LSTM. It is compatible with the CNN method.

Result of the Discrete HMM on Other SMS Dataset in Chinese
The HMM presented in this paper can be applied to other languages. In order to evaluate its performance in different languages, we apply the proposed HMM method on a Chinese SMS dataset containing 2000 SMS messages. The dataset is derived from the production environment of our cooperated SMS service company that provides the SMS service in the East China area.
We choose 700 spam SMS and 700 ham SMS from the Chinese SMS dataset as the training set and the rest 300 spam SMS and 300 ham as the testing set. The statistics of the training and testing datasets are shown in Table 8. The confusion matrix turned out to be that shown in Table 9. The evaluation results of the proposed HMM are shown in Table 10.

UCI Repository Dataset Results
We use 2/3 dataset for training and 1/3 dataset for testing, which has the same division ratio of the paper [11]. The model classified spam and ham SMS with an excellent accuracy of 0.959, which is better than those of NB, SVM, NMF, LDA, and LSTM. In addition, the model performs even better in ham classification with precision 0.969, recall 0.983, and F1 0.976. Although the spam performance is a little under expectation with precision 0.892, recall 0.816, and 0.852, it still performs better than that of LSTM.
The performance could be better if the HMM model is trained enough. English words have many different forms. For verbs, they have past tense, present tense, future tense, and third-person singular forms. A similar situation happens in nouns and adjectives. Therefore, compared to the many English words, the total of 9955 words are not sufficient for observation states in the experiment, and words in the training dataset are less likely to reappear in the testing dataset. As a result, the spam classification performance is affected as the model does not know how to label the untrained words.
In addition, LSTM and CNN are very complex models that consume computer resources greatly. The HMM model proposed in this paper is relatively simple, which has fast training and predicting speed. Thus, our model is easy to implement in commercial applications to process other languages.

Chinese SMS Dataset Results and Its Non-Language-Sensitivity
Apparently, based on the experimental results, the proposed HMM works better in classifying Chinese SMS messages. Especially, it performed the best by securing a remarkable accuracy of 98.5% to classify spam and ham SMS. Compared to the experimental results on English SMS messages, the division ratio of training and testing SMS is similar, but the results are obviously much better.
The reason lies in English and Chinese languages themselves. In English, many words have different forms. As these synonyms have not merged in this research, the different forms of words are treated as different observation states. On the contrary, Chinese words never change. Compared to English, the words in the Chinese SMS training dataset are more likely to reappear in the testing dataset. Therefore, the HMM introduced in this paper works better in Chinese SMS spam detection.
Furthermore, the LSTM and CNN models presented in the paper [11] were dependent on SMS written in English only. However, the HMM present in this paper may be implemented widely in the future because it is not language-sensitive.

Conclusions and Future Work
This paper proposed a discrete hidden Markov model for SMS spam detection, and it is the first research taking advantage of word order information to detect spam SMS. Compared to other traditional and even novel machine learning models, the proposed HMM method scored excellent results among them. In addition, HMM is a relatively simple machine learning model. It can be implemented into the spam filtering industry to meet the huge throughput requirement. In addition, this paper resolved the issue in which the traditional feature extraction algorithm, like TF.IDF, does not work well for SMS spam detection, due to the extremely low term frequency. The proposed HMM is not language-sensitive, which was also validated on Chinese SMS spam detection. The overall performance of the proposed HMM is better on the Chinese dataset than the English dataset.
The proposed HMM still has limitations. The accuracy depends highly on the size of the training set. The bigger the training set, the more likely SMS words will reoccur in the testing dataset and the better the achieved accuracy. In our future research, we will provide an improved HMM model to make it suitable to a small training set scenario. In addition, as there is no training dataset that has a label for each word of the SMS, each word was labeled based on the property of the SMS in this research. This compromised word labeling method also affected the spam classification accuracy. We will tackle this issue by applying artificial neural networks. Furthermore, the other HMM variants will be explored for SMS spam detection.
Author Contributions: T.X. and X.C. conceived the idea and developed the algorithm; T.X. developed the experimental software and performed the data analysis. T.X. prepared the manuscript; T.X. and X.C. wrote and made revisions to the paper. Both authors have read and agreed to the published version of the manuscript.
Funding: This work was partially supported by the Soft Engineering of Key Subjects Construction in Shanghai Polytechnic University, grant number xxkzd1604 and US National Science Foundation, grants number CNS-1801811.

Conflicts of Interest:
The authors declare no conflict of interest.