Machine learning modeling is the core of our classification approach. In this step, we have chosen to try several machine learning algorithms to classify text messages as spam or notspam. We then prepared two categories of algorithms: traditional machine learning algorithms and deep learning algorithms. The idea is to test these different algorithms and choose the one that gives the best performance. Based on the experimental results that are detailed in the next section, we concluded that a hybrid deep learning model based on the combination of two methods (CNN and LSTM) gave the best results as compared to the others.
3.3.2. Deep Learning
In this paragraph, we describe the deep learningbased model for detecting spams from a collection of Arabic and English text messages. To implement this model, we tried three different deep learning architectures. Firstly, we created an architecture using the CNN (Convolutional Neural Network) method. Secondly, we tried a model based on LSTM (Long ShortTerm Memory). Finally, we implemented a hybrid model combining the two previous methods CNN and LSTM. Based on the experimental results that we will explain in the next section, the hybrid model gave us the best results in terms of performance. For this reason, here we will limit ourselves to the description of the CNNLSTM hybrid model.
In
Figure 3 and Algorithm 1, we describe the architecture and operating principle of the CNNLSTM model. The output of the Embedding layer, as described above, is connected with a CNN layer with a type “Convolution 1D”. Subsequently, a MaxPooling layer is applied to reduce the dimensionality of the CNN output. Next, we connect an LSTM layer. Finally, we apply a Dense output layer with a single neuron and a sigmoid activation function to make decision for the two classes spam and notspam. In the following, we describe, in detail, the role and configuration of each layer.
Algorithm 1: Convolutional Neural Network (CNN)Long ShortTerm Memory (LSTM) model 

The main interest of the Convolution layer is to extract relevant features from the text data. This is done by applying the convolution operation on the word vectors generated by the Word Embedding layer. The convolution task that we present in this paragraph is based on the work that was presented in [
19]. Let
${x}_{i}\in {\mathbb{R}}^{d}$ be the ddimensional word vector corresponding to the
ith word in the message. Let
$x\in {\mathbb{R}}^{L\times d}$ denote the input message, where
L is the length of the message. For each position
j in the message, consider a window vector
${w}_{j}$ with
k consecutive word vectors, represented as:
A convolution operation involves a filter
$p\in {\mathbb{R}}^{k\times d}$, which is applied to the window
w to produce a new feature map
$c\in {\mathbb{R}}^{Lk+1}$. Each feature element
${c}_{j}$ for window vector
${w}_{j}$ is calculated, as follows:
where ⊙ is elementwise product,
$b\in \mathbb{R}$ is a bias term, and
f is a nonlinear function. In our case, we used Rectified linear unit (ReLU) as nonlinear function. It is defined as:
ReLU activation function returns
x if the value is positive, elsewhere returns 0. For the configuration of the convolution layer, we used a onedimensional convolution associated with a filter window
k = 3. Algorithm 2 describes the detailed working of the CNN algorithm.
Algorithm 2: CNN 

The feature maps that are generated by the Convolution operation are characterized by a highlevel vector representation. To reduce this representation, we added a MaxPooling layer after the Convolution layer to help select only important information by removing weak activation information. This is useful to avoid overfitting due to noisy text.
CNN is very useful in extracting relevant features from the text data. However, it is unable to correlate the current information with the past information. This can be done with another deep learning method, which is “LSTM”.
LSTM (Long shortterm memory) is a kind of RNN architecture that is capable of learning longterm dependencies. The architecture of LSTM contains a range of repeated units for each time step. Each unit, at a time step
t, is composed of a cell
${c}_{t}$ (the memory part of LSTM) and three gates to regulate the flow of information inside the LSTM unit: an input gate
${i}_{t}$, an output gate
${o}_{t}$ and a forget gate
${f}_{t}$. These gates collectively decide how to update the current memory cell
${c}_{t}$ and the current hidden state
${h}_{t}$. The transition functions betwen the LSTM units are defined, as follows [
19]:
Here,
${x}_{t}$ is the input vector of the LSTM unit,
$\sigma $ is the sigmoid function, tanh denotes the hyperbolic tangent function, the operator ⊙ denotes the elementwise product, and
W, and
b are the weight matrices and bias vector parameters that need to be learned during training. In the architecture of our model, we used a single LSTM layer placed directly after the MaxPooling. This layer contains 64 LSTM units using a Dropout equal to 0.2 as a regularization parameter to prevent the model from overfitting. Algorithm 3 describes the detailed working of the LSTM algorithm.
Algorithm 3: LSTM 

Dense is the last layer of our model. It is also called the fully connected layer, and it is used to classify text messages according to the output of the LSTM layer. Since our classification model is binary, we used a Dense layer with a single neuron and a sigmoid activation function to give predictions of 0 or 1 for the two classes (Notspam and spam). The sigmoid function is a logistic function that returns a value between 0 and 1, as defined by the formula in Equation (
6):