Hyperparameter Tuning for Machine Learning Algorithms Used for Arabic Sentiment Analysis

: Machine learning models are used today to solve problems within a broad span of disciplines. If the proper hyperparameter tuning of a machine learning classiﬁer is performed, signiﬁcantly higher accuracy can be obtained. In this paper, a comprehensive comparative analysis of various hyperparameter tuning techniques is performed; these are Grid Search, Random Search, Bayesian Optimization, Particle Swarm Optimization (PSO), and Genetic Algorithm (GA). They are used to optimize the accuracy of six machine learning algorithms, namely, Logistic Regression (LR), Ridge Classiﬁer (RC), Support Vector Machine Classiﬁer (SVC), Decision Tree (DT), Random Forest (RF), and Naive Bayes (NB) classiﬁers. To test the performance of each hyperparameter tuning technique, the machine learning models are used to solve an Arabic sentiment classiﬁcation problem. Sentiment analysis is the process of detecting whether a text carries a positive, negative, or neutral sentiment. However, extracting such sentiment from a complex derivational morphology language such as Arabic has been always very challenging. The performance of all classiﬁers is tested using our constructed dataset both before and after the hyperparameter tuning process. A detailed analysis is described, along with the strengths and limitations of each hyperparameter tuning technique. The results show that the highest accuracy was given by SVC both before and after the hyperparameter tuning process, with a score of 95.6208 obtained when using Bayesian Optimization.


Introduction
Social media has attracted billions of people to interact with each other. Today, many people use such media on a daily basis, not only as a platform to socialize with one another, but also to share opinions, commentary, and experiences. The significant impact of social media has been addressed in abundant research literature [1][2][3]. The importance of sentiment analysis comes from its ability to find conclusions and draw indirect inferences from a huge amount of given data. Analyzing a text written in a language with a complex morphology, such as the Arabic language, has always been a multi-level challenging process. Machine learning has become a prominent part of such a process. Today, machine learning is everywhere; we encounter many machine learning applications in our daily lives, from checking the weather for the next week or transferring our handwritten notes to a typed document, to adjusting the car seat to our preferred position. Machine learning algorithms have gained popularity in such a way that almost every scientific field now uses them. They have become integrated into numerous scientific disciplines, such as networking [4], transportation [5], text analysis [6,7], and bioinformatics [8]. This broad range of machine learning disciplines is due to their promising results and predictive performance in solving classification problems.
Machine learning algorithms automatically learn and hence adjust their internal parameters based on data. These types of parameters are called "model parameters", or simply "parameters" for short. However, there are other parameters that are not adjusted during the learning process but, rather, have to be pre-configured before the learning process starts. Such parameters are often referred to as "hyperparameters". The model parameters indicate how the input data are converted into the desired output, while the hyperparameters indicate how the model is structured. The performance of a machine learning model can drastically change depending on the choice and the values of its hyperparameters.
For example, the decision tree algorithm has a "tree_depth" hyperparameter; setting a moderate value for this hyperparameter can obtain good results, while a high value can lower the algorithm's performance. For this reason, hyperparameters should be set very carefully. Hyperparameters could be set using a number of methods for a specific dataset. One is to set them manually and calculate the accuracy accordingly. Then, other values of the hyperparameters can be tested and, with every adjustment, the corresponding accuracy can be calculated. Manually setting the values for the hyperparameters in such a trial-and-error fashion is a cumbersome and time-consuming process. Another method to find an appropriate hyperparameter configuration is to safely choose the default values of hyperparameters that are recommended by the software packages used in the implementation, which are in turn based on recommendations from the literature, as well as experience. Sometimes, the default values work well for a certain dataset, but this does not always mean that they give the best accuracy.
Alternatively, we can use hyperparameter optimization strategies. These strategies are data-dependent optimization algorithms, which try to minimize the expected generalization error of a machine learning model over the hyperparameter search space of considered candidate configurations. The ML algorithms have been analyzed first using the default hyperparameters values to be compared afterwards with the results obtained when using hyperparameter tuning algorithms. Most of the manuscripts in literature usually test the effect of using one hyperparameter tuning approach on the accuracy of one or more machine language techniques when solving a certain classification problem. The nature of the problem plays an important role in both the classification accuracy of the machine learning technique and the hyperparameters combination that provide the best classification accuracy.
The novelty of the manuscript is in the number of hyperparameter tuning algorithms being compared, the number of machine learning techniques being tested, and, the most important, the nature of the classification problem which is Arabic sentiment analysis. A number of manuscripts have used machine learning approaches to solve Arabic sentiment analysis, but none (as far as we know) have used hyperparameter tuning algorithms to find the best hyperparameters that will lead to the best classification accuracy for the machine learning algorithm used. It should be noted that this set of hyperparameters is not the same for every classification problem and differs according to the nature of the problem.
In this paper, five search strategies are used; these are Grid Search, Random Search, Bayesian Optimization, Particle Swarm Optimization (POS), and Genetic Algorithm (GA). They are used to hyperparameter-tune six machine learning algorithms, namely Logistic Regression (LR), Ridge Classifier (RC), Support Vector Machine Classifier (SVC), Decision Tree (DT), Random Forest (RF), and Naive Bayes (NB) classifiers. These algorithms are used in Arabic sentiment analysis in order to determine the tonality of Arabic reviews using our dataset, which contains 7000 reviews written in different forms of the Arabic language. The accuracy of the algorithms mentioned is calculated when the default values of the hyperparameters are used, and then calculated again after each of the hyperparameter tuning strategies are used. A before-and-after comparison is given. The contributions of our work can be summarized as follows: • The paper reviews common hyperparameter tuning techniques, their benefits, and their drawbacks.
• A comprehensive comparative analysis among five hyperparameter tuning algorithms is given, as most of the previous and current literature typically focuses only on Grid Search and Random Search. • Hyperparameter tuning for six machine learning models is performed to analyze sentiments over an Arabic text. • The Arabic language is a challenging language; this paper is considered the first hyperparameter tuning study performed on an Arabic text.
The paper is organized as follows. Related work is given in Section 2. The five hyperparameter optimization models are presented in Section 3. Section 4 describes our Arabic sentiment analysis scheme. The experimental results and conclusions are given in Sections 5 and 6, respectively.

Related Work
In this paper, we present the use of hyperparameter tuning of machine learning algorithms to tackle the sentiment analysis problem of Arabic reviews. It is important to shed light on previous related work on both hyperparameter tuning and sentiment analysis.

Hyperparameter Tunning
Most of the previous work in hyperparameter tuning tends to focus only on Grid Search and Random Search, or a comparison between them [9][10][11][12][13]. A comprehensive study on tunability is given in [14]; the authors constructed the tuning problem in statistical terms as well as suggesting tunability quantifying measures of algorithms' hyperparameters. They presented a comprehensive study based on 38 datasets from the OpenML platform. In [15], the authors showed that the performance of SVM can be improved significantly using parameter optimization. They applied two methods which are Grid Search and Genetic Algorithm. Based on the average running time on different datasets, GA was almost 16 times faster than Grid Search. Another comparison between Grid search and Genetic Algorithms was introduced in [16]. The result of the implementation showed that Genetic Algorithm can find the hyperparameters with faster computational time than Grid search. The authors tested their comparison against Support Vector Machine, Random forest, Adaptive Boosting,and K Nearest Neighbour. In [17], the authors proposed the use of an evolutionary algorithm named SHADE to optimize the configuration of a deep learning model for the sentiment analysis of Spanish tweets. Their results showed that the hyperparameters found by the evolutionary algorithm enhanced the performance of the deep learning method. A Word2Vec model based on a convolutional neural network was constructed in [18]. The authors used a dataset collected from different newspapers over a number of Arabic countries. Our work differs from the work listed here in the sense that our main concern is increasing the accuracy of a machine learning algorithm through selection of its internal hyperparameters. An ensemble of surface and deep features for Arabic sentiment analysis was proposed in [19], and the model was evaluated on three datasets of Arabic tweets. The authors concluded that the Word2Vec method gives much better results than sentiment-specific embeddings.

Sentiment Analysis
Sentiment analysis has been a very rich topic for research. However, very few works have been conducted involving sentiment analysis of an Arabic text. In this section, we shed light on the relevant literature. In [20], the authors used three machine learning approaches, Naïve Bayes, Support Vector Machine, and K-Nearest Neighbor classifiers, to classify their in-house dataset of tweets. Their study found that SVM gives the highest precision, while K-Nearest Neighbor gives the best recall. Meanwhile, in [21], the authors studied the effect of preprocessing techniques on the classification behavior of machine learning algorithms. In [22], a sentiment analysis study on financial context news, gathered from Lithuanian language websites, has been introduced. Three machine learning algorithms have been applied (SVM, Naive Bayes, and long short term memory). The Naive Bayes algorithm has shown the best accuracy. A hybrid framework of machine learning and deep learning is proposed in [23], which combines Convolutional Neural Network and Random Forest classifier for sentiment analysis. The experimental result showed that the proposed model gives much better accuracy values than the existing base models.
The authors in [24] used two datasets, one with binary labels and the other with multiclass labels, to explore various natural language processing (NLP) methods to perform sentiment analysis. For the binary classification they applied the bag of words, and skipgram Word2Vec models followed by Random Forest, SVM, or logistic regression classifiers. For the multi-class case, they implemented the recursive neural tensor networks (RNTN). A comprehensive comparative analysis of various machine learning classifiers is provided in [6], and a review of sentiment analysis in the Arabic Language is given in [25]. A rich study of Algerian newspaper comments is presented in [26], in which the authors created their own corpus and used Support Vector Machine and Naïve Bayes to classify Arabic comments into positive and negative sentiments.
As mentioned above, very limited research has been conducted in sentiment analysis involving the Arabic language, and there is almost no research concerning hyperparameter optimization for sentiment analysis on Arabic texts. To the best of our knowledge, this paper is considered the first study to tune machine learning hyperparameters used in the sentiment analysis of an Arabic text.

Hyperparameter Tuning
Speaking in statistical terms, hyperparameter tuning captures a snapshot of the current performance of a model, and compares this snapshot with others taken previously. In any machine learning algorithm, hyperparameters need to be initialized before a model starts the training. Fine-tuning the model hyperparameters maximizes the performance of the model on a validation set. In a machine learning context, a hyperparameter is a parameter whose value is set before initiating the learning process. On the other hand, the values of model parameters are derived via training the data. Model parameters refer to the weights and coefficients, which are derived from the data by the algorithm. Every algorithm has a defined set of hyperparameters; for example, for a Decision Tree, this is a depth parameter.
Before explaining our hyperparameter tuning approach, it is important to explain a process called "cross-validation", as it is considered an important step in the hyperparameter tuning process. Cross-validation (CV) is a statistical method used to estimate the accuracy of machine learning models. Once the model is trained, we cannot be certain of how well it will work on data that have not been encountered before. Assurance is needed regarding the accuracy of the prediction performance of the model. To evaluate the performance of a machine learning model, some unseen data are needed for the test. Based on the model's performance on unseen data, we can determine whether the model is underfitting, overfitting, or well-generalized. Cross-validation is considered a very helpful technique to test how effective a machine learning model is when the data in hand are limited. To perform cross-validation, a subset of the data should be set aside for testing and validating; this subset will not be used to train the model, but rather saved for later use. K-Fold is one of the most common techniques of cross-validation, and it is also the cross-validation technique that we used to validate our model (see Figure 1).
In K-Fold cross-validation, the parameter K indicates the number of folds or sections that a given dataset is split into. One of the folds is retained as a validating set and the machine learning model is trained using the remaining K-1 folds. Each fold of the K-Folds is used as a validating set at some point, with K scores (accuracy) given as a result. Finally, we average the model against each of the folds to obtain a final score for the model, as shown in Figure 1. The importance of hyperparameters lies in their ability to directly control the behavior of the training algorithm. Choosing appropriate hyperparameters plays a a very important role in the performance of the model being trained. It is important to have three sets into which the data are divided, i.e., a training, testing, and validation set, whenever the default parameter is altered in order to obtain the necessary accuracy, so as to prevent data leaks. Thus, hyperparameter tuning can be simply defined as the process of finding the best hyperparameter values of a learning algorithm that produce the best model (see Figure 2).
The approach to finding the best hyperparameter values with respect to a specific dataset has traditionally been performed manually. To set these values, researchers often depend on their past experience of training these machine learning algorithms in solving similar problems. The problem is that the best setting for hyperparameters used to solve one problem may not be the best settings for another one, as these values will change with different datasets. Hence, it is difficult to define the hyperparameter values based on previous experience. A more automated, guided method is needed to explore alternative configurations of the machine learning model under study. Common hyperparameter optimization algorithms include Grid Search, Random Search, and Bayesian Optimization.

Grid Search
The most intuitive traditional approach for performing hyperparameter optimization is perhaps Grid Search [9]. It generates a Cartesian product of all possible combinations of hyperparameters. Grid Search trains the machine learning algorithm for all combinations of hyperparameters; this process should be guided by a performance metric, typically measured using the "cross-validation" technique on the training set. This validation technique ensures that the trained model obtains most of the patterns from the dataset. Grid Search is obviously the most straightforward hyperparameter tuning method. With this technique, we simply build a grid with each possible combination of all the hyperparameter values provided, calculating the score of each model, in order to evaluate it, and then selecting the model that gives the best results. To perform Grid Search, one selects a finite set of reasonable values for each hyperparameter; the Grid Search algorithm then trains the model with each combination of the hyperparameters in the Cartesian product. The performance of each combination is evaluated on a held-out validation set or through internal cross-validation on the training set. Finally, the Grid Search algorithm outputs the settings that achieve the highest performance in the validation procedure. The best set of hyperparameter values chosen in the Grid Search is then used in the actual model. Grid Search guarantees the detection of the best hyperparameters. However, one of its drawbacks is that it suffers severely when it comes to rapid convergence and dimensionality. If k parameters with n distinct values are tested, the complexity of Grid Search is expected to grow exponentially at a rate of O(n k ) [27].

Random Search
As seen in the previous section, Grid Search is an exhaustive search for selecting a model. In Grid Search, we set up a grid of hyperparameter values, train a model, and calculate scores for the testing data for each combination, which can be very inefficient. Thus, searching 10 different hyperparameter values for each of five parameters, for example, will require 100,000 trials. If 10-fold cross-validation is used, this requires 1,000,000 model fits and 1,000,000 predictions, which would be costly in terms of both computing power and time. In contrast, Random Search [10] samples the search space and evaluates sets from a specified probability distribution. In brief, it is a technique in which random combinations of the hyperparameters are used to find the best solution for the model under consideration. For example, instead of rotating through all 100,000 samples, only 1000 random samples of hyperparameter sets are checked. The number of evaluations in Random Search must be set in the beginning, before the hyperparameter optimization process starts, and hence, the complexity of Random Search running n evaluations is O(n) [28]. A significant drawback of the Random Search algorithm is that it does not use information from prior trials to select the next set and it also does not use a strategy to predict the next trial.

Bayesian Optimization
A more appealing approach to fine-tune hyperparameters through automated model tuning is the Bayesian Optimization algorithm. Bayesian Optimization is an informed search algorithm, which means that each iteration of this algorithm learns from the previous one, and the results of one iteration help in creating the next one. Bayesian Optimization resembles the Random Search method in the sense that it samples a subset of hyperparameter combinations; however, they differ in the way in which each combination is chosen. Bayesian Optimization [29] views the hyperparameter tuning process as the optimization of a black-box function. The function to be optimized is the model's final prediction score (i.e., accuracy) on a held-out test set. Any global optimization framework can then be applied to minimize this function. The model used to approximate the objective function is called a surrogate model. A number of surrogate models for Bayesian Optimization are in use, but perhaps the most common one is the Gaussian Process (GP). This is also the one used in our study. First, a few hyperparameter combinations are randomly chosen and tested. These randomly selected hyperparameters' values are then used to produce the first model of the objective function. Once a first draft of the model is obtained, there are two methods to choose the next combination: one can either choose a value close to the highest obtained value, where performance is likely to increase, or one can explore another subset of the hyperparameter search space, where another maximum could be found. Applying Bayesian Optimization with the Gaussian Process fitness function over a dataset of size n has a time complexity of O(n 3 ) and space complexity of O(n 2 ) [30]. The advantage of Bayesian Optimization is that it does not sample each combination in the search space, as Grid Search does, and, at the same time, it proceeds in a more systematic manner than Random Search.

Genetic Algorithm
Metaheuristic algorithms are mainly those approaches inspired by biological theories; they are known for their ability to solve non-continuous and non-convex optimization problems. One common example of a metaheuristic algorithm is Genetic Algorithm (GA). Metaheuristic algorithms generally begin by creating a population that represents a generation at each iteration. Each generation consists of a number of candidate solutions called individuals, which in turn have a number of properties represented by chromosomes. Each candidate solution (individual) in every generation is then evaluated until a global optimum is found [31]. Metaheuristic algorithms differ in the methods used to generate and select populations.
Genetic Algorithm (GA) [32] is based on the evolutionary theory that individuals with the best capability to adapt to environmental changes are more likely to survive and hence pass on their capabilities to future generations. As the next generation inherits the characteristics of the parents, they may produce individuals with better or worse versions of the parents. Better individuals will be more likely to survive and consequently have better offspring, while the worse individuals will gradually disappear. After several generations, the individual carrying the best chromosomes (properties) will be identified as the global optimum [33].
When applying GA to the hyperparameter optimization problem, each hyperparameter is represented by a chromosome, and the value of the hyperparameter is set to the decimal value of the representative chromosome. Every chromosome has several genes represented in binary digit format. Chromosome selection, crossover, and mutation operations are then performed on the genes of this chromosome to identify the optimal parameters. Chromosomes with good fitness function values will be selected and passed to the next generation with higher probability. In the next generation, these chromosomes generate new ones carrying the best characteristics of their parents. Crossover is used to generate new chromosomes by swapping proportions of the genes of different chromosomes, which represent the solution, in order to obtain the mixing of solutions in the search space [34]. Mutation is another kind of operation that can also be used to generate new chromosomes by randomly changing one or more genes of a chromosome [35]. Crossover and mutation operations ensure the diversity of the later generations, enable them to possess different characteristics, and reduce the chance of missing good ones [36].
Setting up the initialization step with an appropriate estimate is always considered an important step in any optimization problem, as it may speed up the convergence of the algorithm. Even if the initial values will iteratively be enhanced during the convergence process, it is always wise to choose initial values that bring the convergence closer to the optimum point without being trapped in unpromising local regions within the search space.
One of the advantages of Genetic Algorithm is that its initial values are randomly selected, making it very easy to implement without requiring excessive effort in choosing good initial values. This is because the mutation, crossover, and selection operations ensure that the the global optimum is not missed. However, being a sequential execution algorithm, the possibility to parallelize is lower. The time complexity of GA is O(n 2 ) [37], which is considered a low convergence speed.

Particle Swarm Optimization
Another powerful metaheuristic example is Particle Swarm Optimization (PSO) [38,39], which is commonly used to solve optimization problems. PSO solves a problem by trying to optimize a solution in an iterative way with respect to some measure of quality by enabling a group of particles (swarm) to scan the search space in a semi-random manner [40]. PSO algorithms find the optimal solution through information sharing and cooperation among the particles in a group. In PSO, a swarm has a group of particles and each particle is represented by a vector containing the current position, the current velocity, and the best known position of the particle so far. After initializing the position and velocity of each particle, the current position and performance score of every particle are calculated. In the next iteration, the velocity of each particle is changed based on the calculated information from previous iterations, namely the position and the current global optimal position. The particles then move based on their new velocity vectors [41]. These steps are repeated until some convergence or termination criteria are reached.
PSO is easier to implement than GA, since PSO does not include crossover and mutation operations. In GA, all chromosomes share information with each other, so the entire population moves uniformly toward the optimal region, while, in PSO, only the individual best particle and the global best particle are transmitted to others. The computational complexity of the PSO algorithm is O(nlogn) [42]. Generally, PSO has faster convergence than GA. Moreover, PSO is easy to parallelize since particles operate independently and only need to share information with each other after each iteration [36].
Unlike GA, the population in the PSO algorithm needs proper initialization; otherwise, it might only find a local optimum and lose the global optimum, especially for discrete hyperparameters. Many population initialization approaches have been developed to improve the performance and convergence of PSO [43]. However, using one of the popula-tion initialization approaches will increase the complexity and the execution time of PSO. Table 1 provides a complexity, parallelization, and initialization comparison among the different techniques. Table 1. Comparison of hyperparameter tuning algorithms (where k is the number of hyperparameters and n is the number of values for hyperparameters).

HP Approach Complexity Enable Parallelization Easy Initialization
Grid

Proposed Architecture for Arabic Sentiment Analysis
We used our own constructed dataset for hotel reviews. To better analyze the polarity of these reviews as being positive or negative, useless content was removed. A brief discussion is given in the following subsections to explain the steps undertaken by our system in order to be fully constructed. Below is a description of each step.

Data Collection
We used the Booking.com website to collect hotel reviews. A total of 3500 positive and 3500 negative hotel reviews were collected, generating a total of 7000 reviews. The number of words in the dataset is 95,906 words, with an avarage of 13 words per review. The corpus was named the Reviews Sentiment Analysis Corpus (RSAC), and it is available at the link provided in [44]. The reviews in RSAC are written in both standard Arabic and dialectical Arabic, which contains non-standard Arabic vocabulary, such as onomatopoeia and mimetic words. Figure 3 shows sample records from our constructed dataset (RSAC), and an English translation of the dataset is provided in Figure 4.

Cleaning and Annotation
In this step, RSAC was cleaned by removing whitespace and punctuation, correcting spelling mistakes, and deleting replicate instances. Each review instance in RSAC was manually annotated as being either positive or negative.

Preprocessing
In the preprocessing stage, data were prepared for a text mining process. Preprocessing subtasks were applied to the cleaned reviews, as discussed below.
• Normalization: Normalization converts all possible conjugations of a specific word to its standard form. Our normalizer has been fed with a predefined set of rules to provide a standardized form of the collected Arabic reviews. For example, the normalizer removes unnecessary characters such as punctuation, numbers, non-Arabic characters, and special characters. It also removes repeated letters, usually used to express certain impressions, such as exaggeration or affirmation, and replaces this repetition with a single occurrence of the character. • Stop Word Removal: Stop words are frequently used words that appear in a text but are not semantically related to the context in which they exist. These words can be removed from the text without affecting the classification task's performance. • Stemming: Stemming is the process of returning a word to its stem. Stemming trims words by reducing all forms of the word (i.e., adjectives, adverbs, etc.) to its origin base. Many Arabic words have the same stem. Hence, all such words are replaced with one word. This technique has a large impact on improving the efficiency of text categorization.

Feature Extraction
At this stage, the text is ready for processing, which starts with the feature extraction step. During this step, the annotated data are converted into a feature vector, which in turn is fed into the machine learning classifiers for training. A specific combination of features leads to a specific class. Each piece of text is converted into a feature with a certain weight. The weight of the word (feature) changes as the document changes because it is calculated with respect to the document containing the word under consideration. In our model, Term Frequency-Inverse Document Frequency, which is known as TF-IDF and considered to be one of the most popular term-weighting schemes, was used to construct the feature vectors. Other examples of term-weighting schemes include Boolean and Term Frequency (TF). TF-IDF measures the importance of a word according to how many times this word appears in a document. Hence, TF-IDF is defined as the frequency with which a word appears in a text fragment divided by the frequency of the word within the whole dataset.

Training and Testing Classifiers
This step in the sentiment analysis process depends on the approach used for classification. Six supervised ML algorithms are used; they are Logistic Regression (LR) [45], Ridge Classifier (RC) [46], Support Vector Machine (SVC) [47], Decision Tree (DT) [48], Random Forest (RF), and Naive Bayes (NB) classifiers. During this step, the machine learning classifier was trained to classify each Arabic review in the dataset as carrying either a positive tone or a negative tone. During the training stage, the classifier algorithm learns from the labelled data, which were hotel reviews in our case. Consequently, during the testing stage, the classifier needed to have the ability to classify new, non-labelled hotel reviews. Each classifier was then evaluated by measuring the accuracy.

Using Default Hyperparameters
First, the accuracy of each machine learning model in classifying the Arabic reviews was calculated using the default values of the hyperparameters. We used the default values specified by the Python scikit-learn library package [49]. Figure 5 presents a box plot of the accuracy for all classifiers, while Figure 6 contains a tabular representation of the accuracy of each classifier. Figure 7 shows how each classifier was calibrated during the global optimum search. Figure 8 gives the confusion matrix of each classifier for better performance evaluation. The figures show that the Support Vector Classifier (SVC) outperformed the other classifiers.

Using Hyperparameter Tuning Techniques
This section presents the results after the Grid Search, Random Search, Bayesian Optimization, Particle Swarm Optimization (PSO), and Genetic Algorithm (GA) strategies were used to tune the hyperparameters of each machine learning model, with the accuracy was calculated accordingly. Our system was implemented using Python running on a MacBook Air platform.
For Ridge Classifier (RC), Table 2 shows the five hyperparameter tuning techniques, the corresponding score (accuracy), and the set of the best hyperparameters that represent the best model configuration obtained by each technique and hence with the best accuracy. Figure 9 depicts the score corresponding to each hyperparameter tuning strategy, as well as the accuracy obtained using the default values for the hyperparameters. Figure 10 shows the score for each parameter value in the search space for Ridge Classifier (RC).
It should be noted that, sometimes, default values work very well for some machine learning models and the specified dataset; however, this is not always the case when changing the model and/or the dataset used. For example, in terms of our dataset, setting the default value for the 'var_smoothing' hyperparameter in the Naïve Bayes model, which was 1 × 10 −9 , gave an accuracy value of 70.5013; however, a higher accuracy value of 71.4402 was obtained when setting this hyperparameter to 1 × 10 −9 . Similarly, setting the 'max_feature' hyperparameter to 'log2' in the Random Forest model yielded higher accuracy than setting it to its default value, which was 'auto'.
Another issue that should be noted is that some values of the hyperparameters should not be used together. For example, in the Logistic Regression classifier, the solver hyperparameter value 'sag' supports only 'l2' or 'no' penalties. Therefore, one should pay careful attention when setting the values. Of course, one can anticipate that Grid Search requires more iterations than the other two search algorithms as it rotates over every hyperparameter set. In the Ridge Classifier, for example, Grid Search tested 561 hyperparameter combination sets to find the best set, while, in both Random Search and Bayesian Optimization, we set the number of iterations to 100.
Similarly, Table 3 and Figures 11 and 12 show the results for Decision Tree (DT). Table 4 and Figures 13 and 14 show the results for Random Forest (RF). Table 5 and Figures 15 and 16 show the results for the Support Vector Classifier (SVC). Table 6 and Figures 17 and 18 show the results for Logistic Regression (LR). Finally, Table 7 and and Figure 19 show the results for Naive Bayes (NB).                The RC algorithm clearly showed the highest accuracy obtained by Grid Search. The Grid Search, Random Search, and Bayesian Optimization hyperparameter tuning strategies obtained almost the same accuracy for LG, SVC, RF, and NB. Meanwhile, Bayesian Optimization showed noticeable superiority over the other hyperparameter tuning strategies for the DT algorithm. However, PSO and GA showed significant improvements for the NB accuracy. It is shown in the tables and figures that SVC gave the best accuracy both before and after hyperparameter tuning. Bayesian Optimization yielded slightly higher accuracy than Grid Search and Random Search when tuning SVC.

Conclusions
Sentiment analysis has become a very useful method to understand how a population feels about a given product, service, place, or even a public issue. In general, a machine learning approach is used when a certain text needs to be sentimentally analyzed. The classification accuracy of each machine learning model can be further optimized through hyperparameter tuning to achieve better accuracy than that obtained when using the model's default values of the hyperparameters. In this paper, five hyperparameter tuning approaches are presented: Grid Search, Random Search, Bayesian Optimization, Particle Swarm Optimization, and Genetic Algorithm. These approaches are used to perform the hyperparameter tuning of six machine learning algorithms in order to sentimentally analyze Arabic reviews. This paper is considered the first machine learning hyperparameter tuning study on an Arabic text. Our results show that the Support Vector Classifier offers the best accuracy both before and after hyperparameter tuning, with the highest score of 95.6208 obtained when using Bayesian Optimization. Meanwhile, Particle Swarm Optimization (PSO) and Genetic Algorithm (GA) dramatically enhanced the score of the Naive Bayes classifier.