Improvement of Tourists Satisfaction According to Their Non-Verbal Preferences Using Computational Intelligence

: In the tourism industry it is common that the information obtained from customers can be varied, dispersed, and with high volumes of data. In this context, the automatic analysis of information has been proposed through electronic customer relationship management, which refers to marketing activities, tools and techniques, delivered with the use of electronic channels for the speciﬁc purpose of locating, building and improving long- term relationships with customers, to enhance their individual potential. In this paper, we refer to the analysis of information in three aspects: customer satisfaction, the study of customer behavior and the forecast of tourist demand. Speciﬁcally, we have created a novel dataset comprising the non-verbal preference assessment of tourists who are clients of the Sol Cayo Guillermo hotel belonging to the Melia hotel chain, in Jardines del Rey, Cuba. Then, by applying Computational Intelligence algorithms to this dataset, we achieve segment customers according to their non-verbal preferences, in order to increase their satisfaction, and therefore the client proﬁtability. In order to achieve a good performance in the realization of this task, we have proposed two modiﬁcations of the Naïve Associative Classiﬁer, whose results are compared with the most relevant computational algorithms of the state of the art. The experimentally obtained values of balanced accuracy and averaged F1 measure show that, by clearly improving the results of the state-of-the-art algorithms, our proposal is adequate to successfully use electronic customer relationship management in the tourist services provided by hotel chains.


Introduction
Customer Relationship Management (CRM) is defined as all marketing activities directed towards the establishment, development, and maintenance of satisfactory relational exchanges with customers [1]. Moreover, Grönroos defines it as "identifying and establishing, maintaining and improving and, when necessary, also terminating relationships with customers and other interested parties, for profit, so that the objectives of all parties are met, and let this be done through mutual exchange and the fulfillment of promises" [2]. On the other hand, CRM has also been defined as "a business strategy that includes a combination of people, processes and technologies through all points of contact with customers, including marketing, sales, and customer service" [3]. With regard to tourism companies, it is a fact that this type of business increasingly makes use of CRM in activities aimed at attracting new customers to the different services offered in hotel facilities and in tourist sites around the world. Tourism entrepreneurs continually invest resources to offer clients innovations in services. In this context, CRM plays a fundamental role, because the use of cutting-edge technologies that are associated with CRM makes it possible to The research methodology is illustrated in four steps in the schematic diagram of the Figure 1: (i) application of the questionnaire to hotel clients to determine their non-verbal preferences; (ii) the application of data clustering algorithms to obtain groups of clients; (iii) the application of supervised classification algorithms, trained with the clients' nonverbal preferences, and (iv) the determination of the type of client of the new customers, also by supervised classification. Summarizing, we applied a questionnaire to hotel clients, with such information we determine their non-verbal preferences; then we applied clustering algorithms to group the clients according to their non-verbal preferences and set the group number as class labels. Subsequently, we train a supervised classifier. One a new guest arrives to the hotel, we propose him/her to fill the questionnaire, and we use the obtained non-verbal preferences for supervised classification. Finally, we obtain the type of client of the arrived guest, and give that information to the staff, in order to provide a personalized attention to the guest, according to their non-verbal preferences, favoring the client satisfaction.

Data Collection
The Sol Cayo Guillermo hotel has 268 rooms of three types: double, double sea view and superior sea view. In high season (November to February), hotel occupancy is practically at 100% of its capacity, while in low season (March-October), occupancy ranges between 40-60%. The data used were obtained from customer surveys, in December 2019.
A total of 73 customers, aged between 24 and 81 years old, were surveyed. This sample is representative of the population of hotel guests, and we considered that the classification obtained in this research is applicable to other guests of this and other hotels. The distribution of clients by sex and by country of origin is shown in Figure 2. Summarizing, we applied a questionnaire to hotel clients, with such information we determine their non-verbal preferences; then we applied clustering algorithms to group the clients according to their non-verbal preferences and set the group number as class labels. Subsequently, we train a supervised classifier. One a new guest arrives to the hotel, we propose him/her to fill the questionnaire, and we use the obtained non-verbal preferences for supervised classification. Finally, we obtain the type of client of the arrived guest, and give that information to the staff, in order to provide a personalized attention to the guest, according to their non-verbal preferences, favoring the client satisfaction.

Data Collection
The Sol Cayo Guillermo hotel has 268 rooms of three types: double, double sea view and superior sea view. In high season (November to February), hotel occupancy is practically at 100% of its capacity, while in low season (March-October), occupancy ranges between 40-60%. The data used were obtained from customer surveys, in December 2019.
A total of 73 customers, aged between 24 and 81 years old, were surveyed. This sample is representative of the population of hotel guests, and we considered that the classification obtained in this research is applicable to other guests of this and other hotels. The distribution of clients by sex and by country of origin is shown in Figure 2.
Of the customers surveyed, 38 were returning customers, and 35 were new customers. The variables chosen are the essential ones that make up the non-verbal communication system. In addition to being the most feasible to evaluate in clients. We believe that the line of future work will be to corroborate the influence of other variables of nonverbal communication on customer satisfaction. The non-verbal system is made up of subsystems such as kinesic, paralanguage, proxemic, chronic, and others. In the design of the questionnaire, the indicators that make up these subsystems were taken into account to be explored as part of the client's communication preferences, as well as being feasible to evaluate in clients. The 22 variables analyzed were considered feasible to evaluate by the hotel's clientele. The instrument was modified from other similar ones validated by Rey-Benguría [47] to establish communicative preferences in teachers. The form of measurement of these questionnaires was maintained. Of the customers surveyed, 38 were returning customers, and 35 were new customers. The variables chosen are the essential ones that make up the non-verbal communication system. In addition to being the most feasible to evaluate in clients. We believe that the line of future work will be to corroborate the influence of other variables of non-verbal communication on customer satisfaction. The non-verbal system is made up of subsystems such as kinesic, paralanguage, proxemic, chronic, and others. In the design of the questionnaire, the indicators that make up these subsystems were taken into account to be explored as part of the client's communication preferences, as well as being feasible to evaluate in clients. The 22 variables analyzed were considered feasible to evaluate by the hotel's clientele. The instrument was modified from other similar ones validated by Rey-Benguría [47] to establish communicative preferences in teachers. The form of measurement of these questionnaires was maintained.
The variables that were considered, coming from the survey carried out, and that were used to characterize each client are shown in Table 1.  The variables that were considered, coming from the survey carried out, and that were used to characterize each client are shown in Table 1. The survey considered several non-verbal preferences. To do so, six non-verbal categories were surveyed: gesture, posture, emotional atmosphere, tone, quasi lexicon expressions, and proxemic. To do so, several images, audios and videos were presented to the clients, to determine their preferences. Non-verbal behavior can modify, contradict, substitute, complete, accentuate, and regulate verbal signs [48]. We are based on the importance of non-verbal communication in interpersonal relationships. Paying attention to all these types of non-verbal communication is successful in influencing customer perception and satisfaction. The goal of the hotel facility is customer satisfaction. Customers during their trip may find themselves subject to unexpected situations that require the attention of the quality department. The staff must then give specialized attention through non-verbal communication, which favors communicative interactions and the perception of the client. These variables are part of the non-verbal communication system. The quality Appl. Sci. 2021, 11, 2491 6 of 18 and customer service department, which receives customer complaints and concerns, does not have the alternative of improving hotel infrastructure and services provided. The use of non-verbal communication is chosen to mediate the perception of the client and influence their satisfaction.
In addition to the preferences, we also survey some general information about the clients (sex, age, country and repetend). The results obtained from the application of the described instrument have allowed us to design and create a new dataset, which contains valuable information related to clients of tourism companies. The resulting dataset obtained from the questionnaires, will be donated to the University of California (UCI) Machine Learning Repository, to be publicly available. Another viable alternative would be to observe the non-verbal behavior of clients. The obvious advantage that our proposal presents is that the use of the questionnaire allows better data collection and processing. The questionnaire is voluntary, and the client is free to express his preferences instead of assuming them through observation. In this way, investigator bias is avoided, which is why the voluntary response is believed to be more objective.
After surveying the clients, we proceeded to the automatic formation of client groups, with the purpose of determining what types of non-verbal communication elements each of the client groups prefer in their treatment.

Client Segmentation by Clustering
For the segmentation of the customers, it was considered that the data obtained are described by numerical variables (for example, age) and categorical variables (for example, sex), and that they also have missing information (not all clients answered all the questions). Due to these elements, three clustering algorithms were applied, which are designed for handling mixed and incomplete data, such as those obtained from clients.
The number of algorithms for mixed and incomplete data clustering is much smaller than their counterparts for numeric data. Among mixed and incomplete data clustering, there are a few with good behavior in practice, by finding a small number of clusters (less than 20). Previous research [49,50] state that KMSF is very good for clustering, followed by k-Prototypes and AGKA.
In the specialized literature, a large number of algorithms can be found to perform the clustering task. However, the vast majority of these algorithms only support patterns with numeric values in their features. In our research, the use of clustering algorithms is required to carry out customer segmentation. However, the patterns that describe the tourist clients in our dataset have a uniqueness the data obtained from the questionnaire are described by numerical variables (for example, age) and categorical variables (for exam-ple, sex), and that they also have missing information (not all clients answered all the questions). By considering these severe data constraints, the availability of clustering algorithms is dramatically reduced. There are VERY SCARCE clustering algorithms that support incomplete patterns with mixed traits, like the ones we discussed in our research. For these reasons, these three clustering algorithms were selected, which are one of the few that are designed for handling mixed and incomplete data, such as those obtained from clients. The advantages of adopting these three algorithms are clear: they allow handling mixed and incomplete data, such as those obtained from clients: this has a positive effect on the results, as shown in the tables and in the discussion section.
For the application of these algorithms, the EEUC (Experimenter Environment for Unsupervised Classification) software was used, which allows the application and evaluation of data grouping algorithms [54]. The use of the EEUC software platform in our research is very relevant. The advantages of using EEUC are clear: this software platform is efficient, friendly, and allows the application and evaluation of data grouping algorithms, among which are the three algorithms selected for the segmentation of tourist clients. Consequently, the segmentation results are reliable and efficiently obtained.
The number of groups to be obtained was defined from two to 10 groups. Subsequently, the groups obtained were evaluated to determine their quality. For this, the Dunn index was used, which is widely recommended to assess the quality of the clusters [55]. Clustering is a task of the unsupervised learning paradigm. The measures to measure the quality of the clusters thrown by the clustering algorithms fall broadly into three classes: internal validation is based on calculating properties of the resulting clusters; relative validation is based on comparisons of partitions generated by the same algorithm with different parameters or different subsets of the data; and external validation compares the partition generated by the clustering algorithm and a given partition of the data. In our research we are interested in calculating the internal properties of groups of tourist clients attending to a specific criterion of non-verbal communication. The internal indices to use in our proposal are appropriate because their aim is to identify sets of clusters that are compact, with a small variance between members of the cluster, and well separated, where the means of different clusters are sufficiently far apart, as compared to the within cluster variance. Three are the most important options in this area: the Dunn index, the Silhouette index and the Davies-Bouldin index. For a given assignment of clusters, a higher Dunn index indicates better clustering, and its advantages are clear over the other feasible alternatives when the number of clusters is small. This feature of our dataset gave us the opportunity to choose the Dunn index to assess the quality of the clusters. In the experiments carried out in reference [55] of our manuscript, the authors found that the Dunn index attains better rank correlation and therefore is widely recommended. This guarantees us that the clusters of tourist clients are of high quality, which has a positive impact on the results.
The Dunn index is given by the ratio between the smallest distance between two groups G i , G j , and the size of the largest group (Equations (1)- (3)).
There are several ways to define the distance between groups and the size of a group. In this case, the EEUC software uses the dissimilarity between the centroids g i , g j , and the average intergroup dissimilarity, respectively. As a measure of dissimilarity, the HEOM (Heterogeneous Euclidean Overlapping Metric) function is used, which allows the handling of any type of data [56]. There are an infinite number of functions that are useful for measuring dissimilarity between patterns. The most famous are the Minkowski distance or metric functions, among which three cases stand out: the city block distance (order 1), the Euclidean distance (order 2), and the chessboard distance (infinite order). However, all these distance functions are only useful for PATTERNS WITH NUMERICAL FEATURES. If in the patterns there are categorical features, or mixed or lack of information (missing values), these distance functions totally lose their usefulness. That is when the Heterogeneous Euclidean Overlapping Metric function becomes important, which is one of the best options for measuring dissimilarity between incomplete patterns with mixed features. For this reason, in our research the Heterogeneous Euclidean Overlapping Metric function is adopted as a measure of dissimilarity, given the obvious advantages it exhibits. The positive effects on the bottom line are obvious. The HEOM function is shown in Equations (4)- (7). Figure 3 shows the results of applying Dunn's index to the grouping algorithms compared. As can be seen, the highest quality grouping is the one corresponding to the KMSF algorithm, with six groups formed, since, at this point, the Dunn index reaches its maximum value. The resulting groups were inspected manually by the director of the Quality and Customer Service Department of the Sol Cayo Guillermo hotel and received the approval of this expert, with more than 20 years of experience in the tourism field. The distribution of clients in the six groups is shown in Figure 4.  The resulting groups were inspected manually by the director of the Quality and Customer Service Department of the Sol Cayo Guillermo hotel and received the approval of this expert, with more than 20 years of experience in the tourism field. The distribution of clients in the six groups is shown in Figure 4.
It is very relevant to show the characteristics exhibited by the individuals of each of the 6 clusters obtained.
Individuals of cluster 1: They are characterized by being distant people, they prefer a formal treatment. They prefer a friendly tone.
Individuals of cluster 2: They prefer personal or social treatment, are indifferent to the use of an authoritarian tone, and they like the gestures of the staff. Individuals of cluster 3: They are characterized by being repeaters, they are clients adapted to Cuban cultural dynamics, including linguistics, which makes them receptive to interactions with staff. They prefer intimate treatment.
Individuals of cluster 4: They are also repeat customers, but prefer a more personal rather than intimate treatment, they are indifferent to the gestures of the staff, but the use of measured quasi-lexical elements bothers them.
Individuals of cluster 5: These are individuals who prefer social or even public treatment. They reject expressions of kindness and interest on the part of the staff.
Individuals of cluster 6: Unlike the previous ones, these individuals require docile treatment by the staff, reflected in their non-verbal preferences. Any other type of nonverbal behavior on the part of the staff is perceived as conflict. The resulting groups were inspected manually by the director of the Quality and Customer Service Department of the Sol Cayo Guillermo hotel and received the approval of this expert, with more than 20 years of experience in the tourism field. The distribution of clients in the six groups is shown in Figure 4. It is very relevant to show the characteristics exhibited by the individuals of each of the 6 clusters obtained.
Individuals of cluster 1: They are characterized by being distant people, they prefer a formal treatment. They prefer a friendly tone.
Individuals of cluster 2: They prefer personal or social treatment, are indifferent to the use of an authoritarian tone, and they like the gestures of the staff. Individuals of cluster 3: They are characterized by being repeaters, they are clients adapted to Cuban cultural dynamics, including linguistics, which makes them receptive to interactions with staff. They prefer intimate treatment.
Individuals of cluster 4: They are also repeat customers, but prefer a more personal rather than intimate treatment, they are indifferent to the gestures of the staff, but the use of measured quasi-lexical elements bothers them.
Individuals of cluster 5: These are individuals who prefer social or even public treatment. They reject expressions of kindness and interest on the part of the staff.  After obtaining the six types of clients, according to their non-verbal preferences, each client was assigned the type of the group in which they were included. Thus, a labeled data set was obtained, where the label corresponds to the type of customer preference.
Considering such segments of clients, the management of the Sol Cayo Guillermo hotel designed a personalized strategy, to increase client satisfaction, and therefore client profitability. Other possible alternatives to gather information about the opinion of the clients would be to evaluate the reviews left by the hotel on the different platforms. Another would be the processing of the quality surveys that the client answers during their stay at the hotel. The disadvantage of these other alternatives is that the solutions to the problems planted come a posteriori. Our proposal has clear advantages because if we have a profile of their non-verbal behavior, we have a tool to predict their behavior and therefore their satisfaction. Another advantage lies in the value of nonverbal behavior in interpersonal relationships. It is important to emphasize that the strategy of our proposal is now included in the eCRM process of the hotel.

Supervised Classification of Clients
In order to determine the preference of a new customer, it is necessary to apply the survey, and subsequently, with said data, use a supervised classification algorithm capable of determining what type of preference they belong to. The No Free Lunch theorems [57] hold that there is no superiority of one classifier over others, over all data sets and all performance measures. However, it is possible to analyze the performance of the classifiers in some scenarios.

Execution of State-of-the-Art Algorithms
For this, we tested several state-of-the-art classifiers able to deal with mixed an incomplete data. The classifiers were Nearest Neighbor (NN) [58], Naïve Bayes (NB) [59], C4.5 [60], Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [61], Voting Algorithm (ALVOT) [62], Assisted Classification for Imbalance Data (ACID) [63], Extended Gamma (EG) [64,65] and Naïve Associative Classifier (NAC) [8,66]. The parameter values of the compared classifiers are given at Table 2 and were selected according to the suggestions founded in the corresponding papers. We used Knowledge Extraction based on Evolutionary Learning (KEEL) environment [67] to test the C4.5 and RIPPER classifiers, and EPIC environment [68,69] for the remaining ones.   Table 2 becomes important, which are adopted as benchmark for comparison, given the obvious advantages it exhibits. The positive effects on the overall results are obvious.
All the parameter values of the eight classifiers compared are included in Table 2, and the performance in Table 3. The parameters of the proposed classifier are specified immediately after Table 4. That is the totality of parameters that are used in the manuscript. No more parameters are required than specified. In all cases, we have respected the suggestions of the authors of each of the compared classifiers. We do not experiment with other values because that IS NOT THE OBJECTIVE of our paper. In addition, we believe that this experimentation is not necessary, because authors typically publish the BEST VALUES of the parameters of their classifiers; and those values we have used for comparisons.  We used the same partitions of the dataset in all the computational algorithms that were applied. Due to the imbalance of the data, with an imbalance ratio IR = 9, the 5 × 2 cross validation was used, where the data set is divided five times, in two parts each time.
The three main methods of model validation are Hold-out, k-fold cross-validation, and Leave-one-out. However, for unbalanced datasets the 5 × 2 cross validation is recom-mended. This is the reason why it is adopted in this study, given the obvious advantages it exhibits. The positive effects on the overall results are obvious.
Because we are dealing with multiclass imbalanced data, to test the performance of the algorithms we used a multiclass confusion matrix ( Figure 5).
VALUES of the parameters of their classifiers; and those values we have used for comparisons.
We used the same partitions of the dataset in all the computational algorithms that were applied. Due to the imbalance of the data, with an imbalance ratio IR = 9, the 5 × 2 cross validation was used, where the data set is divided five times, in two parts each time.
The three main methods of model validation are Hold-out, k-fold cross-validation, and Leave-one-out. However, for unbalanced datasets the 5 × 2 cross validation is recommended. This is the reason why it is adopted in this study, given the obvious advantages it exhibits. The positive effects on the overall results are obvious.
Because we are dealing with multiclass imbalanced data, to test the performance of the algorithms we used a multiclass confusion matrix ( Figure 5). We compute two measures of performance: the balanced accuracy and the averaged F1 measure. Such measures are robust in presence of imbalanced data and can be easily obtained from a multiclass confusion matrix.
The most important measure of performance of supervised classifiers is accuracy. However, accuracy privileges the majority class, and the results it produces have a strong bias towards the majority class, thereby disregarding the minority class. In order to take into account the two types of classes, the majority and the minority, the specialized literature [70] strongly recommends the balanced accuracy and the averaged F1 measure, which are adopted as measures of performance in our research, given the obvious advantages it exhibits. The positive effects on the overall results are obvious. We compute two measures of performance: the balanced accuracy and the averaged F1 measure. Such measures are robust in presence of imbalanced data and can be easily obtained from a multiclass confusion matrix.
The most important measure of performance of supervised classifiers is accuracy. However, accuracy privileges the majority class, and the results it produces have a strong bias towards the majority class, thereby disregarding the minority class. In order to take into account the two types of classes, the majority and the minority, the specialized literature [70] strongly recommends the balanced accuracy and the averaged F1 measure, which are adopted as measures of performance in our research, given the obvious advantages it exhibits. The positive effects on the overall results are obvious.
In a classification problem with k classes, the balanced accuracy takes into consideration the total of correctly classified instances from each class, relative to the total of instances of such class. The averaged F1 measure considers both precision and recall for each of the classes. These performance measure allow us to evaluate the global performance of classification algorithms over all the classes in the problem, without bias towards majority class.
The experiments carried show different performance values of the algorithms (Table 3), according to the balanced accuracy and averaged F1 measure. As can be seen, the best result was obtained by the NAC algorithm, with a balanced accuracy value of 0.7181 and an averaged F1 measure value of 0.6747.
Regarding the execution times, the fastest algorithm for training is Nearest Neighbor, while for testing are RIPPER and NAC classifiers. In general, all algorithms are fast, due to the total time of the slowest algorithm do not surpass the six seconds, considering the whole validation procedure.
Considering the above-mentioned results, such performance values aren't good enough to be deployed in a real-time eCRM, inside a hotel facility. Thus, we introduce a novel classification algorithm, based on the best-performing one, the NAC classifier, to improve the classification of the non-verbal preferences of the clients.
In order to determine the preference of a new customer, it is necessary to apply the survey, and subsequently, with said data, use a supervised classification algorithm capable of determining what type of preference they belong to. The No Free Lunch theorems [46] hold that there is no superiority of one classifier over others, over all data sets and all performance measures. However, it is possible to analyze the performance of the classifiers in some scenarios.

Customized Naïve Associative Classifier (CNAC)
NAC classifier has a predefined similarity function, based on the Mixed and Incomplete Data Similarity Operator (MIDSO). It also has the possibility of using feature weights, and to compute them by means of metaheuristic algorithms [55]. In addition, in its functioning NAC considers the overall similarity of the instance to classify with respect every class of instances, which makes it suitable for dealing with imbalanced data [7].
The first modification to NAC is to substitute the MIDSO operator by a customized feature similarity operator Customized Mixed and Incomplete Data Similarity Operator (CMIDSO). We consider that if we replace MIDSO with a customized similarity operator, able to use feature weights, and to compare the differences between feature values, we can extend the NAC algorithm to a Customized Naïve Associative Classifier (CNAC). By that, we preserve the nature of NAC, but we make it more flexible and useful for specific problems.
Other feasible alternatives are: (a) design a novel classifier for mixed and incomplete data, based on other paradigms, and (b) using data preprocessing techniques with the objective of enhancing the results. However, (a) is complex, and (b) will require having more data than the ones we have.
The main advantage of adopting this operator is that it is based on features comparison criteria, which makes it suitable for designed customized similarity functions. This is a generalization of the MIDSO operator and allows the users to define whatever similarity they want, for any specific problem.
This affects the results in a positive way, due to the experiments made show a significant increase in the performance of the classifier, by using the proposed CMIDSO operator. In our opinion, the increase in the performance is due to the use of a similarity function which takes into consideration the problem specifications, as well as the way the feature values are compared.
Let x and y be two instances, described by a set of features A = {A 1 , · · · , A m }. If an attribute is missing, its value is denoted by "?". Each instance belongs to a unique class from a set of classes K = K 1 , · · · , K p . and the set of attributes A may have an associated set of attribute weights W = {w 1 , · · · , w m }. The total similarity s t with respect to the instance o to be classified, is computed as s t (o, y) = m ∑ i=1 w i * CMIDSO(o, y, A i ). As using CMIDSO function as an algorithm parameter, we can customize the classifier, by maintaining the NAC advantages.
The pseudocode of the proposed Customized Naïve Associative Classifier is shown in Figure 6.
To solve the problem of classifying the non-verbal preferences of the clients, we introduce a novel operator, suitable for comparing the non-verbal preferences of the clients. The proposed CMIDSO is then defined as follows, by considering the features defined for the clients' preferences (Table 1): Appl. Sci. 2021, 11, x 14 of 19 Figure 6. Pseudocode for the Customized Naïve Associative Classifier.
To solve the problem of classifying the non-verbal preferences of the clients, we introduce a novel operator, suitable for comparing the non-verbal preferences of the clients. The proposed CMIDSO is then defined as follows, by considering the features defined for the clients' preferences (Table 1)  We wanted to clarify that, in the definition of the similarity operators s i , we are using a programming-based approach, that is, we consider that the first condition is evaluated, if false, then the second condition is evaluated, and so on. That is, the order in which the conditions are presented matters, due to the second, third and four conditions assume that the previous were not fulfilled.
We tested the proposed CNAC, with the above-mentioned CMIDSO, and we obtained an incredible improvement (of 9% in both Balanced Accuracy and F1 measure) in the classification results (Table 4).
For this test, we use the same parameters as NAC for the Differential Evolution procedure: Np = 25, It = 1000, F = 0.5 and CR = 1.0. However, the proposed CNAC was slower than NAC. This is due to, in our opinion, the change in the similarity operator, due to MIDSO is faster than the proposed CMIDSO.
The computational complexity of NAC and CNAC algorithms includes its storage complexity and its training and classification complexity. The storing computational complexity of NAC is bounded by O(n * m) + O(3m) where n is the number of instances and m is the number of features, due to NAC stores the training set as well as the minimum, maximum and standard deviation values of each feature. For CNAC, the storage complexity is bounded by O(n * m) because it only stores the training set. Regarding storage, CNAC is less complex, although by little.
The training complexity of the NAC is given by the computation of the standard deviations of features bounded by O(n * m) and the use of Differential Evolution (DE) to obtain feature weights. Using DE has a complexity bounded by O(it * np * f N AC ), where it is the number of evaluations, np is the population number and S N AC is the complexity of computing the fitness function (classifier performance of NAC over a portion of the testing set). Thus, the training complexity of NAC is bounded by O(n * m) + O(it * np * f N AC ).
The training complexity of CNAC is also given by the use of Differential Evolution to obtain feature weights. That is, it is bounded by O(it * np * f CN AC ), where f CN AC is the complexity of computing the fitness function (classifier performance of CNAC over a portion of the testing set).
Regarding training complexity, there are two differences between NAC and CNAC: (a) the computing of standard deviations in NAC and (b) the similarity function used in the classifier, which affects the value of the fitness function. For the case of study, having a small number of instances, we can disregard the difference (a), and due to we are using the same values of it and np, we consider the difference of NAC and CNAC with respect to execution time is given by the use of a different similarity function.
The classification complexity of NAC is given by the similarity comparison of the unclassified instance with respect to the instances in the training set. The similarity computation complexity using the MIDSO operator is given by O(m * 1) = O(m) due to it only compares the feature values in a predefined way. Thus, the classification complexity of the NAC is bounded by O(n * m).
The classification complexity of CNAC is also given by the similarity comparison of the unclassified instance with respect to the instances in the training set. This complexity is given by O(n * s) where s is the complexity of computing the similarity between instances. The complexity of CMIDSO operator is bounded by O(c * m), where c is the complexity of computing the feature similarity criterion. Thus, the total classification complexity of CNAC is bounded by O(n * m * c).
According to classification, CNAC is more complex than NAC, due to the use of a customized similarity, which adds complexity to the classifier.
Nevertheless, expending only ten seconds once to train the model is good enough for using it in a real environment, inside the electronic Customer Relationship Management system implemented in the hotel.
Apart from the quantitative comparison, a qualitative comparison between our proposed CNAC and the other compared supervised classifiers is provided in Table 5. Due to all compared classifiers deal with mixed and incomplete data, and have some kind of transparency in their decisions, we include other aspects in the qualitative comparison. The included aspects are the possibility of using a user-defined similarity function, the inclusion of embedded procedures for feature selection and feature weighing, and the computational complexity of training and testing phases of the algorithms. As shown, the proposed CNAC is different from the other compared classifiers, due to none of them has the same qualitative characteristics of CNAC.