Validation of a Satisfaction Questionnaire on Spa Tourism

The authors’ line of research is within the existing methodological debate around the concepts of quality of services, destinations, and quality measurements methods. The authors consider that the most appropriate way to measure quality is to develop instruments according to the destination and context in question, defining the quality of the tourist destination for practical purposes based on the satisfaction experienced by the tourist or the SERVPERF model, weighted and used to measure the quality of sun and beach tourist destinations. The authors of this work propose the knowledge of spa tourism, its quality and its level of satisfaction as a research gap and consider it as a starting point to validate a questionnaire that would allow the measurement and comparison of parameters with other segments already studied and that can also serve as a measuring instrument for tourist segments with similar characteristics, not as well known in the international literature as inland, ecological or nature tourism. Good internal reliability results were obtained in all items and in all dimensions. The factor analysis distributed the weights of the variables in the theoretical model, and construct validity was obtained with an association between the global evaluation by dimension and the general significance. The score of the main questionnaire was statistically significant.


Introduction
The evolution of research in medical tourism includes different groups of topics, such as issues related to health, well-being, thermal tourism, and quality of service, as well as topics related to medical treatments and tourism [1]. The basis for the implementation of a quality system in any company is the strategic quality plan. This involves objectively evaluating the current situation and contemplating the client's vision [2,3]. There is a flood of definitions about the concept of quality, suggesting that it is a broad and multidimensional term with different interpretations. Quality has been defined from perspectives that are, in some cases, complementary (objective quality versus subjective quality; internal quality versus external quality) and, in other cases, antagonistic (static quality versus dynamic quality; absolute quality versus relative quality). Summarizing the literature, six concepts of quality have been identified [4]:

•
Quality as excellence; • Quality as conformity with specifications; • Quality as uniformity; • Quality as fitness for use; • Quality as satisfaction of customer expectations; • Quality as value creation, understood as the degree of satisfaction of all the key stakeholders of the organization or total quality.
Intuitively, a relationship can be determined between customer opinion and future results. Measuring perceived quality is "listening" to the customer and using it as a tool to organize opinions and determine areas for improvement on which to act. Quality is a multidimensional concept that encompasses many independent attributes (any item that is

Literature Review
Many methodologies based on questionnaires establish theoretical satisfaction as a response to quality services, and there is a broad debate on the advantages and disadvantages of the use of each. Likert-type scales are easy to construct [18] and have high reliability and validity [15,19].
The debate stems from the fact that to fully measure the image of the destination, several components must be captured. These include attribute-based imagery; holistic impressions; and functional, psychological, unique, and common characteristics that require a combination of structured and unstructured methodologies to measure the image of the destination as envisioned in the conceptual framework [18].
Sasser et al. [20] consider that expectations result in attributes. Grönroos [21], representative of the Nordic model, states that the client compares the expected service with the service received. Parasuraman et al. [22,23], representatives of the American school, consider that to determine quality, it would be necessary to compare the difference between perceptions and expectations, and they elaborated a scale, SERVQUAL, with five dimensions. Cronin and Taylor [14] establish quality based solely on perceptions; that is, it eliminates the measurement of expectations but maintains the SERVPERF scale created by Parasuraman et al. [22,23]. Rust et al. [24] consider only three dimensions: result, delivery, and service environment. Dabholkar et al. [25] propose a hierarchy of quality in three levels or primary dimensions, and these, in turn, are divided into sub-dimensions. Brady and Cronin [26] consider dimensions, sub-dimensions, and level: reliability, responsiveness, and empathy. Therefore, the lack of consensus means that there are different models to assess quality.
The first aspect that arises when evaluating the quality of services is the measurement of expectation, which has been modeled on two widely used scales, SERVQUAL, which compares expectations and perceptions [27], and SERVPERF based solely on the measurement of perceptions [28]. Some authors doubt its validity due to theoretical difficulties and practical application measuring expectations [28]; others consider that the comparison between expectations and perceptions of quality is the most appropriate way to measure it [27]. Alén [29] makes a comparison of scales for the measurement of perceived quality in thermal establishments in which he concludes that the SERVPERF scale based on perceptions has better psychometric properties than the one based on the subtractive paradigm (perceptions-expectations).
Another point of debate is whether, when evaluating quality, the importance that users attach to each attribute or dimension should be measured. Cronin and Taylor [27] deduced from the results of their research that weighted models are different for studying the quality of services. Teas [30] makes a comparison between weighted and unweighted models and concludes that the latter are more appropriate in statistical terms. Quester et al. [31] make a comparison between weighted and unweighted models and conclude that the predictive power of SERVQUAL and SERVPERF improves with the inclusion of importance scores. On the other hand, among those in favor of including importance scores, there are also disagreements about how they should be measured. Parasuraman et al. [28] believe that scores should only be obtained for each factor [28], while others [31] argue that the importance should be asked individually for each attribute. In addition, there is a debate about whether importance scores should be obtained directly by asking the respondents [22] or if, on the contrary, they must be obtained indirectly through statistical procedures [13] from another type of information collected in the questionnaires.
The diversity of authors who have tried to define the tourist destination construct [4,[32][33][34][35][36] indicates that the concept of tourist destination encompasses a great diversity of services offered by both private companies and public administrations, and even infrastructure and natural resources. Therefore, to evaluate the quality of a tourist destination, the methodologies developed in the field of quality can be applied, although taking into account that it is necessary to adapt them to a specific context [13].
Kozak and Rimmington [11] in Mallorca followed the line of measuring quality by developing and validating a destination measurement instrument composed of a multiattribute questionnaire, and according to Robinson [37], the best way to measure quality is by adapting the different instruments to the context. Otero and Otero [13], on Costa del Sol, used a multiattribute questionnaire that they validated using the CFA technique and developed a model to measure the quality of the tourist destination. They found high levels of satisfaction with the different attributes of the destination and other investigations obtained results along the same line [11,[38][39][40][41].

Questionnaire Reliability: Test-Retest and Internal Reliability
A cross-sectional survey design was carried out, and the validated questionnaire by Otero and Otero [13], which in turn was inspired by the SERVPERF methodological school [28] study, was the basic instrument to obtain the sample information. The cultural adaptation to the specific spa tourism was carried out based on the contribution of a group of experts made up of university professors, professionals from the sector, and regular users of the spas. The result was a questionnaire with slight modifications to the original that serves to measure a complex construct, such as satisfaction in spa tourism or in segments with similar conditions, for which we have provided an operational definition by meeting with experts. Content validity could be obtained by following Streiner and Norman [42].
Under the previous premises, it was considered a priori that the questionnaire was made up of 7 factors or dimensions, which was later confirmed using the statistical technique of Confirmatory Factor Analysis (n = 725).
Once the adaptation was made, the diagnostic test-retest reproducibility test was carried out on 30 users (12 men and 18 women, aged 25-58 years), who did not belong to the main sample, and three study spas among the cases with a valid response with a separation between one and two days. The intraclass correlation coefficient (ICC) or Kappa index was used depending on the type of variable (quantitative or qualitative), valued with the Landis and Koch scale [43]: <0.00: poor (poor); 0.00-0.20: slight (slightly); 0.21-0.40: fair (fair); 0.41-0.60: moderate (moderate); 0.61-0.80; substantial (substantial); 0.81-1.00: almost perfect (almost perfect).
The internal consistency of the 52-item questionnaire was measured with Cronbach's alpha coefficient excluding the item, which indicates that the questions that measure the same phenomenon are correlated with one another. The global alpha was obtained for each of the dimensions (accommodation, restoration, spa, sports facilities, leisure/culture/shopping, public roads/urban and natural environment, and transport infrastructure and other services) and by item in relation to its dimension.
The literature suggests that the Likert-type and semantic differential scales are easy to construct and manage [18]. The results of the empirical research show that the Likerttype and semantic differential scales have high reliability and validity [19]. The use of the "delighted-terrible scale" has been reported as the possibility of reducing satisfaction responses [15,16]. "Do not know" was also included in the scale of these possibilities for those who might not have an opinion due to no direct experience with the destination attribute. To obtain a weighting factor, the user was asked to grant a scale score of 1 to 100 on how important each factor was to him or her. The third section of the questionnaire, made up of four questions, was designed to determine overall satisfaction with the destination with the same seven-point scale, the intention to recommend it with three categories, and the intention to return with five, in which the user was only given asked by perceptions.

Data Collection Procedures
The sample was designed based on information gathered from external sources, in particular, a telephone and/or email survey of individuals responsible for managing spas. The relevant data were used to calculate the number of users per year per spa, as well as their distribution by age group and length of stay. The overall population size was estimated at 53,231 users per year with an age equal to or greater than 15 years old. These individuals were thus defined as this study's primary subjects.
Stratified sampling was used, with each stratum coinciding with one of the participating spas. The sample size was estimated using Equation (1): in which n c is the size adjusted to ensure complete sampling (i.e., strata) and De f f is the design's effect or the relationship between the variance under stratified sampling and under simple random sampling. This variance was arbitrarily estimated a priori as 1.5, which was shown to be an overestimate. Any proportion calculated in this study had to meet the requirements of an accuracy of 5% and a 95% confidence level, with the worst possibility considered to be p = 0.5. This implied that n = 384 and nc = 576, which was set as the minimum. The sample size and/or stratum was the result of a proportional fixation (i.e., the largest spas received more than one visit during the survey). The final result was 725 valid questionnaires, which coincides with some authors' suggestion that sample size [44] be based on 10 subjects per analyzed variable. The sample size's adequacy was thus confirmed, since 500 subjects would be an acceptable sample size and 1000 or more subjects would be an excellent sample size.

Factor Analysis, Calculation of Dimensions, and Validity of the Construct
In the factorial model of Otero and Otero [13] applied to sun and beach tourism, seven dimensions were obtained (accommodation, restaurants, beaches, sports facilities, leisure/culture/shopping, public roads/urban and natural environment, and transport infrastructures). In the present study, CFA was carried out to determine the degree to which the group of factors identified for sun and beach tourism is capable of representing the data from the spa tourism matrix.
To obtain a model with parsimony-that is, to maximize the amount of variance of the variables that can be explained by the minimum underlying factors or components-the Cattell slope test was performed, which consists of graphically representing the extracted factors (placed on the abscissa axis) against their eigenvalues (placed on the ordinate axis) to establish an inflection point on the graph. To obtain a more easily interpretable grouping of variables, the orthogonal rotation of factors was performed using the Varimax method, as it is the most frequently used [3,45]. To calculate the weight of each dimension, weights derived from the importance that each user gave to each of them were constructed, so that in each user, it would add up to 100 with all the weights in such a way that the weighting given by the i-th user was defined as follows: where I ij is the importance given by use i to factor j.
With this definition, the sum of the weights for each individual should be equal to 100.
The weight variable multiplies the value of each dimension to estimate the total score. The percentage distribution was not calculated for the weights, since it is the means that are relevant, taken from Otero and Otero [13]. The score of the items of global assessment by dimension and general satisfaction in users (n = 725) corresponded to the single question by dimension and global satisfaction questions and are different from the score by dimension estimated from the main questionnaire of 52 items.
The internal consistency of the questionnaire was complemented with questions of its future intention. A priori, it was to be expected that the response of intention to recommend it and intention to return would be related to the level of satisfaction, as observed in the satisfaction study of sun and beach tourism [10]. To determine the association between global assessment by dimension and general satisfaction and the score of the main questionnaire, the Pearson correlation coefficient was calculated.
To calculate the association between the variables of future intention and the scores estimated in the main questionnaire (52 items and 725 users), the comparison of means was used. The aim was to check whether the difference between the two means was due to satisfaction influencing the intention to recommend it and the intention to return, or if the observed differences could simply be due to random variability. For this, the Student's t test was applied for independent samples, with weighted samples. The null hypothesis was that both means are equal in the population. The alternative hypothesis maintained that the means were different and that both had different effects.
In all statistical analyses, the weighting of the observations was taken into account. For descriptive statistics, the SPSS Windows 15.0 program (SPSS Inc., Chicago, IL, USA) was used, and for analytical statistics (calculation of p values, as well as confidence intervals), the SUDAAN 7.0 program (RTI, RTP, Durham, NC, USA), indicating the STRWR sample design (stratified with replacement, the conglomerates being the strata). It was not corrected for a finite population, since the sample fraction (ratio between sample size and population) was much less than 10%. All the statistical procedures used, both those discussed in terms of reliability and validity and those carried out to evaluate the association of sociodemographic variables and characteristics of the type of visit by tourists to spas, are expressed in detail at the bottom of each results table.

Sociodemographic Description of the Users and the Visit to the Spa
The average age of spa tourists in Andalusia is 56 years old, and the tourists are predominantly female. Fifty-three percent of users are retired, and 34% are employed. Moreover, 29.8% have an average monthly income of EUR 500, 33.1% have an average monthly income between EUR 501 and 1000, and 26.6% are in the interval between EUR 1001 and 1500 ( Table 1). The origin of these users is mainly local, with 82.9% from the Andalusian community, with no foreigners in the sample (Table 2). In addition, 51.3% of spa tourists in Andalusia go as a couple and 19.7% as a family. In the accommodation modality, 49.6% choose a three-star hotel. The highest percentage of overnight stays is among the IMSERSO (48.4%), followed by weekend and long-weekend tourists (28.1%), with an average stay of one week. Moreover, 46.8% of users who go to a spa repeat their visit, contracting the trip without intermediaries (99.1%) ( Table 1). At the population level, 92.4% express the intention of recommending the spa compared to 2.0% who have no intention of recommending it. Additionally, 63.6% of the population express the intention to return next year compared to 1.6% who have no intention of ever returning (Table 2).

Test-Retest
For the diagnostic agreement of the 52 items of the main questionnaire, a test-retest was applied (with a separation between 1 and 2 days), analyzed with the Intraclass Corre- Diagnostic concordance was found with scores higher than 0.65 for almost all the items of the accommodation dimension. In the restoration dimension, values higher than those of accommodation were found, mostly above 0.67; when analyzing the spa dimension, we found values higher than 0.51; in the sports facilities dimension, these values were above 0.67; regarding leisure/culture/shopping, values higher than 0.61 were obtained; for the dimension of public roads, and urban and natural environments, the values were in almost all respects above 0.36; and finally, for the dimension of transport, infrastructures and other services, the CCI value was higher in all variables than 0.61 (Table 3). In the general assessment questions (global assessment by dimension, importance, general satisfaction, satisfaction in quality/price, intention to recommend it, and intention to return), an ICC of greater than 0.61 was obtained in all the items of the quantitative variables and a kappa index higher than 0.70 for the qualitative ones (Table 4).  Table 5). b : Arithmetic mean ± SD unweighted. c : The kappa was calculated after eliminating the case (3.3%) that did not answer in the initial assessment.

Internal Reliability
The internal consistency of the questionnaire measured with Cronbach's alpha coefficient indicates a high correlation between the items. The global values for accommodation were at 0.871; restoration: 0.919; spa: 0.833; sports facilities: 0.907; leisure/culture/shopping: 0.830; public roads/urban and natural environment: 0.869; and transport infrastructure and other services: 0.803 (Table 6).
By item (in relation to its dimension), we found a corrected item-total correlation within the accommodation dimension for "cleanliness and hygiene" of 0.756 and for "aesthetics" of 0.730 as the highest values. The lowest value was obtained by "facility for children and the disabled" at 0.389. For the restoration dimension, the value was 0.840 for "food quality" and 0.829 for "value for money" as high values, and 0.664 for "environment" as the lowest value. For the spa dimension, these values were between 0.669 for "cleanliness" and 0.425 for "disabled access". In the sports facilities dimension, we obtained 0.879 for "other sports facilities" and the lowest value of 0.533 for "children". In the case of the leisure/culture/shopping dimension, the highest value was 0.786 for "shops", and the lowest value was 0.357 for "tourist office". In public roads/urban and natural environment, we obtained 0.705 for "sense of security" and 0.527 for "people's attitude". Finally, in the case of the dimension of transport infrastructures and other services, the highest value was 0.710 for "post office", and the lowest values were 0.309 for "taxis" and 0.449 for "roads" (Table 5).

Factor Analysis and Calculation of Dimensions
The KMO test, which compares the Pearson correlation coefficients between each pair of variables with their respective partial correlation coefficients and whose values oscillate between 0 and 1, obtained an excellent value of adequacy of the weighted sample, KMO = 0.923 (Table 6). The Bartlett sphericity test, whose null hypothesis is that there is no correlation between variables, obtained a p value <0.001, which rejects the null hypothesis of the absence of correlation between variables, which indicates the adequacy of the CFA (Table 7). Table 6. Diagnostic criteria of the factor analysis (n = 725) a .
The Cattel slope test ( Figure 1) indicates a number of factors to be extracted from eight factors, coinciding with the theoretical model (38). After the Varimax orthogonal rotation, a grouping of variables was obtained that distributed the weights over the dimensions indicated in the theoretical model, with slight imbalances in the variables that distributed a similar weight in two different factors, such as the variables ease for children and the disabled, medical service, complementary services, people's attitudes, and natural environment. The infrastructure and other services dimension was divided into two, leaving, on the one hand, roads and other services and, on the other, the variables referring to transport, such as taxis, urban buses, and line buses ( Table 7). The variance explained with rotated factors (Varimax rotation) is shown in Table 8.  After the Varimax orthogonal rotation, a grouping of variables was obtained that distributed the weights over the dimensions indicated in the theoretical model, with slight imbalances in the variables that distributed a similar weight in two different factors, such as the variables ease for children and the disabled, medical service, complementary services, people's attitudes, and natural environment. The infrastructure and other services dimension was divided into two, leaving, on the one hand, roads and other services and, on the other, the variables referring to transport, such as taxis, urban buses, and line buses ( Table 7). The variance explained with rotated factors (Varimax rotation) is shown in Table 8.

Validity of the Construct
The association (Pearson correlation) between global assessment by dimension, general satisfaction and score of the main questionnaire (n = 725) of the weighted sample in all cases was statistically significant (p < 0.001). Pearson's linear correlation coefficient r showed positive values close to 1, 0.91 for the restoration dimension and 0.69 for transport infrastructure and other services, which showed a close and direct association between them (Table 9). Regarding the association (comparison of means) between the variables of future intention and the scores estimated with the main questionnaire of 52 items (n = 725) of the weighted sample, they are significant (p < 0.001) with respect to greater satisfaction (Table 10).

Discussion and Conclusions
The questionnaire had three structured sections for the purpose of measuring the satisfaction of spa tourists with general identification questions, specific questions of satisfaction, their probability of returning, and their probability of recommending the destination. The second section, made up of 52 items, was based on a seven-point scale from delighted to terrible, measuring seven factors, namely, accommodation, restaurants, spa, sports facilities, leisure/culture/shopping, public roads/urban and natural environment, and infrastructures and other services.
Our research coincides with other studies that have identified specific attributes of the destination at a lower cost [11,13].
The reference model was taken into account and contrasted with the suggestions of the experts for adaptation to the context, finding a good general fit with the theoretical model and some particularities, such as the variables of ease for children and the disabled, medical service, complementary services, the attitude of the people, and the natural environment, which distributed their weight in two different factors and which the authors define as complex variables [46]. The dimension "infrastructures and other services" was divided into two, leaving, on the one hand, roads and other services and, on the other, the variables related to transport, such as taxis, urban buses, and scheduled buses. Otero and Otero [13] justify the categorization of the items based on the advantage of considering that the dimensions within several factors obtain more useful information to carry out quality policies, since it allows the identification of those responsible. On the other hand, it would be logical to disaggregate transport infrastructures and other services based on the practical knowledge acquired in this work. The CFA for five dimensions explains 48% of the total variance compared to the study by Otero and Otero [13], where it only explains 36%. However, in the analysis of the results with eight components, it explains 65.1% of the total variance.
All this questions the validity of the construct. However, the results of the global assessment obtained by dimension in the weighted sample corresponding to the single question per dimension of the questionnaire and different from the one estimated from the main questionnaire of 52 items, as well as the general questions regarding general satisfaction, stood out with higher average values. The study found statistical significance between the highest satisfaction values and the variables of future intention. Other investigations obtained similar results [11,13,32,[39][40][41]. The validated questionnaire of satisfaction on sun and beach tourism in Costa del Sol is adapted to the spa sector, adjusting to the seven initial dimensions, except for transport infrastructures, which is divided into two, on the one hand, roads and public telephones, and on the other hand, taxis and buses.
The findings of this research show a solid instrument that can be used by different institutions to replicate and use as a barometer of spa tourism and culturally adapted to be used for similar segments. With this instrument, one can obtain both the level of satisfaction and the factors associated with it, and even segment the market.
The analysis is limited only to spas located in Southern Spain, so the results can be extrapolated to any spa in Spain or any country with similar climatic conditions.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.