Examining the Feasibility and Acceptability of Valuing the Arabic Version of SF-6D in a Lebanese Population

Objectives: The SF-6D is a preference-based measure of health developed to generate utility values from the SF-36. The aim of this pilot study was to examine the feasibility and acceptability of using the standard gamble (SG) technique to generate preference-based values for the Arabic version of SF-6D in a Lebanese population. Methods: The SF-6D was translated into Arabic using forward and backward translations. Forty-nine states defined by the SF-6D were selected using an orthogonal design and grouped into seven sets. A gender-occupation stratified sample of 126 Lebanese adults from the American University of Beirut were recruited to value seven states and the pits using SG. The sample size is appropriate for a pilot study, but smaller than the sample required for a full valuation study. Both interviewers and interviewees reported their understanding and effort levels in the SG tasks. Mean and individual level multivariate regression models were fitted to estimate preference weights for all SF-6D states. The models were compared with those estimated in the UK. Results: Interviewers reported few problems in completing SG tasks (0.8% with a lot of problems) and good respondent understanding (5.6% with little effort and concentration), and 25% of respondents reported the SG task was difficult. A total of 992 SG valuations were useable for econometric modeling. There was no significant change in the test–retest values from 21 subjects. The mean absolute errors in the mean and individual level models were 0.036 and 0.050, respectively, both of which were lower than the UK results. The random effects model adequately predicts the SG values, with the worst state having a value of 0.322 compared to 0.271 in the UK. Conclusion: This pilot confirmed that it was feasible and acceptable to generate preference values with the SG method for the Arabic SF-6D in a Lebanese population. However, further work is needed to extend this to a more representative population, and to explore why no utility values below zero were observed.


Introduction
The fast growth medical technologies and treatments increasingly requires cost-utility analyses (CUA) and cost-effectiveness analyses (CEA) to decide on the optimal treatment for every health condition [1]. Agencies that advise on reimbursement such as the National Institute for Health and Care Excellence (NICE) commonly require a health-related quality of life (HRQoL) outcomes using quality adjusted life years (QALYs) from preference-based measure questionnaires as part of the decision-making process [2,3]. In order to generate QALYs, there must be a valuation for HRQoL on the 1-0 full health-death scale through eliciting preferences from the general population by setting full health to 1 and death to 0 on the scale [4,5].
The SF-36 [23,24] served as the base for the SF-6D used in several valuation studies. Previously, a scoring algorithm for the SF-6D has been derived from the general UK population using the standard gamble (SG) technique [12]. This has been used to elicit values for several countries, including China [19], Japan [20], Hong Kong [21], Brazil [22], Portugal [23], and Australia [24]. There were significant differences between UK values and the other countries, which suggested cultural differences in values. It is likely that significant differences in the preferences for different health states may also exist in a Lebanese population relative to other populations. To the best of our knowledge, there have been no studies in any Middle Eastern country to elicit valuations for the SF-6D health states.
However, a few steps must be taken before applying the SF-6D to the Lebanese population. First, it is imperative to confirm the ability to generate a Lebanese preference-based valuation for the multi-dimensional SF-6D health states using preference elicitation tasks. Second, it is compulsory to check if a valuation of the representative population can be used to produce a scoring algorithm to generate utility values for all possible SF-6D states. Since most studies using SG have been conducted on Western populations, little is known about its feasibility, validity, and reliability in Middle Eastern populations including the Lebanese population.
Thus, the aim of this pilot study is to examine the feasibility and acceptability of valuing the Arabic SF-6D in a Lebanese population using the SG method. If the results are positive, preference-based measures of health such as the SF-6D could be valued by the Lebanese population to generate a definitive value set for Lebanon. This may enable the inclusion of the Lebanese population in global and multi-ethnic pharmacoeconomic evaluation studies.
In the following sections, we describe the methods of the SF-6D valuation survey and the data collection process. The modeling of the valuation data is also outlined. Then, we present our findings in Section 3 and finish with a discussion of the results, their implications, and briefly consider the possible future studies.

The SF-6D
The SF-6D is derived from the SF-36. It is composed of six health dimensions, including physical functioning, role limitation, social functioning, bodily pain, mental health and vitality, each having between four and six levels [12]. Defining a health state requires choosing a level from each dimension, hence creating 18,000 possible combinations. Since every possible health state is described by six digits, from 1 to 6, the perfect health state (full health) is indicated by the combination 111,111, whereas the "pits" (worst health state) is indicated by 645,655.

Subjects
Lebanese adults aged between 18 and 70 years old were recruited at the American University of Beirut (AUB), stratified based on gender (male/female) and on occupation (faculty/staff and employees/students). Potential participants were contacted by phone and/or email to schedule an interview session. However, those who could not be reached after two attempts and those unwilling to participate were excluded from the study.
As this was a feasibility study, a formal sample size calculation was not undertaken. Previous experiences with the SF-6D have shown that 15 observations per health state are adequate to estimate a new model [19]. Hence, a total of 126 people, 21 in each of the six gender-occupation groups, were interviewed out of 170 initially contacted potential participants, thus giving a response rate of 74%. Each one of the seven sets of health states (see below for further details) was valued by three respondents from every group (gender-occupation) for a total of 18 valuations per health state. In order to assess the reliability of the questionnaire, a random sample of 21 participants across all six groups was interviewed a second time 2-4 weeks after the initial interview.

Data Collection Procedure
An Arabic version of the original SF-6D Health Survey was developed by forward and backward translations using professional translators. The latter has been done in collaboration with a team in Egypt and the United Arab Emirates (UAE) [25], for which the English equivalence has been approved by the developers Brazier and Kharroubi. Given that the SF-6D is known to be an elaborated descriptive system, with 18,000 possible outcomes, a sample of 49 health states was generated using the orthoplan procedure in SPSS (SPSS Inc., Chicago, IL, USA). For the sake of future comparison, the 49 health states chosen were the same used in the feasibility study of Chinese SF-6D valuation by Lam et al. [19], and which included every level of every dimension at least once. Those states were then distributed over seven sets, each containing seven health states each represented by a six-digit number, where each digit denotes a level from the SF-6D dimensions in the following sequence: Physical functioning (PF), role limitation (RL), social functioning (SF), pain, mental health (MH), and vitality (VIT). In addition, each respondent valued "pits" (worst health state).
Interview sessions took place between late January 2019 and early March 2019. The interview officially started after briefly explaining the study to the participants and obtaining their written consent. The sets of health states were used in a rotational manner to reduce the interviewer learning effect. The interview session followed a certain sequence of events, where the subject was asked to: (1) Answer the Arabic version of the SF-6D; (2) rank eight health states (the set of seven health states and the "pits" state); (3) value the seven health states and the pits ranked using the SG technique used by Brazier et al. [12] in a random order to reduce the bias effect that could arise from the order of the states; (4) provide some information about their demographics; and (5) fill in an evaluation survey about the interview. The study has been ethically approved by the Institutional Review Board (IRB) at the AUB.
The interview protocol was analogous to the one applied in the UK valuation study [12]. The aim was to allow fair comparison across the two valuation studies. Each respondent was asked to rank and value eight health states using the McMaster 'ping pong' variant of the SG [26]. The SG technique asked the respondents to value seven of the eight SF-6D health states against the perfect health state and the "pits" state. Respondents were then asked in the eighth SG question to value "pits". Depending on whether they thought this state was better or worse than death they would be asked to consider one of the following choices: (i) The certain prospect of being in the "pits" state and the uncertain prospect of perfect health or immediate death; or (ii) the certain prospect of death and the uncertain prospect of perfect health or the "pits" state. The chances of the best outcome occurring is varied until the respondent is indifferent between the certain and uncertain prospects. The negative of the indifference probability of the best outcome is used to value states worse than death, having the effect of bounding negative values at −1 [27]. Then, the other seven health states were chained onto the zero to one scale, where 0 is given to states perceived equivalent to being dead, and 1 is given to perfect health [12]. Having valued the "pits" state (P), the seven intermediate SF-6D health state valuations (SG) are adjusted using the formula SG + (1 − SG)*P, where the best SF-6D state is 1 and death 0, for use in the modelling.
The interview material was in Arabic and the interviews were conducted by a trained interviewer, who after the interview reported their views on the understanding, effort, and concentration of the subject. The respondent also reported how they found the SG tasks.

Patient and Public Involvement
Patients or the public were not involved in the design, or conduct, or reporting, or dissemination of the research.

Data Analysis and Outcome Measures
This study evaluated different aspects of the Arabic SF-6D. First, the feasibility of the health survey was assessed by (1) the completion rate of the interviews; (2) percentage of states with useable values; (3) interview's duration; (4) respondent understanding, effort, and concentration as reported by the interviewer; and (5) respondents own rating of how they found the SG tasks including their effort, frustration, and boredom.
Data were considered unusable if the results obtained from the respondents met any one of the following three conditions: (1) All health states had the same valuation; (2) less than two health states were valued; and (3) pits state was not valued. The valuation of the pits state was essential in order to chain the respondents' health state on the full health-death scale, where full health had a value of 1, dead had a value of zero, and any negative value was bounded by −1. These adjusted SG values form the dependent variable (y) in the models discussed below.
The test-retest reliability of the survey was assessed by analyzing the results obtained from the 21 re-interviewed subjects using the mean difference between test and retest results (statistical significance tested by paired t-test), and intraclass correlation (ICC) calculated using the two-way mixed effects model where respondents' effects were considered as random and interviewers' effects were fixed. As for the validity of applying standard modelling techniques, this was assessed by fitting the models to Lebanese SG data and comparing predictive ability and consistency of the model coefficients with the results of the UK SF-6D.
To understand the size and potential importance of differences between the UK value set and this Lebanese population we also compared the distribution of values, mean health state values for the 39 common states, and their intra class correlation. We also compared the ranking of the coefficients from the models.

Modelling
The modelling methods followed the same methods as the UK study [12]. Models have been estimated at the aggregate level; that is, the explanatory variables were used to estimate the mean value given to each of the states by the respondents that valued them (the mean level model). Models have also been estimated at the individual level that takes into account the variation both within and between respondents using a random effects (RE) model.
The general model for health state valuations is: where i = 1, 2, . . . , n j represents individual health state values and j = 1, 2, . . . , m, y ij represents individual respondents, g is a function specifying the appropriate form, and ε ij is an error term, whose properties depend on the assumptions of the model [12]. The dependent variable, y ij , is the adjusted SG score for health state i valued by respondent j, x is a vector of binary dummy variables for each λ of dimension δ of the descriptive system, where the best level of each dimension represents the baseline for that dimension. For example, x 32 denotes dimension δ = 3 (social functioning), level λ = 2 (health limits social activities a little of the time). For any given health state, x δλ is defined as: x δλ = 1 if, for this state, dimension δ is at level λ x δλ = 0 if, for this state, dimension δ is not at level λ In all, there are 25 of these terms, hence, for a simple linear model, the intercept represents state 111,111, and summing the coefficients of the 'on' dummies derives the value of all other states. The r term is a vector of terms to account for interactions between the levels of different attributes. However, given the small sample size, we did not look at interaction terms here. Finally, z is a vector of respondent level characteristics such as age, sex, or socio-economic factors.
Mean level models were estimated using the ordinary least squares (OLS) and random and fixed effects models were also estimated using generalized least square (GLS) and maximum likelihood estimation in order to take into account repeated observations for each individual [12]. For the random effects (RE) model the error term, ε ij , is subdivided as follows: where u j represents the individual random effect, assumed to be random across individual respondents, and e ij represents the random error term for the health state valuation i of individual j.
The models were evaluated considering the following criteria: (1) Inconsistencies in the estimated coefficients, as the coefficients of dummy variables representing each level of SF-6D are expected to be negative and increasing in absolute size as the level of severity increases (amongst coefficients with statistical significance); (2) adjusted R 2 , mean absolute error, and the proportion of predictions outside 0.05 (% absolute error > 0.05) and 0.10 (% absolute error > 0.10) ranges on either side of the observed value. Predictions were further tested in terms of bias (t-test). Analysis was performed using SPSS version 24.0 (SPSS Inc., Chicago, IL, USA) (Statistical Package for Social Sciences. Available from: http://www.spss.com/software/) and R 2.9.1 (R Development Core Team, Vienna, Austria) (R: A language and environment for statistical computing. R Foundation for Statistical Computing. Available from: www.R-project.org).

Participants
One hundred and twenty-six participants were recruited from AUB and belonged to either one of the three following categories: (1) Faculty, (2) staff, and (3) students. The mean age of the participants was 32.45 years, which is very close to the mean age of the Lebanese general adult population (31 years) [28]. The gender distribution (male/female) of the subjects (50.8%/49.2%) was in line with that of the general population (50.2%/49.8%). However, 88.7% of the participants hold a degree or above, which is very far from the descriptive of the general population (13.8%), and 71.3% have a total household income higher than 2200 USD. The discrepancy in educational level and the high total household income are due to our sample population being recruited from an educational institution. As for the housing, a large proportion of respondents (41.9%) live with their parents, since one third of our data was collected from students. The marital status is consistent with that of the general population, since 63.7% of the Lebanese are listed as single. More information about the sociodemographic characteristics of the interviewed population is available in Table 1.

Feasibility and Acceptability
The 126 recruited participants completed all parts of the questionnaire, thus providing a 100% completion rate. Two subjects out of the 126 participants gave the same valuation for all eight valued health states, including the pits, and were excluded from the data analysis. No respondents were excluded for failing to value two or more health states or for failing to value the pits state. The mean time for completing the whole interview was 26.98 min (SD 8.62, ranged from 11 to 70).   Table 2 shows the interviewer and respondent evaluations on the process of the ranking and SG exercises. According to interviewers' evaluations, the vast majority of respondents had no problems or only some problems performing and concentrating on the ranking task (over 99% and 94.4%, respectively) and SG (over 99% and 93.5%). In regard to the respondents' evaluations, almost all of the respondents (92.7%) mentioned trying their best in answering the questionnaire. Half of the respondents (50.0%) said that they considered three or more dimensions in the SG decision indicating the majority were not lexicographic in their preferences. None of respondents found the task very difficult and none thought the quality of their answers was poor. The process was acceptable to most subjects with 75.0% evaluating the ranking and SG tasks as easy or neutral and 79.0% reporting no degree of irritation or boredom during any of the ranking or SG exercises.

Test-Retest Reliability
Three to four weeks after the first interview, a random pool of 21 respondents were selected for a repeat interview, in order to check the reliability of the questionnaire. The ranking of the best health state card as the top was consistent in both interviews for all 21 respondents. On the other hand, six (28.6%) respondents reversed the order of the pits and severe cards between the first and second interviews (two ranked a severe card the lowest in the first interview but the pits health state lowest in the second interview; and four ranked the pits health state the lowest in the first interview but a severe card the lowest in the second interview). There were 168 paired health state values for the assessment of the test-retest reliability. The mean difference of SG valuations between baseline and post-test was 0.0092 (95% CI −0.02, 0.04), which was not statistically significant by the paired t-test (t = 0.549, p = 0.583). The ICC of SG valuations between baseline and post-test was 0.667 (95% CI 0.55, 0.75), which was almost in line with the standard of 0.7 for group comparison [29].

SF-6D Valuation
Each of the 126 subjects valued seven health states from the SF-6D in addition to "pits", resulting in a total of 1008 health state valuations (882 observations for the health states and 126 for "pits"). The number of observations was evenly distributed across the 49 health states selected by orthoplan using SPSS. All 126 participants were able to value the seven health states in addition to "pits", however, two of them provided the same valuation for all eight health states (including "pits"). Therefore, in total, we had 992 (98.4%) useable observations (868 observations for the health states and 124 for "pits"). Table 3 shows the mean with SD, median, minimum, maximum values and the number of usable values of the 49 SF-6D health states valued in the sample and "pits". These results were compared to the valuations from the UK study. However, it is important to note that some health states valued in this pilot study were not part of the original UK study (health states 124,125, 135,312, 212,145, 221,452, 334,521, 425,131, 432,621, 523,551, 534,113, and 611,221), and hence their appropriate cells in Table 3 were left empty.
It can be seen that the observed values for the "pits" state (645,655) in this pilot ranged between 0.100 and 0.  Figure 1. There are no negative values observed in the Lebanese sample. However, a large proportion of the values were above 0.9 (19%), as also observed in the UK (23%). There were no utility values at 1.0, which indicates that all participants were willing to risk a worse health state to have a chance for a better state. The left skewness in elicited utility values at the individual level is shown in the histogram for the 992 individual health states values in Figure 1. There are no negative values observed in the Lebanese sample. However, a large proportion of the values were above 0.9 (19%), as also observed in the UK (23%). There were no utility values at 1.0, which indicates that all participants were willing to risk a worse health state to have a chance for a better state.

Modelling
The models used for the analysis were random effect (RE) models at the individual level and the ordinary least square (OLS) model at the aggregate (using the mean values of the 50 valued health states) level. In both models, the constant was restricted to unity. The results of the obtained beta

Modelling
The models used for the analysis were random effect (RE) models at the individual level and the ordinary least square (OLS) model at the aggregate (using the mean values of the 50 valued health states) level. In both models, the constant was restricted to unity. The results of the obtained beta coefficients estimated for each level in every dimension, model predictive ability (MAE and number of absolute errors greater than 0.05 or 0.10), and the number of inconsistent preference-based coefficients are presented in Table 4. Our results are compared to those of the UK valuation study by Brazier et al. [12]. Coefficients found to be significantly different from zero at α > 0.05 are marked in bold. For the RE model, 17 out of the 25 parameters were significant. However, at α > 0.10, an additional parameter (VIT4) became significant, meaning a total of 18 out of 25 parameters were significant. The parameter estimates for physical functioning and social functioning were very similar to those of the UK. For instance, the coefficient for PF6 was −0.173 compared to 0.160 in the UK, and that of SF5 was −0.116 to −0.109. However, there were marked differences in the coefficients for pain across all levels with level 6 scoring −0.093 in the Lebanon compared to −0.178 in the UK. There were smaller but nonetheless potentially important differences in the coefficients for the other dimensions. The order of the decrements in the Lebanese model resulted in a ranking of PF with the largest, followed by RL, SF, Pain, MH, and VIT. This contrasts with the UK that also had PF first, but this was followed by Pain, MH, SF, VIT, and RL. All coefficients in the UK study were negative in the RE model [12]. In our study, we had two parameters showing positive coefficients (PAIN3 and VIT2), both of which are insignificant. The MAE in the RE model for Lebanon was better than that of UK; 0.050 compared to 0.078 in the UK. Two significant inconsistent coefficients were found, where the estimated effect decreases from level 2 to level 3 for the physical functioning (i.e., PF2 (−0.061) vs. PF3 (−0.056)) and level 2 to level 3 for the role limitations (i.e., RL2 (−0.057) vs. RL3 (−0.039)) in the RE model. However, the UK model had four such inconsistencies.
As for the OLS mean model, the UK study observed 23 significant parameters while we observed 14 out of 25 parameters to be significant at α > 0.05 and at α > 0.10, an additional parameter (RL4) became significant, for a total of 15 out of 25 parameters. This smaller number may have been a result of a much smaller sample size. The UK mean model had two positive coefficients (PF3 and PAIN2), whereas in the Lebanese mean model VIT3 was positive. The MAE for Lebanon was smaller than that of UK, 0.036 and 0.074, respectively, as is the case with the RE model. Again, there were important differences in the parameter coefficients estimated from the OLS mean model for Lebanon compared to those of the UK. This time the ordering of decrements was PF followed by SF, Pain, MH, RL, and VIT. The UK mean model places pain at the top, followed by MH, PF, VIT, SF, and RL. This suggests there may be major differences in the relative weights for these dimensions.
Overall the models on the Lebanese valuation data had good performance. Two significant inconsistent coefficients were found, where the estimated effect decreases from level 2 to level 3 for the role limitations (i.e., RL2 (−0.049) vs. RL3 (−0.004)) and level 4 to level 5 for the mental health (i.e., MH4 (−0.098) vs. MH5 (−0.064)) in the OLS mean model. In comparison, the UK model had five such inconsistencies. The adjusted R 2 of the OLS mean model for Lebanon was almost double that of UK, 0.950 and 0.508, respectively. Figure 2 presents the actual and predicted valuations for the RE model for the 49 valued health states and the pits. The RE model predicts the observed health state values quite well and in contrast to the UK model, does not seem to suffer from the tendency to over predict at low health state values (i.e., poor health states). Health 2020, 17, 1037 18 of 16 role limitations (i.e., RL2 (−0.057) vs. RL3 (−0.039)) in the RE model. However, the UK model had four such inconsistencies. As for the OLS mean model, the UK study observed 23 significant parameters while we observed 14 out of 25 parameters to be significant at α > 0.05 and at α > 0.10, an additional parameter (RL4) became significant, for a total of 15 out of 25 parameters. This smaller number may have been a result of a much smaller sample size. The UK mean model had two positive coefficients (PF3 and PAIN2), whereas in the Lebanese mean model VIT3 was positive. The MAE for Lebanon was smaller than that of UK, 0.036 and 0.074, respectively, as is the case with the RE model. Again, there were important differences in the parameter coefficients estimated from the OLS mean model for Lebanon compared to those of the UK. This time the ordering of decrements was PF followed by SF, Pain, MH, RL, and VIT. The UK mean model places pain at the top, followed by MH, PF, VIT, SF, and RL. This suggests there may be major differences in the relative weights for these dimensions.

Int. J. Environ. Res. Public
Overall the models on the Lebanese valuation data had good performance. Two significant inconsistent coefficients were found, where the estimated effect decreases from level 2 to level 3 for the role limitations (i.e., RL2 (−0.049) vs. RL3 (−0.004)) and level 4 to level 5 for the mental health (i.e., MH4 (−0.098) vs. MH5 (−0.064)) in the OLS mean model. In comparison, the UK model had five such inconsistencies. The adjusted R 2 of the OLS mean model for Lebanon was almost double that of UK, 0.950 and 0.508, respectively. Figure 2 presents the actual and predicted valuations for the RE model for the 49 valued health states and the pits. The RE model predicts the observed health state values quite well and in contrast to the UK model, does not seem to suffer from the tendency to over predict at low health state values (i.e., poor health states).
Finally, there is a key finding from the models that is worth mentioning. Namely, for the Lebanese sample, there is almost no discrimination in preference-based coefficients as a function of severity for either the pain or vitality dimensions. For both of these dimensions, coefficients for all but the most severe level of each are not statistically significant (and having small magnitude), and even the most severe level of vitality is not statistically significant for the aggregate model. In the results above, we pointed to the fact that there were fewer inconsistencies in magnitude of coefficients across dimension severity levels for the Lebanese sample than for the UK sample in the models, but  122233  312332  133132  232111  321122  511114  113411  124125  332411  121212  341123  131542  111621  221452  431443  611221  421314  412152  142154  531635  212145  241531  213323  522321  235224  132524  122425  135312  414522  432621  144341  631355  334251  443215  115653  534113  633122  622513  545422  315515  642612  425131  224612  523551  512242  323644  614434  625141  Finally, there is a key finding from the models that is worth mentioning. Namely, for the Lebanese sample, there is almost no discrimination in preference-based coefficients as a function of severity for either the pain or vitality dimensions. For both of these dimensions, coefficients for all but the most severe level of each are not statistically significant (and having small magnitude), and even the most severe level of vitality is not statistically significant for the aggregate model. In the results above, we pointed to the fact that there were fewer inconsistencies in magnitude of coefficients across dimension severity levels for the Lebanese sample than for the UK sample in the models, but we only focused on statistically significant coefficients in this count. Whether we use the actual values of non-significant coefficients, or if we just consider their values to be 0, there are many more inconsistencies for the Lebanese sample. We consider this finding in more detail in Section 4.

Discussion
Health state valuation is a relatively new research area in the Middle East, with two studies investigating the validity and reliability of the Arabic version of the EQ-5D-3L in Jordan and Saudi Arabia [30,31] and only one study focused on testing the feasibility of eliciting EQ-5D-5L values from a general public sample in the UAE [32]. The Arabic translation of EQ-5D-3L appeared to be valid and reliable in measuring the quality of life in Jordanian and Saudi people. In addition, results suggested that it is feasible to generate meaningful health-state values in the UAE and most of the respondents stated that their religious beliefs influenced their responses to the valuation tasks.
The results of this pilot study supported the feasibility and acceptability of using the SG method to generate health state utility values for the SF-6D in Lebanon, to generate QALYs, and hence to conduct cost utility analysis of health care interventions. The Lebanese SF-6D preference weights estimated here offer a method for producing utility values from existing SF-36 data. We believe that using SF-6D health state preference values from the Lebanese population to conduct cost-effectiveness studies in Lebanon is more appropriate than using values obtained in other countries. The results from our study sample were positive in a sense that preference-based measures of health such as the SF-6D could be adapted nationally to the Lebanese population who can then be included in global and multi-ethnic pharmacoeconomic studies.
After excluding unusable data from two participants, we had a completion rate of 98.4% which is much higher than that obtained in the UK where 36.8% of respondents were excluded from the data analysis [12]. This may be because we had a well-educated sample and only had two well-trained interviewers who made sure that respondents valued every health state. The Lebanese values were higher than those of the UK for 36 out of the 39 comparable states. Furthermore, the ordering of the dimension coefficients indicates a higher weight is given to PF, RL, and SF compared to pain and MH than the UK population. This indicates important possible differences in health preferences between the two cultures.
The mean health states for the 49 valued states were broadly consistent with the severity of the health state. This means that the scoring of the health state decreased with the increasing number of dimensions with severe levels in that state i.e., the misery score, which is the sum of all the severity levels of each dimension (e.g., the misery score for state 511,114 = 5 + 1 + 1 + 1 + 1 + 4 = 13). For instance, health state 511,114 had a mean value of 0.858 while state 512,242 had a value of 0.603.
The performance of the Lebanese models was compared to that of the UK model, and they both had good comparative performance relative to the UK models with MAE of 0.036 compared to 0.074 and adjusted R 2 of 0.950 compared to 0.508 for the OLS mean model; and MAE of 0.050 and 0.078, respectively, for the RE model. These results support the validity of the preference-based valuation by SG of the SF-6D in a Lebanese population for the generation of scoring algorithms applicable to the Lebanese population.
However, the UK model had four such inconsistencies. Two more significant inconsistent coefficients were found, where the estimated effect decreases from level 2 to level 3 for the role limitations (i.e., RL2 (−0.049) vs. RL3 (−0.004)) and level 4 to level 5 for the mental health (i.e., MH4 (−0.098) vs. MH5 (−0.064)) in the OLS mean model, whereas the UK model had five such inconsistencies. These results will further support the validity and quality of the data from the Lebanese population. These results are promising given the relatively smaller size of the Lebanese sample compared to the UK.
The models show that two of the six dimensions, pain and vitality, have small and insignificant coefficients for all severity levels with the exception of the worst level (s), and for vitality there are no significant coefficients for the mean level model. While it would be expected that values on some dimensions may differ across cultures (e.g., one would expect that the importance of social functioning could be different across different cultures, we would expect pain severity in particular to impact on health state preferences. This raises the question whether this finding would be observed in a larger sample of respondents, and whether this finding would be observed using a valuation study including a larger number of health states. However, if this result is replicated in a larger study this raises the question as to whether pain and vitality are relevant for inclusion in a preference-based measure in Lebanon if the Lebanese population do not state that milder and moderate problems with these dimensions impact on their utility. This is an important issue and is the subject of further work. Limitations of this study include the use of a small sample of 126 people. This is much smaller than the UK study which involved 611 people, so it may limit the generalizability of the preference values found. More sophisticated models could be tested with data obtained from a larger sample of health states (Kharroubi et al. [33][34][35][36][37][38][39]). However, this study aims at testing the feasibility and acceptability of the use of SG to value the SF-6D in Lebanon, in order to proceed with a larger valuation study involving a larger number of participants. Further research is underway to assess this. In particular, ongoing valuation study for a sample of 249 health states defined by the SF-6D by a nationally representative sample of 577 participants matched with the national proportionate of gender and age category from all Lebanese governorates using standard gamble has preliminary results that are very promising. Upon completion, this study would be the first valuation study of the SF-6D in the Middle East, and therefore, neighbouring countries would benefit from this value set until similar studies are conducted in the region.
Whilst previous studies have found a relatively low number of utility values below zero elicited using the SG technique (for example the UK valuation of SF-6D using SG found 7% of responses were below zero [12]), the lack of utility values below zero is surprising. One possibility is that this was due to the small sample size, but a small number of utility values might still be expected with the sample size analysed here. There are many potential reasons why no values below zero have been observed, including attitudes to risk, characteristics of the particular survey sample, and the perceived severity of the states. There could be cultural or religious reasons why participants are not willing to say a health state is worse than being dead. For these reasons the interviewers may have been reluctant to move to the task for states worse than dead. Future research is recommended to explore this further since this is the first valuation study conducted in the Lebanese population.
An additional concern is around the understanding and concentration of study participants. Whilst the study participants report that they tried their best to answer (92.7%) and few felt bored or irritated (21.0%) only 50% of participants considered three or more of the dimensions in making their choices, and 25.0% of participants found the level of the task difficult. The level of difficulty of the task is to be expected given the complexity of the task. The fact that 50% of people considered only two or fewer dimensions when making their choices may reflect a simplifying heuristic observed in discrete choice experiments too whereby participants may have focused on a small number of dimensions to make the tasks easier to complete. This question is rarely asked in valuation surveys and so it is not possible to say whether this is unusual. However, the interviewer reporting of participant effort and concentration and problems in performing the tasks indicates that only a small number of participants had a lot of problems with the SG task (0.8%), or were reported to have had little effort and concentration (6.5%).
Our study sample was recruited from AUB, hence, the majority of our respondents were well educated (88.7% hold a degree or above). This may lead to a concern about whether SG is feasible and acceptable for use with people with low education levels, because SG requires the respondent to think in abstract terms of probability. Furthermore, the small number of health states valued could impact on the accuracy of the econometric modelling. Overall, the generated SF-6D preference-based coefficients from this pilot study should not be regarded as necessarily representative of the general population of Lebanon. Further studies with a larger and more representative sample from the general population are required to generate a definitive SF-6D value set for the Lebanese population.

Conclusions
This study has demonstrated that generating a scoring algorithm for the SF-6D for the Lebanese population using the SG technique is overall feasible and acceptable. The performance of the econometric models derived from the Lebanese data compared favorably to the UK study, particularly given the smaller sample size. Given the overall encouraging nature of the results, this suggests that health state utility elicitation using SG could be used in Lebanon and other Arab populations in the MENA region. The large differences observed in the parameter estimates coefficients between the UK and Lebanon suggest it is important to have a local value set. However, further research is recommended to determine whether SG is feasible and acceptable in a sample in Lebanon with lower education levels, and to further generate a definitive value set for Lebanon using a representative sample of the general population.