The HLS19-COM-P, a New Instrument for Measuring Communicative Health Literacy in Interaction with Physicians: Development and Validation in Nine European Countries

Background: Sufficient communicative health literacy (COM-HL) is important for patients actively participating in dialogue with physicians, expressing their needs and desires for treatment, and asking clarifying questions. There is a lack of instruments combining communication and HL proficiency. Hence, the aim was to establish an instrument with sufficient psychometric properties for measuring COM-HL. Methods: The HLS19-COM-P instrument was developed based on a conceptual framework integrating HL with central communicative tasks. Data were collected using different data collection modes in nine countries from December 2019 to January 2021 (n = 18,674). Psychometric properties were assessed using Rasch analysis and confirmatory factor analysis. Cronbach’s alpha and Person separation index were considered for reliability. Results: The 11-item version (HLS19-COM-P-Q11) and its short version of six items (HLS19-COM-P-Q6) fit sufficiently the unidimensional partial credit Rasch model, obtained acceptable goodness-of-fit indices and high reliability. Two items tend to under-discriminate. Few items displayed differential item functioning (DIF) across person factors, and there was no consistent pattern in DIF across countries. All items had ordered response categories. Conclusions: The HLS19-COM-P instrument was well accepted in nine countries, in different data collection modes, and could be used to measure COM-HL.

The new COM-HL with physicians in healthcare instrument, HLS 19 -COM-P, was developed in the context of the Health Literacy Population Survey Project 2019-2021 (HLS 19 ), a European HL survey within 17 participating countries from the WHO-Europe region [3]. The HLS 19 -COM-P is based on a comprehensive theoretical framework that integrates Nutbeam's [22] idea of COM-HL, the basic competencies of information processing according to the HL framework of the HLS-EU Consortium [25,26], and the main communicative tasks of the Calgary-Cambridge Guide framework (C-CG) [32]. The C-CG framework has been developed over the last 25 years and integrates the results of different research traditions to serve as a guide to teach health professionals in patient-centred communication skills. This framework is also used as a framework for assessments of communicative skills of health professionals [33]. The C-CG describes 56 single communicative practices of a health professional in six main phases of a routine interaction in healthcare. Within these six main phases, the C-CG identifies the communicative tasks of patients that need to be considered in the conceptual framework for COM-HL (see Figure 1). considered or tested as a stand-alone scale. The 47-item version of the European HL-Survey Questionnaire (HLS-EU-Q47) [26] also includes a few items concerning COM-HL but without systematic reference to the broader research on healthcare communication. O'Hara et al. [31] published the concept of the "Conversational Health Literacy Assessment Tool (CHAT)" to provide a short actionable survey tool for the clinical context to assess patients' ability to interact with health professionals, but the 10 items mainly focus on general health information-seeking behaviour and health promotion activities. Only one item focuses on interactive behaviour.
In summary, previous instruments meant to measure COM-HL were either developed to measure only certain communicative tasks or outcomes or to capture only certain aspects of healthcare communication. As far as we know, no previous instrument integrated systematic findings from communication research and HL research into one instrument. We would also like to consider COM-HL as a relational proficiency. Hence, there is a need for a new instrument that covers the COM-HL skills necessary for actively participating in health communication with healthcare professionals, especially physicians.
Based on this background, this article aims to establish an international instrument with sufficient psychometric properties, intending to measure communicative health literacy in patient-physician communication.

Development of the Instrument for Measuring Communicative Health Literacy in Patient-Physician Communication
The new COM-HL with physicians in healthcare instrument, HLS19-COM-P, was developed in the context of the Health Literacy Population Survey Project 2019-2021 (HLS19), a European HL survey within 17 participating countries from the WHO-Europe region [3]. The HLS19-COM-P is based on a comprehensive theoretical framework that integrates Nutbeam's [22] idea of COM-HL, the basic competencies of information processing according to the HL framework of the HLS-EU Consortium [25,26], and the main communicative tasks of the Calgary-Cambridge Guide framework (C-CG) [32]. The C-CG framework has been developed over the last 25 years and integrates the results of different research traditions to serve as a guide to teach health professionals in patient-centred communication skills. This framework is also used as a framework for assessments of communicative skills of health professionals [33]. The C-CG describes 56 single communicative practices of a health professional in six main phases of a routine interaction in healthcare. Within these six main phases, the C-CG identifies the communicative tasks of patients that need to be considered in the conceptual framework for COM-HL (see Figure  1). Based on our conceptual framework and definition of COM-HL (see above), the HLS19-COM-P was developed in a multistage process by a working group of representatives from HLS19 countries interested in COM-HL (see Figure 2). After creating the Based on our conceptual framework and definition of COM-HL (see above), the HLS 19 -COM-P was developed in a multistage process by a working group of representatives from HLS 19 countries interested in COM-HL (see Figure 2). After creating the conceptual framework in the first step, the HLS-EU-Q47 [26,34] instrument for measuring general HL was reviewed for suitable items in the second step. Five items on communication in healthcare were identified (Q5, Q8, Q9, Q13 and Q16), but, according to the C-CG, these items only measure aspects of two main phases of interactions between patients and health professionals (explanation and planning; closing the session). In particular, key patient communication tasks were not captured by the HLS-EU-Q47, e.g., presenting their concerns and preferences, and asking questions. Therefore, as a third step, a targeted literature search in English or German language was conducted to identify existing instruments and possible items for measuring COM-HL. In addition to the HLS-EU-Q47, a total of 20 different instruments were found. Since none of these instruments covers all relevant aspects of the underlying conceptual framework and definition of COM-HL, the working group decided to develop a set of items inspired by these instruments. Hence, in the fourth step, the most relevant items were selected from the pool of 183 items and tested in an expert panel and in two focus group interviews. The aim was to identify at least one item per C-CG main phase and the related main communicative task (see Figure 1 and Table 1) to capture the main HLrelated challenges in healthcare communication. A preliminary set of 15 items (including 5 items from the HLS-EU-Q47) was selected, which was adapted to the question format of the HLS 19 -HL instruments. In accordance with the HLS 19 methodology [3], the items are formulated as direct questions (see Table 1) and are rated using a four-point Likert response scale: very easy (4), easy (3), difficult (2), very difficult (1). The comprehensibility and importance of the individual items were assessed in two focus groups involving potential survey participants in Austria (using a German preliminary version of the instrument). Of the focus group participants (n = 14), four were men, they were aged 18-54, four had a university degree, eight had high school graduation, and two participants had compulsory school as their highest completed education. Three of the participants were chronically ill. In general, the 15 items were well received and understood. However, the focus group interviews revealed that the term "health professional" was not well accepted by participants because their experiences varied by type of health professional. The term was, therefore, perceived as too vague, making it difficult to form opinions and respond to the items. These general insights were included in creating the original English version. In addition, the set of items and item wording were discussed in several feedback loops within the working group, with the HLS 19 International Coordination Center, and the HLS 19 Consortium to ensure transferability to different national contexts. The discussion in the working group indicated that the status of different health professions varies widely in the participating countries, while the status of physicians seems to be quite similar and comparable. Based on these considerations the COM-HL instrument focuses exclusively on physician-patient communication. To create an independent instrument for measuring COM-HL, the five items from the HLS-EU-Q47 were excluded, constituting a set of 11 items (HLS 19 -COM-P-Q11) reflecting the conceptual framework. The HLS 19 -COM-P instrument measures all six main communicative phases of physician-patient interactions according to the C-CG and can be used to analyze the dimensions of COM-HL in accordance with Nutbeam [22] and the basic competencies of information processing according to the conceptual model of HL developed by the HLS-EU Consortium [25,26].
In addition, a 6-item short form (HLS 19 -COM-P-Q6) (see Table 1) was suggested. The short form was proposed based on content considerations, and the shorter length allowed more countries to include HLS 19 -COM-P in their national survey. The short form might also be included more easily in future studies with patients.   [SI]), the translation process followed a two-step procedure: first, two forward translations were prepared, one by the National Study Centre (NSC) and one by the data collection agency (DCA), and, second, a comparison of the two translations was carried out by the NSC, with the most appropriate translation being selected in consensus with the DCA in case of differences. The AT and DE versions were also aligned. In three countries (BE (French translation), Bulgaria (BG) and  ), the translation process followed a twostep procedure: first, two forward translations were prepared, one by the National Study Centre (NSC) and one by the data collection agency (DCA), and, second, a comparison of the two translations was carried out by the NSC, with the most appropriate translation being selected in consensus with the DCA in case of differences. The AT and DE versions were also aligned. In three countries (BE (French translation), Bulgaria (BG) and Czech Republic (CZ)), only one forward translation was performed. Back-translation was conducted in CZ and SI. In France [FR], the BE French translation was used with very minor adaptations.

Data Collection
The HLS 19 -COM-P instrument was included as an optional package in the HLS 19 survey [3]. Countries that chose this optional package could either use the 11-item version (HLS 19 -COM-P-Q11) or the 6-item short version (HLS 19 -COM-P-Q6). Data on COM-HL were collected in nine countries (AT, BE, BG, CZ, DE, DK, FR, HU, and SI). The 11-item version was applied in three of these (AT, DE, and SI). Data were collected using different modes of data collection ( Table 2) from December 2019 to June 2021 based on multi-stage random sampling or quota sampling procedures in most countries. A mix of survey methods was used in three countries (BG, CZ, and SI). Except for DE, all surveys took place during the COVID-19 pandemic, which had an impact on possible data collection modes. The sample size in the countries varied from 865 (BG) to 3602 (DK) respondents and data on COM-HL were collected from a total of 18,674 respondents. Multi-stage random sampling 9-15 March 2020; 9 June 2020-10 August 2020 3342 AT = Austria; BE = Belgium; BG = Bulgaria; CAPI = computer-assisted personal interviews; CATI = computerassisted telephone interviews; CAWI = computer-assisted web interviews; CZ = Czech Republic; DE = Germany; DK = Denmark; FR = France; HU = Hungary; PAPI = paper-assisted personal interviews; SI = Slovenia. i The number of respondents who have answered one or more HLS 19 -COM-P items. ii Only 12 individuals responded using paper and pencil. These records were excluded from the analyses.

Analyses
At the overall level, the psychometric properties of the HLS 19 -COM-P-Q11 and its short version HLS 19 -COM-P-Q6 were assessed using Rasch analysis and by using onefactorial confirmatory factor analysis (CFA). CFA provides detailed information about the overall fit of the model [35] but has some shortcomings (e.g., the results are sample-and scale-dependent and the standard error of measurement is constant) [36,37]. Rasch models are considered as parsimonious models meeting the requirements of fundamental measurement [38]. Rasch analysis also provides detailed information about the items. Hence, Rasch analysis was performed at both overall and item levels [39]. As mentioned above, the HLS 19 -COM-P items are interpreted as ordinal scaled. In the International Report on the Methodology, Results, and Recommendations of the European Health Literacy Population Survey 2019-2021 (HLS 19 ) of M-POHL [3], the analyses were based on dichotomous data. In this article, the assessment of psychometric properties is conducted on both the polytomous version of the COM-HL items (very difficult-difficult-easy-very easy) and on a dichotomized version of the COM-HL items (very easy/easy versus difficult/very difficult) to explore which one might be preferable. The internal consistency and reliability were assessed using Cronbach's alpha and the Person Separation Index (PSI). If conclusions are to be drawn at the individual or group level, the indexes are recommended to exceed 0.85 or 0.65, respectively [40]. Omega for categorical data was used as an index for composite reliability [41]. In addition, the average variance extracted (AVE) was evaluated. An AVE value of ≥0.5 could be considered as acceptable [42]. The analyses were conducted for each country and separately for different modes of data collection if used within a country.
In terms of Rasch analysis, data were tested against the partial-credit parameterization [43] of the unidimensional Rasch model [44]. Analyses at the overall level included data-model fit, targeting (mean person location), and dimensionality [39]. Chi-square statistics were applied to assess the data-model fit. The targeting of the HLS 19 -COM-P-Q11 and the HLS 19 -COM-P-Q6 was assessed by comparing the item and person location distributions on the same metric. An instrument could be deemed as well-targeted if the mean person location values are around zero [45]. Graphical displays for targeting were also inspected. Dimensionality was assessed using the combined procedure of principal component analysis (PCA) of residuals and paired t-tests [46][47][48]. Based on the PCA of residuals, two subsets of items were made, and paired t-tests were used to examine whether the subsets provided significantly different person location estimates. A scale could be considered sufficiently unidimensional if the proportion of individuals with significantly different person location estimates on the pair of compared subscales does not exceed 5% (or if the lower bound of the binomial 95% confidence interval (CI) does not exceed 5%) [46,48].
CFA was performed for a one-factor model of the HLS 19 -COM-P-Q11 and the HLS 19 -COM-P-Q6 using a WLSMV estimator with diagonally weighted least squares [49][50][51]. The following goodness-of-fit (GOF) indices were considered: standardized root mean square residual (SRMR), root mean square error of approximation (RMSEA), comparative fit index (CFI), Tucker-Lewis index (TLI), goodness-of-fit index (GFI) and adjusted goodness-of-fit index (AGFI). Schumacker and Lomax [52] recommend SRMR < 0.05, RMSEA ≤ 0.05-0.08, CFI, TLI, GFI and AGFI values close to 0.90 or 0.95, whereas Hu and Bentler [53] claim that SRMR values close to or below 0.08 indicate sufficient overall fit. An overview of GOF indices with reference values could also be found in Table S1: Fit indices considered in confirmatory factor analysis. Rasch analyses at a finer level included assessing item fit, response dependence, ordering of response categories and differential item functioning (DIF) [39]. Chi-square probability values above a Bonferroni-adjusted p-value of 5% and fit residuals within the range of ±2.5 indicate adequate item fit [45]. In addition, the mean square residual fit statistic (MNSQ) infit was used to assess item fit. For measuring COM-HL at the population level, infit between 0.7 and 1.3 was considered as sufficient [54]. A residual correlation of <0.3 was applied as an indicator of response dependency. In addition, residual correlations were assessed relative to each other [55]. For ordinal data, the threshold ordering was inspected both statistically and graphically to examine whether the response categories could be considered working as intended. A key requirement of measurement is that items measure invariantly across levels of different person factors, such as gender, age, and education. Lack of invariance in measurement across person factors is called differential item functioning (DIF) [56,57]. Uniform DIF means that there are consistent systematic differences in responses across person factor levels, whereas nonuniform DIF is present if the DIF varies along the latent trait, i.e., the persons factor interacts with the latent trait [56]. Items were inspected for DIF across different levels of person factors (gender, age, educational level, status of employment, ability to pay bills, self-perceived level in society and self-reported general health status), both statistically using two-way analysis of variance of standardized residuals and graphically by inspecting item characteristic curves [58]. Statistical significance was assumed at a Bonferroni-adjusted p-value ≤ 5%. An overview of the tests performed with reference values is also found in Table S2: Analyses and tests with reference values considered in Rasch analyses.
Since chi-square statistics are sensitive to sample size, there is a risk of drawing false conclusions due to large sample sizes [59]. Therefore, the amend sample size function in the software RUMM2030 was used to draw a random sub-sample for analyses concerning datamodel fit, item fit and DIF. As recommended, the sample size was calculated by multiplying the number of items (11/6) by the number of thresholds (3 for polytomous items), with 10-30 persons per threshold [58], indicating that a sample size of 330-990/180-540 (11/6 × 3 × 10/30) can be deemed as adequate for these analyses.
Items with a negative item location estimate could be considered as relatively easy to endorse, whereas the opposite is the case for items with a positive item location estimate. A higher value indicates that the item is having a higher difficulty level and is, consequently, harder to endorse [58].
Convergent and discriminant validity are also facets of construct validity [60]. As the HLS 19 -COM-P intends to measure an aspect of HL, COM-HL, it would be expected that the COM-HL score is positively (moderately) correlated with general HL (GEN-HL) and with navigational HL (HL-NAV; convergent validity) but is still a separate construct (discriminant validity). For convergent validity, Pearson's correlations were used to assess the associations between COM-HL and GEN-HL and HL-NAV. A positive moderate correlation between these scores would be expected, as the instruments intend to capture different aspects of HL. To assess discriminant validity, the combined procedure of PCA of residuals and paired sample t-test was applied to investigate whether the different HL instruments could be considered measuring distinctive constructs. GEN-HL and HL-NAV were measured using the HLS 19 -Q12 and the HLS 19 -NAV instruments, each consisting of 12 items [3].
Rasch analyses were performed using the software RUMM2030Plus [61] and ACER ConQuest 5 [62], the lavaan package [51] for R [63] was applied for CFA, and the correlation analyses were conducted using R.

Missing
On average, there were few missing values. In most countries, the number of missing values varied between 0 and 2 or 3%. In BG data, COM11 had 11% missing values. Conducting Rasch analysis, missing data were handled through full information maximum likelihood estimation (FIML), whereas the other analyses included respondents that had at least 80% of valid responses. Table 3 provides details regarding the main characteristics of the samples, including the key demographics, socioeconomic variables, and health status.

Rasch Analyses at the Overall Level
At the overall level, the polytomous scored HLS 19 -COM-P-Q11 displayed misfit in all countries when a sample size of n = 660 (20 persons for each of the 33 thresholds (11 items × 3 thresholds)) was considered (Table 4, [39]). Reducing the sample size to 330 in each country, the HLS 19 -COM-P-Q11 displayed acceptable overall data-model fit. Applying a sample size of n = 360, the polytomous scored HLS 19 -COM-P-Q6 displayed acceptable overall data-model fit in AT and DE data. Reducing the sample size to 180 (6 items × 3 thresholds × 10) the short version also displayed acceptable data-model fit in the other countries. The proportion of significant different person location estimates across subtests for the HLS 19 -COM-P-Q11 varied between 4.8% (SI; CAPI) and 7.9% (DE; PAPI), and between 3.0% (HU; CATI and SI; CAPI) and 7.5% (DK; CAWI) for HLS 19 -COM-P-Q6, indicating that the scales could be considered sufficiently unidimensional in all countries. The targeting of both the long and the short version could have been better, as the items, on average, were quite easy to endorse (mean person location varying between 1.38 (DE; PAPI) and 2.73 (SI; CAWI), and 1.21 (DE; PAPI) and 2.47 (SI; CAWI) for HLS 19 -COM-P-Q11 and HLS 19 -COM-P-Q6, respectively (Table 4, [39]).
In countries applying different modes of data collection, higher mean person location was observed for data obtained from CAWI than from CAPI (SI (long and short version) and BG). A higher mean person location was also observed in data obtained from CATI compared with CAWI (CZ; Figure 3a Assessing fit to the Rasch model based on dichotomized items indicated low power of analysis of fit and a decreased PSI. The results based on dichotomous data also led to a sharp increase in the number of records with extreme scores (4-12 times more extreme records). Reasonable power of analyses of fit was observed only for the DE version of the dichotomized HLS19-COM-P-Q11. In DE data, the number of extreme records increased from 81 when analyses were based on polytomous data to 642 when based on dichotomous data. The PSI decreased from 0.89 to 0.62. Hence, the following results from Rasch analyses provided in this paper are based on polytomous data.

Confirmatory Factor Analysis
Regardless of considering polytomous or dichotomous data, most goodness-of-fit indices for both the HLS19-COM-P-Q11 and the HLS19-COM-P-Q6 could be considered as acceptable when using a one-factor model ( Table 5). In DE and SI data, the RMSEA for HLS19-COM-P-Q11 was above the recommended reference value. In countries applying different data collection modes, the goodness-of-fit indices were approximately the same, except for the SI HLS19-COM-P-Q6 data, where data collected using CAPI had somewhat better fit than CAWI data. Comparing goodness of fit indices based on dichotomous versus polytomous data, the SRMR was either equivalent or lower when analyses were based on polytomous data, whereas the opposite was the case considering RMSEA. Assessing fit to the Rasch model based on dichotomized items indicated low power of analysis of fit and a decreased PSI. The results based on dichotomous data also led to a sharp increase in the number of records with extreme scores (4-12 times more extreme records). Reasonable power of analyses of fit was observed only for the DE version of the dichotomized HLS 19 -COM-P-Q11. In DE data, the number of extreme records increased from 81 when analyses were based on polytomous data to 642 when based on dichotomous data. The PSI decreased from 0.89 to 0.62. Hence, the following results from Rasch analyses provided in this paper are based on polytomous data.

Confirmatory Factor Analysis
Regardless of considering polytomous or dichotomous data, most goodness-of-fit indices for both the HLS 19 -COM-P-Q11 and the HLS 19 -COM-P-Q6 could be considered as acceptable when using a one-factor model ( Table 5). In DE and SI data, the RMSEA for HLS 19 -COM-P-Q11 was above the recommended reference value. In countries applying different data collection modes, the goodness-of-fit indices were approximately the same, except for the SI HLS 19 -COM-P-Q6 data, where data collected using CAPI had somewhat better fit than CAWI data. Comparing goodness of fit indices based on dichotomous versus polytomous data, the SRMR was either equivalent or lower when analyses were based on polytomous data, whereas the opposite was the case considering RMSEA.   AT = Austria; BE = Belgium; BG = Bulgaria; CAPI = computer-assisted personal interviews; CATI = computer-assisted telephone interviews; CAWI = computer-assisted web interviews; CFI = comparative fit index; CZ = Czech Republic; DE = Germany; DK = Denmark; FR = France; GFI = goodness-of-fit index; HU = Hungary; PAPI = paper-assisted personal interviews; RMSEA = root mean square error of approximation; SI = Slovenia. SRMR = standardized root mean square residual; TLI = Tucker-Lewis index.

Reliability
Both the HLS 19 -COM-P-Q11 and HLS 19 -COM-P-Q6 obtained acceptable to high reliability indices (Table 6). Based on polytomous data, the PSI, Cronbach's alpha, and omega for the HLS 19

Fit at the Item Level
Using a sample of 990 from each country, the HLS 19 -COM-P-Q11 item COM1 ("describe to your doctor your reasons for coming to the consultation") displayed significant misfit (p < 0.001) in all three countries but had acceptable infit and fit residual. This was also the case for DE (PAPI) and SI (CAPI and CAWI) when reducing the sample size to 660. The item misfit was not significant at Bonferroni 5% for AT considering a sample size of 660. In data from AT (CATI), items COM4 ("get enough time in the consultation with your doctor"; fit residual of 3.87 and infit of 1.21) and COM7 ("understand the words used by your doctor"; fit residual of 3.21 and infit of 1.18) tend to under-discriminate (Table S3: Item fit statistics for HLS 19 -COM-P-Q11 for each country, [39]). The other items displayed acceptable fit.
Most HLS 19 -COM-P-Q11 items worked invariantly across different levels of person factors. However, in DE data, item COM7 ("understand the words used by your doctor") displayed significant DIF for education, where those having maximum upper secondary school as the highest completed education (ISCED 0 to 3) scored significantly lower than those having higher education despite the same location on the latent trait. This was also evident in SI CAWI data. In addition, item COM7 ("understand the words used by your doctor") did also display significant DIF for paying bills in SI CAWI data. In SI CAWI data, item COM6 ("get the information you need from your doctor") displayed DIF for age and education. However, the DIF was not evident when reducing the sample size to 660. The same was the case for age in item COM4 ("get enough time in the consultation with your doctor") in AT data. In SI CAPI data, item COM10 ("recall the information you get from your doctor") displayed DIF for age depending on how the variable was categorized (Table S3: Item fit statistics for HLS 19 -COM-P-Q11 for each country, [39]). None of the items displayed DIF when it comes to self-reported health.
Response dependency was observed between items COM1 ("describe to your doctor your reasons for coming to the consultation") and COM3 ("explain your health concerns to your doctor") (r = 0.35) in the DE data (not reported in the Table). The response categories worked well for all items in all countries. Applying the HLS 19 -COM-P-Q11, item COM1 ("describe to your doctor your reasons for coming to the consultation") was the easiest to endorse in all countries, whereas items COM4 ("get enough time in the consultation with your doctor"), COM7 ("understand the words used by your doctor"), COM5 ("express your personal views and preferences to your doctor") and COM9 ("be involved in decisions about your health in dialogue with your doctor") were the hardest in AT, DE, SI CAWI and Slovenian CAPI data, respectively (Table S3: Item fit statistics for HLS 19 -COM-P-Q11 for each country, [39]).  For the HLS 19 -COM-P-Q6, most items worked well in most countries. However, item COM4 ("get enough time in the consultation with your doctor") under-discriminated in BG (CAPI data: fit residual of 3.02 and infit of 1.36, CAWI data: fit residual of 2.01 and infit of 1.27) and DK data (fit residual of 7.21 and infit of 1.36), while item COM10 ("recall the information you get from your doctor") tend to under-discriminate in BE (fit residual of 3.02 and infit of 1.29), CZ (CATI data: fit residual of 1.98 and infit of 1.39, CAWI data: fit residual of 3.48 and infit of 1.34), DK (fit residual of 4.87 and infit of 1.37) and HU (fit residual of 1.58 and infit of 1.33) data (Table S4: Item fit statistics for HLS 19 -COM-P-Q6 for each country, [39]). The item COM4 ("get enough time in the consultation with your doctor") also displayed significant DIF across age categories (depending on categorization), level of education and employment status in BG CAPI data. In addition, the item displayed DIF for self-perceived social level in society and self-reported general health status, but this was not significant when reducing the sample size to 360 (Table S4: Item fit statistics for HLS 19 -COM-P-Q6 for each country, [39]). Significant uniform and nonuniform DIF across age categories (depending on categorization) was also observed for item COM3 ("explain your health concerns to your doctor") in BG CAWI data. For the other countries, no significant DIF was observed when considering a sample size of 360.
No response dependency nor unordered response categories were observed for HLS 19 -COM-P-Q6. In most countries, item COM3 ("explain your health concerns to your doctor") was the easiest to endorse (in BE data and SI CAWI data, COM8 ("ask your doctor questions in the consultation") was the easiest and, in HU data, COM10 ("recall the information you get from your doctor") was the easiest), whereas item COM4 ("get enough time in the consultation with your doctor") was the hardest in most countries (except for CZ CATI data, CZ CAWI data, FR and SI CAPI data, where COM10 ("recall the information you get from your doctor"), COM9 ("be involved in decisions about your health in dialogue with your doctor"), COM5 ("express your personal views and preferences to your doctor") and COM9 ("be involved in decisions about your health in dialogue with your doctor") were the hardest to endorse, respectively; Table S4: Item fit statistics for HLS 19 -COM-P-Q6 for each country).

Invariance across Modes and Countries
In SI data, the HLS 19 -COM-P-Q11 items COM4 ("get enough time in the consultation with your doctor"), COM5 ("express your personal views and preferences to your doctor"; CAPI > CAWI) and COM7 ("understand the words used by your doctor"; CAPI < CAWI) displayed DIF across mode (n = 990). When the sample size was reduced to n = 660, DIF was only evident in item COM5 (F-ratio: 12.75, p < 0.001; Figure 4). Item COM5 also displayed DIF across mode in data from the SI six-items version, but this was not significant when the sample size was reduced to 540. Using approximately equal sample sizes from CZ CATI and CAWI HLS 19 -COM-P-Q6 data, item COM10 ("recall the information you get from your doctor") displayed DIF across modes. Item COM3 ("explain your health concerns to your doctor") did display DIF across mode in BG data. These DIFs were not significant at Bonferroni adjusted 5% level when applying a sample size of 540.
Using equal sample sizes (n = 500) from countries applying CATI (AT, CZ, and HU), items COM3 ("explain your health concerns to your doctor"), COM4 ("get enough time in the consultation with your doctor") and COM10 ("recall the information you get from your doctor") displayed DIF across the countries. The same items displayed DIF when drawing a random sample of 400 for each country applying CAPI/PAPI (BG, DE, and SI). Using random samples of 500 from countries applying CAWI (BE, BG, CZ, DK, FR and SI), items COM3 ("explain your health concerns to your doctor"), COM4 ("get enough time in the consultation with your doctor"), COM9 ("be involved in decisions about your health in dialogue with your doctor") and COM10 ("recall the information you get from your doctor") displayed DIF. The mean person location for the random samples of HLS 19 -COM-P-Q6 data collected using CATI, CAPI/PAPI and CAWI were 2.08, 1.60 and 1.89, respectively. also displayed DIF across mode in data from the SI six-items version, but this was not significant when the sample size was reduced to 540. Using approximately equal sample sizes from CZ CATI and CAWI HLS19-COM-P-Q6 data, item COM10 ("recall the information you get from your doctor") displayed DIF across modes. Item COM3 ("explain your health concerns to your doctor") did display DIF across mode in BG data. These DIFs were not significant at Bonferroni adjusted 5% level when applying a sample size of 540. Using equal sample sizes (n = 500) from countries applying CATI (AT, CZ, and HU), items COM3 ("explain your health concerns to your doctor"), COM4 ("get enough time in the consultation with your doctor") and COM10 ("recall the information you get from your doctor") displayed DIF across the countries. The same items displayed DIF when drawing a random sample of 400 for each country applying CAPI/PAPI (BG, DE, and SI). Using random samples of 500 from countries applying CAWI (BE, BG, CZ, DK, FR and SI), items COM3 ("explain your health concerns to your doctor"), COM4 ("get enough time in the consultation with your doctor"), COM9 ("be involved in decisions about your health in dialogue with your doctor") and COM10 ("recall the information you get from your doctor") displayed DIF. The mean person location for the random samples of HLS19- Figure 4. Graphical comparison between means of CAPI and CAWI in Slovenian data for item COM5 ("express your personal views and preferences to your doctor").

Convergent and Discriminant Validity
The long and the short version of the instrument were, as expected, highly correlated; r varied from 0.97 (AT) to 0.98 (SI).
The scores obtained from HLS 19 -COM-P-Q11 and the HLS 19-COM-P-Q6 were moderately to highly correlated with scores of GEN-HL (measured using HLS 19 -Q12) and HL-NAV (measured using HLS 19 -NAV) when analyses were based on polytomous data (Table 7). When conducting analyses based on dichotomous data, the correlation between COM-HL and related HL scores could be considered as small to large (Table 7). Lower correlation coefficients were observed in all countries when analyses were based on dichotomous data compared to polytomous.

Distribution of COM-HL Score
The score based on dichotomous items shows a left-skewed distribution with a clear ceiling effect in all countries, for both the long and short versions, regardless of the survey method (see Figure S1a: Distribution of HLS 19 -COM-P-Q11 dichotomous score by country and survey mode; and Figure S1b: Distribution of HLS 19 -COM-P-Q6 dichotomous score by country and survey mode). The score based on polytomous items is rather normally distributed in most countries, both for the long and short version, although, in some countries, the positive extreme values are disproportionately represented in the distribution ( Figure S2a: Distribution of HLS 19 -COM-P-Q11 polytomous score by country and survey mode; and Figure S2b: Distribution of HLS 19 -COM-P-Q6 polytomous score by country and survey mode). Table 7. Correlation between COM-HL scores (based on HLS 19 -COM-P-Q11 to the left and HLS 19 -COM-P-Q6 to the right) and general (GEN-HL) and navigational (HL-NAV) health literacy scores, based on polytomous and dichotomous (marked in grey) data. Results are divided by country and data collection mode.

Discussion
Based on our theoretical framework that integrates the idea of COM-HL of Nutbeam [22], the basic competencies of information processing according to the HL framework of the HLS Consortium [3,25,26] and the main communicative tasks of the C-CG framework [32] we succeeded in developing a brief international instrument with acceptable psychometric properties and strong reliability for measuring COM-HL in patientphysician interaction.

Construct Validity and Reliability
The HLS 19 -COM-P-Q11 and HLS 19 -COM-P-Q6 data display acceptable fit to the unidimensional Rasch model (considering a reduced sample size) and acceptable goodness-of-fit indices in CFA. Both HLS 19 -COM-P-Q11 and HLS 19 -COM-P-Q6 gave sufficient unidimensional data, implying that it could be statistically defensible to calculate a total score [66] of COM-HL based on these instruments. Sufficiently high reliability indices do also allow for drawing conclusions about COM-HL both at group and individual levels [40].
However, the targeting of both HLS 19 -COM-P-Q11 and HLS 19 -COM-P-Q6 could have been better. Overall, the items were quite easy to endorse, implying a ceiling effect. Mistargeting might bring decreased reliability, as the precision of the instrument becomes poorer [56]. Hence, the instrument could benefit from adding items that are harder to endorse. On the other hand, for identifying groups with difficulties in HL-COM, the instrument performs well.
Most items displayed acceptable fit to the Rasch model. However, item COM10 ("recall the information you get from your doctor") under-discriminated in CZ, DK, and HU HLS 19 -COM-P-Q6 data. Under-discriminating items tend also to measure something else that is not positively correlated with the latent trait [67], here, COM-HL. The item is about recalling health information, which might be dependent on other cognitive processes than HL. Hence, in future studies, one might consider replacing this item with item COM11 in the short version of the instrument. This item might also be more in line with the cognitive domain to "apply" health information in the conceptual model of Sørensen et al. [25]. In addition, item COM4 ("get enough time in the consultation with your doctor") underdiscriminated in BG (CAPI) and DK HLS 19 -COM-P-Q6 data. In AT data, the fit residual was also somewhat elevated, but the infit could be deemed as acceptable. Experiences of having sufficient time in consultation with a physician may depend on other things in addition to COM-HL, such as the number and type of health issues that the patient would like to discuss and the patients' age [68]. In our conceptual framework, there are no other items covering this dimension. However, item COM4 might be replaced by item COM6 ("get the information you need from your doctor"), which could also be an indicator for understanding and following the agenda. In the short version, there are no items covering the dimension opening the session and giving initial information. However, these communicative tasks might be somewhat overlapping, which was confirmed in DE data, as response dependency was observed between items COM1 and COM3.
Few items showed DIF across different levels of person factors and, where DIF was present, there was no consistent pattern across countries. This indicates that the instrument works quite invariantly. However, the HLS 19 -COM-P-Q11 item COM7 ("understand words used by your doctor") displayed DIF for education in DE and SI CAWI data. The source of DIF might be that patients with low education are less familiar with medical jargon than those with higher education. In BG HLS 19 -COM-P-Q6 CAPI data, COM4 ("get enough time in the consultation with your doctor") displayed DIF for several person factors. As mentioned above, there might be several reasons that some patients perceive a need for more time in consultation with physicians.
On one hand, the COM-HL score was moderate to highly correlated with GEN-HL and HL-NAV scores, indicating that the instruments measure something common and, consequently, ensure convergent validity. The scores are all based on related instruments, all intending to measure certain aspects of HL. On the other hand, the combined PCA of residuals and t-test procedure show that they are measuring distinctive constructs (discriminant validity). Hence, we conclude that the instruments for measuring COM-HL, GEN-HL and HL-NAV are measuring different aspects but could be considered parts of the family of HL instruments. Content and face validity are also ensured by using the theory-based model and definition of communicative HL with physicians in healthcare for selecting and operationalizing the included indicators. Concurrent predictive validity should be further explored in future studies.

Using Dichotomous or Polytomous Scores
Due to a change in the labelling of the response categories from the HLS-EU [26] to the HLS 19 survey, the HLS 19 Consortium decided to use dichotomized scores when reporting on HL in the international report [3] to ease the comparison between the surveys. However, no items of the HLS 19 -COM-P-Q11 or HLS 19 -COM-P-Q6 displayed unordered response categories, indicating that the four-point response categories used in the HLS 19survey worked well at least for the HLS 19 -COM-P instruments. Conducting Rasch analyses based on dichotomous items did result in an increased number of extreme records. The Cronbach's alpha values based on polytomous data were also higher than those reported based on dichotomized data, which would also be expected as more response categories yield more scoring points. The correlation between COM-HL and other HL scores was also stronger when analyses were based on polytomous scores. This is in line with Jiao et al. [69] who also found that results based on polytomous scores have slightly higher measurement precision compared to results from dichotomous scoring. Dichotomized scoring might be easier to understand but yields a loss of information and a loss of power [70]. However, dichotomization might reduce the effect of outliers [71].

Data Collection Mode
Different data collection modes were applied across countries and, for some countries, also within the country. The advantage is that the instrument was evaluated for different modes. However, according to Bowling [72], using different modes might bring more response bias than within a single mode. For countries that have used multiple modes, we also found that the mean person location differed across modes. Especially in CZ data, the difference in mean person location estimates between CATI and CAWI data was significant. In CZ, there was a predominance of younger people who responded to the CAWI version, whereas the CATI responses were dominated by older people, as these were hard to reach in CAWI sampling. Hence, the CAWI and CATI samples in CZ are also incomparable due to including different age groups. This was the case also in SI data. Some items did also display DIF across different data collection modes, implying that people responding to questionnaires operationalized by different data collection modes might interpret the items differently. However, the DIF across modes was marginal.
Even though the results should be interpreted with caution, due to DIF across countries, the mean person location was, on average, highest for countries that collected data by CATI and lowest in countries that used CAPI/PAPI. Both might be affected by response bias, but CAPI/PAPI brings less cognitive burden and is usually the most preferred data collection mode for the respondents [72]. Braekman et al. [73] did also find that, in a health survey, responses collected via self-administered modes (web versus PAPI) were more comparable than responses collected via self-and interviewer-administrated modes (web versus CAPI). However, the authors also found that simple and factual questions (such as healthcare use) are less prone to mode differences when comparing self-and interviewer-administered modes. Future studies which intend to compare scores across countries and modes of data collection should take actions to minimize mode effects, such as providing instructions for different modes in order to provide the same perceived stimuli to respondents. To reduce DIF across languages or countries, the translated versions of the instrument should also be assessed to ensure that items are interpreted in the same way across countries.

How to Use the Instrument
The HLS 19 -COM-P intends to measure COM-HL in general adult populations and comprise skills that are necessary to actively participate in an interaction with physicians within a healthcare setting. The COM-HL score is standardized in the range of 0 to 100. Scores are only computed for respondents who have answered at least 80% of the HLS 19 -COM-P items. If less than 80% of the items contain valid responses, the score is set to "missing". A higher score value signifies a higher level of COM-HL. The score should be interpreted in light of contextual factors related to the health system of the present country.
The instrument belongs to the HLS 19 Consortium. The use of the instrument is free, but any use of it needs a contractual agreement between the nonprofit applicant and the HLS 19 Consortium. Further information can be found here: https://m-pohl.net/tools (accessed on 1 June 2022).

Strengths and Limitations
The psychometric properties of the instrument were assessed in large country representative samples from nine countries and are assessed in different data collection modes. The development of the instrument relies on a theory-based definition and conceptual framework of COM-HL.
A limitation is that the instrument only measures patients' COM-HL in interaction with physicians. However, other healthcare professionals, such as nurses, are also providing health communication in healthcare settings. Hence, a version of the instrument comprising COM-HL in interacting with nurses should also be piloted. As the instrument intends to measure COM-HL in interaction with physicians in a healthcare setting, the instrument should be further tested among patients in clinical settings, especially with relevant indicators for predictive validity of the instrument. The validity and reliability of the HLS 19 -COM-P should also be further explored in people with chronic illnesses.
In most countries, data were collected during the COVID-19 pandemic, which could have had an impact on the responses, as face-to-face encounters in this period were restricted to some extent. However, in DE, data were collected before the pandemic, and the psychometric properties of the HLS 19 -COM-P do not differ much from DE to the other countries.
As the analyses are based on self-reported data, and some also from interviews, there might be a risk of response bias, such as social desirability. There is also a risk of recall misclassification as the experiences of physician-patient interactions might vary because of diverse factors, such as time since the last interaction, frequency of interactions, individual dependence on healthcare, cognitive skills, etc. Selection bias might also have occurred.

Conclusions
The HLS 19 -COM-P-Q11 and the short version HLS 19 -COM-P-Q6 worked quite well in the nine countries and across different data collection modes, even though misfit was found in a few items in some countries. The scale was also well accepted in all countries, with few missing values. Hence, this instrument could be used for identifying COM-HL in populations, and results from this instrument can be used to give recommendations for policy, practice and for COM-HL interventions (e.g., communication training for physicians). To our understanding, this is the first instrument to measure COM-HL as a separate construct in the family of related HLS instruments. However, the HLS 19 -COM-P should be further evaluated in clinical settings and should be adapted to measure COM-HL also in relation to other health professions.