Measurement Characteristics of the Most Common Used International Physical Activity Questionnaires for Adults Across EU-28 in National Language Versions: Systematic Review and Meta Analysis

This review and meta-analysis (PROSPERO registration number: CRD42020138845) critically evaluates test-retest reliability, concurrent validity and criterion validity of different physical activity (PA) levels of three most-common used international PA questionnaires (PAQs) in national language versions of EU-28: IPAQ-SF, GPAQ, and EHIS-PAQ. In total, 4969 abstracts were read, 287 full-text articles were identified as relvant to the study objectives, and 19 studies were included in the final meta-analysis. The studies’ results and quality were evaluated using the Quality Assessment of Physical Activity Questionnaires (QAPAQ) checklist. Results indicate that only five EU-28 countries validated national language versions of selected PAQs. Meta analysis revealed that assessment of moderate-to-vigourus PA (MVPA) is the most relevant PA level outcome, since no publication bias in any of measurement properties was detected while test-retest reliability was the moderately-high (rw=0.74), and moderate for both criterion (rw=0.41) and concurrent validity (rw=0.72). Reporting of methods and results of the studies was poor with overall moderate risk of bias with total score of 0.43. In conclusion, where only self-reporting is feasible, assessment of MVPA with selected PAQs in EU-28 adult population is recommended for policymaking.


Introduction
Increasing the level of physical activity (PA) has become one of the priorities of public health policies in most developed countries of the world [1]. Over the last thirty years, we have witnessed an accelerated increase in the quantity of interventions to increase PA worldwide, although with limited effects [2,3]. Creating optimal policies and planning effective interventions aimed at increasing PA is not possible without reliable data on the prevalence of physical inactivity. Hence, numerous global authoroties have called for concereted efforts in PA surveillance [4][5][6][7]. Conversely, how to execute PA monitoring is not entirely clear. Firstly, although methods for the assessment of PA are numerous, but given the complex nature of the PA, none of the currently available methods can assess all PA dimensions.
Based on the literature review, we can classify scientific methods for determining PA as direct observation or objectively assessed PA and indirect or subjectively assessed PA [8,9]. Large PA surveillance systems have, until recently, relied solely on PA questionnaires (PAQs) as one of the subjective assessment of PA. Questionnaires are easy to apply in large groups of individuals and are therefore the basic method of assessing PA in large epidemiological studies. However, this method

Materials and Methods
The meta-analysis was performed and reported in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines [31,32]. The present work was registered at the International Prospective Register for Systematic Reviews, identification code CRD42020138845.

Search strategy
Identical search strategy was employed in PubMed, SportDiscus, Scopus, Dart and ResearhGate databases, examining studies describing measurement properties of three international PAQs through April and May 2018 and through April and May 2020. We used the following search string "name of the questionnaire e.g. IPAQ AND (valid* OR reliab* OR repeat* OR reproducib* OR assess* OR measure*)". Additional studies were identified by manually searching the reference lists of the full papers identified during the search. Grey literature was additionally reviewed through ResearchGate, Google Scholar and Mendeley, using only keyword "name of the questionnaire e.g. IPAQ AND valid* and through personal communication of members of the research team with other scientists. Additional literature that corresponds to the eligibility criteria of the present review was also obtained through online questionnaire on platform 1KA (University of Ljubljana, Faculty of Social Sciences) with the help of World Health Organization within EUPASMOS project actvities. National HEPA focal points were asked to report on any national research, reports and doctoral thesis, published in their national languages that examined the measurement properties of any of the three PAQs included in this study. All articles generated from the initial search were stored on Mendeley reference management software & researcher network (Elsevier, Amsterdam, Netherlands) which removed duplicate references.

Eligibility criteria
Studies included in the present review had to be performed in healthy adults (18 yrs. old or older) and carried out in one of te European Union's countries (28 countries included) and published in one of the 24 EU's official languages. Studies were included if they were examining the following PAQs: IPAQ, GPAQ and EHIS-PAQ. and reported on: (i) PAQ translation protocol, (ii) mode of administration (interview, self-administered) and (iii) reliability or (iv) concurrent validity or (v) criterion validity of included PAQs. Studies performed in special populations (e.g. participants with specific medical conditions) were excluded.
Reliability refers to the consistency of a response either across multiple tests within a single assessment, generally called internal consistency, or across multiple assessments, known as test-retest or stability reliability [44]. The time interval between the test and retest must have been described and short enough that subject's PA cannot change, but long enough to prevent recall [35]. For PA assessment during current or past week, 1 day to 3 months recall period was considered appropriate, whereas for lifetime PA assessment, 1 day to 12 months recall period was considered appropriate [35].

Risk of bias assessment
The assessment of the risk of bias of included studies was conducted using the criteria, previously used by Sneck [45] and Sember [46], which includes the criterion of power calculations. Each study received '0' (does not meet the criterion) or '1' (meets the criterion) for each criterion based on an analysis of the reporting in the original article. Methodological quality was assessed following QAPAQ checklist [47], which was developed specifically for qualitative assessment of PA questionnaires. Risk of bias assessment and methodological quality was performed by two independent reviewers (V.S. and K.M.). 4 of 21 studies through the search process to identify potentially relevant articles. In case of uncertainty, a third and fourth reviewer (G.J. and G.S.) screened the article. Summary tables of entered data were checked with the trial protocol and latest trial report or publication. Any discrepancies or unusual patterns were checked with the study principal investigator. Hunter-Schmidt estimate was used for reducing the amount of bias and Fisher's z transformation was applied to samples' correlation to display publication bias [42,43]. We also assessed publication bias with Egger bias test [48] for all PA constructs, separately for reliability, concurrent and criterion validity.
For further analysis correlation (rw) coefficients were determined by Hunter-Schmidt approach [49], which was multiplied by the sample size of each study (rw × N). The generalizability of rp was corrected using an artefact correction and variance sample. For weighted mean (rw), 95% credibility interval: CIw=rw+1.96√Vp and I 2 and Q statistics to measure heterogeneity of ES were calculated.

Data synthesis
Results of 19 studies for moderate (MPA), vigorous (VPA) and total PA (tPA) were synthesised into four categories: (1) General characteristics of selected studies of PAQs across EU (Table 1); (2) Reliability of PAQs in selected studies across EU (Table 2); (3) Concurrent validity of PAQs in selected studies across EU (Table 2) and (4): Criterion validity of PAQs in selected studies across EU (Table 2).

Grading the level of evidence
Reliability levels of evidence were formulated following van Poppel and colleagues (2010) levels of evidence: (1) adequate time between test and retest and use of interclass correlation (ICC), Kappa or Concordance reliability score > 0.7; (2) inadequate time interval between test and retest and use of ICC, Kappa or Concordance reliability score < 0.7, adequate time interval between test and retest, Pearson/Spearman correlation > 0.7; (3) an inadequate time interval between test and retest, Pearson/Spearman correlation < 0.7. Additional grade was given due to number of participants and level of index or correlation. Positive score (+) was given for studies with > 50 participants and the reliability coefficients were > 0.70. Negative (-) score was given for studies with < 50 participants and the reliability coefficients were < 0.70. Pearson and Spearman correlation were considered inadequate due to systematic errors [50], therefore only ICC, Kappa or Concordance were deployed in level (1) of evidence. Validity is the degree to which an instrument measures constructs [47]. Highest level of criterion validity evidence would be comparing PAQs to gold standard -doubly labelled water (DLW) [51]. However, DLW also includes basal metabolic rate and the thermic effects of food, therefore use of other validated instrument are more reliable for obtaining construct validity. This will be done by comparing PAQs to other PAQs (concurrent validity) and accelerometers (criterion validity). For concurrent and criterion validity, research team established following levels of evidence: (1) concurrent validity score > 0.8; (2) 0.8 > validity score > 0.5; (3) concurrent validity score < 0.5. Positive score (+) was given for studies with > 50 participants and negative (-) score was given for studies with < 50 participants.

Results
The flow of the review process is shown in Figure 1. In total, 4969 abstracts were read, 287 fulltext articles were identified and read and 19 studies were finally included in present review ( Figure  1). The characteristics of the included studies are presented in Table 1. We included studies from 17 different EU countries, mostly from United Kingdom (7), Spain (4) and Germany (3). Three studies were cross-national [38,52,53]. Table 1 represents information from all 19 studies included in present review of selected PAQs [36][37][38]52,[54][55][56][57][58][59][60][61][62][63][64][65][66][67][68][69][70] including the country where the study was carried out, sample size, participants age and gender, sample description, modes and means of administration of selected studies. Altogether 5902 people in 22 different sub studies participated. Age range included participants in all studies was between 18 and 75 years. In 17 of 19 studies participant's gender proportion was included, whereas in two studies gender proportion is unknown [59,60]. Regarding sample description, 13 studies used convenient (68.4%), 4 random (21.05%), 1 quota (5.26%), 1 multistage stratified probability sampling (5.26%). Most of the studies (n=12) used self-administered mode of administration, 4 used interview and 2 telephone interviews. In one study, they used both selfadministered questionnaire and interview. All of the studies assessed duration and frequency of physical activity. All interventions, meeting all predefined criteria, are presented in Table 1.   [37,38,57,59,62,64,69,70], including information about measurement interval, results (Pearson r, Spearman ρ, Lin's concordance correlation and Phi coefficient) and quality rating. Most studies assessed test-retest reliability for MPA (30), and the least test-retest reliability for MVPA (5). The information for concurrent validity was reported in 9 PAQ studies across EU [36][37][38]56,60,62,64,71,72]. Information about comparison method, measured construct, correlation coefficient results and quality rating is shown in Table 2. Most of these studies assessed concurrent validity for tPA (11) and the least for MVPA (6). Table 2 represents information from 13 studies regarding criterion validity of PAQs in selected studies across EU [37,38,52,57,58,[62][63][64]66,67,69,70,73], including information about country where study was carried out, duration of objective assessment, number of valid days and minutes per day, method for validity comparison, cut-off points, epoch length, definition of non-wear time and measured construct. Most studies assessed criterion validity for VPA and tPA (both 11) for MPA, the least studies assessed criterion validity for MPA and MVPA (both 9).  Rütten et al., [53]  Based on weighted correlation means, concurent validity was best for VPA (rw=0.72), where 4 associations were graded with levels of evidence 1 (rw=0.82) and 5 associations with levels of evidence 2 (rw=0.62) ( Table 3). Concurrent validity was the lowest for tPA (rw=0.22), where 9 associations were evaluated with levels of evidence 2 (rw=0.64) and 2 with levels of evidence 3 (rw=0.38). On the other hand, VPA showed the highest validity (rw=0.72), but it should be noted that the Egger test (-5.63) showed significant bias between included correlations coefficients in VPA (p<0.0001).
Based on weighted correlation means, measurement construct performed the best for VPA (rw=0.48), where 4 associations was evaluated with level of evidence 2 (rw=0.64) and 7 associations with grade 3 (rw=0.30); whereas the worst criterion validity was noted for MPA (rw=0.14) (Table 3), with all 9 associations graded with level of evidence 3. Once again, although the highest criterion validity was noted for VPA,the Egger test (-5.59) showed significant bias between included correlations coefficients in VPA (p<0.0001). Results of weighted correlation coefficients for for testretest reliability, concurrent validity and criterion validity across all included studies stratified by PA intensity are presented in Figure 2.  The Egger bias test [48] provided evidence for publication bias for following measurement characteristics and PA constructs: concurrent validity VPA (bias= -5.63, 95% CI: -6.80 to -4.46, p<0.0001), concurrent validity tPA (bias=-0.14, 95% CI: 6.47 to 6.20, p=0.97), criterion validity VPA (bias= -5.59, 95% CI: -7.38 to -3.81, p<0.0001) and criterion validity tPA (bias= -3.22, 95% CI: -6.55 to 0.11, p=0.09) ( Table 3). The results of risk-of-bias assessment are shown in Table 4. Total average risk of bias of all included studies was moderate (0.43). Of the 19 studies, only two were rated as having a low risk of bias (> 67% of total score) with average of 0.73 of total score; 9 were rated as having moderate risk of bias (> 33 and < 67% of the total score) with average of 0.45 of total score and 8 studies were rated as having a high risk of bias (<33% of total score) with average of 0.32 of total score. Only 6 studies (33 %) reported power calculations to determine sufficient sample size and only 3 studies met the assumption of randomization, which is actually not so important to determine reliability and validity of questionnaires [74].

Discussion
This systematic review and meta-analysis investigated the test-retest reliability, concurrent validity and criterion validity of three most common used PAQs across EU-28 in national language versions: IPAQ-SF, GPAQ and EHIS-PAQ. We identified 19 studies that adequately tested selected PAQs in recent 17-year period between 2003 and 2020.
The main findings include: i) GPAQ and IPAQ-SF are one of the most widely used international PAQs in EU [35], resulting in more than 90% of records identified through search strategy; ii) IPAQ and GPAQ are validated for MPA, MVPA and VPA in only 5 countries across EU-28; iii) the assessment of MVPA is the most relevant PA outcome, since no publication bias in any of measurement characteristics was detected and test-retest reliability was the moderately-high (rw=0.74), and moderate for both criterion (rw=0.41) and concurrent validity (rw=0.72); iv) reporting of methods and results of the studies was poor, leading to high risk of bias in 8 studies, moderate risk of bias in 9 studies and resulting in overall moderate risk of bias with total score of 0.43.
Our results revealed that MPA reached lowest overall correlations for all measurement characteristics (reliability rw=0.42; concurrent validity rw=0.51; criterion validity rw=0.13). VPA reached the highest overall correlations (reliability rw=0.53; concurrent validity rw=0.72; criterion validity rw=0.48) but we also found publication bias in concurrent and criterion validity for mentioned PA construct. All measurement characteristics were moderately-high for MVPA (reliability rw=0.74; concurrent validity rw=0.41; criterion validity rw=0.40). Since we did not detect publication bias in any of measurement characteristics for MVPA, we suggest the assessment of MVPA to be the most relevant PA outcome. To the larger extent, research findings indicate that MVPA in particular, positively influence the public health of adult population, which also resulted in the development of recommendations for policymakers to increase PA of the European population [1].
Although there is no single rule of thumb related to adequate sample size, test-retest interval and statistical analysis, academics have recommended the acceptable ratio of survey items and participants to be 1:5 [34,76], the use of ICC and Pearson correlation coefficient [47] and test-retest interval between three and eight days [76]. Based on our qualitative rating, only 8 of 305 PA constructs within different measurement characteristics received grade 1, 142 constructs were graded with grade 2 and 145 with grade 3. This is a foundation for concern, since more than half constructs did not satisfied preferred recommendations for assessing reliability and validity in PAQs and calls for more rigourous study design in future reliabilty and validity investigations.
It is promising that the reliability was found to be moderate to high (rw=0.40 to 0.74). Of greater importance, time intervals with the exception of two studies [59,77] were within optional range [76] of test-retest interval and ranged mostly between three to eight days. Since the reliability of MVPA and tPA was high even in the two aforementioned studies [77,78] that used one month interval between repeated assessments, this methodological weakness [34] does not hamper the conclusions of this study.
PAQs showed low to moderately-high validity (rw=0.14 to 0.48) against measures of objectively measured PA and moderate to high validity against subjective measures of PA (other PAQs). Our result are comparable to results repoted previously [32,79] that showed validity of PAQs ranging from 0.1 to 0.50 against objective measures of PA [80]. However, it should be noted that the criterion validity was validated in only 5 different national versions for IPAQ (Ireland, Lithuania, Spain, Sweeden and United Kingdom) and 4 different national versions for GPAQ (Austria, Belgium, Spain and United Kingdom) across EU-28. Results are indicating differences in validity between all of versions, therefore other countries assessing PA don't even distinguish how valid their assessed data are. Moreover, factors explaining the variation in the validity of PAQs may relate to differences in qualitative atributes of PAQs, such as recall period and number of items as well as heterogeneity of population. It is well documented, that there are differences in prevalance of overweight and obesity [81] and physical fitness level between different nations and countries [82], which is the governing factor to assess PA with questionnaire. PAQs are assessing the subjective perception of PA, which is conditioned by physical fitness. Accordingly, it is exceptional that only few studies reported the reliability and validity of PAQ related to body mass index [36,66], whereas we haven't found a single study related to physical fitness or fatness. We believe one of important factors effecting variablity of PAQs's validity are different physical fitness levels of participants, therefore inclusion of these factors might allow a more objective assessmet of PA, as well better international comparability of the PA data. Rather low concurrent validity scores found in our study may be explained by different recall period in investigated PAQs.. Subsequently, objective measures of PA are less dependent on longterm variation, and can more accurately capture sporadic and intermittent behaviours [32] which results in higher validity of measured PA construct, but lower criterion validity of PAQs. Moreover, it was often blurred what dimension of PA the PAQ was supposed to measure, which made assessing concurrent validity sometimes impossible. Moreover, it was extremely difficult to assess whether the same or somewhat modified versions of PAQs were used in some studies, and it was not always clear whether the data were derived from a self-report questionnaire or whether the questionnaire was part of an interview [35]. Nevertheless, most of the studies enthusiastically concluded that PAQ is valid, but they did not take into account risk of bias and quality assessment. However, when we applied criteria for risk of bias and quality assessment, we found this conclusions to be overoptimistic which is in concordance with previous reviews [35].

Limitations
There are several limitations of this study that should be acknowledged: i) although we systematically searched five biggest databases in the field of PA twice and by more investigators, it is possible that not all relevant studies are included in present meta-analysis; ii) the most commonly used PAQs in present review were IPAQ and GPAQ. GPAQ uses a typical week to assess PA data. However, a typical week can be different in many European countries due to weather conditions, yielding different PA level, and iii)) even though the quality of each study was assessed, findings from studies with lower quality were given no less importance than the other findings.

Conclusion
This meta-analysis examined reliability and validity of the most commons used international PAQs accros EU-28 in national language versions and identified the widespread use of GPAQ and IPAQ-SF, which significantly differ in type, duration, recall period and complexity. Our results revealed that the quality of most studies was poor. The reliability was higher for all included PA constructs than validity, whilst MVPA reached highest scores in test-retest reliability and generally performed the best in all measurements characteristics. Thus, where only self-report is affordable due to time availability of participants and resources; assessment of MVPA with GPAQ and IPAQ-SF in European adult population is recommended. When possible, combination of objective measures of PA and self-report will provide the most valid and reliable results of MVPA.Further investigation of measurement characteristics of PAQs should focus on studiying differences in validity of different groups of participants according to their physical fitness and/or fatness since results from PAQs represent subjective assesment of PA which is closely related with fitness and fatness. This could contribute to more critical evaluation of PA data, assessed by PAQs.

Conflicts of Interest:
The authors declare no conflict of interest.