Validity and Reliability of International Physical Activity Questionnaires for Adults across EU Countries: Systematic Review and Meta Analysis

This review and meta-analysis (PROSPERO registration number: CRD42020138845) critically evaluates test-retest reliability, concurrent validity and criterion validity of different physical activity (PA) levels of three most commonly used international PA questionnaires (PAQs) in official language versions of European Union (EU): International Physical Activity Questionnaire (IPAQ-SF), Global Physical Activity Questionnaire (GPAQ), and European Health Interview Survey-Physical Activity Questionnaire (EHIS-PAQ). In total, 1749 abstracts were screened, 287 full-text articles were identified as relevant to the study objectives, and 20 studies were included. The studies’ results and quality were evaluated using the Quality Assessment of Physical Activity Questionnaires checklist. Results indicate that only ten EU countries validated official language versions of selected PAQs. A meta-analysis revealed that assessment of moderate-to-vigorous PA (MVPA) is the most relevant PA level outcome, since no publication bias in any of measurement properties was detected while test-retest reliability was moderately high (rw = 0.74), moderate for the criterion (rw = 0.41) and moderately-high for concurrent validity (rw = 0.72). Reporting of methods and results of the studies was poor, with an overall moderate risk of bias with a total score of 0.43. In conclusion, where only self-reporting of PA is feasible, assessment of MVPA with selected PAQs in EU adult populations is recommended.


Introduction
Increasing the level of physical activity (PA) has become one of the priorities of public health policies in most developed countries in the world [1]. Over the last thirty years, we have witnessed an accelerated increase in the quantity of interventions to increase PA worldwide, although with limited effects [2][3][4][5]. Creating optimal policies and planning effective interventions aimed at increasing PA is not possible without reliable data on the prevalence of physical inactivity [1]. Hence, numerous global authorities have called for concerted efforts in PA surveillance [6][7][8]. Conversely, how to execute PA monitoring is not entirely clear. Although methods for the assessment of PA are numerous, given the complex nature of PA, none of the currently available methods can assess all PA dimensions (duration, frequency, intensity and type of PA).
Based on the literature review, we can classify scientific methods for determining PA as direct observations or objectively assessed PA and indirect or subjectively assessed PA [9][10][11]. Large PA differentiate between two intensities of PA (vigorous and moderate) [35]. Both GPAQ and IPAQ were designed to compare PA levels in different cultural settings around the world. On the other hand, EHIS-PAQ is an EU-specific questionnaire within the European Health Interview Survey. EHIS-PAQ is a domain-specific questionnaire with last 30-days recall, which includes 8 questions, covering three domains of PA (work-related, transport-related and leisure time), and distinguishes between aerobic and muscle-strengthening PA [46]. Although some reviews and meta-analysis of measurement properties of PAQs have already been published [38,[47][48][49], there is still lack of knowledge addressing this issue on the European population is very multi-national, multi-cultural and multi-lingual.
Therefore, the purpose of this systematic review and meta-analysis is to critically appraise, compare and summarize the measurement properties (reliability, criterion validity, construct validity) of PAQs most commonly used in trans-national surveillance systems for adults in EU-official language versions, taking the methodological quality of these studies, as well as the quality of the evidence, into account.

Materials and Methods
The meta-analysis was performed and reported in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines [50,51]. The present work was registered at the International Prospective Register for Systematic Reviews, identification code CRD42020138845.

Search Strategy
An identical search strategy was employed in PubMed, SportDiscus, Scopus, Dart and ResearhGate databases, looking for studies describing measurement properties of three international PAQs from April to May 2018. The search was later updated to include articles published between May 2018 and May 2020. We used the following search string "name of the questionnaire e.g., IPAQ AND (valid * OR reliab * OR repeat * OR reproducib * OR assess * OR measure *)". Additional studies were identified by manually searching the reference lists of the full papers identified during the search. Grey literature was additionally reviewed through ResearchGate, Google Scholar and Mendeley, using only keyword "name of the questionnaire e.g., IPAQ AND valid *" and through personal communication of members of the research team with other scientists. Additional literature that corresponds to the eligibility criteria of the present review was also obtained through an online questionnaire posted on the platform 1KA (University of Ljubljana, Faculty of Social Sciences) with the help of the World Health Organization within EUPASMOS project activities. National health-enhancing physical activity (HEPA) focal points were asked to report on any national research, reports and doctoral theses, published in their national languages that examined the measurement properties of any of the three PAQs included in this study. All articles generated from the initial search were stored on Mendeley reference management software and researcher network (Elsevier, Amsterdam, The Netherlands) which was used to remove duplicate references.

Eligibility Criteria
Studies included in the present review had to be peer-reviewed, include healthy adults (18 years old or older), carried out in one of the EU countries (28 countries included-United Kingdom was still part of the EU and was, therefore, included in this review) and published in one of the EU's 24 official languages. For the purposes of the present review only those studies which examined one or more of the most commonly used standardized PAQs in the EU [35][36][37][38][39][40][41], were included: IPAQ-SF, which was the first developed PA surveillance instrument [36] and the most frequently used PAQ in EU [37,38]; GPAQ, which is with 120 countries is the most used PAQ in the world [35,40,41]; and EHIS-PAQ, which is the only available EU surveillance system used by all EU member countries [33,42]. Studies needed to report the following characteristics: (i) PAQ translation protocol, (ii) mode of administration (interview, self-administered) and (iii) reliability or (iv) concurrent validity or (v) criterion validity of included PAQ. Studies performed in special populations (e.g., participants with specific medical conditions) were excluded.
The time interval between the test and retest must have been described and short enough that the subject's PA could not have changed, but long enough to prevent recall [37]. For PA assessment during the current or previous week, a recall period of 1 day to 3 months was considered appropriate [37].

Quality and Risk of Bias Assessment
The assessment of the risk of bias of included studies was conducted using the criteria, previously used by Sneck [52] and Sember [53], which includes the criterion of power calculations. Each study received "0" (does not meet the criterion) or "1" (meets the criterion) for each criterion based on an analysis of the reporting in the original article. Methodological quality was assessed following the QAPAQ checklist [54], which was developed specifically for qualitative assessment of PA questionnaires. Risk of bias assessment and methodological quality was performed by two independent reviewers (Vedrana Sember and Kaja Meh)

Data Extraction and Statistical Analysis
Abstract and full-text article screening, data extraction and quality assessment were performed by two independent reviewers (Vedrana Sember and Kaja Meh) who also checked all databases and identified potential studies through the search process to identify potentially relevant articles. In case of uncertainty, a third and fourth reviewer (Gregor Jurak and Gregor Starc) screened the article. Summary tables of entered data were checked with the trial protocol and latest trial report or publication. Any discrepancies or unusual patterns were checked with the study principal investigator. A Hunter-Schmidt estimate was used for reducing the amount of bias and Fisher's z transformation was applied to samples' correlations to display publication bias [50,51]. We also assessed publication bias with Egger's bias test [55] for all PA constructs, separately for reliability, concurrent and criterion validity.
For further analysis, correlation (r w ) coefficients were determined by the Hunter-Schmidt approach [55,56], which was multiplied by the sample size of each study (r w × N). The generalizability of r p was corrected using an artefact correction and variance sample. For weighted means (r w ), 95% credibility interval: CI w = r w + 1.96 √ V p and I 2 and Q statistics to measure heterogeneity of ES were calculated. Statistical analysis is explained in more detail elsewhere [53]. A forest plot was generated with online software "DistillerSR Forest Plot Generator" from Evidence Partners.

Data Synthesis
Results of 20 studies were synthesized into four categories: (1) General characteristics of selected studies of PAQs across the EU; (2) reliability of PAQs in selected studies across the EU; and (3) concurrent validity of PAQs in selected studies across the EU: Criterion validity of PAQs in selected studies across EU. The systematic review synthesized 20 studies and the meta-analysis synthesized only 17 studies, since it was performed only for moderate (MPA), moderate-to-vigorous (MVPA), vigorous (VPA) and total PA (tPA), and 3 studies failed to report these metrics.

Grading the Level of Evidence
Reliability levels of evidence were formulated following van Poppel and colleagues (2010) levels of evidence: (1) adequate time between test and retest and use of interclass correlation (ICC), Kappa or Concordance reliability score >0.7; (2) inadequate time interval between test and retest and use of ICC, Kappa or Concordance reliability score <0.7, adequate time interval between test and retest, Pearson/Spearman correlation >0.7; (3) an inadequate time interval between test and retest, Pearson/Spearman correlation <0.7. An additional grade was given depending on the number of participants and the level of index or correlation. A positive score (+) was given for studies with >50 participants and reliability coefficients >0.70. A negative (−) score was assigned to studies with <50 participants and reliability coefficients <0. 70. Pearson and Spearman correlation were considered inadequate due to known systematic errors [57] and therefore only ICC, Kappa or Concordance were deployed in level (1) of evidence. Validity is the degree to which an instrument measures constructs [54]. The highest level of criterion validity evidence would be comparing PAQs to the gold standard-doubly labelled water (DLW) [58]. However, DLW also includes basal metabolic rate and the thermic effects of food, and therefore the use of other validated instruments is more reliable for obtaining construct validity. This is done by comparing a PAQ to another PAQ (concurrent validity), and accelerometers (criterion validity). For concurrent and criterion validity, the research team established the following levels of evidence: (1) concurrent validity score >0.8; (2) 0.8> validity score ≥0.5; (3) concurrent validity score <0.5. A positive score (+) was given for studies with >50 participants and a negative (−) score was given for studies with <50 participants.

Results
The flow of the review process is shown in Figure 1. In total, 4969 abstracts were identified, 1749 records were screened, 287 full-text articles were identified and read and 20 studies were finally included in the present review ( Figure 1). The characteristics of the included studies are presented in Table 1. We included studies from 18 different EU countries, mostly from the United Kingdom (7), Spain (5) and Germany (3). Three studies were cross-national [33,59,60]. Table 1 represents information from all 20 studies included in the present review of selected PAQs [33,35,46,[59][60][61][62][63][64][65][66][67][68][69][70][71][72][73][74][75], including the country where the study was carried out, the sample size, participants' age and gender, sample description, modes and means of administration of selected studies. deployed in level (1) of evidence. Validity is the degree to which an instrument measures constructs [54]. The highest level of criterion validity evidence would be comparing PAQs to the gold standard-doubly labelled water (DLW) [58]. However, DLW also includes basal metabolic rate and the thermic effects of food, and therefore the use of other validated instruments is more reliable for obtaining construct validity. This is done by comparing a PAQ to another PAQ (concurrent validity), and accelerometers (criterion validity). For concurrent and criterion validity, the research team established the following levels of evidence: (1) concurrent validity score > 0.8; (2) 0.8 > validity score > 0.5; (3) concurrent validity score < 0.5. A positive score (+) was given for studies with >50 participants and a negative (−) score was given for studies with <50 participants.

Results
The flow of the review process is shown in Figure 1. In total, 4969 abstracts were identified, 1749 records were screened, 287 full-text articles were identified and read and 20 studies were finally included in the present review ( Figure 1). The characteristics of the included studies are presented in Table 1. We included studies from 18 different EU countries, mostly from the United Kingdom (7), Spain (5) and Germany (3). Three studies were cross-national [33,59,60]. Table 1 represents information from all 20 studies included in the present review of selected PAQs [33,35,46,[59][60][61][62][63][64][65][66][67][68][69][70][71][72][73][74][75], including the country where the study was carried out, the sample size, participants' age and gender, sample description, modes and means of administration of selected studies. Altogether, 5997 people in 23 different sub studies participated. The age range of included participants in all studies was between 18 and 75 years. In 18 out of 20 studies, the gender proportion of participants was included, whereas in two studies, gender proportion was unknown [75,76].  Altogether, 5997 people in 23 different sub studies participated. The age range of included participants in all studies was between 18 and 75 years. In 18 out of 20 studies, the gender proportion of participants was included, whereas in two studies, gender proportion was unknown [75,76]. Regarding sampling procedures, 13 studies used convenient sample (65%), 4 random sampling (20%), 1 quota sampling (5%), 1 multistage stratified probability sampling (5%) and one study did not report a sample description [61]. Most of the studies (n = 13) used a self-administered mode of administration, 4 used an interview and 2 used telephone interviews. In one study, both self-administered questionnaires and an interview mode was used. All of the included studies assessed the duration and frequency of physical activity. Table 2 represents information from eight studies regarding the reliability of PAQs in selected studies across the EU [33,46,64,65,68,70,72,76], including information about measurement interval, results (Pearson r, Spearman ρ, Lin's concordance correlation and Phi coefficient) and quality ratings. Most studies assessed test-retest reliability for MPA (30), and the least test-retest reliability for MVPA (5). The information for concurrent validity was reported in seven PAQ studies across the EU [33,35,46,69,70,72,75]. Information about comparison method, measured construct, correlation coefficient results and quality ratings are shown in Table 2. Most of these studies assessed the concurrent validity for tPA (11) and the least for VPA (6). Table 2 represents information from 13 studies regarding the criterion validity of PAQs in selected studies across the EU [33,46,59,[62][63][64][65]68,[70][71][72][73][74], including information on the country where the study was carried out, the duration of the objective assessment, the number of valid days and minutes per day, the method for validity comparison, cut-off points, epoch length, the definition of non-wear time and measured constructs. Most studies assessed the criterion validity for VPA and tPA (both 11) for MPA, while the fewest studies assessed the criterion validity for MPA (both 9).   Table 2. Cont.

Reference (PAQ) Study Pop Method Construct (Comparison Method) Results Rating
Taylor et al. [72]  Based on weighted correlation means, measurement construct test-retest performed the best in construct MVPA (r w = 0.74), where 3 associations (of 5) were graded with level of evidence 1 (r w = 0.74) and 2 with levels of evidence 2 (r w = 0.73); whereas the worst were in MPA (r w = 0.40) (Table 3), where 28 of 30 associations were graded with a level of evidence of 3 (r w = 0.41) and only 2 with grade 2 (r w = 0.58).
Based on weighted correlation means, concurrent validity was best for VPA (r w = 0.72), where 4 associations were graded with levels of evidence 1 (r w = 0.82) and 5 associations with levels of evidence 2 (r w = 0.62) ( Table 3). Concurrent validity was the lowest for tPA (r w = 0.22), where 9 associations were evaluated with levels of evidence 2 (r w = 0.64) and 2 with levels of evidence 3 (r w = 0.38). On the other hand, VPA showed the highest validity (r w = 0.72), but it should be noted that the Egger test (−5.63) showed a significant bias between included correlations coefficients in VPA (p < 0.0001). Based on weighted correlation means, measurement construct performed the best for VPA (r w = 0.48), where 4 associations were evaluated with a level of evidence of 2 (r w = 0.64) and 7 associations with a grade of 3 (r w = 0.30); the worst criterion validity was noted for MPA (r w = 0.14) (Table 3), with all 9 associations graded with the level of evidence of 3. Once again, although the highest criterion validity was noted for VPA, the Egger test (−5.59) showed a significant bias between included correlations coefficients in VPA (p < 0.0001). Results of weighted correlation coefficients for test-retest reliability, concurrent validity and criterion validity across all included studies stratified by PA intensity are presented in Figure 2.    The Egge's bias test [53] provided evidence for publication bias for the following measurement characteristics and PA constructs: concurrent validity VPA (bias = −5.63, 95% CI: −6.80 to −4.46, p < 0.0001), concurrent validity tPA (bias = −0.14, 95% CI: 6.47 to 6.20, p = 0.97), criterion validity VPA (bias = −5.59, 95% CI: −7.38 to −3.81, p < 0.0001) and criterion validity tPA (bias = −3.22, 95% CI: −6.55 to 0.11, p = 0.09) ( Table 3). The results of the risk-of-bias assessment are shown in Table 4. The total average risk of bias of all included studies was moderate (0.43). Of the 20 studies, only two were rated as having a low risk of bias (≥67% of total score) with an average of 0.73 of the total score; 10 were rated as having a moderate risk of bias (>33 and <67% of the total score) with an average of 0.45 of the total score and 8 studies were rated as having a high risk of bias (<33% of total score) with an average of 0.32 of the total score. Only 6 studies (33%) reported power calculations to determine a sufficient sample size and only 3 studies met the assumption of randomization, which is not so important to determine the reliability and validity of questionnaires [77]. Table 4. Results of the risk-of-bias assessment.

Author (Year) Outcome R BC BV T BM VO DA RR PC Total
Baumeister (2016)

Discussion
This systematic review and meta-analysis investigated the test-retest reliability, concurrent validity and criterion validity of the three most commonly used PAQs across the EU in national language versions: IPAQ-SF, GPAQ and EHIS-PAQ. We identified 20 studies that adequately tested selected PAQs in the recent 17-year period between 2003 and 2020.
The main findings include the following: (i) IPAQ, GPAQ and EHIS-PAQ were validated for MPA, MVPA and VPA in only 10 countries across EU; (ii) the assessment of MVPA is the most relevant PA outcome, since no publication bias in any of the measurement characteristics were detected and test-retest reliability was moderately high (r w = 0.74), while both criterion (r w = 0.41) and concurrent validity (r w = 0.72) were judged to be moderate; (iii) reporting of methods and results of the studies was rather poor, leading to a high risk of bias in 8 studies and a moderate risk of bias in 10 studies, resulting in an overall moderate risk of bias with a total score of 0.43; and (iv) the representation of different EU countries may be biased, since out of 20, 7 were from the UK, 5 from Spain, 3 from Germany, 2 from Lithuania and 1 from the other countries.
Our results revealed that MPA reached the lowest overall correlations for reliability and criterion validity (reliability r w = 0.42; criterion validity r w = 0.14) and MVPA reached the lowest correlations for concurrent validity (r w = 0.41). VPA reached the highest overall correlations (reliability r w = 0.53; concurrent validity r w = 0.72; criterion validity r w = 0.48), but we also found publication bias in concurrent and criterion validity for this PA construct. All measurement characteristics were moderate-to-high for MVPA (reliability r w = 0.74; concurrent validity r w = 0.41; criterion validity r w = 0.41). Since we did not detect publication bias in any of the measurement characteristics for MVPA, we suggest the assessment of MVPA to be the most relevant PA outcome. To a larger extent, research findings indicate that MVPA in particular positively influences the health of the adult population, which also resulted in the development of recommendations for policymakers to increase the MVPA of the European population [1].
Although there is no single rule of the thumb relating to an adequate sample size, test-retest intervals and statistical analysis, academics have recommended the acceptable ratio of survey items and participants to be 1:5 [49,78], including test-retest interval between three and eight days [78] and the use of ICC and Pearson correlation coefficient [54]. Based on our qualitative rating, only 8 out of 311 PA constructs within different measurement characteristics received grade 1, 144 constructs were awarded with grade 2 and 149 with grade 3. Low qualitative ratings were mostly given because studies did not use the interclass correlation (ICC), Kappa or Concordance reliability score, but the majority of studies used the Spearman coefficient of association. We recommend researchers to use Kappa or ICC in the future, because they also take into account rater bias [79]. This is a foundation for concern, since more than half of the constructs did not satisfy the preferred recommendations for assessing the reliability and validity of PAQs, and calls for a more rigorous study design in future reliability and validity investigations.
It is promising that the reliability of investigated PAQs was found to be moderate to high (r w = 0.40 to 0.74). Of even greater importance, time intervals with the exception of two studies [46,76] were within the optional range [78] of the test-retest interval and ranged mostly between three and eight days. Since the reliability of MVPA and tPA was high even in the two aforementioned studies [49,78] that used one month interval between repeated assessments, this methodological weakness [49] does not hamper the conclusions of this study.
PAQs showed low-to-moderate validity (r w = 0.13 to 0.48) against measures of objectively measured PA and moderate-to-high validity against subjective measures of PA (other PAQs). Our results are comparable with previous reports [48,80] that showed the validity of PAQs to range from 0.1 to 0.50 against objective measures of PA [81]. However, it should be noted that the criterion validity was validated in only six different national versions for IPAQ (Ireland, Lithuania, Spain, Sweden, Finland and United Kingdom) and four different national versions for GPAQ (Austria, Belgium, Spain and the United Kingdom) across the EU. Results indicate differences in the validity between different versions, and therefore the remaining countries assessing PA do not even know how valid their data are. Moreover, factors explaining the variation in the validity of PAQs may relate to differences in the qualitative attributes of PAQs, such as recall period and number of items as well as heterogeneity of population. It is well documented that there are differences in the prevalence of overweight and obesity [82] and physical fitness levels between different nations and countries [83], which is the governing factor to assess PA with a questionnaire. PAQs are assessing the subjective perception of PA, which is conditioned by physical fitness. Accordingly, it is exceptional that only a few studies reported the reliability and validity of PAQ, observing differences in validity between countries and sex according to body mass index (BMI) [35,62], whereas we have not found a single study that used physical fitness as a criteria. It has been found that a high BMI can reduce accuracy of devices, such as accelerometers and heart rate monitors [84]. Additionally, PA data with self-reports seems to be over-or under-estimated among participants with higher BMI [84]. We believe one of the important factors affecting the variability of PAQs' validity to be the different physical fitness levels of the participants, and therefore an inclusion of this control might allow for a more objective assessment of PA, as well as better international comparability of PA data. The rather low concurrent validity scores found in our study may be explained by the different recall periods in investigated PAQs. Next, objective measures of PA are less dependent on long-term variation, and can more accurately capture sporadic and intermittent behaviors [48], which results in a higher validity of measured PA constructs, but a lower criterion validity of PAQs. It was often blurred which dimension of PA a PAQ was supposed to measure, which made assessing concurrent validity sometimes impossible. Moreover, it was extremely difficult to assess whether the same or somewhat modified versions of PAQs were used in some studies, and it was not always clear whether the data were derived from a self-report questionnaire or whether the questionnaire was part of an interview [37]. Nevertheless, most of the studies enthusiastically concluded that PAQ is valid, but they did not take into account risk of bias and quality assessment. However, when we applied criteria for risk of bias and quality assessment, we found this conclusion to be over-optimistic, which is in concordance with a previous review [37].

Limitations
There are several limitations of this study that should be acknowledged: (i) although we systematically searched five biggest databases in the field of PA twice and with different investigators, it is possible that not all relevant studies are included in the present meta-analysis; (ii) the most commonly used PAQs in the included studies were IPAQ (7) and GPAQ (6), while EHIS-PAQ was included because it is the only questionnaire that is a part of the PA surveillance system of all EU member states [40]. GPAQ uses a typical week to assess PA data; however, a typical week can be different in many European countries due to weather conditions yielding different PA levels. (iii) The season of the assessed PA was not taken into account, and therefore different results could be reported from studies since the EU has four seasons; (iv) even though the quality of each study was assessed, findings from studies of a lower quality were given no less importance than the other findings; (v) sample type might have a potential impact on the results of the study, since 13 out of 20 used convenience sampling; (vi) meta-analysis included only 17 studies, whereas the systematic review included 20 studies; (vii) coefficients of associations were reported whether or not they were significant or insignificant in initial studies, potentially leading to different results if only significant results were used; (viii) according to the PROSPERO register we left Eurobarometer out of the manuscript since we did not find any validation studies; (ix) this review includes studies from the UK, although at the time of publication, the UK is no longer a part of the EU; (x) although there exist other widely used PA questionnaires, targeting specific parts of the populations, such as Physical Activity Scale for the Elderly [85], we focused only on the questionnaires targeting the general adult population; and (xi) results of the present meta-analysis refers only to the adult population and are not necessarily valid in other populations such as the elderly, children and patients.

Conclusions
Where only self-reporting is affordable due to time limitations and resources of the large-scale PA monitoring in EU adults, assessment of MVPA with GPAQ, IPAQ-SF or EHIS-PAQ is recommended. All EU countries should validate the translated PAQs in their national settings. In the validation studies, it would be advisable to employ BMI, physical fitness indicators or objective assesments of PA as validation criteria. Lastly, in order to further improve the validity and reliability of PAQ in adults, the researchers should report the results in a standardized manner to allow for the improved quality of assessment and a lower the risk of bias.