Validity and Reliability of IPAQ-SF and GPAQ for Assessing Sedentary Behaviour in Adults in the European Union: A Systematic Review and Meta-Analysis

Current lifestyles are marked by sedentary behaviour; thus, it is of great importance for policymaking to have valid and reliable tools to measure sedentary behaviour in order to combat it. Therefore, the aim of this review and meta-analysis is to critically review, assess, and compile the reliability, criterion validity, and construct validity of the single-item sedentary behaviour questions within national language versions of most commonly used international physical activity questionnaires for adults in the European Union: The International Physical Activity Questionnaire-Short Form and the Global Physical Activity Questionnaire. A total of 1749 records were screened, 287 full-text papers were read, and 14 studies were included in the meta-analysis. The results and quality of studies were evaluated by the Quality Assessment of Physical Activity Questionnaires checklist. Meta-analysis indicated moderate to high reliability (rw = 0.59) and concurrent validity (rw = 0.55) of national language versions of single-item sedentary behaviour questions. Criterion validity was rather low (rw = 0.23) but in concordance with previous studies. The risk of bias analysis highlighted the poor reporting of methods and results, with a total bias score of 0.42. Thus, we recommend using multi-item SB questionnaires and smart trackers for providing information on SB rather than single-item sedentary behaviour questions in physical activity questionnaires.


Introduction
Sedentariness has taken over our lives in recent decades. Knowledge-based work usually requires sitting during working hours [1] and transport, leisure activities and socialisation are associated with an increase in sedentary behaviour (SB) [2]. For the purpose of this study, SB is defined as any behaviour in the waking state with an energy expenditure of ≤1.5 metabolic equivalents (METs) during a sitting or lying position [3].
Although SB and physical inactivity appear to be synonyms, there are major differences between the two terms. People who do not meet the WHO physical activity (PA) guidelines [4] are classified as physically inactive, while people who spend most of their waking hours sitting are classified as sedentary [3]. People meeting PA guidelines may be sedentary, and physically inactive people are not necessarily sedentary [2,5]. Both SB and physical inactivity have been associated with increased health risks [6][7][8][9], and light PA may be more beneficial than prolonged sitting [2]. One of the positive findings of recent studies is that high PA mitigates the negative effects of prolonged sitting, such as an increased risk of mortality [10]. SB can be measured by various methods, which are divided into subjective and objective methods similar to PA. Subjective methods rely on participants' recollection or recording of activities [11,12], most commonly used are self-reported measures, such as questionnaires [13,14]. Objective methods measure SB (or PA) directly with wearable monitors [11] and through device-based measures, such as accelerometry [14]. In large surveys, questionnaires are typically used [13,15] because they are inexpensive, can be used with a wide range of participants [16], and include information about the setting, context, and type of SB [13]. Questionnaires can collect information exclusively on SB, or they can be a part of a PA questionnaire [5]. In addition, SB can be measured with a single item or by a composition of several questions [13]. Measuring SB with the PA questionnaires makes them easy to use in large-scale population surveys and makes it easier to gather a lot of data with one self-report instrument. On the other hand, these instruments are not designed to measure SB, which can affect the results, and the reliability and validity of SB questions in PA questionnaires are rarely tested. Device-based measures are generally considered more valid and reliable [14,17] but are based on the recording of body movements at precise times [18] or time spent in specific postures [5]. As a result, the devices lack contextual information [5] that can provide crucial information about the typical SB of each participant or group of participants (e.g., sitting at work, screen time).
Most PA questionnaires include single-item SB questions [13] and many national and international PA surveys are based on such PA questionnaires [5]. Therefore, the reliability and validity of these specific single-item SB questions are essential for an accurate and precise assessment of sedentariness in society. In a recent meta-analysis based on 96 studies, the validity of SB questions was found to be moderate to low [19], which is consistent with previous findings [5,14]. Studies focusing on instruments with single items measuring SB also reported low validity (Spearman ρ ranging from 0.012 to 0.46), with insufficient and non-detailed information on comparability with sitting time measured with accelerometers [20][21][22]. The reliability of instruments assessing SB tends to be higher than validity, moderate to high [5,[19][20][21], and regularly performed behaviours (sitting at work, watching TV) tend to have higher reliability coefficients [14]. There are statistically significant differences in criterion validity of single-item SB questions compared to multiitem questions and activity logs, and a recent meta-analysis reported between-group differences with higher validity for multi-item instruments [5]. Conversely, another metaanalysis found no differences in correlation coefficients between single-item and multi-item SB questions [19], highlighting the need for further validity testing of single-item SB questionnaires. As none of the previous meta-analyses focused exclusively on single-item SB questions, that are part of an international PA questionnaire, there is a need for further research. In the context of the EU, only single-item SB questions are used to measure SB, therefore the reliability and validity of these questions should be tested and analysed.
The EUPASMOS project aims to create a unified framework for measuring PA, SB, and sport participation in the European Union (EU) member states. In this meta-analysis, we included studies on the measurement properties of the three most common international PA questionnaires used for PA surveillance conducted in the EU. Only studies that used official national language versions of the selected questionnaires were included. Studies met the inclusion criteria if they were based on one of the three most frequently used PA questionnaires with a single-item SB question in the EU described elsewhere [23]. All selected PA questionnaires: International Physical Activity Questionnaire-Short Form (IPAQ-SF) [24], Global Physical Activity Questionnaire (GPAQ) [25,26], and European Health Interview Survey-Physical Activity Questionnaire (EHIS-PAQ) [27,28], contain a single-item SB question. Although the questionnaires are similar, and none of them measures dimensions or type of SB, the main difference is the recall period. IPAQ-SF contains a question about sitting time on a workday during the last seven days (participants indicate hours and minutes) and is the first internationally recognised PA surveillance instrument [29,30], and GPAQ contains a question about usual sitting time on a workday (participants indicate hours and minutes) and is the most widely used PA questionnaire in the world, with the use in 120 countries [29,30]. Lastly, EHIS-PAQ contains a question for the usual week of sitting and lying down during the day (multiple choice question) [28]. The reliability and validity of both IPAQ-SF and GPAQ SB questions differ significantly. The main differences for all three constructs may be seen for the IPAQ-SF (e.g., concurrent validity: Spearman ρ = 0.23-0.97 [20,[31][32][33]). For the GPAQ, smaller differences are found for all three constructs, but results differ between studies (e.g., concurrent validity: Spearman ρ ranging from 0.56 to 0.89) [25,34]. Several systematic reviews and meta-analyses of the reliability and validity of SB measurement have been published in recent years [5,13,19], but there is no research focusing on single-item SB questions, although most surveillance systems for measuring and reporting SB in society are based on single-item questions. Moreover, there is a lack of evidence regarding measurement properties of single-item SB questions in the EU, while the same questionnaires (and hence the same SB questions) are used for national and European policymaking to combat SB, as well as for cross-national comparisons (e.g., Eurobarometer). It is therefore important to examine the validity and reliability of the data gathered using single-item SB questions.
Thus, this systematic review and meta-analysis aims to critically review, compile, and assess the reliability, criterion validity, and construct validity of the single-item SB questions within national language versions of the most commonly used international PA questionnaires in the EU.

Materials and Methods
The present systematic review and meta-analysis were performed according to a protocol designed a priori following recommendations set by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [35,36]. The present work has been registered at the International Prospective Register for Systematic Reviews, identification code CRD42020138845.

Search Strategy
We first searched PubMed, SportDiscus, and Scopus databases for studies reporting measurement characteristics of IPAQ-SF, GPAQ, and EHIS-PAQ in April and May 2018. Subsequently, the Dart, ResearchGate, and Google Scholar databases were also searched. An identical search string was repeated in May 2020 to include articles published between May 2018 and May 2020. We used the following search string: "the name of the questionnaire (e.g., IPAQ) AND (valid * OR reliab * OR repeat * OR reproducib * OR assess * OR measure *)." Additional studies were identified by searching the reference lists of the full papers that met the eligibility criteria. Grey literature was additionally screened through ResearchGate, Google Scholar, and Mendeley, using only the keyword "name of the questionnaire, e.g., GPAQ AND valid*". Additional literature that met the eligibility criteria of the present review was also obtained through an online questionnaire published on the platform 1KA (University of Ljubljana, Faculty of Social Sciences) with the help of WHO in the framework of the EUPASMOS project activities. National focal points for health-enhancing physical activity (HEPA) were asked to report any national research, reports, and doctoral theses published in their national languages that examined the measurement properties of IPAQ-SF or GPAQ. All articles generated from the initial search were stored on the Mendeley reference management software and researcher network (Elsevier, Amsterdam, The Netherlands), which was used to remove duplicate references. Even though we performed the database search for EHIS-PAQ, we were unable to include it in the analysis as we did not identify suitable research papers.

Eligibility Criteria
All included studies were peer-reviewed, included healthy adults (>18 years, special populations (i.e., participants) were excluded) and were carried out in one of the European countries (28 countries were included, the United Kingdom was still part of the EU and therefore included in this review). Two independent reviewers (V.S. and K.M.) performed eligibility screening and included studies published in one of the 24 official languages of Eu-ropean Union. The included studies needed to report modes of administration, translation protocols, and coefficients for reliability, concurrent validity, and criterion validity.

Quality and Risk of Bias Assessment
Methodological quality and risk of bias assessment were performed by two independent reviewers (V.S. and K.M.). The methodological quality of the included articles included was examined using the Quality Assessment of Physical Activity Questionnaire checklist [37], which was developed specifically for assessing methodological quality in systematic reviews of PA questionnaires. The risk of bias assessment of the final sample of 14 included articles was conducted, following the criteria previously set by Sneck [38] and Sember [23,39], following Cochrane's guidelines for Systematic Reviews and Interventions [40].

Data Extraction and Statistical Analysis
Systematic searches, article screening, and data extraction were performed by two independent reviewers (V.S. and K.M.) who extracted reliability and validity coefficients for single-item SB questions. In case of ambiguity, a third reviewer (G.J.) reviewed the article. In addition, senior investigators (G.J. and P.R.) checked the summary tables of the entered data and checked for any discrepancies in the data. Meta-analysis was performed following the Hunter-Schmidt approach [41], explained in more detail elsewhere [23]. Credibility intervals and I 2 and Q statistics were calculated to measure the heterogeneity of effect size. The forest plots were generated with the help of MS Excel.

Grading the Level of Evidence
The levels of evidence were formulated using criteria proposed by van Poppel and colleagues [30] (Table 1). Regarding reliability levels of evidence, Pearson and Spearman correlations were considered insufficient due to known systematic errors [42]; therefore, only ICC, Kappa, or Concordance were classified as the highest level (1) of evidence. The highest level of evidence for criterion validity would be the comparison of the PA questionnaires with the gold standard: doubly labelled water (DLW) [43]. However, DLW also includes basal metabolic rate and thermal effects of food; therefore, the use of other validated instruments is more reliable for obtaining construct validity. Thus, we assessed the construct validity by analysing studies that compared one PA questionnaire with another PA questionnaire (concurrent validity), or to accelerometers (criterion validity).  Figure 1 shows the flow of the study selection process. Of the 4716 studies identified through the search, 14 studies [20,21,25,[31][32][33][34][44][45][46][47][48][49][50] were included in the present systematic review and meta-analysis ( Figure 1). General characteristics, such as the country where the study was conducted, population, and modes and means of administration, are presented in Table 2. We included studies from 18 different EU countries, mainly from the United Kingdom (5) and Spain (3). Four studies were cross-national [20,21,46,47].   A total of 5294 people participated in the selected studies. The age range of participants included in all studies was between 18 and 75 years. In 13 of 14 studies, the participants' gender ratio was included (93%), with the gender ratio unknown in one study [47]. In terms of sampling, nine studies used convenient (64%), three random (21%), and one multistage stratified probability sample (7.5%), while one study did not provide sample description (7.5%) [44]. Most studies (n = 7.5%) used a self-administered mode of administration, four studies used interviews (29%), two studies used a self-administered mode and an interview (14%), and one used a telephone interview (7%). Tables 3 and 4 show results on test-retest reliability, concurrent validity, and criterion validity for SB in selected studies from the EU, including study population, measured construct (sitting or SB), comparison method (type of accelerometer or PA questionnaires), results (Spearman, Pearson, ICC), and methodological quality grade (1− to 3+).  Even though only five studies assessed test-retest reliability [20,21,32,34,47], the most associations (53) were found for this measurement construct. There was a considerable variability in the reported data ( Figure 2) and the weighted mean for test-retest reliability amounted to r r = 0.59 (95% CI = 0.55 to 0.63). Only one study [34] was graded with the highest methodological grade, twenty-three associations were graded with grade 2, and thirty associations with the lowest methodological grade 3. Five studies assessed concurrent validity [20,25,33,34,50] and reported eighteen different associations. The range of results was considerable, similar to reliability (Figure 3), and the weighted average mean for concurrent validity was found to be r c = 0.55 (95% CI = 0.42 to 0.68). Most of the associations that assessed concurrent validity were graded with grade 1, four of them were graded with grade 2, and three associations were graded with methodological grade 3. Criterion validity was observed in 10 of 13 included studies, with 24 associations for this measurement construct [20,21,31,32,34,[44][45][46]48]. One result with negative criterion validity stood out (r = −0.48), otherwise criterion validity results were low but close to each other ( Figure 4). The weighted average mean for criterion validity was r cr = 0.23 (95% CI = 0.19 to 0.27). All associations for criterion validity were given the highest methodological grade: 3 (Table 4).   We included the qualitative assessment of the studies as a moderator in the analysis. For criterion validity, all studies were graded with grade 3; thus, no moderating effects were found. High heterogeneity of these studies due to differences in experimental design, analysis, poor study quality, etc. (Q = 10, p = 0.33), could also be the reason for the low results and grading (Table 5). For reliability and concurrent validity, we detected the moderating effects of the quality assessment. Each of the three grades were moderators of reliability and concurrent validity with statistically significant results (p < 0.00 for both constructs). Notes: N-number of studies; k-number of associations for selected construct and measurement characteristics; n-number of participants; CI-confidence interval; CRI-credibility interval; I 2 -I index of heterogeneity; Q-chi-square test of heterogeneity.

Results
The Egger's bias test [51] showed no publication bias for any of the measurement characteristics regarding SB (p > 0.30 for all) ( Table 4). The results of the risk-of-bias assessment are presented in Table 6. The total average risk of bias for all included studies was moderate (0.42). Of the fourteen studies, only one was rated as having a low risk of bias (≥67% of total score) with a score of 0.78 of the total score, six were rated as having a moderate risk of bias (>33 and <67% of the total score) with an average of 0.48 of the total score, and seven studies were rated as having a high risk of bias (<33% of total score) with an average of 0.31 of the total score. In addition, only four studies (29%) reported power calculations to determine sufficient sample size [52].

Discussion
This systematic review and meta-analysis examined test-retest reliability, and concurrent and criterion validity of the single-item SB question in national language versions of the two most commonly used PA questionnaires in the EU: IPAQ-SF and GPAQ. Through the literature review, we identified 14 studies that adequately tested selected PA questionnaires over the last 17-year period between 2003 and 2020.
The main findings of our study are: (i) IPAQ-SF and GPAQ were criterion-validated for SB in only 10 EU countries, with low weighted average mean r cr = 0.23 (95% CI 0.19 to 0.27), (ii) reporting of study methods and results was rather poor, with only one study with low risk of bias and 13 studies with moderate or high risk of bias, resulting in an overall moderate risk of bias with a total score of 0.42, and (iii) the representation of the different EU countries might be biased, as less than half of the EU countries were included in SB measurement characteristics studies.
Our research highlighted that SB is not reported in all studies using PA questionnaires, although all of these studies collected data on SB. Thirteen EU countries were represented in 14 studies (6 from the UK, 3 from Spain and the Netherlands, 2 from Belgium, Finland, France, Germany, Portugal, and Sweden, and one from Austria, Italy, Lithuania, and Slovenia). Furthermore, four of the studies were conducted internationally [20,21,46,47], two of them with small sample sizes [21,46], and one study did not report the results separately for each participating country [46]. Therefore, the representation of the different EU countries may be biased. We can assume that PA is still considered as a more relevant factor or contributor to a healthy lifestyle, even though studies in recent years have highlighted the effect physical inactivity and SB have on the health [2,53], and the 24 h movement behaviour approach advocates for the integration of all movement behaviours across the 24 h daily span in the movement guidelines [54].
Furthermore, single-item SB questions in IPAQ-SF and GPAQ were criterion-validated in only 10 EU countries (Belgium, Finland, France, Germany, Lithuania, the Netherlands, Slovenia, Spain, Sweden, and the United Kingdom). As expected, the results for criterion validity are low but still perplexing, with coefficients r w = 0.19 to 0.27, similar to the low correlation coefficients for MVPA presented in the previous meta-analysis on the same PA questionnaires [23]. Similarly, startling results were shown in the recent metaanalysis, where both IPAQ-SF and GPAQ results showed underreporting of SB compared to accelerometers (mean difference for IPAQ-SF = −161.67 and for GPAQ = −219.85), whereas multi-item SB questionnaires performed much better (e.g., mean difference for SBQ = −5.8 and Marshall sitting questionnaire = 83.85) [5]. Poor validity results of MVPA and SB are even more concerning when recommendations and PA-enhancing measures are based on them. Some previous validation studies using the same PA questionnaires on non-European populations came to similar conclusions [5,14], while studies including multiple questions on SB reported higher correlations [19]. Comparison of single-item SB questions in PA questionnaires and multiple-item SB questionnaires from previous meta-analyses demonstrated advantages of the latter (better suited for the precise measurement of SB), as single-item questions significantly underreport sitting time compared to multi-item questionnaires [5,14] and are more domain-specific [19]. Multi-item SB questionnaires can help participants to remember more of their SB (e.g., transport-related SB, sitting at work) and provide more detailed information for researchers and healthcare providers [10,55].
Therefore, multi-item questionnaires can improve the criterion validity, since they are more complex [19]. Although test-retest reliability and concurrent validity are moderate to high, we aim for high criterion validity as it is the measure that compares the subjective method-the SB questionnaire-with the objective measure. Therefore, for more valid and reliable results, we propose to use questionnaires with multiple-item SB questions and to include the results in the 24 h movement behaviour guidelines. This information would provide policymakers with more valid data about the time spent sitting and the context of SB.
Results obtained using the questions of SB need to be critically evaluated and not taken for granted, as the results are likely to underestimate (or overestimate) the time spent sitting [44,45,50]. In our review, 71% of the included studies in 10 countries (Belgium, Germany, Finland, France, Lithuania, the Netherlands, Slovenia, Spain, Sweden, UK) also analysed the data for criterion validity. Although the percentage of studies is high, only one third of the EU countries conducted a validation study using objective methods on their national version of the PA questionnaire. Without an acceptably validated national language version of the most commonly used PA questionnaires and corresponding SB question, it is impossible to compare data between countries in the EU or to base health programs even on the results of inadequate questionnaires [56][57][58][59].
As expected, the test-retest reliability of single-item SB questions was moderate to high (r w = 0.552 to 0.63), which is similar to the reliability of PA constructs in the same PA questionnaires (e.g., moderate PA r w = 0.37 to 0.43 and vigorous PA r w = 0.49 to 0.58) [23]. The test-retest interval should be between three and eight days [52], and the recall periods of all included studies were within the desired range and consistent with previous studies [5,14,19]. Moreover, the concurrent validity of single-item SB questions was found to be moderate to high, with a weighted mean ranging from 0.42 to 0.68. The weighted mean of moderate to vigorous PA from a recent meta-analysis of the same PA questionnaires was lower, compared to SB (r w MVPA = 0.36 to 0.46) [23]. As highlighted in a meta-analysis by Bakker and colleagues [19], measuring subjective perceptions of physical (in)activity behaviours, from sedentariness to vigorous PA, is complicated; therefore, validity results are moderate.
The analysis of risk of bias highlighted poor reporting of methods and results, with a total bias score of 0.42. Only 14 of 110 SB measurement characteristics received grade 1, 28 were rated with grade 2, and 68 with grade 3. In particular, criterion validity received low grades due to low correlation coefficients between objective and subjective methods, and all 24 criterion validity constructs were given the lowest grade 3. Low qualitative ratings were usually awarded because the studies did not use the interclass correlation (ICC), Kappa, or Concordance reliability score, but instead mostly used Spearman coefficient of association.
We conclude that EU studies validating SB questions in IPAQ-SF or GPAQ are methodologically inadequate, which is in line with the meta-analysis conducted on PA and on the same PA questionnaires in the EU [23]. This is a cause for concern, as more than half of the constructs did not follow the preferred recommendations for the assessment or the reliability and validity of PA questionnaires, and this requires a more rigorous study design in future reliability and validity studies. We recommend that researchers use Kappa or ICC, as these also take the rater bias into account [60].

Strengths and Limitations
Eurobarometer, as a cross-national comparison in the EU, and most of the national surveillance systems for measuring and reporting SB, are based on single-item SB questions. Our research examined several available sources to find validation studies on the three most common PA questionnaires in EU national languages. However, the present systematic review and meta-analysis has several limitations that should be considered when extrapolating the results: (i) first, the review was limited to only two of the most commonly used PA questionnaires in the EU, which include only single-item SB questions, (ii) although we searched five databases, it is possible that not all relevant studies are included in the present review, (iii) not all questionnaires and studies were conducted in the same week or period of the year, leading to potentially different results due to different weather conditions in the EU and yielding different results of the reported SB, and (iv) one of the studies reported SB for weekdays and weekends, which is not consistent with the IPAQ-SF protocol [50]. (v) Although the quality of each study was assessed, the results for criterion validity did not differ, as all studies were graded with the lowest grade, which also influenced the high heterogeneity of said construct. (vi) Coefficients of association were reported regardless of whether they were significant or insignificant in the initial studies, which could lead to different results if only significant results were used. (vii) This review includes studies from the UK, although, the UK is no longer a part of the EU at the time of publication, and (viii) the results of the present meta-analysis refer only to the adult population and are not necessarily valid for other populations, such as the elderly, children, and patients.

Conclusions
The single-item SB questions in the EU national versions of GPAQ and IPAQ-SF have low criterion validity; therefore, the assessment of SB in EU populations based on these PA questionnaires is critical. Policymakers should be aware of the limitations of information on SB collected with the aforementioned PA questionnaires when evaluating policies. We recommend the use of commercial fitness trackers to monitor sedentary behaviour within the 24 h movement behaviour paradigm in combination with multi-item SB questionnaires.
Such a combination could provide more valid and non-expensive information on the time and context of SB necessary to determine behavioural interventions. Additionally, the surveillance of physical fitness as a measure of health (PA, physical inactivity, and SB influence physical fitness) should be considered as a much better alternative to PA and SB questionnaires. Finally, researchers should report the results of SB questions in a standardised manner to improve the quality of the assessment and reduce the risk of bias.

Data Availability Statement:
The data presented in this study are available in this paper.