Reliability and Concurrent Validity of Global Physical Activity Questionnaire (GPAQ): A Systematic Review

This study aimed to systematically review previous studies on the reliability and concurrent validity of the Global Physical Activity Questionnaire (GPAQ). A systematic literature search was conducted (n = 26) using the online EBSCOHost databases, PubMed, Web of Science, and Google Scholar up to September 2019. A previously developed coding sheet was used to collect the data. The Modified Quality Assessment Tool for Observational Cohort and Cross-Sectional Studies was employed to assess risk of bias and study quality. It was found that GPAQ was primarily revalidated in adult populations in Asian and European countries. The sample size ranged from 43 to 2657 with a wide age range (i.e., 15–79 years old). Different populations yielded inconsistent results concerning the reliability and validity of the GPAQ. Short term (i.e., one- to two-week interval) and long-term (i.e., two- to three-month apart) test–retest reliability was good to very good. The concurrent validity using accelerometers, pedometers, and physical activity (PA) log was poor to fair. The GPAQ data and accelerometer/pedometer/PA log data were not compared using the same measurements in some validation studies. Studies with more rigorous research designs are needed before any conclusions concerning the concurrent validity of GPAQ can be reached.


Introduction
Participation in physical activity (PA) on a regular basis is well documented as a critical component of a healthy lifestyle and disease prevention [1]. However, public health organizations, educational institutions, and others interested in an intervention project should have valid and reliable scales for measuring PA [2][3][4]. Without validated PA measurement scales, it is impossible to accurately assess and monitor the progress of PA interventions fairly and in different settings [5,6]. Although technologies such as pedometers and accelerometers have increased the objectivity and accuracy of PA measures [7,8], self-reported survey methods still have the advantage to reach out to a large scale of the general population due to its low costs [5,9]. To date, several different PA questionnaires have been developed and validated for use in developed countries [2,6]. These questionnaires are of limited Peer-reviewed journal articles; written in English; analyzed/discussed the reliability and/or validity of the GPAQ; studies using the GPAQ to collect PA data were excluded; conference abstracts and papers were eliminated; articles discussing GPAQ without actual reliability and validity data were not selected; time frame was set from 2002 to September 2019.
PubMed: Validity, concurrent validity, reliability, validation, global physical activity questionnaire, and GPAQ Google Scholar: Validity, concurrent validity, reliability, validation, global physical activity questionnaire, and GPAQ Webs of Science: Validity, concurrent validity, reliability, validation, global physical activity questionnaire, and GPAQ

Data Reduction and Harmonization
A coding sheet was first developed based on the purposes of the study and the PRISMA-P checklist [25]. Previously published systematic review studies on similar topics [5,26,27] were also examined to help generate the complete coding sheet. The following data were coded: Countries in which the GPAQ was tested, first author's name, year of publication, research design, the number of participants, participants' mean age, gender, ethnicity, and occupation, comparison devices/methods, and values of validity and reliability measure(s).
Using the above pre-established inclusion/exclusion criteria, three independent reviewers with a PA questionnaire validation background initially screened all abstracts of the 391 retrieved articles, rating each abstract as either yes, no, or maybe for inclusion. The ones rated as yes were kept and those rated as no by both reviewers were eliminated. The maybe ones were reviewed by a third reviewer and a consensus was reached through discussion for any disagreement among the three reviewers. A total of 26 articles met the inclusion criteria and were retrieved for complete review and analysis. The entire text of each article evaluated was examined by three investigators. The other two authors served as the reliability coders by independently coding all the 26 articles to ensure the acceptable inter-coder reliability (agreement > 80%) [28]. Discrepancies between coders were discussed among all team members until the consensus was reached.

Methodological Quality and Risk of Bias
The methodological quality and risk of bias of the selected articles were assessed using The Modified Quality Assessment Tool for Observational Cohort and Cross-Sectional Studies [29], based on the purpose of the study and the content domains included, ultimately, allowing us to assess the appropriateness of the methodology, study design, participant selection, and data analysis. The following two modifications were made. First, instead of using a checklist style (i.e., yes or no), a three-point Likert scale (i.e., 3 = good, 2 = fair, and 1 = poor) was employed to evaluate the items included in the aforementioned tool. And second, the content of the assessment tool was also modified according to previous research on PA questionnaire reliability and validity [3,5]. Specifically, item 13 was not applicable to our paper, and therefore, was deleted. The standards used by Chinapaw et al. [3] and Hidding [5] were adopted to assess the methodological quality of test-retest reliability studies regarding the time interval: Between >1 day and <3 months for questionnaires recalling a standard week, which was used in the GPAQ [6].
Two of the researchers independently rated the articles in each of the items listed in the tool. Since there were only 26 articles with specific reliability and/or validity data, the raters coded all studies separately with specific comments on the weaknesses of those studies which were rated poor. The results of these ratings were compared, and discrepancies among raters were discussed until 100% agreement was achieved.

Reliability
The following types of studies were included for synthesis if it reported intraclass correlation coefficients (ICC) for continuous measurements of different raters [30], and/or Pearson r for continuous variable and Spearman's correlation coefficients (rho) for rank variables, and/or agreement measures using Cohen's simple kappa (κ) for binary ratings and weighted κ for two inter-rater agreement assigning different weights to the different levels [30,31]. It was noted that a single study may report values of kappa, weighted kappa, and ICC when continuous data are used [32]. According to Warner [31], the cut-off value of r or rho or κ for poor, moderate (acceptable), or strong is <0.4, 0.4-0.8, and >0.8, respectively. For ICC, the cut-off values are slightly different (i.e., poor: ICC < 0.5, moderate: 0.5 < ICC < 0.70, good: 0.70 < ICC < 0.90, excellent: ICC > 0.90). Because the sample size and participant characteristics varied greatly among the selected studies, the mean value of test-retest reliability was not calculated. Instead, the range of reliability and validity was examined.

Concurrent or Criterion Validity
Concurrent validity is defined as the degree to which the measurement values of the GPAQ are consistent with a criterion-related standard [4,20,22,32]. The criterion-related standard was established by both self-reported (i.e., PA log and IPAQ) and objectively measured data (i.e., pedometers and accelerometers). In general, concurrent validity often tested using Spearman's rho and the Bland-Altman plots for visually assessing the absolute agreement between two different methods when the same variable was measured [28,30]. The mean bias could also be examined using the Bland-Altman plots. The aforementioned cut-off values for rho were used to assess the magnitude of concurrent validity. The concurrent validity of three domains of PA and SB was also synthesized.
Among these completed in the 18 countries, 10 were in Asian countries (i.e., Bangladesh, China, India, Korea, Malaysia, Saudi Arabia, Singapore, Thailand, The United Arab Emirates (UAE), and Vietnam) with a total of 20 publications (i.e., 76.9%). The rest were from European countries and the US. However, one of the two validation studies was completed in multiple countries reporting the validation data from nine countries [1]. The other was done in Belgium, Spain, and UK [44].
The representation of other continents was limited given that there was one study found in South America [19] and Africa [39], respectively, even though it was reported that the GPAQ has been used in many African countries [1,2]. Moreover, only one article included participants who were younger than 18 [36]. No studies specifically examined the reliability and validity of the GPAQ in the elderly group, even though many studies have included participants whose age was above 60 years old [7,11,12,20,36,40,42,43]. The representation of other continents was limited given that there was one study found in South America [19] and Africa [39], respectively, even though it was reported that the GPAQ has been used in many African countries [1,2]. Moreover, only one article included participants who were younger than 18 [36]. No studies specifically examined the reliability and validity of the GPAQ in the elderly group, even though many studies have included participants whose age was above 60 years old [7,11,12,20,36,40,42,43].
The GPAQ could be administered by others using face-to-face interviews or self-administered [2,18,47,48]. An interviewer-administered GPAQ requires a trained interviewer, while self-administered may be more cost-effective, if valid and reliable. Chu and colleagues [18] tested the differences in using self-and interviewer-administered modes and found a similar level of comparability between the two administrations (see Table 2).

Methodological Quality and Risk of Bias
All studies used a cross-sectional research design (see Table 3). The GPAQ data were compared to accelerometer and/or pedometer data except for one study, which used a PA log [11]. The quality of study scores ranged from 1 to 3 (i.e., the higher the mean, the better study quality). Methodological shortcomings were mostly identified as not wearing pedometers or accelerometers during the typical week in which the GPAQ aimed to measure. A smaller sample size was the second common weakness noted (see Table 4).

Concurrent Validity of GPAQ
A cross-sectional research design was employed by all studies. Most selected studies (24 out of 26) examined the concurrent validity of the GPAQ by comparing the weekly GPAQ data with a criterion measure(s). Eight studies used pedometer and/or a PA log and/or IPAQ data as the criterion measure while the rest of the studies used either accelerometers or pedometers (Note: Some studies used more than one criterion standard) (see Tables 5-7). Spearman's rho was calculated to examine criterion validity. Bland-Altman plots were also used to test the acceptable agreement between the two sets of data.

Concurrent Validity Results
As noted earlier, the GPAQ data were compared to either accelerometers and/or pedometers and/or a PA log and/or a previously validated PA questionnaire (e.g., IPAQ). Data collected by ActiGraph GT3X (AG) was most often employed in the validation studies (see Table 5). Criterion validity for the overall PA, the three PA domains, and SB, respectively, was examined. The concurrent validity for work-related PA (r: −0.03-0.50), transport-PA (r: 0.04-0.49), and leisure-PA (r: 0.02-0.41) was poor to fair. So was the SB concurrent validity (r: 0.07-0.47). In addition, MVPA concurrent validity using accelerometers was mostly often investigated, followed by moderate PA (MPA) and vigorous PA (VPA). Specifically, MVPA concurrent validity using accelerometers and pedometer, PA log and IPAQ was −0.01-−0.69, and −0.01-−0.54, respectively. In essence, the results of the criterion validity for various GPAQ measures ranged from poor to fair using Spearman's rho (see Tables 5-7).
The concurrent validity results using both pedometers and accelerometers, surprisingly, were about the same (r < 0.5), even though accelerometers are found to generate more accurate PA data than pedometers [45]. Moreover, pedometers only measure steps without PA intensity data. However, different total daily steps were used to indirectly measure PA with different intensities. For instance, Sitthipornvorakul and colleagues [32] classified pedometer (steps/day) inactive (<5000), moderately active (5000-9999), and highly active (≥10,000). This was how pedometer data quantified the PA intensity (see Table 6). PA domain-specific concurrent validity of the GPAQ was also poor to fair compared to that measured by accelerometers, pedometers, or a PA log. There was only one study using the IPAQ that found very good concurrent validity of the GPAQ in India [36].
When the GPAQ was analyzed on a country level, each country had a slightly different result. For instance, using the same method by comparing GPAQ data with AG accelerometer data, some researchers reported higher correlations for minutes of MVPA in the US (r = 0.26) [42], China (r = 0.26-0.52) [11], and the UK (r = 0.48) [16] than that found by Bull and colleagues [1] in lower-income countries such as South Africa (r = −0.03). Moreover, Bland-Altman plots found that participants tended to overestimate their MVPA and underestimate their SB using the GPAQ.       The Yamax Digiwalker CW-700 pedometer was removed when immersing the body in water; participants who had four instead of seven daily measurements were also included in the study; PA intensities were classified using pedometer steps; there is a lack of information on whether the pedometer data were collected during a typical week when GPAQ data were measured.

Bangladesh
Mumu et al. 2017 [34] Water-based activities were excluded, resulting in underestimates of PA by the accelerometer; participants who wore the accelerometer for ≥3 days were also included; there is a lack of information on whether the accelerometer data were collected during a typical week when GPAQ data were measured.
Chile Aguilar-Farias et al. 2017 [19] The accelerometer data were not measured during a typical week when GPAQ data were measured.

China
Hu et al. 2015 [11] Self-reported PA log data were used as the criterion-referenced standards for GPAQ data; there is a lack of information on whether the accelerometer data were collected during a typical week when GPAQ data were measured.

India
Misra et al. 2014 [36] Pedometers was used as the criterion-referenced standard.
Mathews et al. 2016 [45] Less than 100 participants were recruited (n = 47 women); total PA was not measured; there is a lack of information on whether the accelerometer data were collected during a typical week when GPAQ data were measured Korea Lee et al. 2019 [46] There is a lack of information on whether the accelerometer data were collected during a typical week when GPAQ data were measured Malaysia Lingesh et al. 2016 [9] Less than 100 participants were recruited (n = 43 females only); there is a lack of information on whether the accelerometer data were collected during a typical week when GPAQ data were measured.
Soo et al. 2015 [37] Pedometers were used as the criterion-referenced standard; average pedometer steps were compared to GPAQ min. data; there is a lack of information on whether the pedometer data were collected during a typical week when GPAQ data were measured.
Saudi Arabia Alkahtani, 2016 [38] Only 62 male participants were recruited; those who wore an accelerometer for ≥4 days were included; there is a lack of information on whether the accelerometer data were collected during a typical week when GPAQ data were measured.

Singapore
Chu et al. 2015 [10] There is a lack of information on whether the accelerometer data were collected during a typical week when GPAQ data were measured.
Chu et al. 2018 [7] Only 78 participants were involved in the study with 69.0% of females; there is a lack of information on whether the accelerometer data were collected during a typical week when GPAQ data were measured.

South Africa
Watson et al. 2017 [39] 95 pregnant women were recruited; there is a lack of information on whether the accelerometer data were collected during a typical week when GPAQ data were measured.

Spain
Ruiz-Casado et al. 2016 [40] There is a lack of information on whether the accelerometer data were collected during a typical week when GPAQ data were measured.

Switzerland
Wanner et al. 2017 [47] There is a lack of information on whether the accelerometer data were collected during a typical week when GPAQ data were measured. UAE Doyle et al. 2019 [48] Less than 100 participants were recruited (n = 93 for reliability study, n = 43 for concurrent validity study); there is a lack of information on whether the accelerometer data were collected during a typical week when GPAQ data were measured.

UK
Cleland et al. 2014 [16] There is a lack of information on whether the accelerometer data were collected during a typical week when GPAQ data were measured.

Gorzelitz et al. 2018 [41]
There is a lack of information on whether the accelerometer data were collected during a typical week when GPAQ data were measured.
Herrmann et al. 2013 [42] Only 68 participants were included; there is a lack of information on whether the accelerometer data were collected during a typical week when GPAQ data were measured.
Hoos et al. 2012 [43] Less than 100 participants (n = 72) were included; there is a lack of information on whether the accelerometer data were collected during a typical week when GPAQ data were measured.
Metclif et al. 2018 [20] There is a lack of information on whether the accelerometer data were collected during a typical week when GPAQ data were measured.

Test-Retest Reliability of GPAQ
There were 14 studies (i.e., 53.8%) that tested the reliability of the GPAQ (see Table 8). The research design for these studies was similar, using a test-retest procedure to measure the consistency of the GPAQ, which was similar to what has been reported in the literature [4,5]. The sample size ranged from 16 to 940 and there was one study with a sample size smaller than 30. The reason for noting the sample sizes was that Hogg et al. [49] pointed out that small samples (i.e., n < 30) for PA measures tended to generate variability that is inappropriate to draw any conclusions on.
The test-retest reliability values varied from moderate to very good and only one study reported poor reliability (r < 0.40). For the overall PA, the reliability was good to very good (r = 0.58-0.89). Similar findings were found for the overall vigorous PA [VPA] and moderate PA [MPA]. When the overall PA was converted into MET values, however, only moderate reliability was found. The reliability range for work-related PA was about the same for the MPA (r = 0.41-0.99) and VPA (r = 0.59-0.92). For transport and recreation PA domains, the reliability value ranges were about the same. The time intervals for the short-term reliability ranged from three days to three weeks, and the median value was seven days. Only one study had an interval of three days [36] and three weeks [43], respectively. There were two long-term test-retest reliability studies with a two-(i.e., r -two month apart = 0.55) [12] and three-month interval (i.e., r -three month apart = 0.53) [42].  Note: MET = metabolic equivalent of task; PA = physical activity.

Discussion
The importance of accurately monitoring PA levels on a regular basis in the general population cannot be overstated. The current study contributed to our knowledge base by providing an overview and synthesis concerning the reliability and validity of the GPAQ in different countries. The main potential for this is to save billions of dollars in medical treatments caused by sedentary lifestyles and increase one's quality of life [5,50]. Overall, the highlighted findings were: (a) Inconsistent reliability and validity among adults in free-living settings found in different countries, and (b) the lack of revalidations in specific age groups as well as in other continents such as Africa, and North and South America. Each of the highlights will be discussed in the following section.

GPAQ Reliability and Validity in Adults in Free-Living Settings
It is deemed critical to note the limitations of the GPAQ to help readers better understand the findings that emerged from the systematic review. First, only self-reported data were used by the GPAQ and some participants reported spending more than 8 h in work-related PA, which might not be true unless they worked over-time. However, it is almost impossible to verify the accuracy of the GPAQ data. Second, MVPA lasted for at least 10 min are measured by the GPAQ. It is possible that the GPAQ may underestimate the total time spent in MVPA and cannot estimate the total PA. Third, work-, transport-, and leisure-related PA can only be accurate when participants can clearly differentiate among the three categories. Fourth, individuals who do not have a job will not be able to report work-related PA. This may be the reason that the GPAQ has only been validated in adult populations, thus limiting its ability to track or monitor individuals' PA among children and youth. And fifth, the GPAQ uses a typical week to estimate PA data. However, a typical week can be seasonal in many developing countries [9,12], yielding different PA data.

Sample Size
Although there is no single rule of thumb related to adequate sample size for questionnaire validation studies, scholars have recommended an acceptable ratio of survey items and participants to be 1:5 and preferably 1:10 [28]. Therefore, the minimum sample size should be 90-160 and 95-190 for the short and long versions of the GPAQ, respectively. However, about one-third of the revalidation studies (i.e., 8 out of 23) had a sample size smaller than 100 and one study had a sample size smaller than 30 (see Tables 5-7). This is a cause for concern as a smaller sample size may negatively affect the representativeness of the population [4,28].

Concurrent Validity
To date, PA wearables are commonly used for testing the concurrent validity of self-reported PA questionnaires [4,5,51]. Relatively low concurrent validity was found when Pearson's r and/or Spearman's rho were calculated [34,40] using accelerometers as the criterion standard. This was surprising, given accelerometers are typically found to be more accurate than pedometers [52], and interesting enough, there were no large differences in concurrent validity results that were revealed between studies using the two wearables (see Tables 5 and 6).
Such results may be explained by the following reasons. First, it might be caused by the differences in measurement methods between the GPAQ and criterion-related measures. The GPAQ is designed to ask participants about their PA and SB lasting for at least 10 min of MVPA excluding light PA in a typical week. Accelerometers and pedometers, on the other hand, were used to measure PA for the entire duration chosen by the researchers in the selected studies. Hence, the week wearing accelerometers/pedometers may not be a typical week because no studies noted that a typical PA week was chosen for participants to wear an accelerometer and/or pedometer or to use a PA log. This means that the GPAQ data were not validated against the same measures obtained from accelerometers/pedometers. Second, the inconsistency in PA measurements between the two methods must have resulted in errors in measuring PA. Accelerometers may underestimate upper body movements and movements involve few vertical motions like cycling, as the accelerometers were usually worn on the dominant hip of the participants [51]. Furthermore, studies have found that pedometers may not accurately record steps for people with abnormal gait patterns and people that are obese [53]. Pedometers also may not accurately capture activities in which the lower body is stationary (i.e., pushing, lifting, and carrying). In fact, low concurrent validity using accelerometers and pedometers as criteria has been found consistently in the literature, resulting from inherent limitations of PA wearables serving as a 'gold standard' for self-report questionnaires [15]. Future validation studies should include only bouts of at least 10 min of MVPA from accelerometer data to make it more comparable to the GPAQ measuring activities lasting for at least 10 min.
There is a concern of using pedometer data as the criterion-related standard for testing GPAQ's concurrent validity in the reported studies (see Table 6). Pedometers measure steps without intensity unless steps are recorded at a specific intensity, such as MPA steps, VPA step, or MVPA steps. Moreover, unlike GPAQ, pedometers measure all steps. As such, steps per day recorded by pedometers are not comparable to the MVPA time measured by GPAQ, It is also important to point out the use of a previously validated self-reported questionnaire such as the IPAQ as the comparison standard for concurrent validity [9,37,43] as it is not a true gold standard. Many studies have shown the significant differences between data collected from the IPAQ versus accelerometers [51]. In addition, the reason for developing and validating the GPAQ was that the IPAQ-short form (IPAQ-SF) does not measure occupational, transport-related PA, which is the dominant form of PA in many developing countries while the long form of IPAQ was deemed too long and too complex to be used in studies with a large sample size [2,6]. Therefore, it is not surprising that low concurrent validity would be found when IPAQ is used as the so-called gold standard in developing countries. On the other hand, the use of PA logs as a criterion standard for concurrent validity is also a cause for concern as GPAQ and PA log data are not comparable if a typical week of PA is not measured. Unfortunately, none of the selected validation studies on the topic had noted that PA data were measured using the criterion-related standard means during a typical PA week, which is specified in GPAQ (see Table 4).

Reliability
It is encouraging that the range of reliability was found to be good to very good, except for only one study with poor reliability (r < 0.30). Of greater importance, the time intervals for the test-retest varied from three days to three weeks for short term-reliability or repeatability, which are within the recommended range [28]. It is alarming, however, that the actual PA change in test-retest was ignored by the two studies [42,43] on long-term repeatability/reliability (i.e., two or three months apart) of the GAPQ without ensuring that participants' PA pattern remained the same within the two-or three-month time span. This methodological flaw in the research design made the findings dubious [4].

Revalidation in Various Age Groups
It is surprising that a wide age range (i.e., 15-79) existed in the selected studies and no research has specifically examined the reliability and validity of GPAQ in elderly adults. The following two reasons are for the necessity of validating the GPAQ in elderly groups. First, age affects the accuracy of self-reported PA due to the complexity in estimating PA patterns consisting of intensity, time, frequency, and types [54]. Elderly individuals may not be able to correctly remember their typical PA levels. And second, light/functional PA, which is not measured by GPAQ, plays an important role in the elderly's overall health [54,55]. This means that the current domain-specific PA (i.e., work, transportation, and recreation) may not be suitable for the elderly considering that they usually do not have a job and only perform limited functional PA.

Revalidations of GPAQ in Other Continents
Although it has been noted that the GPAQ has been used in many African countries [2], it is unclear why it has primarily been validated in Asia and Europe. Hence, this gap in knowledge warrants more attention to professionals in the field of PA measurement and assessment. More validation studies are needed in continents other than Asia and Europe in the future to ensure that the GPAQ is truly a global PA scale.

Limitations
The present review has some limitations that should be acknowledged. Although we thoroughly searched the aforementioned four largest databases in the fields of physical activity measurement and assessment more than once by multiple investigators, it is still possible that not all relevant studies have been identified using the search strategies described in the study. Even though the quality of each study was assessed, they were not weighted or ranked. Thus, findings from studies with poorer quality and smaller sample sizes were given no less importance than other findings. Caution needs to be exercised when interpreting the results of the current project. Factors such as education level, gender, seasons, and types of residential areas (i.e., rural, suburban, and urban) have been found to affect the reliability and validity of GPAQ. Future research is needed on concurrent validity differences in the above factors.

Conclusions
As a global instrument for measuring PA, the GPAQ has been translated into different languages and validated among adults in more than 20 countries, primarily in Asia and Europe. The GPAQ demonstrated good-to-very good test-retest reliability with time intervals that ranged from three days to two weeks. Poor to fair concurrent validity was found when the GPAQ data were compared to the data measured by PA wearables (i.e., accelerometers and pedometers) and PA logs for seven days. Mixed findings were found concerning the effects of educational level and sex on the reliability and validity of the GPAQ. Incomparable data were used to test the concurrent validity of the GPAQ using the data measured by accelerometers, pedometers, and PA log. As such, it is premature to draw any conclusions concerning the concurrent validity of the GPAQ. Great care must be taken into consideration when interpreting the existing findings of GPAQ concurrent validity. Future research should focus on validating the GPAQ with matching data measured by relatively objective tools such as accelerometers and/or pedometers. More studies are needed to use bouts of at least 10 min of MVPA from accelerometer data to make it more comparable to GPAQ data focusing on MVPA lasting for at least 10 min.