The Relative Validity and Reproducibility of Food Frequency Questionnaires in the China Kadoorie Biobank Study

Background: Short versions of qualitative and quantitative food frequency questionnaires (FFQs) are widely used to assess usual food intake. However, fewer studies evaluated their relative validity and reproducibility in the Chinese population. Methods: This study compared 12-day 24-h dietary recalls with qualitative and quantitative FFQs designed by the China Kadoorie Biobank (CKB) study to assess the relative validity. Two FFQs were administered in the second and third seasons and compared to evaluate the reproducibility. Statistical tests included Spearman correlation coefficients, weighted kappa, and cross-classification. Results: A total of 432 participants were eligible after stratifying by age, sex, and four regions. In the validation of qualitative FFQ, adjusted Spearman coefficients were between 0.23 and 0.59, and weighted kappa coefficients ranged from 0.61 to 0.88, except for fresh vegetables. The percentage of correct classification was highest in fresh vegetables and lowest in fresh fruit, but the percentages of extreme classification were below 3.0%. Corresponding Spearman and kappa coefficients for the reproducibility were 0.17–0.56 and 0.62–0.90. Furthermore, the correct classification constituted between 35.6 and 93.3% of all participants. Regarding the relative validity of the quantitative FFQ, Spearman coefficients ranged from 0.14 to 0.69 in addition to dried vegetables and carbonated soft drinks. For items with more than two-thirds of total participants consumed, weighted kappa coefficients were from 0.57 to 0.79; correct classification percentages were between 34.6% and 67.5%. Spearman and kappa coefficients for the reproducibility of the quantitative FFQ were 0.15–0.71 and 0.60–0.86, respectively; correct classification percentages varied from 47.8% to 71.6%. Conclusion: Most food items from the qualitative FFQ showed acceptable or even good relative validity and reproducibility in the CKB study. Likewise, major food items in the quantitative FFQ were valid and reproducible, but poor performances of dried vegetables and carbonated soft drinks indicated the need for modification and validation in future research.


Introduction
Diet acts as a pivotal modifiable risk factor in the progression of various chronic diseases. Dietary records, dietary recalls, and food frequency questionnaires (FFQs) are commonly used to assess dietary intake in population-based studies. The FFQ is the most time-and cost-effective way to assess long-term dietary intakes and widely administered in epidemiological studies [1]. FFQ includes qualitative and quantitative FFQs depending on whether to estimate amounts. Several previous studies showed that estimating food weights explained a limited percentage of between-person variation [2][3][4][5], but this would demand trained staff and time. Although food items in the FFQ should be informative as much as possible, researchers have to make compromises with reduced items considering research aims and respondent burden. It is notable that less detailed food items could lead to rough definitions and hereafter introduce bias from weight estimation [1]. Hence, studies should design an appropriate FFQ based on their purposes and resources. In addition, the validity and reproducibility of FFQ, especially a short one, is crucial for future analyses of dietary information. Lacking a gold standard, most validation studies used multiple dietary records or recalls as the optimal reference and summarised correlation coefficients between 0.4 and 0.6 for the quantitative FFQ and those between 0.2 and 0.5 for the qualitative FFQ [4].
Long FFQs have been used to measure nutrient levels in the Chinese population, such as the Chinese National Nutrition and Health Survey (149 food items) [6] and the Shanghai Women's and Men's Health Study (79 and 81 food items, respectively) [7,8]. However, large observational studies usually have limited resources to collect detailed dietary information and lesser needs to measure macronutrient and micronutrient levels [9,10]. For example, the China Kadoorie Biobank (CKB), which enrolled around half a million adults aged 30-79 years in 10 sites, administered a 12-item qualitative FFQ at baseline and a 20-item quantitative FFQ in the second resurvey to describe the long-term intake of common food groups [11,12]. In this context, a short FFQ with good validity and reproducibility is more realistic and practical, but there is scarce evidence about the short FFQ in the Chinese population [7,8,13]. Thus, this study aims to assess the relative validity and reproducibility of the short qualitative and quantitative FFQs in the CKB study, which other Chinese studies can adopt in the future.
The short qualitative FFQ chose 12 food items, including rice, wheat products, other staple foods (millet, corn, etc.), meat, poultry, fish/seafood, eggs, fresh vegetables, fresh fruit, dairy products, preserved vegetables, and dairy products according to recommendations from the Chinese Dietary Guidelines. Five frequency options were never or rarely, monthly, 1-3 days/week, 4-6 days/week, and daily.
The quantitative FFQ retained the first nine food items in the qualitative FFQ and split the remaining three items into two or three subgroups (Supplementary Table S1). In addition, four new items were added, including pure fruit/vegetable juice, dried vegetables, carbonated soft drinks and other cold soft drinks. Alternative frequency levels remained the same as the qualitative FFQ. Participants estimated the average amount assisted by colour plates picturing the usual size and weight of food items.

Relative Validity and Reproducibility of FFQ
Supplementary Figure S1 illustrates the field survey flow. Multiple 24-h dietary records or dietary recalls are widely used as the "gold" standard to assess the relative validity [1]. Considering that dietary records depend on the education level and compliance of participants, the present study took multiple 24-h dietary recalls (24 h DRs) as the reference. To avoid the bias caused by the seasonal food supply, dietary information was collected in four consecutive days from three seasons (summer, winter, and spring or autumn). Four investigation days included three workdays and one weekend day. The interval time between seasons was more than two months. Trained interviewers asked participants about all the foods they consumed and corresponding amounts during the past 24 h each day. For food recipes recorded in China Food Composition (2004 and 2009 editions) [14,15], participants estimated the overall weight; otherwise, participants reported each ingredient and its weight, except for condiments.
In the reproducibility study, participants completed the first FFQ before 24 h DRs in the second season; in the third season, they answered the second FFQ after 24 h DRs. Colour plates from the second resurvey were provided as well.
This study was approved by the Institutional Review Board of Peking University Health Science Center. All participants gave their written consent before joining the study.

Study Population
Considering the geographical location (urban/rural, southern/northern), food availability and dietary diversity in each site, the present study chose 13 villages or administrative communities from 4 out of 10 CKB study sites, including 1 urban site (Qingdao) and 3 rural sites (Zhejiang, Sichuan and Henan) to represent the CKB population. Eligible participants satisfied three criteria: (1) joining the baseline survey and the first and second resurveys; (2) aged less than 70 years old by 31 December 2016; (3) completing all questionnaires and signing the informed consent form. When multiple individuals fitted criteria in one household, one participant was randomly selected if they were of the same sex, otherwise, the male one was selected because there were fewer eligible male individuals. Among these candidates, the study randomly selected participants by sex and age groups (<50, 50-59, ≥60 years). Individuals with two circumstances were excluded: (1) unemployed and having more than half of lunches and suppers outside the home; (2) employed and having more than half of suppers outside because it was difficult to perform the face-to-face interview.
To validate the FFQ, 200-300 individuals are recommended for 3-day 24 h DRs and 100-200 individuals for 14-28 days of 24 h DRs [1]. After consultation with nutritional epidemiologists, the present study set the sample size at 480, taking a 20% loss follow-up rate into account. The field survey started in September 2015 and ended in August 2016. Finally, 432 participants were qualified for the qualitative FFQ and 416 for the quantitative FFQ after exclusion of those with an average daily energy intake outside of the 2-99 percentiles in the 24 h DRs.

Quality Control
After completing the field survey in each season, interviewers input questionnaires into a predesigned website and coded ingredients or recipes according to China Food Composition tables [14,15]. Ten percent of the overall questionnaires were randomly selected with stratification on survey sites and interviewers. Then, staff checked input errors and calculated percentages of missing, duplicate, and wrong items. If any percentage exceeded 1%, the corresponding interviewer examined all questionnaires he or she had completed. This process repeated until these indicators were lower than 1%. Finally, independent nutritional epidemiologists reviewed food codes.

Statistical Analyses
In FFQs, we assigned the midpoint value to each level (0, 0.5, 2, 5, and 7 days per week) and treated it as a continuous variable. Then, it was multiplied by the estimated amount and divided by seven was the average daily amount. In 24 h DRs, consuming a food item for 0, 1, 2-6, 7-10, 11-12 days corresponded to 5 frequency options in FFQs, respectively. The continuous frequency level (days per week) was the product of days that a participant consumed a specific food item and 7/12. The summing weight of a particular item divided by 12 generated the average daily amount, then it was categorized into three groups by tertiles.
Percentages of frequency levels and median daily amounts were listed and compared between 24 h DRs and two FFQs using Wilcoxon tests. Cross-classification (percentages classified into the same, adjacent and extreme groups) and weighted kappa statistics were used to test the agreement at the group level [16]. The performance is good if more than 50% of the respondents were correctly classified and less than 10% were grossly classified; while it is considered to be bad if the correct classification percentage is below 50% and the extreme classification percentage exceeds 10% [16,17]. The weight for kappa was defined as 1 if frequency levels were in the same group, 0.5 if they were in adjacent groups, and 0 if they were in extreme groups [17]. A kappa value ≥0.61 represents a good outcome, 0.20-0.60 represents an acceptable one, and <0.20 means a poor one, respectively [16]. Age-, sex-, and region-adjusted Spearman coefficients were calculated to examine the strength and direction of the association at the individual level due to skewed distribution of data. The average daily energy intake derived from 24 h DRs was additionally adjusted when evaluating the relative validity of the qualitative FFQ. The Spearman coefficient greater than or equal to 0.50, between 0.20 and 0.49, and less than 0.20 indicate good, acceptable, and poor outcomes, respectively [16].

Results
A total of 432 participants completed all surveys. About 49.8% were men, 22.5% were urban residents, and the mean age was 55.0 years (standard deviation: 7.7 years) ( Table 1). The median interval time between seasons was 3.3 months (interquartile: 3.0-4.7 months).   Table S2). Twenty-four-hour DRs reported higher percentages of daily wheat consumption but lower percentages of daily meat, eggs, and fresh fruit consumption compared with two qualitative FFQs. Daily wheat and fresh fruit intakes were more common in the first FFQ than in the second FFQ. In particular, more than 95% of participants consumed fresh vegetables every day. In 24 h DRs, foods from the qualitative FFQ contributed 88.8% of average daily energy intake and those from the quantitative FFQ accounted for 89.1% of average daily energy intake (Supplementary Table S3). pared with two qualitative FFQs. Daily wheat and fresh fruit intakes were more common in the first FFQ than in the second FFQ. In particular, more than 95% of participants consumed fresh vegetables every day. In 24h DRs, foods from the qualitative FFQ contributed 88.8% of average daily energy intake and those from the quantitative FFQ accounted for 89.1% of average daily energy intake (Supplementary Table S3). Comparisons between 24h DRs and qualitative FFQs showed that 62.1% (preserved vegetables) to 99.6% (fresh vegetables) of participants were in the same or adjacent frequency levels ( Table 2). In particular, 89.3% of respondents reported daily consumption of fresh vegetables in both methods. All percentages of extreme classification were below 2.2% (fresh fruit). Except for fresh vegetables, average weighted kappa coefficients ranged from 0.61 (meat) to 0.88 (rice), and Spearman coefficients were between 0.23 (other staple foods) and 0.59 (fish/seafood) after adjusting for age, sex, and region. Comparisons between each FFQ and 24h DRs were listed in Supplementary Tables S4 and S5. In the reproducibility study, individuals reporting the same frequency levels constituted about 35.6% (soya products) to 93.3% (fresh vegetables), and those choosing extreme Comparisons between 24 h DRs and qualitative FFQs showed that 62.1% (preserved vegetables) to 99.6% (fresh vegetables) of participants were in the same or adjacent frequency levels ( Table 2). In particular, 89.3% of respondents reported daily consumption of fresh vegetables in both methods. All percentages of extreme classification were below 2.2% (fresh fruit). Except for fresh vegetables, average weighted kappa coefficients ranged from 0.61 (meat) to 0.88 (rice), and Spearman coefficients were between 0.23 (other staple foods) and 0.59 (fish/seafood) after adjusting for age, sex, and region. Comparisons between each FFQ and 24 h DRs were listed in Supplementary Tables S4 and S5. In the reproducibility study, individuals reporting the same frequency levels constituted about 35.6% (soya products) to 93.3% (fresh vegetables), and those choosing extreme frequency levels were highest in dairy products (5.3%) ( Table 3). In addition to fresh vegetables, average weighted kappa coefficients ranged from 0.62 (poultry) to 0.90 (rice), and adjusted Spearman coefficients varied between 0.17 (soya products) and 0.56 (rice).  FFQ: food frequency questionnaire. The weight for kappa was defined to be 1 if the frequency levels were in the same group, 0.5 if they were in adjacent groups, and 0 if they were in extreme groups. Spearman coefficients were adjusted for age, sex and region. * Coefficients were not significant (p > 0.05).

Relative Validity and Reproducibility of the Quantitative FFQ
Quantitative FFQs demonstrated a higher intake of fresh and salted vegetables but a lower intake of wheat products, other staple foods, and soya products (excluding liquids) in comparison with 24 h DRs ( Table 4). The median levels for most food items were approximate in two FFQs, except for eggs (15.7 g/d in the first FFQ vs. 31.4 g/d in the second FFQ).
Validity studies showed that average Spearman coefficients ranging from 0.14 (fresh vegetables) to 0.69 (pickled vegetables) after adjustment for age, sex, region and daily energy intake, but those of dried vegetables (0.04) and carbonated soft drinks (0.05) were insignificant (Table 5). For some food groups, cross-classification and weighted kappa statistics could not be calculated because more than two-thirds of respondents reported never or rare consumption in FFQs. Regarding the rest items, a range of 34.6% (dried vegetables) to 67.5% (rice) of participants were correctly classified into the same tertile, while those who were grossly misclassified into opposite tertiles varied from 0.7% (wheat products) to 23.6% (salted vegetables). Weighted kappa coefficients for these food items ranged between 0.57 for fresh vegetables and 0.79 for rice. Comparisons of each FFQ with 24 h DRs were in Supplementary Tables S6 and S7.
Adjusted Spearman correlation coefficients to assess the reproducibility were from 0.15 (other staple foods) to 0.71 (pickled vegetables), except for dried vegetables (0.06, p < 0.05) and carbonated soft drinks (0.04, p < 0.05) ( Table 6). Participants in the same tertile accounted for about 47.8% (dried vegetables) to 71.6% (rice), and those in opposite tertiles constituted between 0.2% (rice) and 29.1% (salted vegetables). The weighted kappa was highest in salted vegetables (0.86) and lowest in fresh vegetables (0.60).  Original groups refer to food items shared by the qualitative and quantitative FFQ. Split groups refer to food items in the qualitative FFQ but split into subgroups in the quantitative FFQ. Added groups refer to new food items in the quantitative FFQ. The weight for kappa was defined to be 1 if the frequency levels were in the same group, 0.5 if they were in adjacent groups, and 0 if they were in extreme groups. Spearman coefficients were adjusted for age, sex, and region. * Comparisons using the Wilcoxon test were significant (p < 0.05). ‡ No participants consumed pure fruit or vegetable juice in the 24 h DRs.  Added groups refer to new food items in the quantitative FFQ. The weight for kappa was defined to be 1 if the frequency levels were in the same group, 0.5 if they were in adjacent groups, and 0 if they were in extreme groups. Spearman coefficients were adjusted for age, sex, and region. The blank cell indicated the percentage of zero consumption exceeded 66.7%. * Coefficients were not significant (p > 0.05).

Discussion
This study compared repeated short qualitative and quantitative FFQs of CKB to assess the reproducibility and used 12-day 24-h dietary recalls as the reference method to evaluate the relative validity. Numerous studies have assessed the relative validity and reproducibility of FFQs and suggested good performance with the correlation coefficient greater than 0.5 and acceptable performance with the coefficient between 0.20 and 0.49 [16][17][18]. Good performance was also implicated when the kappa statistic greater than 0.60 or extreme classification percentage below 10% and right classification percentage above 50% [16]. In the present study, the qualitative FFQ showed acceptable even good relative validity and reproducibility. In the quantitative FFQ, food items demonstrated acceptable validity and reproducibility except for dried vegetables, pure fruit/vegetable juice, carbonated soft drinks, and other soft drinks.
Instead of measuring the favourable effects of particular nutrients, the purpose of the CKB baseline survey was to describe characteristics of habitual consumption [19], investigate disease risks contributed by certain food items or the overall dietary pattern [20,21], and avoid confounding bias due to diet. The short food list with broad definitions posed great challenges to weight estimation. Therefore, the CKB study only administered a qualitative FFQ. Later, the second resurvey used a quantitative FFQ among a randomly selected subpopulation aiming to estimate usual portion sizes for food groups at baseline [20,22].
The method to assess the validity and reproducibility in this study was in line with that of prior studies such as the Chinese National Nutrition and Health Survey, Shanghai Women's and Men's Health Study, European Prospective Investigation into Cancer and Nutrition, and UK Biobank [6][7][8]23,24]. The dietary record is usually recognized to be the "gold standard" to evaluate the validity, but it is more applicable in respondents with high motivation and literate ability. Hence, this study chose dietary recalls as the second optimal method such as in previous studies [7,25,26]. To minimize the recall bias, participants were encouraged to record foods and beverages according to the time. Participants were interviewed for 12 days (including working and weekend days) in three seasons to maximally address the influence of day-to-day variation and seasonality. When assessing the reproducibility, a longer interval between two FFQs could result in Nutrients 2022, 14, 794 9 of 11 underestimation because of the long-term variation [27,28], but a shorter interval might lead to overestimation since individuals tend to remember the last answers. Two FFQs were 3.3 months apart that was in accordance with the recommendation for an FFQ collecting dietary habits in one year [1].
The quantitative CKB FFQ showed good or acceptable validity and reliability for nine overlapping food items in the qualitative and quantitative FFQs except for fresh vegetables. The consumption level of fresh vegetables might be still influenced by the diversity and accessibility across seasons, subsequently causing large variations in the amount. The acceptable performance of other staple foods resulted from the rough definition, which made it difficult to estimate the average amount for participants. The most probable explanation for the poor performance of dried vegetables was that the second resurvey did not clearly define the wet and dried weight. Poor results of carbonated and other soft drinks were because of infrequent consumption in the target population. Spearman coefficients for other groups were acceptable, but researchers need to be careful to interpret the results since more than two-thirds of total respondents did not consume these foods in the present study.
In the qualitative FFQ, weighted kappa coefficients were greater than 0.60 and Spearman coefficients exceeded 0.2 in all food groups except for fresh vegetables. Although correct classification percentages accounted for less than 50% in most groups, a majority of respondents were classified into adjacent frequency groups, and misclassification percentages were still below 10%. This could result from five frequency levels in the FFQ, which was different from three or four groups in other studies when describing crossclassification [16]. Both the kappa and Spearman coefficients of fresh vegetables were insignificant, but this was caused by the high prevalence of daily consumption (>90%) [29]. High percentages of correct classification (about 90%) and low percentages of extreme classification (<1%) still indicated good validity and reproducibility. However, the limited discriminative ability of frequency levels for fresh vegetables can contribute little variation in future studies. This indicates that food groups with high-frequency intake need more precise assessments in the Chinese population, such as daily frequency, amount, or type of vegetables.
The present study investigated multiple days of 24 h DRs, including weekdays and weekends in three seasons to minimize within-person variation and seasonal influences and capture the dietary habits throughout the year. We selected these four sites based on northsouth and rural-urban dissimilarities, as well as their diet cultures to represent the CKB population to a great extent. A large sample size also increased the power compared with other studies [7,23,24]. Yet, several limitations should be acknowledged. Firstly, the validity and reproducibility of FFQs were usually assessed before administering in the target population. The CKB study originally focused on the disease risk associated with a variety of environmental factors, such as smoking and alcohol consumption, with adjustment for covariates such as dietary behaviours. A detailed evaluation of FFQs was indeed neglected in the first place. Still, the present study found good or acceptable outcomes for the major food items. In addition, the CKB study periodically performed resurveys and offered an opportunity to upgrade the FFQ with a better discriminative ability or comprehensive definitions for some items. Secondly, the great diversity in each food group impeded the calculation of nutrient levels and their associations with disease risks. Thirdly, respondents should be representative of the entire population. However, the CKB participants were geographically scattered, making stratified random sampling impractical [1]. This study has balanced the feasibility of field survey and representativeness as much as possible.

Conclusions
In summary, the present study evaluated the relative validity and reproducibility of qualitative and quantitative FFQs administered in the CKB baseline and resurveys and found major food items with good or acceptable performance. However, foods such as dried vegetables and carbonated soft drinks are not suitable for further research.  Figure S1. The study design to assess the relative validity and reproducibility of qualitative and quantitative FFQs in the China Kadoorie Biobank study.