Reliability Generalization Meta-Analysis of Internet Gaming Disorder Scale

The objective of this study was to carry out a reliability generalization meta-analysis of the Internet Gaming Disorder Scale (IGDS) to find out if it presents adequate values that justify its application in its four versions: original and abbreviated with dichotomous or ordinal response. A systematic review including 26 studies that apply this instrument from 2015 to June 2022 was conducted. For each version, a reliability generalization meta-analysis was performed with the random coefficients (RC) and varying coefficients (VC) model. Results showed reliability levels > 0.80 in the ordinal versions (IGDS9P and IGDS27P) and in the dichotomous 27-item version (IGDS27D), while the dichotomous 9-item version (IGDS9D) produced levels > 0.70. High heterogeneity was found in all versions (I2 > 95%; Rb > 90%). Cronbach’s α means with both models (RC-VC) did not differ significantly except for the IGDS9D version. It is concluded that, considering the dependence of the results on sample size, abbreviated versions do not guarantee that reliability remains acceptable, and dichotomous versions may provide limited but acceptable reliability. Finally, IGDS27P version is recommended in contexts where high precision is required.


Literature Review
Since the inclusion of Internet gaming disorder (IGD) in the DSM-5 [1], a set of controversies and opportunities arose for mental health professionals regarding the evaluation and intervention of IGD [2,3]. This incorporation reinforced the development and adaptations of psychometric instruments that evaluate the IGD, presenting a set of proposals based on DSM-5 diagnostic criteria [4].
According to DSM-5, IGD is characterized by a pattern of persistent and recurrent gaming behavior that leads to clinically significant impairment or discomfort over 12 months, in which five or more symptoms are present such as preoccupation with gaming, withdrawal symptoms when not gaming (sadness, anxiety, and irritability), tolerance, inability to reduce or stop gaming, giving up other pleasurable activities, continuing to game despite problems, misleading family or others about the amount gaming time, using gaming to relieve negative moods, and risking or losing a job or relationship over gaming [1]. Although this disorder is labeled "online" due to its association with specific online games, it can also develop with offline games [1] (p. 796). It is also included in the ICD-11 as a "video game disorder" [5].
In a systematic review and meta-analysis of IGD prevalence in 17 countries [6], the worldwide prevalence was estimated to be 3.05%. These rates exceed the prevalence of problem gaming, and are similar to those of some substance use disorders and obsessivecompulsive disorder [6]. a previous search, no examples of this type of meta-analysis have been found for the IGDS. However, while it was being carried out, a study was detected that included a reliability generalization meta-analysis of the IGDS [31], with which there are substantial differences in the included versions of the IGDS, the meta-analytic models employed, the sample size, and the sample size, and the assessment of heterogeneity. Therefore, this study includes the four main versions of the IGDS, as well as more models and reliability and heterogeneity estimators, and discusses which of the four versions may be more reliable. A more comprehensive differentiation between the two studies is included in Appendix A.
Given that IGD usually begins and develops during adolescence, such a study could benefit clinical practice by providing evidence of the reliability of a diagnostic tool for IGD, as well as promoting the prevention and early treatment of IGD.

Goal Setting
To carry out this study, a main research question was posed: does the IGDS in its four versions present adequate reliability values to justify its application? Consequently, the main objective of this study is to carry out a reliability generalization meta-analysis to estimate the internal consistency of the IGDS in order to know whether it presents adequate values that justify its application in all its versions. Likewise, it is intended to analyze whether the reliability indices are affected by the characteristics of the studies. It was hypothesized that the IGDS, in its four versions, would continue to be an instrument with good internal consistency reliability.

Information Sources
After not identifying a systematic review or meta-analysis similar to the one proposed, a systematic review was carried out following the PRISMA 2020 method [32] in the Web of Science (WoS, Main Collection), PsycInfo, Scopus, and Dialnet databases.

Eligibility Criteria
A protocol was registered in PROSPERO, with the identification code CRD42022330840. For screening, the following inclusion criteria were proposed: (a) original empirical studies that apply the IGDS, (b) include the IGDS in the original or translated language, (c) include the IGDS in the original or reduced version, (d) validations and adaptations of the IGDS, and (e) report the reliability of the instrument using Cronbach's α or another indicator. It was not considered to limit the search to the type of sample, since the instrument has been adapted to different populations. Likewise, those studies that met any of the following exclusion criteria were discarded: (a) not using the IGDS and (b) book chapters or books.

Search Strategy
First, manuscripts using IGDS instrument [9] were identified in three iterations. The search was carried out in July 2021 and was updated in June 2022. This process was carried out by one of the authors and corroborated by another through the Covidence tool.

First Iteration
The first search was performed on the Web of Science (WoS Main Collection), PsycInfo, Scopus, and Dialnet databases. On all bases, the term "Internet Gaming Disorder Scale" was introduced to include all those works that contained said instrument. Given the generality of the search profile, the results were refined by including only the works between 2015 and June 2022, both inclusive, and written in English and Spanish.
The search was limited to the field of psychology. In the WoS database, the search was limited to the categories "social sciences", "psychology", "applied psychology", "clinical psychology", "developmental psychology", "experimental psychology", "multidisciplinary psychology", and "psychology Social". In Scopus, the fields of study "psychology" and Healthcare 2022, 10, 1992 4 of 22 "social sciences" were selected. In Dialnet, it was not limited by the field of study. At PsycInfo, the field of study was also not limited by being a psychology database.

Second Iteration
The second search was manual, reviewing the references of the studies found in the first search.

Third Iteration
A second manual search was performed in Google Scholar to expand the sample.

Selection of Studies
Once duplicates were eliminated with RefWorks bibliographic manager, the Covidence Software was used to carry out a blind peer review by title and abstract, and full text, following the eligibility criteria. In longitudinal studies or those that included more than one measurement performed on the same subjects, the first study was selected, or in its exception, the first that reported a reliability coefficient. In contrast, studies using more than one sample and their reliability values were considered independent samples.

Data Extraction
The reliability coefficient of the selected studies (α, McDonald's omega (ω), or testretest) was manually extracted. Both reported (i.e., the study's coefficient) and induced reliability values were considered. In this scale, the use of the single score or total α (although there are several specific dimensions) is indicated and was used for the correlations of the original study. Therefore, no study reports the reliability of the 9 dimensions of the scale.
Regarding the induced reliability, it was subdivided into three categories: omitted, vague, and precise [33]. It was considered omitted when no reliability coefficient was reported in the study, vague when "good" reliability was expressed by citing other previous studies, and precise when the exact value of a previous study was reported.
From the studies with reported reliability values, information was extracted regarding the year of publication, version of the IGDS, the language of the IGDS, country of application, application method, sample size, general characteristics of the sample, and classification of the gamers in the sample (include only players or players and non-players), mean age, mean IGDS scores and standard deviation, percentage of women in the sample, mean weekly gaming hours, data collection method, adjustment indices, use of test-retest, presence of statistical validation of the IGDS, and percentage IGD+ in the sample. Data extraction was performed with Covidence and Microsoft Excel.

Analysis
After obtaining the sample, a reliability generalization meta-analysis was performed for each version of the IGDS following the recommendations of the REGEMA checklist (Appendix B).

Description and Evaluation of α Coefficients
The adequacy of the reliability in each study, and the mean α coefficient, were evaluated by comparing their confidence intervals (95%) obtained against a null value [34]. This null value was established in two values of the coefficient α: 0.70 and 0.80. These values were chosen because they are usually minimum criteria to generically determine the appropriateness of a reliability coefficient [35].

Modeling
Due to the characteristics of the study that could influence the conclusions (i.e., interstudy and intra-study heterogeneity, and the size of the selected studies), the modeling decision was oriented towards two approaches: random coefficients (RC model) [36] and variant coefficients (VC model) [37].
Random coefficients model. There are several contextual factors (e.g., a large number of measures applied to an examinee, evaluation monitored by an examiner or without monitoring, etc.), individual variables (e.g., mood, motivation, effort, personality attributes, etc.), data quality (e.g., multivariate outliers, response trends, etc.), and the interaction between them, which can produce variability in the estimation of the reliability of the obtained score [38,39]. Thus, the first model for meta-analyzing the α coefficients of the IGDS was random effects (RC) [36]. RC has several underlying assumptions: first, the estimates obtained vary from study to study, due to actual differences between studies, and due to sampling variation. Second, the study sample came from a random selection from a normally distributed population of α coefficients, which implies that this is an overpopulation. Third, to generalize the results to future studies not similar to this study, the RC model is generally accepted as the recommended option, and is one of the preferred goals of research [40].
For the estimation of the amount of variance between studies (τ 2 ) [41], the restricted maximum likelihood (REML) method [42] was implemented, with the Knapp-Hartung modification [43] was applied to the statistical test of variability (Q) [44], τ 2 , and confidence intervals. This method is based on the Student's t-distribution and tends to provide more robust estimates under various data conditions [45]. The measures of heterogeneity of the selected studies were: (a) the statistical test Q [44], with the significance level set at 0.10 [46], and (b) estimators of the size of heterogeneity: I 2 [47], R b [48], and CV b and CV w [48] as measures of between-study and intra-study variability, respectively [48].
According to the recommendation of Sánchez-Meca et al. [52] and Romano et al. [53], under the RC model, the α coefficients were transformed with the Bonett method [54], L i = Ln(1 − α), and their sampling variance [54] was obtained with: V i = 2J (J−1)(n i −2) , in which J: number of items and n i : sample size of the study. On the other hand, the weighting of the studies to obtain the αmean was conducted with the general approach of creating weights based on the inverse of the variance [55]. To assess the independence between the size (i.e., number of participants) and the α coefficient of the studies [37,56], both parameters were correlated. Respectively, for IGDS9D, IGDS9P, IGDS27D, and IGDS27P, the following was obtained: r = 0.149 (p = 0.53), r = −0.213 (p = 0.78), r = 0.382 (p = 0.61), and r = −0.793 (p = 0.108). According to the lack of statistical significance, the weights could be applied with little apparent risk of bias in the estimation of the αmean in the IGDS9D, IGDS9P, and IGDS27D, but the interpretation of the results requires caution in IGDS27P due to the size of the correlation.
Varying coefficients model. Although meta-analytic research usually uses the RC model, the analysis was also conducted with the varying coefficients (VC) model [37]. This model was chosen due to: (a) the unlikely fulfillment of the assumption of normality of the hypothetical population of α coefficients, (b) the actual absence of random selection of manuscripts, and (c) the small number of selected studies (i.e., less than 6 in IGDS9P, IGDS27D, and IGDS27P). These are conditions that make it difficult to justify the RC model, particularly when the identification of a well-defined population of studies is problematic [37]. Specifically, in our study, variability is observed in multiple factors of the sample. Firstly, there are differences in gamer conceptualization. On the one hand, some studies define a gamer according to a minimum game frequency (e.g., playing at least once a month [9,21,[57][58][59][60][61]), while others only consider as gamers those who currently play [18][19][20]24,[62][63][64][65][66][67]. On the other hand, some studies did not specify what they considered as a gamer, since in most of these cases the diagnosis of IGD was not their main objective [15,20,[68][69][70][71][72][73][74]. Differences are also observed in terms of the characteristics of the samples, sometimes including players and non-players or exclusively gamers. In some studies, game intensity (hours and days of game), gamer profiles (players for fun, amateur, professionals, etc.), and video game genre (e.g., MMORPG) were considered. Methodological variability was also observed. Finally, both the size and age groups of the study samples are variable, ranging from 204 to 2078 participants, including adolescents, young adults, adults, and the general community. Consequently, heterogeneity is present in different combinations of the aforementioned variables.
VC model is an appropriate approach when the number of meta-analyzed studies is small (<30) [52], when strong heterogeneity is present (in the Results section, this is observed), and when there was no randomized extraction of studies [75]. In contrast to the RC model, the generalizability of the VC results is oriented to a population of studies similar to those that were selected [52]. VC does not assume compliance with the common assumptions of fixed effects and random coefficient methods [76]. In the VC method, the log-transformation is applied to αmean (ln [1 − α mean ]) [37] to stabilize the variance [54], and the studies are not weighted to obtain the mean meta-analytics.

Sources of Heterogeneity
Due to the small number of studies analyzed (<25), the identification of sources of heterogeneity in the IGDS with the largest number of meta-analyzed studies (i.e., IGDS9D, n studies = 20) was explored with K-means cluster analysis, within a dependency cluster-covariate analysis [77]. Accordingly, (a) strictly exploratory clusters were identified with the K-means procedure, and (b) these clusters were compared with the existing natural clusters in the studies. This comparison was made using the χ 2 independence test, and the Cramer-V effect size estimator. The similarity found in this comparison would suggest the substantive interpretation of these new groupings and avoid random capitalizing. The descriptive variables were: the language of the scale (original and English dichotomous classification), mode of application, age characteristics of the sample (adolescents, young people, adults, or general community), and the condition of gamers in the sample (only gamers or mixed samples with gamers and non-gamers) (Appendix C). Differences in mean alpha estimates for each group of studies were estimated using Bonnet's method [37], based on the confidence interval of the differences (CI ∆ diff ).
In the versions with smaller number of meta-analyzed studies (i.e., IGDS9P, IGDS27D, and IGDS27P; in all n ≤ 5), subgroup identification was performed on a quantitative-qualitative basis, and within qualitative evidence synthesis (QES) framework [78]. This was conducted to identify the distinguishing characteristic of the studies that could be associated with the variability of the αmean coefficient. The procedure followed was: (a) quantitative identification of homogeneous groups with K-means cluster analysis, (b) content exploration of the identified groups by K-means analysis (i.e., qualitative examination of the characteristics of their studies from their descriptive variables: instrument language, etc.), (c) assignment of apparent quality that distinguishes these identified groups, and (d) reproducibility evaluation of the three previous steps (independently by one of the authors).

Outliers and Robust Estimation
As part of the heterogeneity assessment, outliers were detected for each study, and the mean α was robustly reestimated excluding them. For each study, its impact on τ 2 was also estimated, using the V RATIO and TAU RATIO statistics [79]; the cut-off point Q (Q vratio and Q tauratio ) to identify the strength of the impact of each study (V RATIO > Q vratio ; TAU RATIO > Q tauratio ) was established with 1000 bootstrap samples [79].
Regarding the software used, the following R programs were used: RC modeling with metafor [80], VC modeling with vcmeta [81], the impact assessment of each study on variability was conducted with boutliers [79], outlier detection and robust estimation were conducted with the dmetar [82], alternative measures of heterogeneity (R b , CV b , CV w ) were obtained with the R hetmeta [48], and K-means cluster analysis with R stats [83].

Results of the Study Selection Process
The identification, screening, and selection process carried out according to PRISMA 2020 [32] is detailed in Figure 1.

Outliers and Robust Estimation
As part of the heterogeneity assessment, outliers were detected for each study, and the mean α was robustly reestimated excluding them. For each study, its impact on τ 2 was also estimated, using the VRATIO and TAURATIO statistics [79]; the cut-off point Q (Qvratio and Qtauratio) to identify the strength of the impact of each study (VRATIO > Qvratio; TAURATIO > Qtauratio) was established with 1000 bootstrap samples [79].
Regarding the software used, the following R programs were used: RC modeling with metafor [80], VC modeling with vcmeta [81], the impact assessment of each study on variability was conducted with boutliers [79], outlier detection and robust estimation were conducted with the dmetar [82], alternative measures of heterogeneity (Rb, CVb, CVw) were obtained with the R hetmeta [48], and K-means cluster analysis with R stats [83].

Results of the Study Selection Process
The identification, screening, and selection process carried out according to PRISMA 2020 [32] is detailed in Figure 1. First, using the database tools, a total of 1095 articles were identified, including 204 from Scopus, 196 from WoS, 691 from PsycInfo, and 4 from Dialnet. Additionally, 10 articles were identified in a second (n = 3) and third (n = 7) iteration. After removing duplicates (n = 237), the remaining 868 articles were screened for eligibility criteria. In total, 38 articles were selected to read the full text. Nine articles were excluded: for not administering the instrument (n = 1), for not specifying IGDS version (n = 1), for administering a version whose answers are provided by third parties (n = 1), second measures from longitudinal studies (n = 2), and by using repeated samples (n = 4). Of the 29 studies, 3 studies only include induced reliability values (omitted (n = 2) and precise (n = 1)). Only articles First, using the database tools, a total of 1095 articles were identified, including 204 from Scopus, 196 from WoS, 691 from PsycInfo, and 4 from Dialnet. Additionally, 10 articles were identified in a second (n = 3) and third (n = 7) iteration. After removing duplicates (n = 237), the remaining 868 articles were screened for eligibility criteria. In total, 38 articles were selected to read the full text. Nine articles were excluded: for not administering the instrument (n = 1), for not specifying IGDS version (n = 1), for administering a version whose answers are provided by third parties (n = 1), second measures from longitudinal studies (n = 2), and by using repeated samples (n = 4). Of the 29 studies, 3 studies only include induced reliability values (omitted (n = 2) and precise (n = 1)). Only articles that reported reliability (n = 26) were included for the meta-analysis. These studies were divided according to the version of the IGDS administered. As mentioned, the studies that presented several independent samples that reported the reliability values and the N of each one of them were coded as independent samples. The distribution was as follows: IGDS27P (n = 5), IGDS27D (n = 4), IGDS9P (n = 4), and IGDS9D (n = 20) (Appendix C).

Description and Evaluation of the Level
Results are presented in Table 1. Regarding IGDS9D, 60% (n = 12) of the studies produced an alpha coefficient > 0.70, while the rest were below this level (20%, n = 4) or had an inconclusive result (20%, n = 4), because the confidence interval of the difference includes zero. Meanwhile, less than half of the studies produced coefficients > 0.80. Regarding IGDS9P and IGDS27P, all random-effects coefficients > 0.80. Finally, IGDS27D was predominantly > 0.80.

Mean Reliability
The size distribution of each study ranged from 204 to 3938 (M = 941.1, Md = 780). Based on 20 studies (n total = 20) for IGDS9D, the mean reliability was higher than 0.70. In contrast, the mean point estimates for the remaining versions IGDS9P (n total = 4), IGDS27D (n total = 4), and IGDS27P (n total = 5) were found to be higher than 0.90. In the population variability range (95% CI), the coefficients varied between approximately 0.75 and 0.97, higher than 0.70. The prediction intervals for IGDS9D, IGDS9P, and IGDS27D, placed the mean reliability at low levels, except for the mean reliability of IGDS27P, in which the predictivity of the mean coefficient is above 0.90.

Heterogeneity Estimation
The smallest part of the heterogeneity came from random variation (<0.10%), while the strong degree of between-study heterogeneity (I 2 > 95%; R b > 90%) was predominant in the analysis of all IGDS versions. Between-study variability (CV b ) was trivial in the IGDS27P studies, and was similar in the rest of the IGDS versions (CV b between 0.189 and 0.206), while the variability from intra-study differences was strong in the IGDS9D, compared to the rest (CV W between 0.383 and 0.473) ( Table 2).

Exploratory Analysis
IGDS9D. In the cluster-covariate analysis [77], two exploratory solutions were identified in the IGDS9D (Table 3), one of three groups (three Clusters: group 1 = 6 studies, group 2 = 2 studies, group 3 = 12 studies), and another of two groups (two clusters: group 1 = 12 studies, group 2 = 8 studies). The X 2 independence test did not reject the null inde-pendence hypothesis between these two solutions and the existing groupings in the data (language of the scale, language, mode of application, sample, and gamer condition; see Table 4). These results suggest that the association found (Cramer-V) may be included in the sampling variation. Advancing in a qualitative evidence synthesis (QES) framework [78], regarding the solution of three groups: in cluster 1, 2 articles share samples of young people (under 20 years of age) in which only gamers are included, in cluster 2, all agree on the administration of the IGDS in European languages and online administration, and in cluster 3, no similarities are found. Between clusters, there are no clear differences in the variables studied. Table 3. Exploratory analysis results.

Kmeans 2 (n g = 2)
Study IGDS9P. The minimum interpretable solution was two groups (see Table 3). Qualitative analysis of the similarity between this clustering and the study characteristics suggested that cluster 1 were similar in including general population, with similar mean ages, and gamer-only samples (although the report of the Lei et al. [19] study is missing). The difference in α coefficient between the cluster 1 studies (study 1 and 2; α mean = 0.94), with the cluster 2 (study 9 and 18; α mean= 0.87) can be established as statistically significant, but also as small (∆α = 0.07; 95% CI ∆ diff = 0.05, 0.08), because the range of the difference (95% CI ∆ diff ) was small and close to 0.0.
IGDS27D. The minimum interpretable solution was two unbalanced groups (see Table 3). Qualitative analysis of the similarity between this grouping and the characteristics of the studies suggested that the studies of cluster 1 only include gamers in their samples, compared to the study of cluster 2, which has a mixed sample. However, the difference in α coefficient between the study identified as cluster 2 (study 4, α = 0.81) and the rest of the three studies (αmean = 0.92), can be established as trivial, (∆α = −0.11, 95% CI ∆diff = −0.13, −0.08).
IGDS27P. The minimum interpretable solution was two unbalanced groups (see Table 3). Qualitative analysis suggested that the only difference between both groups was the presence of gamers and non-gamers in the sample of the second study. However, the difference in the α coefficient between the study identified as the only member of its group (study 2, α = 0.97), with the rest of the three studies (αmean = 0.95) can be established as trivial, (∆α = −0.01, 95% CI ∆diff = −0.01, −0.00).

Discussion
The objective of this study is to perform a reliability generalization meta-analysis of the IGDS. For this purpose, the internal consistency values of the IGDS in different samples are analyzed, and possible causes of the observed variations are examined [84].
The appropriateness of the alpha coefficients of the retrieved studies depended on instrument length (number of items) and response scaling (i.e., dichotomous and polytomous); both structural characteristics had a clear influence, as all versions of the IGDS using ordinal scaling (IGDS9P and IGDS27P) or the long version with 27 dichotomous items (IGDS27D), almost entirely produced scores with reliabilities >0.80. In contrast, the short version with dichotomous items (IGDS9D) produced levels <0.70.
Two major issues to be elaborated in this discussion can be drawn. Ordinal scales and the larger number of items can be taken into account when choosing the IGDS version. In terms of brevity, the IGDS9P may be the recommended version. In a first look at the influence that each study had on the overall estimation of the degree of heterogeneity of the studies (with V RATIO and TAU RATIO ) [79], almost all individual studies potentially produced non-negligible changes in the α-mean heterogeneity of each version (Table 1). In the versions with few meta-analyzed studies (IGDS9P, IGDS27D, and IGDS27P), there was an apparent hypersensitivity and consequent Type I error. The studies detected as a source of statistically important impact subsequently showed trivial differences with the rest of the coefficients compared.
Regarding the α coefficient meta-analysis, the point estimates of the mean α produced by the VC and RC model were not substantially different, except for the IGDS9D version (αmean = 0.775; 95% CI = 0.74, 0.80). The rest of versions can achieve a level of accuracy of the scores that is usually considered optimal for group description purposes and basic research [35,85].
In both RC and VC models, IGDS27P was shown to be exceptionally high (lower CI limit > 0.93), while the most widely used version, IGDS9D, remained at an acceptable level (lower CI limit > 0.70), but not optimal compared to the rest of the IGDS versions. One implication of these specific results is that the IGDS27P may be the best option for using highly reliable scores, particularly when the context of IGDS use demands this level of precision, such as the classification of individuals, and the differentiation of individual gamer vs. non-gamer, etc. On the contrary, with both estimated models (i.e., CR and VC), IGDS9D showed acceptable levels of reliability, but they are likely to be unacceptable when high precision in the interpretation of its scores is required, as in clinical practice [25]. Due to the number of meta-analyzed studies (n studies = 20) in the IGDS9D version, this conclusion is relatively reliable and can be taken as a reference.
The prediction intervals (PI) did not set optimal minimum values in future studies using the IGDS9D, IGDS9P, and IGDS27D. Using the lower limit of the PI as a reference, the lowest estimates obtained (IGDS9D = 0.56, IGDS9P = 0.51, and IGDS27D = 0.48) showed that the scores may contain a high proportion of error variance, even at levels where no instrument would be eligible for research and applied use (<0.60). On the other hand, the upper limit of the PI indicated that the predicted levels can exceed the value 0.85. The best consistently achieved level on this was the IGDS27P, where the error variance was consistently very low. Three implications of these results are that: first, to maintain high and optimal true variance in IGDS scores, the IGDS27P is the best option in the context of high intra-study variability; second, the abbreviated versions of the IGDS (IGDS9D and IGDS9P) do not guarantee that the reliability remains within a range of acceptable use; and third, the dichotomous versions of the IGDS may yield a limited minimum acceptable reliability. Therefore, it is apparent that IGDS27P is the recommended option in a wide range of applications, especially when high precision is required. As a further note here, the confidence intervals generated by RC tended to be wider compared to the Cis generated by VC, a problem intrinsic to the RC model when the number of studies is small [37].
In the robust analysis, the re-estimation of αmean could only be conducted in the IGDS9D, because in the rest of the versions no outliers were detected. The robust αmean obtained was slightly higher (αmean = 781) than the non-robust estimate (αmean = 0.775), as well as the 95% CI was very similar, and therefore both parameters can be considered equivalent. This equivalence may be because the outliers were symmetrically distributed, producing little bias in the estimation [86]. In contrast, the PI was different in the robust analysis (95% PI: 0.695, 843), because the predicted level of the α coefficient indicates an acceptable level of reliability. An implication of this is that the user must consider that the internal consistency of the IGDS9D can be maintained at the level declared in previous paragraphs.
It should be mentioned that the detection of outliers by the method used (i.e., Harrer, et al. [82]) did not have an apparent sensitivity in the studies analyzed for IGDS9P, IGDS27D, and IGDS27P, as no outlier studies were detected. This suggested that the heterogeneity detected could not be explained by the presence of studies with extreme alpha coefficients (i.e., outlier studies). Given the strong heterogeneity found, it is likely that the reason for this lack of sensitivity of the method used [82] was the effect of the small number of these studies in each version (n studies ≤ 5).
Usual (I 2 ) and new (R b ) indicators of heterogeneity converged in pointing out that the amount of variability was strong (>90%), and that an important source seems to come from the intra-study variability (CV w ) compared with the inter-study variability (CV b ). The exploration of the heterogeneity between the studies gave an unclear clue as to the sources of the variability of the alpha coefficients. Qualitative inspection suggested that when the size of meta-analyzed studies is small [78] only some studies were an apparent source of differentiation (i.e., the study by Evren et al. [19] on IGDS9P, Zemestani et al. [67] in IGDS27D, and Evren et al. [19] in IGDS27P; Table 3), but this was trivial because the differences produced in their αmean coefficients without these studies were of trivial size.
The exploration of the heterogeneity carried out suggests that there are other possible and realistic explanations for the degree of heterogeneity found. These can be attributed to the characteristics of the individual studies, which may be causally linked to the distance of each alpha coefficient from its αmean. Specifically, methodological or artificial heterogeneity [87] may have played an important and not ignorant role in this heterogeneity. The methodological aspects are specifically focused on the quality of the data, and the calibration of the IGDS in each sample. In the first, possible response biases were not explicitly incorporated into the analytical procedures of the selected studies, which requires a set of a priori decisions about their detection or treatment. These have been termed random responses [88], item content-independent responses [89], insufficient effort response [90], or careless response [91].
There is empirical literature that has shown its effects on a long chain of statistical estimates, such as the spurious variability of responses [38,39], the internal structure of the instrument [39,92], and in general, the prevalence of error Type I and Type II [93]. In the study of behavioral addictions mediated by the web platform, this issue is no less critical compared to other areas of research. There is an explicit call to address it as well [94], more so when a small prevalence of C/IE responses can produce non-ignorable changes in quantitative estimates [95], which are unrealistic for the measurement of the construct of interest [92].
Second, in the selected substantive studies, the internal structure of the IGDS was not explicitly verified, resorting to supporting the validity argument using previous results. This is a problem of inducing the validity of the instrument (specifically, its structure or the interpretation of its scores) from background evidence, but without corroborating it with the available data [27,28,96]. Due to the natural variability of samples and application conditions, it does not seem reasonable to expect that the internal structure of an instrument will remain intact, even more so if the instrument contains several items that capitalize on sampling variability and methodological variability. These changes in the instrument may be expressed in different magnitudes of intra-study or between-study factor loadings on the set of items, the presence of correlated residuals between items, or the emergence of a general method factor. This means that even to obtain a valid measure of the reliability of the scores, this corroboration is required [35,97], and as a general rule, it should be resolved even in non-psychometric studies. As part of good reporting practices, it is proposed to check the internal structure of the IGDS and to report the psychometric adjustment obtained.
Modern reliability estimates aim to calculate other coefficients that tend to better represent the structure of the items of a measure, such as the coefficient ω [98] for congeneric one-dimensional measures [99], that is, with variability in factor loads of the items. This reliability measure was hardly calculated in the selected studies with the IGDS, and it is reasonable to conclude that the reliability estimates obtained with the alpha coefficient are the lower bond of the reliability of the scores obtained by omega [98,99]. A practical implication is that the mean reliabilities of the IGDS versions may be higher than those obtained in the present study. How high? It is not possible to give an approximate or precisely answer, due to the high prevalence of induced validity and the consequent lack of knowledge of factor loads. In practice, it may be advisable to report both coefficients, α and ω.
Finally, in comparison with another meta-analysis analyzing the reliability of various instruments [31], the two polytomous versions of the IGDS (IGDS27P and IGDS9P) show higher reliability scores than those found on other similar instruments. This highlights that these versions may be more suitable for the assessment of the IGD.

Limitations
Firstly, the present study considered that the minimum reasonable evidence to quantitatively addressed the generalization of reliability was the number of studies included (20 in the IGDS9D; between 4 and 5 in the IGDS9P, IGDS27D, and IGDS27P). Although each contributed a substantial number of participants (n participants > 2000), the accumulation of more studies may be required to reach more reliable conclusions about the αmean and sources of heterogeneity. Secondly, the interpretation of the PI can be reliable in the IGDS9D, but with caution in the conclusions of the IGDS9P, IGDS27D, and IGDS27P, due to the small size of the studies included in these three versions. Thirdly, the conclusions about the αmean in each version analyzed must be contextualized by the degree of heterogeneity found, especially in the versions where the number of studies was less than six (i.e., IGDS9P, IGDS27D, and IGDS27P). Adding to this general situation, in our analyzes the number of retrieved studies was small (particularly for the IGDS9P, IGDS27D, and IGDS27P versions), and few studies cannot serve to reliably identify sources of heterogeneity [100]; for this reason, heterogeneity was conducted in an exploratory manner, combining quantitative and qualitative means to maximize the opportunity to recognize these sources. Fourthly, it is possible that, within a general framework of sensitivity, it is necessary to implement new estimators [101], but it is adapted to the study of reliability generalization, or to take advantage of the Bayesian approach [102] in the area of the generalization of the reliability of the IGDS.

Practical Implications and Future Research
The continuous evolution of video game consumption, the increase in rates of problematic consumption, and the expansion of consumption beyond adolescence highlight the social relevance of the IGD study. Carrying out a reliability generalization meta-analysis of evaluation or diagnostic instruments guides professionals on which scales are more reliable to evaluate a certain construct, as well as in what circumstances [25]. Given that psychological intervention in social settings covers all social sectors (children, adolescents, young adults, etc.), the availability of reliable instruments can be helpful in prevention, diagnosis, evaluation, and psychological intervention, as well as in choosing therapeutic strategies. Therefore, due to the onset and prevalence of IGD in adolescence, the results of this study have implications for clinical practice, confirming the accuracy of the instrument for the early detection and intervention of this disorder. All of this has a direct positive impact on the promotion of well-being in this vital sector and beyond, as well as the promotion of prevention at earlier ages.
On a practical level, after the results are obtained, the IGDS27P is shown to be the most recommendable version for contexts where high reliability is required. However, despite the more frequent use of the IGDS9D, the use of its polytomous version is also recommended when the use of an abbreviated version is required.
For future research, it is proposed to continue with the study of the IGD in different countries with differential video game consumption or to analyze variables that could positively or negatively influence the development of IGD. Moreover, after the sample size is achieved, it would be important to re-perform a meta-analysis to generalize the reliability of the same instrument or even include more instruments measuring this construct, in order to corroborate the results found, as well as to study the psychometric properties of these instruments.

Conclusions
The present work carried out a reliability generalization meta-analysis of the IGDS from 2015 to June 2022, considering its four versions, original and abbreviated with dichotomous and polytomous responses. In the meta-analysis, it was observed that a large part of the sample of studies reviewed lacked some data and presented high heterogeneity, which made it difficult to correctly interpret the generalization of its reliability. The results of the study suggested that to achieve high-reliability scores it was advisable to use the IGDS with ordinal response, for both the 9-item and 27-item versions, the latter being the most appropriate when high precision is required. In contrast, the IGDS9D version had the lowest reliability and could compromise the interpretation of its scores, and was, therefore, the least recommended. These conclusions are dependent, however, on the size of the meta-analyzed studies in each version, and considering that the version with nine dichotomous items was the one with the largest meta-analyzed studies, this conclusion may be more generalizable. Regarding the exploration of heterogeneity, high variability was found, and it was not possible to classify the studies based on the characteristics of the registered data (language of the scale, mode of application, age characteristics of the sample, and status of gamers in the sample). For this reason, a qualitative analysis was carried out that highlighted as a possible cause of variability the use of samples of only gamers in the original and abbreviated polytomous and original dichotomous versions, but this apparent source of differentiation was trivial. Regarding other sources of heterogeneity, the studies did not include information to identify methodological variability (e.g., response bias control, outliers, etc.), which could play an important role in the heterogeneity found. Finally, the need to strengthen adequate reliability reporting practices in primary studies to optimize their reporting is highlighted.

Acknowledgments:
The authors thank the casual helpers in information processing and searches.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
The main difference between this study and Yoon et al. study [31] is that Yoon et al. focus only on the nine-item dichotomous response version of the IGDS. In our study, the four original versions [9] are analyzed, including the 9-item and 27-item versions with polytomous and dichotomous response.
The election of this instrument by Yoon et al. [31] is based on the results of the systematic review by King et al. [103], in which only the 9-item dichotomous and 27-item dichotomous and polytomous versions [103] were included (p. 4, Table 1). Therefore, not considering the four versions could mean an error in the search, since our sample (made in 2021 and updated in June 2022, consulting Scopus, WoS, PsycInfo, and Dialnet databases) already contains articles that use the not-included version (when the review was conducted in these and other databases by King et al. [103]).
All four versions of the IGDS report adequate reliability values in all four versions according to our systematic search. Although the nine-item dichotomous version is the most widely used in studies, some of them report inadequate reliability values, being the only version in which this occurs.
It should be noted that in our study, gray literature was included due to the small sample size. This fact can be both a strength, because it allows greater generalization of results by having a larger sample, and a weakness if it implies a reduction in quality. In the study by Yoon et al. [31], the quality criterion used was that the journal of publication had peer review, and therefore, we reviewed our gray literature and concluded that it also met this criterion. All the articles that formed part of the gray literature were published in peer-reviewed journals, thus providing a larger sample and more generalizable results.
Regarding the meta-analytic model used, Yoon et al. [31] used the random effects model while we used two models: the random coefficients model and the varying coefficients model. These two models were employed because the former is the generally accepted and preferred model in research, and the latter was appropriate because of the characteristics of the sample (the unlikely fulfillment of the assumption of normality of the hypothetical population of α coefficients, the actual absence of random selection of the manuscripts, and the small number of the selected studies).
Regarding heterogeneity assessment, in the study by Yoon et al. [31] Tau (τ), Tausquared (τ 2 ), and I 2 were reported, whereas this study reported these measures plus the Q-statistic test and estimators of heterogeneity size R b and CV b and CV w .
Regarding the sources of heterogeneity analyzed, Yoon et al. [31] focus on sample type, study location, and instruments for measuring video game addiction. In our study, it is considered the language of the instrument, the mode of application, the type of sample, the total number of sample, and the presence of gamers/non-gamers in the sample. We also include a cluster analysis and a qualitative analysis of the significance of these clusters.
All of this highlights a greater rigor in conducting the meta-analysis and in assessing heterogeneity, as well as in exploring all versions of the IGDS. Thus, this study implies greater depth in the meta-analytic study of reliability with more focused generalizability. Table A1. Checklist for the corroboration of the meta-analytical report according to the REGEMA.