Can Retracted Social Science Articles Be Distinguished from Non-Retracted Articles by Some of the Same Authors, Using Benford’s Law or Other Statistical Methods?

: A variety of ways to detect problems in small sample social science surveys has been discussed by a variety of authors. Here, several new approaches for detecting anomalies in large samples are presented and their use illustrated through comparisons of seven retracted or corrected journal articles with a control group of eight articles published since 2000 by a similar group of authors on similar topics; all the articles involved samples from several hundred to many thousands of participants. Given the small sample of articles (k = 15) and low statistical power, only 2/12 of individual anomaly comparisons were not statistically signiﬁcant, but large effect sizes ( d > 0.80) were common for most of the anomaly comparisons. A six-item total anomaly scale featured a Cronbach alpha of 0.92, suggesting that the six anomalies were moderately correlated rather than isolated issues. The total anomaly scale differentiated the two groups of articles, with an effect size of 3.55 ( p < 0.001); an anomaly severity scale derived from the same six items, with an alpha of 0.94, yielded an effect size of 3.52 ( p < 0.001). Deviations from the predicted distribution of ﬁrst digits in regression coefﬁcients (Benford’s Law) were associated with anomalies and differences between the two groups of articles; however, the results were mixed in terms of statistical signiﬁcance, though the effect sizes were large ( d ≥ 0.90). The methodology was able to detect unusual anomalies in both retracted and non-retracted articles. In conclusion, the results provide several useful approaches that may be helpful for detecting questionable research practices, especially data or results fabrication, in social science, medical, or other scientiﬁc research.


Introduction
How may editors and their reviewers detect problems in submitted papers before those papers might be accepted and then retracted for methodological problems?Solutions that would allow editors or reviewers to detect such problems may not be easy or obvious.
The number of articles retracted on account of scientific misconduct has increased in recent decades, even in medicine [1][2][3][4]; some academics have had dozens of their articles retracted [5][6][7][8][9][10][11], in spite of the grave consequences caused by scientific conduct being exposed [12].Sometimes the misconduct has involved apparent or possible fabrication of data, which is one of the most serious types of scientific misconduct, although relatively rare [13,14].
What are reviewers and journal editors to do?We would like to suggest several statistical methods for detecting data anomalies, which may reflect fabrication of data and/or results.
There have been few systemic studies of retracted papers, especially with respect to papers from authors with multiple retractions [15] (p.277).Several anomalies had been noted in some of the several articles which were of concern to Pickett [16], published in top tier journals from 2000 to 2020.The particular concerns expressed by Pickett [16] are as follows: (1) a high ratio of beta coefficients and standard errors that were identical across multiple models; (2) a high number of "hand-calculated" t-test values; (3) an absence of zeros in second or third decimal points; (4) binary values that were impossible; and (5) frequent omissions of important statistical information.Some editors responded to Pickett's concerns by retracting or correcting certain articles, some of the problems having been acknowledged by those articles' authors.It has been argued that it would be very desirable to develop statistical measures to permit the identification of fabricated or manipulated data [17] (p.193).
Therefore, our most general research question was to find new ways to detect potentially fraudulent research using statistical methods.More specifically, our primary objective was to ask this research question and test the following general hypothesis: do retracted articles differ from control articles in statistically significant ways?We tested three specific hypotheses: Hypothesis 1.The retracted group of articles will differ from the control group of articles in terms of six anomalies, measured as percentages and by ordinal breakdowns of those percentages, including two scales derived from two different sums of the six anomalies.
Hypothesis 2. Comparison of the two groups of articles using expected values of first digits of regression coefficients (Benford's Law) will yield larger deviations from expected values for the retracted group of articles than for the control group of articles.Hypothesis 3. Measures of deviations from Benford's Law will be correlated significantly with the two scales derived from the six ratings of the anomalies.

Sample
A sample of articles was developed cumulatively.Seven articles were among those retracted or corrected, as reported by Pickett [16], though six of which are also available in the Retraction Watch database [https://retractionwatch.com (accessed on 1 February 2023)].The corrected article was by Mears, Stewart, Warren, and Simons [18]; the remaining six were retracted [19][20][21][22][23][24].Thus, Pickett was the original instigator calling for the retraction of the articles [18][19][20][21][22][23][24] but journal editors made the final decisions on retractions or corrections.We selected a set of eight articles as control articles, written by many of the same authors who wrote the retracted articles, including Gertz, Mears, Pickett, and Simons [25][26][27][28][29][30][31][32].The control articles were selected by searching Google Scholar for articles related to criminology by co-authors of Dr. Stewart.The total sample came to 15 articles, listed in Table 1.The number of Google citations for each article as of 5 January 2023 were recorded via a search of each article in Google Scholar.The year of publication of each article was recorded from an inspection of the article and the citation from Google Scholar.Whether the article said it was supported by a state or federal grant was determined through an inspection of each article and credits for grant support.The total number of authors of each article was included from a count of the authors listed for each article.Each article was coded as either a control article (coded as 0) or a retracted/corrected article (coded as a 1).Sample size was derived from author reports within each article, using the largest sample available if more than one sample were used.Total pages used was assessed through page counts.See Table 2.

Individual Measures of Anomalies
Several measures were created to identify possible anomalies in the 15 articles under consideration.These are reported as percentages in Table 3, but were analyzed in decimal form (i.e., 53.4% = 0.534).

Hand Calculation
Hand calculation was measured by dividing unstandardized regression coefficients by their standard errors [B/SE] and attending to whether the reported t-value was replicated exactly to two or three decimals.Brown and Heathers [33] have provided more details on this issue of hand calculation.

Excess Identical Unstandardized Regression Coefficients {Betas} or Standard Errors
An excess of identical betas or SEs was determined by calculating how many adjacent identical pairs were possible and creating a ratio of identical pairs to all possible identical adjacent pairs.If a table had five models, that would mean that each row could have four adjacent identical pairs, etc.If all SE's were identical across all rows and columns, then the ratio would be 1.0.Other approaches that we did not use might have counted how many parameters were the same across a row of results, even if not adjacent or counted as a match if the last parameter in a row matched the first parameter in that row.

Shortage/Excess of Zeroes in Terminal Digits of Regression Coefficients or Standard Errors
A shortage (or excess) of zeros in terminal digits [34,35] was determined by counting all the listed data points (regression coefficients, standard errors) in the regression tables (not including data from correlation matrices, factor loadings, odds ratios, t-tests, other test statistics [e.g., Exp(b)], intercept values, and squared variance values) that had two or three decimal points, and counting how many ended in a digit of zero.The ratio was turned into a percentage.We did not assess terminal zeroes in tables of means and standard deviations.Pickett [16] has reported that the retracted articles appeared to avoid zeroes as terminal digits; therefore, we focused on that issue rather than unusually high or low values for the digits 1 to 9, which could also suggest data problems [34].Research in other situations might investigate shortfalls in all digits rather than just zero.

Mathematically Incorrect Standard Deviations for Binary Variables
All binary variables were checked to see if the standard deviations were computed correctly from the reported mean values.A ratio was determined from those not calculated correctly compared to all binary variables used and turned into a percentage.In one or two articles, the authors reported binary mean scores but did not report standard deviations.Binary results were coded as "bad" if off by more than 0.02; incorrect results off by 0.02 or less were coded as close.We created two variables: one from the percentage of "bad" results, and one from the total percentage of "bad" and "close" results.

Benford's Law Deviations
Another method available for detecting fraudulent research has been discussed elsewhere in more detail [8,36,37].Benford's law indicates that the left-most digits in a genuine set of data will follow a pattern of declining percentages from 1 to 9, as follows: 30 7, p. 118).However, Benford's Law may be most useful for fraud detection when fraud is rampant, when the first three digits are considered rather than just the first digit [38], and when using unstandardized regression coefficients [39].Results have been mixed with respect to using Benford's Law for detecting scientific fraud [17].Benford's Law has been used to validate, as well as to raise suspicions about, published research [40].Absolute values of differences for initial digits in regression coefficients compared to expected values for Benford's Law were summed for nine (DIFF9) and three (DIFF3) digits.For example, suppose an article featured 60 regression coefficients, of which 20, 10, and 6 featured left-hand digits of 1, 2, and 3, respectively, for percentages of 33.3, 16.7, and 10.0.Taking the absolute differences between Benford's Law in decimal form, computed to the fifth decimal point, would yield the sum of absolute values of [(0.33333 − 0.30103 = 0.03230) + (0.16667 − 0.17609 = 0.00942) + (0.10000 − 0.12494 = 0.02494)] = 0.06660.Thus, DIFF3 for that article would be 0.06660; we did not divide by three to average the differences.We initially applied Benford's Law to means, standard deviations, regression coefficients, and standard errors but found little relationship with other anomalies except in the case of regression coefficients (mostly unstandardized in the retracted and control group articles).

Creation of Ordinal Anomaly Scales
To expedite an overall analysis of the anomalies, percentage values were converted to ordinal measures, using the terms no issue (coded 0), avoided (1), slight (2), moderate (3), and major (4), as presented in Table 4 below.

Missing Data
The term avoided was used when a series of parameters could have been reported but were not.In some articles, binary variable means were reported but not their standard deviations.In other cases, beta coefficients were reported, but not their standard errors.Because avoiding obvious statistics would be an issue in itself, we coded that situation as 1.In most cases, tables of results would present more than one column of data, each column representing a different model, allowing for comparison of regression coefficients and standard errors from one model in the table to another model in the same table.However, in one article [19], the two models were presented in separate tables and were thus compared.

Hand Calculation, Regression Coefficients, and Standard Errors
For the percentages associated with hand calculation, regression coefficients, and standard errors, ordinal items were created by coding the percentages for those items as follows: from 0.0 to 5.99% was coded as no issue, from 6.00 to 29.99% was coded as a slight issue, from 30.0% to 65.99% were coded as a moderate issue, with 66% or more being coded as a major issue.For example, if 34% of the possible adjacent standard errors were identical to three digits in an article, the standard error variable would be coded as "moderate" for that article.More leeway was granted for the other variables because some situations would be more likely to occur naturally and only more extreme situations would be indicative of serious problems.

Shortage/Excess of Zeroes
For the zero's variable, the recoding scheme was centered on the expected value of 10%, such that both sets of percentages, less than 3% and more than 20%, were coded as a major issue (major deviation from the expected value), and from 3% to 4.99% and from 15% to 19.99% were both coded as moderate deviations.Values between 5.00% and 6.99% and between 13% and 14.99% were coded as slight deviations.Values from 7% to 12.99% were coded as not an issue.The coding pattern was not symmetric because there were no values above 13.1% and we wanted to make some distinctions among those less frequent values rather than coding them identically.For example, if an article contained 200 regression coefficients and their standard errors and only 2 of them ended in a digit of zero, the value of 1.0% for zeroes would be coded as "major".

Binary Variable Standard Deviations Relative to Their Means
For the "bad" binary variable, coding was 0% (no issue), 0.01% to 24.99% was coded as a slight issue, 25.0% to 49.99% was coded as a moderate issue, and 50% or more was coded as a major issue.The coding was more sensitive because accurate computer calculations should seldom, if ever, make a major error in calculating the standard deviations for binary variables.For example, if there were ten binary variables reported in an article and four of the standard deviations were in error by 0.05 units and one was in error by 0.01 units, then the "bad" binary variable (i.e., errors > 0.02) would be coded as "moderate" while the total "bad and close" (i.e., errors > 0.01) binary variable would be coded as "major".

Benford's Law Measurement
For each article, the percentage of first digits that were 1, 2, 3, 4, 5, 6, 7, 8, and 9 was calculated.Our first measure related to Benford's Law was those percentages averaged across all of the articles (k = 15), the retracted articles (k = 7), and the control group articles (k = 8).Those results were compared to the expectations of Benford's Law and the absolute values of the differences summed.Next, the absolute value of the difference between the expectations of Benford's Law and the results for each of the nine digits was calculated.For one measure, the sum of the absolute values of the differences across all nine digits was calculated (DIFF9); for a second measure, the sum of the absolute values was calculated for only digits 1, 2, and 3 (DIFF3).

Total Anomalies Scale
The total anomaly scale was computed by adding the ordinal scores for hand calculation, percentage of zeros, percentage of adjacent standard errors, percentage of adjacent betas, percentage of incorrect binary standard deviations (>0.02), and percentage of incorrect binary standard deviations (≥0.01).Measurement characteristics for this scale are reported in Section 3.3.3.

Anomaly Severity Scale
An anomaly severity scale score was also developed by coding avoided or slight ratings as 0.25, moderate ratings as 0.50, and major ratings as 1.0, summing them across each of the six measures of anomalies.Measurement characteristics for this scale are reported in Section 3.3.4.

Analyses
Pearson zero-order correlations were used to correlate the key variables, while t-tests were used to compare scores for the group of retracted articles versus the scores for the control group of articles.SPSS 28.0 was used for all statistical calculations, including the calculation of Cohen's d [41,42], to assess effect sizes, using the convention from 0.50 to 0.79 as a moderate effect size and 0.80 or greater as a large effect size.A repeated measures analysis with the group (retracted vs. control) as a between subjects variable and digit percentages over nine digits as a within subjects factor was used to assess main effects and the group via digit percentages interaction term.The SPSS SCALE/RELIABILITY program was used to calculate Cronbach's alpha, a measure of the internal consistency reliability of scales.A website [www.escal.site]was used to convert correlations to Cohen's d for the equivalent effect size of the correlations.

Descriptive Data and Retraction Status
The authors, year of publication, sample size, grant funding, and google cites as of 5 January 2023 have been presented in Table 1.

Comparing Retracted and Control Articles' Data
Basic descriptive statistics from the variables in Table 1 were presented in Table 2.For the variables presented in Table 2, compared across the retracted and control group articles, there were no significant differences as a function of retracted status.

Anomaly Percentage Values
Table 3 earlier presented the results of classifying each article under consideration in terms of percentage levels of each type of anomaly measured.

Anomaly Ordinal Values
Table 4 presents the results of classifying each article under consideration in terms of ordinal levels of each type of anomaly measured.retracted status and the total anomaly scale score was r = 0.89 (p < 0.001), an indication of predictive validity.

Anomaly Severity Scale
The characteristics of the anomaly severity scale were a mean of 2.37 (SD = 2.26, range 0-6, median = 1.00).The Cronbach alpha for the severity scale was 0.94 and would be 0.96 if the hand-calculated severity rating were not included in the scale.Deleting any one of the other five items changed the value of the Cronbach alpha between 0.92 and 0.94.The anomaly severity scores ranged between zero and 1.0 for the control group and between 1.25 and 6.00 for the retracted group of articles.This severity scale was also correlated 0.88 (p < 0.001) with retracted status, an indication of predictive validity.

Retraction Status and Anomalies
Table 5 shows the percentages (expressed in decimals (0 = 0%, 1 = 100%) for each of the six anomalies as well as the ordinal recoding of the percentages, compared across the two groups.Results for the hypotheses are presented below.One-sided t-tests were used given the a priori assumption that retracted articles would feature more problems than the control articles.When Levene's test for equality of variances was violated, separate variance estimates were used for the reported t-tests and degrees of freedom.Numbers are rounded up from 5 or higher in the third or fourth digit.Missing data are reflected in the degrees of freedom reported.

Hypothesis 1
The first hypothesis was that the retracted group of articles would differ from the control group of articles in terms of six anomalies, measured as percentages and by ordinal breakdowns of those percentages, including two scales derived from two different sums of the six anomalies.
Table 5 presents the results for hypothesis 1.The results for the anomalies in terms of decimal percentages were significant (p < 0.05) except for hand calculation (p < 0.09) using one-tailed tests.Using two-tailed tests, the other tests would remain significant.The effect sizes ranged between 0.86 and 6.30, above the "large" size [41,42].Using nonparametric Mann-Whitney U tests to compare the percentage ratings, all results were significant (p < 0.05, two-tailed) except for hand calculation.
The results for the ordinal ratings of the anomalies were all significant (p < 0.05, onetailed) except for hand calculation (p < 0.09).Effect sizes ranged between 0.86 and 5.10, above the "large" size [41,42].Using Mann-Whitney U-tests to compare the ratings, all results were significant (p < 0.05, two-tailed) except for hand calculation.
The t-test results for the two overall measures of anomalies were both significant (p < 0.001, one-tailed) with effect sizes between 3.52 and 3.55, far above Cohen's [41,42] "large" size.Using the nonparametric Mann-Whitney U-test, both results were significant (p = 0.006, two-sided).
Even though our results for the hand-calculation anomaly were not significant, Appendix A illustrates the difference between computer-generated results and hand calculation; 20% of the computer generated t-values differed from the hand-calculated values.
Thus, our results supported hypothesis 1 in terms of statistical significance and in terms of large effect sizes for all anomaly variables and scales except for hand calculation.

Hypothesis 2
The second hypothesis was that a comparison of the two groups of articles using expected values of first digits of regression coefficients (Benford's Law) would yield larger deviations from expected values for the retracted group of articles than for the control group of articles.
Deviations from Benford's Law were assessed as shown in Table 6.Although the effect sizes were in the "large" range (0.90 and 1.08), the t-test results were marginally significant (0.053, 0.029, one-tailed).A Mann-Whitney U test obtained a two-sided exact significance result of 0.040 for DIFF3, while the result for DIFF9 was not significant.Thus, our results supported hypothesis 2 in terms of effect sizes but only partially in terms of significance levels.However, in terms of DIFF3, both the effect size and significance level supported hypothesis 2.

Hypothesis 3
The third hypothesis was that measures of deviations from Benford's Law would be correlated significantly with the two scales derived from the six ratings of the anomalies.

Additional within Control Group Analysis
Visual inspection of the values for two of the control group articles [29,30] yielded a finding of relatively high values for beta and standard error anomalies.Results for comparing those two articles' anomalies with the same anomalies for the other six control articles are presented in Appendix B. On four of the six t-tests, there were significant differences between the two subdivided control groups, with Cohen's d ranging from 1.41 to 5.97.On the other two t-tests, had a difference variance estimate t-test been used, all six of the tests would have been significant.Comparing the two article control group with the retracted article group led to five of six tests being significant (p < 0.05), with Cohen's d ranging from 0.41 to 2.65, while comparing the six article control group with the retracted article group led to all of the comparisons being significant (p < 0.005) with Cohen's d ranging from 2.35 to 6.06.

Discriminant Analysis for Sensitivity and Specificity
We performed a discriminant analysis using the groups (Retracted/Control) and six anomaly variables.Entering all six variables at once into the analysis, 100% of the retracted articles were predicted as retracted (sensitivity), while 100% of the control articles were predicted as controls (specificity).However, one article in each group ( [23], retracted; [27], controls) came close to being assigned to the other group; therefore, a more conservative approach would indicate sensitivity as low as 85.7% (6/7), with retracted articles predicted as retracted and specificity as low as 87.5% (7/8) with control articles predicted as controls.

Discussion
Good data are hard to fake.Good data may have systematic patterns but will also have randomness; therefore, too much or too little consistency may signal that something is amiss.Among the seven retracted or corrected articles discussed here, most had outstanding reviews of the literature, convincing theory, and reasonable, useful conclusions, even more total pages than might be typical.However, high quality of narrative portions, even the theoretical portions, and useful conclusions do not guarantee valid data or valid statistical analysis.
Most of the results for our hypotheses were statistically significant.Our total measures of anomalies or of anomaly severity yielded significant differences as a function of retracted status, with substantial (d > 3.50) effect sizes.Results for violations of Benford's Law were mixed, but promising for larger samples using unstandardized regression coefficients, especially for the use of deviations for the left-most digits of 1, 2, and 3 (DIFF3), for which correlations with both anomaly scales were significant (p < 0.05), while also significant for rho (p < 0.05) but not r (p < 0.06) with respect to retracted status.This methodology appears capable of detecting unusual anomalies with high sensitivity and specificity in both retracted [19][20][21][22][23][24] and non-retracted articles [29,30] even though the articles were published by a variety of scholars.
When time is not adequate to permit detailed testing for anomalies, we would suggest some rules of thumb for editors and reviewers: for apparent cases of hand calculation of results, or adjacent identical regression coefficients or standard errors, we would suggest using 50% or more as a rule of thumb to suggest serious problems.For binary variables, if 50% or more are inaccurate or inconsistent (e.g., means of 0.35 and 0.47 both have standard deviations of 0.50), then we would suspect serious problems.In the case of second or third (presumably random) decimal points for regression coefficients or standard errors, we would question any situation in which 2% or less of any number from 0 to 9 were represented.Benford's Law is more difficult to simplify, but we would suggest that if the percentage of left-most digits of "1" are found to be below 20% or above 40%, there should be further investigation.Such levels would be "red flags" to us; other levels might still raise questions, especially if several of these rules of thumb were violated at the same time within any one paper.
Thus, our research provides scholars with several ways to detect anomalies that may help detect falsified or fabricated data or results, using either several detailed statistical approaches or several more simple rules of thumb for assessing the extent of unusual anomalies.

Table 1 .
Basic descriptive information for articles reviewed.

Table 2 .
Sample data summary characteristics for 15 articles used in this study.

Table 3 .
Characteristics of anomalies.

Table 4 .
Summary of anomalies in ordinal measurement.

Table 5 .
Differences between control and retracted articles on anomaly variables.

Table 6 .
Using Benford's Law to compare retracted and control groups of articles.