Spearman’s Hypothesis Tested Comparing 47 Regions of Japan Using a Sample of 18 Million Children

Many groups differ in their mean intelligence score. Spearman’s hypothesis states that the differences are a function of cognitive complexity. There tend to be large differences on subtests of high cognitive complexity and small differences on subtests of low cognitive complexity. Spearman’s hypothesis has been supported by a large number of studies. Can Spearman’s hypothesis be generalized to regions of a country, where these regions differ in mean intelligence? We utilized data from 86 different cognitive tests from all 47 Japanese prefectures and correlated the g loadings of 86 subtests with standardized differences on the same subtests. Spearman’s hypothesis was clearly supported: the biggest differences between the regions were on the tests that were of the greatest complexity, meaning that Spearman’s hypothesis may be generalizable from groups to regions. In Japan, g loadings offer a better explanation of group differences in intelligence than cultural differences. Future research should explore whether Spearman’s hypothesis is also supported for differences between regions of other countries.


Introduction
Charles Spearman was the first to postulate that there exists a single general factor of human intelligence, something that reflects positive correlations among different cognitive tasks.Spearman coined the term "g factor" [1].He hypothesized g as something like mental energy which enables various cognitive activities, such as memory, deduction, induction, grasping abstract relationships, rule inference, and finding similarities and dissimilarities.Following this lead, Jensen maintained that g should be treated as a distilled entity obtainable by factor analysis from all mental tasks because, argued Jensen, it stems from individuals' neural and physiological substrates [2].
Some researchers have been doubtful of the existence of g, claiming that it is a statistical artifact, or that all intelligence tests are measuring a similar or single construct of various mental processes [3,4].However, Jensen pointed out that there have been reports that g is correlated with widely different sets of variables, such as stature, head and brain size, reaction time, and myopia; as for the EEG (electroencephalogram) experiments, frequency of alpha wave, latency, and amplitude of evoked brain potentials; physiologically, cerebral glucose metabolism, brain and peripheral nerve conduction velocity, and brain pH [2,5].It would be hard to imagine that an artificial characteristic limited to intelligence tests and their construction can be somehow correlated with so many human variables.
Moreover, Jensen proposed a method to use the concept of g in order to examine the cause of group differences in mean intelligence test scores between Blacks and Whites in the United States.Although it had been known for a long time that there are large racial differences in intelligence test scores [6], many attempts have been made to explain these differences using cultural factors [7].
Psych 2019, 1 27 Instead, Jensen looked at a fundamentally different source, namely differences in g loadings between tests, reflecting differences in cognitive complexity between tests [8], and formulated what he called "Spearman's hypothesis" as follows: The varying magnitude of the mental difference between Black and White populations on a variety of mental tests is directly related to the size of test's loading on g, the general factor, common to all complex tests of mental ability.
Jensen devised the method of correlated vectors (MCV) to formally test this idea and supported it with previous reports on Black/White differences on a total of 171 diverse psychometric tests from 17 independent data sets totaling 45,000 Black and 245,000 White subjects [8,9], as summarized in Jensen [2].Spearman's hypothesis was further supported using the Armed Forces Qualification Test [10].The estimate of the correlation coefficient between the Black/White difference and g, based on various studies, is approximately r = 0.60.
Spearman's hypothesis has been extended from its original focus on Black/White comparisons to all kinds of comparisons between racial groups, and many empirical tests have been conducted, including various meta-analyses.In the US, the relationship has been ascertained for Chinese-Americans and Whites [11], Amerindians and Whites [12], and Jews and non-Jewish Whites [13].The relationship has also been ascertained for Black/White difference in South Africa [14,15], and Zimbabwe [16], and Dutch natives and non-Western immigrants [17][18][19].There are some exceptions, such as the study of East Asians in the United States [20,21].te Nijenhuis et al. suggest that when groups have a profile of being relatively better on performance subtests (lower g) than on verbal subtests (higher g), generally leading to a lack of support for Spearman's hypothesis.In these cases, analyzing only the verbal subtests or only the performance subtests leads to a confirmation of Spearman's hypothesis in approximately half of the comparisons [22].
More specifically, a study by Helms-Lorenz, van de Vijver, and Poortinga, which claims that the intelligence test performance of non-Western immigrants is more strongly correlated with cultural factors than with their cognitive complexity [23].However, te Nijenhuis and van der Flier argue that Helms-Lorenz et al. use an unbalanced collection of tests, leading to atypical outcomes [24].It was also shown with a meta-analysis that having a highly balanced collection of tests leads to substantially stronger support of Spearman's hypothesis [15].So, the general trend is seemingly clear: a strong validation of Spearman's hypothesis.
Why have so many papers been written on Spearman's hypothesis?Jensen supplies the answer [2]: Why is Spearman's hypothesis so important?Because, if proven true, not only would it answer the question, at least in part, of why the magnitude of the W-B differences varies across different tests, but, of greater general importance, it would tell us that the main source of the W-B difference across various cognitive tests is essentially the same as the main source of differences between individuals with each racial group, namely g.This proposition would imply that a scientific understanding of the nature of the W-B difference in fact depends on understanding the nature of g.
There has been criticism of Jensen's test of Spearman's hypothesis [25][26][27][28], and detailed replies to these critics can be found in various articles [29][30][31], to which we refer the reader.Woodley et al. conclude that if one combines MCV with psychometric meta-analysis then one is able to correct for statistical artifacts.This process, they demonstrate, leads to reliable outcomes [32].
So, Spearman's hypothesis appears to be generalizable from Black/White differences to all kinds of other group differences, but the key question is: How strongly does Spearman's hypothesis generalize?
In attempting to answer this question all kinds of groups could be compared.However, in this study we will test the generalizability of Spearman's hypothesis using average IQ score differences between regions within a country.Even if, within a country, there do not exist such distinct gene pools as for Blacks and Whites in the United States, the populations of many countries consist of one or multiple genetic gradients.To date, a variety of studies have found average differences in IQ score between the regions of countries.Lynn [33,34] examined the well-known gap in development between the north and south of Italy, showed that there were substantially lower IQ scores in the south and suggested a causal influence on regional differences in income and education.However, Lynn's position has also been criticized [35,36].Kura reported mean IQ scores for the different regions of Japan and showed a large difference between the highest-scoring region and the lowest-scoring region [37].Similarly, regional differences in IQ, often correlating in the expected direction with measures such as wealth and educational level, have been reported between, for example, the different regions of the UK [38,39], Spain [40], Germany [41], Turkey [42], and between northern and southern Egypt [43], though they do not test Spearman's hypothesis.
In this study we test whether Spearman's hypothesis can be generalized to differences between regions in Japan.When Spearman's hypothesis is not supported, this suggests that cultural factors play an important role in explaining differences in intelligence scores between regions.In fact, IQ differences that are caused by purely environmental differences do not support Spearman's hypothesis [44].
Using published achievement test results from all 47 Japanese prefectures taken between 2007 and 2018 as proxies of IQ tests, we estimated the g loadings of these tests.Tests' g loadings were correlated with the gaps between regions for these achievement tests to examine whether or not there was a clear relationship.It should be noted that, in Western countries, results of school achievement tests have been found to correlate with IQ at around 0.7 [45,46], so these are excellent proxies for IQ tests.We also note that Warne [21] used Advanced Placement academic achievement scores to test Spearman's hypothesis and reports support for it.

Sample
Samples consisted of all students in the sixth and ninth grades between the years 2007-2009, 2012-2015, and 2017-2018 in Japan.In 2010 and 2012, a representative 30-40% of students were included.The total number of test takers was about 18 million; we estimate that many students took the tests on two occasions, specifically when they were 11 and 14 years old, and 12 million took the test at least once.The number in the four prefectures directly compared in this article (Akita, Fukui, Kochi, and Okinawa) was 0.7 million.

Tests
We used the National Achievement Tests conducted by the Japanese Ministry of Education, Culture, Sports, Science, and Technology [47].The test has been uniformly administered nationwide to 11-and 14-year-old students from 2007 to 2018.It consists of four subtests: basic verbal, advanced verbal, basic mathematics, and advanced mathematics.In the official statement from the Ministry, the basic tests "Japanese A" and "Mathematics A" consist of "questions mainly concerning knowledge", whereas the advanced tests "Japanese B" and "Mathematics B" consist of "questions mainly concerning applications".A "Science" test was also conducted in addition to these four tests, but in the years 2012, 2015, and 2018.
Tests were administered to all students who went to public schools at the beginning of their sixth and ninth academic years (April in the Japanese system).The tests in 2010 and 2012 were administered to a representative 30-40% of students.The average test scores for these 2 years were reported as 95% confidence intervals, and we treated the mean values of these intervals as the average scores.Due to earthquakes, test administration was cancelled in 2011 and 2016.In summary, eight test scores were obtained for each of the year ranges 2007-2010, 2013-2014, and 2017, and ten scores were obtained for each of 2012, 2015, and 2018, producing 86 test scores in total for 47 prefectural populations.

Statistical Analyses
All statistical analyses were carried out with the statistical package R [48].Principal axis factoring, using the maximum likelihood method, was employed for factor analyses.

Computing g Loadings
There were 86 test scores for 47 prefectural populations.We simply used a factor analysis to estimate g loadings as a principal component of these test scores.

Computing ds
Akita and Okinawa showed the highest and the lowest g-factor scores of all the 47 prefectures.For each of the 86 tests, we calculated the difference score between these two prefectures (Cohen's d), by subtracting the Okinawa score from the Akita score and dividing this difference by the value of the SD.To obtain SDs of test scores for this mixed sample (i.e., students in Akita and Okinawa aggregated), the reported SDs of these samples were weighted by their respective number of test takers [3,5].
We also calculated the differences between Fukui and Kochi, which showed the second highest and the second lowest g-factor scores, respectively.Differences between Fukui and Kochi were not as pronounced in magnitude as those of Akita and Okinawa: the average effect size d was 0.55 SD for Akita and Okinawa, whereas it was 0.35 SD for Fukui and Kochi.

Average Distance D
To utilize the full information from 47 prefectural test scores, we calculated the average distance D between randomly-selected pairs as follows.First, 47 prefectures were sorted according to their g-factor scores.Second, all possible prefectural differences were calculated by subtracting lower-ranked prefectural scores from higher-ranked prefectural scores.This operation produced (46 + 45 + . . .+ 2 + 1) = 1081 prefectural differences.Third, these differences were averaged.Fourth, the average difference obtained via this method was divided by the standard deviation of the test score.This procedure yielded the average effective difference D of a specific test.
Since this measure averaged all possible differences among prefectures, this value should be regarded as the most general form of prefectural performance differences of a test.

Testing Spearman's Hypothesis
We examined Spearman's hypothesis with the method of correlated vectors by computing the correlation between the vectors g and d for Akita and Okinawa and for Fukui and Kochi, and by computing the correlation between g and D for 86 achievement test scores.Supplementary Materials for this study contains test scores of 47 prefectures and nationwide standard deviations of these 86 tests.

Results
Figure 1 shows that positive relationships exist between the g loadings of achievement tests and the performance differences d between Akita and Okinawa (orange circles with their regression line in red), and also between Fukui and Kochi (dark gray circles with their regression line in blue).The results were strongly affirmative with the Pearson correlation coefficient r = 0.77 (p < 0.001) and the Spearman rank correlation ρ = 0.66 (p < 0.001) for Akita and Okinawa.For Fukui and Kochi, g and d vectors of these two prefectures showed similar correlations: r = 0.77 and ρ = 0.74 (Table 1).The difference between Akita and Okinawa is on average 0.55 SD and extending the regression line shows that a perfectly g-saturated test (i.e., g loading is unity) would produce a 0.80 SD difference.This is equivalent to an IQ difference of 12 points, which is close to the estimated IQ difference between these two prefectures of 11 points [37].Furthermore, not surprisingly, we found that average g loadings of 43 tests for 14-year-olds (0.86) was higher than that for 11-year-olds (0.73).
As for the average performance differences between prefectures, Figure 2 shows the same high correlation between the average distance D and g loadings of test scores (r = 0.75 and ρ = 0.64, p < 0.001) as was found in the previous analyses using only comparisons between two prefectures.The difference between Akita and Okinawa is on average 0.55 SD and extending the regression line shows that a perfectly g-saturated test (i.e., g loading is unity) would produce a 0.80 SD difference.This is equivalent to an IQ difference of 12 points, which is close to the estimated IQ difference between these two prefectures of 11 points [37].Furthermore, not surprisingly, we found that average g loadings of 43 tests for 14-year-olds (0.86) was higher than that for 11-year-olds (0.73).
As for the average performance differences between prefectures, Figure 2 shows the same high correlation between the average distance D and g loadings of test scores (r = 0.75 and ρ = 0.64, p < 0.001) as was found in the previous analyses using only comparisons between two prefectures.
difference.This is equivalent to an IQ difference of 12 points, which is close to the estimated IQ difference between these two prefectures of 11 points [37].Furthermore, not surprisingly, we found that average g loadings of 43 tests for 14-year-olds (0.86) was higher than that for 11-year-olds (0.73).
As for the average performance differences between prefectures, Figure 2 shows the same high correlation between the average distance D and g loadings of test scores (r = 0.75 and ρ = 0.64, p < 0.001) as was found in the previous analyses using only comparisons between two prefectures.

Discussion
Spearman's hypothesis states that Black/White differences on IQ test scores are a function of how well these tests measure g, instead of a manifestation of cultural differences between Blacks and Whites.The hypothesis has been extended to comparisons between other racial groups and is supported in the large majority of studies.In this study we have explored how far Spearman's hypothesis can be generalized, focusing on differences in mean IQ scores between regions of countries, in this case differences between the prefectures of Japan.
We used 86 subtests of the National Achievement Test from 2007 to 2018 as proxies of IQ test batteries for 47 prefectural populations.The effect size of the prefectural performance difference was highly correlated with the test's g loading at r = 0.75 for the average distance D. In other words, a test with a lower g loading tended to show a narrower gap between prefectures and a test with a higher g loading tended to show a larger gap between prefectures.We conclude that the Spearman-Jensen hypothesis is strongly supported with Japanese prefectural data.
In fact, there is a genetic gradient in Japan, which dates to about 2000 years ago [49].The aboriginal Jomon people (which contribute around 30% genetically to the modern Japanese population) were gradually dominated by an influx of migrants, namely the Yayoi people who were from regions of modern-day China and Korea.Around 70% of genes in the modern Japanese population were contributed by the Yayoi.However, the two populations merged to such an extent to render the present culture highly homogenous and, thus, a single ethnic group.These results indicate Spearman's hypothesis also functions within this racial group-not just between racial groups.This means that differences between prefectures can be parsimoniously explained with g loadings and less parsimoniously with cultural differences between the prefectures.
Compared with the reported Black/White racial difference of 1 SD in the United States, even the largest performance difference between Japanese prefectures is much smaller.Average ds between Akita and Okinawa for 86 achievement tests was 0.55.Although the differences within Japan are clearly smaller than the Black/White differences in the US, Spearman's hypothesis is supported more strongly for comparisons between regions in Japan: the values of 0.64 and 0.75 are higher than the value of 0.62 reported for B/W differences [2].One of the reasons for this stronger correlation could be the reliability of the data due to the sheer number of subjects involved.To calculate the g loadings of the subtests and d between the highest and lowest performing prefectures, the data were obtained from more than 18 million and 0.7 million students, respectively.This is an extremely large dataset, using the largest number of subtests to date, which yielded a high-quality test of Spearman's hypothesis.
Our results, however, should be considered in light of some limitations.The most important limitation is that the scores on the achievement tests were not reported at the individual level, but only at the prefectural level, which is not in line with Jensen's requirements of testing Spearman's hypothesis.Although we cannot assume that the structures of mental ability are distinct for Japanese subpopulations, questions relating to this problem were not examined in this study.However, it is possible that Spearman's hypothesis is such a powerful phenomenon that a positive correlation between g and d will occur whether using scores of individuals or scores of groups.Note that the default assumption should be that group differences are nothing more than aggregated individual differences and an a priori assumption is that these results should generally apply to individuals (though probably with weaker factor loadings).Our findings of support for Spearman's hypothesis when comparing regions of a country may or may not generalize to differences in IQ scores between regions of other countries.Additional empirical research will shed light on this question.
In conclusion, the Japanese National Achievement Test shows clear support for Spearman's hypothesis.The more a test is g loaded, the larger the performance difference between higher-and lower-performing prefectures.It appears that Spearman's hypothesis may be generalizable from groups to regions; in Japan, g loadings again offer a better explanation of group differences in intelligence than cultural differences as insisted by Rushton and Jensen [50].
Funding: This research received no external funding.

Figure 1 .
Figure 1.The g loadings and the mean differences (effect size d) between Akita and Okinawa, and Fukui and Kochi on 86 achievement test scores from 2007 to 2018 (r = 0.77, r =0.71, respectively).

Figure 1 .
Figure 1.The g loadings and the mean differences (effect size d) between Akita and Okinawa, and Fukui and Kochi on 86 achievement test scores from 2007 to 2018 (r = 0.77, r =0.71, respectively).

Table 1 .
The correlations between g loadings and d (mean difference) for test scores from year 2007-2018.

Table 1 .
The correlations between g loadings and d (mean difference) for test scores from year 2007-2018.