3.1. Association Across Linguistic Variants Disregarding Space
The association graph in Figure 6
visualises the variant categories that are used together in the survey locations, and the strength of the connection, proportional to the Jaccard index. Parallel to analyses in traditional dialectology delimiting dialect areas based on isogloss bundles, e.g., [36
], this overlap analysis shows the degree to which dialect areas can be discovered based on the 37 lexical variables considered, but, importantly, independent of geography. Having no spatial association thus means that our overlap analysis avoids the spatial bias that is present when creating isoglosses by drawing lines on maps. the strongest connections ultimately mean exclusive overlap of the areas covered by the variants in question. The network visualisation in Figure 6
uses the Fruchterman-Reingold algorithm, which also conveys that the positions or distances of nodes are not supposed to be spatially interpreted [85
]. The color saturation and the width of the edges corresponds to the absolute weight and scale relative to the strongest weights in the graph.
In Figure 6
, two main clusters can be identified, with the association strength gradually fading out around their centres. The larger cluster is associated with the Standard Japanese variants, usually found spread across large areas on the main island Honshu. The more central the node’s position within such cluster, the more the variant category overlaps with others, signaling their ubiquitous distribution. The spatial relations of such variant categories can be verified on LAJ’s individual variable maps. The smaller cluster on the right is associated with variant categories used in Okinawa, hinting at the fact that variants used here usually do not overlap with the variants used in Honshu, Shikoku or Kyushu. The close-knit cluster lets the observer associate on a high grade of exclusivity, except for its top right part, which represents variant categories used on the Sachishima-islands, the westernmost island group in the Ryukyu Islands, with what appears to be a distinct dialect based on our data.
Based on the 37 variables however, the classic of dialect areas’ definition established using isogloss bundles cannot be proved or disproved. On the one hand, even the close-knit clusters of overlapping variant categories include only a few of the variant categories rather than one variant category from most variables. On the other hand, the variant categories in the largest cluster are used throughout vast areas in Honshu, a finding which would not qualify to building dialect areas. This pattern invites the question whether the linguistically opposing concept, the dialect continuum theory can be warranted based on the data available. This interpretation is connected to analysing the results of the MDS.
3.2. Linguistic Distances Mapped
Calculating the linguistic distance matrix allows producing maps with different reference locations, i.e., presenting linguistic distances in reference to certain localities. This kind of visualisation goes back to Goebl’s dialectometry [34
]. Figure 7
maps the linguistic distance from the following six localities: the north of Hokkaido, a rural site in Aomori prefecture in the north of Honshu, Tokyo, Kyoto, Matsue city in Shimane prefecture in the west of Honshu, and Okinawa’s capital city, Naha. Tokyo (formerly Edo
) and Kyoto are the present and the past capitals and cultural centres of Japan, and therefore thought to have affected the language of the whole country by being the starting points of the (hierarchical) diffusion for many linguistic innovations [74
]. Aomori in the northern extremes of Honshu is far away from both capitals, and as such, it is associated with preserving dialectal features less affected by standardisation. Matsue is the centre of the so-called Umpaku dialect area which has a unique historical aspect. Hokkaido was settled by the Japanese primarily from the end of the 19th century, exactly when the respondents were acquiring their mother tongue, from different parts of Japan but mostly the Tohoku (NW) and Hokuriku (the western shore of central) areas in Honshu. Because of this, the language history is not deep and respondents are assumed to inherit their ancestor’s language leading to a dialectally mixed area with Standard Japanese having gained ground more easily. It is not attested in our 37 variables whether the antecedent Ainu population of Hokkaido affects the variants used. Lastly, Okinawa as an archipelago used to be a semi-independent kingdom mostly isolated from imperial Japan until incorporated as a prefecture in 1879, also shortly before the LAJ respondents’ mother tongue acquisition. Because of the historical isolation, vast differences are expected between Okinawan and “mainland” varieties.
In general, Figure 7
shows that the closer a locality is to the reference locality, the smaller their linguistic distance, but Okinawa tends to show uniformly larger linguistic distances, while Hokkaido’s localities are never extremely different from the reference localities. The northerly Hokkaido locality seems to be lexically close to various areas, attesting a mixture of dialects or the degree to which Standard Japanese is used in different parts of the country. Interestingly, the north of Honshu (Aomori, Tohoku) are some of the linguistically most different areas from this locality. The Aomori locality seems to only have a small area of linguistic similarity with most of Honshu, Kyushu and Shikoku being different. At the same time the southern tip of Hokkaido seems more similar, which hints on the language connection present throughout history. Linguistic distances to Tokyo (the birthplace of Standard Japanese) tend to be smaller throughout Honshu, and most Hokkaido localities express a similarity with it. The largest distances are found in the north of Honshu (Tohoku) and the south of Kyushu, the geographically farthest areas. Kyoto, as the former capital is expressly similar to its surroundings, in the so-called Kinki area, with its similarity gradually fading away by distance and levelling to the highest differences in the north of Honshu and south of Kyushu. Interestingly, similarity between the Tokyo and Kyoto area is not salient in these maps, based on our set of variables. the area looking similar to Matsue spreads farther away than Kyoto’s and also looks concentrical, with the exception of Hokkaido. Okinawa is, finally, uniformly different in all maps with reference points on the larger islands. Its own reference map centred in Naha, the capital city of Okinawa shows extreme difference with all other provinces of Japan and even in the Okinawa prefecture itself, the Sachishima-islands in the west present a relatively large difference.
Based on the linguistic distance matrix, for each locality an average linguistic distance can be calculated to all other localities by taking the mean of each matrix row. Mapping these average values, used also in [53
], adds a technique to Nerbonne’s inventory of mapping aggregate variation [28
]. The resulting map in Figure 8
can thus be interpreted as a degree of overlap between the locally used lexicon and all other localities’ lexicon. Importantly, similar colours do not correspond to linguistic similarity, but to mean difference from all localities being similar. Although the lexical distance is calculated based on only 37 variables, the map shows several interesting points. The most conspicuous interpretation of the map is that Okinawan varieties are the most different from all others on average, as expected. The localities closest to all other sites are found on Hokkaido, attesting the mixed nature of the local varieties. In Honshu the area spanning from North of Kanto (the area containing Tokyo) to the West of Kansai (the area encompassing Kyoto, Osaka, and the cultural centre of Japan before the Edo-era) is a seemingly average area, fading out into the extremes of the three main islands: to the north of Honshu, and south of Shikoku and Kyushu.
3.3. Dialectal Variation in Space
Having performed the multidimensional scaling (MDS) on the 2400 × 2400 linguistic distance matrix, we can represent the dialectal variation in a three dimensional space, which is readily interpretable. These three dimensions are assigned to the RGB colours. Interpreting the similarly coloured clusters and spatial areas similar in either the 3D plot or the map in Figure 9
is practically equivalent to finding similar survey sites with regards to all 37 variables and therefore to accounting for dialect areas. Figure 9
excludes the Ryukyu Islands (containing Okinawa) due to their large linguistic distance from all other parts of Japan. Despite the removal of the outlying Ryukyu Islands, no genuinely isolated clusters are visible in the 3D plot. Although contrasting colours and certain central areas can be identified in the map, such as the northern part of Honshu, the Kanto area centred around Tokyo, or the south of Kyushu, the transitions in between remain gradual, attesting for the theory of dialect continua. As expected, Hokkaido’s localities seem to be mixed and brownish in colour, which indicates equal mixture of RGB colours and thus centrality. In essence, there is a contrast between the MDS map based on the 37 variables at hand, and the classic area formation map of Japan, e.g., [78
]. The boundaries of dialect areas usually bordered by sharp lines can be considered to be a representation of some core varieties based on the MDS map, painting a fuzzier picture of the transitions present between these cores.
An additional MDS conducted only on the Ryukyu Islands revealed that isolated clusters can be found based on the 37 variables, despite the small subset. Having found four isolated clusters—namely, from West to East, the Yaeyama Islands, the Miyako Islands, the Okinawa Islands (containing the capital), and the Amami Islands (belonging to the Satsuma domain of South Japan since 1624, rather than the then Ryukyu Kingdom)—hints on the historical isolation not only between the Ryukyu Islands from mainland Japan, but also within itself.
3.4. Correlations with Spatial Measurements
We calculated several values estimating the potential of dialect contact across localities in the LAJ. For the continuous values, we built spatial distance matrices similarly to the linguistic distance and for all matrix pairs, Pearson product-moment correlation was calculated. Figure 10
shows the correlation coefficients across the explanatory variables for the entire survey area. A high correlation present among GCD
is not surprising, given the size of Japan. These values are negatively correlated with the logarithms of the TLGI
values, as they represent an influence
, therefore similarity, rather than distance.
It is expected that the logarithm of the spatial distances will have a greater explanatory power on the linguistic variation due to the following. While linguistic distance can grow up to a certain degree only (i.e., until total dissimilarity), spatial distance can constantly grow. It is expected (similarly to most dialectological studies) that in a large area, such as Japan, large linguistic differences will be reached before the most extreme spatial distance from a certain point is reached.
Correlation coefficients with the linguistic distance for the entire survey area and the functional subsets are given in Table 1
. We first compare the effects of the spatial distances (rows) and then discuss the different data subsets (coloumns). As seen in Figure 10
, the correlation of GCD
are almost total so, unsurprisingly, all of them explain a similar amount of variance in the dialectal differences. It is also due to this fact that we tested the travel distance (lengthwise shortest paths with regards to the network) and travel time matrices coming from different resources rather than sourcing both from OSRM.
The correlation of GCD
with the linguistic distance is presented in a heatmap (Figure 11
), due to the large number of locality pairs. Hexagons are coloured by the number of points (locality pairs) in each cell, thus plotting the density of points. the correlation is undoubtedly positive, but solely based on the plot, its linear or logarithmic nature cannot be warranted. The correlation tests reveal that the logarithm of GCD
explains slightly more variance in the linguistic distance (r
= 0.6462 and r
= 0.6714, respectively). The difference, proves to be statistically significant based on Meng et al.’s z
], calculated using the R
]. This test is applied for finding whether any correlation coefficient is significantly different from another, given their difference and the sample sizes.
The high correlation with TD should be taken with a little skepticism due to the high rate of missing values. The logarithmic correlation is significantly higher in this case too. With a much smaller rate of missing values, the correlation obtained with contemporary TT is lower than that of GCD, but its logarithm seems to match the logarithm of GCD. The large number of locality pairs however, renders this difference statistically significant.
Correlation values for HT and their logarithms are similar, but lower than the previously discussed values, inviting the question whether our model is less valid for the estimation of dialect contact (resulting in the dialect landscape of the first half of the 20th century) or whether dialect variation is not governed as much by potential least cost paths as we determined at the scale of the entire country, and for our pool of variables.
To level out the uncertainty due to missing values in TD
, correlation with linguistic distance is calculated for the subset of locality pairs where all spatial distance values are available (
). For this subset, containing 70.7% of all locality pairs, the correlation coefficients are given in the second row of Table 1
. These results, however biased by not taking into account distances between Okinawa, Hokkaido and the three most populous islands, show that the spatial distance-based estimations of contact deliver very similar explained variance. We assume that this is due to the fact that the overwhelming majority of locality pairs lack the possibility for direct contact because of large distances. In such cases only indirect contact is present and thus the way we measure the inability
of contact makes little difference. This convergence at the global level invites the investigation of the local impact of different estimations of contact.
Within Hokkaido lower correlation is expected, since the island has been populated by Japanese speakers more extensively only since the end of the 19th century, slightly before or around the LAJ respondents’ mother tongue acquisition. As the settlers came from all over Japan, we find that their varieties are much closer to the Standard because of the newly established, cosmopolitan environment. Besides, varieties tend to resemble those various areas that gave the diasporae to Hokkaido settlements. the mixed pattern visible in Figure 7
, Figure 8
and Figure 9
and the low correlation values are hard to explain by local geographic factors, given that the historical scenarios leading to the then Hokkaido language variation are not only formed in Hokkaido but necessarily around Honshu, the ancestral home of the majority of LAJ respondents in Hokkaido. It is the TLGI
that explains the most variance in Hokkaido. The difference between its two measures is not statistically significant. Moreover, the correlation coefficient for
, −0.3407, is not significantly higher than for the logarithm of TT
, 0.2782, due to the low number of samples. This means that contact patterns based on migration and hierarchy do not characteristise dialectal variation in Hokkaido more than elsewhere.
Honshu is encompassing two thirds of the survey locations and most pairwise values for explanatory variables could be calculated. The large distances within Honshu encompass mostly indirect contact, rendering the spatial distance values to explain a similar amount of variance, with TT with the lowest values. However, due to the high number of samples most of these correlation coefficients are significantly different, resulting in the logarithm of TD being the best explanatory variable.
Unsurprisingly, the resulting correlation coefficients for the united subset of Honshu, Kyushu and Shikoku (HSK) resemble those in . The nuanced differences present stem from incorporating location pairs in that are within Hokkaido or other islands with multiple survey localities. This also demonstrates the degree to which the three most populous islands outweigh all other areas when accounting for correlations at the global scale, inviting the question of testing correlations at (more) local scales.
Shikoku has the highest correlation coefficients of all subsets, with the logarithm of TD
scoring the highest (0.7943), however this value is not statistically significantly higher than that of TD
and their logarithms, and the logarithm of GCD
, due to the small number of localities on Shikoku. These r
values mean that the spatial distance measures explain about 62% of the variance in linguistic distances, leaving a much smaller room for other, sociodemographic variables to influence the lexical variation. Because of this, it would be interesting to investigate the role of geographic factors in linguistic variation on the island of Shikoku more in depth. Shikoku’s geography is defined by rugged mountains, crucially defining the communication of the four prefectures located on it. The centres of these prefectures are relatively isolated from each other, with partly better chances at communication with Honshu via sea, e.g., [102
Kyushu’s correlation coefficients are almost as high as Shikoku’s, with statistically no difference between GCD, TD, their logarithms and the logarithm of TT. The relatively lower correlation with HT could be influenced by the fact that the number of Edo-era ports for the Kyushu subset is also relatively low. This shows the propagating effect of small differences in local models and the importance of limitations regarding the realistic estimation of contact potential.
In the case of Okinawa, as an archipelago, the fact of not having roads in between islands renders the HT
as the potential interaction estimation similar to GCD
, with the difference of elevated importance of port access. TD
data are retained for less than half of the point pairs. Correlation with TLGI
is relatively high for Okinawa, probably due to its relatively small size and the frequency of access across islands might historically correlate with their population, which might not have changed much in terms of proportions. However,
is not significantly lower than the logarithm of HT
. Huisman [14
] notes that in archipelago languages “diversity is a reflection of time since divergence, as a result of limited contact due to the geographic isolation of islands”. High correlation of Okinawan linguistic difference with the remaining explanatory variables means that even though Okinawa is relatively small and its variation is very different from all others in general, linguistic differences within Okinawa itself are large and spatially autocorrelated. Further, a large part of this linguistic difference can be explained by distance contact patterns over sea.
In each set of locations TLGI has a lower explanatory power, which would mean that even bigger cities are impeded from communication by long distances. It might, however, show that the communication patterns across the country characteristic of the Meiji-era cannot be very well explained by influence characteristics representing 1975 and 2005, despite them being scaled down to local population densities.
3.5. Effects of Administrative Boundaries
We report the tests investigating the dialect separation effect of administrative boundaries in two ways. On the one hand, Table 2
shows the aggregate effect of the boundaries, testing the within
groups’ overlap when cumulated for all administrative regions. All Mann-Whitney U
tests result in statistically significant separation values, therefore only their effect sizes are reported, by giving the Vargha-Delaney A
and their interpretation as defined in the R
]. On the other hand, Figure 12
and Figure 13
map the underlying effect sizes contributed by each of the administrative regions, prefectures and domains respectively, showing the results of the calculations with a 150 km cut-off. The colours of the regions correspond to the effect size categories. Besides, density plots show the distribution of linguistic distances in the within
groups, respectively, with within
groups expected to have smaller values.
Vargha and Delaney’s A reports the probability that a randomly chosen value from one group will be greater than a randomly chosen value from the other group. A value of 0.5 would indicate stochastic equality of the two groups. A value of 1 would indicate that the within group shows complete stochastic domination over the separated group, and a value of 0 would indicate the other way around, the separated group showing larger linguistic distances in all cases.
In all cases, at the aggregate scale in Table 2
, locality pairs separated by the domains’ boundaries show little to negligible stochastic dominance with regards to linguistic distance, which means that having a domain boundary between two survey locations would not mean much bigger chances for a higher linguistic distance by itself. The small stochastic difference between the within domain
and the separated
groups is also visible in the density plot in Figure 13
. In contrast, for prefecture boundaries the higher distance cut-off value we chose, the larger the effect size, reaching the medium and large categories.
In Figure 12
and Figure 13
it is often the larger regions for which a smaller effect is present. However, not all large prefectures show this pattern and actually some of the largest ones’ boundaries show a large effect. In case of the domains, the three largest ones in the north of Honshu show smaller effects while the other domains showing small and medium effect overlap the areas of the prefectures that also show small and medium effect. Smaller effects of larger regions’ boundaries might be due to larger possible linguistic distances possible within
large regions, which might go hand in hand with smaller distances across its boundaries. However, as it is not always the case, it is safe to say that spatial variation is present. This spatial variation is also marked by boundaries that changed as domains were reorganised into prefectures. As changing boundaries often show a change in effect size category as well, we might confirm the presence of the modifiable area unit problem (MAUP). When aggregated, however, the differences across individual regions level out. Nonetheless, the underlying variation in effect statistics invites the question of analysing such boundaries more in detail, focusing on certain sections rather than investigating the whole length of a region’s boundaries, together with their historical context and changes of location, stability and porosity. Finally, the larger effect visible in longer distances for prefectures might be interpreted as the effect of distance, rather than the effect of boundaries.