With three strategies to deal with the issue of representativeness, the next step is to ask under which conditions each of them is more appropriate. For studies interested in analyzing broad typological generalizations, the sparsity of massively parallel texts strongly favors the strategy adopted by the Typological tradition. Practical considerations are also decisive for studies interested in comparing individual items/constructions across multiple languages: we have seen that the Contrastive parallel corpus architecture cannot easily be generalized beyond three languages, so that researchers with a comparative focus on individual items/constructions should resort to the strategy of the Translation Mining
tradition. Turning to contrastive studies, the choice is one between the representativeness strategies of the Contrastive and the Translation Mining
traditions, and this is the choice we focus on in this section. Providing a full decision tree to decide between the two lies beyond the scope of this paper, but we here set the stage for future studies to build on. On the one hand, we introduce a generalized version of the measure of mutual correspondence (see Section 2.2
) that can be applied to parallel corpora in the two traditions (Section 5.1
). This will allow future studies to compare their respective outcomes more easily and evaluate the impact of their respective representativeness strategies. On the other hand, we argue that the assumptions about monolingual and translation corpora that motivate the representativeness strategy of the Contrastive tradition (see Section 2.2
) may be correct for big corpora but may not always hold for smaller corpora (Section 5.2
). This, in turn, influences the choice between the representativeness strategies of the Contrastive and the Translation Mining
traditions (Section 5.3
5.1. Generalizing Mutual Correspondence
In Section 2.2
, we presented a number of measures that are used in the Contrastive tradition and illustrated them with English talk
and Norwegian snakke
data. The unidirectional measures in (1) and (2) were concerned with how often a given form in the target language is a translation of a given form in the source language. These measures can easily be generalized to apply to any pair of languages, independently of whether they are source or target languages. The same does not hold for the bidirectional measure defined in (3): mutual correspondence. The problem that presents itself is that mutual correspondence is based on data from two independent samples of source and translated texts. Corpora in the Translation Mining
tradition do not come with such independent samples, and we consequently have to reconsider the rationale behind mutual correspondence to come to a bidirectional measure that can be applied in both traditions, allowing for an easy comparison of their respective findings. We argue that normalized pointwise mutual information (NPMI, Bouma 2009
), a measure that originates in Information Theory, provides us with such an alternative rationale. For concreteness, we work with a parallel dataset containing all French indicative verb forms (n
= 389) from the first chapter of Camus’ L’Étranger
and their translations to Mandarin. We focus on the French imparfait
and the way it relates to the Mandarin progressive marker zai
The intuition behind our use of NPMI can best be understood with a small analogy. Imagine person A is tossing a coin and person B is throwing a die. The probability of person A ending up with heads is one out of two, and the probability of person B ending up with six is one out of six. The probability of them ending up with these results in the same turn is the product of the probabilities of each of the results, viz., 1 out of 12. With two fully independent processes, we can thus calculate the probability of ending up with a given pair of results by multiplying the probabilities of the individual results.
Moving to parallel texts, the turns in the coin-and-die example become pairs of expressions that occur as each other’s counterparts. We refer to these pairs as counterpart pairs, or CPs for short. Frequencies give us a handle on the probabilities of individual expressions occurring in CPs. For example, for our Camus data, the probability of finding the French imparfait in a CP is 140 out of 389. Likewise, the probability of finding Mandarin zai in a CP is 6 out of 389. With the probabilities of these individual expressions in place, we can calculate what the probability of them occurring in the same CP would be if co-occurrence in a CP were random. In the same way as for the coin-and-die example, we simply have to take the product of the individual probabilities; namely, 840 out of 151,321.
Clearly, co-occurrence in a CP is not random, but, given that we can calculate what the probability of two expressions co-occurring in a CP would be if it were, we can compare the actual probability of finding them together in a CP in the corpus to this hypothetical probability. By dividing the actual probability by the hypothetical probability, we then get a measure of how strongly the two expressions are associated with each other across their respective languages. For the imparfait and zai in our Camus data, the actual probability is 6 out of 389. If we divide this by their hypothetical probability, we end up with 2.78, indicating that the actual probability of finding the imparfait in the same CP as zai is approximately three times higher than we would expect on the basis of random co-occurrence.
NPMI builds on the actual/hypothetical probability ratio we have introduced, but puts it through two further transformations. The first is to take the (binary) logarithm of this ratio. The main effect of this operation is that the cutoff point between actual probabilities that are higher than the hypothetical ones is moved from 1 to 0. The rationale behind this first transformation is internal to Information Theory, where information is measured in (binary) bits, and random co-occurrence is taken to have no information value. The second transformation consists in dividing the result of the first transformation by the negative value of the logarithm of the actual probability. The latter value equals the result of the first transformation in case the two expressions in question only occur together, i.e., when the ratio is at its highest. The effect of this operation then is to project all values on a scale from −1 to 1, while maintaining 0 as the cutoff point between actual probabilities that are higher than the hypothetical ones. This second transformation thus counts as a normalization and allows for easy comparison across datasets. The final NPMI value for the imparfait and zai in our Camus dataset is then log(2.78)/−log(6/389) = 0.25.
With the definition of NPMI in place, we are in a good position to come back to mutual correspondence and discuss the way the two measures relate to one another. We argue that both measures quantify the strength of association between expressions across languages. Mutual correspondence does so by comparing the frequency of expressions in source texts to the frequency of their counterparts in translations. The more frequently their counterparts occur as their translation equivalents, the stronger the association is between the expressions and their counterparts. NPMI follows a different route and compares the actual probability of two expressions occurring as each other’s counterparts to the hypothetical probability of the two randomly occurring as each other’s counterparts. The more NPMI approaches a value of 1, the stronger the association is between the two expressions. Despite the fact that mutual correspondence and NPMI are clearly mathematically different, we conclude that they do measure the same construct, viz., the association between expressions across languages. NPMI comes out as the more general measure, as it does not rely on two independent samples of source and translated texts. It can consequently be used in parallel corpora from both the Contrastive and the Translation Mining traditions and thus allows for easy comparison of their outcomes and of the impact of their respective representativeness strategies.
5.2. Assumptions of the Contrastive Tradition and Corpus Size
In Section 2.2
, we pointed out that there are two assumptions that underlie the representativeness strategy of the Contrastive tradition. The first is that source texts are representative of the target language, whereas translated texts are less so; the second is that the differences between source and translated texts within
a language are to be related to the process of translation underlying the latter. Even though we agree with these assumptions, we also want to warrant against too strict an interpretation, in particular, for smaller corpora similar to the ones used in the different traditions discussed in this paper. The argument we develop is as follows: if (potential) source texts were representative of the target language, we would expect there to be little to no variation between them. We show that this expectation is not borne out and conclude that taking source texts as the ultimate touchstone for target language representativeness in parallel corpus research is not a foolproof strategy.
To make our discussion as concrete as possible, we go back to Johansson’s study on hate
in the ENPC and remind the reader that Johansson observes that English hate
are less frequent in translated texts than in source texts, and that he relates this fact to the influence of translation. On the strongest interpretation of Johansson’s reasoning, there should be no independent reason for hate
to be less frequent in the translated texts of his corpus. However, this is exactly where smaller corpora are at a disadvantage: unless we have a corpus that is balanced for the phenomenon under study, there is no way to exclude independent factors from intervening in the frequencies of individual expressions. To get a feel for the size of corpus that would be required to be able to abstract away from the influence of such independent factors, we extracted two hate
datasets from the Corpus of Contemporary American English (COCA, Davies 2008
). Similar to the ENPC used by Johansson, COCA is a balanced corpus, but in contrast to the ENPC, COCA has over a billion words and contains over 20 million words for every year from 1990 to 2019 in the same balanced design as the overall corpus. For comparison, we note that the English and Norwegian source text subcorpora of the ENPC each contain between 600k and 700k words.
The first dataset we extracted opposes the frequencies of hate
in the years 1992 (23.8m words) and 1993 (24.5m words). What we find is that hate
are clearly more frequent in 1993 than in 1992. We checked the differences for each verb with the same log-likelihood test as the one used by McEnery and Xiao
), and found that the distribution of the two verbs is significantly different between the two years (hate
: LL value = 37.54 (p
< 0.001), love
: LL value = 37.21 (p
< 0.001)). What this dataset shows then is that, even in a far bigger corpus than the ENPC, there is no way to guarantee that there are no independent fluctuations in the frequencies of individual expressions. The relevance of this observation lies in the fact that COCA is a monolingual corpus, and therefore the fluctuations cannot be related to translation.
The second dataset we extracted moves to an even higher level of aggregation and opposes the frequencies of hate
in the years 1990–1994 to those in the years 1995–1999. Where the data in Table 5
were still concerned with two subcorpora of around 20m words, we now move to two subcorpora with over 100m words (139m and 147m words respectively). The data are presented in Table 6
The difference we found in Table 5
in the 1992 and 1993 subcorpora has clearly become smaller, especially if we were to focus on relative frequencies. It is still significant, though (hate
: LL value = 4.12 (p
< 0.05)). For love
, moving to this higher level of aggregation makes little difference, even in relative frequencies, and the difference between the two subcorpora in Table 6
remains highly significant (love
: LL value = 584.19 (p
< 0.001)). This second dataset thus further strengthens our claim that independent fluctuations in the frequencies of individual expressions are difficult to avoid and that these need not be related to translation in any way. We conclude that the comparison between source and translated texts can inform us about the influence of translation, but that this comparison should be handled with care. This holds for big corpora and a fortiori
for the smaller corpora used in the traditions discussed in this paper. Corpus size is, of course, relative to the phenomenon under study, and lexical phenomena are likelier to require bigger corpora than more grammatical phenomena.
5.3. Corpus Size and Choosing a Representativeness Strategy
Our discussion in Section 5.2
warrants against an overreliance on the comparison between source and translated texts within
a language to control for the influence of translation. What the data from COCA show is that the variation we find might well be due to factors that have little or nothing to do with translation. The question that imposes itself is how to best deal with this extra complication; in particular, for smaller corpora. The answer—we argue—lies in the extended research design of the Translation Mining
In Section 4.2
, we already pointed out that the Translation Mining
tradition does not rely on one corpus but replicates the parallel vs. monolingual perspective of the Contrastive tradition across studies of multiple parallel corpora with different source languages. The advantage of this approach is that it maintains the parallel vs. monolingual perspective but, at the same time, forces researchers to pay attention to the individual characteristics of each corpus and invites them to systematically reflect on different sources of variation. Predictions based on one corpus are checked on the next, and hypotheses on why predictions are borne out or not are systematically evaluated.
A further research design feature that we did not treat in Section 4.2
but does play a role in teasing apart different types of variation in the Translation Mining
tradition is related to an architectural feature of its corpora, whose relevance can be best highlighted here. Major corpus compilation projects from the Contrastive tradition include fragments of different source texts from different authors and translations from different translators. Corpora in the Translation Mining
tradition are markedly different in the sense that they are typically built around a single source novel and one translation per language at a time. The rationale behind this move is that it allows researchers to keep constant as many variables as possible, while actively looking for variation between subparts of the corpus. In Le Bruyn et al.
) and van der Klis et al.
), this strategy leads to the opposition between dialogue and narrative discourse in their analysis of the tense use in the first volume of the Harry Potter
series by J.K. Rowling and its translations to a number of Western European languages. The inclusion of this opposition is a direct consequence of the difference in the use of the have
-perfect they found between Chapter 1 and Chapters 16/17. On closer analysis, this difference turned out to be due to the fact that Chapter 1 contains little-to-no dialogue, whereas dialogue abounds in Chapters 16/17. This active search for variation within a single corpus typically turns up variation that is independent of translation, and complements the attention for different types of variation across multiple corpora that we noted above. Together, they allow research within the Translation Mining
tradition to lead—across multiple studies—to the critical mass required to support claims about cross-linguistic variation, on the one hand, and about the influence of translation, on the other.