3. Vector Analysis of Translations Based on Deep-Language Variables
Independently of the different parallel channels (one for each variable), the correlation noise () in most cases is larger than the regression noise (), therefore indicating that every translation tries as much as possible to be not biased, i.e., to approach , but it cannot avoid to be decorrelated (), with correlation coefficients which approximately decrease in the regression lines related to characters, words, sentences, interpunctions and to , , and .
If different translations of a New Testament text into the same language can be mathematically quite different, this is always found for different languages [
1], as we explicitly show now for Matthew by using a graphical tool developed in [
2], namely the vector plane.
The vector plane is a useful graphical tool for synthetically comparing different literary texts. As it can be noticed in Figure 18 of [
39], for most Tew Testament texts the vector plane allows to assess how much different texts are mathematically similar, by considering the four deep-language variables
,
,
,
not explictly controlled by authors when writing. It considers the following six vectors of components
and
given by the average values of the variables:
),
),
),
),
),
) and their resulting vector, of coordinates
and
, given by:
Table 1 reports the average values of
,
,
,
found in Matthew in the indicated languages.
Figure 2 shows the vector plane reporting
and
of Equation (13), and
Figure 3 shows the distance
of a translation from the Greek Matthew (
,
for each language. Latin, as it was found in [
1], is the translation closest to the original Greek. From these figures we can see that all translations are mathematically quite diverse from the original Greek text and show a large variability, as in [
1]. Similar results can be found for other New Testament texts, here not considered for brevity.
It is interesting to notice that, in any language family (
Table 1),
varies in a range of approximately 7 dB. Some translations in some language families practically coincide, e.g., Italian, French and Spanish, or Icelandic, Norwegian and Swedish.
Because it is cumbersome to consider all New Testament texts studied in [
1] to assess whether, within a language, the mathematical mutual relationships of the original Greek texts are saved or lost, in the following we use only Matthew as reference text in any language. From the large spread shown in
Figure 1, we can imagine the even large spread likely found in studying all langauges, a topic, however, deeply studied in [
1].
In conclusion, the results reported below, by assuming Matthew as reference text, are sufficient for giving a reliable answer to the issues mentioned in the Introduction.
In other words, for the purpose of being specific in deriving the full characteristics of the extended theory, it is sufficient to consider only Matthew and its translations, to show how a text is mathematically related, within the same language, to other texts, namely, in the following, the gospel according to Mark (Mk), Luke (Lk) and John (Jh), and to Acts (Ac) and Apocalypse (Revelation) (Ap).
Moreover, of the many linguistic channels linking two texts, for brevity we consider the channel that links their sentences, discussed in the next section.
4. The Sentences Channel and Its Theoretical Signal-to-Noise Ratio
“Translation” can also refer-as discussed in
Section 2- to the case in which a text is compared to another text both written in the same language. We can investigate, for instance, how the number of sentences in text
is “translated” into the number of sentences in text
for the same number of words. This comparison can be done, of course, by considering average values and regression lines, but now the theory allows us to consider also the correlation coefficient-i.e., the noise defined in Equation (5)-and provides insight because it models linguistic channels according to parameters of communication theory, such as the signal-to-noise ratio (and possibly channel minimum capacity [
1]). For our study of the several linguistic channels we consider, for illustration, only the sentences channel.
Let us consider the Greek texts of the New Testament, and let us compare Matthew, in turn, to Mark, Luke, John, Acts and Apocalypse.
Notice that in any translation, all texts have been worked as detailed in [
1]. For each chapter, we have counted words, sentences and interpunctions (full-stops, question marks, exclamation marks, commas, colons, semicolons) after deleting all extraneous characters added to the original text by translators/commentators, such as titles, footnotes et cetera. At the end of this lengthy and laborious work, only the original text was left to be studied. Of course, it is not required to understand any of the translation languages–the theory does not consider meaning-because the process consists in just counting characters and sequences of characters. In the end, for example, in any language, Matthew is made of 28 chapters, therefore all regression lines, or any other data processing, are always based on 28 couples of data.
According to
Section 2, to apply the theory to the sentences channel, we need to know the slope and the correlation coefficient of the regression line between the number of sentences per chapter (dependent variable) and the number of samples of another variable (independent variable) for each text. As independent variable, we consider the number of words per chapter, therefore the input parameters refer to the regression line between sentences (
) and words (
). By eliminating words, the theory compares sentences for equal number of words, i.e., studies the sentences channel.
For example,
Table 2 shows the regression parameters found in the original Greek texts. Notice that we have maintained 4 decimal digits because some values differ only from the third digit. For example, on the average, for 100 words we find
sentences in Matthew,
sentences in Luke (the text closest to Matthew when considering all deep-language variables, as reported in [
39]) and only
sentences in Apocalypse.
Because we always consider Matthew as dependent text, the data of
Table 2 can be used to compare the sentences channel of Matthew with itself (self-channel) and with the other texts (cross-channels). Mathematically, the input data (reference, independent data) to the theory are the slope and the correlation coefficient given, in turn, by Mark (e.g.,
and
, Luke, John, Acts and Apocalypse, and the dependent data are always the values of Matthew, therefore
and
.
Table 3 reports, for each cross-channel, the values of
–calculated with Equation (4)-and
–calculated with Equation (8)–and the total signal-to-noise ratio
calculated with Equations (11) and (12), and the partial values
and
, calculated with Equations (6), (7), (9) and (10).
For example, according to
Table 3, the number of sentences estimated in Matthew is given by the number of sentences in Luke multiplied by
, with correlation coefficient
. Therefore, the fraction
(98.76%) of the variance of the sentences in Matthew is due to the regression line, while only
(1.24 %) is due to decorrelation. The large difference between the partial signal-to-noise ratios
and
make the total value
, practically determined by
, because in Equation (11)
.
For the values linked by the regression line (see
), Matthew is closer to Luke than to Mark (the other synoptic gospel) or to John, but when considering also the correlation coefficient, a different situation emerges (see
as Matthew is closer to John than to Mark or Luke, although these differences are small and might be due to the statistical noise because of the small number of samples (28) in establishing the data of
Table 2 (see
Section 5).
Let us apply the theory to all languages.
Figure 4,
Figure 5 and
Figure 6 show how
changes with language when Matthew is compared to Mark, Luke, Acts and Apocalypse (cross channels; color key: blue = Mt vs. Mk; red = Mt vs. Lk; magenta = Mt vs. Jh; green = Mt vs. Ac; black = Mt vs. Ap).
We can make the following general remarks:
The signal-to-noise of both self- and cross-channels depends very much on language, with some translations giving significantly larger (more common) or smaller values than the Greek (language no.1).
Only few translations are very similar to Greek, as their in the cross-channels falls on the magenta line, especially in Mt vs. Mk and Mt vs. Luke. For Mt vs. Mk they are: Romanian (language number 7), English (10), Armenian (27), Somali (35). For Luke: Spanish (8), Icelandic (13), Ukrainian (23), Estonian (24), Cebuano (31), Chichewa (33), Somali (35).
The range of can be quite different from language to language, and can be biased, i.e., displaced mostly upwards (Mk, Lk, Ac, Ap) or mostly downwards (Jh), compared to Greek.
Figure 6 shows (right panel) also the relative range (%) found in a language, i.e., the range of
(
) divided by the range in the Greek texts. The relative range can be largely compressed (below the magenta line) or expanded (above the magenta line), therefore biasing readers’ style appreciation of texts. Very few languages maintain the range of Greek texts, namely Latin (2), Swedish (15), Albanian (26), Tagalog (32). In other words, texts which in Greek are mathematically quite different, in another language can be very similar, or vice versa.
However, at this point some important observations must be highlighted. The slope
and correlation coefficient
of a regression line are stochastic variables, therefore characterized by average values (e.g., those reported in
Table 2 for Greek, calculated by standard algorithms) and standard deviations. The extended theory would yield improved estimates, of course, if the standard deviation were a very small percentage of the average value. However, with a sample size of at most 28 (as in Matthew, and even fewer samples in the other New Testament texts), the standard deviations of
and
can give too large variations in
predicted by the theory and reported in
Table 3 and in
Figure 4,
Figure 5 and
Figure 6.
Because the largest values of
fall in the steepest region of
Figure 1, a small statistical fluctuations in
or in
, or in both, are amplified in
. Only when the input parameters are more diverse, as with Acts and Apocalypse (
Table 2), the larger mathematical distinction is maintained (
dB and
dB, respectively) because in this case
falls in the flat region of
Figure 1 and the total noise effectively tends to mask sensitivity to
or to
.
To avoid this inaccuracy-due to the small sample size from which the regression lines are calculated (see
Appendix B), not to the theory of
Section 2-, we adopt a kind of “renormalization” based on Monte Carlo simulations-whose results we consider as “experimental”-defined and discussed in the next section.
6. Self-and Cross Channels Signal-to-Noise Ratios in Reduced Texts
Let us study how the signal-to-noise ratio changes in self- and cross-channels when the output text is reduced. This analysis can be useful for indicating whether two texts are mathematically indistinguishable.
For this analysis, we perform a Monte Carlo simulation like that described in
Section 5, but now the number of chapters is varied from maximum (28 for Matthew) to a minimum of at least 3. Then, for each text reduction, we calculate the experimental values of
, and
as outlined in
Section 5.
Figure 13 (left panel) shows the average values of
, and
as a function of the fraction
of text considered, the latter given by the average total number of words found in the simulation with reduced text divided by the total average number of words found in the simulation with full text (28 chapters). The normalization to 100% takes care of the small differences in totals mentioned in
Section 5.
We notice that in the self-channel the total signal-to-noise ratio
is practically determined more by
than by
, as already observed for full text (
Table 3). The cross-channels follow a similar trend but with a very important difference, highlighted in
Figure 13 (right panel), which shows the difference between the 100% signal-to-noise ratio and that at the indicated fraction. The most striking finding is that in the self-channel at
,
and
are all reduced by 3 dB. In other words, in a large range of
,
,
and
are proportional to
.
On the contrary, the reduction is much lower in the cross-channels, whose results are also shown in
Figure 13 (right panel). Mathematically this is due to being in the steepest range of
Figure 1 in the self-channel-where signal-to-noise ratio drops rapidly -, and in the flat range in the cross-channels. This is confirmed by the results shown in
Figure 14, which clearly shows that, the slope is practically constant, regardless of
, while the correlation coefficient varies significantly.
This characteristic can be considered as another check to assess whether a text can be confused with another text, be therefore indistinguishable, a characteristic more stringent than mere similarity of , such that between Matthew and Luke. In other words, if in a self-channel and in its cross-channels , and are proportional to , then we can be reasonably confident that the two texts are more than similar, than one can be confused with the other, not excluding the further hypothesis that the author is the same.
The same characteristics can be found in any text. For example,
Figure 15 shows the results found in the self-channels concerning the Greek
Jewish War (JW) by Flavius Josephus, the English
David Copperfield (DC) by Charles Dickens, and the Italian
I Promessi Sposi (PS) by Alessandro Manzoni. For each text, regardless of epoch and language, the total signal-to-noise ratio
is reduced according to
.
7. Channel Probability of Error and Likeness Index
In this section we explore a way of comparing the signal-to-noise ratios of self- and cross-channels objectively and automatically, and possibly also getting more insight on texts mathematical likeness.
In the sentences channel explicitly studied in the present paper-but our development can be applied to any other linguistic channel-, can we “measure” how close, in the Monte Carlo simulations, is Matthew to itself (self-channel), or to other texts (cross-channels) with an index based on probability? In other words, how much can we be confident that a text can be mistaken, mathematically, with another text, e.g., Matthew with Luke or John, et cetera, by studying self- and cross-channels? Because in the Monte Carlo simulations we get probability densities as those shown in
Figure 7 (right panel) for Greek, we must deal with continuous functions. In other words, can Mt self- channel, described statistically by its probability density shown in
Figure 7, be confused with one of the cross- channels, also described by a probability density in
Figure 7, therefore implying, for example, that Matthew and Luke are very similar, while Matthew and Acts are not? The probability problem is binary because a decision must be taken between two alternatives.
The problem is classical in binary digital communication channels affected by noise, as recalled in
Appendix C. In this field, “error” means that bit 1 is mistaken for bit 0 or vice versa, therefore the channel performance worsens as the error frequency (i.e., the probability of error) increases.
Now, in the sentence self- and cross channels-to be specific -, “error” means that a text can be more or less mistaken, or confused, with another text, consequently two texts are more similar as the probability of error increases.
According to Equation (A8), the average minimum probability of error in a binary channel with equiprobable “events”–as we assume, of course, for self- and cross-channels-is given by:
In Equation (14) and are the signal-to-noise ratios in the indicated channels.
The decision threshold
, as shown in
Appendix C, is given by the intersection of the known probability density functions
(cross-channel) and
(self-channle), i.e., the experimental probability densities shown in
Figure 7. The integrals limits are fixed as shown in Equation (14) because
.
Let us study the range of If there is no intersection between the two densities; their average values are centered at and , respectively, or the two densities have collapsed to Dirac delta functions. If the two densities are identical, e.g., a self-channel is comparerd with itself. In conclusion, . Therefore, when cross- and self- channels can be considered totally uncorrelated; when , self and cross-channels coincide, the two texts are mathematically identical.
Instead of reporting results on
, we define and show results of the following normalized “likeness index”
:
In Equation (15), ; means totally uncorrelated texts, means totally correlated texts.
Let us apply Equation (15) to the probability density of
(“bit 1”) and
(“bit 0”). Now, according to
Figure 7 (right panel),
and
can be well modelled in a large range with Gaussian probability density functions (not shown for brevity), with average value and standard deviations given by
Table 4 in Greek Matthew self- and cross-channels.
In the left panel of
Figure 16 we show
in the indicated cross-channels (i.e.,
) compared to Matthew self-channel (i.e.,
. It is evident that when the text of Matthew is referred to Luke (Mt vs. Lk, red line)
is the closest to Matthew self-channel (Mt vs. Mt): in other words, Matthew can be confused with Luke (and vice versa) more than with Mark, John, Acts or Apocalypse.
The results shown in the right panel of
Figure 16 highlight the asymmetry of linguistic channels [
1]. Here we show cross- and self-channels of both Matthew and Luke referred to their self-channels. For example,
of the cross-channel in which Matthew is “read” as Luke and compared to Luke (Mt vs. Lk/Lk, solid green line) is smaller than
of the cross-channel in which Luke is “read” as Matthew (Lk vs. Mt/Mt) and compared to Matthew.
From
Figure 16, we can draw the following conclusions on a possible use of the likeness index:
- (a)
In the self-channel Mt vs. Mt, , for any text fraction (, therefore 30% of Matthew compared with its full text retains a large likeness.
- (b)
In the cross-channel Mt vs. Lk, for full-texts (100%), therefore indicating a large likeness when the full Mt is compared to the full Lk.
- (c)
In the reverse channel Lk vs. Mt/Mt, for (right panel), therefore indicating a larger likeness when Luke is compared to Matthew.
- (d)
In the cross-channel Lk vs. Mt/Lk (right panel) the likeness index is markedly larger than in the cross-channel Mt vs. Lk/Mt. This finding may support the conjencture, shared by many scholars (see [
39]), that Matthew was written before Luke, and that Luke might have known Matthew when he wrote his text.
- (e)
In the cross-channels Mt vs. Mk and Mt vs. Jh .
- (f)
In the cross-channels Mt vs. Ac and in Mt vs. Ap .
In conclusion, the likeness index
seems reliable because it confirms known relationships among the Greek New Testament texts (e.g., [
39]). In particular it confirms that Matthew and Luke are the most similar texts.
Similar results are found in English, shown in
Figure 17 where
refers to Mt self-channel. Compared to Greek (
Figure 16, left channel), distortions are clearly evident because
is quite smaller, and with different rankings, than what found in Greek.
Finally notice the universal result that in self-channels
is practically given by the same function,
for
and
for
, features which evidently characterize self-channels, as also shown in
Section 8.
In conclusion can be considered another usefull index for automatically comparing texts in a multidimensional space of indices.
9. Concluding Remarks
We have extended the general theory of translation [
1] to texts written in the same language. To be specific, we have applied the extended theory to New Testament translations already studied in [
1], and have assessed how much the mutual linguistic mathematical relationships present in the original Greek texts have been saved or lost in 36 languages. In general, we have found that in many languages/translations the original relationships have been lost and consequently texts have been mathematically distorted.
After defining the mathematical problem in general terms, we have assessed the sensitivity of the signal-to-noise ratio of a linguistic channel to input parameters. The theory is based on the properties of linear regression lines, therefore on slope
and correlation coefficient
. The slope
is the source of the “regression noise”–because
; the correlation coefficient
is the source of the “correlation noise”–because
, as discussed in [
1] for translation channels, but now the theory refers also to texts written in the same language.
Because it is cumbersome to consider all New Testament texts to assess whether, within a language, the mathematical mutual relationships of the original Greek texts are saved or lost, we have studied only the gospel according to Matthew as reference text in any language. However, the results reported are sufficient for giving a reliable answer to the question.
For the purpose of being specific in deriving the full characteristics of the extended theory, we have shown how Matthew is mathematically related, within the same language, to the gospels according to Mark (Mk), Luke (Lk) and John (Jh), and to Acts (Ac) and Apocalypse (Revelation) (Ap). The channels so defined are termed “cross-channels”. The channel in which Matthew is compared with itself is the “self-channel”.
Of the many linguistic channels linking two texts [
1], we have considered only the channel that links their sentences, referred to as the “sentences channel”. We have investigated how the number of sentences in text
is “translated” into the number of sentences in text
for the same number of words. This comparison can be done, of course, by considering average values and regression lines, but the theory has allowed us to consider also the correlation coefficient and has provided insight because it models linguistic channels according to parameters of communication theory, such as the signal-to-noise ratio.
To avoid the inaccuracy, due to the small sample size from which the regression lines are calculated, we have adopted a kind of “renormalization” based on Monte Carlo simulations, whose results we consider as “experimental”. We have compared theoretical and experimental signal-to-noise ratios and have found that for several languages cross-channel maxima are found in the approximate (ordinate) range dB. Beyond these values there is saturation, i.e., a horizontal asymptote. Before saturation, (approximately a 45°-line). In other words, for dB, theory and simulation agree, indicating that the values of slope and correlation coefficient which determine are sufficiently accurate to be used conservatively as input to the theory, without performing a Monte Carlo simulation.
We have also studied how the signal-to-noise ratio changes when the output text is reduced, according to the fraction . This analysis can be useful for indicating whether two texts are mathematically indistinguishable. We have found that the signal-to-noise ratio in the self-channel is proportional to . In the cross-channels the reduction is much lower. Operationally, this can be another check to assess whether a text can be confused with another text.
We have found the same characteristics in self-channels concerning the Greek Jewish War (JW) by Flavius Josephus, the English David Copperfield (DC) by Charles Dickens, and the Italian I Promessi Sposi (PS) by Alessandro Manzoni. For each text, regardless of epoch and language, the total signal-to-noise ratio is reduced according to .
We have also we explored a way of comparing the signal-to-noise ratios of self- and cross-channels objectively and automatically-by applying concepts of binary communication channels affected by noise-, and possibly also a way of getting more insight on texts mathematical likeness. To this end, we have defined a “likeness index” , and have shown how it can reveal similarities or differences of different texts.
Finally notice that, because the theory deals with linear regression lines, it can be applied any time a scientific/technical problem involves two or more linear regression lines, therefore it is not limited to linguistic variables but it is universal.