Next Article in Journal
LSE-CVCNet: A Generalized Stereoscopic Matching Network Based on Local Structural Entropy and Multi-Scale Fusion
Previous Article in Journal
Entropy Generation Optimization in Multidomain Systems: A Generalized Gouy-Stodola Theorem and Optimal Control
Previous Article in Special Issue
Scaling Laws in Language Families
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Long-Range Dependence in Word Time Series: The Cosine Correlation of Embeddings

by
Paweł Wieczyński
1,† and
Łukasz Dębowski
2,*,†
1
Independent Researcher, 80-180 Gdańsk, Poland
2
Institute of Computer Science, Polish Academy of Sciences, 01-248 Warsaw, Poland
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Entropy 2025, 27(6), 613; https://doi.org/10.3390/e27060613
Submission received: 14 May 2025 / Revised: 7 June 2025 / Accepted: 8 June 2025 / Published: 9 June 2025
(This article belongs to the Special Issue Complexity Characteristics of Natural Language)

Abstract

:
We analyze long-range dependence (LRD) for word time series, understood as a slower than exponential decay of the two-point Shannon mutual information. We achieve this by examining the decay of the cosine correlation, a proxy object defined in terms of the cosine similarity between word2vec embeddings of two words, computed by an analogy to the Pearson correlation. By the Pinsker inequality, the squared cosine correlation between two random vectors lower bounds the mutual information between them. Using the Standardized Project Gutenberg Corpus, we find that the cosine correlation between word2vec embeddings exhibits a readily visible stretched exponential decay for lags roughly up to 1000 words, thus corroborating the presence of LRD. By contrast, for the Human vs. LLM Text Corpus entailing texts generated by large language models, there is no systematic signal of LRD. Our findings may support the need for novel memory-rich architectures in large language models that exceed not only hidden Markov models but also Transformers.

1. Introduction

Consider a time series ( W i ) i Z such as a text in natural language, a sequence of real numbers, or a sequence of vectors. Let I ( W i ; W i + n ) be the Shannon mutual information between two random variables separated by n positions. By short-range dependence (SRD), we understand an asymptotic exponential bound for the decay of this dependence measure,
I ( W i ; W i + n ) = O ( exp ( δ n ) ) , δ > 0 .
By long-range dependence (LRD), we understand any sort of decay of the dependence measure that does not fall under (1). In particular, under LRD, we may have a power-law decay of the dependence measure,
I ( W i ; W i + n ) n γ , γ > 0 ,
which resembles a more standard definition of LRD for the autocorrelation function by Beran [1], or we may have a stretched exponential decay thereof,
I ( W i ; W i + n ) exp ( δ n β ) , δ > 0 , 0 < β < 1 .
The SRD is characteristic of mixing Markov and hidden Markov processes ([2], Theorem 1), which assume that the probability of the next token depends only on a finite number of preceding tokens or on a bounded memory. Hence, the observation of LRD for sufficiently large lags implies that the time series generation cannot be modeled by a mixing Markov process of a relatively small order or—via the data-processing inequality ([3], Chapter 2.8)—by a mixing hidden Markov process with a small number of hidden states.
By contrast, it has often been expressed that texts in natural language exhibit LRD [2,4,5,6,7,8,9,10,11]. Several empirical studies analyzing textual data at different linguistic levels, such as characters [2,4], words [9], or punctuation [11], have indicated that correlations in natural language persist over long distances. This persistent correlation suggests that dependencies in human language extend far beyond adjacent words or short phrases, spanning across entire paragraphs or even longer discourse structures.
The LRD should be put on par with other statistical effects signaling that natural language is not a finite-state hidden Markov process, a theoretical linguistic claim that dates back to [12,13,14]. Let us write blocks of words W j k : = ( W j , W j + 1 , . . . , W k ) . A power-law growth of the block mutual information
I ( W 1 n ; W n + 1 2 n ) n β , 0 < β < 1 ,
is known as Hilberg’s law or as the neural scaling law [15,16,17]. Another observation [18] is a power-law logarithmic law of the maximal repetition length
L ( W 1 n ) ( log n ) α , α > 1 ,
where we denote the maximal repetition length
L ( W 1 n ) : = max k 1 : W i + 1 i + k = W j + 1 j + k for some 0 j < i n k .
The long-range dependence (2) or (3), Hilberg’s law (4), and the maximal repetition law (5) have been all reported for natural language, whereas it can be mathematically proved that none of them is satisfied by finite-state hidden Markov processes [19,20].
The LRD, Hilberg’s law, and the maximal repetition law independently—and for different reasons—support the necessity of using complex memory architectures in contemporary large language models (LLMs). Neural networks designed for natural language processing must incorporate mechanisms capable of mimicking these laws. The older generation n-gram models struggle with this requirement for reasons that can be analyzed mathematically. By contrast, it is has been unclear whether Transformers [21], with their attention-based mechanisms, can leverage these extensive relationships. Understanding the nature of the LRD, Hilberg’s law, and the maximal repetition law in textual data may shed some light onto neural architectures that can progress on language modeling tasks.
Various smoothing techniques were proposed to discern LRD at the character or phoneme level [2,4,6,7]. Under no advanced estimation, the power-law decay of the Shannon mutual information between two characters dissolves into noise for lags up to 10 characters [4]. By contrast, Lin and Tegmark [2] considered sophisticated estimation techniques and reported the power-law decay of the Shannon mutual information between two characters for much larger lags.
Because of the arbitrariness of word forms relative to the semantic content of the text, we are not convinced that the results by Lin and Tegmark [2] are not an artifact of their estimation method. For this reason, following the idea of Mikhaylovskiy and Churilov [9], we have decided to seek the LRD on the level of words. We have supposed that pairs of words rather than pairs of characters better capture the long-range semantic coherence of the text. For this reason, we have expected that the LRD effect extends for a larger distance on the level of words than on the level of characters. Indeed, in the present study, we report a lower bound on the Shannon mutual information between two words that is salient for lags up to 1000 words, which is four decades of magnitude larger than the unsmoothed effect for characters.
A modest goal of this paper is to systematically explore a simple measure of dependence to check whether texts in natural language and those generated by large language models exhibit the LRD. Rather than directly investigating the Shannon mutual information, which is difficult to estimate for large alphabets and strongly dependent sources, we elect a measure of dependence called the cosine correlation. This object is related to the cosine similarity of two vectors and somewhat resembles the Pearson correlation. Formally, the cosine correlation between two random vectors U and V equals
CC ( U ; V ) : = E U U · V V E U U · E V V ,
where E X is the expectation of random variable X, U · V is the dot product, and U : = U · U is the norm. By contrast, the cosine similarity of two non-random vectors u and v is
cos ( u ; v ) : = u u · v v .
In order to compute the cosine correlation or the cosine similarity for actual word time series, we need a certain vector representation of words. As a practical vector representation of words, one may consider word2vec embeddings used in large language models [22,23]. Word embeddings capture semantic relationships between words by mapping them into continuous spaces, allowing for a more meaningful measure of similarity between distant words in a text. In particular, Mikhaylovskiy and Churilov [9] observed an approximate power-law decay for the expected cosine similarity E cos ( U ; V ) of word embeddings.
The paper by Mikhaylovskiy and Churilov [9] lacked, however, the following important theoretical insight. As a novel result of this paper, we demonstrate that the cosine correlation CC ( U ; V ) , rather than the expected cosine similarity E cos ( U ; V ) , provides a lower bound for the Shannon mutual information I ( U ; V ) . Applying the Pinsker inequality [24,25], we obtain the bound
I ( U ; V ) CC ( U ; V ) 2 2 .
This approach provides an efficient alternative to direct statistical estimation of mutual information, which is often impractical due to the sparse nature of natural language data. In particular, a slower than exponential decay of the cosine correlation implies LRD. Thus, a time series with a power-law or stretched exponential decay of the cosine correlation is not a Markov process or a hidden Markov process.
Indeed, on the experimental side, we observe a stretched exponential decay of the cosine correlation, which is clearly visible roughly for lags up to 1000 words—but only for natural texts. By contrast, artificial texts do not exhibit this trend in a systematic way. Our source of natural texts is the Standardized Project Gutenberg Corpus [26], a diverse collection of literary texts that offers a representative sample of human language usage. Our source of artificial texts is the Human vs. LLM Text Corpus [27]. To investigate the effect of semantic correlations, we also consider the cosine correlation between moving sums of neighboring embeddings, a technique that we call pooling. Curiously, pooling does not make the stretched exponential decay substantially slower. The lack of a prominent LRD signal was already noticed for the previous generation of language models by Takahashi and Tanaka-Ishii [6,7].
Our observation of the slow decay of the cosine correlation in general confirms the prior results of Mikhaylovskiy and Churilov [9] and supports the hypothesis of LRD. We notice that Mikhaylovskiy and Churilov [9] did not try to fit the stretched exponential decay to their data and their power-law model was not visually very good. Both theoretical and experimental findings of this paper contribute to the growing body of statistical evidence proving that natural language is not a finite-state hidden Markov process.
What is more novel is that our findings may support the view that natural language cannot be either generated by Transformer-based large language models—in view of no systematic decay trend of the cosine correlation for the Human vs. LLM Text Corpus. As mentioned, the LRD, Hilberg’s law, and the maximal repetition law independently substantiate the necessity of sophisticated memory architectures in modern computational linguistic applications. These results open avenues for further research into the theoretical underpinnings of language structure, potentially informing the development of more effective models for language understanding and generation.
The organization of the article is as follows. Section 2 presents the theoretical results. Section 3 discusses the experiment. In particular, Section 3.1 presents our data. Section 3.2 describes the experimental methods. Section 3.3 presents the results. Section 3.4 offers the discussion. Section 4 contains the conclusion.

2. Theory

Similarly as Mikhaylovskiy and Churilov [9] but differently than Li [4], Lin and Tegmark [2], and Takahashi and Tanaka-Ishii [6,7], we will seek for LRD on the level of words rather than on the level of characters or phonemes. The Shannon mutual information between words is difficult to estimate for large alphabets and strongly dependent sources. Thus, we consider its lower bound defined via the cosine correlation of word2vec embeddings [22,23].
Let E X : = X d P denote the expectation of a real random variable X. Let ln x be the natural logarithm of x and let H ( X ) : = E ln p ( X ) be the Shannon entropy of a discrete random variable X, where p ( X ) is the probability density of X with respect to a reference measure ([3], Chapters 2.1 and 8.1). The Shannon mutual information between variables X and Y ([3], Chapters 2.4 and 8.5) equals
I ( X ; Y ) : = H ( X ) + H ( Y ) H ( X , Y )
By contrast, the Pearson correlation between real random variables X and Y is defined as
Corr ( X ; Y ) : = Cov ( X ; Y ) Var ( X ) Var ( Y ) .
where we denote the covariance Cov ( X ; Y ) : = E X Y E X E Y and the variance Var ( X ) : = Cov ( X ; X ) . By the Cauchy–Schwarz inequality, we have | Corr ( X ; Y ) | 1 .
We will introduce an analog of the Pearson correlation coefficient for vectors, which we call the cosine correlation. First, let us recall three standard concepts. For vectors u = ( u 1 , u 2 , . . . , u d ) and v = ( v 1 , v 2 , . . . , v d ) , we consider the dot product
u · v : = k = 1 d u k v k ,
the norm u : = u · u , and the cosine similarity
cos ( u ; v ) : = u u · v v .
By the Cauchy–Schwarz inequality, we have | cos ( u ; v ) | 1 .
Now, we consider something less standard. For vector random variables U and V, we define the cosine correlation
CC ( U ; V ) : = E U U · V V E U U · E V V = E U U E U U · V V E V V .
If U and V are discrete and we denote the difference of measures
Δ ( u , v ) : = P ( U = u , V = v ) P ( U = u ) P ( V = v )
then we may write
CC ( U ; V ) = u , v Δ ( u , v ) cos ( u ; v ) .
We observe that if random variables U and V are unidimensional, then cos ( U , V ) = 1 with probability 1 and CC ( U ; V ) = 0 . Similarly, CC ( U ; V ) = 0 if cos ( U , V ) is constant with probability 1 or if U and V are independent.
To build some more intuitions, let us notice the following three facts. First of all, the cosine correlation between two copies of a random vector lies in the unit interval.
Theorem 1.
We have
0 CC ( U ; U ) 1 .
Proof. 
Let us write U : = U / U . We have
CC ( U ; U ) = E U · U E U · E U = 1 k = 1 d ( E U k ) 2 1 k = 1 d E ( U k ) 2 = 1 E k = 1 d ( U k ) 2 = 1 E U · U = 0 .
Hence, the claim follows.    □
Second, the cosine correlation satisfies a version of the Cauchy–Schwarz inequality.
Theorem 2.
We have
| CC ( U ; V ) | CC ( U ; U ) CC ( V ; V ) 1 .
Proof. 
Let us write U : = U / U and V : = V / V . By the Cauchy–Schwarz inequalities | Cov ( X ; Y ) | Var ( X ) Var ( Y ) for random scalars X and Y and k = 1 d x k y k k = 1 d x k k = 1 d y k for real numbers x k , y k 0 , we obtain
| CC ( U ; V ) | k = 1 d | Cov ( U k ; V k ) | k = 1 d Var ( U k ) Var ( V k ) k = 1 d Var ( U k ) k = 1 d Var ( V k ) = CC ( U ; U ) CC ( V ; V ) .
Hence, the claim follows (17).    □
Third, we will show that cosine correlation CC ( U ; V ) provides a lower bound for mutual information I ( U ; V ) .
Theorem 3.
We have
I ( U ; V ) CC ( U ; V ) 2 2 .
Proof. 
Let us recall the Pinsker inequality
x p ( x ) ln p ( x ) q ( x ) 1 2 x | p ( x ) q ( x ) | 2
for two discrete probability distributions p and q [24,25]. By the Pinsker inequality (22), the Cauchy–Schwarz inequality | cos ( u ; v ) | 1 , and identity (16), we obtain
I ( U ; V ) 1 2 u , v | Δ ( u , v ) | 2 1 2 u , v | Δ ( u , v ) | | cos ( u , v ) | 2 1 2 u , v Δ ( u , v ) cos ( u , v ) 2 = CC ( U ; V ) 2 2 .
Hence, the claim follows.    □
We note in passing that the Pinsker inequality can be modified as the Bretagnolle–Huber bound
x p ( x ) ln p ( x ) q ( x ) ln 1 1 4 x | p ( x ) q ( x ) | 2
for probability distributions p and q [28,29]. Respectively, we obtain
I ( U ; V ) ln 1 CC ( U ; V ) 2 4 .
This bound is weaker than (21) since | CC ( U ; V ) | 1 .
Let ( W i ) i Z be the text in natural language treated as a word time series. Let ϕ ( w ) = ( ϕ 1 ( w ) , ϕ 2 ( w ) , , ϕ d ( w ) ) be an arbitrary vector representation of word w, such as word2vec embeddings [22,23], and let F i : = ( F i 1 , F i 2 , , F i d ) : = ϕ ( W i ) . In particular, since embeddings F i = ϕ ( W i ) are functions of words W i , by the data-processing inequality ([3], Chapter 2.8) and by the cosine correlation bound (21), we obtain
I ( W i ; W j ) I ( F i ; F j ) CC ( F i ; F j ) 2 2 .
Wrapping up, a slow decay of cosine correlation CC ( F i ; F i + n ) implies a slow decay of mutual information I ( W i ; W i + n ) . Since I ( W i ; W i + n ) is damped exponentially for any mixing Markov or hidden Markov process ( W i ) i Z by Theorem 1 of Lin and Tegmark [2], observing a power-law or a stretched exponential decay of cosine correlation CC ( F i ; F i + n ) is enough to demonstrate that process ( W i ) i Z is not a mixing Markov or hidden Markov process.
The framework that we have constructed in this section has its prior in the literature. We remark that Mikhaylovskiy and Churilov [9] investigated estimates of expectation E cos ( F i ; F i + n ) rather than cosine correlation CC ( F i ; F i + n ) . That approach required estimation and subtraction of the asymptotic constant term. Mikhaylovskiy and Churilov [9] observed an approximate power-law decay but they did not mention the cosine correlation bound (21) in their discussion explicitly.

3. Experiment

3.1. Data

Our data consisted of three elements: a dictionary of embedding vectors for a subset of human languages, a corpus of texts written by humans in these languages, and a corpus of texts in English created by artificial intelligence. The considered set of human languages included 17 languages. Originally, we planned to use 20 languages with the largest text counts in the considered corpora but three of them, Esperanto, Chinese, and Tagalog, had to be excluded because the embedding dictionary did not cover these languages.
In particular, the source of pretrained word embeddings was chosen as the NLPL repository [23]. To provide a uniform baseline across languages, for all considered languages, we used 100-dimensional embedding vectors trained on the CoNLL17 corpora with the same algorithm, being the word2vec continuous skipgram algorithm. None of these embedding vector spaces includes lemmatization. The vocabulary sizes of the embedding spaces for the considered 17 languages are presented in Table 1.
As the source of texts written by humans, we chose the Standardized Project Gutenberg Corpus (SPGC) [26]. The corpus provides texts after some preprocessing and tokenization, as detailed in [26]. We filtered the SPGC to obtain a more manageable yet representative subset of texts. As we have mentioned, we restricted the corpus to 17 languages with the largest text counts simultaneously covered by the applied NLPL embedding dictionary. Moreover, we filtered out files of the size above 1000 KB and we sampled up to 100 texts (or fewer if not available) per language in order to achieve roughly balanced subsets across particular languages.
To provide a comparison with texts generated by artificial intelligence, we also considered the Human vs. LLM Text Corpus (HLLMTC) [27]. All texts in the HLLMTC are in English. To make this corpus more easily computationally tractable, we sampled 1000 human written texts and 6000 LLM generated texts, where we chose 1000 texts per each of the six selected large language models. To convert these texts into word time series, we used off-the-shelf tokenizer [30].
Table 2 provides the summary statistics of the obtained subsets of the Standardized Project Gutenberg Corpus and the Human vs. LLM Text Corpus. In particular, we report the token counts and the coverage of the sampled texts, i.e., the fraction of word tokens of texts that appear in the respective NLPL embedding dictionary.

3.2. Methods

In this section, we briefly describe what we measured and in what way. We supposed that the LRD on the level of words is due to semantic coherence of the text over longer distances. In particular, mutual information between two words is large as long as the text around these words concerns a similar topic. We supposed that the embedding of this local topic can be roughly estimated as the sum of embeddings of all words in the neighborhood, called a pooled embedding. Let F i be the embedding of the i-th word in the text. The pooled embeddings are defined as
F i ( k ) : = j = 0 k 1 F i + j
for the pooling order k 1 . In particular, pooled embeddings for k = 1 equal word embeddings, F i ( 1 ) = F i .
The object that we wanted to measure was the cosine correlation for pooled embeddings, namely
C ( n | k ) : = CC ( F i ( k ) ; F i + n ( k ) ) .
Function C ( n | k ) is substantially larger for 0 n < k since the summations for variables F i ( k ) and F i + n ( k ) range partly over overlapping embeddings F i . Thus, if one wants to estimate the functional form of the decay of C ( n | k ) , it makes sense to fit the respective function exclusively to data points where n k .
Let us proceed to the estimation of function C ( n | k ) . Let ϕ ( w ) be the embedding of word w according to the considered word2vec dictionary. From each text, we removed all word tokens that did not have an embedding in the dictionary. In this way, we obtained a collection of word time series ( W 1 , W 2 , . . . , W N ) , corresponding word embeddings F i = ϕ ( W i ) , and pooled embeddings F i ( k ) given by formula (27). We estimated the expectations as the averages over the times series. That is, we computed the estimator of C ( n | k ) defined as
C ^ ( n | k ) : = 1 N n i = 1 N n U i ( k ) · U i + n ( k ) ,
where we used the auxiliary time series
U i ( k ) : = F i ( k ) F i ( k ) 1 N j = 1 N F j ( k ) F j ( k ) .
We observe that F i + 1 ( k ) = F i ( k ) F i + F i + k . Therefore, the computational complexity of estimator C ^ ( n | k ) for fixed n and k is of order O ( N d ) , where N is the text length and d is the dimension of embeddings F i .
For each text ( W 1 , W 2 , , W N ) , we computed estimators C ^ ( n | k ) for lags n A [ 1 , N ] , where
A : = 1 , 1.1 , 1 . 1 2 , 1 . 1 3 , ,
and pooling orders k 1 , 3 , 3 2 , 3 3 . We observed that the plot of the absolute value | C ^ ( n | k ) | for considered texts usually dissolved into random noise around n = 1000 and there was a hump for n < k , as expected. Hence, to estimate the functional form of the decay of | C ^ ( n | k ) | , we restricted the fitting procedure to range k n 1000 .
The parameter estimation was performed using the curve_fit function from the SciPy library [31], which employs the trust region reflective algorithm. We selected this method due to its compatibility with bounded constraints. We estimated parameters of two functions: the power-law decay
f ( n | c , γ ) : = c n γ , c R , γ > 0 ,
and the stretched exponential decay
f ( n | b , δ , β ) : = exp ( δ n β + b ) , b R , δ > 0 , 0 < β < 1 ,
with parameters γ , δ , and β implicitly depending on the pooling order k. As a goodness-of-fit metric, we calculated the sum of squared logarithmic residuals
SSLR : = n A [ k , 1000 ] log | C ^ ( n | k ) | log f ( n | ) 2
divided by the number of the degrees of freedom (ndf) equal to | A [ k , 1000 ] | minus the number of parameters of f ( n | ) .
We investigated the dependence of the results on the source, understood as the particular language for human-written texts or the particular language model for LLM-generated texts. To check whether there are significant differences of the distribution of a parameter α γ , δ , β across particular sources, we used the non-parametric Kruskal–Wallis test with the null hypothesis
H 0 : P 1 = P 2 = = P J ,
where P j is the distribution of parameter α for the j-th source. To further explore differences among different sources, we employed the post-hoc Dunn test with the Bonferroni correction for multiple comparisons.

3.3. Results

Visually, the decay of the absolute cosine correlation estimates | C ^ ( n | k ) | for k n 1000 usually follows a stretched exponential form rather than the exact power-law decay for human-written text. By contrast, no systematic decay for k n can be detected for LLM-generated texts. This tendency can be seen in Figure 1, which is a diagnostic plot of the absolute cosine correlation estimates | C ^ ( n | k ) | for two texts: Cecilia: A Story of Modern Rome in English from the SPGC corpus and Text no. 702 by GPT 3.5, which is the longest LLM-generated text in the sampled subset of the HLLMTC corpus.
In Table 3, Table 4, Table 5, Table 6 and Table 7, we report the means and the standard deviations of the fitted parameters c and γ of the power-law model (32) and b, δ , and β of the stretched exponential model (33). The values are reported as they depend on a particular language for human-written texts or on a particular language model for LLM-generated texts. When fitting the models, the optimization algorithm did not converge sometimes. The failure rates and the overall goodness of fit are reported in Table 8. Despite the visual appeal of the stretched exponential model, the mean SSLR given by Formula (34) is less for the power-law model. This does not necessarily mean that the power-law model is better, however, since the standard deviation of the SSLR is greater than the mean for the stretched exponential model.

3.4. Discussion

Similarly as Mikhaylovskiy and Churilov [9] but differently than Li [4], Lin and Tegmark [2], and Takahashi and Tanaka-Ishii [6,7], we have sought for the LRD on the level of words rather than on the level of characters or phonemes. We have hypothesized that word-level dependencies yield a more prominent effect due to semantic coherence of lexical units over longer distances as compared to phoneme-level correlations, which tend to decay faster, in view of the arbitrariness of word forms.
Indeed, analyzing the cosine similarity of word embeddings, like Mikhaylovskiy and Churilov [9], or their cosine correlation, in the present study, one observes a clearly visible LRD effect for natural, i.e., human-written texts. Mikhaylovskiy and Churilov [9] reported a rough power-law decay without considering an alternative model. By contrast, we have considered both a power-law model and a stretched exponential model and both natural texts and LLM-generated texts.
We report that the slow decay of the cosine correlation extends up to 1000 words for natural texts, whereas it is dominated by noise for LLM-generated texts—as it was already observed for the previous generation of language models [6,7]. These effects can be seen in the diagnostic Figure 1 and independently witnessed by Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8, where fitting to the random noise results in highly unstable estimates and outliers pumping up the standard deviations beyond the means. Curiously, the decay of the cosine correlation does not change systematically as the pooling order k increases, despite our prior expectation that the cosine correlation would increase monotonically with k.
The distributions of fitted parameters c and γ of the power-law model (32) and b, δ , and β of the stretched exponential model (33) vary significantly across different human languages and different large language models ( p < 0.01 for the Kruskall–Wallis tests). It means that the cosine correlation decays at different source-specific rates. At the moment, we are unable to state clearly what the cause for this variation may be.
For example, the Japanese language seems an outlier in many categories but this need not be directly caused by language typology. We notice that the available texts in Japanese are very short and their coverage in terms of embeddings is much lower than for other languages. Maybe our experimental methodology fails for very short texts in general. This might be an alternative explanation of the poor fitting results for LLM-generated texts which are very short, as well, as shown in Table 2 and Table 8.

4. Conclusions

In this paper, we have provided an empirical support for the claim that texts in natural language exhibit long-range dependence (LRD), understood as a slower than exponential decay of the two-point mutual information. Similar claims have been reiterated in the literature [2,4,5,6,7,8,9,10,11] but we hope that we have provided more direct and convincing evidence.
First, as a theoretical result, we have shown that the squared cosine correlation lower bounds the Shannon mutual information between two vectors. Under this bound, a power-law or a stretched exponential decay of the cosine correlation implies the LRD. In particular, the vector time series which exhibits such a slow decay of the cosine correlation cannot be not a mixing Markov or hidden Markov process by Theorem 1 of Lin and Tegmark [2].
Second, using the Standardized Project Gutenberg Corpus [26] and vector representations of words taken from the NLPL repository [23], we have shown experimentally that the estimates of the cosine correlation of word embeddings follow a stretched exponential decay. This decay extends for lags up to 1000 words without any smoothing, which is four decades of magnitude larger than the unsmoothed, presumably LRD, effect for characters [4].
Third, the stability of this decay suggests that the LRD is a fundamental property of natural language, rather than an artifact of specific preprocessing methods or statistical estimation techniques. The observation of the slow decay of the cosine correlation for natural texts not only supports the hypothesis of LRD but also reaffirms the prior results of Mikhaylovskiy and Churilov [9], who reported a rough power-law decay of the expected cosine similarity of word embeddings.
Fourth, like Takahashi and Tanaka-Ishii [6,7], we have observed the LRD only for natural data. We stress that, as we were able to observe, artificial data do not exhibit the LRD in a systematic fashion. Our source of artificial texts was the Human vs. LLM Text Corpus [27]. We admit that texts in this corpus may be too short to draw firm conclusions and further research on longer LLM-generated texts is necessary to confirm our early claim.
As we have mentioned in the introduction, non-Markovianity effects such as the LRD, Hilberg’s law [15,16,17], and the maximal repetition law [18] may have implications for understanding the limitations and capabilities of contemporary language models. The presence of such effects in natural texts in contrast to texts generated by language models highlights the indispensability of complex memory mechanisms, potentially showing that state-of-the-art architectures such as Transformers [21] are insufficient.
Future research might explore whether novel architectures could capture quantitative linguistic constraints such as the LRD more effectively [32]. Further studies may also explore alternative embeddings or dependence measures and their impact on the stability of the LRD measures such as the stretched exponential decay parameters. Investigating other linguistic corpora, text genres, and languages could also provide valuable insights into the universality of these findings.

Author Contributions

Conceptualization, Ł.D. and P.W.; methodology, Ł.D.; software, P.W.; validation, Ł.D. and P.W.; formal analysis, Ł.D.; investigation, Ł.D. and P.W.; resources, P.W.; data curation, P.W.; writing—original draft preparation, Ł.D.; writing—review and editing, Ł.D. and P.W.; visualization, P.W.; supervision, Ł.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The code and instructions to reproduce the experiments are available at https://github.com/pawel-wieczynski/long_range_dependencies (accessed on 6 June 2025). More figures are available at https://github.com/pawel-wieczynski/long_range_dependencies/tree/main/figures (accessed on 6 June 2025).

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT 4o for the purpose of drafting some passages of the text in the introduction and conclusion. The authors have reviewed and heavily edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
CCcosine correlation
HLLMTCHuman vs. LLM Text Corpus
LLMlarge language model
LRDlong-range dependence
SPGCStandardized Project Gutenberg Corpus
SSLRsum of squared logarithmic residuals
SRDshort-range dependence

References

  1. Beran, J. Statistics for Long-Memory Processes; Chapman & Hall: New York, NY, USA, 1994. [Google Scholar]
  2. Lin, H.W.; Tegmark, M. Critical Behavior in Physics and Probabilistic Formal Languages. Entropy 2017, 19, 299. [Google Scholar] [CrossRef]
  3. Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley & Sons: New York, NY, USA, 2006. [Google Scholar]
  4. Li, W. Mutual Information Functions versus Correlation Functions. J. Stat. Phys. 1990, 60, 823–837. [Google Scholar] [CrossRef]
  5. Altmann, E.G.; Pierrehumbert, J.B.; Motter, A.E. Beyond Word Frequency: Bursts, Lulls, and Scaling in the Temporal Distributions of Words. PLoS ONE 2009, 4, e7678. [Google Scholar] [CrossRef] [PubMed]
  6. Takahashi, S.; Tanaka-Ishii, K. Do neural nets learn statistical laws behind natural language? PLoS ONE 2017, 12, e0189326. [Google Scholar] [CrossRef] [PubMed]
  7. Takahashi, S.; Tanaka-Ishii, K. Evaluating computational language models with scaling properties of natural language. Comput. Linguist. 2019, 45, 481–513. [Google Scholar] [CrossRef]
  8. Tanaka-Ishii, K. Statistical Universals of Language: Mathematical Chance vs. Human Choice; Springer: New York, NY, USA, 2021. [Google Scholar]
  9. Mikhaylovskiy, N.; Churilov, I. Autocorrelations Decay in Texts and Applicability Limits of Language Models. arXiv 2023, arXiv:2305.06615. Available online: https://arxiv.org/abs/2305.06615 (accessed on 6 June 2025).
  10. Stanisz, T.; Drożdż, S.; Kwapień, J. Complex systems approach to natural language. Phys. Rep. 2024, 1053, 1–84. [Google Scholar] [CrossRef]
  11. Bartnicki, K.; Drożdż, S.; Kwapień, J.; Stanisz, T. Punctuation Patterns in “Finnegans Wake” by James Joyce Are Largely Translation-Invariant. Entropy 2025, 27, 177. [Google Scholar] [CrossRef] [PubMed]
  12. Chomsky, N. Three models for the description of language. IRE Trans. Inf. Theory 1956, 2, 113–124. [Google Scholar] [CrossRef]
  13. Chomsky, N. Syntactic Structures; Mouton & Co.: The Hague, The Netherlands, 1957. [Google Scholar]
  14. Chomsky, N. A Review of B. F. Skinner’s Verbal Behavior. Language 1959, 35, 26–58. [Google Scholar] [CrossRef]
  15. Hilberg, W. Der bekannte Grenzwert der redundanzfreien Information in Texten—Eine Fehlinterpretation der Shannonschen Experimente? Frequenz 1990, 44, 243–248. [Google Scholar] [CrossRef]
  16. Takahira, R.; Tanaka-Ishii, K.; Dębowski, L. Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale Corpora. Entropy 2016, 18, 364. [Google Scholar] [CrossRef]
  17. Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling Laws for Neural Language Models. arXiv 2020, arXiv:2001.08361. Available online: https://arxiv.org/abs/2001.08361 (accessed on 6 June 2025).
  18. Dębowski, L. Maximal Repetitions in Written Texts: Finite Energy Hypothesis vs. Strong Hilberg Conjecture. Entropy 2015, 17, 5903–5919. [Google Scholar] [CrossRef]
  19. Dębowski, L. Maximal Repetition and Zero Entropy Rate. IEEE Trans. Inf. Theory 2018, 64, 2212–2219. [Google Scholar] [CrossRef]
  20. Dębowski, L. A Refutation of Finite-State Language Models Through Zipf’s Law for Factual Knowledge. Entropy 2021, 23, 1148. [Google Scholar] [CrossRef] [PubMed]
  21. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
  22. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of the Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013, Lake Tahoe, NV, USA, 5–8 December 2013; pp. 3111–3119. [Google Scholar]
  23. Fares, M.; Kutuzov, A.; Oepen, S.; Velldal, E. Word vectors, reuse, and replicability: Towards a community repository of large-text resources. In Proceedings of the 21st Nordic Conference on Computational Linguistics, Gothenburg, Sweden, 22–24 May 2017; pp. 271–276. [Google Scholar]
  24. Pinsker, M.S. Information and Information Stability of Random Variables and Processes; Holden-Day: San Francisco, CA, USA, 1964. [Google Scholar]
  25. Csiszár, I.; Körner, J. Information Theory: Coding Theorems for Discrete Memoryless Systems; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
  26. Gerlach, M.; Font-Clos, F. A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics. Entropy 2020, 22, 126. [Google Scholar] [CrossRef] [PubMed]
  27. Grinberg, Z. Human vs. LLM Text Corpus. 2024. Available online: https://www.kaggle.com/datasets/starblasters8/human-vs-llm-text-corpus (accessed on 13 May 2025).
  28. Bretagnolle, J.; Huber, C. Estimation des densités: Risque minimax. In Séminaire de Probabilités XII; Lecture Notes in Mathematics; Springer: New York, NY, USA, 1978; Volume 649, pp. 342–363. [Google Scholar]
  29. Canonne, C. A short note on an inequality between KL and TV. arXiv 2022, arXiv:2202.07198. Available online: https://arxiv.org/abs/2202.07198 (accessed on 6 June 2025).
  30. Řehůřek, R.; Sojka, P. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, 22 May 2010; pp. 45–50. Available online: http://is.muni.cz/publication/884893/en (accessed on 6 June 2025).
  31. Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat. Methods 2020, 17, 261–272. [Google Scholar] [CrossRef] [PubMed]
  32. Behrouz, A.; Zhong, P.; Mirrokni, V. Titans: Learning to Memorize at Test Time. arXiv 2025, arXiv:2501.00663. Available online: https://arxiv.org/abs/2501.00663 (accessed on 6 June 2025).
Figure 1. Estimates | C ^ ( n | k ) | for two diagnostic texts.
Figure 1. Estimates | C ^ ( n | k ) | for two diagnostic texts.
Entropy 27 00613 g001
Table 1. Vocabulary sizes of the chosen embedding spaces.
Table 1. Vocabulary sizes of the chosen embedding spaces.
LanguageVocabulary Size
Catalan (ca)799,020
Danish (da)1,655,886
German (de)4,946,997
Greek (el)1,183,194
English (en)4,027,169
Spanish (es)2,656,057
Finnish (fi)2,433,286
French (fr)2,567,698
Hungarian (hu)2,702,663
Italian (it)2,469,122
Japanese (ja)3,989,605
Latin (la)555,381
Dutch (nl)2,610,658
Norwegian (no)223,763
Polish (pl)4,420,598
Portuguese (pt)2,536,452
Swedish (sv)3,010,472
Table 2. Summary of the used subset of corpora.
Table 2. Summary of the used subset of corpora.
# of Texts# of TokensCoverage
Source MeanStdMeanStd
Standardized Project Gutenberg Corpus:
Catalan (ca)3236,827.4420,376.280.970.02
Danish (da)6651,832.9230,748.010.970.02
German (de)10041,532.8432,192.860.970.02
Greek (el)10029,487.2017,775.530.910.03
English (en)10047,362.2241,189.091.000.01
Spanish (es)10062,873.1637,040.170.980.02
Finnish (fi)10035,095.0228,948.420.940.03
French (fr)10053,948.6639,585.750.960.01
Hungarian (hu)10050,510.3031,976.380.950.02
Italian (it)10054,386.0137,917.550.950.03
Japanese (ja)20268.05371.400.780.19
Latin (la)7626,769.5731,199.930.930.07
Dutch (nl)10043,055.0531,465.480.980.01
Norwegian (no)1939,497.0024,798.870.930.03
Polish (pl)2914,225.2816,859.880.960.06
Portuguese (pt)10018,485.8018,533.720.960.01
Swedish (sv)10037,474.9729,310.510.970.02
Human vs. LLM Text Corpus:
GPT-3.51000444.40278.660.99960.0025
GPT-41000628.69228.940.99960.0023
Human1000666.63881.150.99800.0055
LLaMA-13B1000437.87268.760.99870.0058
LLaMA-30B1000404.65261.390.99880.0053
LLaMA-65B1000369.19252.730.99880.0061
LLaMA-7B1000489.27263.580.99860.0070
Table 3. Means and standard deviations of parameter c.
Table 3. Means and standard deviations of parameter c.
SourcePooling Order
k = 1 k = 3 k = 9 k = 27
Standardized Project Gutenberg Corpus:
ca0.0248 ± 0.00350.069 ± 0.0130.099 ± 0.0240.17 ± 0.12
da0.0203 ± 0.00450.052 ± 0.0130.080 ± 0.0240.150 ± 0.091
de0.0286 ± 0.00790.070 ± 0.0210.108 ± 0.0470.27 ± 0.67
el0.031 ± 0.0120.072 ± 0.0240.100 ± 0.0410.19 ± 0.20
en0.0258 ± 0.00920.070 ± 0.0260.108 ± 0.0530.20 ± 0.18
es0.033 ± 0.0220.088 ± 0.0370.120 ± 0.0490.19 ± 0.12
fi0.0501 ± 0.00820.100 ± 0.0190.144 ± 0.0460.24 ± 0.13
fr0.033 ± 0.0130.087 ± 0.0330.127 ± 0.0760.27 ± 0.57
hu0.0353 ± 0.00580.086 ± 0.0160.119 ± 0.0300.21 ± 0.18
it0.0327 ± 0.00930.085 ± 0.0250.117 ± 0.0470.19 ± 0.16
ja0.148 ± 0.0940.45 ± 0.393.5 ± 9.621,259 ± 72,964
la0.067 ± 0.0310.154 ± 0.0760.24 ± 0.160.63 ± 0.96
nl0.0257 ± 0.00780.071 ± 0.0260.113 ± 0.0570.24 ± 0.27
no0.0136 ± 0.00510.040 ± 0.0120.068 ± 0.0190.110 ± 0.041
pl0.046 ± 0.0110.118 ± 0.0250.179 ± 0.0770.42 ± 0.48
pt0.026 ± 0.0110.087 ± 0.0340.16 ± 0.120.32 ± 0.50
sv0.027 ± 0.0120.065 ± 0.0220.101 ± 0.0390.19 ± 0.14
Human vs. LLM Text Corpus:
GPT-3.50.024 ± 0.0220.05 ± 0.742 ± 575.4  × 10 30 ± 1.7  × 10 31
GPT-40.029 ± 0.0190.1 ± 3.20.02 ± 0.52−0.1 ± 1.2
Human0.028 ± 0.0210.039 ± 0.0613 ± 591.0  × 10 7 ± 3.1  × 10 8
LLaMA-13B0.027 ± 0.0270.045 ± 0.0882 ± 311.0  × 10 30 ± 3.1  × 10 31
LLaMA-30B0.027 ± 0.0240.040 ± 0.0480.3 ± 4.82.6  × 10 19 ± 8.0  × 10 20
LLaMA-65B0.026 ± 0.0230.04 ± 0.181 ± 231.7  × 10 7 ± 4.4  × 10 8
LLaMA-7B0.029 ± 0.0270.044 ± 0.0510.3 ± 4.06.2  × 10 11 ± 1.9  × 10 12
Table 4. Means and standard deviations of parameter γ .
Table 4. Means and standard deviations of parameter γ .
SourcePooling Order
k = 1 k = 3 k = 9 k = 27
Standardized Project Gutenberg Corpus:
ca0.449 ± 0.0550.523 ± 0.0630.546 ± 0.0830.62 ± 0.13
da0.373 ± 0.0670.442 ± 0.0920.49 ± 0.110.58 ± 0.17
de0.440 ± 0.0790.49 ± 0.110.53 ± 0.140.62 ± 0.23
el0.405 ± 0.0830.48 ± 0.130.51 ± 0.160.57 ± 0.23
en0.330 ± 0.0670.418 ± 0.0980.47 ± 0.120.54 ± 0.19
es0.373 ± 0.0900.44 ± 0.110.45 ± 0.120.50 ± 0.16
fi0.574 ± 0.0840.552 ± 0.0990.57 ± 0.130.62 ± 0.17
fr0.415 ± 0.0790.47 ± 0.110.49 ± 0.130.55 ± 0.20
hu0.442 ± 0.0750.482 ± 0.0960.49 ± 0.110.55 ± 0.17
it0.421 ± 0.0890.47 ± 0.120.48 ± 0.130.52 ± 0.19
ja0.34 ± 0.190.57 ± 0.390.79 ± 0.692.3 ± 1.6
la0.40 ± 0.170.47 ± 0.230.51 ± 0.240.60 ± 0.33
nl0.390 ± 0.0740.46 ± 0.110.51 ± 0.130.60 ± 0.20
no0.347 ± 0.0410.451 ± 0.0600.522 ± 0.0730.59 ± 0.10
pl0.550 ± 0.0750.63 ± 0.120.65 ± 0.180.73 ± 0.26
pt0.416 ± 0.0710.57 ± 0.160.63 ± 0.220.70 ± 0.47
sv0.407 ± 0.0840.459 ± 0.0950.51 ± 0.120.60 ± 0.17
Human vs. LLM Text Corpus:
GPT-3.50.31 ± 0.360.19 ± 0.360.9 ± 1.91.2 ± 2.0
GPT-40.47 ± 0.300.39 ± 0.420.6 ± 1.30.8 ± 1.6
Human0.28 ± 0.260.29 ± 0.380.7 ± 1.30.9 ± 1.3
LLaMA-13B0.19 ± 0.200.25 ± 0.270.6 ± 1.30.8 ± 1.4
LLaMA-30B0.20 ± 0.200.25 ± 0.350.6 ± 1.30.8 ± 1.3
LLaMA-65B0.19 ± 0.200.23 ± 0.270.6 ± 1.20.8 ± 1.2
LLaMA-7B0.21 ± 0.200.25 ± 0.240.6 ± 1.00.7 ± 1.1
Table 5. Means and standard deviations of parameter b.
Table 5. Means and standard deviations of parameter b.
SourcePooling Order
k = 1 k = 3 k = 9 k = 27
Standardized Project Gutenberg Corpus:
ca0.9 ± 6.1192 ± 52426 ± 11424 ± 127
da119 ± 552107 ± 49069 ± 322131 ± 571
de137 ± 58656 ± 246134 ± 53189 ± 480
el126 ± 50163 ± 206128 ± 48180 ± 335
en93 ± 47239 ± 157108 ± 52936 ± 183
es147 ± 615124 ± 33434 ± 18850 ± 255
fi26 ± 40151 ± 40697 ± 419127 ± 504
fr81 ± 333110 ± 478121 ± 461106 ± 466
hu10 ± 2840 ± 15821 ± 13053 ± 197
it114 ± 48386 ± 23183 ± 402108 ± 527
ja195 ± 652449 ± 992252 ± 902479 ± 1239
la406 ± 1044176 ± 672231 ± 701375 ± 1029
nl4 ± 3119 ± 10114 ± 12947 ± 257
no271 ± 7455 ± 27−0.7 ± 3.984 ± 356
pl157 ± 416387 ± 1071364 ± 725610 ± 1189
pt353 ± 915338 ± 884496 ± 1207588 ± 1174
sv114 ± 455106 ± 53428 ± 15069 ± 455
Human vs. LLM Text Corpus:
GPT-3.51575 ± 1629709 ± 1156421 ± 877198 ± 590
GPT-42520 ± 14891749 ± 15101127 ± 1333325 ± 793
Human1211 ± 1462724 ± 1144624 ± 1126581 ± 1149
LLaMA-13B513 ± 951502 ± 959662 ± 1129522 ± 1029
LLaMA-30B569 ± 1002469 ± 893664 ± 1108494 ± 993
LLaMA-65B554 ± 982479 ± 912606 ± 1090458 ± 987
LLaMA-7B553 ± 1010506 ± 946719 ± 1173526 ± 1062
Table 6. Means and standard deviations of parameter δ .
Table 6. Means and standard deviations of parameter δ .
SourcePooling Order
k = 1 k = 3 k = 9 k = 27
Standardized Project Gutenberg Corpus:
ca4.7 ± 6.0195 ± 52429 ± 11427 ± 127
da123 ± 552110 ± 49072 ± 322134 ± 571
de141 ± 58659 ± 246137 ± 53192 ± 480
el130 ± 50166 ± 206131 ± 48183 ± 335
en97 ± 47242 ± 157111 ± 52939 ± 183
es151 ± 615127 ± 33437 ± 18852 ± 255
fi29 ± 40153 ± 406100 ± 419130 ± 504
fr84 ± 333112 ± 478123 ± 461108 ± 466
hu13 ± 2843 ± 15823 ± 13056 ± 197
it118 ± 48388 ± 23186 ± 402111 ± 527
ja198 ± 652451 ± 992254 ± 900480 ± 1238
la409 ± 1044178 ± 672234 ± 701377 ± 1028
nl8 ± 3121 ± 10117 ± 12950 ± 257
no275 ± 7448 ± 272.6 ± 3.887 ± 356
pl160 ± 416389 ± 1071367 ± 725612 ± 1188
pt356 ± 915341 ± 884498 ± 1207590 ± 1173
sv118 ± 455109 ± 53431 ± 15072 ± 455
Human vs. LLM Text Corpus:
GPT-3.51579 ± 1628714 ± 1155426 ± 876203 ± 589
GPT-42523 ± 14881753 ± 15091132 ± 1331330 ± 792
Human1214 ± 1462728 ± 1144627 ± 1125584 ± 1148
LLaMA-13B517 ± 951506 ± 958665 ± 1129526 ± 1027
LLaMA-30B573 ± 1001473 ± 893667 ± 1107498 ± 992
LLaMA-65B558 ± 982483 ± 911610 ± 1089462 ± 986
LLaMA-7B557 ± 1009510 ± 946723 ± 1172530 ± 1060
Table 7. Means and standard deviations of parameter β .
Table 7. Means and standard deviations of parameter β .
SourcePooling Order
k = 1 k = 3 k = 9 k = 27
Standardized Project Gutenberg Corpus:
ca0.16 ± 0.110.090 ± 0.0930.16 ± 0.150.24 ± 0.21
da0.23 ± 0.170.17 ± 0.110.21 ± 0.160.23 ± 0.19
de0.14 ± 0.150.14 ± 0.120.18 ± 0.150.27 ± 0.19
el0.18 ± 0.130.14 ± 0.150.18 ± 0.170.30 ± 0.24
en0.28 ± 0.150.17 ± 0.140.21 ± 0.150.31 ± 0.23
es0.17 ± 0.120.11 ± 0.150.15 ± 0.150.23 ± 0.20
fi0.071 ± 0.0670.067 ± 0.0650.12 ± 0.110.21 ± 0.15
fr0.14 ± 0.140.13 ± 0.150.16 ± 0.190.24 ± 0.23
hu0.086 ± 0.0690.093 ± 0.0900.18 ± 0.160.26 ± 0.25
it0.119 ± 0.0950.10 ± 0.130.13 ± 0.150.20 ± 0.18
ja0.61 ± 0.450.55 ± 0.490.59 ± 0.480.49 ± 0.48
la0.20 ± 0.220.32 ± 0.250.40 ± 0.300.52 ± 0.37
nl0.19 ± 0.150.15 ± 0.130.19 ± 0.150.27 ± 0.21
no0.30 ± 0.180.22 ± 0.110.187 ± 0.0810.23 ± 0.18
pl0.070 ± 0.0680.090 ± 0.0890.15 ± 0.170.24 ± 0.24
pt0.19 ± 0.160.13 ± 0.160.18 ± 0.200.23 ± 0.25
sv0.16 ± 0.120.17 ± 0.130.21 ± 0.150.29 ± 0.24
Human vs. LLM Text Corpus:
GPT-3.50.07 ± 0.240.11 ± 0.300.18 ± 0.370.28 ± 0.44
GPT-40.02 ± 0.120.04 ± 0.190.06 ± 0.220.11 ± 0.30
Human0.10 ± 0.270.19 ± 0.350.27 ± 0.410.32 ± 0.44
LLaMA-13B0.24 ± 0.380.26 ± 0.390.25 ± 0.400.32 ± 0.45
LLaMA-30B0.21 ± 0.360.24 ± 0.380.24 ± 0.400.31 ± 0.44
LLaMA-65B0.20 ± 0.360.23 ± 0.380.26 ± 0.420.33 ± 0.45
LLaMA-7B0.25 ± 0.390.27 ± 0.390.27 ± 0.410.30 ± 0.44
Table 8. Failure rates and goodness of fit for power-law (PL) decay and stretched exponential (SE) decay.
Table 8. Failure rates and goodness of fit for power-law (PL) decay and stretched exponential (SE) decay.
Pooling OrderPLPLSESE
FailureAvg. FailureAvg.
Rate (%) SSLRRate (%)SSLR
Standardized Project Gutenberg Corpus:
10.000.43 ± 0.4714.082 ± 29
30.000.28 ± 0.4329.060.4 ± 2.6
90.000.26 ± 0.4420.270.5 ± 3.2
270.370.25 ± 0.4817.210.6 ± 4.6
Human vs. LLM Text Corpus:
10.001.4 ± 1.10.0311 ± 262
30.001.3 ± 4.60.137 ± 220
90.004 ± 360.331.6 ± 7.7
274.201 ± 124.992 ± 37
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wieczyński, P.; Dębowski, Ł. Long-Range Dependence in Word Time Series: The Cosine Correlation of Embeddings. Entropy 2025, 27, 613. https://doi.org/10.3390/e27060613

AMA Style

Wieczyński P, Dębowski Ł. Long-Range Dependence in Word Time Series: The Cosine Correlation of Embeddings. Entropy. 2025; 27(6):613. https://doi.org/10.3390/e27060613

Chicago/Turabian Style

Wieczyński, Paweł, and Łukasz Dębowski. 2025. "Long-Range Dependence in Word Time Series: The Cosine Correlation of Embeddings" Entropy 27, no. 6: 613. https://doi.org/10.3390/e27060613

APA Style

Wieczyński, P., & Dębowski, Ł. (2025). Long-Range Dependence in Word Time Series: The Cosine Correlation of Embeddings. Entropy, 27(6), 613. https://doi.org/10.3390/e27060613

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop