# Syntactic Parameters and a Coding Theory Perspective on Entropy and Complexity of Language Families

## Abstract

**:**

## 1. Introduction

#### 1.1. Principles and Parameters

#### 1.2. Syntactic Parameters, Codes and Code Parameters

#### 1.3. Position with Respect to the Asymptotic Bound

#### 1.4. Complexity of Languages and Language Families

## 2. Language Families as Codes

#### 2.1. Code Parameters

#### 2.2. Parameter Spoiling

- A code ${C}_{1}=C{\u2605}_{i}f$ in ${\mathbb{F}}_{2}^{n+1}$, for a map $f:C\to {\mathbb{F}}_{2}$, whose code words are of the form $({x}_{1},\dots ,{x}_{i-1},f({x}_{1},\dots ,{x}_{n}),{x}_{i},\dots ,{x}_{n})$ for $w=({x}_{1},\dots ,{x}_{n})\in C$. If f is a constant function, ${C}_{1}$ is an ${[n+1,k,d]}_{2}$-code. If all pairs $w,{w}^{\prime}\in C$ with ${d}_{H}(w,{w}^{\prime})=d$ have $f\left(w\right)\ne f\left({w}^{\prime}\right)$, then ${C}_{1}$ is an ${[n+1,k,d+1]}_{2}$-code.
- A code ${C}_{2}=C{\u2605}_{i}$ in ${\mathbb{F}}_{2}^{n-1}$, whose code words are given by the projections$$({x}_{1},\dots ,{x}_{i-1},{x}_{i+1},\dots ,{x}_{n})$$
- A code ${C}_{3}=C(a,i)\subset C\subset {\mathbb{F}}_{2}^{n}$, given by the level set $C(a,i)=\{w={\left({x}_{k}\right)}_{k=1}^{n}\in C\phantom{\rule{0.166667em}{0ex}}|\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}{x}_{i}=a\}$. Taking $C(a,i){\u2605}_{i}$ gives an ${[n-1,{k}^{\prime},{d}^{\prime}]}_{2}$-code with $k-1\le {k}^{\prime}<k$, and ${d}^{\prime}\ge d$.

#### 2.3. Examples

**Example 1.**Consider a code C formed out of the languages ${\ell}_{1}=$ Italian, ${\ell}_{2}=$ Spanish, and ${\ell}_{3}=$ French, and let us consider only the first six syntactic parameters of Table A of [3], so that $C\subset {\mathbb{F}}_{2}^{n}$ with $n=6$. The code words for the three languages are

${\ell}_{1}$ | 1 | 1 | 1 | 0 | 1 | 1 |

${\ell}_{2}$ | 1 | 1 | 1 | 1 | 1 | 1 |

${\ell}_{3}$ | 1 | 1 | 1 | 0 | 1 | 0 |

- Throughout the entire set of 28 languages considered in [3], the first two parameters are set to the same value 1, hence for the purpose of comparative analysis within this family, we can regard a code like the above as a twice spoiled code $C={C}^{\prime}{\u2605}_{1}{f}_{1}=\left({C}^{\prime \prime}{\u2605}_{2}{f}_{2}\right){\u2605}_{1}{f}_{1}$ where both ${f}_{1}$ and ${f}_{2}$ are constant equal to 1 and ${C}^{\prime \prime}\subset {\mathbb{F}}_{2}^{4}$ is the code obtained from the above by canceling the first two letters in each code word.
- Conversely, we have ${C}^{\prime \prime}={C}^{\prime}{\u2605}_{2}$ and ${C}^{\prime}=C{\u2605}_{1}$, in terms of the second spoiling operation described above.
- To illustrate the third spoiling operation, one can see, for instance, that $C(0,4)=\{{\ell}_{1},{\ell}_{3}\}$, while $C(1,6)=\{{\ell}_{2},{\ell}_{3}\}$.

#### 2.4. The Asymptotic Bound

#### 2.5. Code Parameters of Language Families

**Example 2.**Consider the set $C=\{{L}_{1},{L}_{2},{L}_{3}\}$ with languages ${L}_{1}=$ Arabic, ${L}_{2}=$ Wolof, and ${L}_{3}=$ Basque. We exclude from the list of Table A of [3] all those parameters that are entailed and made irrelevant by some other parameter in at least one of these three chosen languages. This gives us a list of 25 remaining parameters, which are those numbered as 1–5, 7, 10, 20–21, 25, 27–29, 31–32, 34, 37, 42, 50–53, 55–57 in [3], and the following three code words:

${L}_{1}$ | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |

${L}_{2}$ | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |

${L}_{3}$ | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |

#### 2.6. Comparison with Other Bounds

#### 2.7. Entailment and Dependency of Parameters

**Example 3.**Let $C=\{{L}_{1},{L}_{2},{L}_{3}\}$ be the code obtained from the languages ${L}_{1}=$ Arabic, ${L}_{2}=$ Wolof, and ${L}_{3}=$ Basque, as a code in ${\mathbb{F}}_{3}^{n}$ with $n=63$, using the entire list of parameters in [3]. The code parameters $(R=0.0252,\delta =0.4643)$ of this code no longer violate the Plotkin bound. In fact, the parameters satisfy $R<1-{H}_{3}\left(\delta \right)$ so the code C now also lies below the GV bound.

## 3. Entropy and Complexity for Language Families

#### 3.1. Why the Asymptotic Bound?

#### 3.2. Entropy and Statistics of the Gilbert–Varshamov Line

#### 3.3. Kolmogorov Complexity

#### 3.4. Kolmogorov Complexity and the Asymptotic Bound

#### 3.5. Entropy and Complexity Estimates for Language Families

## 4. Conclusions

- Do languages belonging to the same historical-linguistic family always yield codes below the asymptotic bound or the GV bound? How often does the same happen across different linguistic families? How much can code parameters be improved by eliminating spoiling effects caused by dependencies and entailment of syntactic parameters?
- Codes near the GV curve are typically coming from the Shannon Random Code Ensemble, where code words and letters of code words behave like independent random variables, see [26,27]. Are there families of languages whose associated codes are located near the GV bound? Do their syntactic parameters mimic the uniform Poisson distribution of random codes?
- The asymptotic bound for error-correcting codes was related in [16] to Kolmogorov complexity, and the measure of complexity for language families that we proposed here is estimated in terms of the position of the code point with respect to the asymptotic bound. There are other notions of complexity, most notably the type of organized complexities discussed in [33,34,35]. Can these be related to loci in the space of code parameters? What do these represent when applied to codes obtained from syntactic parameters of a set of natural languages?
- Is there a more direct linguistic complexity measure associated to a family of natural languages that would relate to the position of the resulting code above or below the asymptotic bound?

- How much the conclusions obtained for a given family of languages will depend on data pre-processing (removal of “spoiling” features, etc.)
- To what extent the proposed criterion (above or below the asymptotic bound) is really an objective property of a set of languages.

## Acknowledgments

## Conflicts of Interest

## References

- Chomsky, N. Lectures on Government and Binding; Foris: Dordrecht, The Netherlands, 1981. [Google Scholar]
- Longobardi, G. Methods in parametric linguistics and cognitive history. Linguist. Var. Yearb.
**2003**, 3, 101–138. [Google Scholar] [CrossRef] - Longobardi, G.; Guardiano, C. Evidence for syntax as a signal of historical relatedness. Lingua
**2009**, 119, 1679–1706. [Google Scholar] [CrossRef] - Longobardi, G.; Guardiano, C.; Silvestri, G.; Boattini, A.; Ceolin, A. Toward a syntactic phylogeny of modern Indo-European languages. J. Hist. Linguist.
**2013**, 3, 122–152. [Google Scholar] [CrossRef] - Aziz, S.; Huynh, V.L.; Warrick, D.; Marcolli, M. Syntactic Phylogenetic Trees. 2016; In Preparation. [Google Scholar]
- Park, J.J.; Boettcher, R.; Zhao, A.; Mun, A.; Yuh, K.; Kumar, V.; Marcolli, M. Prevalence and recoverability of syntactic parameters in sparse distributed memories.
**2015**. [Google Scholar] - Port, A.; Gheorghita, I.; Guth, D.; Clark, J.M.; Liang, C.; Dasu, S.; Marcolli, M. Persistent Topology of Syntax.
**2015**. [Google Scholar] - Siva, K.; Tao, J.; Marcolli, M. Spin Glass Models of Syntax and Language Evolution.
**2015**. [Google Scholar] - Syntactic Structures of the World’s Languages (SSWL) Database of Syntactic Parameters. Available online: http://sswl.railsplayground.net (accessed on 18 March 2016).
- TerraLing. Available online: http://www.terraling.com (accessed on 18 March 2016).
- Haspelmath, M.; Dryer, M.S.; Gil, D.; Comrie, B. The World Atlas of Language Structures; Oxford University Press: Oxford, UK, 2005. [Google Scholar]
- Tsfasman, M.A.; Vladut, S.G. Algebraic-Geometric Codes. In Mathematics and Its Applications (Soviet Series); Springer: Amsterdam, the Netherlands, 1991; Volume 58. [Google Scholar]
- Manin, Y.I. What is the maximum number of points on a curve over ${\mathbb{F}}_{2}$? J. Fac. Sci. Univ. Tokyo Sect. 1A Math.
**1982**, 28, 715–720. [Google Scholar] - Tsfasman, M.A.; Vladut, S.G.; Zink, T. Modular curves, Shimura curves, and Goppa codes, better than Varshamov–Gilbert bound. Math. Nachr.
**1982**, 109, 21–28. [Google Scholar] [CrossRef] - Vladut, S.G.; Drinfel’d, V.G. Number of points of an algebraic curve. Funct. Anal. Appl.
**1983**, 17, 68–69. [Google Scholar] [CrossRef] - Manin, Y.I.; Marcolli, M. Kolmogorov complexity and the asymptotic bound for error-correcting codes. J. Differ. Geom.
**2014**, 97, 91–108. [Google Scholar] - Bane, M. Quantifying and measuring morphological complexity. In Proceedings of the 26th West Coast Conference on Formal Linguistics, Berkeley, CA, USA, 27–29 April 2007.
- Clark, R. Kolmogorov Complexity and the Information Content of Parameters; Institute for Research in Cognitive Science: Philadelphia, PA, USA, 1994. [Google Scholar]
- Tuza, Z. On the context-free production complexity of finite languages. Discret. Appl. Math.
**1987**, 18, 293–304. [Google Scholar] [CrossRef] - Barton, G.E.; Berwick, R.C.; Ristad, E.S. Computational Complexity and Natural Language; MIT Press: Cambrige, MA, USA, 1987. [Google Scholar]
- Sampson, G.; Gil, D.; Trudgill, P. (Eds.) Language Complexity as an Evolving Variable; Oxford University Press: Oxford, UK, 2009.
- Longobardi, G. A minimalist program for parametric linguistics? In Organizing Grammar: Linguistic Studies in Honor of Henk van Riemsdijk; Broekhuis, H., Corver, N., Huybregts, M., Kleinhenz, U., Koster, J., Eds.; Mouton de Gruyter: Berlin, Germany, 2005; pp. 407–414. [Google Scholar]
- Clark, R.; Roberts, I. A computational model of language learnability and language change. Linguist. Inq.
**1993**, 24, 299–345. [Google Scholar] - Manin, Y.I.; Marcolli, M. Error-correcting codes and phase transitions. Math. Comput. Sci.
**2001**, 5, 133–170. [Google Scholar] [CrossRef] - Manin, Y.I. A computability challenge: Asymptotic bounds and isolated error-correcting codes.
**2011**. [Google Scholar] - Barg, A.; Forney, G.D. Random codes: minimum distances and error exponents. IEEE Trans. Inf. Theory
**2002**, 48, 2568–2573. [Google Scholar] [CrossRef] - Coffey, J.T.; Goodman, R.M. Any code of which we cannot think is good. IEEE Trans. Inf. Theory
**1990**, 36, 1453–1461. [Google Scholar] [CrossRef] - Manin, Y.I. Complexity vs Energy: Theory of Computation and Theoretical Physics.
**2014**. [Google Scholar] - Baker, M.C. The Atoms of Language: The Mind’s Hidden Rules of Grammar; Basic Books: New York, NY, USA, 2001. [Google Scholar]
- Li, M.; Vitányi, P. An Introduction to Kolmogorov Complexity and Its Applications; Springer: New York, NY, USA, 2008. [Google Scholar]
- Grünwald, P.; Vitányi, P. Shannon Information and Kolmogorov Complexity.
**2004**. [Google Scholar] - Manin, Y.I. A Course in Mathematical Logic for Mathematicians, 2nd ed.; Springer: New York, NY, USA, 2010. [Google Scholar]
- Bennett, C.; Gacs, P.; Li, M.; Vitanyi, P.; Zurek, W. Information distance. IEEE Trans. Inf. Theory
**1998**, 44, 1407–1423. [Google Scholar] [CrossRef] - Delahaye, J.P. Complexité Aléatoire et Complexité Organisée; Éditions Quæ: Paris, France, 2009. (In French) [Google Scholar]
- Gell-Mann, M.; Lloyd, S. Information measures, effective complexity, and total information. Complexity
**1996**, 2, 44–52. [Google Scholar] [CrossRef] - Marcolli, M.; Perez, C. Codes as fractals and noncommutative spaces. Math. Comput. Sci.
**2012**, 6, 199–215. [Google Scholar] [CrossRef]

© 2016 by the author; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons by Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Marcolli, M.
Syntactic Parameters and a Coding Theory Perspective on Entropy and Complexity of Language Families. *Entropy* **2016**, *18*, 110.
https://doi.org/10.3390/e18040110

**AMA Style**

Marcolli M.
Syntactic Parameters and a Coding Theory Perspective on Entropy and Complexity of Language Families. *Entropy*. 2016; 18(4):110.
https://doi.org/10.3390/e18040110

**Chicago/Turabian Style**

Marcolli, Matilde.
2016. "Syntactic Parameters and a Coding Theory Perspective on Entropy and Complexity of Language Families" *Entropy* 18, no. 4: 110.
https://doi.org/10.3390/e18040110