# Using the Relative Entropy of Linguistic Complexity to Assess L2 Language Proficiency Development

## Abstract

## 1. Introduction

- How distinct are the differences in language proficiency between L2 learners at a lower level and L2 learners at a higher level as compared to the differences between intermediate-level L2 learners and higher-level learners from the perspective of information gain?
- Does the algorithm of relative entropy have advantages over the frequency-based algorithms for lexical and syntactic complexity in detecting development patterns of L2 language proficiency?

## 2. Background

#### 2.1. Linguistic Complexity and the Development of Language Proficiency in L2

#### 2.2. Relative Entropy

_{2}p(“take”|1820s) − log

_{2}p(“take”|1810s))

_{2}p(“take”|1820s) − ∑p(“take”|1820s)*log

_{2}p(“take”|1810s)

- p(“take”|1820s) = 556/1435 = 0.3875,
- p(“took”|1820s) = 327/1435 = 0.2279,
- p(“taken”|1820s) = 345/1435 = 0.2404,
- p(“taking”|1820s) = 144/1435 = 0.1003,
- p(“takes”|1820s) = 63/1435 = 0.0439.

_{2}p(“take”|1820s)

_{2}(0.3875) + 0.2279*log

_{2}(0.2279) + 0.2404*log

_{2}(0.2404) + 0.1003*log

_{2}(0.1003) + 0.0439*log

_{2}(0.0439)) = −2.041431

- p(“take”|1810s) = 728/1415 = 0.5145,
- p(“took”|1810s) = 168/1415 = 0.1187,
- p(“taken”|1810s) = 228/1415 = 0.1611,
- p(“taking”|1810s) = 133/1415 = 0.094,
- p(“took”|1810s) = 158/1415 = 0.1117.

_{2}p(“take”|1810s) = −2.1864

## 3. Materials and Methods

#### 3.1. Material

#### 3.2. Method

- Relative entropy and the discrimination of information distribution.

_{h}” refers to L2 learners at a higher level, but “Level

_{l}” L2 learners at a lower level.

_{l}” is the distribution of linguistic phenomena that learners have encountered at a lower level and “Level

_{h}” is the new distribution that learners will encounter at a higher level. More importantly, the algorithm of relative entropy examines the information differences between the same linguistic units encoded by two groups of L2 learners. This avoids the problem that characterized previous studies, namely ignoring the weights of different units and simply placing them under the same category.

_{h})” by using an encoding optimized for “L2 learners at a lower level (Level

_{l})”. When applied to the comparison of sub-corpora of the EFCAMDAT2, the KLD serves as a strong indication of the degree of difference between two sub-corpora (representing two groups of L2 learners) measured in bits as well as of the linguistic units that are primarily associated with a difference. That is to say, the difference in the KLD indicates that linguistic units need high amounts of additional bits for encoding. We can find the KLD as an indicator of change after sliding over different groups of L2 learners’ lines in the EFCAMDATA2 and by comparing adjacent L2 learners’ groups.

- Language units (measures).

- Traditional approaches to lexical/syntactic complexity and stationary time series.

## 4. Results

#### 4.1. The Results from the KLD

#### 4.2. The Results from Syntactic and Lexical Complexity

## 5. Discussion

#### 5.1. Conflicting Results from Different Studies

#### 5.2. The Developmental Patterns of Language Proficiency in L2 Learners

#### 5.3. Consistency with the Other Measures

## 6. Conclusions

## Appendix A. JSD Algorithm

_{h}||Level

_{l})= 1/2*KLD(Level

_{h}||(Level

_{h}+ Level

_{l})/2) + 1/2KLD(Level

_{l}||(Level

_{h}+ Level

_{l})/2).

## Appendix B. The Results from the JSD

Cross-Proficiency Levels of L2 | JSD of Grammar | JSD of Lexicon | |||||||
---|---|---|---|---|---|---|---|---|---|

POS-Trigram | Sub-Conj. | Token | Lemma | ||||||

A1→A2 | A1→(A2, B1, B2, C1) | 0.05 | (coef = 0.03, p = 0.003) | 0.09 | (coef = 0.007, p = 0.34) | 0.17 | (coef = 0.003, p = 0.23) | 0.06 | (coef = 0.02, p = 0.17) |

A1→B1 | 0.08 | 0.117 | 0.18 | 0.11 | |||||

A1→B2 | 0.1 | 0.1 | 0.18 | 0.13 | |||||

A1→C1 | 0.13 | 0.12 | 0.18 | 0.12 | |||||

A2→B1 | A2→(B1, B2, C1) | 0.03 | (coef = 0.2, p = 0.18) | 0.04 | (coef = 0.02, p < 0.001) | 0.11 | 0.08 | (coef = 0.005, p = 0.33) | |

A2→B2 | 0.04 | 0.06 | 0.12 | 0.09 | |||||

A2→C1 | 0.07 | 0.08 | 0.1 | 0.09 | |||||

B1→B2 | B1→(B2, C1) | 0.02 | 0.014 | 0.1 | 0.06 | ||||

B1→C1 | 0.03 | 0.017 | 0.09 | 0.06 | |||||

B2→C1 | B2→(C1) | 0.02 | 0.009 | 0.07 | 0.05 |

Cross-Proficiency Levels of L2 | JSD of Grammar | JSD of Lexicon | |||||||
---|---|---|---|---|---|---|---|---|---|

POS-Trigram | Sub-Conj. | Token | Lemma | ||||||

A1→C1 | (A1, A2, B1, B2)→C1 | 0.13 | (coef = −0.04, p = 0.043) | 0.12 | (coef = −0.04, p = 0.034) | 0.18 | (coef = −0.034, p = 0.09) | 0.11 | (coef = −0.02, p = 0.015) |

A2→C1 | 0.07 | 0.075 | 0.1 | 0.09 | |||||

B1→C1 | 0.03 | 0.017 | 0.09 | 0.06 | |||||

B2→C1 | 0.02 | 0.009 | 0.07 | 0.05 | |||||

A1→B2 | (A1, A2, B1)→B2 | 0.1 | (coef = −0.04, p = 0.26) | 0.1 | (coef = −0.043, p = 0.008) | 0.18 | (coef = −0.04, p = 0.18) | 0.13 | (coef = −0.035, p = 0.05) |

A2→B2 | 0.03 | 0.058 | 0.12 | 0.09 | |||||

B1→B2 | 0.02 | 0.014 | 0.1 | 0.06 | |||||

A1→B1 | (A1, A2)→B1 | 0.08 | 0.12 | 0.18 | 0.1 | ||||

A2→B1 | 0.03 | 0.04 | 0.11 | 0.08 | |||||

A1→A2 | (A1)→A2 | 0.05 | 0.09 | 0.17 | 0.06 |

Cross-Proficiency Levels of L2 | JSD of Grammar | JSD of Lexicon | |||||||
---|---|---|---|---|---|---|---|---|---|

POS-Trigram | Sub-Conj. | Token | Lemma | ||||||

A1→A2 | adjacent levels | 0.05 | (coef = −0.01, p = 0.087) | 0.09 | (coef = −0.027, p = 0.06) | 0.17 | (coef = −0.03, p = 0.046) | 0.06 | (coef = −0.005, p = 0.48) |

A2→B1 | 0.03 | 0.04 | 0.11 | 0.08 | |||||

B1→B2 | 0.02 | 0.015 | 0.1 | 0.06 | |||||

B2→C1 | 0.02 | 0.009 | 0.07 | 0.05 |

## Appendix C. The Data on Syntactic and Lexical Complexity

**Table A4.**Syntactic complexity for each L2 level in EFCAMDAT2 and the difference between different L2 levels.

Levels | MLS | MLT | MLC | C/S | VP/T | C/T | DC/C | DC/T | T/S | CT/T | CP/T | CP/C | CN/T | CN/C |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

A1 | 8.56 | 8.17 | 7.04 | 1.21 | 1.3 | 1.16 | 0.11 | 0.13 | 1.05 | 0.09 | 0.24 | 0.2 | 0.58 | 0.5 |

A2 | 11.59 | 10.09 | 7.58 | 1.53 | 1.33 | 1.33 | 0.21 | 0.29 | 1.15 | 0.23 | 0.32 | 0.24 | 0.76 | 0.57 |

B1 | 13.28 | 12.01 | 8.53 | 1.56 | 1.41 | 1.41 | 0.27 | 0.37 | 1.11 | 0.3 | 0.3 | 0.21 | 1.09 | 0.78 |

B2 | 14.85 | 13.6 | 8.93 | 1.66 | 1.5 | 1.52 | 0.33 | 0.5 | 1.09 | 0.34 | 0.32 | 0.21 | 1.34 | 0.88 |

C1 | 16.32 | 14.58 | 9.54 | 1.71 | 1.53 | 1.53 | 0.33 | 0.5 | 1.12 | 0.34 | 0.39 | 0.26 | 1.56 | 1.02 |

A1_C1 | 7.76 | 6.41 | 2.5 | 0.49 | 0.23 | 0.37 | 0.22 | 0.37 | 0.07 | 0.24 | 0.16 | 0.05 | 0.98 | 0.52 |

A2_C1 | 4.72 | 4.49 | 1.96 | 0.18 | 0.2 | 0.2 | 0.11 | 0.21 | −0.03 | 0.11 | 0.08 | 0.02 | 0.8 | 0.45 |

B1_C1 | 3.04 | 2.56 | 1.01 | 0.15 | 0.12 | 0.12 | 0.06 | 0.13 | 0.01 | 0.04 | 0.09 | 0.04 | 0.47 | 0.24 |

B2_C1 | 4.72 | 4.49 | 1.96 | 0.18 | 0.2 | 0.2 | 0.11 | 0.21 | −0.03 | 0.11 | 0.08 | 0.02 | 0.8 | 0.45 |

A2_A1 | 3.04 | 1.93 | 0.54 | 0.31 | 0.3 | 0.17 | 0.1 | 0.16 | 0.1 | 0.13 | 0.08 | 0.03 | 0.17 | 0.07 |

B1_A1 | 4.72 | 3.85 | 1.49 | 0.34 | 0.53 | 0.25 | 0.16 | 0.25 | 0.06 | 0.2 | 0.06 | 0.01 | 0.51 | 0.27 |

B2_A1 | 6.29 | 5.44 | 1.89 | 0.45 | 0.75 | 0.36 | 0.22 | 0.38 | 0.04 | 0.29 | 0.09 | 0.01 | 0.76 | 0.38 |

C1_A1 | 7.76 | 6.41 | 2.5 | 0.49 | 0.77 | 0.37 | 0.22 | 0.37 | 0.07 | 0.28 | 0.16 | 0.05 | 0.98 | 0.52 |

A2_A1 | 3.04 | 1.93 | 0.54 | 0.31 | 0.3 | 0.17 | 0.1 | 0.16 | 0.1 | 0.13 | 0.08 | 0.03 | 0.17 | 0.07 |

B1_A2 | 1.68 | 1.92 | 0.95 | 0.03 | 0.23 | 0.08 | 0.05 | 0.09 | −0.04 | 0.07 | −0.01 | −0.02 | 0.34 | 0.21 |

B2_B1 | 1.57 | 1.59 | 0.4 | 0.11 | 0.22 | 0.12 | 0.07 | 0.13 | −0.01 | 0.08 | 0.02 | 0 | 0.25 | 0.1 |

C1_B2 | 1.47 | 0.98 | 0.61 | 0.05 | 0.02 | 0 | 0 | −0.01 | 0.03 | −0.01 | 0.07 | 0.04 | 0.22 | 0.14 |

**Table A5.**Lexical complexity for each L2 level in EFCAMDAT2 and the difference between different L2 levels.

Levels | ld | ls1 | ls2 | vs1 | vs2 | cvs1 | ttr | msttr | cttr | rttr | logttr | uber | lv | vv1 | svv1 | cvv1 | vv2 | nv |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

A1 | 0.54 | 0.56 | 0.88 | 0.02 | 43.42 | 4.66 | 0.03 | 0.73 | 17.32 | 24.49 | 0.74 | 22.29 | 0.05 | 0.02 | 60.94 | 5.52 | 0.01 | 0.07 |

A2 | 0.53 | 0.52 | 0.86 | 0.04 | 71.49 | 5.98 | 0.04 | 0.75 | 17.55 | 24.82 | 0.75 | 22.32 | 0.08 | 0.04 | 106.91 | 7.31 | 0.01 | 0.11 |

B1 | 0.52 | 0.47 | 0.93 | 0.02 | 143.27 | 8.46 | 0.02 | 0.79 | 19.82 | 28.04 | 0.72 | 23.45 | 0.03 | 0.02 | 167.3 | 9.15 | 0.01 | 0.04 |

B2 | 0.51 | 0.48 | 0.96 | 0.01 | 180.81 | 9.51 | 0.01 | 0.79 | 20.61 | 29.14 | 0.71 | 23.98 | 0.02 | 0.01 | 198.99 | 9.97 | 0 | 0.03 |

C1_A1 | 0.52 | 0.49 | 0.93 | 0.02 | 186.36 | 9.65 | 0.02 | 0.8 | 20.15 | 28.5 | 0.73 | 23.47 | 0.03 | 0.02 | 218.08 | 10.44 | 0.01 | 0.05 |

A2_A1 | −0.01 | −0.04 | −0.02 | 0.02 | 28.07 | 1.32 | 0.01 | 0.02 | 0.23 | 0.33 | 0.01 | 0.03 | 0.03 | 0.02 | 45.97 | 1.79 | 0 | 0.04 |

B1_A1 | −0.02 | −0.09 | 0.05 | 0 | 99.85 | 3.8 | −0.01 | 0.06 | 2.5 | 3.55 | −0.02 | 1.16 | −0.02 | 0 | 106.36 | 3.63 | 0 | −0.03 |

B2_A1 | −0.03 | −0.08 | 0.08 | −0.01 | 137.39 | 4.85 | −0.02 | 0.06 | 3.29 | 4.65 | −0.03 | 1.69 | −0.03 | −0.01 | 138.05 | 4.45 | −0.01 | −0.04 |

C1_A1 | −0.02 | −0.07 | 0.05 | 0 | 142.94 | 4.99 | −0.01 | 0.07 | 2.83 | 4.01 | −0.01 | 1.18 | −0.02 | 0 | 157.14 | 4.92 | 0 | −0.02 |

A2_A1 | −0.01 | −0.04 | −0.02 | 0.02 | 28.07 | 1.32 | 0.01 | 0.02 | 0.23 | 0.33 | 0.01 | 0.03 | 0.03 | 0.02 | 45.97 | 1.79 | 0 | 0.04 |

B1_A2 | −0.01 | −0.05 | 0.07 | −0.02 | 71.78 | 2.48 | −0.02 | 0.04 | 2.27 | 3.22 | −0.03 | 1.13 | −0.05 | −0.02 | 60.39 | 1.84 | 0 | −0.07 |

B2_B1 | −0.01 | 0.01 | 0.03 | −0.01 | 37.54 | 1.05 | −0.01 | 0 | 0.79 | 1.1 | −0.01 | 0.53 | −0.01 | −0.01 | 31.69 | 0.82 | −0.01 | −0.01 |

C1_B2 | 0.01 | 0.01 | −0.03 | 0.01 | 5.55 | 0.14 | 0.01 | 0.01 | −0.46 | −0.64 | 0.02 | −0.51 | 0.01 | 0.01 | 19.09 | 0.47 | 0.01 | 0.02 |

**Figure 1.**Relative entropy among L2 learners at cross-proficiency different levels (EFCAMDAT2). Note that JSD results are also visualized in this figure.

**Figure 2.**The difference of syntactic complexity between different L2 levels. Here, each L2 level (A1, A2, B1, B2, C1) can be treated as time order (date). Here, x-axis is syntactic complexity measures/metrics, and y-axis is the difference of those complexity metrics across proficiency levels (discussed in the section of Methods). The left top plot shows that MLC, MLS, and MLT have a gradual increase, that is, B1_A1 is higher than A2_A1, and B2_B1 is higher than B1_A1, and C1_B2 is higher than CB2_B1. When a metric shows a regular increase, it indicates that this measure can detect patterns of L2 proficiency development. By contrast, in the right top plot, such a regular increase can only be found in 4 of the 12 metrics. In the bottom two plots, none of metrics shows a regular increase. Irregular changes suggest that these metrics cannot capture the patterns of L2 proficiency development.

Word | 1810s | 1820s | 1830s | 1840s |
---|---|---|---|---|

Take | 728 | 556 | 665 | 529 |

Took | 168 | 327 | 351 | 333 |

Taken | 228 | 345 | 344 | 324 |

Taking | 133 | 144 | 164 | 165 |

Takes | 158 | 63 | 76 | 86 |

Total | 1415 | 1435 | 1600 | 1437 |

**Table 2.**Composition of the five sub-corpora of the essays section of the EFCAMDAT2 by language proficiency level.

L2 Learners’ Proficiency Levels | Texts | Learners | Tokens | Lemmas |
---|---|---|---|---|

A1 | 625,985 | 103,742 | 28.8 M | 27,065 |

A2 | 307,996 | 52,734 | 24 M | 32,051 |

B1 | 168,361 | 32,852 | 18.4 M | 26,276 |

B2 | 61,329 | 13,951 | 9.3 M | 21,312 |

C1 | 14,698 | 2839 | 2.8 M | 16,464 |

Cross-Proficiency Levels of L2 | KLD of Grammar | KLD of Lexicon | |||||||
---|---|---|---|---|---|---|---|---|---|

POS-Trigram | Sub-Conj. | Token | Lemma | ||||||

A1→A2 | A1→(A2, B1, B2, C1) | 0.37 | (coef = 0.21, p < 0.001) | 0.56 | (coef = 0.07, p = 0.31) | 1.41 | (coef = 0.03, p = 0.18) | 0.52 | (coef = 0.15, p = 0.2) |

A1→B1 | 0.58 | 0.81 | 1.41 | 0.84 | |||||

A1→B2 | 0.78 | 0.68 | 1.42 | 1.09 | |||||

A1→C1 | 1.0 | 0.82 | 1.52 | 0.93 | |||||

A2→B1 | A2→(B1, B2, C1) | 0.2 | (coef = 0.16, p = 0.13) | 0.24 | (coef = 0.12, p = 0.016) | 1.29 | (coef = −0.25, p = 0.13) | 0.71 | (coef = 0.025, p = 0.66) |

A2→B2 | 0.3 | 0.35 | 1.13 | 0.81 | |||||

A2→C1 | 0.51 | 0.47 | 0.78 | 0.76 | |||||

B1→B2 | B1→(B2, C1) | 0.13 | 0.085 | 1.1 | 0.62 | ||||

B1→C1 | 0.21 | 0.11 | 0.87 | 0.54 | |||||

B2→C1 | B2→(C1) | 0.16 | 0.052 | 0.52 | 0.51 |

Cross-Proficiency Levels of L2 | KLD of Grammar | KLD of Lexicon | |||||||
---|---|---|---|---|---|---|---|---|---|

POS-Trigram | Sub-Conj. | Token | Lemma | ||||||

A1→C1 | (A1, A2, B1, B2)→C1 | 1.0 | (coef = −0.28, p = 0.056) | 0.82 | (coef = −0.27, p = 0.034) | 1.52 | (coef = −0.29, p = 0.12) | 0.93 | (coef = −0.15, p = 0.034) |

A2→C1 | 0.51 | 0.47 | 0.78 | 0.76 | |||||

B1→C1 | 0.21 | 0.11 | 0.87 | 0.54 | |||||

B2→C1 | 0.16 | 0.05 | 0.52 | 0.51 | |||||

A1→B2 | (A1, A2, B1)→B2 | 0.78 | (coef = −0.33, p = 0.171) | 0.68 | (coef = −0.3, p = 0.04) | 1.42 | (coef = −0.16, p = 0.28) | 1.09 | (coef = −0.24, p = 0.07) |

A2→B2 | 0.3 | 0.35 | 1.13 | 0.81 | |||||

B1→B2 | 0.13 | 0.085 | 1.1 | 0.62 | |||||

A1→B1 | (A1, A2)→B1 | 0.58 | 0.81 | 1.41 | 0.84 | ||||

A2→B1 | 0.2 | 0.24 | 1.28 | 0.71 | |||||

A1→A2 | (A1)→A2 | 0.37 | 0.56 | 1.41 | 0.52 |

Cross-Proficiency Levels of L2 | KLD of Grammar | KLD of Lexicon | |||||||
---|---|---|---|---|---|---|---|---|---|

POS-Trigram | Sub-Conj. | Token | Lemma | ||||||

A1→A2 | adjacent levels | 0.37 | (coef = −0.07, p = 0.16) | 0.56 | (coef = −0.16, p = 0.086) | 1.41 | (coef = −0.28, p = 0.063) | 0.52 | |

A2→B1 | 0.2 | 0.24 | 1.28 | 0.71 | |||||

B1→B2 | 0.13 | 0.085 | 1.1 | 0.62 | |||||

B2→C1 | 0.16 | 0.081 | 0.52 | 0.51 |

