Next Article in Journal
Exploring Employee Perspectives on Workplace Technology: Usage, Roles, and Implications for Satisfaction and Performance
Previous Article in Journal
The Relationships of Workplace Spirituality and Psychological Capital with Work Engagement Among Junior High School Teachers
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The Influence of Input Frequency and L2 Proficiency on the Representation of Collocations for Chinese EFL Learners

School of Foreign Languages, Ocean University of China, Qingdao 266005, China
*
Author to whom correspondence should be addressed.
Behav. Sci. 2025, 15(1), 46; https://doi.org/10.3390/bs15010046
Submission received: 7 November 2024 / Revised: 29 December 2024 / Accepted: 2 January 2025 / Published: 4 January 2025

Abstract

:
Collocations typically refer to habitual word combinations, which not only occur in texts but also constitute an essential component of the mental lexicon. This study focuses on the mental lexicon of Chinese learners of English as a foreign language (EFL), investigating the representation of collocations and the influence of input frequency and L2 proficiency by employing a phrasal decision task. The findings reveal the following: (1) Collocations elicited faster response times and higher accuracy rates than non-collocations. (2) Higher input frequency improved the accuracy of judgments. High-proficiency Chinese EFL learners exhibit better accuracy and faster response times in collocation judgment tests. Additionally, input frequency and L2 proficiency interactively affected both response time and accuracy rate. These results indicate that L2 learners have a processing advantage for collocations, which function as independent entries in the mental lexicon. Both input frequency and L2 proficiency are crucial factors in collocational representation, with increased input frequency and proficiency shifting the representation from analytic retrieval toward holistic recognition in a continuum pattern.

1. Introduction

Collocation is indicative of the horizontal relationship between words, integrating the lexical, grammatical, semantic, and pragmatic knowledge of languages, playing a crucial role in language use and cross-linguistic communication (Nation, 2001). It is widespread in texts and reflects the construction and development of lexical networks of language users’ mental lexicons (Ellis et al., 2009). In linguistic systems, collocations function as essential units of language processing, with their distributional information identifiable throughout comprehension and production (Siyanova-Chanturia, 2015). Most research on collocations has primarily focused on native speakers, with limited attention to second language (L2) learners. Recent studies have shifted the focus to investigating the L2 mental lexicon, particularly how L2 collocations are represented under the influence of various linguistic and non-linguistic factors (Cao, 2016; Gyllstad & Wolter, 2016; Öksüz et al., 2024). However, there is ongoing debate regarding whether collocations are represented holistically or analytically in the mental lexicons of Chinese EFL learners. Moreover, the impact of input frequency and L2 proficiency remains under-explored. On the one hand, the frequency of collocation extracted from the native speaker corpus lacks representativeness in reflecting the input of L2 collocations for Chinese EFL learners. On the other hand, the current studies are heavily skewed toward adult learners with higher L2 proficiency (especially undergraduate and graduate students), while adolescent learners with lower proficiency have received little attention. Therefore, utilizing a corpus-based psycholinguistic paradigm and extracting collocation frequency from a self-constructed corpus of English textbooks for Chinese EFL learners, this study aims to examine the impact of input frequency and L2 proficiency on the representation of collocations, particularly targeting both lower-proficiency adolescent and higher-proficiency adult Chinese EFL learners. Collocations with consistent meaning and form in both English and Chinese are focused. For instance, the English phrase “similar concept” corresponds to the Chinese “相似的概念”. Both of them constitute adjective–noun word pairs in two languages, with “相似的” and “概念” serving as the translation equivalents of “similar” and “concept”, respectively.

2. Literature Review

Collocations have long been a central topic in linguistic research, garnering significant scholarly attention and resulting in a substantial body of work. Although the term “collocation” was first introduced in the 1930s (Palmer, 1933), it was J. R. Firth who clearly defined and elaborated it in 1957. Firth distinguished collocation from grammar, defining it as the habitual juxtaposition of lexical items, which refers to the co-occurrence of words. His famous assertion, “You shall know a word by the company it keeps” (Firth, 1957, p. 179), underscores the importance of collocation. Afterward, scholars have approached collocations from the lexical, structural, and semantic perspectives, leading to diverse definitions and classification systems. Two dominant approaches to defining collocations have emerged in the state-of-the-art literature: the phraseological approach and the frequency-based approach (Gablasova et al., 2017; Howarth, 1998; McEnery & Hardie, 2012; Nesselhauf, 2005; Öksüz et al., 2024). The former emphasizes semantic associations between words which contribute to compositional meaning (Howarth, 1998; Nesselhauf, 2005), while the latter draws on quantitative evidence of word co-occurrence in corpora within a limited range (Gablasova et al., 2017; McEnery & Hardie, 2012; Öksüz et al., 2024).
In the realm of theoretical inquiry, a central debate revolves around how language users cognitively process collocations. Generative linguistics and functional linguistics provide contrasting explanations regarding whether multi-word sequences are processed in the same way as single words. The generative approach (Chomsky, 1995; Pinker, 1999) advocates a dual-system model, suggesting that lexical knowledge and grammar are represented separately in the brain. The storage and retrieval of lexical knowledge rely on declarative memory, while multi-word sequences (including collocations) that involve the use of rule-governed aspects of grammar are supported by procedural memory (Ullman, 2001; Ullman, 2004). In contrast, the usage-based approach (Bybee, 2007; Langacker, 1987) denies the existence of abstract rules that are independent of language use. It is believed that language acquisition is not the learning of words and rules, but rather a process of abstracting frequency-oriented language patterns based on language experience. The usage-based approach supports a single-system model, arguing that both single words and larger linguistic units are processed through the same cognitive mechanism.
The usage-based approach has received substantial empirical support, particularly from studies examining how native speakers process multi-word sequences. The findings indicate that multi-word sequences are holistically represented in the mental lexicon of native speakers (Jiang et al., 2020; Tremblay et al., 2011). As the co-occurrence frequency of words in multi-word sequences increases, the entrenchment of the word strings as a whole in psychological representation strengthens. Native speakers exhibit sensitivity to multi-word co-occurrence frequency in both language comprehension (Hernández et al., 2016; Siyanova-Chanturia et al., 2011) and production (Arnon & Priva, 2013, 2014). Thus, collocation can be regarded as the fundamental unit in language processing and accelerator of language comprehension and production for native speakers.
Research on L2 learners indicates that fixed language forms in multi-word sequences, such as lexical bundles (e.g., in the middle of) and binomials (e.g., knife and fork), are not only ubiquitous in texts but also have psychological reality. In contrast, collocations represent a relatively flexible form of lexical combination, wherein the constituent words can form new collocations with other words. However, while collocational components exhibit a degree of flexibility, they also possess certain constraints, which pose greater challenges for language acquisition. Therefore, how can collocations be represented in the mental lexicon of L2 learners has become the key issue. Along this line, Cao (2016) argued that the representation of collocations was closely linked to learners’ L2 proficiency, with low-proficiency learners totally relying on analytic representation and high-proficiency learners employing both holistic and analytic representation. In other words, only advanced L2 learners can fully acquire collocation knowledge, while words tend to remain isolated in the mental lexicon of low-proficiency L2 learners, preventing lexical connections. These L2 proficiency-related differences in the representation of collocations were significant. However, contrary to this view, evidence from the studies by Wolter and Yamashita (2017), and Fang and Zhang (2019) showed that both the component word and the entire collocation could be activated simultaneously for high- and low-proficiency L2 learners, indicating the coexistence of holistic and analytic representation, with high-proficiency learners showing a greater reliance on the holistic representation. Thus, learners at different proficiency levels can engage in top-down processing, with differences mainly reflecting varying degrees of automaticity. In addition to L2 proficiency, collocation frequency plays a significant role in the representation of collocations. Collocation frequency moderates the processing channels, with L2 learners favoring holistic representation for high-frequency collocations and analytic representation for low-frequency collocations (Lou, 2022; Zhang & Fang, 2020). Furthermore, the sensitivity to collocation frequency varies among learners of different L2 proficiency levels. Studies by Gyllstad and Wolter (2016), Wolter and Yamashita (2017), and Zhang and Fang (2020) suggested that L2 learners’ sensitivity to collocation frequency increased with proficiency, potentially reaching native-speaker levels. Conversely, Sonbul (2015) found that L2 learners exhibit limited sensitivity and struggle to perceive collocation frequency in L2 input, with the frequency effect remaining stable regardless of proficiency levels.
Given the mixed findings, the representation of collocations in L2 learners’ mental lexicon is unclear, particularly regarding how input frequency and L2 proficiency shape the representation of collocations. Additionally, the majority of the existing studies have focused on adult L2 learners with high proficiency, largely overlooking adolescent L2 learners with low proficiency. Furthermore, prior studies have typically extracted language materials and frequency indicators from native-speaker corpora (e.g., COCA, BNC, etc.), which may not align with the input that L2 learners are exposed to. In response, this study draws on a self-constructed corpus of English textbooks for Chinese learners (CETCL), focusing on both adolescent and adult Chinese EFL learners, to investigate the effects of input frequency and L2 proficiency on the representation of collocations. Specifically, the research questions are:
(1) How do Chinese EFL learners represent collocations in the mental lexicon?
(2) To what extent does the frequency of occurrence of particular collocations in the input and learners’ L2 proficiency influence the representation of collocations by Chinese EFL learners?

3. Method

This study adopted a two-factor mixed design using a phrasal decision task, with collocation frequency (low-frequency/high-frequency) as the within-subject variable and English proficiency (low/high) as the between-subject variable. Response time and accuracy rate were treated as the dependent variables.

3.1. Participants

Participants were recruited from a high school and a university in mainland China. To ensure the validity of the study and the comparability between the two groups, the sample comprised 40 third-year undergraduate English majors and 40 12th-grade high school students (the last year of high school in China). All the participants are Mandarin native speakers learning English as a second language, with 25 males and 55 females (mean age = 19.76). They participated in the experiment after completing their English courses in the semester. Prior to the main experiment, all the participants signed an informed consent form and filled out a personal information questionnaire (see Supplementary Table S1). This procedure ensured voluntary participation, assured the confidentiality of their personal information and responses, and collected demographic information, including age, gender, grade, starting age of learning English, length of formal English education, duration of residence in English-speaking countries, etc.
Subsequently, the Oxford Quick Placement Test (OQPT) was employed to determine group assignment. Based on OQPT placement criteria, 32 high school participants were assigned to the low-proficiency non-native speakers (hereafter, LNNS) group, and 34 university participants were assigned to the high-proficiency non-native speaker (hereafter, HNNS) group. To verify the comparability between the groups in terms of learning-related variables, independent samples t-tests were conducted. The results showed no significant differences in the participants’ starting age of learning English (t (64) = −1.16, p = 0.25). However, significant differences were found in the length of English learning experience (t (64) = −7.99, p < 0.001) and OQPT scores (t (64) = −9.17, p < 0.001), demonstrating that HNNS had a longer history of English study and a higher proficiency than LNNS. All the participants were right-handed with natural or corrected-to-natural vision, ensuring consistency in motor and visual conditions that could influence experimental performance. Only one participant reported prior experience traveling in an English-speaking country, and this exposure was limited to less than one month, minimizing the impact of immersive language experience on the study. The participants were compensated upon the completion of the experiment as a gesture of appreciation for their time and effort. Table 1 summarizes the participants’ background information.

3.2. Item Development

This study adopted a frequency-based approach to define collocations, focusing on the co-occurrence of two words that are semantically transparent, structurally compliant, and non-random. Semantic transparency implies that the meaning of a collocation is derived directly from the meanings of its component words (e.g., blue sky), in contrast to the opaque collocation (e.g., blue tooth). Structural compliance refers to the collocation conforming to grammatical rules. Non-random co-occurrence means that the score of mutual information (MI) between the component words in a corpus is not less than 3 (Yi, 2018; Yi et al., 2022).
The experimental materials were derived from a self-constructed corpus of English textbooks for Chinese learners (CETCL), which represents L2 input and contains a total of 1,422,612 word tokens. Prior to the construction of the CETCL, a questionnaire was administered to collect data on the sources of L2 input and the editions of English textbooks for Chinese EFL learners. Ultimately, a total of 40 textbooks were selected, as textbook-based L2 input was reported by over 80% of the respondents. The CETCL encompassed all the content encountered by Chinese EFL learners across textbooks, including passages, exercises, and vocabulary lists, from primary school to college. The content was first extracted and converted into “.txt” format using the ABBYY FineReader software (version 12), followed by annotation with the TreeTagger software (version 3) and manual verification. The CETCL was designed to contain four sub-corpora, namely the corpus of textbooks for Chinese primary students, the corpus of textbooks for Chinese middle school students, the corpus of textbooks for Chinese high school students, and the corpus of textbooks for Chinese college students in English majors. To ensure consistency in parts of speech, adjective–noun collocations were extracted, as these combinations tend to exhibit less variability in determiners compared to verb–(object)–noun combinations (e.g., make a mistake vs. make progress) (Wolter & Gyllstad, 2013). Then, to establish frequency bands, non-lemmatized frequency counts were utilized at both the lexical and collocational levels, with the threshold set at over 10 occurrences per million for high-frequency collocations and under 5 occurrences per million for low-frequency collocations. Based on the results obtained from this step, 200 adjective–noun collocations were randomly selected as the candidate materials.
Next, two questionnaires were adopted to minimize interference from familiarity and semantic congruency. The first questionnaire assessed the participants’ familiarity with high- and low-frequency collocations, involving 8 students who did not participate in the main experiment (4 12th-grade high school students and 4 third-year undergraduate English majors). The second questionnaire evaluated the semantic congruency of the collocations and was completed by 4 PhD students majoring in Applied Linguistics. These processes further refined the candidate materials to include only those with familiar and congruent collocations (i.e., English collocations that have translation equivalents in Chinese, e.g., similar concept), with the mean scores greater than 4 on a five-point Likert Scale.
Following this, 60 target adjective–noun collocations for each group of participants were selected, comprising 30 high-frequency and 30 low-frequency collocations. These collections were structurally consistent and semantically equivalent between Chinese and English. The target collocations for the LNNS group were derived from the corpus of textbooks for Chinese high school students (including content in English textbooks from primary school to high school), while those for the HNNS group were derived from the corpus of textbooks for Chinese college students in English majors (covering content in English textbooks from primary school to university). Both types of collocations consisted of 8 to 14 characters, with no significant differences in length (i.e., the total number of letters for collocations; LNNS: t (58) = −1.30, p = 0.20; HNNS: t (58) = −1.16, p = 0.38). This was true for noun frequency (frequency per million; LNNS: t (58) = −1.56, p = 0.11; HNNS: t (58) = −1.34, p = 0.18) and adjective frequency (frequency per million; LNNS: t (58) = 1.63, p = 0.11; HNNS: t (58) = −0.06, p = 0.96), but showed a significant difference between collocation frequency (frequency per million; LNNS: t (58) = −6.99, p < 0.001; HNNS: t (58) = −7.86, p < 0.001).
Ultimately, 60 non-collocations were selected as baseline items for the phrasal decision task. These items consisted of word combinations containing either grammatical errors (e.g., deeply hole) or semantic anomalies (e.g., broad price), with constituent words drawn from the top 5000 words in the COCA, ensuring high familiarity for the participants (Milton, 2009). To eliminate potential interference from word repetition and inconsistency in the length of collocations and non-collocations, no constituent words from the target collocations appeared in the non-collocation baseline items, and the string length of the baseline items adhered to the same range as the collocations (M = 11.05, SD = 1.32). The classification of non-collocations was verified by individually checking each baseline item in the COCA to confirm that the non-collocations failed to occur in the corpus or their MI scores were smaller than 1, suggesting that the expressions are peculiar rather than significant co-occurrences in English (Wolter & Yamashita, 2017). A complete description of the items is provided in Table 2, and the research items adopted in the phrasal decision task are listed in Supplementary Table S2.

3.3. Procedures

A phrasal decision task was employed, requiring participants to make a judgment about whether the given item was a possible sequence in English, consistent with the method employed by Hernández et al. (2016). The task was administered using the E-prime software (version 3.0). All the items were presented on a computer screen in an individually determined, randomized order. The participants began the experiment by reading the instructions displayed on the computer screen and pressing the space bar to start. Ten practice trials were provided to familiarize the participants with the procedure, and only after confirming their understanding did they proceed to the main trials.
The procedure for each block of trials was as follows: a red fixation point “+” appeared at the center of the computer screen for 500 milliseconds (hereafter, ms) to alert the participants, followed by a blank screen for 50 ms. When the target item appeared on the screen, the participants were required to make a judgment within 4000 ms before the next trial began. The participants needed to quickly assess whether the test item represented a correct English collocation and respond using the keyboard. If they considered the target item a valid English collocation, they pressed the “J” key with the right hand; otherwise, they pressed the “F” key with the left hand. The item disappeared automatically if no response was made within 4000 ms. To prevent the fatigue effect, the participants took a self-defined time break after every 60 items. The participants completed the task independently in a quiet computer lab, taking approximately 8 min to finish it. The task procedure is visualized in Figure 1.

4. Analysis and Results

Data from 66 participants, comprising a total of 7920 items, were automatically recorded by the E-prime software (version 3.0) and then stored in Excel files. The dataset includes both accuracy rate (hereafter, ACC) and reaction time (hereafter, RT) data. Before fitting the model, data trimming was performed following three steps: (1) removing the participants’ data with ACCs below 60%; (2) eliminating RT data linked to incorrect judgments; (3) excluding outliers of RT data below 450 ms, above 4000 ms, and those exceed the mean by 2.5 standard deviations. In total, 333 data items were removed, representing 3.28% of the trials in the HNNS group and 5.18% in the LNNS group. The descriptive statistics obtained for the trimmed RT and ACC data in the phrasal decision task are listed in Table 3.
The trimmed data were then imported into the R statistical platform (version 4.2.2), where the RTs were log-transformed, and all the numeric variables were centered and standardized prior to the analysis. A top-down approach, specifically backward stepwise regression, was utilized for model establishment. Linear mixed-effects models were established using the lmer() function for the RT data, while generalized linear models were established for the ACC data using the glm() function (Kuznetsova et al., 2017). And post hoc tests were conducted using the emmeans() function (Naimi et al., 2014).
Based on the participants’ processing of collocations versus non-collocations, mixed-effects model I for RTs and generalized linear model I for ACCs were established, with the results presented in Table 4 and Table 5, respectively. In these models, both the participants and items were treated as random variables, while type served as a fixed effect. For RTs, the result of mixed-effects model I indicated a main effect of type (β = 0.48, t = 39.83, p < 0.001), revealing that the participants processed collocations significantly faster than non-collocations. Similarly, the result of generalized linear model I for ACCs demonstrated a main effect of type (β = −1.36, z = −17.22, p < 0.001), indicating that the participants had significantly higher ACCs in judging collocations compared to non-collocations. These findings highlight the processing advantage for L2 collocations over non-collocations.
Based on the participants’ judgment responses to collocations, mixed-effects model II for RTs and generalized linear model II for ACCs were established, with the results presented in Table 6 and Table 7, respectively. In these models, both the participants and items were treated as random variables, while group, collocation frequency, and their interactions served as fixed effects.
For RTs, the result of mixed-effects model II exhibited a significant main effect of group (β = 0.13, t = 2.66, p = 0.01), showing that LNNS had significantly longer RTs when judging collocations compared to HNNS. Additionally, there was a significant interaction effect between the group and collocation frequency (β = 0.08, t = 10.68, p < 0.001). The post hoc analysis revealed that LNNS processed high-frequency collocations significantly faster than low-frequency collocations (β = −0.07, t = −6.86, p < 0.001), while there was no difference in RTs between the high- and low-frequency collocations for HNNS (β = 0.01, t = 1.17, p = 0.24). In both the high-frequency (β = −0.13, t = −2.66, p = 0.01) and low-frequency (β = −0.21, t = −4.41, p < 0.001) conditions, the differences between LNNS and HNNS were statistically significant.
For ACCs, the result of generalized linear model II indicated a significant main effect of the group (β = −1.29, z = −4.59, p < 0.001), with LNNS showing lower ACCs in judging collocations compared to HNNS. The main effect of collocation frequency was also significant (β = −1.44, z = −4.62, p < 0.001), indicating that the participants had significantly lower ACCs in processing low-frequency collocations compared to high-frequency collocations. Furthermore, there was a significant interaction between the group and collocation frequency (β = 1.13, z = −3.08, p = 0.01). The post hoc analysis revealed that LNNS showed no significant differences in ACCs between the high- and low-frequency collocations (β = 0.31, z = 1.62, p = 0.11), whereas HNNS demonstrated significantly higher ACCs for the high-frequency collocations compared to low-frequency collocations (β = 1.44, z = 4.62, p < 0.001). In the high-frequency condition, the difference between the two groups was significant (β = 1.29, z = 4.59, p < 0.001), with HNNS achieving higher ACCs; however, no significant difference was observed between the groups in the low-frequency condition (β = 0.16, z = 0.70, p = 0.48). These results suggested that both L2 proficiency and collocation frequency influence the representation of collocations in the L2 mental lexicon.

5. Discussion

5.1. The Representation of L2 Collocations

This study found that, compared to non-collocations, Chinese EFL learners exhibited shorter reaction times and higher accuracy rates for collocations, demonstrating a clear processing advantage for collocations over non-collocations. Similar findings have been reported in other studies, despite the use of different research paradigms (e.g., phrasal decision task (Arnon & Snider, 2010), eye-tracking studies (Sonbul, 2015), and ERP studies (Hughes, 2018)) and the focus on L2 learners with varied native language backgrounds (e.g., Swedish learners of English (Wolter & Gyllstad, 2013) and Turkish learners of English (Öksüz et al., 2024)). These results suggest that collocations have distinct entries in learners’ L2 mental lexicon, suggesting that words in collocations are stored holistically. The holistic representation of collocations allows for their simultaneous retrieval from memory during processing, eliminating the need for the separate activation of component words and the synthesis of grammatical knowledge. The finding of this research expanded the understanding of multi-word sequences, such as lexical bundles (Tremblay et al., 2011) and binomials (Sonbul et al., 2023) in previous research, to collocations. It indicated that despite the relatively low degree of fixedness of collocations, which reflects the flexibility of language, collocations exhibit independent accessibility as a whole. The psychological reality of collocations corresponds to the principles of the usage-based theory of language acquisition and processing. The acquisition of collocations, similar to other skills, relies on general cognitive abilities, such as categorization, chunking, analogy, and rich memory storage, rather than from innate language acquisition mechanisms. For Chinese EFL learners, repeated exposure to numerous language exemplars helps reinforce and abstract regular language forms. Linguistic constructions then emerge through natural language use and are cognitively represented in the brain. The basic unit of language is not the word but the construction. Therefore, collocations, in contrast to non-collocations, are reinforced through repeated exposure, making them easier to store, recall and retrieve. Collocations, along with larger linguistic units and even sentences, are all considered constructions due to their conventionalized form and meaning, thereby possessing psychological reality in language users’ mental lexicon (Sonbul et al., 2023).
Additionally, the processing advantage of collocations over non-collocations can be explained by the transitional probability of words in English (McDonald & Shillcock, 2003; Valsecchi et al., 2013), as well as the differences in collocational formation between Chinese and English. The frequent co-occurrence of two words aids language users in predicting collocational components, facilitating the activation between words and reducing the processing time for collocations (Öksüz et al., 2024). In contrast, non-collocations, as mentioned in the material selection section, lack the probability of co-occurrence in the input. Therefore, Chinese EFL learners must undergo a process of forming predictions about collocational components, experiencing mismatches, followed by a stage of retrieval of word items from the mental lexicon, ultimately leading to judgment behavior. The low-transitional probability of words in non-collocations increases the cognitive load, as evidenced by slower processing speeds and lower accuracy rates in this study. Additionally, it is further supported by longer reading times and higher fixation counts in eye-tracking studies (Sonbul, 2015), as well as the N400 components responding to semantic violations for non-collocations in ERP studies (Hughes, 2018). From the perspective of linguistic typology, the frequently examined collocations share the same structural patterns in both Chinese and English (e.g., noun–adjective, verb–noun, noun–noun, and adverb–adjective collocations) (Fang & Zhang, 2019). However, the verb–preposition combinations involved in this study are inconsistent in Chinese. For instance, in Chinese, the verb “arrive” can directly connect with a place without the need for prepositions such as “in” or “at”. The absence of such verb–preposition collocations in Chinese makes it challenging for Chinese EFL learners to judge the correctness of “arrive under”. The structural inconsistency of certain word pairs complicates the processing of non-collocations for Chinese EFL learners, amplifying the differences in processing collocations and non-collocations in English.

5.2. The Role of Input Frequency and L2 Proficiency

For the main effect of input frequency, the results revealed that Chinese EFL learners exhibited sensitivity to the input frequency of collocations during language processing, as evidenced by the higher accuracy rates in judgments for high-frequency collocations compared to low-frequency ones. The finding contradicts Wray’s (2002) claim that L2 learners have limited sensitivity to collocation frequency. Instead, it supports Bybee’s (2007) “frequency effect”, which includes three primary effects of frequency: the maintenance effect, the reduction effect, and the autonomy effect. This study confirms the maintenance effect, where the repeated use of linguistic units reinforces their representations in memory, as well as the autonomy effect, in which high-frequency units are accessible under context-independent conditions. Increased frequency solidifies the entrenchment of corresponding linguistic structures, enabling the component words of collocations to be represented as cohesive units within the cognitive system, thereby enhancing the degree of holistic representation. Consequently, high-frequency co-occurrences are more likely to be psychologically salient than low-frequency co-occurrences (Cangır et al., 2017; Durrant & Doherty, 2010; Fioravanti et al., 2024).
For the main effect of L2 proficiency, the results showed a distinction between the high- and low-proficiency Chinese EFL learners in collocational processing, with the high-proficiency Chinese EFL learners exhibiting shorter reaction times and higher accuracy rates. This within-group difference may be closely related to the level of L2 exposure. Low-proficiency Chinese EFL learners tend to have less L2 exposure than high-proficiency learners. For these learners, the memory trace of collocation may disappear before it is encountered again, which could impede the acquisition of implicit word-pair knowledge related to collocations (Durrant & Schmitt, 2010).
Input frequency and L2 proficiency interactively influence the representation of collocations. Processing differences among learners of varying proficiency levels are more pronounced for high-frequency collocations. For low-proficiency Chinese EFL learners, increased input frequency enhances the holistic representation of collocations, whereas high-proficiency Chinese EFL learners exhibit similar processing patterns for both high- and low-frequency collocations. These findings may be attributed to familiarity. Adolescent Chinese EFL learners primarily acquire collocations from classroom-based textbooks, leading to limited exposure to diverse forms of language input. In contrast, adult Chinese EFL learners, particularly in university settings, are exposed to a broader range of language input, which reduces their dependence on textbooks for collocational learning. Consequently, increased language exposure enhances their familiarity with collocations as holistic units, thereby reducing the frequency effect (Lu, 2024).
Building on the preceding discussion, the present study proposes that the representation of collocations in the L2 mental lexicon exists as a continuum between analytic and holistic patterns, which advances the dual-route pattern suggested by Carrol and Conklin (2014). The increasing input frequency and improving L2 proficiency facilitate a transition toward a more holistic representation of collocations. This reflects a dynamic interplay between analytic retrieval and holistic recognition, with language experience serving as a key factor in shifting from the former to the latter. From a language development perspective, the holistic pattern of collocational representation gradually becomes entrenched in the mental lexicon of Chinese EFL learners.

6. Conclusions

This study employs a corpus-based psycholinguistic research paradigm to investigate how input frequency and L2 proficiency influence the representation of L2 collocations. The findings indicate that collocations possess a certain degree of psychological reality, with both input frequency and L2 proficiency significantly affecting the representation of collocations in the L2 mental lexicon of Chinese EFL learners. As input frequency and L2 proficiency increase, collocational representation tends to transit from analytic retrieval to holistic recognition, exhibiting a continuum pattern. It is worth noting that this study is limited by its exclusive focus on adjective–noun collocations, the lack of explanatory quality analysis, and the reliance on CETCL-derived collocations for Chinese EFL learners. Therefore, future research could address these limitations to provide a more comprehensive understanding of the representation of L2 collocations.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/bs15010046/s1, Table S1: Personal Information Questionnaire & Informed Consent Form. Table S2: List of Items Used in Phrasal-decision Task.

Author Contributions

Conceptualization, M.Y.; methodology, M.Y. and S. X.; software, M.Y.; validation, S.X., L.Y. and S.C.; formal analysis, M.Y. and S.X.; investigation, M.Y.; resources, M.Y., S.X. and L.Y.; data curation, M.Y.; writing—original draft preparation, M.Y.; writing—review and editing, S.X. and L.Y.; visualization, M.Y. and S.X.; supervision, L.Y. and S.C.; project administration, L.Y.; funding acquisition, L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Social Science Fund of China, grant number 17AYY023.

Institutional Review Board Statement

This study was conducted in accordance with the Declaration of Helsinki and approved by the Ethics Review Committee of the School of Foreign Languages, OUC (protocol code OUCIRB2024012 and approval date 15 May 2024).

Informed Consent Statement

Written informed consent was obtained from all the participants prior to the enrollment (or for the publication) in this study.

Data Availability Statement

The datasets generated during and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Arnon, I., & Priva, U. C. (2013). More than words: The effect of multi-word frequency and constituency on phonetic duration. Language and Speech, 56(3), 349–371. [Google Scholar] [CrossRef] [PubMed]
  2. Arnon, I., & Priva, U. C. (2014). Time and again: The changing effect of word and multiword frequency on phonetic duration for highly frequent sequences. The Mental Lexicon, 9(3), 377–400. [Google Scholar] [CrossRef]
  3. Arnon, I., & Snider, N. (2010). More than words: Frequency effects for multi-word phrases. Journal of Memory and Language, 62(1), 67–82. [Google Scholar] [CrossRef]
  4. Bybee, J. L. (2007). Frequency of use and the organization of language. Oxford University Press. [Google Scholar]
  5. Cangır, H., Büyükkantarcıoğlu, S. N., & Durrant, P. (2017). Investigating collocational priming in Turkish. Journal of Language and Linguistic Studies, 13(2), 465–486. Available online: https://dergipark.org.tr/en/pub/jlls/issue/36120/405622 (accessed on 6 November 2024).
  6. Cao, Y. (2016). The representation of Chinese EFL learners’ second language collocation and influencing factors. Wuhan University Press. [Google Scholar]
  7. Carrol, G., & Conklin, K. (2014). Getting your wires crossed: Evidence for fast processing of L1 idioms in an L2. Bilingualism: Language and Cognition, 17(4), 784–797. [Google Scholar] [CrossRef]
  8. Chomsky, N. (1995). The minimalist program. MIT Press. [Google Scholar]
  9. Durrant, P., & Doherty, A. (2010). Are high-frequency collocations psychologically real? Investigating the thesis of collocational priming. Corpus Linguistics and Linguistic Theory, 6(2), 125–155. [Google Scholar] [CrossRef]
  10. Durrant, P., & Schmitt, N. (2010). Adult learners’ retention of collocations from exposure. Second Language Research, 26(2), 163–188. [Google Scholar] [CrossRef]
  11. Ellis, N. C., Frey, E., & Jalkanen, I. (2009). The psycholinguistic reality of collocation and semantic prosody. In U. Römer, & R. Schulze (Eds.), Exploring the lexis-grammar interface (pp. 89–114). John Benjamins. Available online: http://digital.casalini.it/9789027289803 (accessed on 6 November 2024).
  12. Fang, N., & Zhang, P. (2019). Effects of translational congruency and L2 lexical frequency on English collocational processing. Modern Foreign Languages, 42(1), 85–97. Available online: https://link.cnki.net/urlid/44.1165.H.20181102.0259.014 (accessed on 6 November 2024).
  13. Fioravanti, I., Siyanova-Chanturia, A., & Lenci, A. (2024). Collocation in the mind: Investigating collocational priming in second language speakers of Italian. Language Learning. Early View. [Google Scholar] [CrossRef]
  14. Firth, J. R. (1957). Papers in linguistics, 1934–1951. Oxford University Press. [Google Scholar]
  15. Gablasova, D., Brezina, V., & McEnery, T. (2017). Collocations in corpus-based language learning research: Identifying, comparing and interpreting the evidence. Language Learning, 67(1), 155–179. [Google Scholar] [CrossRef]
  16. Gyllstad, H., & Wolter, B. (2016). Collocational processing in light of the phraseological continuum model: Does semantic transparency matter? Language Learning, 66(2), 296–323. [Google Scholar] [CrossRef]
  17. Hernández, M., Costa, A., & Arnon, I. (2016). More than words: Multiword frequency effects in non-native speakers. Language, Cognition and Neuroscience, 31(6), 785–800. [Google Scholar] [CrossRef]
  18. Howarth, P. (1998). Phraseology and second language proficiency. Applied Linguistics, 19(1), 24–44. [Google Scholar] [CrossRef]
  19. Hughes, J. (2018). The psychological validity of collocation: Evidence from event-related brain potentials [Doctoral Thesis, Lancaster University]. Available online: https://eprints.lancs.ac.uk/id/eprint/127732 (accessed on 6 November 2024).
  20. Jiang, S., Jiang, X., & Siyanova-Chanturia, A. (2020). The processing of multiword expressions in children and adults: An eye-tracking study of Chinese. Applied Psycholinguistics, 41(4), 901–931. [Google Scholar] [CrossRef]
  21. Kuznetsova, A., Brockhoff, P. B., & Christensen, R. H. B. (2017). LmerTest package: Tests in linear mixed effects models. Journal of Statistical Software, 82(13), 1–26. [Google Scholar] [CrossRef]
  22. Langacker, R. W. (1987). Foundations of cognitive grammar. Stanford University Press. [Google Scholar]
  23. Lou, J. (2022). Collocational frequency and transparency in L2 English collocation processing. Foreign Language Research, 45(4), 98–104. Available online: https://link.cnki.net/doi/10.16263/j.cnki.23-1071/h.2022.04.014 (accessed on 6 November 2024).
  24. Lu, X. (2024). Enhancing the processing advantage: Two psycholinguistic investigations of formulaic expressions in Chinese as a second language. International Review of Applied Linguistics in Language Teaching. Early View. [Google Scholar] [CrossRef]
  25. McDonald, S. A., & Shillcock, R. C. (2003). Low-level predictive inference in reading: The influence of transitional probabilities on eye movements. Vision Research, 43(16), 1735–1751. [Google Scholar] [CrossRef] [PubMed]
  26. McEnery, T., & Hardie, A. (2012). Corpus linguistics: Method, theory and practice. Cambridge University Press. [Google Scholar]
  27. Milton, J. (2009). Measuring second language vocabulary acquisition. Multilingual Matters. [Google Scholar]
  28. Naimi, B., Hamm, N., Groen, T., Skidmore, A., & Toxopeus, A. (2014). Where is positional uncertainty a problem for species distribution modelling? Ecography, 37(2), 191–203. [Google Scholar] [CrossRef]
  29. Nation, I. (2001). Learning vocabulary in another language. Cambridge University Press. [Google Scholar]
  30. Nesselhauf, N. (2005). Collocations in a learner corpus. John Benjamins. [Google Scholar]
  31. Öksüz, D., Brezina, V., Monaghan, P., & Rebuschat, P. (2024). Individual word and phrase frequency effects in collocational processing: Evidence from typologically different languages, English and Turkish. Journal of Experimental Psychology: Learning, Memory, and Cognition, 50(8), 1287–1314. [Google Scholar] [CrossRef] [PubMed]
  32. Palmer, H. (1933). Second interim report on English collocations. Kaitakusha. [Google Scholar]
  33. Pinker, S. (1999). Words and rules: The ingredients of language. Basic Books. [Google Scholar]
  34. Siyanova-Chanturia, A. (2015). On the ‘holistic’ nature of formulaic language. Corpus Linguistics and Linguistic Theory, 11(2), 285–301. [Google Scholar] [CrossRef]
  35. Siyanova-Chanturia, A., Conklin, K., & van Heuven, W. J. B. (2011). Seeing a phrase “time and again” matters: The role of phrasal frequency in the processing of multiword sequences. Journal of Experimental Psychology, 37(3), 776–784. [Google Scholar] [CrossRef]
  36. Sonbul, S. (2015). Fatal mistake, awful mistake, or extreme mistake? Frequency effects on off-line/on-line collocational processing. Bilingualism: Language and Cognition, 18(3), 419–437. [Google Scholar] [CrossRef]
  37. Sonbul, S., El-Dakhs, D. A. S., Conklin, K., & Carrol, G. (2023). “Bread and butter” or “butter and bread”? Non-natives’ processing of novel lexical patterns in context. Studies in Second Language Acquisition, 45(2), 370–392. [Google Scholar] [CrossRef]
  38. Tremblay, A., Derwing, B., Libben, G., & Westbury, C. (2011). Processing advantages of lexical bundles: Evidence from self-paced reading and sentence recall tasks. Language Learning, 61(2), 569–613. [Google Scholar] [CrossRef]
  39. Ullman, M. T. (2001). The neural basis of lexicon and grammar in first and second language: The declarative/procedural model. Bilingualism: Language and Cognition, 4(2), 105–122. [Google Scholar] [CrossRef]
  40. Ullman, M. T. (2004). Contributions of memory circuits to language: The declarative/procedural model. Cognition, 92(1–2), 231–270. [Google Scholar] [CrossRef]
  41. Valsecchi, M., Künstler, V., Saage, S., White, B. J., Mukherjee, J., & Gegenfurtner, K. R. (2013). Advantage in reading lexical bundles is reduced in non-native speakers. Journal of Eye Movement Research, 6(5), 1–15. [Google Scholar] [CrossRef]
  42. Wolter, B., & Gyllstad, H. (2013). Frequency of input and L2 collocational processing: A comparison of congruent and incongruent collocations. Studies in Second Language Acquisition, 35(3), 451–482. [Google Scholar] [CrossRef]
  43. Wolter, B., & Yamashita, J. (2017). Word frequency, collocational frequency, L1 congruency, and proficiency. Studies in Second Language Acquisition, 40(2), 395–416. [Google Scholar] [CrossRef]
  44. Wray, A. (2002). Formulaic language and the lexicon. Cambridge University Press. [Google Scholar]
  45. Yi, W. (2018). Statistical sensitivity, cognitive aptitudes, and processing of collocations. Studies in Second Language Acquisition, 40(4), 831–856. [Google Scholar] [CrossRef]
  46. Yi, W., Man, K. W., & Maie, R. (2022). Investigating first and second language speaker intuitions of phrasal frequency and association strength of multiword sequences. Language Learning, 73(1), 266–300. [Google Scholar] [CrossRef]
  47. Zhang, P., & Fang, N. (2020). Effects of lexical frequency, semantic congruency, and language proficiency on English collocational processing by Chinese EFL learners. Foreign Language Teaching and Research, 52(4), 532–545. Available online: https://link.cnki.net/doi/10.19923/j.cnki.fltr.2020.04.005 (accessed on 6 November 2024).
Figure 1. Procedure of the phrasal decision task.
Figure 1. Procedure of the phrasal decision task.
Behavsci 15 00046 g001
Table 1. Biographical data for participants in the phrasal decision task (standard deviation in parentheses).
Table 1. Biographical data for participants in the phrasal decision task (standard deviation in parentheses).
GroupNAgeGender
(M/F)
SALLOS
(Year)
LOR
(Month)
OQPT
Score
LNNS3217.16
(0.44)
11/217.97
(1.47)
10.19
(1.42)
0.03
(0.17)
34.05
(3.25)
HNNS3421.15
(0.66)
2/328.35
(1.20)
12.79
(1.20)
0.00
(0.00)
43.37
(3.34)
Note. N = number of participants; SAL = starting age of learning English; LOS = length of studying English through formal education; LOR = length of residence in English-speaking countries (periods shorter than one month being counted as none); OQPT = Oxford Quick Placement Test.
Table 2. Description of research items in the phrasal decision task.
Table 2. Description of research items in the phrasal decision task.
TypeFrequencyNumberExample
CollocationsHigh30correct form
Low30similar concept
Non-collocations——60deeply hole
Note. The symbol of “— —”denotes the absence of frequency.
Table 3. Mean reaction times (in milliseconds) and accuracy rates (in percentage) with their standard deviations in parentheses.
Table 3. Mean reaction times (in milliseconds) and accuracy rates (in percentage) with their standard deviations in parentheses.
GroupTypeFrequencyRTs (ms)ACCs (%)
LNNSCollocationHigh1162.03 (468.84)94.27 (23.25)
CollocationLow1391.58 (493.75)89.48 (30.70)
Non-collocation——1681.07 (598.86)78.75 (40.92)
HNNSCollocationHigh1042.09 (370.08)97.94 (14.21)
CollocationLow1158.09 (380.61)95.00 (21.81)
Non-collocation——1553.27 (518.96)82.55 (37.96)
Note. The symbol of “— —”denotes the absence of frequency.
Table 4. Results of the linear mixed-effects model Ⅰ (collocation as reference categories).
Table 4. Results of the linear mixed-effects model Ⅰ (collocation as reference categories).
Random EffectsVarianceSD
Participant (Intercept)0.050.23
Item (Intercept)0.040.19
Residual0.050.23
Fixed EffectsEstimateStd. Errordft ValuePr (>|t|)
(Intercept)6.960.03159.20220.41<0.001
Type0.470.015660.3539.83<0.001
Table 5. Results of the generalized linear model Ⅰ (collocation as reference categories).
Table 5. Results of the generalized linear model Ⅰ (collocation as reference categories).
EstimateStd. Errorz ValuePr (>|t|)
(Intercept)2.800.0740.98<0.001
Type−1.360.08−17.22<0.001
Table 6. Results of the linear mixed-effects model Ⅱ (HNNS and high-frequency collocations as reference categories).
Table 6. Results of the linear mixed-effects model Ⅱ (HNNS and high-frequency collocations as reference categories).
Random EffectsVarianceSD
Participant (Intercept)0.040.19
Item (Intercept)0.100.31
Residual0.010.12
Fixed EffectsEstimateStd. Errordft valuePr (>|t|)
(Intercept)6.980.05144.72144.43<0.001
Group0.130.0564.842.660.01
Collocation frequency−0.010.013548.00−1.170.24
Group × Collocation frequency0.080.013481.0010.68<0.001
Table 7. Results of the generalized linear model II (HNNS and high-frequency collocations as reference categories).
Table 7. Results of the generalized linear model II (HNNS and high-frequency collocations as reference categories).
EstimateStd. Errorz ValuePr (>|t|)
(Intercept)3.930.2416.10<0.001
Group−1.290.28−4.59<0.001
Collocation frequency−1.440.31−4.62<0.001
Group×Collocation frequency1.130.373.080.002
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yu, M.; Xu, S.; Yang, L.; Chen, S. The Influence of Input Frequency and L2 Proficiency on the Representation of Collocations for Chinese EFL Learners. Behav. Sci. 2025, 15, 46. https://doi.org/10.3390/bs15010046

AMA Style

Yu M, Xu S, Yang L, Chen S. The Influence of Input Frequency and L2 Proficiency on the Representation of Collocations for Chinese EFL Learners. Behavioral Sciences. 2025; 15(1):46. https://doi.org/10.3390/bs15010046

Chicago/Turabian Style

Yu, Mengchu, Saisai Xu, Lianrui Yang, and Shifa Chen. 2025. "The Influence of Input Frequency and L2 Proficiency on the Representation of Collocations for Chinese EFL Learners" Behavioral Sciences 15, no. 1: 46. https://doi.org/10.3390/bs15010046

APA Style

Yu, M., Xu, S., Yang, L., & Chen, S. (2025). The Influence of Input Frequency and L2 Proficiency on the Representation of Collocations for Chinese EFL Learners. Behavioral Sciences, 15(1), 46. https://doi.org/10.3390/bs15010046

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop