Next Article in Journal
Is There a Woman in Los Candidatos? Gender Perception with Masculine “Generics” and Gender-Fair Language Strategies in Spanish
Previous Article in Journal
The Discourse Function of Differential Object Marking in Turkish
Previous Article in Special Issue
Childhood Heritage Languages: A Tangier Case Study
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhancing Code-Switching Research Through Comparable Corpora: Introducing the El Paso Bilingual Corpus

by
Margot Vanhaverbeke
1,
Renata Enghels
1,*,
María del Carmen Parafita Couto
2 and
Iva Ivanova
3
1
Spanish Linguistics Department, Ghent University, 9000 Ghent, Belgium
2
Centre for Linguistics, Leiden University, 2311 EZ Leiden, The Netherlands
3
Department of Psychology, University of Texas at El Paso, El Paso, TX 79968, USA
*
Author to whom correspondence should be addressed.
Languages 2025, 10(7), 174; https://doi.org/10.3390/languages10070174
Submission received: 12 February 2025 / Revised: 5 June 2025 / Accepted: 17 June 2025 / Published: 21 July 2025

Abstract

Research on language contact outcomes, such as code-switching, continues to face theoretical and methodological challenges, particularly due to the difficulty of comparing findings across studies that use divergent data collection methods. Accordingly, scholars have emphasized the need for publicly available and comparable bilingual corpora. This paper introduces the El Paso Bilingual Corpus, a new Spanish–English bilingual corpus recorded in El Paso (TX) in 2022, designed to be methodologically comparable to the Bangor Miami Corpus. The paper is structured in three main sections. First, we review the existing Spanish–English corpora and examine the theoretical challenges posed by studies using non-comparable methodologies, thereby underscoring the gap addressed by the El Paso Bilingual Corpus. Second, we outline the corpus creation process, discussing participant recruitment, data collection, and transcription, and provide an overview of these data, including participants’ sociolinguistic profiles. Third, to demonstrate the practical value of methodologically aligned corpora, we report a comparative case study on diminutive expressions in the El Paso and Bangor Miami corpora, illustrating how shared collection protocols can elucidate the role of community-specific social factors on bilinguals’ morphosyntactic choices.

1. Introduction

In recent years, studies on bi/multilingualism have increasingly recognized the value of taking into account the variation inherent in speakers’ bi/multilingual experiences to advance our understanding of how bi/multilingual speakers use their languages in their daily lives (Beatty-Martínez et al., 2018; Parafita Couto et al., 2021). There are over 7000 languages spoken worldwide (Eberhard et al., 2024), yet only around 140 nation-states exist; in some of these countries, more than a hundred languages coexist. These widespread situations of language contact give rise to a range of linguistic phenomena, including loanwords, calques, interferences or transferences, code-switching, and patterns of convergence and divergence between the languages in contact (e.g., Palacios, 2014; Myers-Scotton, 1993; Muysken, 2013). These linguistic outcomes are not uniform but are shaped by the specific social settings and community norms in which they occur. As such, bi/multilingualism manifests in diverse ways, with its outcomes highly dependent on sociolinguistic, historical, and demographic factors (Muysken, 2013; Deuchar, 2020 among others). Accordingly, examining how the same language pair interacts under different community norms and social conditions provides invaluable insights into the dynamics of language contact (e.g., Blokzijl et al. (2017) and Balam et al. (2020, 2022)).
In view of this, the contact between Spanish and English lends itself particularly well to this case. Not only are these two languages among the most widely spoken worldwide, but they are also in contact in numerous and varied contexts across the world. As a consequence, their interactions reveal patterns that can be studied comparatively across different regions. As of 2021, an estimated 63.6 million individuals who identify as Hispanic (approximately 19% of the U.S. population) reside in the United States (Moslimani & Noe-bustamante, 2023). Spanish–English bilingualism is particularly observed in communities along the U.S.–Mexico border, in states such as California, Texas, New Mexico, and Arizona, but significant bilingual populations also exist far beyond the border, like in Chicago, New York, and Florida.
Still, the influence of local language practices on bilingual speech patterns remains underrepresented in much contemporary code-switching research (Parafita Couto et al., 2021). Studies on code-switching and bi/multilingualism have traditionally focused on a single community, making it difficult to distinguish the community-specific patterns from more universal bi/multilingual structures. While not discounting the relevance of these studies to the description of the respective communities, the need for systematic cross-community comparisons has repeatedly been raised (Parafita Couto et al., 2021). An important contributor to the scarcity of such comparative studies is the lack of naturalistic data that have been collected using comparable methodologies and that are freely available to the academic public (see also Toribio, 2017).
Yet, the growing interest for spoken interactions is at odds with the notable scarcity of empirical datasets that capture the nuances of everyday informal bilingual speech. To date, the Bangor Miami Corpus is the only corpus of colloquial/spontaneous Spanish–English bilingual speech that is freely accessible online. The El Paso Bilingual Corpus project (Vanhaverbeke et al., 2022) was developed precisely to address this gap. It documents spontaneous, conversational, Spanish–English speech as it is produced in the multilingual setting of El Paso, Texas (United States). In this way, the project deepens our understanding of a mode of language use (code-switching) that is underrepresented in the existing research, and, despite its ubiquity and the sophisticated cognitive skills it demonstrates, continues to be stigmatized instead of celebrated. Moreover, by adopting the same methodological design as that of the Bangor Miami Corpus, this project contributes to the systematization of data collection methods, thereby facilitating a comparative analysis of language use across bilingual communities.
Centered around the El Paso Bilingual Corpus, this paper has three aims: to highlight the gap it fills, to present it, and to provide an example of its utility. Therefore, Section 2 provides a review of the existing Spanish–English bilingual corpora at present, outlining their characteristics, methodological approaches, and limitations, particularly in terms of accessibility, data collection methods, and discourse genres. As such, this section establishes the rationale for compiling the El Paso Bilingual Corpus. Section 3, then, presents the corpus, detailing its data collection methodology, participant recruitment, recording protocols, and transcription procedures. The sociolinguistic profiles of the participants are also discussed, along with the steps taken to ensure methodological comparability with the Bangor Miami Corpus. Next, Section 4 illustrates the analytical potential of the corpus through a case study on diminutive expressions in bilingual speech. In particular, this analysis explores cross-community differences between El Paso and Miami bilinguals in terms of diminutive formation strategies and the influence of language contact on morphological choices, showcasing how comparative studies of this kind may contribute to our understanding of the influence of sociolinguistic factors on bilingual speech. Finally, Section 5 summarizes the key contributions of the corpus and discusses its implications for code-switching research and research on bilingualism overall.

2. Datasets of Spontaneous Spanish–English Bilingual Speech: State of the Art

Why, one might ask, should we focus on recording everyday spoken language, an effort as ambitious as it is complex? The answer lies in the profound insights that oral and colloquial speech offer into the underlying mechanisms of language use. Over recent decades, alongside the rise of functional and cognitive linguistics, there has been a growing interest in the dynamic and context-sensitive nature of spoken language. Overall, by representing authentic speech, oral corpora bridge the gap between theoretical models and naturally occurring language; they enhance our understanding of how language is structured, processed, and used in diverse communicative settings.
We begin by addressing the available spoken corpora that specifically document Spanish–English bilingual contact scenarios, to then highlight how the El Paso Bilingual Corpus adds to the existing body of work. Provided in Section 2.1 is a concise overview of each corpus identified through our research. In particular, we will first briefly discuss three corpora that are explicitly centered on bilingual speech (the Bangor Miami Corpus, the Bilinguals in the Midwest Corpus, and the New Mexico Spanish–English Bilingual Corpus), after which we will present five corpora that focus on varieties of Spanish (and English) spoken in the United States that can also document linguistic contact phenomena. Subsequently, Section 2.2 addresses three key challenges that make it more difficult to conduct cross-community studies of language contact phenomena.

2.1. An Overview

First, the Bangor Miami corpus of Spanish–English bilingual speech is a collection of 56 informal conversations recorded between 2008 and 2011 with 84 bilingual speakers. The dataset comprises approximately 35 hours of speech, yielding a total of about 265,000 transcribed words1, and is freely available online. It provides audio recordings in addition to speech transcripts and detailed metadata about the speakers (Deuchar et al., 2014a). Furthermore, a detailed exposition of the collection methods, participants, and transcription procedures is provided by Deuchar et al. (2014b). Accordingly, this resource not only offers valuable insights into bilingual language use and interaction in the Miami area, but, because of its open-access availability and the comprehensive documentation of its design, also serves as a methodological benchmark for the development of comparable corpora against which the Miami data can then be compared.
The Bangor Miami Corpus is currently the only freely accessible corpus of naturalistic Spanish–English bilingual speech. While two other Spanish–English bilingual corpora—the Bilinguals in the Midwest (BILinMID) Corpus and the New Mexico Spanish–English Bilingual (NMSEB) Corpus—provide valuable resources for the study of bilingualism, their focus on elicited rather than spontaneous, unsupervised interaction limits their suitability for examining authentic bilingual speech (see infra). In particular, the Bilinguals in the Midwest (BILinMID) Corpus documents Spanish–English bilingualism in the Midwest and contains annotated transcripts of picture-elicited short stories narrated by 82 bilinguals, recorded in 2021–2022 (Hurtado, 2022). Complete transcripts are accessible online, and users can search by KWIC (keyword in context), lemma, or speaker. Although audio files are not available, the corpus includes detailed speaker background information, such as bilingual status, generation, gender, and age. The New Mexico Spanish–English Bilingual (NMSEB) Corpus documents the bilingual speech of northern Nuevomexicanos, who frequently use both Spanish and English in their daily interactions. The data were collected through sociolinguistic interviews conducted between 2010 and 2011, during which participants shared their life stories. The corpus comprises 31 recordings from 40 speakers, amounting to approximately 29 hours of material and roughly 300,000 transcribed words. Access to the corpus is available upon request from the principal investigators (Torres Cacoullos & Travis, 2020).
The remaining datasets do not document bilingual speech but rather focus on different varieties of Spanish spoken in the U.S. Accordingly, even if some of the participants used English to a certain extent, these datasets generally lack sufficient bilingual discourse data to support comparative analyses of bilingual language use. Moreover, like the BILinMID and NMSEB corpora, most of these datasets are based on elicited speech, further limiting their applicability for studying spontaneous bilingual interaction. First, the Chicago Spanish (CHISPA) Corpus comprises 124 sociolinguistic interviews in Spanish of approximately 1 hour, carried out with Mexicans, Puerto Ricans, and MexiRicans across three generations who lived in Chicago between 2006 and 2010 (Potowski & Torres, 2023). Detailed information about the design of the corpus, the data, and the participant profiles are provided by Potowski and Torres (2023). A subset of these interviews form part of the PRESEEA (“Project for the Sociolinguistic Study of Spanish from Spain and America”; PRESEEA, 2014) corpus and are hence publicly available, but access to the entire corpus is only possible upon request (Potowski, n.d.).
Second, the Corpus del Español en los Estados Unidos (CORPEEU) aims to document the use of Spanish in the United States since 1960, both in written and spoken forms (Moreno-Fernández, 2018). The spoken part includes transcripts of interviews, speeches, public discourses, interactions, and audiovisual media productions, which are accessible online. Furthermore, the corpus offers advanced search capabilities, allowing users to filter by lemma, POS tag, or geographic criteria. However, a comprehensive overview of the spoken section’s data has yet to be published. Preliminary searches suggest that the corpus incorporates material from multiple sources, including (a) the PRESEEA corpus (with data from Texas and New York); (b) datasets contributed by individual researchers, such as samples of Spanish spoken in New York (from the ‘Otheguy–Zentella Corpus’), Los Angeles (collected by C. Silva-Corvalán), Chicago (the CHISPA corpus by K. Potowski, cf. supra), and Texas (the Spanish in Texas corpus, cf. infra); and (c) the Corpus of Heritage Spanish (coordinated by M. Polinsky).
Third, the Corpus of Spanish in Southern Arizona (CESA, Corpus del Español en el Sur de Arizona) documents the variety of Spanish spoken in Arizona. The corpus includes sociolinguistic interviews primarily in Spanish, though participants were allowed to switch to English as needed. These interviews of approximately 1 hour each involved 78 members of Spanish-speaking communities in Arizona and were recorded between 2012 and 2020 (Carvalho, 2012). The corpus includes audio recordings of the interviews, transcriptions, and detailed metadata about the speakers, such as their language background and social characteristics. Access to the corpus is available online upon request.
Fourth, the Corpus Bilingüe del Valle (CoBiVa) is an ongoing project documenting Spanish–English bilingual speech in the Rio Grande Valley, South Texas. Established in 2017 as a sister corpus to CESA, the CoBiVa corpus consists of unilingual sociolinguistic interviews conducted with speakers from the region, capturing naturally occurring speech in both languages, albeit not in a code-switching mode. The dataset includes audio recordings, transcripts, and detailed metadata about the speakers, such as linguistic background and demographic information. The corpus currently comprises 69 recordings of approximately 1 hour of material per interview. CoBiVa serves as a valuable resource for the study of bilingual language use in a historically Spanish-speaking community along the U.S.–Mexico border. Access to the corpus can be requested online through its official website (Christoffersen & Ciller, 2024).
Finally, the Spanish in Texas Corpus offers a rich video dataset that documents Spanish and bilingual Spanish–English speech. The data were collected through interviews and conversations, primarily recorded between 2011 and 2013, with speakers from diverse regional and cultural backgrounds across Texas. The corpus contains over 500,000 words from 97 bilingual speakers and includes downloadable video and audio files, full transcripts, and part-of-speech (POS) annotations. It is freely accessible through the Texas Data Repository (Bullock & Toribio, 2024).
For ease of reference, an overview of these corpora and their main properties is presented in Table 1, classifying them based on their discourse setting, such as interviews, spontaneous conversations, or data collected through other elicitation techniques, as described by the corpus creators. Additionally, the table provides details on the access procedures for each corpus.
Additionally, it is relevant to mention the Multilingual Hispanic Speech in California corpus (MuHSiC), which is currently under development at UC Santa Cruz (PI: Mark Amengual), California, and which will soon be available to the research community (Amengual et al., 2025). This resource aims to document multilingual speech among Hispanic communities in California through a dataset consisting of recordings, transcriptions, and tools for linguistic analysis. The corpus is currently in the transcription phase and will be accessible through a website in the near future.

2.2. Three Key Challenges

Given this overview, at least three key challenges emerge regarding the current availability of data for bilingual speech research: the overall scarcity of corpora, the variability in their discursive nature, and the restricted access to the existing datasets. First, the number of accessible datasets is disproportionately small compared to the large population of Spanish–English bilingual speakers. Notably, most of these corpora were recorded in the early 21st century, with little evidence of significant new data collection efforts in recent years. The most significant explanation for this scarcity lies in the substantial resources required to create spoken corpora. The processes of preparation, recording, transcription, and publication are both time-consuming (the process of annotation and transcription, even with the first-pass help of an automatic speech recognition system, can take years) and expensive (e.g., Christoffersen et al., 2021; Gullberg et al., 2009). Furthermore, data collection projects of this kind are often difficult to secure funding for, given their high costs and relatively modest perceived return on investment. Indeed, most of these datasets are small in terms of participant numbers and overall word counts. For comparison, while the Spanish corpus EsTenTen in Sketch Engine contains over 28.6 billion words and the Corpus del Español exceeds 10 billion words, the largest corpus in Table 1 includes only 500,000 words. Moreover, spoken datasets face inherent challenges, particularly in achieving standardized transcription and annotation. These limitations hamper the ability to compare datasets and highlight the complexities of building reliable resources for the study of bilingual speech.
Due to the scarcity of spontaneous oral corpora, researchers have increasingly turned to alternative data collection methods to study bilingual behavior. These methods primarily involve offline and online experimental approaches to examine both the production and comprehension of specific structures and phenomena (e.g., González-Vilbazo & Koronkiewicz, 2016; Bellamy et al., 2017; Pablos et al., 2018). Additionally, alternative discourse genres have become a focus of analysis, such as bilingual behavior in television series (e.g., Beseghi, 2019) and various written outputs of bilingualism, including social media interactions (e.g., Vilares et al., 2016) and fictional texts (e.g., Sebba, 2012).
However, a longstanding debate persists regarding the extent to which these alternative datasets can be considered ‘authentic’ representations of natural bilingualism as it occurs ‘in the wild’. In particular, some scholars argue that these alternative sources present strong methodological limitations, noting significant differences between, for instance, literary code-switching and real-life conversational code-switching (e.g., Keller, 1979; Lipski, 1982). Others adopt a more optimistic perspective, emphasizing that similar structures and constraints on code-switching are often observed across conversational and literary corpora, suggesting greater validity for such data (Callahan, 2004). However, to resolve these differing perspectives, more systematic and empirically grounded comparisons of bilingual phenomena across various datasets, both spoken and written, are needed. Such comparisons would help clarify the extent to which alternative data sources can complement spontaneous oral corpora in the study of bilingual behavior.
A second challenge, closely related to the first, lies in the inherent nature of the available corpora. A closer examination of their detailed properties reveals that these datasets vary in their structure and content. Notably, the majority of the corpora are based on semi-directed sociolinguistic interviews, with only one corpus, the Bangor Miami Corpus, featuring spontaneous conversational data. This distinction is important because it is well-established that different discourse genres and settings can influence the productivity and variability of linguistic phenomena. This is particularly relevant for highly expressive features of oral discourse, such as the use of pragmatic markers, forms of address, and strategies for mitigation and intensification (e.g., Koch & Oesterreicher, 1985; Biber, 2012; Enghels & Azofra Sierra, 2018). These differences highlight the importance of being careful when drawing conclusions from datasets that come from different discourse genres and interaction settings, and drawing conclusions about natural conversation from datasets that involve structured interactions.
By way of illustration, consider the comparison presented in Table 2, which examines intensification strategies in two datasets: the Havana PRESEEA Corpus and the Havana Ameresco Corpus. The former comprises data from sociolinguistic interviews conducted in the Cuban capital (see: https://preseea.uah.es/ (accessed on 10 February 2025)), while the latter includes spontaneous conversations recorded in the same city (see: https://corpusameresco.org/coloquial/web/datos (accessed on 10 February 2025)).
The comparison indicates a reduction in the variability of intensification strategies in the more directed corpus of PRESEEA. In this dataset, speakers tend to rely more frequently on the most ‘prototypical’ markers, such as muy (e.g., muy grande, ‘very big’) and the cultivated suffix -ísimo (e.g., grandísimo, ‘very big’). In contrast, in the spontaneous conversations of the Ameresco corpus, speakers demonstrate a more creative range, varying between these markers and other prefixes and suffixes, such as -udo (e.g., cabezudo, ‘big head’). This suggests that studying phenomena typical of orality exclusively through semi-directed data may lead to an incomplete understanding of their full productivity and variability.
A third challenge, and certainly not the least, is the limited accessibility of the datasets, as highlighted in the right column of Table 1. Despite the general expectation of accountability from researchers, only about half of these datasets are openly accessible through independent websites or central repositories, such as Dataverse, which are dedicated to curating and sharing data. The remaining datasets are only accessible upon request, posing additional barriers to their use for broader research purposes. Several factors may explain this limited accessibility, including concerns about the quality of the data, privacy-related issues regarding the speakers, and even competition between researchers.
Given the current situation, including the obstacles and challenges outlined, we strongly advocate for the systematic and principled collection of additional corpora of spoken informal bilingual data. One of the most significant benefits of such corpora is their high level of ecological validity. In the context of naturalistic data, ecological validity refers to the extent to which the data reflect real-life conditions and behaviors, documenting language as it is naturally used in authentic contexts rather than in artificial or controlled settings. When collected thoughtfully and with a well-designed methodology, these corpora provide uniquely representative sociolinguistic samples, enabling researchers to have full control over the data. This control, in turn, facilitates innovative investigations, such as automatic sentiment analysis or other emerging analytical approaches. Moreover, such corpora are often used to complement other research methods, for example, to inform experimental design or to provide context for interpreting experimental results.
In this respect, the El Paso Bilingual Corpus project contributes to resolving the scarcity of natural speech data for the study of language contact phenomena and to systematizing corpus building methods. Specifically, given that the Bangor Miami Corpus is to date the only Spanish–English bilingual corpus that comprises informal speech data, this project has adopted its methodological design (see Deuchar et al., 2014b, for an elaborate discussion of the Bangor Miami Corpus’ creation) to compile a new conversational corpus of bilingual speech in the Spanish–English community of El Paso, Texas.

3. Presentation of the El Paso Bilingual Corpus

El Paso, the oldest and largest border city of the United States, has been one of the earliest U.S. regions where Spanish and English entered in contact (Velázquez, 2009). Today, this city accommodates the largest Hispanic population in the U.S., with 82.9% of its residents identifying as Hispanic (El Paso City Hall, 2024). 93.9% of these Hispanic–Americans trace their roots to Mexico, but only 25% is not U.S.-born (U.S. Census Bureau, 2022a). Consequently, the large majority of Hispanic–Americans in this community constitutes second- or third-generation bilinguals. Furthermore, the community is characterized by a sustained history of bilingualism (Velázquez, 2013), with speakers who still very much use their own Spanish variety and are also highly proficient in English. More specifically, 64.2% of the El Paso population speaks Spanish at home, while 61.9% reported that they also speak English ‘very well’ (U.S. Census Bureau, 2022b). With the ultimate goal of making the El Paso Bilingual Corpus publicly available in the near future, the next section serves as the official introduction of the corpus to the wider academic public.
The El Paso Bilingual Corpus (Vanhaverbeke et al., 2022) comprises audio recordings of spontaneous bilingual conversations between acquainted speakers from El Paso, Texas. In this section, we will expand on the data collection methods used to create this corpus (Section 3.1) and offer a first glimpse into the data that have been compiled (Section 3.2). Moreover, we will present the sociolinguistic profiles of the participant sample (Section 3.3), discuss the data transcription methods (Section 3.4), and touch upon the future endeavors to share this corpus in an open-access repository (Section 3.5).

3.1. Data Collection

The El Paso Bilingual Corpus was recorded between April 2022 and January 2023 as part of a project2 funded by the Research Foundation—Flanders (FWO) and in collaboration with the Language and Communication Lab3 at The University of Texas at El Paso (henceforth UTEP). In total, 42 spontaneous and informal conversations were recorded between 84 bilingual speakers living in El Paso, Texas. Each conversation lasted between 30 and 45 minutes, leading to a total of 31 hours of speech or about 270,000 words of text4. For comparison, the Bangor Miami Corpus comprises 35 hours of speech (from 56 different conversations involving 84 speakers), resulting in about 265,000 words of text (Deuchar, 2008).

3.1.1. Participant Recruitment

To minimize the potential influence of the researcher’s language use on participants’ linguistic behavior, a bilingual research assistant who belongs to the local community and is affiliated with UTEP was trained to recruit participants and organize the recording sessions. Participants were recruited through multiple channels, including via the research assistant’s extended social network (following the ‘friend of a friend’ method by Milroy, 1987), by distributing flyers across UTEP’s public spaces, and via announcements posted on UTEP’s digital announcement platform. Prospective participants were invited to fill in an online prescreening survey administered via the experience management software platform Qualtrics (2020). This prescreening contained questions regarding their age, self-reported proficiency in Spanish and English, period of residency in El Paso, and contact details. Those individuals who met our eligibility criteria, namely adult (18+) bilinguals who were (highly) proficient in Spanish and English and had been living in El Paso for at least four years, were subsequently contacted by the research assistant to schedule a recording session.
During the collection of the El Paso Bilingual Corpus, various measures were taken to maximize the naturalistic conversational environment in which bilingual communication and code-switching normally occurs. In particular, eligible participants were asked to invite someone they knew well (preferably a friend or relative) to record the conversation with, as previous research has shown that bilingual speakers code-switch more frequently with familiar interlocutors than with lesser known, or unknown, conversation partners (Dewaele, 2010; Resnik, 2012, as cited in Dewaele & Li, 2014). Furthermore, the participant pairs were free to select the location for their recording session, although it was recommended to choose a (relatively) quiet place where they felt comfortable. This enhanced the probability for informal conversation to occur between the participants, as bilingual speakers tend to avoid using both languages in more formal speech events (Dewaele & Li, 2014). Given that the participant pool primarily comprised UTEP students, several conversations were recorded on campus (24 conversations). Other locations included one of the participants’ home (1 conversation), backyard (2 conversations), or outside balcony (1 conversation), a coffee shop (10 conversations), and the office (4 conversations). For some audio recordings that were recorded in public spaces, the environmental noise was minimized and their acoustic quality enhanced using the free audio-editor program Audacity (Audacity Team, 1999). Each participant was financially compensated for their time upon completing the recording session.

3.1.2. Recording Session

At the start of the recording session, participants were provided a short briefing about the project by the research assistant. As with the Bangor Miami Corpus, the research goal of the project was kept as general as possible, described as aiming to investigate the communicative behavior of Spanish–English bilingual speakers who use both languages in their daily conversations. Moreover, participants were informed about the measures taken to ensure the confidentiality of their data (cf. infra), and how their data would be handled after the recordings. The research assistant then set up the recording equipment and instructed the participants on how to use it. For all sessions, a Marantz PMD 661 MKII mobile voice recorder5 with built-in stereo condenser microphones was used. During the briefing and the instructions, the research assistant herself was instructed to address the participants in both languages to help prime them into a bilingual mode (cf. Grosjean, 1985). To encourage natural and spontaneous conversation, participants were told that they could talk about any topic they liked. However, if the participants requested some guidance, the research assistant proposed a few casual topics, such as recent or future vacations, movies they saw or wanted to see, work-related events, family gatherings, hobbies, weekend plans, etc., as these were topics that also arose in the audio recordings of the Bangor Miami Corpus.
After having provided all the instructions, the research assistant exited the room or space where the recording was taking place, leaving the participants to converse unsupervised. This measure was taken so that the presence of the assistant would not influence the speakers’ language behavior during the conversation and to minimize the effects of the Observer’s Paradox (Labov, 1972). The participants were left to talk for about 45 minutes, after which the research assistant returned to conclude the session.

3.1.3. Sociolinguistic Background Questionnaire

At the end of the recording session, participants were asked to complete a background survey probing sociodemographic and sociolinguistic information, including age, gender, current profession, educational level, locations where the speakers have lived, language input received at home and at school, self-assessed language proficiency levels, linguistic attitudes, and their habits of language use in general and with specific interlocutors. These data were collected to be able to investigate the variation in the data from a sociolinguistic perspective. To ensure comparability between the sociolinguistic data collected in El Paso and Miami, the El Paso Bilingual Corpus project employed the same questionnaire developed for the Bangor Miami Corpus, with minor adaptations to suit the local context. The full questionnaire can be downloaded from the Bangor Miami Corpus website.
More specifically, one such adaptation involved the section on speakers’ self-assessed language proficiency. In the questionnaire used in Miami, speakers were asked to assess their proficiency in Spanish and English by answering the following question: On a scale of 1 to 4, how well do you feel you can speak [English/Spanish]? To obtain a more nuanced understanding of the speakers’ self-assessed proficiency, this original question was divided into four subquestions in the El Paso version of the survey (a–d). This revision allowed for the separate evaluation of the speakers’ writing, reading, speaking, and comprehension skills in each language, as speakers can diverge in their command of a language when taking into account different skills (Hulstijn, 2011). The answer scale was maintained.
(a)
How well do you feel you can speak [Spanish/English]?
(b)
How well do you feel you can read [Spanish/English]?
(c)
How well do you feel you can write [Spanish/English]?
(d)
How well do you feel you can understand spoken [Spanish/English]?
A second adaptation concerned the question probing speakers’ attitudes toward code-switching. The original questionnaire included the statement People should avoid mixing their languages in everyday conversation, whereby participants had to indicate on a five-point Likert scale to what extent they agreed with this statement. In the El Paso version of the survey, this statement was rephrased more positively as I like that people use both their languages when talking to me to reduce the likelihood of eliciting potentially biased responses.
In total, the questionnaire comprised 26 questions and took about ten minutes to complete. The survey was presented digitally on Qualtrics and was available to participants in Spanish and English, allowing them to choose their preferred language. A total of five individuals selected the Spanish version and 79 the English version of the survey.

3.1.4. Ethical Considerations

The procedures involved in the compilation of the El Paso Bilingual Corpus conform to the Federal U.S. guidelines for the protection of human subjects and the European Union General Data Protection Regulation (GDPR) and were approved by the UTEP Institutional Review Board and Ethics Committee at Ghent University. In the consent form, explicit consent was requested to publish the pseudonymized recordings and transcripts (cf. infra) in an open-access repository. Eight participants did not consent to the publication of their conversation recordings (comprising two pairs of conversation partners), as a result of which only 36 conversations shall be included in the publicly available version of the corpus. Moreover, the data of two participants were deleted because they did not reconsent to the use of their data after the full aims of the project (i.e., to study code-switching behavior) were disclosed in the debriefing at the end of the session.
The measures taken to protect the confidentiality of the participants included the use of a unique ID code, consisting of a randomly generated two-letter and four-number combination (e.g., ZQ0416) prior to the recording session, to be able to link the different types of data to the corresponding individual. No identifiable information was tied to the research data.

3.2. Description of the Data

The conversations in the El Paso Bilingual Corpus were named after the research assistant responsible for organizing the recording sessions, followed by a chronological numbering system (e.g., Dominguez01, Dominguez02, etc.). Following the practices established in the Bangor Miami Corpus, the first five minutes of each conversation were removed under the assumption that those initial minutes of conversation might not fully reflect natural speech, as the speakers were likely still becoming accustomed to being recorded. Nevertheless, the remaining data clearly reflect spontaneous, unplanned speech, as evidenced by the numerous interruptions, self-repetitions, overlaps, ellipses, hesitations, and false starts observed in the recordings (Dufour et al., 2014; Ward, 1989).
Still, in some of the conversations (e.g., Dominguez08 and Dominguez36), the participants made certain remarks at the end, which show that they were still conscious of being recorded. For instance, in excerpt (1), speaker DN7658 comments on how long they still need to talk before their recording ends. In a similar vein, in excerpt (2), speaker IB6767 comments that they have been talking for 45 minutes and that the research assistant has not yet returned. However, by the informal and colloquial nature of the conversations, including the participants’ candid exchanges on personal topics (e.g., in excerpt (3)), we reckon that this awareness did not affect them too much and had only a minimal, if any, impact on the overall naturalness of their language use.
(1)DN7658:Sí nos falta un minutito un minutito
‘Yeah we need one more minute, one little minute’
KV9880:Vamos a decirles un secreto vamos va chiquito
‘We’re gonna tell them a secret, let’s tell them a little one’
(El Paso Bilingual Corpus, Dominguez08)
(2)IB6767:Dijo que ahorita como a los cuarenta y cinco regresaba
‘She said that now like at forty five she would return’
GZ5538:En efecto whatever
‘Actually’
IB6767:Yo digo que le dejemos cinco minutitos más
‘I say we leave it about five minutes more’
(El Paso Bilingual Corpus, Dominguez36)
(3)GF4964:sí sí sí hace tres años por como cuatro años era abusivo mi papa
‘yeah yeah yeah three years ago for like four years he was abusive my dad’
VK3709:[xxx] en Juarez
   ‘in Juarez’
GF4964:Sí (.) Pero no no era físicamente nomas era (..) psicológicamente
‘yeah but it was not physically it was rather psychologically’
gritaba mucho güey yo tengo muchos problemas todavía de eso
‘he yelled a lot, dude, I have many problems still because of that’
(El Paso Bilingual Corpus, Dominguez38)
Similar to the Bangor Miami Corpus, the conversations in the El Paso Bilingual Corpus exhibit considerable variation in bilingual language use. While some conversations transpired predominantly in English with switches to Spanish, the reverse is true for others. Table 3 provides a categorical division of the number of conversations that were predominantly in Spanish, predominantly in English, or that comprised an approximately equal frequency of Spanish and English words. In the predominantly Spanish conversations, over 65% of the words were Spanish and less than 30% English, given that a certain percentage of the words can appear in both languages. Logically, the reverse is true for the predominantly English conversations, while in the nearly equally frequently Spanish and English conversations, the respective word counts ranged between 35% and 55%.
However, it is important to acknowledge that, in reality, the conversations more accurately represent a continuum with varying degrees of code-switching. At one end of this spectrum, the corpus comprises several conversations with a heavy amount of code-switching, such as Dominguez01 (excerpt (4)) and Dominguez06 (excerpt (5)). In these excerpts, Spanish discourse is indicated in italics and English discourse in normal font.
(4)ZQ0416:I normally last time los lavé (.) pero ya mira hasta me cayó ahí el café
         ‘I washed them, but well look it even got the coffee’
que me escupiste
‘that you spat on me’
FX7977:it was by accident, ay todos tus [xxx]
        ‘ah all your’
ZQ0416:that’s why I [//] viste? ahí tengo el café que me escupiste
     ‘you see? Here I have the coffee you spat on me’
FX7977:It was an accident, I didn’t mean it
ZQ0416:Los tengo que volver a lavar, tú cómo los lavas?
‘I have to wash them again, how do you wash them?’
FX7977:qué?
‘what?’
ZQ0416:los zapatos
‘your shoes’
FX7977:I don’t mi mamá like she is like “pásame los zapatos” y yo “okay”
   ‘my mom’      ‘pass me the shoes’ and I’
(.) but she puts them in la lavadora
           ‘the washing machine’
ZQ0416:I’ve seen like different techniques he visto unos que dicen que
               ‘I’ve seen a few that say that’
pasta de dientes y luego otros que dicen like baking soda
‘tooth paste and then others that say’
(El Paso Bilingual Corpus, Dominguez01)
(5)SU9854:“Okay pues ya acabé con tu tire it’s gonna be [/] it’s usually forty but
   ‘well I already finished your’
te cobro treinta cinco” and I was like “okay” y luego ya pues y le pagué
‘I’ll charge you thirty-five’       ‘and then well and I paid him’
and then he was like [/] we went into like the little shop para darle
                         ‘to give him’
mi tarjeta y todo and then he’s like “tienes Facebook”? &=chuckles and
‘my card and all’        ‘do you have Facebook?’
I was like sí pero también tengo novio” y l(ueg)o dice “ay pues está bien’
     ‘yes but I also have a boyfriend and then he says ‘ah well that’s fine’
y luego le salía así como no Add Friend le salía no más Message
‘and then it showed him like not’  ‘it showed just’
he’s already my friend
KZ1980:no sé
‘I don’t know’
SU9854:does that mean [//] porque hazte cuenta que he gave me his phone
         ‘because note that’
in Facebook and then I typed there my name y luego I clicked
                    ‘and then’
on my profile y luego no salía así Add Friend or Remove Friend
      ‘and then it didn’t show like’
na(da) más decía Message
‘it just said’
(El Paso Bilingual Corpus, Dominguez06)
On the other end of the continuum, two conversations need to be discussed in more detail. Specifically, in conversation Dominguez10, the bilingual speakers talked exclusively in English. These speakers were two colleagues who recorded a conversation in their office during lunch break hours, as a result of which we believe that the physical setting may have affected their language mode. In addition, in Dominguez08, the interlocutors addressed each other (mainly) in one language during predetermined blocks of time. That is to say, at certain times during the conversation, speaker DN7658 indicated that they should switch to the other language. As a result, the alternation between languages in this conversation is more artificial, although there is still minor switching to the other language in each of the blocks.
(6)DN7658:Mejor no lo veo
‘It’s better if I don’t see it’
KV9880:Habrán visto algo
‘they will have seen something’
DN7658:Speak in English
KV9880:English?
DN7658:Yes. I mean I don’t really know the study is about but I am assuming
that it is +...
(El Paso Bilingual Corpus, Dominguez08)
Apart from the recordings of the conversations (in .wav-format) and the input to the sociodemographic background questionnaire, additional metadata (a–g) about the conversations and the interlocutors were systematically registered by the research assistant during the recording sessions on a technical information sheet. These metadata will also be made available.
(a)
Date of the recording.
(b)
Duration of the recording.
(c)
Place of the recording.
(d)
Number of active participants and passive bystanders.
(e)
Relationship between the participants.
(f)
Socio-demographic information about the participants.
  • Gender and age.
  • Education level.
  • Current profession.
  • Languages spoken by the participant before the recording.
  • Languages spoken by the participant after the recording.
(g)
Field notes (i.e., any extra information that may be relevant for the data analysis).

3.3. Participant Profiles

All conversations took place between two interlocutors with varying social relationships (Table 4). The majority of the recorded conversations occurred between friends (71.4%, n = 30/42). In six of the conversations, the interlocutors were family members. Specifically, four conversations were recorded between siblings (9.5%) and two between a mother and their young-adult daughter (4.8%). In four instances, the interlocutors were a couple (9.5%), and finally, in two cases the speakers were colleagues (4.8%).
For two of the conversations (i.e., Dominguez29, a conversation between siblings, and Dominguez33, between friends), the speech of only one of the interlocutors was transcribed, as the other did not reconsent to the use of their data after the debriefing (cf. supra). As such, those two participants will not be taken into account when describing the participant sample.
Based on the sociolinguistic information provided by the participants in the questionnaire, we are able to create a detailed overview of their sociolinguistic profiles. We will discuss a few of the more prominent sociolinguistic variables here. Of the 82 participants in the El Paso Bilingual Corpus, 49 speakers were women (59.8%), 31 were men (37.8%), and 2 participants defined themselves as non-binary (2.4%). Participants’ ages ranged from 18 to 58, with an average age of 22 years (SD 6 years/8 months.). Due to the fact that participants were primarily recruited from the UTEP student community, the large majority of the participants in El Paso was between 18 and 25 years old at the time of the recording (85.4%, n = 70/82). Ten participants were between 26 and 45 years old (12.2%) and the remaining two participants were aged between 46 and 60 years old (2.4%). Table 5 presents the distribution between the participants’ age and gender.
Furthermore, about half of the El Paso bilinguals who participated in the creation of the corpus were born and raised in the U.S. (43.9%, n = 36/82), as shown in Figure 1. In fact, the great majority of these participants was actually born and raised in the city of El Paso (n = 34/36). A nearly equal number of participants (37.8%, n = 31/82) was born and raised outside of the U.S., moving to El Paso when they were already over 12 years old. Specifically, being located so close to the Mexican border, all but one of these speakers were born and raised in Mexico. The remaining participant was born and raised in the Dominican Republic. Furthermore, 11.0% (n = 9/82) of the speakers were born outside of the U.S., but moved to El Paso in the first years of their lives. In particular, eight individuals were born in Juarez, El Paso’s neighboring city across the border, while one participant was born in Puerto Rico. Finally, one participant (1.2%) was born in the U.S. but spent the majority of the first twelve years of their life in Mexico. Five participants (6.1%) did not disclose any information about the places where they had lived in the background survey.
Regarding the participants’ ages of acquisition of Spanish and English (Figure 2), the large majority of the speakers in the El Paso Bilingual Corpus acquired Spanish before they learned English. In particular, 75 speakers learned Spanish before the age of four. A total of 34 of these participants started learning English in primary school (41.5%), 13 participants reported learning English in secondary school (15.9%), and 2 participants acquired English at adult age (2.4%). On the other hand, 26 of these individuals (31.7%) also acquired English before they were four years old, as a result of which they were deemed early bilinguals. Furthermore, it needs to be mentioned that four participants (4.8%) indicated that they learned Spanish as an adult (one of whom reported also learning English as an adult, and three learning English in secondary school). However, all of them also reported that their parents spoke Spanish to them at home and all but one were taught in Spanish during primary school. Finally, three participants in El Paso did not provide any information regarding their age of acquisition (3.7%).
Finally, let us consider the self-reported proficiency levels of the participants. Based on the four questions probing their speaking, reading, writing, and comprehension skills (cf. supra), a mean proficiency score was calculated for each participant. On a scale of 1 to 4, the average self-reported proficiency levels of the speakers were 3.59 (s = 0.51) for Spanish and 3.79 (s = 0.36) for English. Moreover, as shown in Figure 3, the majority of the speakers indicated that they were equally proficient in Spanish and English (59.8%, n = 49/82). A total of 23 participants assessed their proficiency to be higher in English than in Spanish (28.0%), whereas the remaining ten speakers considered themselves more proficient in Spanish (12.2%).

3.4. Transcription Methods

As the El Paso Bilingual Corpus has been collected fairly recently, the transcription of the conversations is still ongoing. Five people have been involved in creating the transcriptions of the corpus, including Anavictoria Dominguez, the research assistant responsible for collecting the data, Marcela De La Torre, another native bilingual speaker from the El Paso community; Lucía Pascual Perales and Patricia Vidal, two native Spanish students from Spain with a high proficiency in English who did an internship at the Spanish Linguistics Department at Ghent University; as well as the principal author.
The recordings are being transcribed in the annotation program ELAN (The Language Archive, 2024). ELAN is a computer program for annotating sound or video files developed at the Max Planck Institute for Psycholinguistics in Nijmegen, The Netherlands. It runs on all major operating systems (Windows, Mac OS, Linux) and is available in a number of different interface languages (to date, Catalan, Dutch, English, French, German, Japanese, Portuguese, Spanish, and Swedish).
The multimodal annotation tool ELAN allows one to create tiers (e.g., one main or ‘parent’ tier for each speaker) on which the speech of the participant can be transcribed in segmented speech units that are automatically linked to the timeline of the audio or video file (Berez, 2007). Optionally, additional ‘child’ tiers can be created and connected to a main tier, which may contain, for instance, translations of the segmented utterance, but also phonological and/or phonetic representations, glosses, and comments and contextual notes. An unlimited number of tiers are allowed in the program, as long as the parent–child relationships between the tiers are correctly conceived of. Tier configurations can also be saved as templates to expedite the creation of future files with the same participant structure (Berez, 2007).
This annotation program was chosen for its user-friendly interface and its interoperability with a number of tools that are prominent in linguistics research, such as FLEx, PRAAT, as well as CLAN and its CHAT format. The TalkBank databank, where the Bangor Miami Corpus is available, makes use of CLAN as the standard software system and hence requires its files to be uploaded in CHAT format. By using ELAN, we can keep the option open to publish the El Paso Bilingual Corpus in this database. Moreover, ELAN allows exporting .csv- and tab-delimited text files, facilitating the transfer of data to spreadsheet applications such as Excel or R to perform quantitative analysis. ELAN can also export lists of unique word or annotation values, possibly with an occurrence count (Sloetjes et al., 2011).
For the transcription of the El Paso Bilingual Corpus, a detailed protocol was developed for the transcribers in which instructions were provided on how to work with the program. Furthermore, transcription conventions were established in this protocol to ensure that the transcriptions were made in a consistent manner. In general, the orthographic norms of the Diccionario de la Lengua Española (Real Academia Española, 2014) and the Diccionario de Americanismos (Asociación de Academias de la Lengua Española, 2010) were followed for the Spanish stretches of speech and those of the Merriam Webster’s Dictionary (Merriam-Webster, 2025) for the English segments. This was a deliberate decision, as an orthographic transcription facilitates automatic searches within the corpus. Nevertheless, given that this approach limits the ability to transcribe particular phonetic features, a future version of the transcriptions may include an additional tier with this information.
Furthermore, attention was paid to using transcription conventions that were as similar as those of the Bangor Miami Corpus as possible. In particular, the following conventions were implemented:
CodeExplanationExamples
&=To indicate a paralinguistic or extralinguistic aspect&=cough, &=laugh, &=sneeze
[xxx]To indicate unintelligible words or sequences.I wanted to [xxx]
[ ]
To indicate words or phrases that are uncertain to the transcriber
To indicate alternatives when the transcriber is uncertain.
I wanted to [invite] him
I wanted to [invite/ignite] him
( )To indicate a consonant or syllable that has been elided and thus not pronounced.I have (e)nough.
ya (es)toy lista.
<< >>To indicate direct quotations. It is important to not use the symbol “ in ELAN, as it causes the program to crash.He said << what are you going to do? >>
[/]Used when a speaker begins to say something, stops, and then repeats the earlier material without change. The material being retraced is enclosed in angle brackets.<I wanted> [/] I wanted to invite him.
[//]Used when a speaker begins to say something, stops, and then starts a new phrase while maintaining the same idea. The material being retraced is enclosed in angle brackets.<I wanted> [//] I think I wanted to invite him.
[///]Used when a speaker begins to say something, stops, and then says something else. The material being retraced is enclosed in angle brackets.<I wanted> [///] Oh I forgot to tell you that the cat’s gone.
(.)To indicate short pauses in the speech unit.I wanted to (.) invite him.
(..)To indicate longer pauses in the speech unit.I wanted to invite him (..) but didn’t.
+…To indicate that the speech unit is incomplete, but not interrupted. The speaker has trailed off.Smells good enough for +…
+<Used at the beginning of an utterance that overlaps with a previous utterance.SP1: I wanted to invite him
SP2: +< but you didn’t.
+/Used at the end of an utterance that is incomplete because the speaker is interrupted.SP1: I wanted to invite +/
SP2: +< Mommy!
/+Used at the beginning of the utterance that completes a previously interrupted speech unit.SP1: I wanted to invite +/
SP2: +< Mommy!
SP1: /+ him, but didn’t.

3.5. Future Endeavors for the Publication of the Corpus

Currently, a draft transcription exists for all conversations in the corpus, each of which has undergone at least one round of revision. This first round of revisions, focused on verifying the accuracy of the transcripts, was carried out by Anavictoria Dominguez, the principal author, and Simon Claassen, a colleague at Ghent University. However, before being able to share the corpus in an open-access repository, additional rounds of revision will be required to address any remaining orthographic errors and inconsistencies in transcription conventions. Subsequently, it may also be beneficial to conduct an inter-reliability analysis to assess the consistency of the transcription practices, as was performed for the Bangor Miami Corpus (for a discussion of this inter-reliability analysis, see Deuchar et al., 2014b).
In addition, following the good practices of the Bangor Miami Corpus, it would be useful to provide glosses and language markers for each word. Based on where it would be decided to publish the corpus, the language marking system may need to meet some requirements. For instance, in TalkBank, conversations need to be assigned a default language, after which only words that are not in the default language are required to be marked. In any case, language marking over 270,000 words of text is an arduous process and not without difficulties. The marking will be based on the appearance of the word in a monolingual reference dictionary of Spanish or English. In those cases when a word can be found in the dictionaries of both languages, that word receives a neutral language marker, unless its pronunciation makes its language membership clear. In the same vein, place names and brand names that are the same in both languages will be tagged with a neutral language marker. For the non-English stretches of speech, translations will also be provided to facilitate the use of the data by researchers who are not necessarily familiar with Spanish.
Finally, the audio recordings and transcript text files will be pseudonymized. For the recordings, this entails erasing any personal names, both of the individuals who are participants and any non-participant individuals mentioned during the recorded conversation, as well as any other information that could possibly lead to the identification of participants (e.g., street names). The pseudonymization will be carried out in the audio tool Audacity. For the transcripts, the personal names will be changed to randomly selected pseudonyms that reflect the mentioned individual’s gender. In instances where unexpected speakers joined or interrupted the conversation (e.g., a bartender who comes to take the order of the participants who are recording in a coffee shop), their speech will be deleted from the recordings and transcripts, as they did not provide explicit consent to being recorded.
Once these steps have been performed, the 36 conversations for which both interlocutors provided permission to the public sharing, together with the corresponding transcripts, the output to the sociolinguistic questionnaire, and the technical metadata, will be made available in an open-access repository. At this moment, however, the El Paso Bilingual Corpus is stored on a secured SharePoint site hosted by Ghent University, to which limited access can be granted to interested researchers, provided that the privacy conditions are respected. Access can be requested by sending an email to the corresponding author of this paper, Prof. Dr. Renata Enghels (renata.enghels@ugent.be).
In the next section, we aim to illustrate the value and aptitude of the El Paso Bilingual Corpus for research on code-switching, bilingualism, and language contact in general, by providing a brief overview of possible studies that can be carried out using this corpus and by presenting a concrete case study.

4. Case Study: Diminutive Expressions in Spanish–English Bilingual Speech

The spontaneous, bilingual, informal speech data of the El Paso Bilingual Corpus provide a wealth of opportunities for linguistic research in this language contact setting from various perspectives. To provide just a few examples, from a syntactic point of view, the cross-linguistic influence of Spanish and English in bilingual discourse can be explored, as well as the predictive power of the various code-switching frameworks and theoretical approaches (e.g., the Matrix Language Frame model, Myers-Scotton, 1993; Poplack’s Universal Constraints, Poplack, 1980; the Minimalist Approach, MacSwan, 1999). In particular, the El Paso Bilingual Corpus contains a variety of interesting linguistic phenomena that have been the subject of study in prior research of other communities, such as gender assignment strategies in determiner phrases (7) (Cisneros et al., 2023; Bellamy & Parafita Couto, 2022; Balam et al., 2021) and bilingual light verbs with hacer (Balam, 2015; Balam et al., 2022) (8).
(7)VM4054:Fuimosalaschooljuntos
we.wenttothe.art.f[sg] together6
‘We went to school together’
(El Paso Bilingual Corpus, Dominguez32)
(8)DN7658:¿Quéhacesgamble?
Whatdo.you
‘What do you gamble?’
KV9880:Dinero
‘money’
(El Paso Bilingual Corpus, Dominguez17)
From a perspective of pragmatics and discourse analysis, the data allow for the study of how bilinguals use code-switching strategically to achieve conversational goals, such as signaling topic shifts, emphasizing information, or negotiating identity (Bailey, 2022). From a sociolinguistic perspective, the corpus data provide opportunities for analyzing how extralinguistic factors, such as speakers’ gender, age, language proficiency, linguistic attitudes, etc., influence bilingual language use patterns, including code-switching. Moreover, the recordings enable phonological and phonetic research into cross-linguistic phonetic interference, shedding light on how bilinguals’ pronunciation is affected by their dual linguistic systems (Bullock & Gerfen, 2004; Olson, 2024; Stefanich & Amaro, 2018).
In addition, the highly parallel design of the El Paso Bilingual Corpus and the Bangor Miami Corpus provide valuable opportunities for comparative analyses of the local speech practices and for determining the community-specific conventions that underlie bilinguals’ language use. While Miami and El Paso are located not that far apart from a geographic perspective, they have been shaped by their unique migration history, due to which these communities manifest some interesting differences. In particular, both El Paso and Miami accommodate large populations of Hispanic individuals, but they differ rather strongly in their ethnic composition. Whereas the large majority of Hispanic Americans in El Paso originate from Mexico (U.S. Census Bureau, 2022a; cf. also supra Section 3.3), Miami’s Hispanic population primarily traces its roots to Cuba, Nicaragua, and Colombia (P. M. Carter & Lynch, 2018). Additionally, while both communities are characterized by their sustained history of bilingualism, research suggests that Miami bilinguals are shifting toward English as their preferred and dominant language in their daily interactions (Zurer Pearson & McGee, 1993; Portes & Schauffler, 1996, as cited in P. M. Carter & Lynch, 2015; Hakuta & D’Andrea, 1992; Lambert & Taylor, 1996, as cited in Eilers et al., 2006). In particular, despite Miami’s rich cultural diversity, high levels of bilingualism, and generally positive attitudes toward Spanish, recent research shows that English is increasingly favored by younger generations—even in domestic settings where heritage languages are typically most resilient—leading to a gradual decline in Spanish use (P. M. Carter & Lynch, 2018; P. Carter et al., 2020; Eilers et al., 2006). In El Paso, on the other hand, Spanish remains the dominant language for demographic and socio-economic reasons, being located so close to Mexico (Achugar & Pessoa, 2009). According to the 2022 American Community Survey, 64.2% of El Paso residents reported speaking Spanish at home. Notably, the proportion of children (aged 5 to 17 years) who use Spanish in domestic contexts is also higher in El Paso than in Miami (17.2% vs. 10.5%), signaling that Spanish is retained slightly better in this city (U.S. Census Bureau, 2022a, 2022b). These trends are also evident in the Bangor Miami Corpus and the El Paso Bilingual Corpus, since even though they are bilingual corpora, approximately 63% of the words in the Bangor Miami Corpus is English and about 34% of the words Spanish (the remaining 3% being ambiguous as to their language membership). In contrast, in the El Paso Bilingual Corpus, Spanish constitutes 59% of the lexical output, compared to 38% for English. Moreover, while 43% of the Miami participants reported in the sociodemographic background questionnaire that they converse more in English with their contacts than in Spanish, only 16% of the El Paso participants reported doing so. Conversely, 66% of the participants in the El Paso Bilingual Corpus indicated speaking more in Spanish with their contacts than in English, whereas in Miami, this amounted to 41% of the participants. Finally, a nearly equal proportion of participants reported using Spanish and English equally frequently when speaking with others (16% in Miami and 18% in El Paso)7. Accordingly, these data suggest that English indeed plays a more significant role in the bilingual community of Miami than in that of El Paso. In short, the sociolinguistic situation in these communities is quite different, which makes them valuable sites for investigating the influence of community-related social factors on bilingual speakers’ linguistic behaviors.
To illustrate the potential of such comparative analyses, we present a brief case study on the use of diminutive expressions by Spanish–English bilingual speakers in these communities. As a highly expressive phenomenon, diminutives are typically used in oral and colloquial discourse, especially between conversation partners who are well-acquainted with each other (Gorzycka, 2020; King & Melzi, 2004; Martín Zorraquino, 2012; Nieuwenhuis, 1985). The El Paso Bilingual Corpus and Bangor Miami Corpus, both consisting of spontaneous informal conversations between familiar interlocutors, thus serve as ideal datasets for the study of this phenomenon in language contact settings.

4.1. The Diminutive Construction

The phenomenon of the diminutive is considered a near-universal linguistic primitive because of its semantic–pragmatic values that are shared across virtually all languages (Jurafsky, 1993, 1996; Nieuwenhuis, 1985). Traditionally, the diminutive is defined as a linguistic element that is used to indicate diminution of the size or quantity of the referent they are modifying (Ponsonnet, 2018; Schneider, 2013). Beyond mere objective meanings of reduction (e.g., a small tree), the diminutive can also express affective connotations, conveying the speaker’s positive or negative attitude toward the referent (e.g., a lovely little gift vs. the hideous little beast), and even pragmatic values, functioning as metalinguistic hedges to downplay the illocutionary force of the speech act (Bialy, 2016; Dressler & Merlini Barbaresi, 1994; Mendoza, 2005). As such, the diminutive can be used to express politeness (9) or mitigate the speech act (10).
(9)Gustauncafe-cit-o?
you.likeacoffee.cn.m-dim.sx-m[sg]
‘Would you like some coffee?’
(Mendoza, 2005, p. 164)
(10)tienesunproblem-it-atambiénahí
youhaveaproblem.cn.m:-dim.sx-m[sg]toohere
‘You have a little problem there too’
(Bangor Miami Corpus, Zeledon05)
While these diminutive meanings occur with astonishing regularity across languages (Jurafsky, 1996), their formal features can diverge cross-linguistically. Most commonly, diminutives are formed synthetically by means of affixes. However, in some languages, diminutives are formed through lexical markers, creating analytical constructions. In addition, the contexts in which diminutives occur and their frequency of use are culture-specific (Bakema & Geeraerts, 2004). In cultures where emotions, particularly affection, are expressed more overtly, the diminutive paradigm of that language tends to be richer and more productive, as diminutives play a crucial role in the manifestation of emotions (Wierzbicka, 1991, as cited in Bakema & Geeraerts, 2004).
Hispanic culture, for example, is typically thought of as open, oriented to ‘in-group’ goals and needs, valuing physical contact and focused on building relationships (AyiConnect Staff, 2022; Ruiz, 2005). Concurrently, Spanish features a rich and productive synthetic diminutive system, comprising an extensive inventory of diminutive affixes (King & Melzi, 2004; Nieuwenhuis, 1985; Sáenz, 1999). The Royal Spanish Academy (Real Academia Española, 2011) registers the suffixes -ito/a, -ico/a, -illo/a, -uco/a, -ín/ino/a, -iño/a, -uelo/a, -ejo/a, and -ete/a as diminutive markers, as well as the prefix mini- (Real Academia Española, 2014). These affixes can be added to a wide range of grammatical categories, including nouns (e.g., ratito < rato ‘a little while’), adjectives (e.g., verdecito < verde ‘greenish’), adverbs (ahí mismito < mismo ‘right there’), and even verb forms (e.g., andandito < andando ‘strolling’). Not all these suffixes are equally productive across Spanish-speaking regions. For example, the suffix -ico/a is typically associated with Caribbean Spanish and the peninsular Spanish dialects of Aragón, Navarra, Murcia, and Granada (Kornfeld, 2016; Real Academia Española, 2011), whereas -illo/a is characteristic of Sevillian Spanish and Mexican Spanish (Gaarder, 1966; Náñez Fernández, 1973). The suffix -iño/a is typical of Galician Spanish, owing to its close contact with and relation to Portuguese (which has a diminutive suffix -inho/a) (Martín Zorraquino, 2012). Of all the affixes, however, the suffix -ito/a is the most extended across regions (Real Academia Española, 2011) and therefore considered the non-marked Spanish diminutive marker (Náñez Fernández, 1973; Nieuwenhuis, 1985). Finally, Spanish speakers can also recur to other types of markers to express smallness, such as the adjectives pequeño and chico (both meaning ‘little’) and the phrasal quantifier un poco (‘a little’) to form analytic diminutive constructions, but these are not considered official diminutive markers by the Royal Spanish Academy.
In contrast, Anglophone culture is generally perceived as a more closed culture, focused on the individual, respecting their personal space and privacy (AyiConnect Staff, 2022; Stone et al., 2006). In parallel, it has been maintained that the English diminutive system is unproductive, and that English does not have many, or simply any, diminutive markers (Bagasheva-Koleva, 2013; Jespersen, 1948; King & Melzi, 2004, p. 242; Kuzic, 2019; Spasovski, 2012; Turner, 1973). Indeed, English speakers have access to a limited set of synthetic affixes, including -y, -let, -ette, mini-, micro-, and -ish (Huddleston & Pullum, 2002). Nevertheless, English strongly favors analytic constructions over synthetic ones for the expression of diminutivity, in line with its typological nature as an analytic language (Dressler & Merlini Barbaresi, 1994; Gorzycka, 2020; Khachikyan, 2015; Lockyer, 2012; Schneider, 2003, 2013). Specifically, the following adjectives have been categorized as English diminutive markers, which serve to modify nominal base forms: little, small, teeny, tiny, wee, diminutive, minute, miniature, minimal, lilliput, and petite (Schneider, 2003, 2013). Of these analytic markers, little is generally thought of as ‘the functional equivalent to diminutive suffixes in other languages’ (Kruisinga, 1942, as cited in Schneider, 2003, p. 123), and thus considered the non-marked, default diminutive marker in English. Finally, albeit not frequently discussed in English literature on diminutives, adjectives, adverbs, and verbal forms can also be diminutivized through the adverbial phrase markers a little and a bit (Bolinger, 1972, as cited in Dressler & Merlini Barbaresi, 1994, p. 115; Gorzycka, 2020).
In sum, although the diminutive is a linguistically universal phenomenon, the diminutive is not equally productive in every language, since its use is to some extent culture-dependent. In this regard, Hispanophones use diminutives abundantly, while English diminutives are considered less productive. Furthermore, if Spanish and English were to be placed on an analytic–synthetic continuum regarding diminutive formation, Spanish would be positioned further up the synthetic end, while English would be located more toward the analytic end. Given this variability in productivity and formation, the diminutive construction represents a potential conflict site (as understood by Poplack & Meechan, 1998) in Spanish–English bilingual contexts (Vanhaverbeke & Enghels, 2021). Consequently, the intriguing question arises of how Spanish–English bilinguals form and use diminutive expressions in their speech.
In the remainder of this section, we will present an investigation of the diminutive paradigms used by Spanish–English bilinguals from El Paso and Miami to explore the impact of language contact on this construction. Given that the diminutive construction is formed differently in Spanish (i.e., primarily synthetically) and English (i.e., primarily analytically), the first goal of this case study is to examine how Spanish–English bilinguals form diminutive expressions in language contact settings. In particular, we will investigate how the productivity of this phenomenon is manifested in Miami and El Paso, in terms of their frequency of occurrence (i.e., token frequency) and of the particular diminutive markers that are used to create diminutives (i.e., type frequency). Additionally, we will examine whether El Paso and Miami bilinguals show a predisposition for a specific formation strategy and/or diminutive language. Moreover, a comparative analysis between the two communities will allow us to examine the patterns of convergence and divergence in diminutive formation and use across these bilingual communities. As such, we will explore the extent to which bilingual speakers from different communities exhibit different or shared patterns of diminutive formation and usage.

4.2. Productivity of the Diminutive Construction and Its Types

Based on a closed reading of the corpus transcriptions and listening attentively to the conversations, 995 diminutive expressions were extracted from the El Paso Bilingual Corpus and 891 diminutives from the Bangor Miami Corpus8. Given that the corpora are not entirely equivalent regarding their word counts, we calculated normalized frequencies per 10,000 words (Fn/10,000), as provided in Table 6. From these normalized frequencies, we can conclude that El Paso bilinguals use more diminutives in their speech, namely about 38 diminutives per 10,000 words, than Miami bilinguals, who use around 34 diminutives per 10,000 words. Moreover, this difference in frequency of use is confirmed to be statistically significant in a two-sample test for equality of proportions with continuity correction (χ2(1, N = 525,716) = 6.11, p = 0.0135).
Accordingly, bilingual speakers from El Paso produce more diminutive expressions in their speech than bilinguals from Miami. This difference in productivity might result from the difference in language use between the two communities discussed above. Specifically, in El Paso the Hispanic culture and Spanish language—characterized by a prolific use of diminutive constructions—seem to have maintained a stronger foothold due to the geographic proximity of this community to Mexico (Achugar & Pessoa, 2009; U.S. Census Bureau, 2022a, 2022b). In contrast, Miami’s linguistic landscape is characterized by a stronger influence of English and its more closed American culture (P. M. Carter & Lynch, 2015; Eilers et al., 2006), due to which this may be reflected in the somewhat less productive use of diminutive expressions among Miami bilinguals.

4.2.1. Formation Strategies

Apart from the frequency of use of diminutive expressions, we will also explore the paradigm of diminutive markers that are employed in each of the communities and identify the extent to which bilingual speakers from different communities make use of the same or different systems. In particular, given that the dominant community language of Miami is English, whereas in El Paso it is Spanish, it is possible that the higher exposure to and use of these respective languages also influences the use of particular diminutive strategies (i.e., analytic or synthetic) and diminutive markers (e.g., -ito vs. -y, little vs. pequeño).
Table 7 provides the distribution of synthetic and analytic markers used in the El Paso Bilingual Corpus and Bangor Miami Corpus. From this table, it can be observed that both communities prefer to use synthetic markers to express diminutivity, although the El Paso community to a stronger degree than Miami. Specifically, in Miami, 63.2% of the constructions are synthetic (n = 563/891), whereas the proportion amounts to 75.8% in El Paso (n = 754/995). From this, it follows that Miami bilinguals use analytic diminutive constructions somewhat more frequently (36.8%, n = 328/891) than bilingual speakers in El Paso (24.2%, n = 241/995).
In this respect, a Pearson’s chi-squared test with Yates’ continuity correction and Cramer’s V measurement score reveal a weak, significant association between the community and the formation strategies that are used in that community (X2(1, N = 1886) = 34.78, p < 0.0001; Cramer’s V = 0.136). Specifically, based on the standardized residuals9 provided in Table 8, we can conclude that, in El Paso, the synthetic formation strategy is used significantly more frequently (r = 2.25) and the analytic strategy significantly less frequently than expected (r = −2.37), whereas the reverse situation is true in Miami (synthetic: r = −3.42; analytic: r = 3.61).
Accordingly, these findings show that the bilingual speakers in both communities lean toward the default formation strategy of their ethnolect. Nevertheless, the fact that Miami bilinguals employ significantly more analytic diminutive constructions aligns with our earlier observation that English appears to have a greater influence on their diminutive paradigm due to its status as dominant community language (P. M. Carter & Lynch, 2015; Eilers et al., 2006). In fact, if we look at the distribution of the Spanish and English synthetic and analytic markers in the communities, provided in Table 9, these findings seem to further support this idea. Specifically, in Miami, the large majority of analytic diminutive constructions is created using an English marker (84.8%, n = 278/328, compared to 63.1%, n = 152/241 in El Paso). Additionally, a larger percentage of the synthetic constructions comprise an English suffix (5.7%, n = 32/563) in comparison to El Paso (1.2%, n = 9/754), despite the prevalence of Spanish synthetic constructions in both communities. Accordingly, Miami bilingual speakers make more frequent use of English diminutive expressions than their El Paso counterparts.

4.2.2. Paradigm of Diminutive Markers

Let us now provide a bit more detail and explore the concrete paradigm of diminutive types attested in the corpora in order to uncover whether and to what extent bilingual speakers in El Paso and Miami make use of the same or different diminutive types to express diminutivity. In Table 10, the synthetic diminutive paradigms of the El Paso and Miami bilinguals are presented.
In particular, the findings indicate that in both communities, bilingual speakers primarily use the non-marked, default suffix -ito to form synthetic constructions (93.4%, n = 704/754 in El Paso and 86.1%, n = 485/563), as shown in examples (11) and (12). While five other affixes are attested in both communities, these are used much less frequently. Interestingly, the second most frequent suffix in El Paso is -illo (13) (5.0%, n = 38/754), whereas the second most frequent type in Miami is -ico (14) (7.8%, n = 44/563). Given that the suffix -ico is typical of the Caribbean Spanish regiolect (Kornfeld, 2016) and -illo is very prevalent in Mexico (Gaarder, 1966), these findings demonstrate that bilingual speakers retain varietal features of their ethnic language variety.
(11)MG4783:teníalacararendond-it-a
she.hadthefaceround.adj-dim.sx-f[sg]
‘She had a roundish face’
(El Paso Bilingual Corpus, Dominguez07)
(12)TIM:seteva allevartuapartement-it-o
themyougoing totakeyourapartment.cn.m-dim.sx-m[sg]
‘it’s gonna take your little apartment with man’
(Bangor Miami Corpus, Herring12)
(13)TH5643:lospapell-ill-o-sesosquedicen
thepaper.cn.m-dim.sx-m-plthosethatsay
‘those little papers that say’
(El Paso Bilingual Corpus, Dominguez13)
(14)VAN:faltandosminut-ic-o-s
they.misstwominute.cn.m-dim.sx-m-pl
‘I need two little minutes’
(Bangor Miami Corpus, Herring13)
Moreover, the synthetic markers used in Miami and El Paso are not all Spanish types. In particular, two English suffixes are attested in both communities, namely -y and -ish. Whereas in El Paso these suffixes are used only rarely (-y: 0.8%, n = 6/754 and -ish: 0.4%, n = 3/754) (15)–(16), these suffixes are slightly more frequent in Miami (-y: 4.4%, n = 25/563 and -ish: 1.1%, n = 6/563) (17)–(18). Finally, the prefix mini- is used once in both communities, thus constituting a hapax case (19)–(20).
(15)TD6891:Dimitri and Rueben were on that little ride thing-ie
thing.cn.m-dim.sx-m[sg]
(El Paso Bilingual Corpus, Dominguez41)
(16)IB6767:esunmerocomediantefamoso-ish
he.isamerecomedianfamous.adj-dim.sx
‘He is a mere famous-ish comedian’
(El Paso Bilingual Corpus, Dominguez41)
(17)MAR:allbecomelittlefish-ie-s
fish.cn-dim.sx-pl
(Bangor Miami Corpus, María10)
(18) MAR:wedon’tlikethered-d-ishtoneone
red.adj-dim.sx
(Bangor Miami Corpus, María16)
(19)AQ4720:hayunchorrodemini-grupos
there.areabunchofdim.pref-group.cn.m[pl]
‘There are a bunch of minigroups’
(El Paso Bilingual Corpus, Dominguez24)
(20)CHA:he was just thislittlemini-guy
little.dim.adjdim.pref-guy.cn
(Bangor Miami Corpus, Zeledon09)
Regarding the analytic paradigm in El Paso and Miami, the variety of markers must be investigated within their class (i.e., adjective or phrasal marker), as the type of marker that is used in analytic constructions depends on the grammatical category of its base form. On the one hand, adjective markers cannot combine with verbal (*moving little, *nos solíamos pequeño ‘we tanned little’) or adverbial forms (*small differently, *pequeño rápido ‘little fastly’). On the other hand, phrasal markers cannot combine with nominal bases (*a bit flowers, *un poco estante ‘a little bit rack’). Table 11 presents the analytic types found in the El Paso Bilingual Corpus and the Bangor Miami Corpus.
In particular, in both communities, the same five adjective markers have been attested, though with differing frequencies. Little is used in the majority of the cases in both El Paso and Miami to create analytic constructions (71.7%, n = 104/145 of the cases in El Paso and 86.8%, n = 184/212 of the cases in Miami) (21)–(22). Accordingly, in the same manner with which El Paso and Miami bilinguals recur to the Spanish default marker -ito to create synthetic constructions, they tend to use the English default marker in their analytic diminutives. In contrast, the remaining two English adjectives are employed only sporadically. In El Paso, tiny is used in only 4 constructions (1.4%) (23) and small in only 3 (2.8%) (24), whereas in Miami small accounts for 12 of the analytic diminutive expressions (5.77%) (25) and tiny for 3 (1.4%) (26).
Nevertheless, El Paso and Miami bilinguals also make use of two Spanish adjective markers. In El Paso, chico even turns out to be the second most productive adjective type after little (20.7%, n = 30/145) (27). In Miami, this adjective is less productive, occurring in only 5.2% of the analytic nominal constructions (n = 11/212) (28). The Spanish pequeño is in its turn less productive, occurring in only 4 constructions in El Paso (2.8%) (29) and in 2 constructions in Miami (0.9%) (30).
(21)FX7977:driving aroundenesoslittlekarts
intheselittle.dim.adjkarts.cn[pl]
‘driving around in these little karts’
(El Paso Bilingual Corpus, Dominguez01)
(22)LUK:thatlittlegirlfriendofmine
little.dim.adjgirlfriend.cn[sg]
(Bangor Miami Corpus, Herring9)
(23)XF0788:a family of ten would live in atinyhouse
tiny.dim.adjhouse.cn[sg]
(El Paso Bilingual Corpus, Dominguez01)
(24)PQ8810:orintoasmallhouse
small.dim.adjhouse.cn[sg]
(El Paso Bilingual Corpus, Dominguez28)
(25)LUK:and asmallsquareatthat
small.dim.adjsquare.cn[sg]
(Bangor Miami Corpus, Herring9)
(26)GIL:we put “just kidding” in reallytinyletters
tiny.dim.adjletters.cn[pl]
(Bangor Miami Corpus, Zeledon9)
(27)BS2736:así comounespaciochiquito
likeaspace.cnsmall.dim.adj:dim.sx
que tambiénlotiene
that alsoithas
‘like a small little space it also has’
(El Paso Bilingual Corpus, Dominguez04)
(28)AUD:esunniñochiquitotodavía
he.isakid.cn.m[sg]small.dim.adj:dim.sx.m[sg]still
‘he’s a little kid still’
(Bangor Miami Corpus, Herring9)
(29)AQ4720:se hacentrescírculospequeños
are.madethreecircle.cn.m[pl]small.dim.adj.m[pl]
‘three small circles are made’
(El Paso Bilingual Corpus, Dominguez24)
(30)KEV:estoesunpequeñopocket
thisisalittle.dim.adj.m[sg]pocket.cn.m[sg]
‘This is a little pocket’
(Bangor Miami Corpus, Sastre1)
With respect to the phrasal markers, three types have been attested in both communities, but one type, namely un chin, occurs exclusively in Miami (31). However, given that this marker has only been used in one construction, thus constituting a hapax case, it can be concluded that it is not a productive type. In El Paso, over half of the non-nominal analytic constructions are formed with the Spanish marker un poco (57.3%, n = 55/96) (32). In Miami, this type occurs in around a third of the constructions (31.0%, n = 36/116) (33), surpassed only by the English marker a bit (39.7%, n = 46/116) (34). The latter form constitutes the second most productive type in El Paso (24.0%, n = 23/96) (35). The remaining 18.8% (n=18/96) of the non-nominal analytic constructions in El Paso and 28.4% (n = 33/116) in Miami is formed with the English type a little (36)–(37).
(31)LIL:escasiigualun chinmás
yesit.isalmostequala bit.dim.phrmore.adv
‘yes it is almost the same, a bit more…’
(Bangor Miami Corpus, Sastre5)
(32)EA3159:mesientotambiénun pocoraro
meI.feelalsoa bit.dim.phrweird.adj.m[sg]
‘I feel also a little weird’
(El Paso Bilingual Corpus, Dominguez15)
(33)CON:la historiaqueellamecontóes
the storythatshemetoldis
un poquitodiferente
a little.dim.adv:dim.sxdifferent.adj
‘the story that she told me is a little bit different’
(Bangor Miami Corpus, Herring14)
(34)MAR:itisa bitchillyright now
a bit.dim.phrchilly.adj
(Bangor Miami Corpus, María19)
(35)TD6891:I’ma little bitdeafy’all
a bit.dim.phrdeaf.adj
(El Paso Bilingual Corpus, Dominguez15)
(36)JM6290:she getsa littlejealous
a little.dim.phrjealous.adj
(El Paso Bilingual Corpus, Dominguez03)
(37)FLA:it’s gettinga littlemorecarnosoaquí
a little.dim.phrmore.advfleshy
‘it’s getting a little more fleshy around here’
(Bangor Miami Corpus, Zeledon08)
In sum, the findings suggest that El Paso and Miami bilinguals make use of a very similar paradigm of synthetic and analytic diminutive markers, though some markers are more productive in one or the other community. In particular, bilingual speakers chiefly turn to the default diminutive markers of their languages, -ito and little, to create synthetic and analytic diminutive expressions, respectively. In addition, we found that bilingual speakers still retain diminutive types that are typically used in the country of origin of their heritage varieties. Finally, the findings demonstrate that Spanish–English bilinguals employ not only Spanish synthetic and English analytic markers, but also Spanish analytic and English synthetic markers.

5. Conclusions

This paper introduced the El Paso Bilingual Corpus, a newly compiled dataset of spontaneous Spanish–English bilingual speech, with the aim of advancing research on bilingual language use, including code-switching. It has been shown that the corpus provides a unique resource that enables systematic, data-driven investigations into bilingual speech patterns, sociolinguistic variation, and the cognitive underpinnings of code-switching.
First, a review of the existing Spanish–English bilingual corpora revealed a lack of freely accessible, large-scale datasets that document spontaneous, conversational bilingual behavior and allow for contrastive studies to investigate the impact of sociolinguistic factors in shaping bilingual language patterns. Many available corpora focus on sociolinguistic interviews or controlled experimental speech, limiting their generalizability for studying bilingual interaction in everyday settings. The El Paso Bilingual Corpus was designed to complement these resources by capturing naturally occurring bilingual conversations, following the methodological framework of the Bangor Miami Corpus to ensure comparability. The data collection process prioritized authenticity by recording informal conversations between acquainted bilingual speakers in familiar environments. This methodological approach reduced observer effects and allowed for a more accurate representation of bilingual speech practices. The corpus includes 42 recorded conversations, totaling approximately 270,000 words, and features diverse bilingual speech patterns reflecting the sociolinguistic profile of the El Paso community.
Next, a case study on diminutive expressions illustrated the empirical value of the corpus, showcasing how the data can be used to uncover community-specific and community-transcendent speech patterns in comparative studies. In particular, the analysis revealed distinct patterns in diminutive formation between the El Paso and Miami bilingual communities, with El Paso speakers favoring Spanish-derived synthetic diminutives (e.g., -ito, -illo), while Miami speakers showed an increased tendency to use English analytic diminutives (e.g., little, a bit). These findings align with broader sociolinguistic trends, reflecting differences in community language dominance and contact dynamics.
To conclude, by documenting spontaneous bilingual speech, the El Paso dataset supports interdisciplinary research in linguistics, psycholinguistics, and sociolinguistics, offering insights into bilingual processing, language variation, and community-specific linguistic conventions. The corpus facilitates comparative studies with other bilingual datasets, helping to refine theoretical models of language contact and multilingual communication. More broadly, this study promotes a further positive conceptualization of bilingualism and code-switching. While historically stigmatized as an indicator of linguistic deficiency, code-switching is, in fact, a structured, linguistically intricate, richly expressive and socially sensitive mode of speaking that forms an integral part of the ways bilinguals use language and is supported by a highly complex adaptive system of cognitive mechanisms (Beatty-Martínez et al., 2025). So, rather than positioning bilingual speech as a suboptimal variety relative to monolingual norms, it should be recognized as a communicative standard in its own right. The continued development and analysis of bilingual corpora will play a crucial role in enhancing our understanding of the cognitive and social dimensions of multilingualism.

Author Contributions

Conceptualization, M.V., R.E., M.d.C.P.C., I.I.; formal analysis, M.V.; data curation, M.V.; writing—original draft preparation, M.V., R.E.; writing—review and editing, R.E., M.d.C.P.C., I.I.; supervision, R.E., M.d.C.P.C., I.I.; funding acquisition, M.V., R.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by The Research Foundation—Flanders (fwo.be), grant numbers 1186523N and G020223N, as well as by the National Science Foundation (award 2021124) and the National Institute of Child Health and Human Development (R21HD109797).

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Ethics Committee of Ghent University (code 2021-59, date of approval: 18 January 2022) as well as the Institutional Review Board of The University of Texas at El Paso (protocol code 1863588, date of approval: 28 March 2022).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The original data presented in the study are openly available in DataverseNO (TrolLing) at https://doi.org/10.18710/7LGSXY.

Acknowledgments

The authors are ever grateful to the bilingual speakers from El Paso who contributed to the creation of the El Paso Bilingual Corpus, to research assistant Anavictoria Dominguez for her role in the collection of the data, and to the transcribers, Anavictoria Dominguez, Lucía Pascual Perales, Patricia Vidal, Marcela de La Torre, and Simon Claassen for their contribution to the transcription of the recordings. They would also like to thank the reviewers for taking the time and effort necessary to review this manuscript and for their valuable comments and suggestions, which helped the authors to improve the quality of this paper.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Notes

1
In the documentation file of the Bangor Miami Corpus, the number of words of text mentioned is 242,475. However, this number does not take into account the speech of one of their participants, María, who recorded multiple conversations (for more information on the collection process of the Bangor Miami Corpus, see Deuchar, 2008; or Deuchar et al., 2014b).
2
This Ph.D. project, titled Diminutive expressions in Spanish-English language contact settings. A multifactorial and multimethod account, has been carried out by the first author under the main supervision of Prof. Dr. Renata Enghels and co-supervision of Dr. M. Carmen Parafita Couto, co-authors of this paper, and Prof. Dr. Robert Hartsuiker. Dr. M. Carmen Parafita Couto was also involved in the creation of the Bangor Miami Corpus.
3
This research lab is led by Prof. Dr. Iva Ivanova, co-author of this paper.
4
As the transcription process has not been finalized yet, this number should be taken as a preliminary index which may still change in the future.
5
However, for one session, the recorder had run out of battery prior to the recording, as a result of which the conversation was recorded using the research assistant’s iPhone 13 Pro Max. Although the recording quality was not as high as that of the Marantz recorder, it turned out to be sufficiently high to allow transcription of the conversation.
6
The glossing of the examples has been carried out following the Leipzig Glossing Rules (https://www.eva.mpg.de/lingua/pdf/Glossing-Rules.pdf, accessed on 20 August 2021).
7
These data form a strong abstraction of the responses retrieved from the following survey request: Make a list below of five of the people you speak to most in your everyday life, either in person or on the phone (...). Then note which language(s) you mostly speak with that person. For interested researchers, the detailed responses to the questionnaire for Miami can be found on the Bangor Miami Corpus website. The questionnaire data for the El Paso Bilingual Corpus will be freely accessible once the corpus is published in an open-access repository.
8
It needs to be noted that lexicalized diminutive forms (e.g., zorrillo ‘skunk’, almohadilla ‘inkpad’) are not included in these datasets, as they have acquired a proper meaning and have become autonomous lexemes in the process. Moreover, diminutive forms of proper names (e.g., Carlitos) are also excluded. In the Bangor Miami Corpus, all names have been pseudonymized, as a result of which we cannot know whether the transcriber has copied the diminutive form of the original name that was expressed or used a diminutivized pseudonym. In other words, the transcriber may have employed the pseudonym Juanito in lieu of the original name, which could have been, for example, Carlitos or just as well Carlos.
9
Residuals measure the degree to which an observed frequency deviates from the expected value under the null hypothesis (Lowry, 2023). Standardized residuals, then, adjust these differences by scaling them with the standard deviation, thereby quantifying the magnitude of deviation in standard deviation units (Pardoe et al., 2018). A positive standardized residual indicates that the observed frequency exceeds the expected value, while a negative residual suggests it is lower than expected (Lowry, 2023). Typically, a standardized residual greater than ±2 is considered noteworthy, as it implies that the observed deviation is unlikely under the null hypothesis (Pardoe et al., 2018).

References

  1. Achugar, M., & Pessoa, S. (2009). Power and place: Language attitudes towards Spanish in a bilingual academic community in Southwest Texas. Spanish in Context, 6(2), 199–223. [Google Scholar] [CrossRef]
  2. Amengual, M., Kim, J.-Y., & Davidson, J. (2025). Multilingual hispanic speech in california (MuHSiC) [dataset]. University of California. Available online: https://muhsic.acad.ucsc.edu/ (accessed on 9 January 2025).
  3. Asociación de Academias de la Lengua Española. (2010). Diccionario de americanismos [online]. Available online: https://www.asale.org/damer/ (accessed on 12 January 2025).
  4. Audacity Team. (1999). Audacity(R): Free audio editor and recorder (Version 3.3.3.) [Computer software]. Muse Group & Contributors. Available online: https://www.audacityteam.org/ (accessed on 14 January 2025).
  5. AyiConnect Staff. (2022, December 12). Latin America and Anglo America: What are their differences? AyiConnect. Available online: https://www.ayiconnection.com/blog/latin-america-and-anglo-america-what-are-their-differences#:~:text=The%20Hispanic%20culture%20is%20more,important%2C%20their%20goal%20is%20stability (accessed on 24 May 2024).
  6. Bagasheva-Koleva, M. (2013). Some correlates between diminutive words in Bulgarian, Russian and English. Bulgaria Research Papers, 51(1b), 138–147. [Google Scholar]
  7. Bailey, B. (2022). Social/interactional functions of code switching among Dominican Americans. Pragmatics. Quarterly Publication of the International Pragmatics Association (IPrA), 10, 165–193. [Google Scholar] [CrossRef]
  8. Bakema, P., & Geeraerts, D. (2004). Diminution and augmentation. In G. Booij, C. Lehmann, J. Mugdan, & S. Skopeteas (Eds.), Morphology: An international handbook on inflection and word-formation (pp. 1045–1052). Walter de Gruyter. [Google Scholar]
  9. Balam, O. (2015). Code-switching and linguistic evolution: The case of ‘Hacer + V’ in orange walk, northern belize. Lengua y Migración/Language and Migration, 7(1), 83–109. [Google Scholar]
  10. Balam, O., Lakshmanan, U., & Parafita Couto, M. C. (2021). Gender assignment strategies among simultaneous Spanish/English bilingual children from Miami, Florida. Studies in Hispanic and Lusophone Linguistics, 14(2), 241–280. [Google Scholar] [CrossRef]
  11. Balam, O., Parafita Couto, M. C., & Stadthagen-González, H. (2020). Bilingual verbs in three Spanish/English code-switching communities. International Journal of Bilingualism, 24(5–6), 952–967. [Google Scholar] [CrossRef]
  12. Balam, O., Stadthagen-González, H., Rodríguez-González, E., & Parafita Couto, M. C. (2022). On the grammaticality of passivization in bilingual compound verbs. International Journal of Bilingualism, 27(4), 415–431. [Google Scholar] [CrossRef]
  13. Beatty-Martínez, A. L., Parafita Couto, M. C., Ameka, F. K., & Aboh, E. O. (2025). Codeswitching. Reference Module in Social Sciences, 1–4. [Google Scholar] [CrossRef]
  14. Beatty-Martínez, A. L., Valdés Kroff, J., & Dussias, P. E. (2018). From the field to the lab: A converging methods approach to the study of codeswitching. Languages, 3(2), 19. [Google Scholar] [CrossRef]
  15. Bellamy, K., Child, M. W., González, P., Muntendam, A., & Parafita Couto, M. C. (Eds.). (2017). Multidisciplinary approaches to bilingualism in the Hispanic and Lusophone world. John Benjamins Publishing Company. [Google Scholar]
  16. Bellamy, K., & Parafita Couto, M. C. (2022). Gender assignment in mixed noun phrases. In D. Ayoun (Ed.), The acquisition of gender: Crosslinguistic perspectives (pp. 13–48). John Benjamins Publishing Company. [Google Scholar]
  17. Berez, A. L. (2007). EUDICO Linguistic Annotator (ELAN) from Max Planck Institute for psycholinguistics. Language Documentation & Conservation, 1(2), 283–289. [Google Scholar]
  18. Beseghi, M. (2019). The representation and translation of identities in multilingual TV series: Jane the virgin, a case in point. MonTi: Monografías de Traducción e Interpretación, 4, 145–172. [Google Scholar] [CrossRef]
  19. Bialy, P. (2016). The usage of diminutives in polite phrases as a way to express positive/negative politeness or to formulate face-threatening acts in Polish. In E. Bogdanowska-Jakubowska (Ed.), New ways to face and (im)politeness (pp. 133–155). Wydawnictwo Uniwersytetu Śląskiego. [Google Scholar]
  20. Biber, D. (2012). Register as a predictor of linguistic variation. Corpus Linguistics and Linguistic Theory, 8(1), 9–37. [Google Scholar] [CrossRef]
  21. Blokzijl, J., Deuchar, M., & Parafita Couto, M. C. (2017). Determiner asymmetry in mixed nominal constructions: The role of grammatical factors in data from miami and nicaragua. Languages, 2(4), 20. [Google Scholar] [CrossRef]
  22. Bullock, B. E., & Gerfen, C. (2004). Phonological convergence in a contracting language variety. Bilingualism: Language and Cognition, 7(2), 95–104. [Google Scholar] [CrossRef]
  23. Bullock, B. E., & Toribio, A. J. (2024). Spanish in texas corpus [Dataset]. Texas Data Repository. [Google Scholar] [CrossRef]
  24. Callahan, L. (2004). Spanish/English code-switching in a written corpus. John Benjamins Publishing Company. [Google Scholar]
  25. Carter, P., López Valdez, L., & Sims, N. (2020). New dialect formation through language contact: Vocalic and prosodic developments in Miami English. American Speech, 95(2), 120–149. [Google Scholar] [CrossRef]
  26. Carter, P. M., & Lynch, A. (2015). Multilingual miami: Current trends in sociolinguistic research: Multilingual Miami. Language and Linguistics Compass, 9(9), 369–385. [Google Scholar] [CrossRef]
  27. Carter, P. M., & Lynch, A. (2018). On the status of Miami as a southern city. Defining language and region through demography and social history. In W. Wolfram, K. Wojcik, E. Wilbanks, & J. Reaser (Eds.), Language variety in the new south: Contemporary perspectives on change and variation. The University of North Carolina Press. [Google Scholar]
  28. Carvalho, A. M. (2012). Corpus del Español en el Sur de Arizona (CESA) [Dataset]. University of Arizona. Available online: https://cesa.arizona.edu/ (accessed on 9 January 2025).
  29. Christoffersen, K., Besset, R. M., & Carvalho, A. M. (2021). Technologically-aided transcription methods for bilingual sociolinguistic corpora: Findings, resources, and considerations project overview. Writing and Language Studies Faculty Publications and Presentations University of Texas Rio Grande Valley. Available online: https://scholarworks.utrgv.edu/wls_fac/123/ (accessed on 12 January 2025).
  30. Christoffersen, K., & Ciller, J. (2024). Corpus bilingüe del valle (CoBiVa). University of Texas Rio Grande Valley. Available online: https://utrgv.edu/cobiva (accessed on 9 January 2025). [CrossRef]
  31. Cisneros, M., Rodríguez-González, E., Bellamy, K., & Parafita Couto, M. C. (2023). Gender strategies in the perception and production of mixed nominal constructions by New Mexico Spanish-English bilinguals. Isogloss. Open Journal of Romance Linguistics, 9(2), 1–30. [Google Scholar] [CrossRef]
  32. Deuchar, M. (2008). The Miami Corpus: Documentation file. Available online: https://bangortalk.org.uk/docs/Miami_doc.pdf (accessed on 9 November 2020).
  33. Deuchar, M. (2020). Code-Switching in linguistics: A position paper. Languages, 5(2), 22. [Google Scholar] [CrossRef]
  34. Deuchar, M., Carter, D., Davies, P., Donnelly, K., Parafita Couto, M. C., Stammers, J., Aveledo, F., Fusser, M., Jones, L., Lloyd-Williams, S., Myfyr, P., & Robert, E. (2014a). Bangor miami corpus [Dataset]. Bangortalk. [Google Scholar]
  35. Deuchar, M., Davies, P., Herring, J. R., Parafita Couto, M. C., & Carter, D. (2014b). Building bilingual corpora. In E. M. Thomas, & I. Mennen (Eds.), Advances in the study of bilingualism (pp. 93–111). Multilingual Matters. [Google Scholar] [CrossRef]
  36. Dewaele, J.-M. (2010). Emotions in multiple languages. Palgrave Macmillan. [Google Scholar]
  37. Dewaele, J.-M., & Li, W. (2014). Intra-and inter-individual variation in self-reported code-switching patterns of adult multilinguals. International Journal of Multilingualism, 11(2), 225–246. [Google Scholar] [CrossRef]
  38. Dressler, W. U., & Merlini Barbaresi, L. (1994). Morphopragmatics: Diminutives and intensifiers in Italian, German, and other languages. M. de Gruyter. [Google Scholar]
  39. Dufour, R., Estève, Y., & Deléglise, P. (2014). Characterizing and detecting spontaneous speech: Application to speaker role recognition. Speech Communication, 56, 1–18. [Google Scholar] [CrossRef]
  40. Eberhard, D. M., Simons, G. F., & Fennig, C. D. (Eds.). (2024). How many languages are there in the world? Ethnologue. Available online: https://www.ethnologue.com/insights/how-many-languages/ (accessed on 11 February 2025).
  41. Eilers, R. E., Pearson, B. Z., & Cobo-Lewis, A. B. (2006). Chapter 5. Social Factors in Bilingual Development: The Miami Experience. In P. McCardle, & E. Hoff (Eds.), Childhood bilingualism (pp. 68–90). Multilingual Matters. [Google Scholar] [CrossRef]
  42. El Paso City Hall. (2024). Population demographics. Elpasotexas.Gov. Available online: https://www.elpasotexas.gov/economic-development/economic-snapshot/population-demographics/ (accessed on 15 April 2024).
  43. Enghels, R., & Azofra Sierra, M. E. (2018). Sobre la naturaleza de los corpus y la comparabilidad de resultados en lingüística histórica: Estudio de caso del marcador pragmático sabes. Spanish in Context, 15(3), 465–489. [Google Scholar] [CrossRef]
  44. Gaarder, A. B. (1966). Los llamados diminutivos y aumentativos en el español de México. PMLA, 81(7), 585. [Google Scholar] [CrossRef]
  45. González-Vilbazo, K., & Koronkiewicz, B. (2016). Tú y yo can codeswitch, nosotros cannot: Pronouns in Spanish-English code-switching. In R. E. Guzzardo Tamargo, C. M. Mazak, & M. C. Parafita Couto (Eds.), Spanish-English code-switching in the Caribbean and the US (Vol. 11, pp. 237–260). John Benjamins Publishing Company. [Google Scholar] [CrossRef]
  46. Gorzycka, D. (2020). Diminutive constructions in English. Peter Lang. [Google Scholar] [CrossRef]
  47. Grosjean, F. (1985). The bilingual as a competent but specific speaker-hearer. Journal of Multilingual and Multicultural Development, 6(6), 467–477. [Google Scholar] [CrossRef]
  48. Gullberg, M., Indefrey, P., & Muysken, P. (2009). Research techniques for the study of code-switching. In B. E. Bullock, & A. J. Toribio (Eds.), The cambridge handbook of linguistic code-switching (1st ed., pp. 21–39). Cambridge University Press. [Google Scholar] [CrossRef]
  49. Huddleston, R., & Pullum, G. K. (2002). The cambridge grammar of the english language (1st ed.). Cambridge University Press. [Google Scholar] [CrossRef]
  50. Hulstijn, J. H. (2011). Language proficiency in native and nonnative speakers: An agenda for research and suggestions for second-language assessment. Language Assessment Quarterly, 8(3), 229–249. [Google Scholar] [CrossRef]
  51. Hurtado, I. (2022, June 20–25). BILinMID: A spanish-english corpus of the US midwest. Thirteenth Language Resources and Evaluation Conference (LREC 2022) (pp. 5511–5516), Marseille, France. [Google Scholar]
  52. Jespersen, O. (1948). Growth and structure of the English language. Basil Blackwell. [Google Scholar]
  53. Jurafsky, D. (1993, February 12–15). Universals in the semantics of the diminutive. Nineteenth Annual Meeting of the Berkeley Linguistics Society: General Session and Parasession on Semantic Typology and Semantic Universals (pp. 423–436), Berkeley, CA, USA. [Google Scholar]
  54. Jurafsky, D. (1996). Universal tendencies in the semantics of the diminutive. Language, 72(3), 533–578. [Google Scholar] [CrossRef]
  55. Keller, G. (1979). The literary strategems available to the bilingual Chicano writer. In F. Jiménez (Ed.), The identification and analysis of Chicano literature (pp. 262–316). Bilingual Press. [Google Scholar]
  56. Khachikyan, S. (2015). Diminutives as intimacy expressions in english and armenian. Armenian Folia Anglistika, 11(2), 78–83. [Google Scholar] [CrossRef]
  57. King, K., & Melzi, G. (2004). Intimacy, imitation and language learning: Spanish diminutives in mother-child conversation. First Language, 24(2), 241–261. [Google Scholar] [CrossRef]
  58. Koch, P., & Oesterreicher, W. (1985). Sprache der Nähe—Sprache der distanz. Romanistisches Jahrbuch, 36(85), 15–43. [Google Scholar] [CrossRef]
  59. Kornfeld, L. M. (2016). “Una propuestita astutita”: El diminutivo como recurso atenuador. Revista Internacional de Lingüística Iberoamericana, 14(1), 123–135. [Google Scholar] [CrossRef]
  60. Kuzic, I. (2019). Diminutives in Portuguese and their equivalents in English [Master’s thesis, Zagreb University]. [Google Scholar]
  61. Labov, W. (1972). Sociolinguistic patterns. University of Pennsylvania Press. [Google Scholar]
  62. Lipski, J. M. (1982). Spanish-English language switching in speech and literature: Theories and models. Bilingual Review/La Revista Bilingüe, 9(3), 191–212. [Google Scholar]
  63. Lockyer, D. (2012). Such a tiny little thing: Diminutive meanings in alice in wonderland as a comparative translation study of English, Polish, Russian and Czech. Verges: Germanic & Slavic Studies in Review, 1(1), 10–22. [Google Scholar]
  64. Lowry, R. (2023). Chi-square, cramer’s V, and lambda. Vassarstats Net. Available online: http://vassarstats.net/newcs.html (accessed on 20 November 2024).
  65. MacSwan, J. (1999). A minimalist approach to intrasentential code switching. Garland. [Google Scholar]
  66. Martín Zorraquino, M. A. (2012). Sobre los diminutivos en español y su función en una teoría de la cortesía verbal (con referencia especial a un cuento de Antonio de Trueba). In T. E. Jiménez Juliá, B. López Meirama, V. Vázquez Rozas, & A. Veiga Rodríguez (Eds.), Cum corde et in nova grammatica: Estudios ofrecidos a Guillermo Rojo (pp. 555–569). Universidade de Santiago de Compostela, Servicio de Publicaciones e Intercambio Científico. [Google Scholar]
  67. Mendoza, M. (2005). Polite diminutives in Spanish. In R. T. Lakoff, & S. Ide (Eds.), Broadening the horizon of linguistic politeness (pp. 163–173). John Benjamins Publishing. [Google Scholar]
  68. Merriam-Webster. (2025). Merriam-webster’s dictionary of English usage. Available online: https://www.merriam-webster.com/ (accessed on 14 January 2025).
  69. Milroy, L. (1987). Language and social networks (2nd ed.). Blackwell. [Google Scholar]
  70. Moreno-Fernández, F. (Director). (2018). CORPEEU: Corpus del Español en los Estados Unidos [Dataset]. With the col. of F. Javier Pueyo Mena. Instituto Cervantes at Harvard University—ANLE. Available online: https://corpeeu.org/ (accessed on 9 January 2025).
  71. Moslimani, M., & Noe-bustamante, L. (2023, August 16). Facts on latinos in the U.S. Pew Research Center. Available online: https://www.pewresearch.org/race-and-ethnicity/fact-sheet/latinos-in-the-us-fact-sheet/ (accessed on 6 February 2025).
  72. Muysken, P. (2013). Language contact outcomes as the result of bilingual optimization strategies. Bilingualism: Language and Cognition, 16(4), 709–730. [Google Scholar] [CrossRef]
  73. Myers-Scotton, C. (1993). Duelling languages: Grammatical structure in code-switching. Clarendon Press. [Google Scholar]
  74. Náñez Fernández, E. (1973). El diminutivo: Historia y funciones en el español clásico y moderno. Universidad Autónoma de Madrid. [Google Scholar]
  75. Nieuwenhuis, P. (1985). Diminutives [Ph.D. Thesis, University of Edinburgh]. [Google Scholar]
  76. Olson, D. J. (2024). Code-switching and language mode effects in the phonetics and phonology of bilinguals. In M. Amengual (Ed.), The cambridge handbook of bilingual phonetics and phonology (1st ed., pp. 677–698). Cambridge University Press. [Google Scholar] [CrossRef]
  77. Pablos, L., Parafita Couto, M. C., Boutonnet, B., De Jong, A., Perquin, M., De Haan, A., & Schiller, N. O. (2018). Adjective-noun order in Papiamento-Dutch code-switching. Linguistic Approaches to Bilingualism, 9(4–5), 710–735. [Google Scholar] [CrossRef]
  78. Palacios, A. (2014). Variación y cambio lingüístico en situaciones de contacto: Algunas precisiones teóricas. In P. M. Butragueño, & L. Orozco (Eds.), Argumentos cuantitativos y cualitativos en sociolingüística: Segundo coloquio de cambio y variación lingüística (pp. 267–294). El Colegio de México. [Google Scholar]
  79. Parafita Couto, M. C., Greidanus Romaneli, M., & Bellamy, K. (2021). Code-switching at the interface between language, culture, and cognition. Lapurdum, 1–26. Available online: https://shs.hal.science/halshs-03280922v1 (accessed on 11 February 2025).
  80. Pardoe, I., Simon, L., & Young, D. (2018). 9.3—Identifying outliers (unusual Y values). Stat 462: Applied Regression Analysis. Available online: https://online.stat.psu.edu/stat462/node/172/ (accessed on 20 November 2024).
  81. Ponsonnet, M. (2018). A preliminary typology of emotional connotations in morphological diminutives and augmentatives. Studies in Language, 42(1), 17–50. [Google Scholar] [CrossRef]
  82. Poplack, S. (1980). Sometimes I’ll start a sentence in Spanish y termino en español: Toward a typology of code-switching. Linguistics, 18(7), 581–618. [Google Scholar] [CrossRef]
  83. Poplack, S., & Meechan, M. (1998). Introduction: How languages fit together in codemixing. International Journal of Bilingualism, 2(2), 127–138. [Google Scholar] [CrossRef]
  84. Potowski, K. CHISPA. Kim Potowski homepage. n.d. Available online: https://www.potowski.org/chispa (accessed on 28 April 2025).
  85. Potowski, K., & Torres, L. (2023). The chicago spanish (CHISPA) corpus. In K. Potowski, & L. Torres (Eds.), Spanish in Chicago (1st ed., pp. 35–70). Oxford University Press. [Google Scholar] [CrossRef]
  86. PRESEEA. (2014). Corpus del Proyecto para el estudio sociolingüístico del español de España y de América [Dataset]. Universidad de Alcalá. Available online: http://preseea.uah.es/ (accessed on 9 January 2025).
  87. Qualtrics. (2020). Qualtrics XM (Version April, 2023) [Computer software]. Qualtrics. Available online: https://www.qualtrics.com (accessed on 12 February 2025).
  88. Real Academia Española. (2011). La derivación apreciativa. In Nueva gramática de la lengua española manual (pp. 163–172). Espasa. [Google Scholar]
  89. Real Academia Española. (2014). Diccionario de la lengua española. Espasa. Available online: https://dle.rae.es/ (accessed on 12 February 2025).
  90. Ruiz, E. (2005). Hispanic culture and relational cultural theory. Journal of Creativity in Mental Health, 1(1), 33–55. [Google Scholar] [CrossRef]
  91. Sáenz, F. S. (1999). Conceptual interaction and spanish diminutives. Cuadernos de Investigación Filológica, 25, 173–190. [Google Scholar] [CrossRef]
  92. Schneider, K. P. (2003). Diminutives in English. De Gruyter. [Google Scholar] [CrossRef]
  93. Schneider, K. P. (2013). The truth about diminutives, and how we can find it: Some theoretical and methodological considerations. SKASE Journal of Theoretical Linguistics, 10(1), 137–151. [Google Scholar]
  94. Sebba, M. (2012). Multilingualism in written discourse: An approach to the analysis of multilingual texts. International Journal of Bilingualism, 17(1), 97–118. [Google Scholar] [CrossRef]
  95. Sloetjes, H., Wittenburg, P., & Somasundaram, A. (2011, August 27–31). ELAN—Aspects of interoperability and functionality. Interspeech 2011, 12th Annual Conference of the International Speech Communication Association (pp. 3249–3252), Florence, Italy. [Google Scholar] [CrossRef]
  96. Spasovski, L. (2012). Morphology and pragmatics of the diminutive: Evidence from macedonian [Master’s thesis, Arizona State University]. [Google Scholar]
  97. Stefanich, S., & Amaro, J. C. (2018). Phonological factors of Spanish/English word internal code-switching. In L. López (Ed.), Issues in hispanic and lusophone linguistics (Vol. 19, pp. 195–222). John Benjamins Publishing Company. [Google Scholar] [CrossRef]
  98. Stone, D. L., Johnson, R. D., Stone-Romero, E. F., & Hartman, M. (2006). A comparative study of hispanic-american and anglo-american cultural values and job choice preferences. Management Research: Journal of the Iberoamerican Academy of Management, 4(1), 7–21. [Google Scholar] [CrossRef]
  99. The Language Archive. (2024). ELAN (Version 6.4) [Computer software]. Max Planck Institute for Psycholinguistics. Available online: https://archive.mpi.nl/tla/elan (accessed on 11 February 2025).
  100. Toribio, A. J. (2017). Structural approaches to code-switching: Research then and now. In R. E. V. Lopes, J. Ornelas De Avelar, & S. M. L. Cyrino (Eds.), Romance languages and linguistic theory (Vol. 12, pp. 213–234). John Benjamins Publishing Company. [Google Scholar] [CrossRef]
  101. Torres Cacoullos, R., & Travis, C. E. (2020). Code-switching and bilinguals’ grammars. In E. Adamou, & Y. Matras (Eds.), The Routledge handbook of language contact (1st ed., pp. 252–275). Routledge. [Google Scholar] [CrossRef]
  102. Turner, G. W. (1973). Stylistics. Penguin Books. [Google Scholar]
  103. United States Census Bureau. (2022a). ACS demographic and housing estimates. Data.Census.Gov. Available online: https://data.census.gov/table/ACSDP1Y2022.DP05?g=160XX00US4824000 (accessed on 15 April 2024).
  104. United States Census Bureau. (2022b). S1601 Language spoken at home. United States Census Bureau. Available online: https://data.census.gov/table/ACSST1Y2022.S1601?q=language (accessed on 23 February 2024).
  105. Vanhaverbeke, M., Dominguez, A., Ivanova, I., Parafita Couto, M. C., & Enghels, R. (2022). El Paso Bilingual Corpus [Conversational corpus]. Ghent University. [Google Scholar]
  106. Vanhaverbeke, M., & Enghels, R. (2021). Diminutive constructions in bilingual speech: A case study of Spanish-English code-switching. Belgian Journal of Linguistics, 35, 183–213. [Google Scholar] [CrossRef]
  107. Velázquez, I. (2009). Intergenerational Spanish transmission in El Paso, Texas: Parental perceptions of cost/benefit. Spanish in Context, 6(1), 69–84. [Google Scholar] [CrossRef]
  108. Velázquez, I. (2013). Individual discourse, language ideology and Spanish transmission in El Paso, Texas. Critical Discourse Studies, 10(3), 245–262. [Google Scholar] [CrossRef]
  109. Vilares, D., Alonso, M. A., & Gómez-Rodríguez, C. (2016, May 23–28). EN-ES-CS: An English-Spanish code-switching twitter corpus for multilingual sentiment analysis. Tenth International Conference on Language Resources and Evaluation (pp. 4149–4153), Reykjavik, Iceland. [Google Scholar]
  110. Ward, W. (1989, October 15–18). Understanding spontaneous speech. Workshop on Speech and Natural Language—HLT ’89 (pp. 137–141), Cape Cod, MA, USA. [Google Scholar] [CrossRef]
Figure 1. Distribution of participants according to where they were born and raised.
Figure 1. Distribution of participants according to where they were born and raised.
Languages 10 00174 g001
Figure 2. Distribution of participants according to their ages of acquisition.
Figure 2. Distribution of participants according to their ages of acquisition.
Languages 10 00174 g002
Figure 3. Distribution of participants according to their self-reported dominant language proficiency.
Figure 3. Distribution of participants according to their self-reported dominant language proficiency.
Languages 10 00174 g003
Table 1. Overview of (digital) datasets representing Spanish–English bilingual speech in the U.S.
Table 1. Overview of (digital) datasets representing Spanish–English bilingual speech in the U.S.
CorpusPeriodExtensionLanguage
Focus
Discourse SettingData TypesAccess
(All URLs Have Been Accessed on 9 January 2025)
Bangor Miami corpus2008–201184 speakers
35 h
265,000 words
bilingualinformal conversationsaudio
transcripts
metadata
online data repository (Talkbank)
http://bangortalk.org.uk
Bilinguals in the Midwest Corpus
(BILinMID)
2021–202282 speakersbilingualpicture elicited short storiestranscripts
(some) metadata
online website
https://ihurta3.shinyapps.io/bilinmid-corpus/
New Mexico Spanish-English bilingual corpus (NMSEB)2010–201140 speakers
29 hours
300,000 words
bilingualsociolinguistic interviewsaudio
transcripts
metadata
request PI
https://nmcode-switching.la.psu.edu/bilingual-corpus/
Chicago Spanish corpus
(CHISPA)
2006–2010124 speakers, approx. 1 hour per recordingSpanishsociolinguistic interviewsaudio
transcripts
metadata
request PI
Corpus del Español en los Estados Unidos (CORPEEU)1960–nowno info availableSpanishwritten + interviews + public interactionstranscriptsonline website
https://corpeeu.org/
Corpus of Spanish in Southern Arizona
(CESA)
2012–202078 speakers, approx. 1 hour per recordingSpanishsociolinguistic interviewstranscripts
audio
metadata
request PI
https://cesa.arizona.edu/
Corpus Bilingüe del Valle (CoBiVa)2017–now69 speakers, approx. 1 hour per recordingunilingual interviewssociolinguistic interviewsaudio
transcripts
metadata
online website https://www.utrgv.edu/cobiva/index.htm
Spanish in Texas Corpus2011–201397 speakers
500,000 words, approx. 53 h
Spanishsociolinguistic interviews & conversationstranscripts
video & audio
metadata
online data repository
https://corpus.spanishintexas.org/
Table 2. Comparison of intensification strategies across discourse genres in Havana, Cuba 1.
Table 2. Comparison of intensification strategies across discourse genres in Havana, Cuba 1.
PRESEEA (Havana)
Sociolinguistic Interviews
AMERESCO (Havana)
Free Conversational Data
muy ++++ +++
super- +/- ++
-ísimo + ++
-ón +/- ++
-azo +/- +
-ote +/- +
-udo - +
1 For reasons of clarity and to emphasize relative frequency over absolute numbers, exact counts are not provided in this table. The symbols indicate how frequently each strategy appears relative to others: ++++ = overwhelmingly dominant, with minimal presence of alternatives; +++ = clearly dominant, though other options are occasionally observed; ++ = moderately frequent, but not dominant; + = marginal presence; +/- = isolated occurrences (one or two instances only); - = not attested in the data.
Table 3. Distribution of conversations based on their word frequencies in Spanish and English.
Table 3. Distribution of conversations based on their word frequencies in Spanish and English.
Dominant Language in the Conversations#%
Spanish2047.6
English1228.6
both1023.8
Total42100.0
Table 4. Types of social relationships between the participants of the El Paso Bilingual Corpus.
Table 4. Types of social relationships between the participants of the El Paso Bilingual Corpus.
Social Relationship#%
friends3071.4
siblings49.5
couple49.5
parent–child24.8
colleagues24.8
total42100.0
Table 5. Distribution of participants’ age and gender.
Table 5. Distribution of participants’ age and gender.
GenderFemaleMaleOtherTotal
Age n%n%n%n%
GEN2 (18–25) 4048.82834.122.47085.4
GEN3 (26–45) 78.533.7--1012.2
GEN4 (46–60) 22.4----22.4
Total 4959.83137.822.482100.0
Table 6. Absolute and normalized frequencies of diminutive expressions in the El Paso Bilingual Corpus and Bangor Miami Corpus.
Table 6. Absolute and normalized frequencies of diminutive expressions in the El Paso Bilingual Corpus and Bangor Miami Corpus.
El Paso Bilingual CorpusBangor Miami Corpus
nFn/10,000nFn/10,000
99537.989133.8
Table 7. Absolute and relative frequencies of diminutive formation strategies used in the El Paso Bilingual Corpus and Bangor Miami Corpus.
Table 7. Absolute and relative frequencies of diminutive formation strategies used in the El Paso Bilingual Corpus and Bangor Miami Corpus.
Diminutive StrategyEl Paso Bilingual CorpusBangor Miami Corpus
n%n%
synthetic754 75.8 563 63.2
analytic241 24.2 328 36.8
Total995 100.0 891 100.0
Table 8. Standardized residuals of the chi-squared test: formation strategy × community.
Table 8. Standardized residuals of the chi-squared test: formation strategy × community.
Diminutive StrategyEl Paso Bilingual CorpusBangor Miami Corpus
rr
synthetic2.25−2.37
analytic−3.423.61
Table 9. Diminutive formation strategies and language of the diminutive marker in the El Paso Bilingual Corpus and Bangor Miami Corpus.
Table 9. Diminutive formation strategies and language of the diminutive marker in the El Paso Bilingual Corpus and Bangor Miami Corpus.
Diminutive StrategyEl Paso Bilingual CorpusBangor Miami Corpus
Diminutive Languagen%n%
synthetic754 75.8 563 63.2
Spanish 745 98.8 531 94.3
English 9 1.2 32 5.7
analytic241 24.2 328 36.8
Spanish 89 36.9 50 15.2
English 152 63.1 278 84.8
Total995 100.0 891 100.0
Table 10. Synthetic markers attested in the El Paso Bilingual Corpus and Bangor Miami Corpus.
Table 10. Synthetic markers attested in the El Paso Bilingual Corpus and Bangor Miami Corpus.
El Paso Bilingual CorpusBangor Miami Corpus
Synthetic Markersn%n%
Suffix 753 99.9 562 99.8
-ito 704 93.4 485 86.1
-ico - - 44 7.8
-illo 38 5.0 2 0.4
-y 6 0.8 25 4.4
-ish 3 0.4 6 1.1
-ino 2 0.3 - -
Prefix 1 0.1 1 0.2
mini- 1 0.1 1 0.2
Total 754 100.0 563 100.0
Table 11. Analytic markers attested in the El Paso Bilingual Corpus and Bangor Miami Corpus.
Table 11. Analytic markers attested in the El Paso Bilingual Corpus and Bangor Miami Corpus.
El Paso Bilingual CorpusBangor Miami Corpus
Analytic Markersn%n%
Adjective 145 60.2 212 64.6
little 104 71.7 184 86.8
chico 30 20.7 11 5.2
tiny 4 2.8 3 1.4
pequeño 4 2.8 2 0.9
small 3 2.1 12 5.7
Phrasal 96 39.8 116 35.4
un poco 55 57.3 36 31.0
a bit 23 24.0 46 39.7
a little 18 18.8 33 28.4
un chin 1 0.9
Total 241 100.0 328 100.0
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Vanhaverbeke, M.; Enghels, R.; Parafita Couto, M.d.C.; Ivanova, I. Enhancing Code-Switching Research Through Comparable Corpora: Introducing the El Paso Bilingual Corpus. Languages 2025, 10, 174. https://doi.org/10.3390/languages10070174

AMA Style

Vanhaverbeke M, Enghels R, Parafita Couto MdC, Ivanova I. Enhancing Code-Switching Research Through Comparable Corpora: Introducing the El Paso Bilingual Corpus. Languages. 2025; 10(7):174. https://doi.org/10.3390/languages10070174

Chicago/Turabian Style

Vanhaverbeke, Margot, Renata Enghels, María del Carmen Parafita Couto, and Iva Ivanova. 2025. "Enhancing Code-Switching Research Through Comparable Corpora: Introducing the El Paso Bilingual Corpus" Languages 10, no. 7: 174. https://doi.org/10.3390/languages10070174

APA Style

Vanhaverbeke, M., Enghels, R., Parafita Couto, M. d. C., & Ivanova, I. (2025). Enhancing Code-Switching Research Through Comparable Corpora: Introducing the El Paso Bilingual Corpus. Languages, 10(7), 174. https://doi.org/10.3390/languages10070174

Article Metrics

Back to TopTop