4. The Stra-ParlaTO Corpus
The data for this study come from the Stra-ParlaTO module of the KIParla corpus (
Mauri et al., 2019). The KIParla corpus is currently the largest existing corpus of spoken Italian accessible online.
2 The corpus has a modular structure and is currently made up of seven separate subcorpora that, although with their internal differences, share the same general structure and participant metadata system. Such a modular structure allows for the continuous expansion of the corpus across time while ensuring searchability as a whole. Up to 2024, the data included in the KIParla corpus had been collected exclusively among L1 speakers of Italian, thus excluding people who have learnt and use Italian as a second language and, more in general, people with a multilingual repertoire that goes beyond Romance dialects or historical minority languages. This gap was filled by the collection and publication of two new modules between 2024 and 2026, namely, the Stra-ParlaTO and Stra-ParlaBO modules, which comprise speech by either L2 Italian speakers with an international migratory background or people born in Italy but with a multilingual repertoire that includes a heritage/home language. As far as Stra-ParlaTO is concerned, the data were collected in Turin (Italy) among multilingual speakers with either Moroccan Arabic, Peruvian Spanish, or Romanian/Moldovan languages as part of their linguistic repertoire. Speakers’ metadata for this module include: gender, age range, place of birth, age upon arrival in Italy, number of years spent in Italy, L1, education title, and occupation. All speakers were asked to read and sign an informed consent form that fully complies with the current European norms on data protection. The recordings were transcribed with ELAN (
Wittenburg et al., 2006) following the same transcription conventions already used for the other corpus modules. These guidelines are based on a simplified version of the
Jefferson (
2004) transcription system for Conversation Analysis and are accessible on the corpus website.
3 4.1. Data Collection: A Community-Based Approach
Data collection presented significant methodological challenges that prompted a shift away from conventional approaches (
Labov, 1982;
Tagliamonte, 2006) towards a Community-Based Language Research (CBLR) framework (
Czaykowska-Higgins, 2009;
Rice, 2010;
Stenzel, 2014;
Bischoff & Jany, 2018). Under the project’s original corpus design, data collection required two interaction types: semi-structured interviews conducted mostly in Italian, with conversational asymmetry between interviewer and interviewee, and spontaneously occurring free conversations recorded through non-participant observation, with no constraints imposed by the researcher and no conversational asymmetry among participants. Although this methodology proved to be successful for the design of the other KIParla modules, in practice, several obstacles made this design difficult to implement for the Stra-ParlaTO subcorpus.
First, interview participants tended to be selected based on their Italian proficiency, as individuals with lower competence would not have been able to give an interview in Italian. Language choices were further constrained by the interaction format and the interviewer’s own repertoire: when the interviewer shares the full linguistic repertoire with the interviewee, languages other than Italian may surface in the interaction; but when Italian is the only language in common, the interviewee is actually confined to monolingual Italian regardless of their everyday multilingual practice.
For free conversations, the most naturally occurring interactional contexts that we could access were predominantly monolingual in the L1 or home language. Furthermore, participants were rarely available for recording in private settings, and the spontaneous format only proved workable with a specific subset of informants, namely, second-generation university students already involved in the project.
These tensions between a fixed research design (with predetermined data constraints, goals, and timelines) and the fluid, context-dependent nature of community engagement led us to adopt a participatory, community-based approach, in which, as
Czaykowska-Higgins (
2009, p. 17) argues, knowledge is constructed not merely for researchers/linguists but “for, with, and by community members”. This framework foregrounds several key principles, including collaborative research design, shared ownership of outcomes, training of community members, practical long-term impact, equitable distribution of resources, and genuine consideration of the needs of all parties involved (
Czaykowska-Higgins, 2009).
We were able to adopt this methodology for the Peruvian and Romanian/Moldovan cases. In the former case (the latter will not be addressed here), our partner was Paradero NOMiS (Nuove Opportunità per Minori Stranieri), a project working with young people aged 14–23 and their families from Latin America and providing them with practical support to facilitate integration upon arrival. The collaboration with Paradero NOMiS developed through three consecutive activities. The first, Relatar, was a task-based, collaborative workshop designed to produce a mock-up tourist guide of Latin American Turin. Jointly designed with the Paradero team—who had expressed the need for a summer activity that would also serve as Italian language practice—the workshop consisted of a series of meetings where participants had to discuss the format of the final product, the themes to be addressed, and the type of content to be included. Participants were trained in interviewing and recording techniques, which allowed them to collect photos, videos, and oral testimonies of local Latin American people and places in the city. Then, they collaboratively selected materials, wrote captions and texts, and made design choices. The workshop yielded 5 hours of interviews and 2 hours of conversation. The second activity, World Anthropology Day (February 2025), grew directly out of the Relatar collaboration. Four young participants from Paradero planned and led a guided walking tour of the Borgo San Paolo neighbourhood. During four preparatory weekly meetings, tour routes were discussed, contacts established, and the tour rehearsed. A further 5 hours of conversations were collected across these sessions.
Reflecting on the outcomes of this approach, the CBLR framework yielded several advantages over more traditional methods: it enabled a successful collaboration with the communities that benefited both parties, produced richer and more ecologically valid data by incorporating ethnographic knowledge and emic perspectives, and opened access to a broader range of speaker profiles from within the same community. At the same time, the approach unavoidably entailed some trade-offs: it required significantly greater investment of time and effort at the design and data-collection stages; introduced longer and less predictable timelines; and demanded a series of methodological compromises. Most notably, first, the majority of participants in both projects were under 16, whereas speakers in the existing corpus modules are typically aged 16 or above due to privacy management reasons, and second, the task-based nature of the interactions differed from the free conversations envisioned in the original corpus design, which typically involve unconstrained topics. To overcome the former problem, we adapted the informed consent form for participants under 16 and involved parents in both the explanation of the research and the signing process. To address the latter, we adapted the interaction classification labels accordingly: these recordings are now classified in the corpus as ‘free conversations’ but specified as ‘task-oriented’ rather than having a ‘free topic’.
4.2. The Corpus Overall Characteristics
The Stra-ParlaTO subcorpus currently amounts to approximately 48:28:26 h of speech. The module includes 51 semi-structured, sociolinguistic interviews (27:37:36 h) and 31 free conversations (20:50:50 h). The distribution of recordings across speakers, based on their linguistic repertoire, is shown in
Figure 3.
The data collection involved 138 participants, 53 of whom were born in Italy and 85 in another country. Most participants are between 21 and 35 years old (59), 33 are aged 36–50, 18 are older than 50, 25 are between 16 and 20, and only three are between 11 and 15; 61 of them hold a degree or are university students at the time of the recording, 60 hold a high-school diploma, 13 hold a middle-school diploma, and four are PhDs. Most participants declared that they have more than one L1, whereas only 40 stated that Italian is their only L1. In this respect, the participants were categorised based on what they identified as their dominant language, whereas information concerning other languages within individual repertoires was collected from interviews and conversations, and was not included in the corpus metadata. Regarding the number of years already spent in Italy, out of 85 speakers born abroad, 24 of them had been in Italy for 0–5 years, 17 of them for 6–15 years, 37 participants for 16–30 years, and 4 for more than 30 years. One speaker had a discontinuous stay during their childhood and was therefore labeled as a separate case.
As for the Peruvian Spanish subcorpus, 42 participants were involved (26 females and 16 males), 31 of whom were born in Peru and 10 in Italy; 1 person was born in Venezuela but from Peruvian parents. Most participants are between 16 and 20 years old (11) or between 21 and 35 (12), 10 are aged 36–50, and 9 are older than 50 (only 2 are between 61 and 65 years old). The majority of them hold (or are obtaining) a high-school diploma (13 from a Peruvian institution, 14 from an Italian institution), 10 hold a university degree, and only 5 hold a middle-school diploma. Most speakers identified Spanish as their L1, while 10 stated their L1 was Italian; only 3 people declared to have both Spanish and Italian as their L1s. As for the aboriginal languages of Peru, during both conversations and interviews, a minority of speakers were able to recall isolated words or idioms in Quechua, but none of them claimed to be a fluent speaker of the language, while still acknowledging its cultural importance both in Peru and in Peruvian communities in Italy. Lastly, out of the 32 participants born in Peru, 12 of them had been in Italy for 1–3 years, 7 for 6–15 years, 6 for 16–25 years, and 6 for 26–35 years; only 1 speaker experienced a discontinuous stay in Italy and was labeled separately.
5. The Multilingual Past-Tense Construction in the Stra-ParlaTO Corpus
Drawing from the data of the Stra-ParlaTO corpus, we now focus on a specific case-study to provide evidence of how DCxG can be a useful framework to explain language-contact phenomena. In particular, we examine how the compound past tense construction is used by Peruvian Spanish speakers of Italian.
Romance languages distinguish between two main types of past tense forms encoding a perfective meaning, namely, a simple form (e.g., French passé simple, Italian passato remoto, Spanish pretérito perfecto simple, Portuguese pretérito perfeito simples) and a compound one made up of an auxiliary in the present tense and a past participle (e.g., French passé composé, Italian passato prossimo, Spanish pretérito perfecto compuesto, Portuguese pretérito perfeito composto). Formal and functional similarities across these constructions are easily found, as they share the same etymology and are used in similar contexts.
Let us look more closely at the two main languages in contact in the scrutinised community, namely Italian and Peruvian Spanish. Although the two past tenses do overlap to a good extent, they cannot be claimed to be identical neither in terms of form and function, nor internal variation. From a formal point of view, the two languages select auxiliaries differently: Italian verbs require either
essere ‘be’ (if inaccusative) or
avere ‘have’ (if inergative) (4), whereas Spanish always requires
haber ‘avere’ (5).
| (4) | Italian | | |
| | (a) | sono | and-at-o |
| | | be.1SG.PRS | go-PTCP-M.SG |
| | | ‘I went’ | |
| | (b) | ho | mangi-ato |
| | | have.1SG.PRS | eat-PTCP |
| | | ‘I ate’ | |
| (5) | Spanish | | |
| | (a) | he | i-do |
| | | have.1SG.PRS | go-PTCP |
| | | ‘I went’ | |
| | (b) | he | com-ido |
| | | have.1SG.PRS | eat-PTCP |
| | | ‘I ate’ | |
From a functional point of view, the two tenses might not share all the possible uses. In Peruvian Spanish, the
pretérito perfecto compuesto has been described as expressing specific pragmatic, epistemic, and discourse-related values (
Jara Yupanqui, 2011;
Howe, 2018), which do not straightforwardly correspond to the ordinary distribution of the Italian
passato prossimo. Conversely, in Italian, diatopic variation plays an important role in the distribution of simple and compound past tenses in perfective contexts, with the compound form,
passato prossimo, predominating in northern varieties and the simple one,
passato remoto, remaining more frequent in southern varieties (
Lepschy & Lepschy, 1981).
4 However, in spite of these and other potential mismatches, both constructions share the same core meaning of reference to a past, finished event. As shown in other theoretical frameworks (cfr.,
Matras, 2009 ‘pivot matching’), it is precisely this partial similarity—a case whereby two forms share sufficient structural and semantic overlap to be perceived as functionally equivalent by bilingual speakers—that makes the
passato prossimo and the
pretérito perfecto compuesto a plausible fertile site for hybrid constructions, such as (6) and (7).
| (6) | hemos visitato il museo de radio y tecnologia |
| | ‘we visited the museum of radio and technology’ |
| | [Speaker: PST001; Recording: STIS001] |
| (7) | desde il giorno che he arrivado sono diciassette anni |
| | ‘since the day I arrived, it’s been seventeen years’ |
| | [Speaker: PST061; Recording: STIS009] |
In hemos visitato (6), the Spanish auxiliary combines with the Italian past participle, while in he arrivado (7), the Spanish auxiliary is combined with a hybrid past participle form, which in turn combines the Italian lexical stem of the verb arrivare ‘arrive’, but adopts the Spanish inflectional morpheme -ado.
Such cases challenge a monolingual-based view on contact, as these productions cannot be described as mechanical transfers of a specific element from one language into another. In this respect, Diasystematic Construction Grammar offers a valuable analytical framework for examining these phenomena, as it allows us to account for the interplay of constructions across different (though similar) languages at a more abstract level.
Looking at the Stra-ParlaTO corpus data, compound past tense forms produced within Spanish–Italian multilingual practices amount to 704 occurrences and can be grouped into four main categories (relative frequencies are specified in brackets).
| (8) | io l’altra volta que he ido alla questura |
| | ‘I, that time that I went to the central police station’ |
| | [Speaker: PST006; Recording: STIS003] |
| (9) | quando abbiamo iniziato la scuola
abbiamo avuto
un po’ più di facilità |
| | ‘when we started going to school, it was easier (lit. we have had a little more ease)’ |
| | [Speaker: PST065; Recording: STIS012] |
- B.
Forms that combine the Spanish auxiliary with an Italian/Spanish homophone past participle (1.28%), such as
visto ‘seen’ in (10), in a context where no single matrix language can be identified, and the production mostly resembles congruent lexicalisation (
Muysken, 2000).
| (10) | però quando se trovan entre loro un è compromiso sì // he visto che lo parlan he sentito che lo parlan |
| | ‘but when they see each other it’s an agreement, yes, I saw that they speak it, I heard that they speak it’ |
| | [Speaker: PST061; Recording: STIS009] |
- C.
Forms that combine the auxiliary of one language with the past participle of the other (7.95%). Within this group, examples like (13) are also included, where the participial form is “entirely” Spanish, even though the verbal root is identical to the Italian form, and the inflectional suffixes are quasi-homophonous (Italian pensato vs. Spanish pensado).
| (11) | solo he lavorato aquí // eh como le ditto solo weekend |
| | ‘I only worked here, as I said, only during weekends’ |
| | [Speaker: PST004; Recording: STIA001] |
| (12) | io non ho trovato gente del mio paese che parla il quechua // cioè a me mi hubiese piaciuto impararlo |
| | ‘I didn’t find people from my country who speak Quechua, I mean, I would have liked to learn it’ |
| | [Speaker: PST074; Recording: STIS015] |
| (13) | io ho pensado // mh // il fegato io ho pensado prima che le ha dato il caffè |
| | ‘I thought mh the kidney, I thought before giving her coffee’ |
| | [Speaker: PST074; Recording: STIS015] |
- D.
Forms that combine the Spanish or Italian auxiliary with a mixed past participle where the verbal lexeme and the inflectional morpheme come from the other language (7.39%). Into this group we also include cases like (7) above and (14), where the verbal root could be analysed as Spanish, but carries the Italian meaning rather than the Spanish one (e.g., emparar meaning ‘learn’, as in Italian imparare, and not ‘protect’; sentir meaning ‘hear’, as in Italian sentire, and not ‘feel/be sorry’).
| (14) | sì sardegna è bello he sentido che questo mare è es- bellissimo |
| | ‘yes in Sardinia it’s beautiful, I heard that this sea is beautiful’ |
| | [Speaker: PST160; Recording: STCS009] |
| (15) | he trovado eh io la // un ristorante piemontese |
| | ‘I found, eh, I the, a piedmontese restaurant’ |
| | [Speaker: PST008; Recording: STIS004] |
| (16) | non ha estudiato non è che tu puoi mettere quello che hai estudiato a a perù |
| | ‘(he) didn’t study, you cannot put what you studied in Peru’ |
| | [Speaker: PST104; Recording STCS005] |
| (17) | mi sono avvicinata a scuola senza chiedere niente io a mio figlio // sì perché poi // mh // ho parlado con il suo tutore |
| | ‘I approached the school without asking anything to my son, yes because then, mh, I spoke with his advisor’ |
| | [Speaker: PST161; Recording: SCTS009] |
What these instances have in common is that they occur in utterances that cannot be clearly assigned to one language or the other. Rather, the speakers appear to engage, to varying degrees, in what may be broadly described as a bilingual mode (
Grosjean, 2012), in which lexical and grammatical resources from their repertoires are systematically combined. Within such a mode, speakers of Peruvian Spanish in Torino appear to have developed a more general schema encompassing both the Italian
passato prossimo and the Spanish
pretérito perfecto compuesto, which they use productively in the formation of multilingual past-tense constructions. The selection of the lexical and grammatical material instantiating this construction in individual usage events can plausibly be accounted for within a usage-based perspective, and is likely to reflect both the availability of particular elements at certain points in discourse and their frequency in the input. This aspect, however, will need to be examined more systematically in future quantitative research.
Let us now turn to how the inferential path of ‘interlingual identification’ mentioned above works. We identified three different levels. First, speakers recognise that, at the syntactic level, in both languages the compound past tense is formed by an auxiliary verb (Italian
essere/avere and Spanish
haber) and a past participle. Then, the more specific, though partial, similarity regarding the auxiliary verb is identified. The two language-specific constructions share the possibility of having
avere/
haber ‘have’ as an auxiliary; thus, speakers tend to generalise and adopt it as an overarching feature, discarding the possibility of Italian to choose
essere ‘to be’, which indeed was not found in any mixed construction in the corpus.
5 Evidence of such generalisation is provided by examples (7) and (12) above, where Spanish
haber is used with an Italian past participle, even though Italian
arrivare ‘arrive’ and
piacere ‘like’ would require the auxiliary
essere.
At the morphophonological level, speakers recognise the similarity in the past participle formation across the two languages. What clearly stands out from a comparison of the two forms is the correspondence between the Spanish voiced /d/and the Italian voiceless /t/ in the inflectional morpheme (Italian
-ato/-ito and Spanish
-ado/-ido). As shown in examples (14)–(17), speakers seem to have stored a more abstract pattern where either consonant can be used to form a past participle, regardless of the lexicon they select for the lexical verb. Both cases where the Spanish inflectional morpheme is attached to an Italian lexical verb (e.g.,
trovado,
parlado) and, conversely, where the Italian inflectional morpheme is attached to a Spanish lexical verb (e.g.,
estudiato) are attested. After identifying partial similarities, speakers reorganise their mental constructicon introducing a new overarching diaconstruction, namely, a more abstract pattern that is language-unspecific and displays the characteristics shared by the two idioconstructions. Such reorganised networks (which are just a section of the whole mental constructicon) are visualised in
Figure 4.
6 What the figure shows is that, once that interlingual identification has occurred, the two language-specific constructions (at the bottom, 〈C
P.Spanish〉 and 〈C
Italian〉) are generalised into a more abstract pattern (on top) that is language-unspecific (thus lacking the language-specificity notation). This diaconstruction is then inserted in the constructional network at a more abstract level and is connected to the specific schemas, i.e., the idioconstructions, through a vertical, inheritance link.
A third step of interlingual identification occurs at the lexical level, and is mainly composed of two types. The first concerns the case of irregular participle forms, such as
visto ‘seen’ (from Italian
vedere and Spanish
ver), that are homophones and completely coincide in both languages. We argue that, in these cases, the participle form is stored as a separate, lexical item, rather than as a pattern that allows for generalisation. The second case regards lexemes that share a similar or identical form but differ in terms of semantic nuances. This is the case of Italian
andare ‘go’ and Spanish
andar ‘walk’, in which the two verbs have an identical lexical root
and- but convey slightly different meanings:
andare includes all types of motion regardless of the means of transport or the direction, whereas
andar is more specific and refers to ‘moving by walking’. Despite this difference, both verbs share the core meaning of ‘motion’, which is what speakers identify when comparing the two languages looking for similarities. As in the case of the auxiliary illustrated above, it seems that, in such cases, the broader, shared meaning is the one that gets easily generalised and stored within the diaconstruction. This is motivated by the attested use of the lexical root
and- with the meaning of ‘go’ even in—apparently—fully Spanish forms, as in (18).
| (18) | quando hemos andado a vedere il mondiale in russia |
| ‘when we went to see the World Cup in Russia’ |
| | [Speaker: PST061; Recording: STIS009] |
The process of interlingual identification and construction reorganisation at the lexical level is portrayed in
Figure 5. On the left, the case of homophone participle forms is reported, while the network on the right represents how similarities are identified and generalised in lexical verbs with almost identical form and shared core semantics.
The process described so far allows us to assume that the complete past tense diaconstruction (19) that emerged from our data of Italian–Spanish contact features three, partially specified slots, namely, the auxiliary verb, the verbal root, and the past participle ending, each with its own constraints or requirements. The meaning associated with this form is the general one of ‘reference to a past event’, given that no other specific semantic or pragmatic nuances seem to have been generalised into this schema.
| (19) | [ HAVEAUX + [V-]/t-d/o ] | ↔ | ‘reference to a past event’ |
A final, more thorough representation of the constructional network that encompasses both the more general diaconstructions and the specific idioconstructions, following the interlingual identification at the syntactic and morphological levels, is proposed in
Figure 6.
In this network, three levels of schematicity of the construction under scrutiny are represented, namely, the more abstract, semi-schematic level of the diaconstruction; the more specific, though still semi-schematic, level of the two idioconstructions; and, at the bottom, the lexically specified level of concrete realisations of the construction, i.e., constructs. All the three levels are connected one to the other through vertical links, meaning that each construction at a lower level inherits properties from the overarching one. Language-specific constructs, such as Spanish
he comido or Italian
ho mangiato, are direct realisations of their respective, more general and still language-specific, idioconstructions. In turn, each idioconstruction is linked to the overarching diaconstruction that is the result of the process of interlingual identification and constructicon reorganisation outlined above. Lastly, the diaconstruction itself licenses new constructs in which the empty slots are filled with lexical material from either language, as in
he trovado or
ho parlado. These latter cases must be regarded as community-specific, in that they reflect the social conventions of the community under scrutiny, as well as the variable availability of Italian and Spanish repertoires for its members, and hence their mobilisation in language production (see
Section 3.2). However, it has not yet been demonstrated to what extent the diaconstructions identified here are conventionalised within the community as a whole, or represent individual or context-dependent strategies. For this reason, we remain agnostic as to whether the mentioned constructs are to be considered as parts of an emerging ethnolect (
Vietti, 2005) or fused lect (
Auer, 1999). We therefore consider it safer at the current state of investigation to leave the 〈C
X〉 notation unspecified in
Figure 6, in order to merely state that diaconstructions of this type cannot be assigned categorically to any of the monolingual repertoires involved.
After having the data guide us toward a theoretical explanation of what linguistic knowledge lies behind such mixed constructions, we now turn back to the data to verify that the proposed constructional network is indeed able to account for the different types of forms attested in the corpus.
Cases that fall into group A, with fully language-specific realisations such as
he ido ‘I went’ (8) or
abbiamo avuto ‘we had’ (9), instantiate the idioconstruction and show full inheritance from a language-bound schema, without activating the more abstract diaconstruction. Moving to group C, forms like
he lavorato (11) reveal a different pattern: here, the Spanish auxiliary
he and the Italian past participle
lavorato are combined, suggesting that speakers are drawing directly on the more schematic diaconstruction and filling its slots with material selected independently from each language. This becomes even more evident in cases such as
io ho pensado (13), where the participle can be interpreted both as “entirely” Spanish in morphology or as partially language-unspecific, given that the lexical root
pens- has the same form in Italian. Group D examples like
he trovado (15) and
ho parlado (17) provide the strongest confirmation of the proposed model, as they involve hybrid participles in which the lexical root and the inflectional morpheme come from different languages. Such forms can only be explained if we assume that, at the morphophonological level, the participial pattern itself has been abstracted into the diaconstruction, allowing speakers to recombine stems and endings across languages. Finally, the insertion of homophone irregular participle forms belonging to group B, as in
he visto (10), constitutes a slightly different case. Here, we assume that the speaker relies on the even more general diaconstruction [AUX + PP] and instantiates it by retrieving the past participle
visto as a non-language-specific unit (see
Figure 5) and inserting it in the available PP slot. Overall, these examples show that speakers operate flexibly across different levels of the network, with the diaconstruction(s) serving as an overarching template that licenses both fully language-specific and mixed realisations.