Non-Word Repetition and Vocabulary in Arabic-Swedish-Speaking 4–7-Year-Olds with and without Developmental Language Disorder

: The Arabic-speaking community in Sweden is large and diverse, yet linguistic reference data are lacking for Arabic-Swedish-speaking children. This study presents reference data from 99 TD children aged 4;0–7;11 on receptive and expressive vocabulary in the minority and the majority language, as well as for three types of non-word repetition (NWR) tasks. Vocabulary scores were investigated in relation to age, language exposure, and socio-economic status (SES). NWR performance was explored in relation to age, type of task, item properties, language exposure, and vocabulary. Eleven children with DLD were compared to the TD group. Age and language exposure were important predictors of vocabulary scores in both languages, but SES did not affect vocabulary scores in any language. Age and vocabulary size had a positive effect on NWR accuracy, whilst increasing item length and presence of clusters had an adverse effect. There was substantial overlap between the TD and DLD children for both vocabulary and NWR performance. Diagnostic accuracy was at best suggestive for NWR; no task or type of item was better at separating the two groups. Reports from parents and teachers on developmental history, language exposure, and functional language skills emerged as important factors for correctly identifying DLD in bilinguals.


Introduction
This study investigates non-word repetition and vocabulary in a large group of bilingual Arabic-Swedish-speaking children with typical language development, compared to a smaller group of children with a DLD diagnosis.A large proportion of children in Sweden today grow up in a bilingual setting.According to official statistics (2022), 29% of all school children age 7-15 are entitled to mother tongue instruction, which means that they speak a home language other than (or in addition to) Swedish.During the past decades, the number of Arabic speakers has increased substantially, and Arabic is now considered to be the language with the second-highest number of native speakers in the country, after Swedish (National Agency for Education 2022; Parkvall 2016).Despite the fact that as many as a quarter of all Swedish children are bilingual, there is a lack of large-scale studies that investigate these children's language skills in both languages.
Developmental Language Disorder (DLD) is a common condition in children that negatively affects their oral communication, literacy and educational progress (Norbury et al. 2016). 1 DLD typically emerges in early childhood and manifests as a pronounced deficit in the development of language skills, which cannot be attributed to hearing impairment, intellectual disability, medical syndromes or neurological disorders (Bishop 1997, pp. 21-23;Leonard 2014, p. 3).Uncertainty about what should be considered 'normal' language development in bilingual children can lead to both over-and underdiagnosis of Languages 2022, 7, 204 2 of 33 developmental disorders of language and literacy (Dollaghan and Horner 2011;Grimm and Schulz 2014).
More than two decades ago, an epidemiological study found that bilingual children in Sweden were referred to a Speech and Language Pathologist (SLP) for assessment at a later age than monolinguals, and they were also more likely to be considered to have severe DLD (Salameh et al. 2002).More recently, a very high proportion (82%) of Swedish child healthcare nurses have been found to believe that bilingualism causes language delay, and these nurses were more inclined to simplify screening and delay referrals for bilinguals (Nayeb et al. 2015).In a survey investigating the prevalence of severe DLD in five regions of the national healthcare service in Sweden, bilinguals were heavily overrepresented (51%) and bilingualism was reported to be a confounding factor, making it difficult for SLPs to make clinical judgments about the presence and severity of DLD (SOU 2016).This confusion can largely be attributed to insufficient assessment materials, a lack of reference data and patchy knowledge about developmental trajectories in bilinguals (Letts 2013).Furthermore, overlap in many of the linguistic features that are associated with DLD on the one hand and common patterns in typical L2 acquisition on the other adds to this confusion (Boerma et al. 2017a;Paradis and Crago 2000).
Although recommendations abound that bilinguals with suspected DLD should be assessed in both languages (ASHA 2004;World Health Organization 1992), and although it is frequently argued that DLD must manifest in both languages in bilinguals for a child to qualify for a diagnosis (Kohnert 2010;Salameh et al. 2002;Thordardottir 2015, p. 349), evidence-based recommendations about how to interpret language test scores for bilinguals are rare (Peña et al. 2016).
Typically, language test scores are converted into a standardised score in order to be able to compare the performance of an individual child against a reference/norm group.If performance is below a certain cut-off, this is interpreted as a language deficit and may lead to a diagnosis of DLD.Different countries have different clinical practices for assessing and identifying DLD, for instance regarding which cut-offs are utilised.As Thordardottir (2015) reports, clinical guidelines for diagnosing DLD in monolinguals in different European countries range between −2 (identifying the lowest-scoring 2.3%) to −1 (identifying the lowest-scoring 15.9%) z-scores below the mean on standardised language tests.Two largescale epidemiological studies investigating the prevalence of DLD in monolingual children have proposed cut-offs of −1.25 (Tomblin et al. 1997) and −1.5 (Norbury et al. 2016) for composite language scores in a language domain or modality as a yardstick for diagnosis.

Vocabulary
Vocabulary is a cornerstone of general language skills and important for later academic achievement.Vocabulary is a linguistic domain that is maximally influenced by quantitative as well as qualitative aspects of language input.More exposure (child-directed speech) is associated with larger vocabularies and steeper vocabulary growth curves in children (Hart and Risley 1995;Rowe 2012).Qualitative aspects of the input, such as variation in syntax and rich vocabulary in child-directed speech and communication styles that are conducive to verbal interaction between adult and child, show positive effects on children's vocabulary growth (Cartmill et al. 2013;Rowe 2012).For bilingual children, language input is more variable, both concerning the amount of exposure to each language and the contexts and sources of such input (Paradis and Grüter 2014).As children grow older, their receptive and expressive vocabulary grows too (Haman et al. 2017), but in bilingual children this may not happen to the same extent in both languages.While many studies find that bilinguals increase their vocabulary scores in the majority language over time, vocabulary in the minority language may not increase to the same extent, or may even stagnate (Cobo-Lewis et al. 2002a, 2002b;Gagarina et al. 2014;Ganuza and Hedman 2019;Gathercole and Thomas 2009;Lindgren and Bohnacker 2020;Öztekin 2019).Frequently reported in the literature is also the influence of socio-economic status (SES) on vocabulary scores.At group level, bilingual children from families with high SES have been found to perform better on vocabulary tests in the majority language than children from families with low SES (Buac et al. 2014;Calvo and Bialystok 2014;Cobo-Lewis et al. 2002a;Gathercole et al. 2016;Leseman 2000;Prevoo et al. 2014).The effect of SES on the minority language is less consistent.While Cobo-Lewis et al. (2002b) found that children from low SES families perform better than children from high SES families on certain vocabulary tasks, other studies have not found an effect of SES in the minority language (Buac et al. 2014;Leseman 2000;Prevoo et al. 2014).
Children with DLD often have deficits in the lexical domain, with a slower rate of vocabulary growth (Rice and Hoffman 2015;Smolander et al. 2021) and smaller vocabularies than their typically developing peers.Such deficits in the lexical domain have been described for both monolingual and bilingual children with DLD (Boerma et al. 2017b;Khoury Aouad Saliby et al. 2017b;Spaulding et al. 2013;Thordardottir and Brandeker 2013).
At the same time, bilingual children with typical language development may have smaller vocabularies compared to monolinguals in one of their languages or both, depending on the relative amount of exposure to each language (Thordardottir 2011).Relative amount of exposure has been identified in several studies as a key predictor of majority and minority language vocabulary size (Prevoo et al. 2014;Unsworth 2016).Furthermore, the timing of the onset of bilingualism has also been investigated in relation to vocabulary development.Studies in this area generally find that while a binary categorisation of age of onset as early vs. late in itself is not a significant predictor of vocabulary scores later in life (Thordardottir 2011; Unsworth 2016), length of exposure (treated as a continuous variable) affects vocabulary size, with longer exposure times being associated with higher vocabulary scores.The association between length of exposure and vocabulary scores is modulated by the relative amount of exposure from age of onset to age at assessment, often referred to as cumulative exposure (Smolander et al. 2021;Thordardottir 2019).
In sum, vocabulary is affected by both bilingualism (due to variability in language exposure), and DLD.If a bilingual child scores low on vocabulary tasks in one language or both, it may be difficult to determine whether this is due to little exposure or due to DLD.Since vocabulary is probably the linguistic domain that is the most input-dependent, differences in exposure are likely to be reflected in unevenly sized vocabularies in each language.Moreover, it is frequently reported that typically developing bilinguals who are only assessed in the majority language perform significantly below the monolingual norm on standardised language tests targeting vocabulary (Boerma et al. 2017b;Peña et al. 2016) as well as general language skills (Andersson et al. 2019).By contrast, non-word repetition (NWR), which is discussed in the next section, is a task that has been said to be suitable for children of diverse cultural and linguistic backgrounds, as it is less biased than standardised language tests (Dollaghan and Campbell 1998;Thordardottir and Brandeker 2013).

NWR as a Diagnostic Tool for Identifying DLD
Non-word repetition is a task that entails imitating a sequence of phonological nonsense forms (non-words).Poor NWR performance has been known for over three decades to be a clinical marker of DLD in monolinguals in many different languages (Chiat 2015).For bilinguals as well, NWR has been described as a promising diagnostic tool.A number of studies have reported that poor NWR performance in bilingual children is an indicator of DLD (Boerma et al. 2015;de Almeida et al. 2017;Hamann and Abed Ibrahim 2017).Other work however has raised doubts as to whether NWR can reliably be used clinically for identifying DLD in bilingual children (Gutiérrez-Clellen and Simon-Cereijido 2010; Kohnert et al. 2006;Ortiz 2021).
Compared to other language measures, NWR is relatively little affected by language exposure, as it does not depend directly on language knowledge but rather on the processing of new language information (Archibald 2008).However, NWR performance is affected by a number of factors related to the characteristics of the non-words as well as by participant-related factors (for an overview, see Chiat 2015).For instance, item length (operationalised as number of syllables) and phonological complexity have been reported to affect repetition accuracy, where items generally become more difficult to repeat as length and complexity increases (Boerma et al. 2015;dos Santos and Ferré 2018;Ellis Weismer et al. 2000;Jones et al. 2010;Radeborg et al. 2006;Thordardottir and Brandeker 2013).Although not as well-studied, phonotactic probability, word-likeness, and prosodic features are also reported to affect NWR performance.NWR items with lower phonotactic probability, items carrying prosodic features with lower saliency, and items with a lower degree of wordlikeness are typically more difficult to repeat (Chiat and Roy 2007;Gathercole 1995;Jones et al. 2010;Sahlén et al. 1999).Participant-related factors that influence NWR performance are, for instance, chronological age and lexical knowledge.NWR performance typically increases with age.The association between NWR performance and vocabulary size has been well known for several decades (Coady and Evans 2008).The relationship is likely to be bidirectional, meaning that better NWR capabilities facilitate vocabulary learning and that having a larger vocabulary facilitates NWR performance (Gathercole 2006).Several studies with bilingual participants also report an association between language exposure and performance on NWR tasks, especially when the items have language-specific features (Gibson et al. 2015;Kohnert et al. 2006;Sorenson Duncan and Paradis 2016;Thordardottir and Brandeker 2013).In light of this, some researchers have suggested that languagespecific NWR tasks are unsuitable to use with bilinguals, and that NWR tasks that are constructed to be compatible with the phonological structure of many different languages may be better suited to identify DLD in bilinguals (Boerma et al. 2015).
Keeping in mind the influence of language exposure and vocabulary on repetition accuracy in language-specific tasks, a framework for constructing NWR tasks with different properties was developed within the COST Action IS0804 research network (Chiat 2015).In this framework, two main types of tasks are contrasted against the kind of languagespecific tasks (LS) that have traditionally been used in NWR assessment (Gathercole et al. 1994;Radeborg et al. 2006).The first type is the so-called 'crosslinguistic' task (CL) with 2-5-syllable items and simple syllable structure (i.e., consonant-vowel syllables with no clusters or coda) that are constructed to be compatible with the phonological structure of many different languages (Chiat 2015;Boerma et al. 2015). 2 The second type is the so-called 'quasi-universal' task (QU), which has items with 1-3 syllables of varying syllabic complexity (clusters and codas), and probes phonological complexity (dos Santos and Ferré 2018). 3In the present study, all three types of NWR tasks (LS, CL and QU) are used.
Information about early language development, risk factors of developmental disorders of language and literacy, as well as parental reports about functional language abilities may be useful in addition to standardised language tests when diagnosing DLD, particularly in bilinguals (Thordardottir 2015;Tuller 2015).A late emergence of the first word or the first multi-word utterance is associated with a greater risk of developing persistent language disorder later in life (Paradis et al. 2010;Trauner et al. 2000).At the same time, bilinguals with typical language development are expected to reach these early milestones at the same time as monolinguals, although they might not appear at the same time in both languages (Hoff et al. 2014).A family history of speech, language or literacy difficulties has been identified as a risk factor for DLD in both monolinguals and bilinguals (Kalnak et al. 2012;Restrepo 1998).In addition to parental reports, useful information about the child's language and communication can also be obtained from teachers and preschool staff.Teachers see the child every day, know about their learning outcomes, and observe them in interaction with peers and adults.Thus, teacher evaluations provide ecologically valid reports of children's functional language skills.Teacher descriptions of children's language abilities have been found to correlate with results on standardised language tests, and may also reveal language difficulties that are not always straightforwardly captured by standardised language tests (Botting et al. 1997;Purse and Gardner 2013).

The Present Study
Although there are many Arabic-Swedish-speaking children in Sweden, little is known about their language skills.Published studies are generally limited to certain aspects of morphosyntax and word-association in small groups of children (e.g., Holmström 2015, Salameh et al. 2004and Håkansson et al. 2003).There is still a lack of large-scale studies that investigate both the majority and minority language and that also take into account age and environmental factors such as language exposure and SES.Furthermore, there is hardly any research on the NWR performance of bilingual children in Sweden.The present study aims to address this knowledge gap, by presenting reference data for vocabulary in both the minority and majority language and for three types of NWR tasks for a large sample (99 TD children) of Arabic-Swedish-speaking bilinguals aged 4-7.The relative effect of age, language exposure, and SES on vocabulary comprehension and production is investigated for both the majority and the minority language.Additionally, NWR performance is investigated in relation to age, language exposure, vocabulary and properties of the nonword items.Finally, this study explores whether bilingual children with a diagnosis of DLD can be distinguished from children with typical language development, based on their performance on vocabulary and NWR tasks.The following research questions are posed: RQ1: How does vocabulary comprehension and production develop with age in the two languages of 4-7-year-old Arabic-Swedish-speaking bilinguals without DLD, and how does language exposure and SES influence that development?
RQ2: How do 4-7-year-old Arabic-Swedish-speaking bilinguals without DLD perform on NWR tasks, and how is their performance affected by language exposure, vocabulary size, and properties of the non-words (length, phonological complexity, and language-(non-) specificity)?
RQ3: By comparison, how do Arabic-Swedish-speaking bilingual children with a DLD diagnosis perform on vocabulary and NWR tasks?Does one particular type of NWR task identify DLD better in this bilingual group?

Participants
The participants were 110 Arabic-Swedish-speaking children aged 4;0-7;11, 99 with typical language development (the TD sample), and 11 with a diagnosis of DLD (the DLD sample).The two groups will be described in the following.

The TD Sample
A total of 116 children were recruited for the TD sample.They were recruited by contacting a large number of (pre)schools, as well as congregations and associations arranging activities for Arabic-speaking children.Some participants were also recruited via personal contacts of Arabic-speaking members of the research team.Of the 116 children, 17 were excluded for various reasons.The reasons for exclusion included speaking an Arabic variety that was too distant from the prepared dialect versions of the vocabulary tasks (2/17), having only rudimentary knowledge in one language (6/17), not being able to complete all tasks (5/17), not having reached their fourth birthday (2/17), or not speaking one of the target languages (1/17).One child was excluded from the TD sample as it turned out that she was recruited for the DLD sample 18 months later, now having a DLD diagnosis.Only children who could speak both languages were included in the study.The 99 children in the TD sample (see Table 1) attended 53 different (pre)schools in Eastern Central Sweden.According to parental report, the children in the TD sample had no known hearing problems, language disorders or neuropsychiatric disorders at the time of testing.A bit more than half of the children were born in Sweden (56%), and the rest (42%) had migrated with their families from an Arabic-speaking country (or in one case from a third country).The majority spoke a Levant variety of Arabic (Syrian: 43%, Palestinian: 26%, Lebanese: 9%), 17% spoke Iraqi Arabic, and 4% spoke Egyptian Arabic.Many children were exposed to more than one Arabic variety, beyond the variety spoken in the home.A handful of children were also exposed to a third language in addition to Arabic and Swedish, which was either English, Kurdish (Sorani) or Neo-Aramaic.
In nearly all families, both parents had Arabic as their L1 (96%). 4In a few cases, information was available for only one parent (3%), or missing for both parents (2%).
In one family, one parent stated that their L1 was not Arabic (but presumably Kurdish).Virtually all parents were first-generation immigrants, with residence lengths varying from 10 months to 31 years.A few parents had come to Sweden as children, but most had immigrated as adults.Only one parent had been born in Sweden.
All but one child had received regular input in Arabic from birth.One child was reported to have started to hear Arabic shortly after age 1, and for one child, such information was missing.For Arabic then, there was hardly any variation in age of onset.By contrast, age of onset varied considerably for Swedish.A bit less than half (48%) had an age of onset to Swedish that was before age 3;0 (this included 6% with regular input in Swedish from birth).Twenty children had had less than two years (24 months) of exposure to Swedish at the time of testing.Yet these children were immersed in the Swedish language in preschool and could complete all tasks in Swedish.We decided not to exclude children with short residence lengths or late exposure to Swedish a priori.As long as the children could complete the tasks in both languages, they were included in the study.
All children attended institutional childcare, mostly 25-40 h a week.The 4-and 5year-olds, as well as four 6-year-olds, attended förskola (preschool).All other 6-year-olds attended Swedish-medium förskoleklass (a preparatory year for primary school), and the 7-year-olds were in first grade of primary school.Generally, schooling was in Swedish, but 13 children had attended or were attending a bilingual Arabic-Swedish preschool and two children a bilingual English-Swedish preschool, according to parental report.
The children came from a wide variety of socio-economic backgrounds, both concerning parental occupations and education, where all levels from less than six years of primary education to doctorate degrees were represented (i.e., levels 0-8 on the 9-level ISCED 2011 classification, UNESCO Institute for Statistics 2012).

The DLD Sample
Eighteen children with a DLD diagnosis and their parents were invited by their SLP to participate in the study.Of these, four families turned down participation.Three further children had to be excluded, as one child spoke an Arabic variety that was too distant, another child lived too far away for data collection to be feasible, and one child could not be seen due to the outbreak of the COVID-19 pandemic.The 11 children in the final DLD sample were recruited via SLPs working in both public healthcare and private SLP clinics, as well as SLPs working in preschools and schools.Inclusion criteria were: (1) age 4;0-7;11, (2) being regularly exposed to Swedish and an Arabic variety that matched the TD sample (i.e., Levantine, Iraqi or Egyptian), (3) being able to speak at least some Swedish and Arabic, and (4) having a DLD diagnosis.All children had been assessed and diagnosed by a licensed SLP.All but two children (BiAraLI-07 and BiAraLI-09) had been assessed in both Swedish and Arabic by a bilingual SLP or via an interpreter (BiAraLI-08), and (save for one child, BiAraLI-05) had had extensive contact with an SLP in the clinic or at school, often for several years. 5Diagnoses could include mixed comprehension and production difficulties (Swe: generell språkstörning) as well as primarily comprehension difficulties (Swe: impressiv språkstörning) or production difficulties (Swe: expressiv språkstörning), but not exclusively phonological or articulatory difficulties (Swe: fonologisk språkstörning).Exclusion criteria were: (1) having a known biomedical condition associated with language difficulties (e.g., Down syndrome), (2) a diagnosis within the autism spectrum, or (3) intellectual disability.
Although not an exclusionary criterion for participating in the DLD study, none of the children had ADHD.
There were fewer girls (4) than boys (7) in the DLD sample, and the age range (5;0-7;3) was narrower compared to the TD sample (4;0-7;11).The mean age of the DLD sample (6;2) was similar to the TD sample (6;1).As can be seen in Table 2, all children were exposed to Arabic from birth, and six of them had received regular exposure to Swedish before age 3. Roughly half of the children (6/11) were reported to have even exposure to both languages, and four of them had slightly more Swedish (60%) than Arabic (40%) in their daily exposure.Only one child was reported to hear mostly Arabic (80%).Six children spoke an Iraqi variety, which differed from the TD sample where only 17% spoke an Iraqi variety.Six children attended preschool, five attended förskoleklass, and one child was in first grade of primary school.All children but two had the diagnosis generell språkstörning (mixed impressive and expressive language disorder).One child had an unspecified diagnosis, but the SLP suspected generell språkstörning, and one child had an expressive language disorder.The Cross-linguistic lexical task (CLT) is a picture-based vocabulary assessment material (Haman et al. 2015).Each CLT has four subtasks: noun comprehension, verb comprehension, noun production, and verb production.Each part consists of 30 items plus two practice items, making the maximum score 60 for each part, comprehension and production.The comprehension part is a picture selection task.The experimenter asks a prompt question (e.g., 'who is pour -ing?') and the child has to identify the correct response from an array of four pictures.The production part is a picture-naming task, where the child is shown one picture at a time and is asked to answer the prompt question (e.g., 'what is this?') with a word that corresponds to the picture.The CLT was developed specifically for assessing vocabulary in both languages of bilingual children and is currently available in more than 30 different languages (https://multilada.pl/en/projects/clt/,(accessed 20 June 2022)).For a detailed description of the construction of the CLT, please see Haman et al. (2015).In the current study, the Swedish version (Ringblom et al. 2014) and an Arabic CLT version (Haddad 2017) that was adapted from the Lebanese Arabic version (Khoury Aouad Saliby et al. 2017a) were used.Since only a few of the children in the present study spoke Lebanese Arabic, the existing Lebanese version was adapted to the Arabic varieties most relevant to the Swedish context.For the CLT comprehension tasks, new prompts were constructed for all test items in the respective dialect, so that no child was disadvantaged by being asked about a word in a dialect they were not familiar with.For the CLT production tasks, the Lebanese target words needed to be complemented by other dialect synonyms, particularly Syrian, Palestinian and Iraqi, as well as Modern Standard Arabic (MSA).Four different adaptations were developed for Syrian, Palestinian, Lebanese and Iraqi Arabic (Haddad 2017). 6

Non-Word Repetition Tasks
In the present study, three NWR tasks were used, all developed for children of preschool and early school age.First, a Swedish language-specific task (LS-Swe), originally developed by Barthelom and Åkesson (1995), was used, for which reference data for 4-6-year-old monolinguals is available (Radeborg et al. 2006).The LS-Swe encompasses 24 test items of 2-5 syllables (6 of each syllable length) that adhere to Swedish phonotactics and contain phonemes that are typical of Swedish; nineteen consonant phonemes (/p, b, t, d, k, g, m, n, duction.The comprehension part is a picture selection task.The experimenter ask prompt question (e.g., 'who is pouring?') and the child has to identify the correct respon from an array of four pictures.The production part is a picture-naming task, where child is shown one picture at a time and is asked to answer the prompt question (e 'what is this?') with a word that corresponds to the picture.The CLT was developed s cifically for assessing vocabulary in both languages of bilingual children and is curren available in more than 30 different languages (https://multilada.pl/en/projects/clt/,( cessed 20 June 2022)).For a detailed description of the construction of the CLT, please Haman et al. (2015).In the current study, the Swedish version (Ringblom et al. 2014) a an Arabic CLT version (Haddad 2017) that was adapted from the Lebanese Arabic versi (Khoury Aouad Saliby et al. 2017a) were used.Since only a few of the children in present study spoke Lebanese Arabic, the existing Lebanese version was adapted to Arabic varieties most relevant to the Swedish context.For the CLT comprehension tas new prompts were constructed for all test items in the respective dialect, so that no ch was disadvantaged by being asked about a word in a dialect they were not familiar wi For the CLT production tasks, the Lebanese target words needed to be complemented other dialect synonyms, particularly Syrian, Palestinian and Iraqi, as well as Mode Standard Arabic (MSA).Four different adaptations were developed for Syrian, Palest ian, Lebanese and Iraqi Arabic (Haddad 2017). 6

Non-Word Repetition Tasks
In the present study, three NWR tasks were used, all developed for children of p school and early school age.First, a Swedish language-specific task (LS-Swe), origina developed by Barthelom and Åkesson (1995), was used, for which reference data for 4 year-old monolinguals is available (Radeborg et al. 2006).The LS-Swe encompasses 24 t items of 2-5 syllables (6 of each syllable length) that adhere to Swedish phonotactics a contain phonemes that are typical of Swedish; nineteen consonant phonemes (/p, b, t k, ɡ, m, n, ŋ , ɾ, f, v, s, ɕ, ɧ, ʂ, h, j, l/) and fifteen vowel phonemes (/i, ɪ, y, ʏ, e, ɛ, oe, ɑ, a ɔ, u, ʊ, ʉ, ɵ/).The items have syllables with varying phonological complexity: there open and closed syllables with and without consonant clusters (13 items with clusters, 9 items with one cluster, and 2 items with two clusters) in onset and coda.T items are pronounced with stress patterns that are typical of Swedish, i.e., with vary main stress and vowel duration in different syllables, for example /spɵɾɪfɾaˈɡoːl/ an flɛtɛmɪŋɛˈɾoːf/.The LS-Swe items were recorded by a female speaker of Swed speaking a central Swe-dish dialect, which is close to standard Swedish.
Second, a Swedish version of the cross-linguistic NWR task (Chiat 2015) was us (CL-Swe).The task was designed to be compatible with the lexical phonology of ma languages.As such, it contains items of 2-5 syllables (4 of each syllable length), with consonant clusters and no codas (only open syllables).The full range of phonemes cludes eleven consonants (/p, b, t, d, k, ɡ, s, z, m, n, l/) and three vowels (/a, i, u/).For purpose of this study, a Swedish version was created.From a list of 84 candidate items, items were chosen, for example, /lɪmɪka/and/tʊlɪɡasʊmʊ/, excluding items that contain ph nemes that do not exist in Swedish (e.g., /z/), or contain real words or inflections in t language.The CL-Swe items were recorded by the same female speaker who recorded LS-Swe items.All items were pronounced with quasi-neutral prosody (Chiat 2015, p. 13 where all syllables were equally stressed (i.e., they carried equal length and pitch) ap from final-syllable lengthening and pitch drop marking the end of an utterance.
Finally, the third task was the Non-word repetition task-Lebanese (NWRT-Leb, Ab Melhem et al. 2011), a Lebanese version of the QU task, modelled on the NWR-FRENC task (dos Santos and Ferré 2018).This task was constructed to investigate how phonolo ical complexity impacts NWR performance.The task contains 30 items of 1-3 syllables items with one syllable, 14 items with two syllables and 10 items with three syllables) w and without consonant clusters (15 items with no clusters, 13 items with one cluster, a , R, f, v, s, C, Ê, ù, h, j, l/) and fifteen vowel phonemes (/i, I, y, Y, e, ε, oe, A, a, o, O, u, U, 0, 8/).The items have syllables with varying phonological complexity: there are open and closed syllables with and without consonant clusters (13 items with no clusters, 9 items with one cluster, and 2 items with two clusters) in onset and coda.The items are pronounced with stress patterns that are typical of Swedish, i.e., with varying main stress and vowel duration in different syllables, for example /sp8RIfRa"go:l/ and /flεtεmI Languages 2022, 7, x FOR PEER REVIEW 8 of 34 two practice items, making the maximum score 60 for each part, comprehension and production.The comprehension part is a picture selection task.The experimenter asks a prompt question (e.g., 'who is pouring?') and the child has to identify the correct response from an array of four pictures.The production part is a picture-naming task, where the child is shown one picture at a time and is asked to answer the prompt question (e.g., 'what is this?') with a word that corresponds to the picture.The CLT was developed specifically for assessing vocabulary in both languages of bilingual children and is currently available in more than 30 different languages (https://multilada.pl/en/projects/clt/,(accessed 20 June 2022)).For a detailed description of the construction of the CLT, please see Haman et al. (2015).In the current study, the Swedish version (Ringblom et al. 2014) and an Arabic CLT version (Haddad 2017) that was adapted from the Lebanese Arabic version (Khoury Aouad Saliby et al. 2017a) were used.Since only a few of the children in the present study spoke Lebanese Arabic, the existing Lebanese version was adapted to the Arabic varieties most relevant to the Swedish context.For the CLT comprehension tasks, new prompts were constructed for all test items in the respective dialect, so that no child was disadvantaged by being asked about a word in a dialect they were not familiar with.
For the CLT production tasks, the Lebanese target words needed to be complemented by other dialect synonyms, particularly Syrian, Palestinian and Iraqi, as well as Modern Standard Arabic (MSA).Four different adaptations were developed for Syrian, Palestinian, Lebanese and Iraqi Arabic (Haddad 2017).6

Non-Word Repetition Tasks
In the present study, three NWR tasks were used, all developed for children of preschool and early school age.First, a Swedish language-specific task (LS-Swe), originally developed by Barthelom and Åkesson (1995), was used, for which reference data for 4-6year-old monolinguals is available (Radeborg et al. 2006).The LS-Swe encompasses 24 test items of 2-5 syllables (6 of each syllable length) that adhere to Swedish phonotactics and contain phonemes that are typical of Swedish; nineteen consonant phonemes (/p, b, t, d, k, ɡ, m, n, ŋ , ɾ, f, v, s, ɕ, ɧ, ʂ, h, j, l/) and fifteen vowel phonemes (/i, ɪ, y, ʏ, e, ɛ, oe, ɑ, a, o, ɔ, u, ʊ, ʉ, ɵ/).The items have syllables with varying phonological complexity: there are open and closed syllables with and without consonant clusters (13 items with no clusters, 9 items with one cluster, and 2 items with two clusters) in onset and coda.The items are pronounced with stress patterns that are typical of Swedish, i.e., with varying main stress and vowel duration in different syllables, for example /spɵɾɪfɾaˈɡoːl/ and / flɛtɛmɪŋɛˈɾoːf/.The LS-Swe items were recorded by a female speaker of Swedish speaking a central Swe-dish dialect, which is close to standard Swedish.
Second, a Swedish version of the cross-linguistic NWR task (Chiat 2015) was used (CL-Swe).The task was designed to be compatible with the lexical phonology of many languages.As such, it contains items of 2-5 syllables (4 of each syllable length), with no consonant clusters and no codas (only open syllables).The full range of phonemes includes eleven consonants (/p, b, t, d, k, ɡ, s, z, m, n, l/) and three vowels (/a, i, u/).For the purpose of this study, a Swedish version was created.From a list of 84 candidate items, 16 items were chosen, for example, /lɪmɪka/and/tʊlɪɡasʊmʊ/, excluding items that contain phonemes that do not exist in Swedish (e.g., /z/), or contain real words or inflections in that language.The CL-Swe items were recorded by the same female speaker who recorded the LS-Swe items.All items were pronounced with quasi-neutral prosody (Chiat 2015, p. 138), where all syllables were equally stressed (i.e., they carried equal length and pitch) apart from final-syllable lengthening and pitch drop marking the end of an utterance.
Finally, the third task was the Non-word repetition task-Lebanese (NWRT-Leb, Abou Melhem et al. 2011), a Lebanese version of the QU task, modelled on the NWR-FRENCH task (dos Santos and Ferré 2018).This task was constructed to investigate how phonological complexity impacts NWR performance.The task contains 30 items of 1-3 syllables (6 items with one syllable, 14 items with two syllables and 10 items with three syllables) with and without consonant clusters (15 items with no clusters, 13 items with one cluster, and ε"Ro:f/.The LS-Swe items were recorded by a female speaker of Swedish speaking a central Swedish dialect, which is close to standard Swedish. Second, a Swedish version of the cross-linguistic NWR task (Chiat 2015) was used (CL-Swe).The task was designed to be compatible with the lexical phonology of many languages.As such, it contains items of 2-5 syllables (4 of each syllable length), with no consonant clusters and no codas (only open syllables).The full range of phonemes includes eleven consonants (/p, b, t, d, k, g, s, z, m, n, l/) and three vowels (/a, i, u/).For the purpose of this study, a Swedish version was created.From a list of 84 candidate items, 16 items were chosen, for example, /lImIka/and/tUlIgasUmU/, excluding items that contain phonemes that do not exist in Swedish (e.g., /z/), or contain real words or inflections in that language.The CL-Swe items were recorded by the same female speaker who recorded the LS-Swe items.All items were pronounced with quasi-neutral prosody (Chiat 2015, p. 138), where all syllables were equally stressed (i.e., they carried equal length and pitch) apart from final-syllable lengthening and pitch drop marking the end of an utterance.
Finally, the third task was the Non-word repetition task-Lebanese (NWRT-Leb, Abou Melhem et al. 2011), a Lebanese version of the QU task, modelled on the NWR-FRENCH task (dos Santos and Ferré 2018).This task was constructed to investigate how phonological complexity impacts NWR performance.The task contains 30 items of 1-3 syllables (6 items with one syllable, 14 items with two syllables and 10 items with three syllables) with and without consonant clusters (15 items with no clusters, 13 items with one cluster, and 2 items with two clusters) and codas.There were three different types of syllables, all present in Lebanese Arabic (and also in other spoken varieties of Arabic), French and English: CV, CCV or CVC.This task includes only seven phonemes, four consonants (/b, l, k, f/) and three vowels (/a, i, u/), phonemes that all exist in Lebanese (and other varieties of) Arabic, French and English.Each item contained three to seven phonemes, for instance /fablu/ and /bifakub/.The NWRT-Leb items were recorded at the Department of Speech and Language Therapy, St Joseph University, by a female speaker of Lebanese Arabic.
For all three NWR tasks, audio files were created where each item was played one after the other, with a three second pause in between each non-word.The LS-Swe and the CL-Swe were presented with increasing level of difficulty, i.e., starting with the items that were the shortest (had the lowest number of syllables), and gradually increased with one additional syllable.The NWRT-Leb items were presented in randomised order.The audio recordings were incorporated into audio-visual PowerPoint presentations.A list of all items in the NWR tasks is provided in Table A1 in the Appendix A.

Parental Questionnaire
The parental questionnaire used in the present study was developed for a largescale childhood multilingualism research project at Uppsala University, BiLI-TAS (PI: Ute Bohnacker; BiLI-TAS is short for Bilingualism & Language Impairment Turkish/Arabic/ Swedish).The questionnaire could be answered in either Arabic or Swedish, whichever the parents preferred.The questionnaire provided information about the social and linguistic background of the children and their parents.The questions targeted (early) language development, family history of speech, language and literacy difficulties, concerns about language development, language exposure, language use in the family, language activities in the home such as book reading and storytelling, as well as parental education, occupation and language skills.
The questionnaire administered to the parents of the children in the DLD sample was identical to the TD questionnaire, but in addition included questions that queried for how long the child had been in contact with an SLP, and who took the initiative for SLP assessment (e.g., parents, preschool staff, or the child healthcare nurse).

Interviews with Parents, Teachers and SLPs of the DLD Children
The questions for the interviews with parents, SLPs and teachers were developed by the BiLI-TAS team, and first used during a clinical study of Turkish-speaking children, as described in Öztekin (2019).The original interview templates were slightly modified in order to suit the current study.The parents were interviewed in connection with Arabic data collection in the home, or by telephone.The questions asked during the interview concerned the same topics as in the questionnaire, but provided more in-depth information, for instance on how the parents viewed their child's language development over time, their attitudes and beliefs regarding language development and bilingualism, and whether they were concerned about their child's language development.The teachers were interviewed during a preschool visit, in connection with data collection in Swedish.Teacher interviews included questions about the child's language skills, their communicative and social behaviour, whether they could follow instructions, and how they behaved during book reading and group activities that promoted linguistic awareness (e.g., rhyming and language games).The SLPs were interviewed by telephone.Interview questions included: how the child had been assessed (in which language(s), and which materials were used), age at referral, language therapy and the child's development over time, the parents' attitudes towards therapy, current diagnosis, and what the SLP considered to be most striking or problematic about the child's language.

Procedure
The study was planned and carried out in accordance with Swedish legislation on research ethics and data protection, and adheres to the university ethical code of conduct (Codex) that came into place halfway through the BiLI-TAS research project.Prior to participation, the parents of all children gave their informed written consent.They could revoke their participation at any time.

Data Collection
Data were collected between September 2017 and March 2019 for the TD sample, and between January and September 2019 for the DLD sample.
The children were assessed with the CLT and NWR tasks as part of a test battery that also included narrative tasks and a fourth NWR task (an Arabic version of the CL task). 7 Each child was seen on two separate occasions, one in each language, either at (pre)school, in the home, or at a community centre, with each session lasting 30-45 min.The median interval between sessions was 7 days.The order of the languages as well as the order of the tasks were counterbalanced.Tasks were administered by trained native speakers, and the experimenter spoke to the child only in the language of testing in order to be able to assess the children's knowledge of each language separately.The dialectal CLT items and the Arabic variety spoken by the experimenter were matched with the variety spoken by the child.Sessions were video-and audio-recorded, so that all responses could be checked afterwards.
The CLT was administered via coloured picture booklets, following the standard procedure described by Haman et al. (2015).During the session, responses were noted on paper forms, and the experimenter gave only neutral feedback (e.g., aha, mhm, okay) irrespective of whether the child had provided a correct answer or not.After each session, responses were transcribed and scored.
The NWR tasks were administered via audio-visual PowerPoint presentations and presented to the children as an imitation task.The LS-Swe and the CL-Swe tasks feature a parrot and the NWRT-Leb features an alien that the child is instructed to imitate.The task was presented to the child on a smartphone, and the audio was played via noise-cancelling headphones.All tasks were audio-recorded, and the responses were transcribed and scored after the session.
The child was always praised at the end of each task, irrespective of the actual outcome, and rewarded with stickers.

Scoring
All CLT child responses were transcribed and scored.The maximum score for each subtask was 60 points.Every child completed all four subtasks.The total number of responses was 26,400 (= 110 children × 2 languages × 120 test items (i.e., 60 for comprehension + 60 for production)).The scoring was done by native speakers of Swedish and Arabic.As there is no standardised published procedure for scoring the CLTs, scoring was done as follows.One point was awarded for each correct response in the language of testing.For the comprehension tasks, only target picture identification was scored as correct.For the production tasks, a point was awarded if the child produced the target word, for example, Arabic dabdab, zah .af or h .aba ('crawl') on the Arabic CLT, or Swedish krypa ('crawl') on the Swedish CLT, in response to a picture of a baby crawling.Moreover, the following responses were also scored as correct: (i) adult-like synonyms, (ii) words that were more specific than the target word and corresponded to the picture (e.g., Swe.meta 'to angle' instead of the target fiska 'to fish'), and (iii) word forms that were pronounced slightly off-target but were still recognisable as the target lemma.All other types of responses were scored as incorrect.Thus, words not in the target language, words that corresponded to the picture but were less specific than the target word (e.g., Swe.städa 'clean' instead of sopa 'sweep'), paraphrases and circumlocutions, forms belonging to a different word class, and forms that phonologically and/or morphologically strongly deviated from the target word, were scored zero.The scoring of items was carefully checked for consistency.Unclear items were discussed by the authors and Arabic-and Swedish-speaking team members until consensus was reached.Whenever necessary, the audio and video recordings were consulted.
All NWR tasks were audio-recorded for later transcription and analysis.The total number of responses was 7636 (2592 for LS-Swe (108 participants × 24 test items), 1744 for CL-Swe (109 participants × 16 test items), and 3300 for NWRT-Leb (110 participants × 30 test items)).The responses were transcribed phonemically by a native speaker of Swedish (LS-Swe and CL-Swe) and Arabic (NWRT-Leb), respectively.As there is no standardised procedure for scoring any of these NWR tasks, scoring was done as follows.The participants received 1 point for each correctly repeated non-word, and 0 points for any response containing an error (the whole item correct vs. incorrect approach).Allowances were made for minor articulation deviances, such as non-adultlike or indistinct pronunciation of /r/ and /s/ (as these phonemes are challenging to articulate and may be difficult to pronounce even for some adults).Any phonological substitution processes that were consistent in the child's speech were disregarded.Errors of voicing (/p/ vs. /b/) and minor vowel deviations (e.g., /oe/ vs. /ø/) were also disregarded.However, major vowel substitutions, such as substituting /a/ for /i/, were not allowed.Finally, any additions of syllables or phonemes before or after the otherwise correctly repeated item were also disregarded (i.e., children were not penalised for hesitation noises).The scoring of items was carefully checked for internal consistency.Moreover, for interrater reliability, an independent researcher transcribed and scored the responses of 15 randomly sampled participants, 12 TD children (12%) and three DLD children (27%), for all three NWR tasks.The interrater agreement rate was 98.0% (1029/1050 items).

Questionnaire Data
In the present study, four variables from the questionnaire data were investigated with respect to the performance on the vocabulary and the NWR tasks: chronological age, length of exposure, daily exposure, and SES.In the following, it will be described how they were operationalised.
First, chronological age was the child's age at testing, measured in number of months.Second, age of onset (AoO) for Swedish was the reported age at which the child started to receive regular exposure to Swedish.AoO was transformed into Length of Exposure to Swedish (LoESwe) by subtracting AoO (months) from the child's chronological age (months).As AoO for Arabic was at birth for all but two children, 8 there was an almost complete overlap between chronological age and LoE for Arabic, and they could not be investigated as separate variables.Third, the child's current daily exposure to each language was estimated by the parents on a scale with seven levels ranging from almost only Arabic (95% Arabic and 5% Swedish) to almost only Swedish (5% Arabic and 95% Swedish).Parents could also note a different distribution.For the purpose of statistical analyses, the variable of daily exposure was split into two separate variables, one for each language.Daily exposure to Arabic (Daily exp Ara) thus indicated the percentage of daily exposure to Arabic, and Daily exposure to Swedish (Daily exp Swe) indicated the percentage of daily exposure to Swedish.Finally, SES was operationalised as parental education.The questionnaire queried the highest level of education of each parent.The responses were coded according to the 9-level ISCED 2011 classification of education (UNESCO Institute for Statistics 2012).Then, the education level was averaged across both parents.For a couple of children, information was available only for one parent (e.g., single-parent households).In such cases, the SES variable was based on the education level of that one parent.

Interview Data
The interview data was arranged thematically in a spreadsheet according to the questions posed to the informant.Next, all responses were systematically searched for descriptions of the child's language abilities, as well as their communicative behaviour.The parents' answers were further searched for information about delayed language development, and the teacher's answers were searched for information on the child's classroom behaviour and peer relations.Finally, the SLPs answers were searched for descriptions of behaviour and progress in assessment and therapy.For a condensed overview, see Table A2 in the Appendix A. More details are provided in Öberg (2020).

Statistical Analyses
All analyses were conducted in R (R Core Team 2021).Questionnaire data was missing for one seven-year-old in the TD sample, so this participant was excluded from all analyses that contained background variables.All correlations were calculated with Pearson's correlation coefficient (Pearson's r).For all statistical analyses, the level of significance was set at p < 0.05 (two-tailed).
For all vocabulary and NWR tasks, age development was investigated by correlating age in months with raw scores.Vocabulary comprehension and production were investigated separately for each language.Due to the different number of items (and thus, different maximum scores) in the three NWR tasks, total scores were converted into proportions before the performance on the three tasks was compared with a one-way ANOVA.Bonferroni correction for multiple comparisons was used in the subsequent post-hoc tests.
For vocabulary, multivariate linear regression models were fitted, with vocabulary score as the dependent variable.Comprehension and production were analysed separately for each language, thus there were four separate models.All independent variables were centred before modelling; thus, the intercept indicates the mean of the whole sample.As SES (parental education) data was missing for five children, SES data was imputed using regression imputation in order to avoid excluded data points in the sample.
Item-related and participant-related effects on the accuracy of repetition of NWR items were investigated with logistic mixed-effects regression models, using the function glmer from the lme4 package (Bates et al. 2015).The dependent variable, accuracy, was a categorical variable, where each data point indicates correct (1p) or incorrect (0p) repetition of a NWR item.All continuous variables were standardised prior to modelling.Mixedeffects regression models account for dependencies in the data by so-called random effects.This type of model is suitable when data points are not independent of each other.Since participants and non-word items are repeated many times in the data set, random effects account for these dependency structures.In sum, all of the logistic mixed-effects models investigated which of the independent variables could predict whether a response was correct or not, while accounting for non-independence.The mixed-effects models were evaluated with pseudo-R 2 and concordance index (c-index).Pseudo-R 2 was obtained with the r.squaredGLMM function from the MuMIn package (Barto ń 2020).For mixed-effects models, the marginal R 2 expresses the amount of variance that is explained by the fixed effects alone, and the conditional R 2 expresses the amount of variance explained by the full model, including random effects (Nakagawa et al. 2017;Nakagawa and Schielzeth 2013).The c-index is a measure of concordance between a model's predicted probabilities for each data point and the actual outcome.A value above 0.8 is generally considered to be a good model (Baayen 2008, p. 204).
In order to compare the performance of the children in the DLD sample on the vocabulary and NWR tasks to that of the TD sample, age-adjusted z-scores were calculated.First, z-scores were calculated for all children in the TD sample, based on the raw score for each child and the mean and SD for that child's age group.Next, the raw scores of the children in the DLD group were transformed into z-scores, based on the mean and SD for the corresponding age group in the TD sample.Thus, all z-scores indicate how each individual performed on a specific task compared to age-group peers in the TD sample.

Descriptive Statistics
First, total scores and scores by age groups are reported separately for comprehension and production in Arabic and Swedish.As can be seen in Table 3, scores increased with age in all tasks. 9In the following, age will be treated as a continuous variable.There were positive correlations between linear age and scores on all vocabulary tasks.The correlation with age was stronger for comprehension than production in both languages (Arabic comprehension: df = 97, r = 0.50, p < 0.001; Arabic production: df = 97, r = 0.33, p < 0.001; Swedish comprehension: df = 97, r = 0.51, p < 0.001; Swedish production: df = 97, r = 0.46, p < 0.001).

Age, Language Exposure, SES and Vocabulary
In order to investigate the relative effect of age (in months), amount of daily exposure (to Arabic or Swedish), length of exposure (for Swedish) and SES on vocabulary scores, four multivariate linear regression models were run, separately for comprehension and production in Arabic and Swedish, respectively (see Tables 4 and 5).Model 1 explained 36% of the variance for Arabic comprehension scores, with only age and daily exposure being significant predictors.As is evident from the standardised estimates, age was a stronger predictor (β = 0.57, p < 0.001) than daily exposure (β = 0.37, p < 0.001).Model 2 explained 36% of the variance for Arabic production scores.For both Arabic production and comprehension, only age and daily exposure to Arabic were significant, but for Arabic production, daily exposure (β = 0.52, p < 0.001) was a stronger predictor than age (β = 0.41, p < 0.001).SES was not a significant predictor of Arabic comprehension (p = 0.55) or production (p = 0.19) scores.Model 3 explained 53% of the variance for Swedish comprehension scores.Only age, length of exposure and daily exposure were significant predictors, with length of exposure (β = 0.42, p < 0.001) being the most influential, followed by age (β = 0.35, p < 0.001) and daily exposure (β = 0.17, p < 0.05).Similar patterns were found for Swedish production, where Model 4 explained 51% of the variance.Again, only age, length of exposure and daily exposure were significant predictors, with length of exposure (β = 0.43, p < 0.001) having more of an impact than age (β = 0.28, p < 0.001) or daily exposure (β = 0.24, p < 0.01).
SES was not a significant predictor of Swedish comprehension (p = 0.09) or production (p = 0.33) scores.First, total scores and scores by age groups are reported for all three NWR tasks.As can be seen in Table 6, overall performance was lowest on the LS-Swe task, with a mean accuracy of 55.0%.The CL-Swe task was in the middle, with a mean accuracy of 76.1%, and the NWRT-Leb task had the highest overall performance, with a mean accuracy of 83.7%.A one-way ANOVA revealed that the differences in accuracy between tasks were significant (F(2, 291) = 97.65,p < 0.001, η p 2 = 0.40).Pairwise comparisons showed that the differences were significant between all tasks (p < 0.01 for all comparisons).
As shown in Table 6, mean scores increase with age between all age groups, and the ranges generally decrease. 10In the following analyses, age will be treated as a continuous variable.There were positive correlations between linear age and scores on all NWR tasks, but they were slightly weaker for the CL-Swe task (df = 96, r = 0.27, p < 0.01) than for the LS-Swe task (df = 95, r = 0.41, p < 0.001) and for the NWRT-Leb (df = 97, r = 0.45, p < 0.001).Note.For LS-Swe, N = 97, as only 20/22 four-year-olds did the task.For CL-Swe, N = 98, as only 21/22 four-year-olds completed the task.
There were differences in the proportion of children who scored high or low on each task, reflecting different overall task difficulty.For instance, there was a striking difference between the LS-Swe task and the NWRT-Leb, where 38% of all children scored below 50% on the LS-Swe task, but only 3% did so on the NWRT-Leb.The reverse pattern emerged when investigating the proportion of children who scored 90% or better; only one child did so on the LS-Swe task, but 41% did so on the NWRT-Leb. 11For the CL-Swe task, most children scored between 50-90% correct, with fewer children scoring below 50% (10% of the children) or above 90% (12% of the children).

NWR Accuracy in Relation to Task, Item Properties, Language Exposure, and Vocabulary
As described in the literature (see Introduction), several item-/task-related factors and participant-related factors have an impact on NWR performance.First, exploratory analyses were conducted for NWR tasks by investigating correlations with language exposure and vocabulary.Unsurprisingly, for LS-Swe, there were positive correlations with length of exposure to Swedish (df = 94, r = 0.36, p < 0.001) and Swedish vocabulary comprehension (df = 95, r = 0.45, p < 0.001).More surprisingly, for the CL-Swe, there were also positive correlations with length of exposure to Swedish (df = 95, r = 0.23, p < 0.05) and Swedish vocabulary comprehension (df = 96, r = 0.37, p < 0.001).For the NWRT-Leb, there was a positive correlation with Arabic vocabulary comprehension (df = 97, r = 0.36, p < 0.001).For none of the NWR tasks were there any correlations with daily exposure or SES.
Next, accuracy in terms of the percent of correctly repeated items was investigated in relation to item length (number of syllables), as visualised separately for each NWR task in Figure 1.As presented in Table 7, accuracy was generally highest for items with fewer syllables and without consonant clusters.In other words, accuracy decreased as a function of increasing number of syllables and the presence of consonant clusters.However, the accuracy patterns for items with vs. without clusters were not the same for the LS-Swe task and the NWRT-Leb.In the NWRT-Leb, accuracy decreased by similar amounts for items with and without clusters as the number of syllables increased.By contrast, for the LS-Swe task, accuracy levels were similar for 2-4-syllable items without clusters, but decreased steeply at five syllables, whilst accuracy levels for the items with clusters decreased for each added syllable.This difference is visualised in Figure 2. Potential interactions between item length and presence of clusters will be explored further in the multivariate mixed-effects regression models.As presented in Table 7, accuracy was generally highest for items with fewer syllables and without consonant clusters.In other words, accuracy decreased as a function of increasing number of syllables and the presence of consonant clusters.However, the accuracy patterns for items with vs. without clusters were not the same for the LS-Swe task and the NWRT-Leb.In the NWRT-Leb, accuracy decreased by similar amounts for items with and without clusters as the number of syllables increased.By contrast, for the LS-Swe task, accuracy levels were similar for 2-4-syllable items without clusters, but decreased steeply at five syllables, whilst accuracy levels for the items with clusters decreased for each added syllable.This difference is visualised in Figure 2. Potential interactions between item length and presence of clusters will be explored further in the multivariate mixed-effects regression models.Next, logistic mixed-effects regression models were fitted to investigate the effect of task-/item-related factors and participant-related factors on repetition accuracy.Since the non-word tasks differed from each other on a number of fundamental properties (i.e., number of syllables, presence of consonant clusters, language-(non-)specificity, etc.), all tasks could not be directly compared to each other.Therefore, three separate models were fitted.

LS-Swe CL-Swe NWRT-Leb all items w/o clusters with clusters all items all items w/o clusters with clusters
Model 5 (see Table 8) investigated the effect of the presence of clusters, item length (number of syllables), age, Arabic vocabulary, and the interaction between presence of clusters and syllables on the repetition accuracy in the NWRT-Leb.Chronological age (B = 0.47, p = 0.001) and Arabic vocabulary (B = 0.32, p = 0.02) had a positive influence on NWR scores.That is, older children and children with larger vocabularies in Arabic had higher accuracy.Consonant clusters (B = −0.72,p = 0.02) and an increasing number of syllables (B = −1.13,p < 0.001) both contributed to lower repetition accuracy.However, there was no interaction between clusters and syllables (B = 0.05, p = 0.88), demonstrating that as the number of syllables increased, accuracy levels decreased alike for items with and without clusters.The full model's explanatory power was considerable (conditional R 2 = 0.52) and larger than that of the fixed effects alone (marginal R 2 = 0.27).The c-index of 0.88 indicates a good model fit.Model 6 (see Table 9) explored the effect of age, length of exposure to Swedish, Swedish vocabulary, clusters, item length (number of syllables), and the interaction between clusters and number of syllables on LS-Swe accuracy.While chronological age (B = 0.29, p = 0.01) had a positive effect on the repetition accuracy of LS-Swe items, there was no effect of length of exposure to Swedish (B = 0.10, p = 0.43).Swedish vocabulary scores (somewhat surprisingly) had no effect (B = 0.26, p = 0.06) on LS-Swe accuracy. 12The presence of consonant clusters had a negative impact (B = −1.69,p < 0.001), and accuracy also decreased with increasing item length (syllables: B = −0.65,p = 0.007).Furthermore, there was an interaction between clusters and syllables (B = −0.81,p = 0.03), where the negative effect of clusters was stronger with increasing item length (number of syllables).The explanatory power of the full model was considerable (conditional R 2 = 0.53) and greater than that of the fixed effects alone (marginal R 2 = 0.34).The c-index was 0.88, indicating a good model fit.Finally, Model 7 (see Table 10) investigated the effect of task for the LS-Swe and the CL-Swe (a comparison between language-specific vs. non-language-specific test items), age, Swedish vocabulary, item length, as well as the interaction between task and item length and the interaction between task and Swedish vocabulary scores.Chronological age (B = 0.27, p = 0.01) and Swedish vocabulary scores (B = 0.35, p = 0.002) had a positive effect on repetition accuracy.There was a task effect (B = 1.76, p < 0.001); items from the CL-Swe task generally had higher accuracy than items from the LS-Swe task.Accuracy decreased with increasing item length (syllables: B = −1.08,p < 0.001), and there was also an interaction between task and item length (B = −0.80,p = 0.04).The adverse effect of item length was not the same for both tasks.Accuracy decreased by similar amounts for each added syllable in the LS-Swe task (see Figure 1 and Table 7).By contrast, for the CL-Swe task, the decrease was rather small for each added syllable in the shortest items, and then dropped steeply at five syllables.Finally, there was no interaction between task and Swedish vocabulary (B = 0.04, p = 0.70), indicating that the positive effect of Swedish vocabulary was similar for the language-specific items in LS-Swe and the non-languagespecific items in CL-Swe.The explanatory power of the full model was considerable (conditional R 2 = 0.61), and better than that of the fixed effects alone (marginal R 2 = 0.38).The c-index of 0.90 showed a good model fit.

Performance on Vocabulary and NWR Tasks
In this section, the performance of the children in the DLD sample will be compared to that of the TD group.As the NWR tasks and the CLT vocabulary tasks utilised in the present study have not been normed, we had no indication which cut-off would be best to identify DLD in our sample of Arabic-Swedish speaking bilinguals on these particular tasks.Therefore, we opted for a cut-off of z-score below −1.25 (i.e., identifying the lowest-scoring 10.6%) in accordance with Tomblin et al. (1997).
Most children in the DLD group scored within the range of their age group in both vocabulary comprehension and production in both languages, but below the mean, which is shown by the predominantly negative z-scores (see Table 11).For the children in the DLD group as a whole, vocabulary z-scores were generally lower in Arabic than in Swedish.The low vocabulary scores in Arabic are noteworthy, since the DLD children (like the TD group) had ample exposure to Arabic from birth.For NWR, a bit more than half (6/11) of the children in the DLD sample scored within the range of their age peers on all three NWR tasks.Overall performance on the NWR tasks was generally below the mean, as reflected by the overall negative z-scores.In Figure 3, age-adjusted z-scores are plotted for Arabic and Swedish vocabulary comprehension (a) and production (b) for the children in the TD and DLD samples.As is evident in Table 11 and Figure 3, more than half (7/11) of the children in the DLD group had a z-score below −1.25 for comprehension in one language, but only one child (BiAraLI-01) received a z-score below −1.25 for comprehension in both languages.The pattern was similar for vocabulary production scores; six of the children in the DLD group scored below −1.25 in one language (additionally, one child, BiAraLI-01, scored at the cut-off in Arabic).Thus, it was not the case that all or even the majority of the DLD children scored below −1.25 in either task or in both languages.For both comprehension and production, there was a notable overlap in scores for the children in the DLD sample and children in the TD sample.
Languages 2022, 7, x FOR PEER REVIEW 20 of 34 similar for vocabulary production scores; six of the children in the DLD group scored below −1.25 in one language (additionally, one child, BiAraLI-01, scored at the cut-off in Arabic).Thus, it was not the case that all or even the majority of the DLD children scored below −1.25 in either task or in both languages.For both comprehension and production, there was a notable overlap in scores for the children in the DLD sample and children in the TD sample.In Figure 4, z-scores are plotted for the three NWR tasks for the children in the TD and the DLD samples, for the (a) LS-Swe task, (b) CL-Swe task, and (c) NWRT-Leb.As can be seen in Table 11 and Figure 4, there was a notable overlap in NWR performance in the two samples.
We have refrained from calculating sensitivity and specificity for the vocabulary and the NWR tasks since, as the scatterplots in Figures 3 and 4 show, there is a lot of overlap between the TD and the DLD groups and we see no straightforward solution for getting around this overlap (e.g., by exploring different cut-offs).To summarise, although the children in the DLD sample typically scored below the mean on both vocabulary comprehension and production in both languages, it was rare In Figure 4, z-scores are plotted for the three NWR tasks for the children in the TD and the DLD samples, for the (a) LS-Swe task, (b) CL-Swe task, and (c) NWRT-Leb.As can be seen in Table 11 and Figure 4, there was a notable overlap in NWR performance in the two samples.
similar for vocabulary production scores; six of the children in the DLD group scored below −1.25 in one language (additionally, one child, BiAraLI-01, scored at the cut-off in Arabic).Thus, it was not the case that all or even the majority of the DLD children scored below −1.25 in either task or in both languages.For both comprehension and production, there was a notable overlap in scores for the children in the DLD sample and children in the TD sample.In Figure 4, z-scores are plotted for the three NWR tasks for the children in the TD and the DLD samples, for the (a) LS-Swe task, (b) CL-Swe task, and (c) NWRT-Leb.As can be seen in Table 11 and Figure 4, there was a notable overlap in NWR performance in the two samples.
We have refrained from calculating sensitivity and specificity for the vocabulary and the NWR tasks since, as the scatterplots in Figures 3 and 4 show, there is a lot of overlap between the TD and the DLD groups and we see no straightforward solution for getting around this overlap (e.g., by exploring different cut-offs).To summarise, although the children in the DLD sample typically scored below the mean on both vocabulary comprehension and production in both languages, it was rare We have refrained from calculating sensitivity and specificity for the vocabulary and the NWR tasks since, as the scatterplots in Figures 3 and 4 show, there is a lot of overlap between the TD and the DLD groups and we see no straightforward solution for getting around this overlap (e.g., by exploring different cut-offs).
To summarise, although the children in the DLD sample typically scored below the mean on both vocabulary comprehension and production in both languages, it was rare that they scored low (i.e., below −1.25 z-scores) in both languages.For non-word repetition, performance was also generally below the mean score of their age peers, but only six DLD children scored below −1.25 (in z-scores) in at least one task.We will now investigate whether the DLD children's performance on certain types of NWR items differs more from their TD peers.

Performance of the TD and the DLD Children on NWR Items with Different Properties
In this section, NWR accuracy (% correct responses) in the TD sample and the DLD sample will be compared by item length (syllables) and presence of clusters separately for each task.Since the age range was narrower in the DLD sample (5;0-7;3) compared to the TD sample (4;0-7;11), we excluded the four-year-old TD children here.Due to the large difference in sample size between the two groups (N TD = 77, N DLD = 11), only descriptive statistics will be reported.
As evident in Table 12, at group level, the DLD children scored lower than the TD children on all tasks, at all item lengths, and for items with and without clusters alike, with two exceptions.For the shortest LS-Swe items (two syllables) with clusters, accuracy was higher for the DLD group (91%) compared to the TD group (84%), and for the shortest NWRT-Leb items (one syllable) without clusters, accuracy was at ceiling (TD = 99%, DLD = 100%) for both groups.We could not discern any tendencies for the DLD children to perform disproportionally worse on one type of NWR task or on certain types of items, such as long items and/or phonologically complex items (with clusters).
Table 12.Accuracy (% correct answers) for NWR tasks by number of syllables and presence of clusters for the 5-7-year-olds in the TD sample (N = 77) and the children in the DLD sample (N = 11).In the next section, the DLD children's test results will be analysed in conjunction with background information and reports about functional language abilities from parents, teachers, and SLPs.

Background Information and Functional Language Abilities of the Children in the DLD Sample
According to parental report, 9/11 of the children had a delayed language development in their first language (Arabic), and 6/11 had a late development in their second language (Swedish).Six children in the DLD sample (55%) produced their first word and/or first word combination late, i.e., later than 12 and 24 months, respectively, and the same proportion of children (55%) had a close relative with language or literacy problems.By comparison, 27% of the children in the TD sample were reported to have a late onset of the first word or word combination, and only 7% had a close relative with language or literacy difficulties.
The reports from parents, teachers and SLPs regarding functional language skills and communicative behaviour of the children in the DLD sample are briefly summarised in Table A2 in the Appendix A. Most children in the DLD sample had difficulties with both comprehension (e.g., having difficulties understanding instructions) and production (e.g., being difficult to understand and making oneself understood, having deficits in expressive morphosyntax or speaking in rudimentary utterances).The children in the DLD sample will now be characterised in terms of their performance on the vocabulary and NWR tasks and discussed in light of the information provided by parents, teachers, and SLPs.
BiAraLI-03 and BiAraLI-05 had poor vocabulary scores (particularly in Arabic), but received positive or only slightly negative z-scores in the NWR tasks.At the same time, both children had functional communication difficulties according to the parents, SLPs, and teachers (although the mother of BiAraLI-05 thought that his Arabic was age-appropriate).Two children (BiAraLI-04 and BiAraLI-07) scored moderately low in both vocabulary and NWR tasks.In comparison to the other children in the DLD group, these two children had seemingly milder problems.BiAraLI-04 recently had his diagnosis changed from general to expressive language disorder.Although the SLP reported deficits in expressive morphosyntax in both languages, the parents and the preschool staff were of the opinion that he only had problems with pronunciation, and the preschool staff thought that he had a small Swedish vocabulary due to poor exposure.BiAraLI-07 had problems with both comprehension and production according to the parents, the SLP, and the preschool teacher.However, it was reported that he played well with other children with few misunderstandings or conflicts.
Five children (BiAraLI-01, BiAraLI-02, BiAraLI-06, BiAraLI-08, and BiAraLI-10) performed poorly on all NWR tasks and had poor vocabulary scores in one or both languages.Four of these children were described by parents, SLPs, and teachers alike as having severe language difficulties, to the extent that there were frequent misunderstandings or conflicts with peers.Reports about the fifth child (BiAraLI-06) were mixed, as the parents and the preschool teacher did not find his language skills to be severely affected, but the SLP reported great difficulties in several language domains.BiAraLI-11 had a large discrepancy between performance in the vocabulary tasks and the NWR tasks.She had good vocabulary scores in Arabic, and surprisingly good scores in Swedish considering that her age of onset for Swedish was late (age 4;0-5;0).However, NWR performance was poor, especially in the tasks with higher phonological complexity (LS-Swe and NWRT-Leb).Her comprehension seemed to be better than production according to parent, SLP and teacher interviews.
Finally, BiAraLI-09 scored high or very high in all NWR tasks.His vocabulary scores were slightly below the mean in Arabic and far above the mean in Swedish.The reports from parents, the SLP, and the school were inconsistent.The parents said that they were concerned about his early language development, and the child had been seen by different SLPs during the course of a couple of years.Eventually, he was diagnosed with DLD at the SLP clinic, but the parents were not sure that the diagnosis was accurate.At the same time, the school staff perceived the child to be very shy and reported that he did not like to speak in class, but that he seemed to have age-appropriate expressive skills during individual sessions with the special education teacher.Considering that BiAraLI-09 had high scores in all NWR tasks, very high vocabulary scores in Swedish, and Arabic scores just below the TD mean, it could be the case that this child was subject to overdiagnosis of DLD.According to the SLP, he had only been assessed in Swedish and his language scores were compared against monolingual Swedish norms.

Discussion
In this study, we investigated vocabulary comprehension and production, as well as the NWR performance of 110 Arabic-Swedish-speaking bilinguals aged 4-7.The relative effect of age, language exposure and SES on vocabulary comprehension and production was investigated for the minority and the majority language.NWR performance was investigated in relation to age, language exposure, vocabulary, and properties of the nonword items.We also explored whether bilingual children with a diagnosis of DLD could be distinguished from TD children, based on their performance on vocabulary and NWR tasks, and whether one particular type of NWR task might identify DLD better.

Vocabulary in the TD Sample
We found that vocabulary comprehension and production scores increased with age in both the minority language Arabic and the majority language Swedish.These results accord with findings reported in the literature, namely that there is a clear development with age in the majority language (Bialystok et al. 2010;Cobo-Lewis et al. 2002a;Prevoo et al. 2014).However, our finding of a clear development with age in Arabic differs from many previous studies that report small or no gains with age in the minority language (Ganuza and Hedman 2019;Gathercole and Thomas 2009;Leseman 2000;Öztekin 2019, chp. 4).Recall that a bit less than half of the children (42%) were not born in Sweden, 48% had an age of onset to Swedish after age 3, and 20% had less than two years of exposure to Swedish at the time of testing.As mentioned in the introduction, Arabic speakers are the largest linguistic minority in Sweden, and many children in our sample had several sources of input outside the home (e.g., in (pre)school or at community centres arranging activities for Arabic-speaking children).This means that at group level, the children in our study had a high amount of cumulative exposure in Arabic from various interlocutors, which probably supported their development of the minority language.
Vocabulary scores increased as a function of the proportion of daily exposure.The effect was seen for both languages, and it was stronger for production than for comprehension.These findings are in line with earlier studies demonstrating a relationship between the relative amount of exposure and vocabulary comprehension in the minority language (Prevoo et al. 2014) and the majority language (Unsworth 2016), as well as for vocabulary production in the majority language (Öztekin 2019, chp. 4;Prevoo et al. 2014).They are also in line with Thordardottir's (2011) observation that the effect of relative amount of exposure is stronger for vocabulary production than comprehension.In the present study, language exposure was further investigated by exploring the effect of length of exposure (LoE) (in months).For the minority language Arabic, LoE could not be investigated separately as it coincided with chronological age.For the majority language Swedish, LoE emerged as the most influential predictor of vocabulary scores, overshadowing age and daily exposure.Interestingly, these results go against those of Thordardottir (2019), who found that LoE was not a significant predictor of vocabulary comprehension scores in slightly older children (Canadian 7-9-year-olds with French as their common language) when cumulative exposure was also accounted for.Since the present study measured current amount of exposure as a separate variable, our length of exposure variable is likely to capture length as well as cumulative (amount of) exposure.
There was no effect of SES (parental education) on vocabulary comprehension or production, neither in the minority language Arabic nor in the majority language Swedish.Considering previous reports of null results for SES and vocabulary in the minority language (Cobo-Lewis et al. 2002b;Öztekin 2019, chp. 4;Prevoo et al. 2014), it was unsurprising that SES was not a significant predictor of Arabic vocabulary.Surprisingly however, SES was not a significant predictor of Swedish vocabulary either.This result does not match previous studies from other countries, where higher SES is generally associated with better vocabulary scores in the majority language (Buac et al. 2014;Calvo and Bialystok 2014;Cobo-Lewis et al. 2002a;Prevoo et al. 2014).There may be several reasons for why SES was not a significant predictor of majority language vocabulary.Higher SES tends to co-vary with a higher degree of majority language use in the home (Prevoo et al. 2014), or better majority language proficiency of the parents (Buac et al. 2014), which may in turn boost the child's majority language skills if the parents speak the majority language in the home.In the present study, almost 80% of the participating families reported that both parents spoke to their child only or mainly in the minority language Arabic.Overall, there was very little parental input in Swedish, and it is therefore unlikely to boost majority language vocabulary in our sample of Arabic-Swedish-speaking bilinguals.Yet another explanation could be the Swedish setting, where access to institutional childcare (preschool) is not dependent on family income or SES (as it often is in other countries).Consequently, some differences between children from different SES backgrounds may be levelled out, and their vocabulary development may be influenced more strongly by the quantity and quality of input in (pre)school and language-fostering activities in the home, which in turn are not directly related to the parents' educational level (Bohnacker et al. 2021).

Non-Word Repetition in the TD Sample
In the present study, we found age effects for all three types of NWR tasks: the LS-Swe, the CL-Swe, and the NWRT-Leb.Scores increased with age, mirroring findings from several previous studies (Chiat and Roy 2007;Kalnak et al. 2014;Radeborg et al. 2006).These age effects held for all tasks also when controlling for vocabulary scores and for length of exposure to Swedish (for the LS-Swe task) in the multivariate regression models.For all three NWR tasks, accuracy decreased with increased non-word length, in line with earlier studies using other stimulus items (Boerma et al. 2015;Chiat and Roy 2007;Dollaghan and Campbell 1998;Ellis Weismer et al. 2000;Thordardottir and Brandeker 2013).Increased phonological complexity (presence of consonant clusters) had an adverse effect on repetition accuracy in the two tasks that contained items with clusters (LS-Swe and NWRT-Leb).This is in line with previous studies reporting lower accuracy for phonologically more complex items (Abed Ibrahim and Hamann 2017;dos Santos and Ferré 2018;Jones et al. 2010).It was not the case though that items with higher phonological complexity were generally more difficult to repeat.Task difficulty differed greatly; the two tasks that contained items with clusters had the overall highest repetition accuracy (NWRT-Leb) and the lowest (LS-Swe).Interestingly, accuracy for length (number of syllables) and presence of clusters was very different for all NWR tasks.For instance, accuracy of the LS-Swe items decreased by similar amounts for each added syllable.By contrast, for the CL-Swe task, accuracy decreased only slightly for each added syllable between 2-4 syllables, but then declined abruptly at five syllables.Accuracy on the NWRT-Leb was only marginally lower for items with vs. without clusters at all syllable lengths, whilst for the LS-Swe task, there were large discrepancies in accuracy for items with vs. without clusters at different syllable lengths.For 2-syllable items with clusters, accuracy was even somewhat higher than for items without clusters, but much lower for items with 3-5 syllables.Recall however that all tasks utilized different phoneme inventories, the LS-Swe having a wide variety of language-specific Swedish phonemes, whilst the NWRT-Leb had a very restricted phoneme inventory.We speculate that there is an interplay between phoneme inventory, syllabic complexity and item length (number of syllables), affecting item and overall task difficulty.
There was no correlation between any of the tasks and SES, mirroring several previous studies that did not find an association between SES and NWR performance (Boerma et al. 2015;Chiat and Roy 2007;Kalnak et al. 2014).Daily exposure did not correlate with performance on the LS-Swe task (daily exposure to Swedish), the CL-Swe (daily exposure to Swedish) nor the NWRT-Leb (daily exposure to Arabic).Furthermore, length of exposure to Swedish was not a significant predictor of performance on the LS-Swe task when chronological age was controlled for.Swedish vocabulary (comprehension) was a significant predictor of performance for the LS-Swe items.This is congruent with several previous studies finding an association between NWR performance and vocabulary (Gibson et al. 2015;Kohnert et al. 2006;Sorenson Duncan and Paradis 2016;Thordardottir and Brandeker 2013).Interestingly, there was a vocabulary effect on all three NWR tasks.Contrary to expectation, the effect of Swedish vocabulary on repetition accuracy was not stronger for the language-specific LS-Swe items than for the non-language-specific CL-Swe items.Additionally, the impact of Arabic vocabulary on NWRT-Leb is somewhat surprising as this task was constructed to be language-independent and to minimize the impact of vocabulary (dos Santos and Ferré 2018).
Finally, as many as 30% of the children in the TD sample (4-, 5-and 6-year-olds) scored below −1 SD on the language-specific LS-Swe task compared to monolingual reference data (Radeborg et al. 2006).Interestingly, this proportion is similar to that reported by Sorenson Duncan and Paradis (2016), who found that 29% of the bilingual Canadian children in their sample scored below −1 SD on a language-specific English NWR task and, unlike the present study, they found that performance was affected by (cumulative) exposure to English.Thus, even though NWR performance on the LS-Swe task was not measurably related to language exposure per se, it cannot be assumed that the NWR performance of bilingual children can be compared against monolingual norms.

Vocabulary and Non-Word Repetition in the DLD Sample
Vocabulary was assessed in both Arabic and Swedish.At group level, the children in the DLD group scored below the cross-sectional mean, but within the range for their age group on comprehension and production in Arabic and Swedish.There were, however, large individual differences, and not all children had poor scores in both languages.While only one child had a z-score below −1.25 in both languages (in both modalities in Swedish but only in comprehension in Arabic), five children had a z-score below −1.25 in Arabic comprehension, and four children in Arabic production.Thus, having a z-score below −1.25 in both languages may not be a valid criterion for identifying DLD in this group of children.As described in the introduction, it is frequently argued that language difficulties must show in both languages of bilingual children (Kohnert 2010;Salameh et al. 2002;Thordardottir 2015), but evidence-based recommendations for interpreting language test scores and choosing suitable cut-offs are rare.Peña et al. ( 2016) investigated whether cut-offs established for monolingual populations could accurately classify Spanish-English bilinguals with balanced exposure to both languages as DLD or TD on a task targeting semantic skills.They found that scoring below the monolingual cut-off in both languages correctly classified the children as DLD, whereas considering only one language led to overidentification.In the current study, there was much variation in age of onset to Swedish and in the proportion of relative exposure to each language.Notably, most DLD children with poor vocabulary scores had low scores in Arabic, despite the fact that they had received continuous exposure to the language from birth.In light of this, we propose that having low vocabulary scores in the home language despite early onset and continuous input may be a warning sign for DLD.
Let us now move onto NWR, as this has been described as a promising diagnostic tool for bilingual children.At group level, the children in the DLD group scored below the mean of their TD age peers on most NWR tasks.However, there was much individual variation and only six (out of 11) DLD children had a z-score below −1.25 in at least one NWR task.Additionally, there was considerable overlap between the TD and the DLD groups on all tasks, with some DLD children scoring above the mean and some TD children scoring below the −1.25 z-score cut-off.Although poor NWR performance is frequently reported to be associated with DLD (Boerma et al. 2015;Dollaghan and Campbell 1998;Kalnak et al. 2014), there is not a perfect relationship between low NWR scores and presence of DLD (Ellis Weismer et al. 2000).Moreover, there are several reports in the literature of poorer diagnostic accuracy and a higher degree of overlap between TD and DLD groups on NWR tasks in bilingual populations compared to monolingual populations.Poorer diagnostic accuracy in bilinguals has not only been attested for language-specific NWR tasks, but also for language-non-specific and quasi-universal tasks (Abed Ibrahim and Hamann 2017;Boerma et al. 2015;dos Santos and Ferré 2018;Schwob et al. 2021;Thordardottir and Brandeker 2013).In the current study, the DLD children did not perform disproportionally worse on a certain task or item type compared to the TD children.Rather, the DLD children generally scored lower than the TD children on all item types, and accuracy decreased for both groups with increased item length and the presence of consonant clusters.These findings are in line with previous studies reporting that both TD and DLD children have more difficulty with NWR as stimulus length increases (Boerma et al. 2015;Schwob et al. 2021).At the same time, results are mixed with regard to whether DLD children have disproportionally more difficulties with longer items and/or phonologically complex items compared to TD children (Boerma et al. 2015;Schwob et al. 2021).In conclusion, we did not find that one type of NWR task or items with certain properties were superior in the identification of DLD.
Two conclusions can be drawn from these observations about vocabulary and NWR performance in our sample of Arabic-Swedish bilinguals.The first is that bilinguals with DLD do not necessarily perform low in both languages, even when comparing them to peers who grew up in the same country, speaking the same language combination.The second is that performance on NWR tasks cannot reliably distinguish all children with a DLD diagnosis when comparing them to a large group of children with (according to parental report) typical language development.Bearing in mind that vocabulary is the linguistic domain that is probably the most affected by language exposure, perhaps it is not surprising to find a large overlap in vocabulary scores, as fluctuating patterns of exposure give rise to much variation in performance on vocabulary tasks in both the TD and the DLD groups.However, our study indicates that claims such as "bilinguals with DLD must perform low in both languages in order to qualify for a DLD diagnosis" must be taken with caution.In the area of vocabulary and non-word repetition, they are clearly not supported by the empirical evidence.
It is noteworthy that NWR performance was clearly poor in only half of the children with a DLD diagnosis, with much overlap in performance between the DLD and TD groups.As Norbury et al. (2016) point out, cut-offs are arbitrary in the sense that they do not say anything about how a certain score corresponds to functional communicative abilities.Thus, receiving a z-score below −1.25 on a given task is not necessarily associated with poor functional language skills.Conversely, individuals who score above (and in some cases well above) −1.25 on a certain task do not necessarily have sufficient functional language skills.Notably, there was subgroup of around half of the children in the DLD sample who were described by both parents, teachers and SLPs as having severe communication difficulties that often led to peer conflicts and had a negative impact on their learning outcomes.These children scored low on the majority of the NWR tasks.However, there was also another subgroup in the DLD sample who were described to have somewhat milder language problems, and who scored low but still within the typical range on the NWR tasks.In a clinical setting, for the most severe cases (like the children in the first subgroup), a language disorder is usually not difficult to determine.The difficult cases are rather those falling into the second category.NWR does not seem to have good diagnostic accuracy for the bilingual children in our study; it is at best suggestive.Therefore, we argue that it is crucial to interpret language test scores in light of a detailed case history and reports from parents and teachers.
As is well known, the initial diagnosis of DLD is generally difficult in bilinguals, with a risk of misdiagnosis by the experts.The children with a DLD diagnosis in our sample had undergone careful and often repeated assessment, sometimes by several SLPs.However, we cannot completely rule out that there may have been some misdiagnosis, particularly in one case (BiAraLI-09).Furthermore, as DLD is a heterogeneous condition, individual children may have relative strengths or weaknesses in one modality (comprehension or production) or one or more linguistic domains (phonology, morphosyntax, vocabulary, discourse or pragmatics), which may explain why some of the DLD children scored unexpectedly high on a certain task.
In the present study, a much higher proportion of children in the DLD sample had a late language development and/or heredity for speech, language or literacy problems.These findings accord with earlier research showing that delayed language development and heredity for disorders of language or literacy are disproportionally more common in children with DLD compared to their TD peers (Kalnak et al. 2012;Paradis et al. 2010;Trauner et al. 2000).Parents, teachers and SLPs were asked to characterise the children's language and communication.Most children in the DLD sample were described as having deficits in their functional language skills.Descriptions of poor language comprehension were common (e.g., having difficulties understanding instructions or complex syntax), which is frequently reported in the literature about children with DLD (Bishop 1997;Friedmann and Novogrodsky 2004;Norbury et al. 2016).The children were also reported to have poor expressive abilities, for instance having deficits in expressive morphosyntax, which is also a common feature among children with DLD (Paradis et al. 2022;Reuterskiöld et al. 2021).In conclusion, we found that it was necessary to combine a formal assessment of vocabulary and NWR with a detailed case history and reports from parents, teachers and SLPs about early language development, heredity for language and literacy problems, language exposure, and functional language skills for identifying language difficulties, particularly in those children who performed borderline poor on NWR.This solution is also supported by other studies finding that combining NWR with parental questionnaires probing early language milestones and parental concern about the child's language development can improve diagnostic accuracy (Boerma and Blom 2017;Paradis et al. 2013).

Conclusions
For this understudied language combination of Arabic-Swedish-speaking bilinguals, we found that language exposure had a large impact on minority and majority vocabulary scores, but SES was not a significant predictor of vocabulary in any language.Age and vocabulary size had a positive effect on NWR performance; longer items and items containing clusters had lower repetition accuracy.A language-specific Swedish NWR task was evaluated for the first time for a large group of bilinguals, and results showed that although language exposure did not measurably affect NWR scores, these bilinguals were disadvantaged when compared against monolingual norms.There was a substantial overlap between TD and DLD children in performance on both vocabulary and NWR tasks.Low vocabulary scores in the minority language despite ample and continuous exposure from birth emerged as a warning sign for DLD.Diagnostic accuracy seemed at best suggestive for NWR, and we could not discern any particular task or type of item that was clearly superior for identifying DLD in our sample.Most children with DLD did not score below the −1.25 z-score cut-off in both languages (for vocabulary), and many scored above this cut-off on a majority of the NWR tasks.Reports from parents and teachers on language exposure, language development, concerns, functional language skills and communication difficulties are crucial when assessing suspected DLD in bilinguals.Future research should include a larger group of DLD children, as well as use longitudinal designs in order to investigate and confirm the results we have shown here for our cross-sectional sample.
Appendix A Table A1.All items in the NWR tasks: the LS-Swe (language-specific Swedish), the CL-Swe (crosslinguistic Swedish), and the NWRT-Leb (Non-word repetition task Lebanese).

LS-Swe
/glY"vo:/ /a"pεt/ /I"f0:m/ /"ÊORjε/ /na"ki:t/ /"sp0:mε/ /lεbU"s0:f/ /m8stRε"falj/ /glε present study spoke Lebanese Arabic, the existing Leban Arabic varieties most relevant to the Swedish context.For new prompts were constructed for all test items in the res was disadvantaged by being asked about a word in a dial For the CLT production tasks, the Lebanese target words other dialect synonyms, particularly Syrian, Palestinian Standard Arabic (MSA).Four different adaptations were ian, Lebanese and Iraqi Arabic (Haddad 2017). 6

Non-Word Repetition Tasks
In the present study, three NWR tasks were used, all school and early school age.First, a Swedish language-sp developed by Barthelom and Åkesson (1995), was used, fo year-old monolinguals is available (Radeborg et al. 2006).T items of 2-5 syllables (6 of each syllable length) that adhe contain phonemes that are typical of Swedish; nineteen c k, ɡ, m, n, ŋ , ɾ, f, v, s, ɕ, ɧ, ʂ, h, j, l/) and fifteen vowel pho ɔ, u, ʊ, ʉ, ɵ/).The items have syllables with varying pho open and closed syllables with and without consonan clusters, 9 items with one cluster, and 2 items with two c items are pronounced with stress patterns that are typica main stress and vowel duration in different syllables, fo flɛtɛmɪŋɛˈɾoːf/.The LS-Swe items were recorded by a speaking a central Swe-dish dialect, which is close to stan Second, a Swedish version of the cross-linguistic NW (CL-Swe).The task was designed to be compatible with languages.As such, it contains items of 2-5 syllables (4 o consonant clusters and no codas (only open syllables).T cludes eleven consonants (/p, b, t, d, k, ɡ, s, z, m, n, l/) and purpose of this study, a Swedish version was created.From items were chosen, for example, /lɪmɪka/and/tʊlɪɡasʊmʊ/, ex nemes that do not exist in Swedish (e.g., /z/), or contain r language.The CL-Swe items were recorded by the same fe LS-Swe items.All items were pronounced with quasi-neut where all syllables were equally stressed (i.e., they carried from final-syllable lengthening and pitch drop marking the Finally, the third task was the Non-word repetition tas Melhem et al. 2011), a Lebanese version of the QU task, m task (dos Santos and Ferré 2018).This task was constructe ical complexity impacts NWR performance.The task cont items with one syllable, 14 items with two syllables and 10 and without consonant clusters (15 items with no clusters ε"s8lp/ /salU"tA:n/ /hoent"p0:lε/ /nεsU"lo:/ /"ma (Khoury Aouad Saliby et al. 2017a) were used.Since only a few of the children in the present study spoke Lebanese Arabic, the existing Lebanese version was adapted to the Arabic varieties most relevant to the Swedish context.For the CLT comprehension tasks, new prompts were constructed for all test items in the respective dialect, so that no child was disadvantaged by being asked about a word in a dialect they were not familiar with.For the CLT production tasks, the Lebanese target words needed to be complemented by other dialect synonyms, particularly Syrian, Palestinian and Iraqi, as well as Modern Standard Arabic (MSA).Four different adaptations were developed for Syrian, Palestinian, Lebanese and Iraqi Arabic (Haddad 2017).6

Non-Word Repetition Tasks
In the present study, three NWR tasks were used, all developed for children of preschool and early school age.First, a Swedish language-specific task (LS-Swe), originally developed by Barthelom and Åkesson (1995), was used, for which reference data for 4-6year-old monolinguals is available (Radeborg et al. 2006).The LS-Swe encompasses 24 test items of 2-5 syllables (6 of each syllable length) that adhere to Swedish phonotactics and contain phonemes that are typical of Swedish; nineteen consonant phonemes (/p, b, t, d, k, ɡ, m, n, ŋ , ɾ, f, v, s, ɕ, ɧ, ʂ, h, j, l/) and fifteen vowel phonemes (/i, ɪ, y, ʏ, e, ɛ, oe, ɑ, a, o, ɔ, u, ʊ, ʉ, ɵ/).The items have syllables with varying phonological complexity: there are open and closed syllables with and without consonant clusters (13 items with no clusters, 9 items with one cluster, and 2 items with two clusters) in onset and coda.The items are pronounced with stress patterns that are typical of Swedish, i.e., with varying main stress and vowel duration in different syllables, for example /spɵɾɪfɾaˈɡoːl/ and / flɛtɛmɪŋɛˈɾoːf/.The LS-Swe items were recorded by a female speaker of Swedish speaking a central Swe-dish dialect, which is close to standard Swedish.
Second, a Swedish version of the cross-linguistic NWR task (Chiat 2015) was used (CL-Swe).The task was designed to be compatible with the lexical phonology of many languages.As such, it contains items of 2-5 syllables (4 of each syllable length), with no consonant clusters and no codas (only open syllables).The full range of phonemes includes eleven consonants (/p, b, t, d, k, ɡ, s, z, m, n, l/) and three vowels (/a, i, u/).For the purpose of this study, a Swedish version was created.From a list of 84 candidate items, 16 items were chosen, for example, /lɪmɪka/and/tʊlɪɡasʊmʊ/, excluding items that contain phonemes that do not exist in Swedish (e.g., /z/), or contain real words or inflections in that language.The CL-Swe items were recorded by the same female speaker who recorded the LS-Swe items.All items were pronounced with quasi-neutral prosody (Chiat 2015, p. 138), where all syllables were equally stressed (i.e., they carried equal length and pitch) apart from final-syllable lengthening and pitch drop marking the end of an utterance.
Finally, the third task was the Non-word repetition task-Lebanese (NWRT-Leb, Abou Melhem et al. 2011), a Lebanese version of the QU task, modelled on the NWR-FRENCH task (dos Santos and Ferré 2018).This task was constructed to investigate how phonological complexity impacts NWR performance.The task contains 30 items of 1-3 syllables (6 items with one syllable, 14 items with two syllables and 10 items with three syllables) with and without consonant clusters (15 items with no clusters, 13 items with one cluster, and εùblεgε/ /εlU"mOkI/ /OlI"t0:kε/ /sp8RIfRa"go:l/ /tIbε"fi:mε/ /l8tUspε"l0:n/ /toelIpa"le:RU/ /C8lεkROmpa"mi:d/ /fImIgla"nεftI/ /hIlUteRa"p0:d/ /flεtεmI an Arabic CLT version (Haddad 2017) that was adapted from the Leba (Khoury Aouad Saliby et al. 2017a) were used.Since only a few of present study spoke Lebanese Arabic, the existing Lebanese version Arabic varieties most relevant to the Swedish context.For the CLT co new prompts were constructed for all test items in the respective dial was disadvantaged by being asked about a word in a dialect they wer For the CLT production tasks, the Lebanese target words needed to b other dialect synonyms, particularly Syrian, Palestinian and Iraqi, Standard Arabic (MSA).Four different adaptations were developed f ian, Lebanese and Iraqi Arabic (Haddad 2017).6

Non-Word Repetition Tasks
In the present study, three NWR tasks were used, all developed school and early school age.First, a Swedish language-specific task ( developed by Barthelom and Åkesson (1995), was used, for which ref year-old monolinguals is available (Radeborg et al. 2006).The LS-Swe e items of 2-5 syllables (6 of each syllable length) that adhere to Swedis contain phonemes that are typical of Swedish; nineteen consonant ph k, ɡ, m, n, ŋ , ɾ, f, v, s, ɕ, ɧ, ʂ, h, j, l/) and fifteen vowel phonemes (/i, ɪ, ɔ, u, ʊ, ʉ, ɵ/).The items have syllables with varying phonological co open and closed syllables with and without consonant clusters clusters, 9 items with one cluster, and 2 items with two clusters) in o items are pronounced with stress patterns that are typical of Swedish main stress and vowel duration in different syllables, for example flɛtɛmɪŋɛˈɾoːf/.The LS-Swe items were recorded by a female sp speaking a central Swe-dish dialect, which is close to standard Swedis Second, a Swedish version of the cross-linguistic NWR task (Ch (CL-Swe).The task was designed to be compatible with the lexical p languages.As such, it contains items of 2-5 syllables (4 of each syllab consonant clusters and no codas (only open syllables).The full rang cludes eleven consonants (/p, b, t, d, k, ɡ, s, z, m, n, l/) and three vowe purpose of this study, a Swedish version was created.From a list of 84 items were chosen, for example, /lɪmɪka/and/tʊlɪɡasʊmʊ/, excluding item nemes that do not exist in Swedish (e.g., /z/), or contain real words o language.The CL-Swe items were recorded by the same female speake LS-Swe items.All items were pronounced with quasi-neutral prosody where all syllables were equally stressed (i.e., they carried equal leng from final-syllable lengthening and pitch drop marking the end of an u Finally, the third task was the Non-word repetition task-Lebanese Melhem et al. 2011), a Lebanese version of the QU task, modelled on task (dos Santos and Ferré 2018).This task was constructed to investig ical complexity impacts NWR performance.The task contains 30 item items with one syllable, 14 items with two syllables and 10 items with t and without consonant clusters (15 items with no clusters, 13 items w ε"Ro:f/ /dalabεl"hi:mε/ This also includes three single-parent households where data was available for that single parent only.

5
Note that there is no standardised procedure or clinical guidelines for assessing suspected language disorders in bilinguals in Sweden. 6 For a detailed description of the adaptation procedure, see Bohnacker et al. (2021). 7 Since the CL-Ara results were generally similar to the CL-Swe task, they are not reported here.For results on the CL-Ara task, see Öberg (2020). 8 For one seven-year-old, data was missing, and for one five-year-old, AoO to Arabic was at age 1. 9 Descriptive statistics are reported here for age groups so that they can be used as reference data.For comparisons between the two languages (Arabic and Swedish), modalities (comprehension and production), and age groups, see Öberg (2020) and Bohnacker et al. (2021).
10 Descriptive statistics are reported here for age groups so that they can be used as reference data.Age-group comparisons are available in Öberg (2020). 11 The low performance on the LS-Swe task is also noteworthy from another perspective.This is the only NWR task for which published reference data exists (for monolingual Swedish children age 4-6, N = 200, Radeborg et al. 2006).Compared to the monolingual reference data, the proportion of the bilinguals in the present study that scored −1 SD from the mean was 30% (22/73).This is twice as many as expected if performance were the same for monolinguals and bilinguals.

Figure 2 .
Figure 2. Accuracy (% correct responses) for items in LS-Swe and NWRT-Leb by number of syllables and presence of consonant clusters.

Figure 2 .
Figure 2. Accuracy (% correct responses) for items in LS-Swe and NWRT-Leb by number of syllables and presence of consonant clusters.

Figure 3 .
Figure 3. Scatterplots showing age-adjusted z-scores of (a) Arabic and Swedish vocabulary comprehension and (b) Arabic and Swedish vocabulary production of the children in the DLD sample (triangles and labels) compared to the children in the TD sample (circles).Dashed lines at −1.25.

Figure 4 .
Figure 4. Scatterplots showing age-adjusted z-scores of the (a) LS-Swe, (b) CL-Swe and (c) NWRT-Leb tasks of the children in the DLD sample (triangles and labels) compared to the children in the TD sample (circles).Dashed lines at −1.25.

Figure 3 .
Figure 3. Scatterplots showing age-adjusted z-scores of (a) Arabic and Swedish vocabulary comprehension and (b) Arabic and Swedish vocabulary production of the children in the DLD sample (triangles and labels) compared to the children in the TD sample (circles).Dashed lines at −1.25.

Figure 3 .
Figure 3. Scatterplots showing age-adjusted z-scores of (a) Arabic and Swedish vocabulary comprehension and (b) Arabic and Swedish vocabulary production of the children in the DLD sample (triangles and labels) compared to the children in the TD sample (circles).Dashed lines at −1.25.

Figure 4 .
Figure 4. Scatterplots showing age-adjusted z-scores of the (a) LS-Swe, (b) CL-Swe and (c) NWRT-Leb tasks of the children in the DLD sample (triangles and labels) compared to the children in the TD sample (circles).Dashed lines at −1.25.

Figure 4 .
Figure 4. Scatterplots showing age-adjusted z-scores of the (a) LS-Swe, (b) CL-Swe and (c) NWRT-Leb tasks of the children in the DLD sample (triangles and labels) compared to the children in the TD sample (circles).Dashed lines at −1.25.

Table 1 .
Participants in the TD sample.Number of participants, sex, mean age (years; months) and age range (years; months) per age group.

Table 2 .
Age at testing, age of onset for Arabic and Swedish, daily exposure, Arabic variety, (pre)school type, and diagnosis for the children in the DLD sample.

Table 3 .
Means, standard deviations (SD), and ranges for each CLT vocabulary task by age groups and total scores.Maximum score for all tasks = 60 points.

Table 6 .
Mean scores, standard deviations (SD), ranges, mean accuracies in %, and SDs for each NWR task by age groups and total scores.

Table 7 .
Accuracy (% correct responses) by task and number of syllables for all items, items with clusters and items without (w/o) clusters.

Table 7 .
Accuracy (% correct responses) by task and number of syllables for all items, items with clusters and items without (w/o) clusters.
Note.Logistic mixed-effects regression model with random effects: random intercepts for participant and test item.Model fit with maximum likelihood (Laplace approximation).The reference level for categorical variables is the first category.*** p < 0.001, ** p < 0.01, * p < 0.05.
Note.Logistic mixed-effects regression model with random effects: random intercepts for participant and test item, and by-participant random slopes for task.Model fit with maximum likelihood (Laplace approximation).The reference level for categorical variables is the first category.*** p < 0.001, ** p < 0.01, * p < 0.05.

Table 11 .
Vocabulary comprehension (comp), vocabulary production (prod), and NWR scores (raw scores and age-adjusted z-scores) for children in the DLD sample.