Acoustic Cues to Automatic Identification of Phrase Boundaries in Lithuanian: A Preparatory Study

Kalašinskaitė-Zavišienė, Eidmantė; Raškinis, Gailius; Kazlauskienė, Asta

doi:10.3390/languages10080192

Open AccessArticle

Acoustic Cues to Automatic Identification of Phrase Boundaries in Lithuanian: A Preparatory Study

by

Eidmantė Kalašinskaitė-Zavišienė

^1,*

,

Gailius Raškinis

² and

Asta Kazlauskienė

¹

Department of Lithuanian Studies, Vytautas Magnus University, LT-44248 Kaunas, Lithuania

²

Department of System Analysis, Vytautas Magnus University, LT-44248 Kaunas, Lithuania

^*

Author to whom correspondence should be addressed.

Languages 2025, 10(8), 192; https://doi.org/10.3390/languages10080192

Submission received: 7 May 2025 / Revised: 10 July 2025 / Accepted: 28 July 2025 / Published: 14 August 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

This study investigates whether specific acoustic features can reliably indicate phrase boundaries for automatic detection. It includes (1) an analysis of acoustic markers at the end of prosodic units—intonational phrases, intermediate phrases, and words—and (2) the evaluation of these features in an automatic boundary detection algorithm. Data were drawn from professionally and expressively read speech (893 words), news broadcasts (732 words), and interviews (361 words). Key features analyzed were pause duration, final sound lengthening, intensity, and F0 changes. Findings show that pauses and their duration are the most consistent indicators of phrase boundaries, especially at intonational phrase ends. Final sound lengthening and reductions in intensity and F0 also contribute but are less reliable for intermediate phrases. In automatic detection phonetic cues can be used to predict boundaries assigned by phoneticians 69% of the time. Read speech yielded better results than spontaneous speech. Among the features, pause presence and length were the most reliable, while F0 and intensity changes played a minor role.

Keywords:

Lithuanian; intonation phrase; intermediate phrase; pause; lengthening of the final sound; intensity; F0

1. Introduction

Speech is realized in larger or smaller fragments that form a hierarchical prosodic structure, which is relatively consistent across languages (Beckman & Pierrehumbert, 1986). However, structural similarities do not imply identical production or processing.

The prosodic units that make up speech are organized hierarchically, ranging from the largest to the smallest as follows: intonational phrase (IP), intermediate (or phonological) phrase (ip), phonological word (w), foot, syllable, and mora (Féry, 2017). The first three units, known as subphrasal units, are associated with morphosyntactic structures though not determined by them. These prosodic constituents can be partially described in syntactic terms, but not all syntactic information is relevant to the prosodic structure of a language. Analyses of the syntax–phonology interface have intensified over the past few decades (see, e.g., Nespor & Vogel, 1986; Inkelas & Zec, 1990; Elordieta, 2008; Truckenbrodt, 2007; Selkirk, 1984, 2014; Elfner, 2018; Frota & Vigário, 2018, and the literature cited therein). Researchers argue that syntactic and phonological representations are distinct, and that their interaction gives rise to prosodic structure. While syntactic information may play a role, phonological phenomena are considered primary in the description of prosodic organization. It is the prosodic structure that directly determines the relevant acoustic cues.

Phrasing in natural and synthesized speech is a highly significant attribute of smooth speech. However, one of the greatest challenges in analyzing and describing intonation and developing large-scale speech corpora is the identification of phrase boundaries. The juncture between prosodic units can be signaled by tonal cues (e.g., Liberman & Pierrehumbert, 1984; Jun, 1998; Frota et al., 2007; Petrone et al., 2017), durational parameters (pauses, lengthening of final segments, etc. (e.g., Kohler, 1983; Liberman & Pierrehumbert, 1984; Jun, 1998, 2003; Krivokapić, 2007; Michelas & D’Imperio, 2012; Paschen et al., 2022; Harrington Stack & Watson, 2023; Steffman et al., 2024)), intensity changes (Wagner & Watson, 2010), and articulatory clarity (amplitude reduction at the end of a phrase and increase at its beginning, glottalization, non-modal phonation, etc. (e.g., Kohler, 1994; Fougeron & Keating, 1996; Kuzla et al., 2007)). For a concise yet highly informative overview of cross-linguistic studies, see Ip and Cutler (2022). Research on phrase boundary cues in Lithuanian is still in its early stages. A few preparatory or survey studies (Kundrotas, 2012, 2018; Kazlauskienė & Sabonytė, 2018; Dereškevičiūtė & Kazlauskienė, 2022; Kazlauskienė & Dereškevičiūtė, 2022a, 2022b; Melnik-Leroy et al., 2022) can be mentioned as foundational efforts to establish the groundwork for more detailed research. Kazlauskienė et al. (2023, pp. 112–155), after examining a 23 h annotated speech corpus, concluded that pauses are significant indicators of phrase boundaries: 65% of intermediate phrases and 94% of intonational phrases are separated by a pause. Both phrase types predominantly end in a low tone (IP: 75%; ip: 63%) or a descending tone (IP: 19%; ip: 20%). Interestingly, only 44% of intonational phrases exhibit a decrease in intensity, compared to the majority of intermediate phrases (87%), which end with the same intensity. Additionally, 65% of intonational phrases end with a lengthened final sound, whereas this phenomenon is less frequent in intermediate phrases (44%).

This article presents a part of a broader study aimed at identifying and describing the prosodic cues of phrase boundaries in Lithuanian, as well as developing recommendations for corpus annotators and principles for automatic phrase boundary detection. The research phase described in this article aims to examine whether selected acoustic features can serve as reliable indicators of phrase boundaries for automatic detection. To achieve this goal, the study was divided into two parts: (1) the analysis of the acoustic features at the ends of prosodic units (intonational and intermediate phrases, prosodic words) in recordings annotated by phoneticians, and (2) the application of the analyzed features in automatic phrase boundary detection and evaluation of the results. Not all potential acoustic features mentioned in other studies were analyzed; this stage of the research focuses on pauses, duration of the final sound of a word, intensity, and fundamental frequency (F0) changes at the end of words. The choice of these features was informed by the findings of previous research on phrasing in Lithuanian (Kazlauskienė et al., 2023).

2. Materials and Methods

2.1. Audio Recordings and Annotations

The empirical material of the study consists of recordings made by one professional male speaker (51 years old, speaking in standard Lithuanian): three years’ worth (2017, 2018, and 2024) of National Dictation recordings (a public event where anyone can participate by writing down a dictated text at a predetermined time), three news broadcast recordings, and one interview (altogether, the analyzed recordings amounted to 25 min and 2 s). All recordings are publicly available online and were not produced specifically for this study, though the author of the recordings provided written permission to use them. The dictation recordings were chosen because they represent clearly enunciated, carefully read speech, where every word is articulated distinctly to ensure that the listener can easily understand the content and phrasing. Additionally, the dictations offer a wide variety of syntactic structures and sentence types, which allows for the exploration of the relationship between individual units and phrasing, and their interaction within a broader structural context.

In the news recordings, the speaker also reads from a script, but the goal is to convey as much information as possible within a limited time, resulting in a much faster speaking rate. These recordings have been used in previous studies on intonation and other aspects of speech (Kazlauskienė et al., 2023), and the text has already been time-aligned with the audio.

In the interview, the speaker talks spontaneously, which leads to a more varied speech rate, including both faster and slower segments. Fillers and hesitation phenomena may occur, and phrasing is not influenced by written language conventions such as punctuation marks.

For this study, text annotations were also time-aligned with the interview and dictation recordings. Textual annotations were prepared to mark phrase boundaries, as well as various segmental and prosodic phenomena (e.g., stress, word linking). In annotating the text, IPs were identified primarily on the basis of perceptually salient psychoacoustic cues, including pauses, final-sound lengthening, changes in F0 and intensity, the presence or absence of assimilation, degemination, and accommodation of adjacent sounds, as well as qualitative and articulatory changes (e.g., strengthening or weakening of articulation). Syntactic and semantic considerations were also taken into account, albeit to a lesser extent.

In read Lithuanian speech, phrases do not always correspond to full sentences. The number of IPs identified was 152 in the dictation recordings, 67 in the news recordings, and 80 in the interview.

When marking ips, all psychoacoustic, semantic, and syntactic criteria were considered equally important. In cases of uncertainty, psychoacoustic perception was treated as the decisive factor. Intermediate phrases that occurred at the end of IPs were not counted separately, as they were considered prosodically indistinct from the IP boundary. Consequently, they were analyzed as forming the terminal part of IPs, rather than as separate ips. The number of independently marked ips (excluding those integrated into IP-final positions) was as follows: 276 in the dictation recordings, 243 in the news recordings, and 83 in the interview.

The textual annotations were initially carried out independently by two phoneticians and subsequently merged into a unified annotation. To assess the level of subjectivity in this unified version, an inter-annotator agreement analysis was performed. Two additional phoneticians independently annotated a subset of the corpus—specifically, a National Dictation recording consisting of 333 words. Each word was labeled as either phrase-internal (0), ip-boundary (1), or IP-boundary (2). The inter-annotator agreement, measured using Krippendorff’s alpha (Krippendorff, 2019), yielded a value of 0.88, indicating strong consistency among annotators and suggesting minimal risk of impressionistic bias during annotation.

For the purposes of this study, the text annotations were also time-aligned with the interview, news, and dictation recordings. Time alignment was performed automatically using a Lithuanian speech recognizer based on the HTK toolkit (Young et al., 2006) operating in forced alignment mode. During this process, pauses were automatically detected and labeled. Pauses were defined as acoustic segments with a minimum duration of 30 ms, which could optionally occur at the end of any word. All time alignments were subsequently reviewed and refined by two phoneticians, who analyzed the alignments separately in two successive passes.

Below is an example of one sentence from a dictation marked with IP and ip boundaries (‘The sharp-toothed little monkeys were gliding across the surface of the water, drawing ever closer.’):

((Tos aštriadantės beždžionėlės)_ip	(čiuožė vandens paviršiumi)_ip	(vis artyn)_ip)_IP
((²ˈtoːs ɐʃʲtʲrʲɛ²ˈdɐnʲˑtʲeːs bʲɛʒʲʤʲo̟ː²ˈnʲeːlʲeːs)_ip	(²ˈʧʲu̟ɔʒʲeː ʋɐnʲ²ˈdʲɛnˑs pɐ²ˈʋʲɪrʲˑʃʲʊ̟mʲɪ)_ip	(ˈʋʲɪs ɐrʲ²ˈtʲiːn)_ip)_IP

Figure 1 presents a Praat excerpt of the sentence. The first tier displays sounds, the second one shows the orthographic transcription, the third shows ip boundaries, and the fourth indicates IP boundaries. The visualization reveals that each internal phrase is separated by a pause of varying length, accompanied by a lowering of F0 and intensity at the phrase boundary. Additionally, phrase-final segments exhibit durational lengthening—for example, the final [s] in the words aštriadantės and beždžionėlės, which are comparable in length, shows increased duration at the end of ip.

2.2. Analysis of Acoustic Markers at the End of Prosodic Units

For the first part of the study, the following measurements were taken: the number and duration of pauses; the average F0 and intensity of all voiced segments in the phrase; the syllable nuclei of the phrase-final word; and the nucleus of the final syllable of that word. In addition, the duration of sounds [ɪ], [ɐ], [oː], and [s] was measured in different positions: as the final sound of a word in the middle of a phrase and at the end of a phrase. These sounds were chosen because they were the only ones that occurred in all analyzed positions across the entire dataset.

The Praat software was used to manually measure all the aforementioned acoustic parameters (Boersma & Weenink, 2018). Statistical analyses were performed using SPSS v30.

Except for intensity, the ratios in this study were calculated by dividing one measured value by another. For example, if the average F0 of the final syllable in a phrase was 80 Hz and the average F0 of all syllable nuclei in the phrase was 125 Hz, the resulting ratio was 0.64 (80/125 = 0.64). The difference in decibels was calculated and then converted into a ratio for intensity. For instance, if the phrase-final intensity was 55.7 dB and the average phrase intensity was 61.1 dB, the ratio was calculated as: 10^{(55.7−61.1)/10} = 0.288.

2.3. Automatic Boundary Detection Algorithm

An attempt was made to develop a computational model for automatic prosodic phrasing—i.e., a model that, given an audio signal along with time-aligned phone-level and word-level text annotations, can automatically identify which word boundaries correspond to intermediate and intonational phrase boundaries. To this end, a dataset comprising 1990 instances was constructed, each representing a word-final boundary, utilizing the same set of recordings analyzed in the initial phase of the study. These instances were categorized into three classes: word-final boundaries coinciding with intonational phrase boundaries (15%), those aligning with intermediate phrase boundaries (30%), and regular word-final boundaries (55%).

The feature set included fundamental acoustic parameters examined previously, such as absolute pause duration, absolute duration of the word-final sound, and relative changes in intensity and fundamental frequency (F0) of the final syllable in comparison to the entire word. In addition, derived features were incorporated, including intensity and F0 ratios relative to their global averages, as well as the tempo-adjusted duration of the word-final sound. Tempo adjustment was calculated based on the average duration of the same sound type across all word-final positions. Missing feature values were imputed using the mean of the corresponding feature.

Given the dataset’s size, classical machine learning approaches were employed to derive classification rules for prosodic boundary identification, as contemporary deep learning techniques typically necessitate substantially larger datasets for effective training. The algorithms explored included decision tree induction (Quinlan, 1993), rule induction (Cohen, 1995), instance-based learning (Aha & Kibler, 1991), multilayer perceptron (Rumelhart et al., 1986), multinomial logistic regression (McCullagh & Nelder, 1989), and logistic model trees (LMT) (Landwehr et al., 2005). The implementations of these algorithms were sourced from the WEKA software suite (Hall et al., 2009).

Among the evaluated methods, logistic model trees demonstrated superior performance, as determined by 20-fold cross-validation. Consequently, all subsequent confusion matrices presented in Section 3.2 of this paper are derived from models constructed using the LMT algorithm.

To assess potential overfitting, the trained LMT model was evaluated on held-out, out-of-domain, multi-speaker data. The evaluation dataset was selected from a 23 h speech corpus previously used in intonation research (Kazlauskienė et al., 2023). Two radio-broadcast drama performances were randomly chosen from this corpus, featuring three male and two female speakers, yielding a total of 1800 test instances (word boundaries). Some audio excerpts were extremely short, comprising only a single utterance, which precluded the calculation of reliable global average estimates for intensity and F0 normalization. Additionally, the speech articulation rate in this dataset ranged from 6 to 22 sounds per second, extending beyond the range observed in the training data (12–17 sounds per second). The model achieved an average accuracy of 75.3% on the held-out data, with individual speaker accuracies ranging from 65.8% to 84.4%. These results suggest that the model generalizes reasonably well, despite the domain mismatch.

3. Results

3.1. Acoustic Features at Phrase Boundaries

Number and duration of pauses. While reading the dictation, the speaker paused after all IPs and after 84% of ips (see Table 1). In the news readings, pauses followed 84% of IPs and 13% of ips. Spontaneous speech paused after 89% of IPs and 29% of ips. In all recordings, some pauses also occurred within phrases; however, such interruptions were rare, only 5% during dictation, 3% in the news recordings, and 2% in the interview. These may be actual phrase boundaries but were not marked during annotation due to very tight syntactic or semantic connections between the word groups.

The data were also statistically evaluated. To determine whether there was a statistically significant relationship between phrase boundaries and the distribution of pauses, a Chi-square test (McHugh, 2013) was conducted. The results showed that in all cases, the relationship between phrase boundaries and pauses was statistically significant (p < 0.05). Although pauses followed only 13% of ips in the news recordings, the Chi-square test still indicated a significant association, confirming that pauses at these boundaries were not random.

After calculating the average pause durations, it is evident that the longest pauses between IPs and ips occurred while reading the dictations (averages of 910 ms and 266 ms, respectively). On average, pauses between IPs were 3.4 times longer than pauses between ips. In the interview, pauses between IPs were half as long as those in the dictation recordings, with an average duration of 478 ms. Pauses between ips were also shorter, but not as significantly, down to an average of 234 ms (0.9 times shorter). In the interview, pauses after IPs were, on average, 2.8 times longer than pauses between ips. The shortest pauses occurred while reading the news. Here, the average duration of pauses between IPs was 438 ms; and between ips it was only 156 ms. Pauses between IPs were, on average, twice as long as those between ips. The large standard deviations observed indicate that the pause durations vary greatly, and in some cases, the duration of pauses between IPs and ips may overlap. Therefore, the relationship between pause duration and phrase type should always be interpreted cautiously.

Figure 2 further illustrates the variation in pause duration by showing the distribution of phrase-final pauses, with pauses after IPs consistently longer than those after ips, and the greatest variability observed in dictations.

To determine whether the differences in pause durations were statistically significant, the Mann–Whitney U test (Mann & Whitney, 1947) was used to compare the pause durations between IPs and ips in all recordings. In dictation, news, and interview recordings, it was found that the pause durations between IPs and ips differ significantly (p < 0.001, p < 0.001, p = 0.0003, respectively).

The results indicate that pauses are a clear marker of phrase boundaries, although their occurrence may depend on the type of phrase and speech. While reading the dictation, the speaker always separates IPs with pauses, and there are many pauses between ips as well. In the news recordings, pauses are both less frequent and shorter. This is influenced by the rate. The articulation rate in the news was 17 sounds per second, in the interview it was 15 sounds per second, and in the dictations it was 12 sounds per second (the structure of a syllable in Lithuanian can range from one to seven sounds (Kazlauskienė, 2023), e.g., išskleisk [ɪ.²ˈsʲklʲɛɪˑsk], ‘spread’; therefore, it is more appropriate to measure speech rate in sounds rather than in words or syllables). In the interview, pauses may also arise due to the spontaneity of speech, when the speech is not pre-prepared, and there is a need to think about what to say. These findings align with the broader observation that the use of pausing to signal prosodic boundaries varies across languages. For instance, in German, intonational phrase boundaries are only occasionally marked by pauses (Kohler et al., 2017, as cited in Ip & Cutler, 2022), while in Mandarin, pausing is a much more frequent cue (Wang et al., 2019, as cited in Ip & Cutler, 2022).

Intensity. When reading dictations and speaking spontaneously, the trend of intensity decreases at the end of IPs and ips is similar. Compared to the overall intensity of the IP, the last word’s intensity decreased by 50% in the dictations and 40% in the interview. The decrease in intensity of the last word of the ip was minor—around 10% in both types of recordings. The intensity of the last syllable, compared to the overall intensity of the IP, decreased by 70% in both types of recordings, while the decrease in intensity of the last syllable of the ip was much smaller: 30% in the dictations and 40% in the interview (see Table 2).

While reading the news, the intensity of the final word in an IP decreased by 30%, and the final syllable’s intensity was reduced by half. This shows a pattern similar to that observed in other types of recordings, although the decrease in intensity was somewhat smaller. However, the intensity of the final syllable of ip was about 10% higher than that of the entire phrase and its final word. This suggests that when reading the news, the speaker does not reduce intensity between ips—likely due to the relatively fast pace of speech—resulting in a somewhat flatter intonation pattern. A similar pattern of inconsistency in intensity marking phrase boundaries has been observed in studies on other languages; for example, Kim et al., 2004 (as cited in Wagner & Watson, 2010) found that stronger boundaries were sometimes—but not consistently—associated with lower pre-boundary intensity across speakers.

Figure 3 illustrates this pattern across all speech types: intensity consistently decreases in the final syllable of IPs, while the final syllable of ips often retains or even exceeds the intensity of the overall phrase or final word. This trend is most pronounced in the news recordings, where intensity remains relatively high between internal phrases, in contrast to the more marked decline observed in dictations and interviews.

The Kruskal–Wallis test (Kruskal & Wallis, 1952) showed that differences in intensity between IPs and their final syllables were statistically significant across all recordings (p < 0.05). When comparing the final word of these phrases with the final syllable, a statistically significant difference was found only in the news recordings; in the dictation and interview recordings, the intensity differences between these elements were not statistically significant (p = 0.28 and p = 0.08, respectively). When comparing the intensity of ips, a statistically significant difference was observed in only one position—in the news recordings, between the final word of the phrase and its final syllable. In all other cases, the differences were not statistically significant. This suggests that the intensity of the final word and its final syllable does not differ.

The fundamental frequency may either fall or rise at the end of a phrase. A falling F0 is typical of declarative statements, while a rising F0 often signals a question or an unfinished thought. For this part of the study, we analyzed only phrases exhibiting falling F0 contours, as those with rising contours often require detailed semantic and syntactic analysis. While such cases were noted, they were not examined in depth and are left for future investigation. Thus, this section analyzes 125 IPs (82%) and 162 (59%) ips from the dictation recordings, 34 IPs (51%) and 131 ips (54%) from the news recordings, and 49 IPs (61%) and 41 ips (49%) from the interview recordings with falling F0.

Additionally, in 3% of the dictation and news recordings and 10% of the interview recordings, the final F0 of IPs could not be determined due to creaky voice. Similarly, the F0 at the end of 1% of ips in the dictation and interview recordings was also unmeasurable for the same reason. These instances also indicate the end of a phrase; however, they are excluded from this study.

The data from the analyzed material (see Table 3) show that the F0 of the final syllable is, on average, one-fifth lower than that of the entire IP or ip. In the dictations and news recordings, the F0 of the final syllable in IPs decreased by 30% compared to the entire phrase (differences in Hz are shown in Table 3), while in the interview recordings it decreased by 20%. The F0 of the final syllable in ips decreased by 20% in dictations and by 10% in both the news and interview recordings. However, the data on the F0 of the final word compared to the entire phrase are less consistent. In IPs, the F0 of the final word decreased by 20% in dictations and by 10% in interviews and showed no change in the news recordings. Similarly, no F0 change was observed in the final word of ips. Additionally, F0 in the news recordings was noticeably higher than in dictations or spontaneous speech. In the read speech recordings, the F0 of ips—whether of the entire phrase, the final word, or the final syllable—was higher than that of IPs.

Statistical analysis using the Kruskal–Wallis test revealed that in the dictation recordings, the F0 of IPs differed significantly when comparing the entire phrase and the final word with the final syllable (p < 0.05). In ips from the dictations, the F0 of the entire phrase also differed significantly from that of the final syllable (p < 0.05). However, the difference between the F0 of the final word and its final syllable was not statistically significant (p = 0.09). In the news recordings, the differences in F0 between the final word and the final syllable were not statistically significant for either IPs or ips (p = 0.25 and p = 0.05, respectively). In the interview recordings, none of the compared F0 differences were statistically significant (for IPs: phrase vs. syllable, p = 0.05; word vs. syllable, p = 0.08; for ips: phrase vs. syllable, p = 0.37; word vs. syllable, p = 0.56).

As shown in Table 4, an examination of phrases with rising F0 contours reveals that in the dictation recordings, the F0 of the final word and its final syllable in an IP is lower than that of the entire phrase. However, the F0 of the final syllable is slightly higher than that of the final word as a whole, although the difference is minimal. In all other cases (both IPs and ips), the F0 of the final word is either slightly higher than or comparable to the overall phrase F0, while the F0 of the final syllable increases on average by about 20%. These cases involving rising F0 contours warrant more detailed analysis, which is planned for future research.

The boxplot in Figure 4 includes both falling and rising F0. It shows that F0 tends to decrease at the end of IPs, especially in dictations and news recordings, while ips—particularly in interviews—often maintain or even show a slight increase in final-syllable F0. This reflects the influence of rising contours and indicates that final lowering is not consistent across all phrase types and speech styles.

These patterns correspond to some previous observations summarized by Yuan and Liberman (2014), who argue that F0 declination rate is shaped by speaking style and sentence type. Read speech has been shown to display steeper and more frequent F0 declination than spontaneous speech, with greater control over the declination slope in read contexts (Laan, 1997; Lieberman et al., 1984; Tøndering, 2011, as cited in Yuan & Liberman, 2014). Sentence modality also plays a role: declaratives typically exhibit the steepest falling contours, while syntactically unmarked questions and non-terminal utterances show flatter or rising intonation (Thorsen, 1980, as cited in Yuan & Liberman, 2014). Thus, instances with rising F0 require more detailed syntactic and semantic interpretation, which will be undertaken in future work.

Final Sound Duration. In relevant positions—that is, at the end of a word in the middle of a phrase and at the end of both types of phrases—the following sounds were observed: [s], [ɪ], [ɐ], and [oː]. However, their frequency of occurrence was not equal; in some positions, only two or three instances were found. Nevertheless, general trends can still be identified.

Although the sounds were expected to lengthen the most at the end of IPs, the results only partially confirm this hypothesis (see Table 5). However, one consistent pattern is quite evident: in all cases, the final sound of any phrase is longer than the final sound of a word located in the middle of a phrase. In the dictation recordings, the final sound of both IPs and ips was, on average, 1.8 times longer than the corresponding final sound of a word in the middle of a phrase. In the interviews, the final sound of IPs was on average 1.9 times longer, and of ips 1.7 times longer than the respective medial word-final sound. In the news recordings, the final vowel of IPs was on average 1.4 times longer, and the final consonant 2.3 times longer. In contrast, the final sound of ips was 1.3 times longer than the corresponding sound in phrase-medial position. Noticeable differences in sound duration were observed between the dictation and news recordings, which can be attributed to the much faster reading pace in the news. Here, sounds are shorter by a third or even by half compared to those in the dictation recordings.

Another Kruskal–Wallis test was conducted to determine whether the differences in the duration of the final sounds are statistically significant. The results show that in read speech, the duration of all sounds differs significantly when comparing the duration of a final sound in a word located in the middle of a phrase with the final sound of either phrase type (p < 0.05). However, when comparing the duration of final sounds between IPs and ips, a statistically significant difference was found only for the sound [s] (p < 0.05).

In the interview recordings, a statistically significant difference was observed only for the consonant [s] and the vowel [ɐ] when comparing phrase-medial position with either phrase-final position (p < 0.05). The duration differences in the other vowels were not statistically significant (for sound [ɪ], p = 0.12; for [oː], p = 0.41). Lithuanian has a phonological vowel length contrast (for more details on the phonological structure of Lithuanian, see Girdenis, 2014). Our analysis includes two short vowels and one long vowel. The results from the dictation and news recordings do not show a clear difference in the lengthening of these vowels. However, in the interview data, short vowels are lengthened more at the end of ip than at the end of IP, whereas the long vowel [oː] is lengthened more at the end of IP than ip. Nevertheless, we cannot determine whether phonological length contrasts remained stable (cf. significant findings in Paschen et al., 2022), as this was not the aim of our study. Future research should explore these issues further.

3.2. Reliability of Acoustic Features in Automatic Phrasing

Based on all the examined features, the overall accuracy of prosodic unit identification using logistic model trees (LMT) (Landwehr et al., 2005) reached 80%. However, the accuracy varied depending on the type of unit.

Table 6 presents a confusion matrix of the predicted and unpredicted prosodic units, which helps to clarify: (1) how the target phenomenon was predicted and with which other categories it was confused; (2) which unit was interpreted as the phenomenon in question.

The raw data provide the answer to the first question: of the 299 IP boundaries marked by annotators, 247 (83%) were correctly predicted as IPs, 44 were interpreted as ip, and 8 were marked as word-final boundaries. Of the 605 annotated intermediate phrase boundaries, 298 (49%) were correctly predicted, 38 were mispredicted as IPs, and as many as 269 were interpreted as word-final boundaries. Nearly all word-final boundaries were correctly predicted as such, with only 30 marked as ip and 5 as IP.

The answer to the second question is reflected in the column data: 290 instances were predicted as IP boundaries, of which 247 were annotated as IPs, 38 as ips, and 5 as word-final boundaries. A total of 372 instances were predicted as ip boundaries, of which 298 were annotated as ips, 30 as word-final boundaries, and 44 as IPs. Of the 1328 instances predicted as word-final boundaries, 1051 were annotated as such, 269 as ips, and 8 as IPs.

Thus, 69% of phrase boundaries marked by annotators were predicted as phrase boundaries. Still, the accuracy of identifying the specific phrase type (IP or ip) was lower—60%—due to some IPs being misclassified as ips and vice versa. The greatest confusion occurred within the ip group: as many as 45% of ip boundaries annotated by linguists were not predicted as phrase boundaries, and a small portion (6%) were misclassified as IPs. These findings prompted a separate evaluation of different types of recordings.

In the news recordings, annotators marked only 67 instances of IP, which tend to be very long (cf. dictation recordings, where 152 IPs were marked, although the number of phonetic words differs only slightly: 893 in dictations vs. 732 in news recordings). A total of 82% of the IPs annotated by humans were predicted as IPs; the remaining were classified as ips, typically when the phrases were very short, such as greetings.

However, only 30% of the annotated ips were predicted as ips, while as many as 67% were misclassified as word-final boundaries. This makes ips the most poorly predicted prosodic unit group in news recordings. Such low ip accuracy significantly lowers the overall phrase boundary identification rate: only 47% of the phrase boundaries annotated by humans were predicted as such.

News recordings are characterized by a particularly high speech rate—17 sounds per second, compared to 15 in interviews and 12 in dictations (measuring articulation rate only, excluding pauses and filled pauses). As shown in the first part of the study, news recordings also contain the fewest pauses (see Table 1), which means one of the main prosodic markers—pausing—is often absent. Moreover, final sounds in ips are lengthened the least in news recordings (see Table 4). News segments are read with relatively high pitch: the average F0 median in news is 108 Hz, compared to 84 Hz in interviews and 85 Hz in dictations. This somewhat limits the possible F0 variation, although the ratio of the average phrase F0 to the final syllable F0 does not differ significantly across the three types of recordings (see Table 3). News recordings differ from most kinds in terms of intensity patterns (see Table 2). While intensity significantly decreases at the end of ips in dictations and interviews, it remains the same or even slightly increases in news.

The annotators marked phrase boundaries based not only on the features analyzed in this paper (pause, lengthening of the final sound, F0, and intensity) but also on articulation-related features such as assimilation, degemination, the presence or absence of accommodation of adjacent sounds, qualitative characteristics of the sounds, and the strengthening or weakening of articulation. These features were not included in the current analysis because the functioning of Lithuanian sounds in connected speech remains under-researched. In preparing the recordings, the primary focus was on aligning phonetic words with their textual representations, and context-driven or prosodic modifications of sounds were largely disregarded. Since annotators relied on additional criteria when marking phrasing in the data, the automatic identification of ips in news recordings was unsuccessful.

Almost all (97%) word-final boundaries in news recordings were correctly predicted as such.

In dictation recordings, 86% of the IPs annotated by humans were predicted as IPs, while the remainder were labeled as ips. The analysis of these cases reveals that the pauses were somewhat shorter than typical IP boundaries, and other acoustic cues were less pronounced, for example, less lengthening of final sounds. However, in some cases, the final syllable was produced with a creaky voice, which usually signals an IP boundary in Lithuanian.

More ips were correctly predicted in dictations than in news—72%. A total of 8% of ips were classified as IPs, and in some cases, the acoustic cues indeed suggested IP boundaries, but due to syntactic and semantic considerations, annotators labeled them as ips. There were also a few cases where the final syllable was produced with a sharply rising F0, even though it was not a question. Since this pitch contour deviates from the typical pattern, such cases were classified as IPs during automatic analysis.

A total of 20% of the ips were actually within-phrase word boundaries rather than true phrase-final boundaries. Analysis shows that these boundaries may have been cued not by the selected acoustic features, but by other phonetic phenomena mentioned earlier in relation to news recordings, such as the absence of assimilation and degemination (typical of word boundaries but not phrase boundaries), articulatory clarity or weakening, and F0 resetting at the beginning of a new phrase. This last cue should be included in future versions of the algorithm. These mispredicted boundaries most often occurred between subject and predicate phrase groups.

In summary, 87% of phrase boundaries marked by annotators in dictation recordings were predicted as phrase boundaries in general, although the identification of the specific type (IP or ip) was slightly lower at 77%. Thus, phrase boundaries in read speech were predicted quite accurately based solely on the analyzed acoustic features and without accounting for syntactic and semantic factors.

As in the news recordings, nearly all (97%) word-final boundaries in dictations were predicted as such. A very small number were classified as ips, and in nearly all those cases, the acoustic features did suggest an ip, but syntactic and semantic factors led annotators not to mark them.

It is important to note that the dictation texts exhibit considerable syntactic complexity, including intricate sentence structures and numerous punctuation marks, all of which influenced the speaker’s phrasing. Annotators deliberated extensively over certain phrasing decisions. For example, homogeneous parts of a sentence are often read as an intermediate variant between IP and ip—that is, without pauses and with only minor changes in intensity and F0. In each case, the final decision was made after careful comparative analysis with similar examples.

In interviews, 76% of annotated IPs were predicted as IPs, but only 29% of ips were predicted as such, and 61% were labeled as word boundaries. This significantly reduced the overall phrase boundary identification accuracy—only 64% of annotator-marked boundaries were predicted as phrase boundaries. As in read speech, nearly all (94%) word-final boundaries were correctly predicted.

Spontaneous speech exhibits specific characteristics that likely influence phrasing. Speakers often begin an IP with high intensity and pitch, which gradually decreases throughout the phrase. As a result, at the end of the IP, intensity and F0 are often too low to be measured reliably, and vowel quality and sound duration may also be unclear. This made it difficult to assess and automatically detect ip boundaries in the second part of IPs.

Moreover, this recording is an excerpt from a talk show. Thought transitions are common, and the speech rate varies, making it difficult to rely on final sound duration for boundary detection. Although interviewer speech was excluded and interrupted segments were avoided, this may have affected the speaker’s phrasing. This is partially reflected in IPs ending with a sharply rising F0, even though they are not questions. The speaker may be seeking confirmation from the interviewer or signaling that something is left unsaid.

The interview also includes many phrases ending with non-modal phonation (specifically, creaky voice), which makes F0 measurements nearly impossible. Such instances also occurred in read speech but were rare. Since this feature was not included in the study, it was not used to detect phrase boundaries.

During recording preparation and analysis of acoustic correlates, it became evident that not all features are equally relevant or salient. Therefore, we evaluated the accuracy of prosodic unit identification based on individual features. The results confirmed the hypothesis: identification accuracy varied depending on the feature used (see Table 7). The most significant feature was the presence of a pause—prosodic units, which were predicted with 76% accuracy using this cue alone. In comparison, combining all features yielded 80% accuracy, meaning the additional features improved accuracy by only 4%.

Final sound lengthening was considerably less informative, yielding 65% accuracy. Intensity and F0 changes were the least effective (59% and 60% accuracy, respectively), contrary to common assumptions in Lithuanian phonetics. However, the news recordings strongly influenced the low values, where the ip-final intensity remained nearly unchanged.

Naturally, the most relevant aspect is phrase identification—specifically, whether a phrase boundary is detected regardless of its type. The results indicate that 60% of phrase boundaries are detected based on pauses, 57% based on final sound lengthening, 33% based on F0 changes, and only 25% based on intensity changes (cf. 69% when all features are combined).

The findings clearly demonstrate that the examined features vary in significance when identifying specific types of phrases. The accuracy of IP detection based on pauses is 74%, while intensity and F0 contribute similarly, with accuracies of 52% and 50%, respectively. The poorest accuracy was obtained using only the final sound duration parameter (8%), which may be partly due to difficulties in precisely identifying the boundaries of the final IP sound, as intensity tends to decrease substantially.

The highest accuracy in identifying ip was achieved based on final sound lengthening (49%), while pauses proved less reliable, with 38% accuracy. Changes in intensity and F0 were considerably less relevant for identifying ips than for IPs, yielding only 3% and 11% accuracy, respectively.

4. Conclusions

The research of prosodic unit boundary acoustic features in read and spontaneous Lithuanian speech recordings confirms the patterns observed in the analysis of other languages:

Pauses and their duration are a significant indicator of phrase boundaries, and their length may also partially help to distinguish phrase types.
A decrease in intensity at the end of IPs is consistent and systematic; intensity changes at the end of ips are significant in expressively read and spontaneous speech but not in monotonously and quickly read speech (e.g., news recordings).
The F0 tends to fall at the end of IPs, while at the end of ips, the decrease is less pronounced in the selected dataset. F0 may also rise at the end of phrases; however, these instances require further, more detailed analysis.
Final sound lengthening at the end of a word can serve as a reliable indicator of a phrase boundary, though it does not always allow for the accurate determination of phrase type.

The analysis of acoustic feature reliability for automatic speech phrasing leads to the following conclusions:

More than two-thirds of phrase boundaries were predicted based on the four examined features. However, the accuracy of identifying specific phrase types was lower, mainly due to the ambiguous nature of ip features. In some cases, these features may resemble those of IPs but can also be similar to word-boundary features within phrases.
The accuracy of automatic phrasing depends on the mode of speech realization. IPs were predicted with relatively high accuracy in read speech based on the four features, while in spontaneous speech, they were more often confused with ips. ips were relatively well predicted in expressively read speech (dictations), but very poorly predicted in news and interview recordings, where they were often mistaken for the end of a word within a phrase.
The studied features can be ranked as follows (from most to least significant) in terms of their importance for predicting prosodic unit boundaries: pauses and their duration, final sound lengthening, F0 changes, and intensity changes—the latter two being similarly important.
IP boundaries are most reliably signaled by pauses and their duration, decreased intensity, and falling F0. ip boundaries are best indicated by pauses and final sound lengthening.

The analysis of the first stage of automatic phrasing results revealed certain annotator biases. Most notably, a tendency to prioritize syntactic–semantic criteria in ambiguous or debatable cases. This complicates not only acoustic feature-based automatic phrasing but also the analysis of prosodic unit features. Such bias should especially be avoided when annotating spontaneous speech, which is characterized by numerous linguistically unmotivated pauses, false starts, repetitions, etc. Another prominent tendency is to equate IPs in read speech with sentences (also partly due to syntax), even in the case of complex sentences containing multiple predicates, where psychoacoustic cues often indicate numerous IPs.

Future research should focus on the following: (a) analyzing phrase-initial features (changes in intensity and F0), qualitative changes in sounds at phrase boundaries (e.g., final vowel reduction, F1 and F2 changes), and determining whether the entire rhyme portion of the syllable (nucleus and coda) is lengthened at phrase ends; (b) testing the developed automatic phrasing algorithm with a more extensive and more diverse speaker sample; and (c) developing annotator guidelines based on the findings of the current research.

Author Contributions

Conceptualization, A.K.; methodology, A.K. and E.K.-Z.; software, G.R.; validation, A.K. and E.K.-Z.; investigation, G.R. and E.K.-Z.; resources, A.K. and E.K.-Z.; writing—original draft preparation, A.K. and E.K.-Z.; writing—review and editing, A.K., G.R. and E.K.-Z.; funding acquisition, G.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Vytautas Magnus University. The APC was funded by the project “Creation of the Comprehensive Lithuanian Speech Corpus (LIEPA-3)” (grant no. 02-023-K-0001) funded by the Economic Recovery and Resilience Facility under the “New Generation Lithuania” plan.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Part of the audio data presented in this study were derived from the following resources available in the public domain: National Dictation 2018 (https://www.lrt.lt/radioteka/irasas/1013686204/nacionalinis-diktantas-2018 (accessed on 27 July 2025)) and National Dictation 2024 (https://www.lrt.lt/radioteka/irasas/2000328388/nacionalinis-diktantas (accessed on 27 July 2025)). Some of the audio data will be made available by the authors upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

F0	Fundamental frequency
IP	Intonational phrase
ip	Intermediate (or phonological) phrase
w	Phonological word

References

Aha, D., & Kibler, D. (1991). Instance-based learning algorithms. Machine Learning, 6, 37–66. [Google Scholar] [CrossRef]
Beckman, M., & Pierrehumbert, J. (1986). Intonational Structure in Japanese and English. Phonology Yearbook, 3, 255–309. [Google Scholar] [CrossRef]
Boersma, P., & Weenink, D. (2018). Praat: Doing phonetics by computer (Version 6.035) [Computer program]. Available online: https://www.fon.hum.uva.nl/praat/ (accessed on 20 February 2025).
Cohen, W. W. (1995, July 9–12). Fast effective rule induction. Twelfth International Conference on Machine Learning (pp. 115–123), Tahoe City, CA, USA. [Google Scholar]
Dereškevičiūtė, S., & Kazlauskienė, A. (2022). Prosodic phrasing in Lithuanian: Preparatory study. Baltic Journal Modern Computing, 10(3), 317–325. [Google Scholar] [CrossRef]
Elfner, E. (2018). The syntax-prosody interface: Current theoretical approaches and outstanding questions. Linguistics Vanguard, 4(1). [Google Scholar] [CrossRef]
Elordieta, G. (2008). An overview of theories of the syntax-phonology interface. International journal of Basque linguistics and philology (ASJU), 42(1), 209–286. [Google Scholar]
Féry, C. (2017). Intonation and prosodic structure. Cambridge University Press. [Google Scholar]
Fougeron, C., & Keating, P. A. (1996). Articulatory strengthening in prosodic domain-initial position. UCLA Working Papers in Phonetics, 92, 61–87. [Google Scholar]
Frota, S., D’Imperio, M., Elordieta, G., Prieto, P., & Vigário, M. (2007). The phonetics and phonology of intonational phrasing in Romance. In P. Prieto (Ed.), Segmental and prosodic issues in Romance phonology (pp. 131–154). John Benjamins. [Google Scholar] [CrossRef]
Frota, S., & Vigário, M. (2018). Syntax–Phonology Interface. Oxford Research Encyclopedia of Linguistics. [Google Scholar] [CrossRef]
Girdenis, A. (2014). Theoretical foundations of Lithuanian phonology. Eugrimas. [Google Scholar]
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. SIGKDD Explorations, 11(1), 10–18. [Google Scholar] [CrossRef]
Harrington Stack, C., & Watson, D. G. (2023). Pauses and parsing: Testing the role of prosodic chunking in sentence processing. Languages, 8(3), 157. [Google Scholar] [CrossRef]
Inkelas, S., & Zec, D. (Eds.). (1990). The phonology syntax connection. Chicago University Press. [Google Scholar]
Ip, M. H. K., & Cutler, A. (2022). Juncture prosody across languages: Similar production but dissimilar perception. Laboratory Phonology: Journal of the Association for Laboratory Phonology, 13(1), 1–49. [Google Scholar] [CrossRef]
Jun, S.-A. (1998). The accentual phrase in the Korean prosodic hierarchy. Phonology, 15, 189–226. [Google Scholar] [CrossRef]
Jun, S.-A. (2003). Prosodic phrasing and attachment preferences. Journal of Psycholinguistic Research, 32, 219–249. [Google Scholar] [CrossRef]
Kazlauskienė, A. (2023). Bendrinės lietuvių kalbos skiemuo: Monografija. Vytauto Didžiojo Universitetas. [Google Scholar] [CrossRef]
Kazlauskienė, A., & Dereškevičiūtė, S. (2022a). Observations on basic intonational patterns of questions and statements in standard Lithuanian. Studies about Languages/Kalbų Studijos, 40, 90–102. [Google Scholar] [CrossRef]
Kazlauskienė, A., & Dereškevičiūtė, S. (2022b). Observations on the prosodic marking of narrow focus in Lithuanian. Baltic Journal Modern Computing, 10(3), 307–316. [Google Scholar] [CrossRef]
Kazlauskienė, A., Dereškevičiūtė, S., & Sabonytė, R. (2023). Bendrinės lietuvių kalbos intonacija: Frazės centras, ribos ir žymėjimas. Vytauto Didžiojo Universitetas. [Google Scholar] [CrossRef]
Kazlauskienė, A., & Sabonytė, R. (2018). F0 in Lithuanian: The indicator of stress, syllable accent, or intonation? In K. Muischnek, & K. Müürisep (Eds.), Proceedings of the 8th international conference human language technologies—The Baltic perspective (pp. 55–62). IOS Press. [Google Scholar]
Kohler, K. (1983). Prosodic boundary signals in German. Phonetica, 40, 89–134. [Google Scholar] [CrossRef]
Kohler, K. (1994). Glottal stops and glottalization in German: Data and theory of connected speech processes. Phonetica, 51, 38–51. [Google Scholar] [CrossRef]
Krippendorff, K. (2019). Content analysis: An introduction to its methodology (4th ed.). SAGE Publications. [Google Scholar]
Krivokapić, J. (2007). Prosodic planning: Effects of phrasal length and complexity on pause duration. Journal of Phonetics, 35, 162–179. [Google Scholar] [CrossRef]
Kruskal, W. H., & Wallis, W. A. (1952). Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260), 583–621. [Google Scholar] [CrossRef]
Kundrotas, G. (2012). Intonacinė tipologija. Edukologija. [Google Scholar]
Kundrotas, G. (2018). Lietuvių kalbos intonacinė sistema (sisteminis-tipologinis tyrimo aspektas). Indra. [Google Scholar]
Kuzla, C., Cho, T., & Ernestus, M. (2007). Prosodic strengthening of German fricatives in duration and assimilatory devoicing. Journal of Phonetics, 35, 301–320. [Google Scholar] [CrossRef]
Landwehr, N., Hall, M., & Frank, E. (2005). Logistic model trees. Mach Learn, 59, 161–205. [Google Scholar] [CrossRef]
Liberman, M. Y., & Pierrehumbert, J. (1984). Intonational invariance under changes in pitch range and length. In M. Aronoff, & R. T. Oehrle (Eds.), Language sound structure: Studies in phonology presented to Morris Halle (pp. 157–233). MIT Press. [Google Scholar]
Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics, 18(1), 50–60. [Google Scholar] [CrossRef]
McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.). Chapman and Hall. [Google Scholar] [CrossRef]
McHugh, M. L. (2013). The Chi-square test of independence. Biochemia Medica, 23(2), 143–149. [Google Scholar] [CrossRef]
Melnik-Leroy, G. A., Bernatavičienė, J., Korvel, G., Navickas, G., Tamulevičius, G., & Treigys, P. (2022). An overview of Lithuanian intonation: A linguistic and modelling perspective. Informatica, 33(4), 795–832. [Google Scholar] [CrossRef]
Michelas, A., & D’Imperio, M. (2012). When syntax meets prosody: Tonal and duration variability in French accentual phrases. Journal of Phonetics, 40, 816–829. [Google Scholar] [CrossRef]
Nespor, M., & Vogel, I. (1986). Prosodic phonology. Foris, Dordrecht. [Google Scholar]
Paschen, L., Fuchs, S., & Seifart, F. (2022). Final lengthening and vowel length in 25 languages. Journal of Phonetics, 94, 1–22. [Google Scholar] [CrossRef]
Petrone, C., Truckenbrodt, H., Wellmann, C., Holzgrefe-Lang, J., Wartenburger, I., & Höhle, B. (2017). Prosodic boundary cues in German: Evidence from the production and perception of bracketed lists. Journal of Phonetics, 61, 71–92. [Google Scholar] [CrossRef]
Quinlan, R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann Publishers. [Google Scholar]
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536. [Google Scholar] [CrossRef]
Selkirk, E. O. (1984). Phonology and syntax: The relation between sound and structure. The MIT Press. [Google Scholar]
Selkirk, E. O. (2014). The syntax-phonology interface. In J. Goldsmith, J. Riggle, & A. Yu (Eds.), The handbook of phonological theory (2nd ed., pp. 435–484). Blackwell Publishing Ltd. [Google Scholar]
Steffman, J., Kim, S., Cho, T., & Jun, S. (2024). Speech rate and prosodic phrasing interact in Korean listeners’ perception of temporal cues. Speech Prosody, 2024, 1090–1094. [Google Scholar] [CrossRef]
Truckenbrodt, H. (2007). The syntax-phonology interface. In P. de Lacy (Ed.), The Cambridge handbook of phonology (pp. 435–456). Cambridge University Press. [Google Scholar]
Wagner, M., & Watson, D. G. (2010). Experimental and theoretical advances in prosody: A review. Language and Cognitive Processes, 25(7–9), 905–945. [Google Scholar] [CrossRef]
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., & Woodland, P. (2006). The HTK book (for HTK version 3.4). Cambridge University Engineering Department. [Google Scholar]
Yuan, J., & Liberman, M. (2014). F0 declination in English and Mandarin broadcast news speech. Speech Communication, 65, 67–74. [Google Scholar] [CrossRef]

Figure 1. Excerpt from Praat showing one of the sentences with phrase boundaries.

Figure 2. Distribution of phrase-final pause durations.

Figure 3. Intensity variation in final syllables (syl), words (w), and whole intonational (IP) and intermediate phrases (ip) in dictation (D), news (N), and interview (I) recordings.

Figure 4. F0 variation in final syllables (syl), words (w), and whole intonational (IP) and intermediate phrases (ip) in dictation (D), news (N), and interview (I) recordings.

Table 1. Number and duration of pauses. The percentages in the table indicate the proportion of pauses found in recordings of each type.

	Total No. of Pauses	Pauses Between ips		Pauses Between IPs		Duration Ratio ip:IP
	Total No. of Pauses	Quantity	Duration (ms)	Quantity	Duration (ms)	Duration Ratio ip:IP
Dictations	403	233 (84%)	266 (±164)	149 (100%)	910 (±438)	1:3.4
News	88	31 (13%)	156 (±111)	54 (84%)	438 (±258)	1:2.8
Interview	97	24 (29%)	234 (±179)	70 (89%)	478 (±373)	1:2.0

Table 2. Intensity (in dB) of the entire phrase (IP and ip), the final word (w), and its final syllable (σ).

	IP:w(IP):σ(IP)		ip:w(ip):σ(ip)
	Means (dB)	Ratios	Means (dB)	Ratios
Dictations	70.7 (±1.8):68 (±3.2):65.8 (±3.9)	1:0.5:0.3	71.6(±2):71.1(±2.3):70(±3.2)	1:0.9:0.7
News	79.6(±0.9):78.1(±1.8):76.6(±2.9)	1:0.7:0.5	79.9(±0.9):79.9(±1.3):80.4(±2.2)	1:1:1.1
Interview	67.5(±3.4):65.6(±4.1):61.7 (±7.3)	1:0.6:0.3	69.5(±0.9):69.1(±0.9):67.6(±0.9)	1:0.9:0.6

Table 3. Falling fundamental frequency (F0) of the entire phrase (IP and ip), the final word (w), and the final syllable (σ) of that word.

	IP:w(IP):σ(IP)		ip:w(ip):σ(ip)
	Means (Hz)	Ratios	Means (Hz)	Ratios
Dictations	87(±12):73(±15):59(±8)	1:0.8:0.7	93(±17):89(±17):77(±17)	1:1:0.8
News	150(±20):145(±34):111(±46)	1:1:0.7	151(±27):155(±30):131(±38)	1:1:0.9
Interview	104(±27):98(±28):79(±19)	1:0.9:0.8	95(±23):91(±24):82(±21)	1:1:0.9

Table 4. Rising fundamental frequency (F0) of the entire phrase (IP and ip), the final word (w), and the final syllable (σ) of that word.

	IP:w(IP):σ(IP)		ip:w(ip):σ(ip)
	Means (Hz)	Ratios	Means (Hz)	Ratios
Dictations	84(±12):72(±19):77(±21)	1:0.9:0.9	93(±20):91(±19):101(±23)	1:1:1.1
News	157(±17):171(±27):189(±21)	1:1.1:1.2	146(±31):151(±36):171(±37)	1:1:1.2
Interview	93(±26):95(±30):108 (±34)	1:1:1.2	98(±19):98(±22):110(±33)	1:1:1.1

Table 5. Duration of certain final word sounds. # marks the final sound in phrase-medial position.

		Means (ms)	Ratios
Dictations	[s]_#:[s]_ip:[s]_IP	93(±31):156(±31):177(±28)	1:1.7:1.9
	[ɪ]_#:[ɪ]_ip:[ɪ]_IP	99(±38):175(±50):167(±42)	1:1.8:1.7
	[ɐ]_#:[ɐ]_ip:[ɐ]_IP	94(±40):167(±38):174(±21)	1:1.8:1.9
	[oː]_#:[oː]_ip:[oː]_IP	98(±35):163(±44):170(±37)	1:1.7:1.7
News	[s]_#:[s]_ip:[s]_IP	60(±17):78(±24):136 (±36)	1:1.3:2.3
	[ɪ]_#:[ɪ]_ip:[ɪ]_IP	65(±25):85(±36):104(±43)	1:1.3:1.6
	[ɐ]_#:[ɐ]_ip:[ɐ]_IP	58(±19):75(±24):85(±6)	1:1.3:1.5
	[oː]_#:[oː]_ip:[oː]_IP	67(±18):84(±26):85(±12)	1:1.3:1.3
Interview	[s]_#:[s]_ip:[s]_IP	79(±36):121(±59):157 (±76)	1:1.5:2
	[ɪ]_#:[ɪ]_ip:[ɪ]_IP	82(±54):156(±115):104(±73)	1:1.9:1.3
	[ɐ]_#:[ɐ]_ip:[ɐ]_IP	63(±27):119(±71):104(±56)	1:1.9:1.7
	[oː]_#:[oː]_ip:[oː]_IP	70(±34):112(±75):183(±216)	1:1.6:2.6

Table 6. Confusion matrices for identification of prosodic units.

	Entire Sample			News			Dictations			Interview
	IP	ip	w	IP	ip	w	IP	ip	w	IP	ip	w
IP	247	44	8	55	11	1	131	21	0	61	12	7
ip	38	298	269	7	73	163	23	198	55	8	24	51
w	5	30	1051	1	11	411	0	12	453	4	7	187

Table 7. Confusion matrix by separate features.

	Pause			The Final Sound			Intensity			F0
	IP	ip	w	IP	ip	w	IP	ip	w	IP	ip	w
IP	221	56	22	23	174	102	156	5	138	150	27	122
ip	37	232	336	22	299	284	50	16	539	54	68	483
w	5	31	1050	7	105	974	55	21	1010	60	58	968

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kalašinskaitė-Zavišienė, E.; Raškinis, G.; Kazlauskienė, A. Acoustic Cues to Automatic Identification of Phrase Boundaries in Lithuanian: A Preparatory Study. Languages 2025, 10, 192. https://doi.org/10.3390/languages10080192

AMA Style

Kalašinskaitė-Zavišienė E, Raškinis G, Kazlauskienė A. Acoustic Cues to Automatic Identification of Phrase Boundaries in Lithuanian: A Preparatory Study. Languages. 2025; 10(8):192. https://doi.org/10.3390/languages10080192

Chicago/Turabian Style

Kalašinskaitė-Zavišienė, Eidmantė, Gailius Raškinis, and Asta Kazlauskienė. 2025. "Acoustic Cues to Automatic Identification of Phrase Boundaries in Lithuanian: A Preparatory Study" Languages 10, no. 8: 192. https://doi.org/10.3390/languages10080192

APA Style

Kalašinskaitė-Zavišienė, E., Raškinis, G., & Kazlauskienė, A. (2025). Acoustic Cues to Automatic Identification of Phrase Boundaries in Lithuanian: A Preparatory Study. Languages, 10(8), 192. https://doi.org/10.3390/languages10080192

Article Menu

Acoustic Cues to Automatic Identification of Phrase Boundaries in Lithuanian: A Preparatory Study

Abstract

1. Introduction

2. Materials and Methods

2.1. Audio Recordings and Annotations

2.2. Analysis of Acoustic Markers at the End of Prosodic Units

2.3. Automatic Boundary Detection Algorithm

3. Results

3.1. Acoustic Features at Phrase Boundaries

3.2. Reliability of Acoustic Features in Automatic Phrasing

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI