Articulatory Control by Gestural Coupling and Syllable Pulses

Geissler, Christopher

doi:10.3390/languages10090219

Open AccessArticle

Articulatory Control by Gestural Coupling and Syllable Pulses

by

Christopher Geissler

Department of Eastern, Slavic, and German Studies, Boston College, Chestnut Hill, MA 02467, USA

Languages 2025, 10(9), 219; https://doi.org/10.3390/languages10090219

Submission received: 14 December 2024 / Revised: 24 June 2025 / Accepted: 19 August 2025 / Published: 29 August 2025

(This article belongs to the Special Issue Research on Articulation and Prosodic Structure)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Explaining the relative timing of consonant and vowel articulations (C-V timing) is an important function of speech production models. This article explores how C-V timing might be studied from the perspective of the C/D Model, particularly the prediction that articulations are coordinated with respect to an abstract syllable pulse. Gestural landmarks were extracted from kinematic data from English CVC monosyllabic words in the Wisconsin X-Ray Microbeam Corpus. The syllable pulse was identified using velocity peaks, and temporal lags were calculated among landmarks and the syllable pulse. The results directly follow from the procedure used to identify pulses: onset consonants exhibited stable timing to the pulse, while vowel-to-pulse timing was comparably stable with respect to C-V timing. Timing relationships with jaw displacement and jaw-based syllable pulse metrics were also explored. These results highlight current challenges for the C/D Model, as well as opportunities for elaborating the model to account for C-V timing.

Keywords:

speech production; articulation; speech timing

1. Introduction

This paper examines the relationship between syllable structure and articulatory timing by focusing on the relative timing of consonant and vowel articulations (C-V timing). This topic will be viewed from the perspective of the Converter/Distributor (C/D) Model (Fujimura, 2000), which is a theory of the phonetics-phonology interface based on prosodically defined syllable structure. Taking a C/D Model perspective on C-V timing highlights areas not yet developed in the C/D framework and suggests the kind of developments that would be needed to account for these observations.

The prediction at the heart of this paper is that articulatory movements should exhibit relatively stable timing with respect to the C/D Model’s abstract syllable pulse. As a point of comparison, the pairwise timing of gestural onsets is used, following common practice in Articulatory Phonology. These two approaches will be referred to as the “syllable pulse hypothesis” and the “pairwise-coupling hypothesis,” respectively. This study reports fundamental methodological challenges in assessing timing to a syllable pulse, examines covariation between oral articulations and movements of the jaw, and concludes with suggestions for future elaboration of the C/D Model.

1.1. C-V Timing and Pairwise Coupling

The idea that the gestural onsets of a CV syllable are particularly stable traces back to Kozhevnikov and Chistovich (1965). Subsequent work in Task Dynamics (E. L. Saltzman & Munhall, 1989) and Articulatory Phonology explored the temporal relationship between consonant and vowel gestures, often emphasizing coordination between gestural onsets (Goldstein et al., 2009; Nam, 2007; Nam & Saltzman, 2003). Other works, such as Gafos (2002), reference a number of different gestural landmarks besides onsets. Prosody has been implemented in Articulatory Phonology with pi- and mu-gestures that modulate gestural activation in time; see Byrd and Krivokapić (2021) for discussion. Meanwhile, other works, such as Turk and Shattuck-Hufnagel (2020), instead emphasize the timing of articulatory targets, rather than onsets.

Considerable attention in Articulatory Phonology has been devoted to the study of C-V timing. Stability between articulatory events has been used across a number of studies of syllable structure (Kramer et al., 2023; Pastätter & Pouplier, 2017; J. Shaw et al., 2009). Quantifying the stability of a temporal interval with relative standard deviation (RSD), defined as 100 times the standard deviation (SD) divided by the mean, is adopted from J. Shaw et al. (2009) and Durvasula et al. (2021). The standard deviation of an interval can also be plotted against the mean; Burroni (2023) uses this to identify cases with relatively low SD compared to the mean.

Articulatory Phonology and related research has also drawn attention to C-V lag, which is the difference in time between the onset of a syllable–initial consonant gesture and tautosyllabic vowel gesture. Near-zero C-V lag is a prediction of in-phase coupling, which is identified as characteristic of onset-to-vowel timing in the likes of Browman and Goldstein (2000); Haken et al. (1985); Nam and Saltzman (2003). Longer C-V lag values suggest anti-phase or intermediate values of coordination (Browman & Goldstein, 2000; Gafos, 2002; Gao, 2008). Likewise, covariation between intergestural lag and the duration of the first gesture also index non-in-phase timing, as in J. A. Shaw et al. (2021).

When comparing models, it is important to note that the jaw is not generally considered to be a “crucial” or “primary” articulator in these traditions. Browman and Goldstein (1988) and subsequent works describe articulatory targets in terms of location (“constriction location”) and distance from the passive articulator (“constriction degree”). Across contexts and across tokens, different articulators may contribute more or less to the attainment of a particular target. For instance, both the tongue and jaw move to attain a closure between the tongue tip and alveolar ridge, but the relative contributions of the tongue and jaw will differ across tokens. It is uncontroversial in both Articulatory Phonology and the C/D Model that jaw movement serves to support the attainment of other articulatory targets, but the jaw also plays an additional role in the C/D Model, as explained in Section 1.2 and Erickson (this issue).

The term gesture has a precise, technical definition in Articulatory Phonology that is not adopted in this paper. Gestures in Articulatory Phonology are abstract units of speech planning and linguistic contrast; they govern movements of the vocal tract but are not themselves directly observable. However, as Pouplier (2020) notes, “In phonetics, the term articulatory gesture is often used as a general descriptive term to refer to a particular movement cycle in a recorded signal of articulator motion, usually associated with the production of a particular phoneme.” The latter, non-technical definition is the sense in which gesture is used in this paper.

1.2. Syllable Pulse Hypothesis

The C/D Model, short for Converter/Distributor Model, is a theory of phonetic implementation. As articulated by Fujimura (2000), it takes as input the phonological metrical structure of an utterance and a sequence of syllables. The first component, the Converter, combines these inputs and generates a “base function” composed of a syllable nucleus articulation along with jaw, voicing, and tonal specifications. The Distributor component then assembles “impulse response functions,” which activate consonant gestures at the margins of each syllable. See the article by Erickson in this issue for more information on the C/D Model.

One unique feature of the C/D Model is its use of organizational structures in the shape of non-overlapping isosceles triangles. Each triangle corresponds to one syllable, with the height of the triangle determined by the prominence of the syllable. The vertical line is referred to as a “syllable pulse” and represents an abstract center of the syllable. The base of the syllable represents an abstract syllable duration, and the corners along the base provide reference points for the activation of onset and coda consonants. The angle opposite the base, the “shadow angle,” is held constant throughout an utterance. Thus, more-prominent syllables have longer abstract durations. Gaps between syllable triangles indicate prosodic boundaries.

In order to construct syllable triangles from articulatory data, three pieces of information are needed: the height of the syllable pulse, the location of the syllable pulse, and the shadow angle. Fujimura (1992) derived the height of syllable pulses from phonological stress. Working with kinematic data, Erickson and Kawahara (2015) quantified syllable prominence as the maximum magnitude of jaw lowering in each syllable. The location of syllable pulses was identified as the midpoint of two stable points, one on each side of the syllable. To identify these stable points, Fujimura (1986) used an “iceberg method,” which selects points in the consonant gesture that are relatively invariant across tokens. As an alternative, Erickson and Kawahara (2015) used the velocity peaks of the “crucial articulator” for the opening of the onset consonant constriction and closing of the coda consonant. The shadow angle is the largest possible angle between adjacent triangles—that is, triangles corresponding to adjacent syllables with no intervening boundary.

Though C-V timing has not been a focus of empirical research in the C/D Model, it does speak to the topic. The model takes the syllable to be the main unit of phonological representation, with one vowel per syllable. Consonants are arranged on the edges of the syllable, as onset and coda or more marginal structures called p-fix and s-fix. Each of these consonantal components—p-fix, onset, coda, and s-fix—contain features but not segments. Since the linear order of segments is predictable (at least in English and Japanese), the precedence relations can be determined by their corresponding rules (Fujimura, 1979, 1992).

One consequence of the fundamentally syllabic structure of the C/D Model is that consonantal articulations become farther apart in more prominent syllables (Fujimura, 1981). As the “height” of syllable triangles increases, so does the size of the “base,” and thus, all the features become more widely dispersed in time. However, the precise temporal relationship between features and motor instructions (called “Impulse Response Functions”) is not entirely explicit.

Even without a theory of the Impulse Response Functions, it follows from the C/D Model that consonant articulations are timed with respect to the syllable pulse. For simplex onsets, the subject of this study, the consonantal articulations are associated with the left corner of the triangle base. The distance between this corner and the syllable pulse is predicted to vary with syllable prominence but should remain consistent in a given prosodic context. Vowel articulations should also be coordinated with respect to the syllable pulse. Unlike in the pairwise-coupling approach, the C/D Model claims that any two gestures are not coordinated with each other directly but rather through their relationship to the syllable pulse. Stability between the syllable pulse and gestural landmarks would thus provide evidence for the coordination proposed by the C/D Model.

1.3. Aims

The overall aim of this paper is to explore C-V timing from the perspective of the C/D Model. This is intended to showcase what such research could look like and to highlight areas where additional research is needed.

As evidence for articulatory coordination, two types of measurements are used: the stability of the interval between articulatory landmarks and the covariation between two different intervals. In particular, the syllable-pulse hypothesis predicts that articulatory gestures should be coordinated—and thus stably timed—with respect to a syllable pulse. These results are compared to pariwise C-V onset-to-onset timing, which serves to contextualize measures of stability.

2. Materials and Methods

2.1. Data

Kinematic data were drawn from the Wisconsin X-ray Microbeeam Database (XRMB). This system tracked the position of gold pellets attached to vocal articulators, including the mandible, lips, and tongue. The data collection procedure, including post-processing and smoothing, is described in Westbury et al. (1994). Acoustic annotations from Sprouse (2017) were used to help locate sections of interest.

Trajectories were analyzed from 14 monosyllabic CVC words (see Appendix A). These were the items that had non-glottal consonants and were elicited five times in wordlist reading tasks by each of the 48 speakers for which kinematics and annotations were available. Segmental composition varied, but given the wordlist task, all should have been produced in a consistent prosodic context. In total, 2667 usable tokens were present in the XRMB, with an average of 3.97 tokens per word per speaker. Of these, smaller numbers had successfully identified landmarks using the procedure described in Section 2.2.

2.2. Definition of Landmarks

The landmarks used in this study are common in Articulatory Phonology. All but one (MAXC) were referenced by Gafos (2002) as potential sites for intergestural coordination. Likewise, all but one (NMID) correspond to those identified by the popular lp_findgest procedure in Mview (Tiede, 2005), a software package commonly used to process EMA data. The label names used are from Tiede (2005)—with the exception of “NMID”, the midpoint of the gesture plateau—which Gafos (2002) calls “c-center.” The landmarks used are summarized in Table 1.

For each CVC syllable, four gestures were counted in total, corresponding to the onset consonant, vowel, coda consonant, and jaw. For the jaw, the vertical component of the mandibular incisor pellet trajectory was used. For the other gestures, the pellet on a “crucial articulator” was used; we adopt the term “crucial articulator” from Fujimura (2000) and Svensson Lundmark and Erickson (2024), corresponding to “primary articulator” (Byrd & Krivokapić, 2021) or “major articulator” (Browman & Goldstein, 1988). For consonants, the trajectory used was the vertical component of the lower lip, tongue tip (T1), or tongue dorsum (T4) pellet. For vowels, the trajectory used was either the vertical or horizontal component of the tongue blade (T3) or tongue dorsum (T4). The crucial articulators and corresponding trajectories are listed in Appendix A.

Kinematic data analysis was conducted with a script written in Python 3.10.11 (Van Rossum & Drake, 2009). Starting with the annotations from Sprouse (2017), relevant trajectories were extracted for each crucial articulator in a window starting 200 ms before the acoustic start of the word and ending 200 ms after the end of the word. Within that window, landmarks were identified as shown in Figure 1. Two sets of landmarks were identified when onset and coda used the same crucial articulator. First, the position extrema were recorded as MAXC. Next, velocity maxima (in different directions) were recorded as PVEL and PVEL2. The acceleration maxima before and after each of these landmarks were recorded as GONS, NONS, NOFF, and GOFF. Finally, the mean of NONS and NOFF was recorded as NMID.

In this study, the start and end of the gestures (GONS, GOFF) and plateaus (NONS, NOFF) were identified using acceleration peaks. This follows the comparison of three different methods for identifying gesture onsets by Svensson Lundmark et al. (2021). While acceleration peaks are, by definition, after the onset of movement, the use of this metric avoids mistakenly labeling small, non-crucial movements. One alternative, commonly implemented in Tiede (2005), labels these landmarks as points where 20% of peak velocity is attained. While these serve similar functions, acceleration peaks were chosen because of evidence that acceleration peaks may be associated with acoustic segment boundaries and the attainment of acoustic targets (Svensson Lundmark, 2023, 2024).

2.3. Identifying Syllable Pulses

Syllable pulses were identified using the method outlined in Erickson and Kawahara (2015) as the midpoint between the PVEL2 of the onset consonant and the PVEL of the coda consonant. This is an alternative to the “iceberg” method of Fujimura (1986), which requires closely comparing many parallel productions to find stretches of relative invariance within trajectories on either side of the syllable. The iceberg method would not be possible with the relatively small number of repetitions per speaker in the XRMB. Instead, the consonant-velocity method allows a syllable pulse to be identified for each token individually.

While this study uses consonant-velocity syllable pulses throughout, Section 3.5 compares this method with three other metrics derived from the jaw kinematics alone. These are three ways to quantify the center of the jaw movement: the position extremum (MAXC), midpoint of acceleration peaks (NMID), and midpoint of velocity peaks.

2.4. Key Measurements

This paper assumes that the relative stability of landmarks can be taken as evidence for control structures that reference these landmarks. Stability is quantified with SD and RSD, but these are unreliable for values of different magnitudes and near-zero values, respectively. As such, they are supplemented by examining the residuals of a regression model, as in Section 3.1. The use of stability follows practice in Articulatory Phonology, summarized in Section 1.1, and is extended to the temporal relationship of articulatory landmarks with the syllable pulse.

In addition to stability, this study also reports covariation between measurements as evidence of structural relationships. This includes the covariation between two landmark-to-landmark durations but also between a duration and the magnitude of jaw movements in Section 3.4.

Statistical analysis was conducted in R (R Core Team, 2024). Linear mixed-effects models were constructed with the lme4 package (Bates et al., 2015).

3. Results

3.1. Stability Results

The lag between gestural measures and the consonant-velocity-defined syllable pulses are presented in Figure 2. The syllable pulse generally takes place shortly after the end of the consonant gesture and just past the middle of the vowel gesture.

Note that the lags from the vowel landmarks appear substantially more variable than the lags from the consonant landmarks. This difference may result from the trajectories or the nature of the landmarks, but it also reflects the fact that the syllable pulse is defined by the onset and coda consonant gestures, including C_PVEL2 shown here.

The relationship between mean and standard deviation of the same lags is presented in Figure 3. If each landmark were coordinated to the syllable pulse with equivalent stability, then the SD would increase as the mean values increased. Among the consonant landmarks, this pattern appears to hold: landmarks farther from the syllable pulse have lags with higher SDs, with the exception of C_GOFF, the C gesture offset. For the vowel, however, the lag to all the landmarks shows a relatively high SD.

If a landmark had a particularly stable relationship to the syllable pulse, it would appear as an outlier in the bottom left or bottom right of Figure 3—that is, its lag would have a low SD relative to its mean. For most of the consonant landmarks, the SD of the lag increased slowly relative to the mean lag, with no visibly apparent outlier. Thus, even though C_PVEL2 was used to define the syllable pulse, its position on the chart did not stand out from the other C gestures. Interestingly, the consonant landmark with the highest lag SD, C_GOFF, is also temporally closest to the syllable pulse. This could result from the release being less controlled than the closure, but in some cases, it may also reflect coarticulation with the coda consonant. As for the vowel landmarks, this figure broadly suggests that no particular relationship exists between the syllable pulse and the vowel. If anything, the landmarks closest to the beginning and end of the vowel gesture may be somewhat more stable than expected.

The Relative Standard Deviation (RSD) values of these lags are presented in Table 2.

In interpreting the RSD values, it is important to consider them in the context of the mean values. This is particularly the case when values cluster near zero, where the RSD is unhelpfully large. The lowest RSD value, both in the overall data and in the two sample words, is that of C_GONS, which is notable because it is lower than that of the other consonant landmarks despite having the largest lag.

In light of the shortcomings with the RSD, an alternative metric for stability was used: the residuals of a regression model fit to the data. A linear mixed-effects model was fit to the landmark-to-pulse intervals, with a fixed effect of landmark, random slopes for speaker and word (models with random slopes failed to converge). See Appendix C for details of the model. The standard deviation of the residuals are presented in Table 2: the larger the SDs of the residuals, the more variability is present beyond what can be explained by the model.

Since C_PVEL2 was used to identify the syllable pulse, it is unsurprising that this yielded the lowest SD of the residuals (as well as the lowest SD of observations). If no consonant landmarks exhibited particularly stable timing to the syllable pulse, the landmarks farther from C_PVEL2 should have produced progressively larger Residual SD values. Conversely, if a landmark had a lower Residual SD value than its neighbors, this would suggest a stable relationship to the pulse. Since no landmark exhibited a dramatically lower Residual SD than the others, there is no evidence that C_GONS or any other consonantal landmark is particularly stably timed to the syllable pulse.

As for the vowel landmarks, the values are substantially larger, and no clear pattern emerges. It is worth noting that the SD of the residuals generally tracks the SD of the observations, though not exactly: the most-stable interval by observed SD and RSD is V_GONS, while the Residual SD indicates V_NONS and V_NMID (the latter being the midpoint of V_NONS and V_NOFF). Given the small difference in these values and the lack of obvious pattern, there is no clear evidence that V_GONS or any other vowel landmark is particularly stably timed to the syllable pulse.

Overall, the results follow from the fact that the C_PVEL2 landmark was used in calculating the syllable pulse. Consonant landmarks exhibited more stable timing to the syllable pulse than the vowel landmarks did, with the the most stable being C_PVEL2.

3.2. C-V Lag

This section compares the lag between consonant and vowel to the lags discussed in Section 3.1. The GONS-to-syllable-pulse lags, which had the most stable timing to the syllable pulse, are also plotted in Figure 4. Here, they are joined by two forms of C-V lag: from consonant GONS to vowel GONS and from consonant NMID to vowel GONS.

It thus appears that C-V lag, whether measured from the consonant GONS or NMID, is about as stable as vowel-to-pulse measures and thus less stable than consonant-to-pulse measures. The mean, SD, and RSD values are shown in Table 3.

As above, a linear mixed-effects model was fit to the intervals, with a fixed effect of lag type and random intercepts and slopes for speaker and word. See Appendix C for details of the model. The standard deviation results of the residuals are presented in Table 3. The SD of the residuals mirrors the RSD and original SD of the observations.

These results indicate that C-V lag, whether measured from the start of the consonant or the midpoint of its plateau, is comparably stable to the V-to-pulse interval. The C-to-pulse interval is more stable than these, which is consistent with the pulse being defined by the consonants.

3.3. Timing Covariation

The relationship between the duration of gestures and C-V lag is presented in Figure 5. If C-V timing is purely a matter of onset coordination, then the C-V lag should be independent of gesture duration. Conversely, if gestural coordination references points within the gesture after the onset, then gesture length should affect the C-V lag. Specifically, longer consonant gestures would result in longer C-V lag, while longer vowel gestures would result in shorter C-V lag.

A series of linear mixed-effects models were fit to investigate the relationship between each of these variables and C-V lag, as shown in Table 4. A baseline model included random intercepts for speaker and word (models with random slopes failed to converge). This was compared to a series of models that each included a fixed effect of one of the durations in Figure 5.

Comparison of the AIC and BIC shows that the models including C duration and V duration performed better than the baseline, while the model that included jaw duration did not. The model that included factors for both C duration and V duration improved over both the models that included just one of these factors.

In the combined model, consonant duration had a small positive effect on C-V lag (slope: 0.15, se: 0.01,

F (1) = 11.198, p = 0.0009

), while vowel duration had a slightly larger negative effect (slope: −0.32, se: 0.03,

F (1) = 134.511, p < 0.0001

). These results are not consistent with the hypothesis of purely onset-based coordination.

The relationship between the duration of gestures and their timing difference to the syllable pulse is presented in Figure 6. If the gesture onset is coordinated with the syllable pulse, there should be no relationship between the gesture duration and the syllable pulse. Conversely, if the syllable pulse is coordinated to a later point in the gesture, then longer gestures should be associated with earlier gesture onsets relative to the syllable pulse. Since the CGONS-to-syllable-pulse values are negative, longer consonant gestures are predicted to be negatively correlated with the GONS-to-syllable-pulse timing.

Linear mixed-effects models were fit to investigate the relationship between each of these variables and C-V lag. A baseline model included random intercepts for speaker and word (models with random slopes failed to converge). This was compared to models that each included a fixed effect of C duration and jaw duration.

Comparison of the AIC and BIC in Table 5 shows that the model including C duration improved over the baseline, while the model that included jaw duration did not. In the C duration model, consonant duration had a negative effect on C-V lag (slope: −0.28, se: 0.02,

F (1) = 254.58, p < 0.0001

). These results, and those presented earlier in this section, are not consistent with the hypothesis of purely onset-based coordination.

3.4. Jaw Magnitude

In the C/D Model, the magnitude of jaw movement is crucially proportional to the abstract syllable duration (see Erickson, this issue). Longer abstract syllable durations (the “base” of the triangles) would in turn predict that larger jaw excursions should be correlated with consonants timed farther from the syllable pulse. Larger jaw displacement may also be associated with longer jaw and vowel gestures. Figure 7 shows several duration measures as a function of jaw displacement.

Linear mixed-effects models were fit to investigate the potential effect of jaw displacement on the various durations. Each included random intercepts for speaker and word (models with random slopes failed to converge) and a fixed effect of jaw displacement. The values are summarized in Table 6.

Of the results presented in Table 6, jaw displacement was only associated with longer durations of jaw movement. Jaw displacement was not a significant predictor of any other factor. While there was a random effect of word, the data were not sufficient to include a fixed effect of vowel type, which likely would be strongly related to jaw displacement.

3.5. Alternative Syllable Pulses

The results presented so far have used a single metric for defining the syllable pulse, namely, the midpoint of vowel-adjacent consonant velocity peaks. This section also considers three other landmarks that may index syllable timing: the midpoint of jaw velocity peaks, midpoint of jaw plateau (defined by acceleration peaks), and onset of jaw movement. Previous research in the C/D Model has not used jaw kinematics to identify the temporal location of the syllable pulse; this section considers jaw kinematics given the important role of the jaw magnitude in the C/D Model (see Erickson, this issue). Two landmarks, the midpoints of jaw plateau and jaw velocity peaks, might be considered alternative candidates for a syllable pulse landmark. The third, the jaw onset, has been included given the relatively stable timing already observed for the onsets of movement in C and V gestures.

For each of these points, the temporal distances between this point and the C GONS and V GONS landmarks were calculated. These durations were compared with each other and with the duration between C GONS and the consonant-velocity-defined syllable pulse (as in Section 3.1). The results are visualized in Figure 8, and the mean, SD, and RSD values are presented in Table 7.

The three syllable-central landmarks are quite close together. The distance from syllable pulse to jaw plateau midpoint is

- 2.72 \pm 12.17

, from syllable pulse to jaw velocity peak midpoint is

- 0.52 \pm 12.86

, and from jaw plateau midpoint to jaw velocity peak midpoint is

- 1.35 \pm 6.71

.

While visual inspection of Figure 8 is not particularly informative, the SD and RSD values of Table 7 show that the original syllable pulse landmark produced a more stable timing to both CGONS and VGONS. This suggests that these jaw landmarks alone are not a useful way to identify syllable pulses.

3.6. Summary of Results

Overall, the timing of the syllable pulse reported in Section 3.1, Section 3.2 and Section 3.3 follows from its definition in terms of the C_PVEL2 landmark. The pulse was stably timed to consonant landmarks and less-stably timed to vowel landmarks. C-V lag, and the distance between C and V gestural onsets was small in magnitude and comparably stable to the vowel-to-pulse interval.

Jaw kinematics exhibited some interesting correlations. Larger jaw movements tended to also have longer durations. The magnitude of a jaw excursion was correlated with longer C-V lag, but there was no relationship between C-V lag and the duration of the jaw movement (see Section 3.3 and Section 3.4).

Finally, alternative definitions of the syllable pulse based directly on jaw kinematics were briefly explored. These included the jaw gesture onset, midpoint of jaw plateau, and midpoint of jaw velocity peaks. All three exhibited less stable timing to both consonant and vowel gestures (Section 3.5).

4. Discussion

This study examined the stability and covariation of articulatory timing from the perspective of the C/D Model and a simplified model of pairwise gestural coordination. The key prediction of the C/D Model is that consonant gestures would exhibit stable timing to an abstract syllable pulse. Following discussions of the results of this study, Section 4.4 considers what future work is needed to account for these topics in the C/D Model.

The timing between onset consonants and the syllable pulse was indeed stable, consistent with the syllable-pulse timing hypothesis, but this is readily explained by the fact that method used to identify the pulses relied on the consonant kinematics. Vowel onsets were relatively stable, consistent with the pairwise-coupling hypothesis, but no evidence was found indicating stability of particular consonant landmarks. Larger jaw movements were associated with longer consonant-to-pulse intervals, while no such relationship was observed between jaw movements and vowel-to-pulse intervals. Overall, the stability of consonant gestures can be attributed to the method used to identify syllable pulses.

4.1. Syllable Pulse Interpretation

The stability results reported in this paper follow from the fact that the syllable pulse effectively serves as a proxy for the onset consonant’s PVEL2 landmark, which was used in defining the pulse (see Section 1.2). Consonant-to-vowel and vowel-to-pulse intervals had similar standard deviations, which were about double that of the consonant-to-pulse intervals. The consonant-to-pulse stability was just interpreted as the natural consequence of relative stability among landmarks of the same gesture. Since stability to the pulse decreased for each landmark farther from PVEL2, this work does not find evidence that any other consonant landmark is coordinated to this syllable pulse.

As for the vowel-to-pulse and C-V measures, the vowel data were noticeably more variable than the consonant data. One source for this variability could be the fact that non-labial consonants were used: since the consonant and vowel articulations both involved the tongue, there could have been interference from coarticulation.

The covariation between consonant duration and consonant-to-pulse interval (Figure 6) could be interpreted in another way. The same result would be obtained if the syllable pulse is coordinated not to the onset of the consonant but to a point later in its trajectory. However, this would further predict that one of the non-GONS landmarks would have a particularly stable timing to the pulse, which is not found in Section 3.1. Thus, a simpler explanation is that the syllable pulse was again functioning as a proxy for the C_PVEL2 landmark.

In addition to the question of onset-based or non-onset-based timing, this result raises another important question about identifying articulatory coordination from kinematics. That is, what kind of evidence should be used? This paper leans heavily on measures of stability: if two articulations are coordinated with each other, it intuitively follows that the temporal relationship should be stable relative to less-coordinated movements. This has the advantage of allowing for some amount of variation, which is important in noisy speech data. However, another approach instead emphasizes covariation as evidence for coordination. This uses variability to illuminate patterns that might not be visible due to other factors, such as speech rate and stress. Some covariation results are presented here, but the study is limited by the fact that it was designed around a stability diagnostic.

4.2. Articulatory Landmarks

In this study, the GONS landmarks were identified using acceleration peaks, which is not without precedent (Svensson Lundmark, 2023) but must occur after acceleration—and therefore after movement—has already begun. Where an acceleration peak occurs could differ according to the movement profile of the trajectory: relatively earlier for a shorter, faster movement and later for a longer, slower movement. Future research should continue to investigate the effects of using different landmarks but also explore other ways to characterize the timing of gestures.

The automated method of identifying landmarks did not successfully identify landmarks for all tokens. In Table 2, for example, some landmarks were only identified in as few as 1145 out of 2269 tokens. The failure to identify landmarks in so many cases likely resulted from many causes. Perhaps the windows were sometimes too large or too small, and more manual attention is needed to identify gestures properly. Observing many individual trajectories however, it is also apparent that clearly-defined movements are not always visible, especially in continuous speech. Both the tracking of fleshpoints (as in XRMB and EMA) and the search for landmarks simplify the complex reality of articulatory data, and this work may be missing important information as a result.

A simple account of purely onset-based coordination is contradicted by the C-V lag covariation results presented in Section 3.3. Longer C-V lag was correlated with longer consonant gestures, which, assuming pairwise timing, could only result from the vowel gesture being timed to a point later in the consonant gesture. An alternative explanation, that both gesture duration and C-V lag are correlated with speech rate, is contradicted by the negative correlation of C-V lag with vowel gesture duration. One explanation could be that both the consonant and vowel were timed with respect to a third entity, which could be a syllable pulse. Alternatively, some versions of pairwise timing that allow non-onset-based coordination, such as Gafos (2002), may also be able to account for these data.

4.3. Jaw Kinematics

This study also raises questions about the role of the jaw in articulatory timing. Section 3.5 attempted to use jaw kinematics alone to independently define a syllable pulse, but the results were less stably timed to consonant and vowel than the consonant-velocity method. Intriguingly though, the jaw kinematics were also more stably timed to consonant GONS than to vowel GONS, suggesting that the jaw is more closely coordinated with consonant than with vowel. This result is consistent with the idea that the position of the jaw is used to help meet other articulatory goals.

While this study did not use syllable triangles, it did begin to examine the role of jaw displacement, which Erickson and Kawahara (2015) used to define triangle height. With 14 words and seven vowel types represented in the data, this study is not able to draw conclusions about the relationship between jaw displacement and vowel kinematics. The magnitude of the maximal jaw displacement was found to be independent of consonant and vowel timing, which indicates that the degree of jaw opening is the result of other factors. Such factors could include vowel height and, as in the C/D Model, prosodic prominence. Further research with more diverse stimuli is required to properly examine the relationship between jaw movement magnitude and articulatory timing.

4.4. Future Prospects

While existing research in the C/D Model has derived the “height” of syllable pulses from articulatory data, work remains to be done on identifying the temporal location of the pulses. As discussed above, the method used to identify syllable pulses in this paper is not suitable for studying the relationship between pulses and articulatory gestures. The alternative “iceberg method” of Fujimura (1986) would also have yielded similar results; while not using the same landmarks, this method is also based on consonant kinematics. Preliminary investigation of jaw kinematics does not motivate their use for identifying syllable pulses.

At an early stage of the research process, the intention was to examine disyllabic words, whose triangles would be adjacent. The maximal jaw excursion in each syllable would be measured and used as the height of the triangles, enabling the angles and location of the pulses to be calculated. However, in the vast majority of cases, the disyllabic words lacked two distinct jaw opening movements. Impressionistically, it appeared that the unstressed syllables either lacked jaw cycles altogether or had jaw cycles that looked like temporary slowing in the slope toward the opening of the stressed syllable. There may well be an underlying syllable pulse in these cases, but the blended targets of several articulations did not result in a distinct excursion for each syllable. As an alternative, the velocity-peak method was used, but this meant that the syllable pulse was defined in terms of consonant kinematics.

Future experiments could, however, make use of the method originally planned for this study. They would need adjacent syllables with distinct jaw excursions and no intervening prosodic boundaries. The disyllabic words in English investigated here did not exhibit two distinct jaw excursions. A more promising option would be to elicit adjacent stressed syllables, perhaps monosyllables in particular phrasal positions such as those used in Erickson et al. (2024). Given the jaw position extrema of the two syllables, the location of the pulses could be calculated. Moreover, this method would also identify the “corners” of the syllable triangles, which the C/D Model hypothesizes to be the points to which articulatory movements are coordinated. This method would allow an actual test of the stability between articulatory movements and the coordinative structures of the C/D Model.

Once syllable pulses can be reliably identified, a theory of Impulse Response Functions would need to be developed. Such a theory could be based, at least in part, on existing kinematic models such as Task Dynamics (E. Saltzman & Kelso, 1987) or General Tau Theory (Lee, 1998). The timing patterns of the kinematic model would be systematically associated to the C/D Model’s syllable triangles—presumably, onset and coda consonants would be timed to the corners, for example. Since the triangles are symmetrical, an important challenge for the model will be explaining the differences in behavior of onsets and codas. For instance, onset consonant gestures are often overlapped in particular ways, dubbed the “C-center effect,” which is not observed in codas (Browman & Goldstein, 1988).

However, the C/D Model also presents unique opportunities in the study of consonant clusters. In addition to syllable triangles, whose corners are associated with onset and coda, the model also includes adjacent half-triangles corresponding to “p-fix” and “s-fix”. The latter provide a representational mechanism to explain some patterns of overlap in consonant clusters. For example, Hermes et al. (2017) found that onset clusters in Tashlhiyt Berber exhibit much more overlap than those of Polish. A C/D account could posit that multiple Tashlhiyt consonants are associated to the onset position, while Polish has one consonant associated to the onset and one to the p-fix.

The C/D Model also has the potential to explain how consonants become more temporally dispersed under higher degrees of prominence. For example, Gu (2023) found longer C-V lag in stressed syllables than unstressed ones. Rather than invoking a prosodic gesture as in Articulatory Phonology, a C/D account would reference the fact that syllable triangle bases expand in proportion to their height. If the C/D Model can be elaborated to incorporate consonant timing, it should be possible to predict and test the proportional relationship between prominence and C-V lag.

In summary, this study draws attention to an important gap in the C/D Model, namely, the temporal relationship between syllable triangles and C-V timing. Properly studying this relationship would require a method for identifying the timing of the syllable pulse that is independent of consonant kinematics. This would enable the kind of temporal analyses presented in this paper. Such work could allow the C/D Model, or its future developments, to better account for the relationship between prosodic and articulatory data.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data and code are available on OSF: https://osf.io/cxr2z/?view_only=c6d4e8d0a6c64c7e97a4a750cb03e240, accessed on 12 March 2025.

Conflicts of Interest

The author declares no conflicts of interest.

Appendix A. Wordlist

The table below lists the words from the Wisconsin XRMB that were examined in this study, along with the articulators for each word. “x” refers to horizontal movement, “y” to vertical movement. “LL” = “lower lip”, and tongue sensors are numbered in order from the tongue tip (“T1”) to the tongue dorsum (“T4”). Consonants always were tracked as raising movements, where the movement of the crucial articulator varied by vowel, as indicated in the “V traj” column. Segments, following Westbury et al. (1994), are transcribed in X-SAMPA.

Table A1. XRMB items used.

Word	Onset C	Onset Traj	Vowel	V Traj	Coda	Coda Traj
back	B	LLy	AE1	T4x (front)	N	T1y
both	B	LLy	OW1	T4x (back)	DH	T1y
but	B	LLy	AH1	T3y (low)	T	T1y
cash	K	T4y	AE1	T3y (low)	SH	T1y
coat	K	T4y	OW1	T4x (back)	T	T1y
light	L	T1y	AY1	T3y (low)	T	T1y
long	L	T1y	AO1	T4x (back)	NG	T4y
much	M	LLy	AH1	T3y (low)	CH	T1y
right	R	T1y	AY1	T3y (low)	T	T1y
ship	SH	T1y	IH1	T4x (front)	P	LLy
shoot	SH	T1y	UW1	T4x (back)	T	T1y

Note: The low-front AE1 vowel was measured differently in different words. For cash, it the lowering of the T3 sensor was used, while in back, it was the fronting of the T4 sensor. This was done to avoid coarticulation. In cash, the tongue was moving from a velar constriction, lowering, then forming a post-alveolar constriction, meaning the most salient movement was lowering. In back however, the tongue was not raised to start, and so the most salient movement was fronting.

Appendix B. Number of Landmarks Identified

An automated procedure was used to identify articulatory landmarks, and this only successfully identified gestural landmarks in a subset of cases. TextGrid annotations from Sprouse (2017) were used to identify the timestamps for each word and its corresponding segments. Then, local extrema were identified in the position trajectory, and from there, the peaks in velocity and acceleration were labeled as in Figure 1. Thus, while 2669 word tokens were used, the number of successful measurements is presented below. Also presented are the number of tokens for which syllable pulses were identified; the “C velocity” method was used throughout this paper, while the other two appear in Section 3.5.

Table A2. Number of tokens for which each landmark was identified.

Landmark	C (Syllable Onset)	Vowel	Jaw Trajectory
GONS	2223	2482	2343
PVEL	2361	2600	2498
NONS	2067	1949	1606
MAXC	2667	2665	2654
NOFF	2011	1855	1150
PVEL2	2348	2565	2375
GOFF	1870	1850	1145
Syllable Pulse Type	Count
C velocity	1760
Jaw PVEL mid	2375
Jaw plateau mid	1145

Appendix C. LMM Parameters

Table A3. Model predicting temporal intervals from articulatory landmarks to the syllable pulse. Residuals presented in Table 2. Fixed effects of landmark are reported; there were also random intercepts for speaker and word (models with random slopes failed to converge). The reference level is C_GONS, i.e., the interval from C_GONS to the pulse lmer syntax: duration ∼ landmark + (1|speaker) + (1|word).

Fixed Effect	Estimate	Std. Error	df	t	p-Value
(Intercept)	−51.3080	1.1101	18.1621	−46.22	<0.00001
C_PVEL	6.0949	0.4769	26,099.2841	12.78	<0.00001
C_NONS	12.1358	0.4769	26,099.2841	25.45	<0.00001
C_MAXC	23.1977	0.4769	26,099.2841	48.65	<0.00001
C_NMID	22.4864	0.4769	26,099.2841	47.16	<0.00001
C_NOFF	32.8369	0.4769	26,099.2841	68.86	<0.00001
C_PVEL2	39.0983	0.4769	26,099.2841	81.99	<0.00001
C_GOFF	48.6438	0.4769	26,099.2841	102.01	<0.00001
V_GONS	15.0240	0.4839	26,099.5637	31.05	<0.00001
V_PVEL	24.6425	0.4793	26,099.3650	51.42	<0.00001
V_NONS	29.8811	0.5118	26,100.9380	58.38	<0.00001
V_MAXC	43.9716	0.4769	26,099.2841	92.21	<0.00001
V_NMID	43.1146	0.5179	26,101.3014	83.26	<0.00001
V_NOFF	57.2009	0.5179	26,101.3014	110.46	<0.00001
V_PVEL2	64.5579	0.4806	26,099.3897	134.32	<0.00001
V_GOFF	76.0705	0.5183	26,101.3152	146.76	<0.00001

Table A4. Model predicting temporal intervals of C-V lag and from C and V onsets to the syllable pulse. Residuals presented in Table 3. Fixed effects of landmark are reported; there were also random intercepts and slopes for speaker and word. The reference level is C_GONS-V_GONS. lmer syntax: duration ∼ landmark + (landmark|speaker) + (landmark|word).

Fixed Effect	Estimate	Std. Error	df	t	p-Value
(Intercept)	15.600	2.839	12.685	5.496	0.00011
C_NMID-V_GONS	−22.246	1.151	12.892	−19.331	<0.00001
C_GONS-Pulse	−67.176	3.559	12.913	−18.872	<0.00001
V_GONS-Pulse	−51.148	1.544	14.073	−33.121	<0.00001

References

Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. [Google Scholar] [CrossRef]
Browman, C. P., & Goldstein, L. (1988). Some notes on syllable structure in articulatory phonology. Phonetica, 45(2–4), 140–155. [Google Scholar] [CrossRef]
Browman, C. P., & Goldstein, L. (2000). Competing constraints on intergestural coordination and self-organization of phonological structures. Les Cahiers de l’ICP. Bulletin de la communication parlée, 5, 25–34. [Google Scholar]
Burroni, F. (2023, August 7–11). Lexical tones are timed to articulatory gestures. 20th International Congress of Phonetic Sciences, Prague, Czech Republic. [Google Scholar]
Byrd, D., & Krivokapić, J. (2021). Cracking prosody in articulatory phonology. Annual Review of Linguistics, 7(1), 31–53. [Google Scholar] [CrossRef]
Durvasula, K., Ruthan, M. Q., Heidenreich, S., & Lin, Y.-H. (2021). Probing syllable structure through acoustic measurements: Case studies on American English and Jazani Arabic. Phonology, 38(2), 173–202. [Google Scholar] [CrossRef]
Erickson, D., & Kawahara, S. (2015). A practical guide to calculating syllable prominence, timing and boundaries in the C/D model. Journal of the Phonetic Society of Japan, 19(2), 16–21. [Google Scholar]
Erickson, D., Lundmark, M. S., & Huang, T. (2024). Jaw opening patterns and their correspondence with syllable stress patterns. Jaw opening patterns and their correspondence with syllable stress patterns. OSFPreprint. [Google Scholar] [CrossRef]
Fujimura, O. (1979). An analysis of English syllables as cores and affixes. STUF-Language Typology and Universals, 32(1–6), 471–476. [Google Scholar] [CrossRef]
Fujimura, O. (1981). Temporal organization of articulatory movements as a multidimensional phrasal structure. Phonetica, 38(1–3), 66–83. [Google Scholar] [CrossRef]
Fujimura, O. (1986). Relative invariance of articulatory movements: An iceberg model. In J. S. Perkell, & D. H. Klatt (Eds.), Invariance and variability in speech processes. Lawrence Erlbaum Associates, Inc. [Google Scholar]
Fujimura, O. (1992). Phonology and phonetics-A syllable-based model of articulatory organization. Journal of the Acoustical Society of Japan (E), 13(1), 39–48. [Google Scholar] [CrossRef]
Fujimura, O. (2000). The C/D model and prosodic control of articulatory behavior. Phonetica, 57(2–4), 128–138. [Google Scholar] [CrossRef] [PubMed]
Gafos, A. I. (2002). A grammar of gestural coordination. Natural Language & Linguistic Theory, 20(2), 269–337. [Google Scholar] [CrossRef]
Gao, M. (2008). Tonal alignment in Mandarin Chinese: An articulatory phonology account [Unpublished doctoral dissertation (Linguistics), Yale University]. [Google Scholar]
Goldstein, L., Nam, H., Saltzman, E., & Chitoran, I. (2009). Coupled oscillator planning model of speech timing and syllable structure. In Frontiers in phonetics and speech science (pp. 239–250). The Commercial Press. [Google Scholar]
Gu, Y. (2023). Exploring the effect of stress on gestural coordination. Proceedings of the Linguistic Society of America, 8(1), 5539. [Google Scholar] [CrossRef]
Haken, H., Kelso, J. S., & Bunz, H. (1985). A theoretical model of phase transitions in human hand movements. Biological Cybernetics, 51(5), 347–356. [Google Scholar] [CrossRef] [PubMed]
Hermes, A., Mücke, D., & Auris, B. (2017). The variability of syllable patterns in Tashlhiyt Berber and Polish. Journal of Phonetics, 64, 127–144. [Google Scholar] [CrossRef]
Kozhevnikov, V. A., & Chistovich, L. A. (1965). Speech: Articulation and perception (Joint Publications Research Service Trans.). US Department of Commerce Clearinghouse for Federal Scientific and Technical Information. [Google Scholar]
Kramer, B. M., Stern, M. C., Wang, Y., Liu, Y., & Shaw, J. A. (2023, August 7–11). Synchrony and stability of articulatory landmarks in English and Mandarin CV sequences. 20th International Congress of Phonetic Sciences, Prague, Czech Republic. [Google Scholar]
Lee, D. N. (1998). Guiding movement by coupling taus. Ecological Psychology, 10(3/4), 221. [Google Scholar] [CrossRef]
Nam, H. (2007). A gestural coupling model of syllable structure. Yale University. [Google Scholar]
Nam, H., & Saltzman, E. (2003, August 3–9). A competitive, coupled oscillator model of syllable structure. 15th International Congress of the Phonetic Sciences, Barcelona, Spain. [Google Scholar]
Pastätter, M., & Pouplier, M. (2017). Articulatory mechanisms underlying onset-vowel organization. Journal of Phonetics, 65, 1–14. [Google Scholar] [CrossRef]
Pouplier, M. (2020). Articulatory phonology. In M. Aronoff (Ed.), Oxford research encyclopedia of linguistics. Oxford University Press. [Google Scholar] [CrossRef]
R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. Available online: https://www.R-project.org/ (accessed on 9 October 2024).
Saltzman, E., & Kelso, J. A. (1987). Skilled actions: A task-dynamic approach. Psychological Review, 94(1), 84–106. [Google Scholar] [CrossRef]
Saltzman, E. L., & Munhall, K. G. (1989). A dynamical approach to gestural patterning in speech production. Ecological Psychology, 1(4), 333–382. [Google Scholar] [CrossRef]
Shaw, J., Gafos, A. I., Hoole, P., & Zeroual, C. (2009). Syllabification in Moroccan Arabic: Evidence from patterns of temporal stability in articulation. Phonology, 26(1), 187–215. [Google Scholar] [CrossRef]
Shaw, J. A., Oh, S., Durvasula, K., & Kochetov, A. (2021). Articulatory coordination distinguishes complex segments from segment sequences. Phonology, 38(3), 437–477. [Google Scholar] [CrossRef]
Sprouse, R. (2017). xray_microbeam_database. Available online: https://github.com/rsprouse/xray_microbeam_database (accessed on 15 October 2024).
Svensson Lundmark, M. (2023). Rapid movements at segment boundaries. The Journal of the Acoustical Society of America, 153(3), 1452–1467. [Google Scholar] [CrossRef]
Svensson Lundmark, M. (2024). Magnitude and timing of acceleration peaks in stressed and unstressed syllables. In Interspeech 2024 (pp. 2630–2634). ISCA. [Google Scholar] [CrossRef]
Svensson Lundmark, M., & Erickson, D. (2024). Segmental and syllabic articulations: A descriptive approach. Journal of Speech, Language, and Hearing Research, 67(10S), 3974–4001. [Google Scholar] [CrossRef] [PubMed]
Svensson Lundmark, M., Frid, J., Ambrazaitis, G., & Schötz, S. (2021). Word-initial consonant–vowel coordination in a lexical pitch-accent language. Phonetica, 78(5–6), 515–569. [Google Scholar] [CrossRef] [PubMed]
Tiede, M. (2005). Mview: Software for visualization and analysis of concurrently recorded movement data. Haskins Laboratory. [Google Scholar]
Turk, A., & Shattuck-Hufnagel, S. (2020). Timing evidence for symbolic phonological representations and phonology-extrinsic timing in speech production. Frontiers in Psychology, 10, 2952. [Google Scholar] [CrossRef] [PubMed]
Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace. [Google Scholar]
Westbury, J. R., Turner, G., & Dembowski, J. (1994). X-ray microbeam speech production database user’s handbook. version 1.0. University of Wisconsin Waisman Center.

Figure 1. Schematic visualization of the landmarks described in Table 1. Note that all can be identified in the acceleration trajectory as either local maxima, local minima, or zero-crossings.

Figure 2. Lag from onset consonant (C) and vowel (V) landmarks to the syllable pulse, as identified with consonant velocities. The dashed vertical line indicates the syllable pulse. Landmark labels as described in Section 2.2.

Figure 3. Mean and SD values of lags from onset consonant (C) and vowel (V) landmarks to the syllable pulse, as identified with consonant velocities. Diagonal lines have a slope of 1 and −1. Landmark labels as described in Section 2.2. More-stable lags appear at the bottom of the chart, particularly toward the bottom corners. V_GONS appears most stable of the V landmarks; C lags are generally similar to each other, but C_GONS also appears stable (see text later in this section).

Figure 4. Mean and SD values of lags between consonant and vowel, as compared to lags between consonant or vowel to syllable pulse, as identified with consonant velocities. Diagonal lines have a slope of 1 and −1. Landmark labels as described in Section 2.2.

Figure 5. Relationship between total gesture duration (GONS-GOFF) and C-V lag (CGONS-to-VGONS). Each point represents one token.

Figure 6. Relationship between total gesture duration (GONS-GOFF) and timing to syllable pulse (GONS). Each point represents one token.

Figure 7. Relationship between maximum jaw displacement and several duration measures.

Figure 8. Distance to consonant GONS from jaw movement and syllable pulse measurements. The dashed vertical line indicates the consonant GONS.

Table 1. Gestural landmarks used in this study. All are defined for a particular articulator, within a window defined by acoustic annotations.

Label	Description	Definition
GONS	Gesture Onset	Acceleration peak
PVEL	Peak velocity (toward target)	Velocity maximum
NONS	Nuclear Onset (beginning of plateau)	Deceleration peak
MAXC	Max constriction	Position extremum
NMID	Nuclear midpoint (midpoint of plateau)	Mean of NONS and NOFF
NOFF	Nuclear Offset (end of plateau)	Acceleration peak (opposite direction)
PVEL2	Peak velocity #2 (away from target)	Velocity maximum (opposite direction)
GOFF	Gesture offset	Deceleration peak

Table 2. RSD of lag from gestural landmarks to syllable pulse. Pooled data from all 14 CVC items are presented. The “Residual SD” presents the standard deviation of the residuals from a linear mixed-effects model that was fit to the data.

Landmark	Mean, SD	RSD	Residual SD
C_GONS	−51.50 ± 8.63	−16.75	8.07
C_PVEL	−45.41 ± 8.08	−17.80	7.70
C_NONS	−39.37 ± 7.82	−19.86	7.70
C_MAXC	−28.31 ± 7.60	−26.84	7.77
C_NMID	−29.02 ± 5.73	−19.75	6.22
C_NOFF	−18.67 ± 5.94	−31.79	6.98
C_PVEL2	−12.41 ± 5.34	−43.03	6.82
C_GOFF	−2.86 ± 11.07	−387.09	12.82
V_GONS	−36.43 ± 19.75	−54.23	18.35
V_PVEL	−26.84 ± 21.54	−80.26	20.18
V_NONS	−21.60 ± 19.53	−90.41	17.80
V_MAXC	−7.52 ± 22.42	−297.58	20.97
V_NMID	−8.44 ± 16.42	−194.54	14.72
V_NOFF	5.65 ± 18.98	336.20	17.73
V_PVEL2	13.07 ± 19.97	152.78	18.92
V_GOFF	24.50 ± 20.50	83.67	19.80

Table 3. Stability measures of C-V lags, along with lag from gestural landmarks to syllable pulse.

Lag	Mean, SD	RSD	Residual SD
CGONS-VGONS	15.20 ± 20.27	133.40	17.67
CNMID-VGONS	−7.56 ± 20.06	−265.39	16.90
CGONS-Pulse	−51.50 ± 8.63	−16.75	6.75
VGONS-Pulse	−36.43 ± 19.75	−54.23	17.01

Table 4. Comparison of models predicting C-V lag.

Model	Parameters	AIC	BIC	logLikelihood
baseline	4	4996.4	5013.9	−2494.2
+ C duration	5	4995.3	5017.3	−2492.7
+ V duration	5	4884.2	4906.1	−2437.1
+ Jaw duration	5	4998.2	5020.1	−2494.1
+ C duration and V duration	6	4875.5	4901.9	−2431.8

Table 5. Comparison of models predicting CGONS-to-syllable-pulse.

Model	Parameters	AIC	BIC	logLikelihood
baseline	4	3995.3	4012.8	−1993.6
+ C duration	5	3790.0	3811.9	−1890.0
+ Jaw duration	5	3993.2	4015.1	−1991.6

Table 6. Comparison of models predicting durations from maximum jaw displacement. p-values from lmerTest.

Model	Slope	Std. Error	df	t-Value	p-Value
C-V lag	0.14	0.23	105.76	0.607	0.545
C-to-pulse	0.18	0.12	312.42	1.488	0.138
V-to-pulse	0.25	0.23	124.83	1.059	0.292
C duration	−0.168	0.22	210.251	−0.763	0.446
V duration	0.339	0.3193	136.04	1.062	0.29
Jaw duration	0.72	0.27	133.27	2.677	0.008

Table 7. Distance to consonant GONS from jaw movement and syllable pulse measurements.

Measure	Mean, SD	RSD
to Consonant GONS
jaw velocity midpoint	49.87 ± 15.42	30.92
jaw acceleration midpoint	54.22 ± 14.24	26.26
jaw GONS (acceleration peak)	24.32 ± 15.40	63.31
syllable pulse (velocity peaks)	51.50 ± 8.63	16.75
to Vowel GONS
jaw velocity midpoint	34.82 ± 22.44	64.43
jaw acceleration midpoint	37.77 ± 23.21	61.46
jaw GONS (acceleration peak)	8.98 ± 22.72	253.06
syllable pulse (velocity peaks)	36.43 ± 19.75	54.23

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Geissler, C. Articulatory Control by Gestural Coupling and Syllable Pulses. Languages 2025, 10, 219. https://doi.org/10.3390/languages10090219

AMA Style

Geissler C. Articulatory Control by Gestural Coupling and Syllable Pulses. Languages. 2025; 10(9):219. https://doi.org/10.3390/languages10090219

Chicago/Turabian Style

Geissler, Christopher. 2025. "Articulatory Control by Gestural Coupling and Syllable Pulses" Languages 10, no. 9: 219. https://doi.org/10.3390/languages10090219

APA Style

Geissler, C. (2025). Articulatory Control by Gestural Coupling and Syllable Pulses. Languages, 10(9), 219. https://doi.org/10.3390/languages10090219

Article Menu

Articulatory Control by Gestural Coupling and Syllable Pulses

Abstract

1. Introduction

1.1. C-V Timing and Pairwise Coupling

1.2. Syllable Pulse Hypothesis

1.3. Aims

2. Materials and Methods

2.1. Data

2.2. Definition of Landmarks

2.3. Identifying Syllable Pulses

2.4. Key Measurements

3. Results

3.1. Stability Results

3.2. C-V Lag

3.3. Timing Covariation

3.4. Jaw Magnitude

3.5. Alternative Syllable Pulses

3.6. Summary of Results

4. Discussion

4.1. Syllable Pulse Interpretation

4.2. Articulatory Landmarks

4.3. Jaw Kinematics

4.4. Future Prospects

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Wordlist

Appendix B. Number of Landmarks Identified

Appendix C. LMM Parameters

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI