1. Introduction
Studies of speech articulation based on measured kinematic data, e.g., X-Ray Microbeam (XRMB), Electromagnetic Articulography (EMA), ultrasound, palatograms, MRI, are relatively new, at least compared with acoustic studies of spoken language. The first two, XRMB (e.g.,
Fujimura et al., 1973;
J. R. Westbury et al., 1994) and EMA (e.g.,
https://www.de/), involve tracking pellets/sensors on the lips, tongue, and lower incisor tooth (to track mandible movement) to see how the various articulators move while speaking; these methods have been especially relevant for developing articulatory models of speech organization, e.g., Articulatory Phonology (AP) (
Browman & Goldstein, 1992), the DIVA model (Directions Into Velocities of Articulators) (
Guenther, 1994), the Fujimura C/D model (
Fujimura, 2000;
Erickson, 2024a;
Fujimura & Williams, 2015) and the Geppeto Model (
Perrier, 2014) Other more recently introduced models include the XT/3C Model (
Turk & Shattuck-Hufnagel, 2020;
Turk et al., 2025), the Segmental Articulatory Phonetics Model (
Svensson Lundmark, 2023,
2025;
Svensson Lundmark & Erickson, 2024) and the Articulatory Prosody Model (
Erickson & Niebuhr, 2023;
Erickson, 2024b).
The goal of this paper is to focus on the C/D model, a model that has not received as much attention as other articulatory models. Thus, this paper can be seen as a tutorial of some of the basic tenets of the C/D model; it also includes examples from articulatory studies illustrating some of the tenets, specifically, the importance of the mandible in producing syllables as well as its role in organizing strings of syllables to output a spoken utterance. The merit of the C/D model as an articulatory model is that it can account for the effects of prosody on utterance-temporal organization of speech articulation in a way that is not accountable by other models (e.g.,
Section 2.2.1). However, it is a complicated, not yet fully experimentally tested model. A basic tenet, as mentioned above, is that the mandible plays an essential role in producing syllables and in organizing syllables in phrasal units.
The C/D model, as proposed by Fujimura (e.g.,
Fujimura, 2000), is theoretically grounded on articulatory observations of X-ray microbeam (XRMB) data that show that utterance syllable prominence patterns, as manifested by the amount of jaw lowering per syllable, “dictate” the size, timing and phrasing of articulatory movements. In this regard, the C/D model stands out from the other acoustic and articulatory models which purport that prosodic information is “suprasegmental,”, e.g.,
Lehiste (
1970). To amplify, often speech is analyzed as a series of consonant and vowel segments, with prosodic (“non-segmental”) information, e. g., duration, intensity, F0, added onto (above) the segmental information. This contrasts with the C/D model point of view in which phoneme segments per se do not exist; rather, the articulatorily defined syllable is the concatenative unit, generated by the phonologic prosodic specifications of the utterance, e.g.,
Section 2.2.1. Thus, the model posits prosody is the underpinning of articulation. In this way, the C/D model is unique. To date, no other articulatory model starts with prosody, i.e., has phonological/prosodic information as its input. One of the goals of this paper is to encourage further experimental studies to examine and document the importance of prosodic structure on articulatory events.
Some caveats are mentioned: our interpretations of the model are based on working within the model’s framework, and as such, there will be simplifications (e.g.,
Section 2.2.1). Moreover, an underlying assumption of the model is that strength of articulation rather than timing of articulation is the governing principle of temporal organization of speech; thus, there is no attempt in the current version of the model to connect articulation to specific measurable time points in the acoustic signal.
The C/D model was first proposed by Osamu Fujimura in 1991, as a paper presented at the 12th International Congress of Phonetic Sciences, entitled “Prosodic effects on articulatory gestures—A model of temporal organization” (
Fujimura et al., 1991). In subsequent publications, he continued to elaborate and revise the C/D model (
Fujimura, 1994,
2000,
2002,
2008). (See also publications in
Erickson & Imaizumi, 2015;
Erickson & Kawahara, 2015; also,
Erickson, 2024a). The C/D model was to explain the temporal organization of speech, based on intensive examination of articulatory patterns observed in X-Ray Microbeam articulatory data. Trained as a physicist, Fujimura proposed “a model to tackle a complex system that has aspects of discrete—symbolic—information processing and physical movement as well as sound production at the end” (Reiner Wilhelms Tricario:pc).
In order to highlight some of the novel points of the C/D model with regard to prosody, presented here is a brief comparison with AP, which is currently the most widely used model by linguists to describe speech articulation. Both AP and the C/D model describe articulatory kinematics of speech. However, their approach is radically different in terms of the underlying (timing) framework. The framework for AP is a coupled oscillator model (Task dynamics (TD), e.g.,
Saltzman and Munhall (
1989), inspired by work on coordinated arm motion by
Saltzman and Kelso (
1987). We briefly mention here that application of TD to speech articulation remains an open question: arm movement is a jointed system in contrast to speech articulation. The tongue is a soft-tissue articulator; while the jaw joint has reduced degrees of freedom compared to the ball and socket joints of the arms.
AP describes articulatory gestures as the smallest phonological units; these are organized sequentially as second-order differential equations of the TD-coupled oscillator model to account for the production of consonant and vowel sequences. In the original interpretation of AP, timing of gestural onsets is coordinated as being either in-phase or anti-phase within the TD framework; the timing of the gestures is tightly connected with their acoustic output of a series of consonants and vowels that make up the spoken utterance. In the traditional AP model, there is no acknowledgment per se of prosody; prosody, especially prominence, is a by-product of how the gestures coordinate. In later versions of AP, e.g.,
Saltzman et al. (
2008), in order to account for a suprasegmental account of prosody, e.g., phrasing in an utterance, propose gestural planning and modulation oscillators. Also see
Byrd and Krivokapi’c (
2021) for handling prosody in AP with timing modulating gestures.
As for the C/D model, the framework is the abstract phonological/prosodic structure, as represented by an augmented metrical tree specifying syllable stress levels (i.e., syllable magnitudes) (e.g.,
Fujimura, 2000); the smallest phonological unit is the syllable. In this sense, timing is relative to the other syllable members of the utterance; there is no mention of timing in terms of absolute measurable durations of syllable units, nor of segments within a syllable, e.g., C–V timing relationships. In the C/D model, the magnitude of the syllable affects the magnitude/strength of the “segmental” articulators, and also, very importantly, the strength of the various boundaries (e.g., word and phrase boundaries) in the spoken utterance. Thus, in the C/D model, strength of articulation is the organizing principle; it is this strength that affects the timing of syllable units.
In contrast to the C/D model, gestures are the phonological units in APs; each gesture involves sets of articulators working together to produce the desired place and degree of oral constriction for an acoustic segment (e.g., the LIP gesture, which involves LP (lip protrusion) and LA (lip aperture) works with the VEL gesture (velic aperture) and the GLO (glottal aperture) to produce a bilabial nasal constriction, i.e., /m/.) The AP model has no gesture for the mandible (jaw) per se. The kinematics of the other gestures, by default, incorporate the jaw position for each acoustic segment.
In the C/D model, instead of gestures there are ballistic movements which are referred to as impulse response functions (IRFs); these trigger the appropriate articulators to produce the onset and coda portions of the syllable. Instead of a vowel gesture, the vocalic portion is described in terms of tongue horizontal and vertical positions. The phonological unit in the C/D model is the syllable; the syllable articulator is the jaw which provides a skeleton framework describing the prominence patterns of the utterance.
Thus, another crucial difference between the C/D model and AP is the role of the syllable. In the C/D model, the syllable organizes the prosodic phonological structure of an utterance, in that the magnitude, i.e., prominence, of the syllable is commensurate with the magnitude of the “segmental” articulators and also the magnitude of the various boundary pulses. The CD model is the only model that has prosody, and specifically, the syllable as implemented by jaw lowering, as the underlying framework of temporal organization of speech, e.g., stress patterns, articulatory strength and boundary strength.
The organization of this paper is as follows:
Section 2 describes the basic components of the C/D Model;
Section 3 presents a summary of how the C/D model observes and interprets articulatory events to account for temporal organization of spoken language;
Section 4 addresses applications of the C/D Model’s approach to prosody-- transference of first language prosody patterns to second languages; first language acquisition, mandible patterns and neural nesting; clinical applications, specifically, stuttering and Parkinson’s disease; and new insights into prosodic phonology;
Section 5 describes a new tool for investigating jaw movement;
Section 6, entitled “Now what?”, summarizes strengths of the C/D model and brings up further yet-to-be investigated aspects of the C/D model.
2. Components of the C/D Model
The model is called the Converter/Distributor Model (C/D Model) because it takes the abstract prosodic and phonological information as its input, which is subsequently Converted to strings of syllables; then the prosodic and phonological information is Distributed to articulatory movements, which are implemented by control function/signal generators.
The C/D model with its many component levels is complicated. In the following sections, I try to present the basics of the model. The model starts with the phonological prosodic input to spoken utterances, in terms of metrical phonological information (
Section 2.1).
Section 2.2 describes the role the mandible plays in implementing this information. This section also details an articulatory experiment illustrating how the jaw is both the syllabic and prosodic articulator and how syllable prominence/jaw lowering determines the location and size of phrase boundaries.
Section 2.2 and
Section 2.3 show how the phonological prosodic information affects strength of articulation of the syllable, not only the syllable nucleus but also the onset and coda. Syllable strength in the C/D model is referred to as “syllable magnitude,” represented in the model by syllable pulses whose height represents the magnitude of the syllable nucleus, as well as the magnitude of the syllable edge articulations. Syllable edge articulations are referred to as Impulse Response Functions (IRFs) which describe the feature specifications for the syllable onsets and codas. The IRFs specify sets of place, manner and voicing features, i.e., place of constriction in the vocal tract, degree of constriction in the vocal tract, and voicing of the vocal folds during constriction (discussed in more detail in
Section 2.2.3). These features are then implemented by specific articulators by means of neural commands to the appropriate muscles to move the articulators; the strengths of the neural commands/muscle movements are specified by the phonological prosodic input to the utterance.
The C/D model challenges us to view spoken utterances in a different light—as patterns of syllable prominences articulated by varying degrees of jaw lowering. The reader is urged to take on this challenge as they read through the next sections. Note: the following explication of the model is couched in terms of prosodic organization of English utterances; however, the model theoretically is applicable to all languages.
2.1. Prosodic Input of the C/D Model
As aforementioned, for the C/D model, prosodic organization is the driver of articulatory speech kinematics. The term “prosodic organization” here refers to stress/prominence patterns, as described by metrical trees (e.g.,
Liberman & Prince, 1977). A simplified approach to metrical trees is the observation that syllables in an utterance are chunked into smaller phrase units (e.g., foot, phrase (also known as accent phrase or intermediate phrase), and utterance). Within each phrasal unit, one word has more prominence than the others in that phrasal unit. The amount of prominence in each phrase unit is reciprocal to the size of the unit, such that the amount of prominence on the prominent word in the utterance is the largest of all the prominences; this prominent word is often referred to as broad focus or nuclear stress. This pattern of prominences can be described in terms of a metrical tree, showing branches of strong-weak (s-w) syllables, with syllables with the most s-assignments having the largest prominence and those with the least s-assignments having the weakest prominences. In this way, the metrical grid generates numerical values of syllable prominence within an utterance.
In the C/D model, the metrical trees generate the utterance skeleton, a series of pulses, the height of which represents the strength/magnitude of the syllable, both the syllable nucleus and the syllable edges. Fundamental frequency (F0) patterns are part of the base function (see
Section 2.2.3 for more discussion). Thus, the framework of the C/D model is such that (1) the syllable is the basic unit of speech, (2) the “syllable magnitude” (syllable prominence) is a product of the metrical organization of the utterance, (3) increased prominence is implemented by increased articulatory strength, and (4) increased articulatory strength also yields larger phrase breaks within utterances. This aspect of the C/D model, that increased articulatory strength affects phrase boundaries, is illustrated in
Section 2.2.1.
A diagram of the input to the C/D model is shown in
Figure 1 (from
Fujimura et al., 1991). The prosodic phonological input for the utterance
That’s wonderful is described in terms of a metrical tree plus utterance parameters with numeric controls, e.g., speed, formality, excitement, dialect, speaker age, specified by the small letters on the left of the figure. The strong-weak branches of a metrical tree, along the lines of
Liberman and Prince (
1977), describe the arrangement of syllable magnitudes. The beginning and end of the utterance is marked by
$, and the phrase break after
that’s is marked with %.
2.2. Converter Component of the C/D Model
The Converter takes all the information in the input and outputs a base function, which includes, among other things, the utterance skeleton. The skeleton describes the prominence and phrasing patterns (morpheme, word, and phrase boundaries) of the prosodic input of the C/D model.
2.2.1. The Skeleton
Converting Syllable Prominence Values Syllable Pulse Heights
The Converter takes the prominence value of each syllable, together with the other utterance parameters as specified by the metrical tree and converts these prominence values to syllable pulses. The height of each pulse represents the magnitude (prominence) of each syllable. An utterance is thus defined as having a “skeleton” consisting of a series of syllable pulses of varying heights representing various syllable magnitudes. In the C/D model, the syllable articulator is the mandible, also known as the jaw. The working hypothesis is that the syllable magnitude is to a first approximation commensurate with the amount of jaw lowering for each syllable as measured from the occlusal plane. This hypothesis is substantiated by the following experimental research findings.
Review of Experimental Findings Supporting the Converter Component
First, that jaw lowering increases when a word (syllable) is emphasized/focused has been well documented in English (e.g.,
Kent & Netsell, 1971;
Stone, 1981;
Macchi, 1995,
1998;
Summers, 1987;
J. Westbury & Fujimura, 1989;
Beckman & Edwards, 1994;
de Jong, 1995;
Erickson, 1998a,
1998b,
2002,
2003;
Harrington et al., 2000;
Menezes, 2003,
2004) as well in other languages, e.g., French (
Loevenbruck, 1999,
2000;
Tabain, 2003) and Japanese (
Erickson et al., 2000). These studies examined emphasis on words containing low /ɑ/ vowels; increased jaw lowering is also reported for emphasized high and mid vowels, e.g.,
Erickson (
2002,
2003),
Harrington et al. (
2000).
Second, that jaw lowering also correlates with syllable stress/prominence levels has been reported by, e.g.,
Erickson (
2004),
Erickson et al. (
2012,
2015, in press),
Erickson and Niebuhr (
2023),
Menezes (
2003,
2004),
Svensson Lundmark and Erickson (
2024). These studies confirm a strong connection between a syllable’s stress level and the amount of jaw lowering (jaw displacement) relative to the occlusal plane for that syllable. An example of this in English is shown in
Figure 2. The bottom panel of
Figure 2 shows jaw tracings from electromagnetic articulographic (EMA) recordings of an American English speaker producing “five bright highlights in the sky tonight”, taken from the longer utterance, “
Yes, I saw five bright highlights in the sky tonight.” Notice that the jaw opens and closes for each monosyllabic content word, each of which contains the same phonological vowel /ɑɪ/, yet notice that the amount of jaw displacement varies for each syllable. The arrows point to the monosyllabic words with the most jaw displacement: the largest jaw displacement is on
sky, the next on
high(lights), and the next on
five. As shown in the metrical grid in the top panel, this pattern of jaw displacement correlates with the prominence/stress values of each of the syllables, with
sky having nuclear stress,
high having phrasal stress, and
five, foot stress. Regression analyses, as reported in
Erickson et al. (
2012), show a significant correlation between the amount of jaw displacement and the syllable stress patterns as shown in the stress level row of the metrical grid. Note about metrical grids: similarly to metrical trees, metrical grids also show hierarchical prominence patterns. Metrical grids are easier to draw than metrical trees. In
Figure 2, the prominence values for each word are calculated by assigning a prominence to a syllable, then another one for word, another one for foot, another one for phrase and, finally, one for utterance. Adding up the number of filled-in squares yields the numerical prominence value of a specific syllable in the utterance. Thus, the largest prominence value for this speaker for this utterance, i.e., the nuclear stress word (broad focus), was on
sky. For more discussion of nuclear stress and jaw displacement patterns, see
Erickson et al. (
2012),
Erickson and Niebuhr (
2023). For more information about metrical grids, see
Selkirk (
1982) and
Hayes (
1995).
The words in
Figure 2 all contain the same phonological vowel in order to not introduce jaw height as a complicating factor, given that jaw displacement is greatest for low vowels and least for high vowels (
Menezes & Erickson, 2013;
Williams et al., 2013). A pilot study normalizing the amount of jaw displacement across vowel heights shows that jaw lowering correlates with syllable stress, thus supporting the C/D Model hypothesis that the amount of jaw displacement is commensurate with syllable magnitude.
As concerns emphasis/focus, jaw displacement increases even more (see, e.g.,
Erickson, 2004;
Erickson et al., 2015). As reported by
Svensson Lundmark et al. (
2023), narrow focus on a normally produced weak syllable (w) will increase mandible lowering on that syllable such that the syllable is now a strong syllable (s) and mandible lowering on the next syllable is reduced. The result is that the utterance prominence pattern in terms of weak (w) and strong (s) syllables is changed. They reported that the ws wS type sentence spoken with broad focus “The fat cat sat with Matt,” where both
cat and
Matt are strong syllables (
cat has phrasal stress and
Matt, nuclear stress) has a prominence pattern of ws wS, where “s” indicates more jaw lowering and “S” indicates the most jaw lowering. When focus was put on
fat, the jaw lowered more, and the prominence pattern changed in the first phrase from ws to sw.
Thus, experimental studies support the C/D model hypothesis that the amount of jaw lowering per syllable is commensurate with the prominence of that syllable. The converter component of the C/D model generates a series of syllable pulses, where the height of each represents the syllable magnitude.
Converting Location (Timing) of Syllable Pulses
The objective of this section is to show how the C/D model observes and interprets articulatory events to account for temporal organization of spoken language, not in terms of durational timing between articulation and acoustics, but in terms of rhythmic organization of syllable prominences and subsequent syllable boundaries. The rough details of how this is done is described in this section. First, however, in order to calculate the timing of the syllable pulse, it is necessary to assess the magnitude of each syllable, i.e., the prominence value of each syllable, as outputted by the metrical information and implemented by jaw displacement commensurate with the magnitude value of each syllable. Hence, in the above section jaw displacement was discussed as a function of prominence. Now, we turn toward timing of syllables and boundaries.
According to the C/D model, the height of the syllable pulse is at first approximation based on the amount the jaw lowers below the occlusal plane for that syllable. The timing of the syllable pulse, however, is NOT at the point where the jaw is maximally low. This is an important aspect of the model. The timing of the pulse within the syllable is determined by the velocity of the crucial articulators (CA) of the onset and coda. (Note: actually, Fujimura referred to “iceberg points”, which are discussed in more detail in
Section 6).
What follows are reports of experimental applications of some of the principles of the C/D model. The experiment was reported in a number of earlier publications (e.g.,
Erickson et al., 2015;
Kim et al., 2015;
Erickson & Kawahara, 2015;
Erickson, 2024a). The application to the C/D model was first outlined as an invited lecture entitled, “Converter/Distributor model: for describing spoken language rhythm,” presented at the ABRALIN conference, 31 October 2023 in Curitiba, Brazil. Later this was written up as a short dictionary entry in Speech Sciences (
Erickson, 2024a). Permission has been obtained to include parts of this entry in this manuscript.
The articulatory experiment involved the sentence
Pam said bat that fat cat at that mat, spoken by 2 speakers where they varied the position of emphasis in the utterance, i.e., on.
bat,
that,
fat,
cat,
mat. The sentences were presented on a PowerPoint display, and the speakers were asked to emphasize the word in bold letters.
Figure 3 shows articulatory tracings of the segmental articulators (referred to in the C/D model as Crucial Articulators TD, TT, LL) and the syllable articulator (mandible/jaw) for the utterance
Pam said bat that fat cat at the mat, where
bat is emphasized. The vowels in this utterance are all /ae/ vowels, except for /ɛ/ in said, yet each syllable shows a different amount of mandible lowering (i.e., jaw displacement). Based on the amount of mandible lowering (from the occlusal bite plane) for each syllable in the utterance, a string of syllable pulses is created. The articulatory data shown in
Figure 3 is position data (vertical dimension) of the Crucial Articulators (CAs) for the syllable onset and coda for each of the monosyllabic words. For instance, the CAs for
Pam, the initial word of the utterance, are the Lower Lip (LL) for both the syllable onset and coda, for the emphasized word
BAT, the crucial articulator for the syllable onset is the LL and for the coda, the Tongue Tip (TT). For the syllable onset of
that, the CA is the Tongue Dorsum (TD). The position data for the jaw is shown in the bottom panel. As described in the section above, the converter creates a string of syllable pulses, whose heights are commensurate with the stress pattern, as articulated by the syllable articulator, the jaw.
For the utterance shown in
Figure 3, auditory impressions indicate that the speaker correctly produced emphasis on
bat but also added prominence to fat. Acoustically, both
bat and
fat have increased duration and increased intensity compared to the other words in the utterance, with
bat longer than
fat by 0.032 s while
fat is louder than
bat by 1.3 rms. Both
bat and
fat were produced with pitch accents;
bat with an H*+L and
fat with an L*+H, and
fat having a higher maximum f0 than
bat by 24.8 Hz. As for jaw displacement, both
bat and
fat have more jaw lowering than the other words in the utterance, but the jaw lowers more for
bat than
fat by 2.71 mm. The C/D model in its current yet not fully developed form focuses on articulation of syllables and how syllables relate to adjoining syllables; it does not address acoustic characteristics per se. For a more in-depth discussion about acoustic and articulatory cues as they relate to perception of prominence, the reader is referred to
Erickson et al. (
in press).
As seen in
Figure 3, the bottom panel shows the jaw tracings. According to the C/D model hypothesis, the amount of jaw opening per syllable represents the numerical amount of prominence per syllable. However, the location of the syllable pulse is NOT the point in time when the jaw is maximally low. In order to position the syllable pulses in the utterance, the Converter creates a time location for each syllable as the midpoint between the maximum velocities of the syllable onset and coda CAs. Note that the C/D model referred to “iceberg points” instead of maximum velocity times. With regard to a difference between “iceberg points” and “maximum velocity points”, the “iceberg” threshold is an optimal point of relative invariance of velocity, which differs from the peak velocity (
Bonaventura, 2003). But for simplicity, we use maximum velocity points. A discussion about “iceberg points” can be found in
Fujimura (
1986,
2000), and
Bonaventura and Fujimura (
2007); for a comparison of iceberg points with maximum velocity points, see
Kim et al. (
2015).
Figure 4 is like
Figure 3, except that it also includes velocity information of the syllable onset and coda CAs, necessary to locate the syllable pulse within the syllable.
The yellow vertical lines on each side of the syllable mark the point in time of maximum velocities of the onset and coda CA, e.g., the red arrows marking the maximum velocity of the LL articulator for the onset of
bat and
fat and the blue arrows marking the maximum velocity of the TT articulator for the coda of
bat and
fat. The white lines in the center between the yellow maximum velocity lines mark the point in time where the syllable pulses occur. Notice that sometimes the syllable pulse coincides with the maximum jaw lowering, but when the syllable is emphasized (
bat) or has more stress (
fat), the syllable pulse occurs before the maximum jaw displacement. For a report on timing of maximum jaw displacement relative to onset of vowel as a function of emphasis, see
Erickson et al. (
2024a).
The indirect by-product of the syllable pulse positioned at the midpoint between syllable onset and coda CA velocities is boundary strength information. As can be observed from
Figure 4, the yellow lines marking the CA max velocities do not overlap. The distances between the contiguous yellow lines are related to the distance between syllables, i.e., the syllable boundaries. How the Converter calculates abstract syllable durations is described in the next section.
Converting of Syllable Boundaries via Syllable Triangles
Thus, the timing of the syllable pulses is at the midpoint between the maximum velocities of the CAs, while the heights are related to the amount of jaw lowering for each syllable. To calculate abstract syllable durations, the following process is used.
The apex of the syllable triangle is the height of the syllable pulse. The angle of each isosceles triangle is determined by the hypothesis that no two sides of the syllable triangles may overlap, with only one set of triangle sides touching. A Matlab algorithm for constructing syllable triangles can be found in
Erickson et al. (
2015).
Figure 5 shows the results of the algorithm for calculating syllable triangles and syllable boundaries for the utterance,
Pam said bat that fat cat at the mat. Based on the algorithm, the two syllable triangles that touch are
cat and
at. Since the other syllable triangles all have the same angle as these two syllables, the result is (a) each syllable has its own abstract syllable duration and (b) each syllable is accompanied by an abstract boundary duration. The magnitude of each syllable is represented by the height of the syllable pulse, and the magnitude of the boundaries, by the distance between each syllable triangle. In this utterance, emphasized
bat has the largest syllable pulse, which is expected, since it was the emphasized word. This is followed by
fat. The largest syllable boundary, according to the output of the algorithm, follows
that, the next largest follows
fat. Tentative confirmation of this approach can be found in the pilot study by
Erickson et al. (
2015); they reported that listeners’ perceptions of prominence and boundaries show a significant correlation with the syllable (jaw displacement) and boundary magnitudes generated by the Converter’s syllable triangle algorithm.
By applying “reverse engineering,” we can construct a possible metical tree for this utterance, as shown in
Figure 6. Note that although the speaker was instructed to place emphasis on
bat, he also put more prominence on
fat. In doing so, it seems he separated the utterance into two major phrases. The tentative results presented here offer further support of the Converter’s approach to generating syllable and boundary magnitudes from articulatory kinematics.
The above discussion is offered as an example of how theoretically the C/D model calculates syllable boundaries. What is intriguing to me is that the syllable-triangle/iceberg method of the C/D model outputs a metrical arrangement of syllable relationships akin to that of the prosodic phonological input for the utterance. In effect, the C/D model provides a reverse-engineering approach to recovering the phonologic prosodic input to a spoken utterance. To what extent it provides realistic information about temporal organization of spoken utterances needs further testing.
2.2.2. The Base
The base function includes the skeleton, as well as the syllable features and the melody (F0 patterns). An overview of the Base function can be seen in
Figure 7, which provides a detailed look at how the CONVERTER handles a single syllable, e.g., /kit/. The various components of the Converter are displayed in four levels/rows.
The top level is the syllable magnitude information, the second level is the syllable features specification level, the third level describes tongue movement for the vocalic information, and the last level, the voicing information. There is also a fifth level to account for f0 movements, i.e., the melody, which will be discussed later.
A brief review of the top level, as shown in
Figure 7: The top level is the syllable magnitude information, which according to the model is commensurate with the amount the mandible lowers for making this syllable. The pulse also includes information about the vowel nucleus, which is implemented in the third level specifying tongue advancement. The black vertical line in the top panel indicates the height of the syllable pulse, based on the theoretical prominence value of the single syllable utterance, the dashed diagonal lines on either side are the syllable triangle lines. The two edges of the triangle indicate the start in time of the (abstract) syllable onset and coda. (Note that the isosceles triangle is a modification of the original thinking presented in
Fujimura et al., 1991). On the left side of the syllable base, the blue line indicates the magnitude of the syllable onset pulse; the purple line to the right of the syllable, the magnitude of the syllable coda pulse. Notice that the onset and coda pulses are the same height (magnitude) as the syllable pulse. This is an important tenet of the C/D model; it implies that the kinematic strengths of the syllable pulse and onset/coda pulses are the same, i.e., a syllable with large prominence is produced with increased jaw displacement together with increased CA strength.
2.2.3. Syllable Features for Describing Syllable Onset, Nucleus, and Coda
The phonological information is not specified in terms of consonant and vowel segments; it is formatted in terms of feature sets: place, manner and voicing. ‘Place’ refers to where the constriction in the vocal tract occurs; ‘manner’ refers to nature of the constriction, e.g., complete, partial; and voicing refers to vocal fold adduction, which is handled in the fourth level of the Base function, according to
Figure 7.
Impulse Response Functions (IRFs)
The second layer in
Figure 7 shows the Impulse Response Functions (IRFs) which are triggered by the onset and coda pulses. The IRFs consist of feature sets. In this case, the initial IRF set is indicated by {K, τ} to specify a velar place (K), and stop manner (τ) syllable onset, while the final IFR feature set is indicated by {T, τ} to indicate an apical place (T) and stop manner (τ) as the syllable coda. (See
Fujimura, 1994 for a description of features and their symbols). The IRFs generate a response curve, the dashed blue and purple curved lines; note that the peak of the slope does not align with the IRF pulse, and the onset of the curve starts before and ends after the pulse. The strengths of the IRFs are dictated by the magnitude of the onset and coda pulses, which are the same magnitude as that of the syllable pulse. The bold blue and purple horizontal lines for the syllable onset and coda indicate the duration of the closure period of the articulators for producing velar K and apical T, respectively. Notice the closure for the onset starts before the onset pulse and ends right at the coda pulse. Presumably, this is meant to describe the asymmetric patterns between the onset and coda response curves.
The third level is the vocalic level. The underlying base of the syllable is from the blue onset pulse to the purple coda pulse, marked by dashed red horizontal lines. Since the vocalic syllable pulse (red upward arrow) generates a tongue advancement which starts before, and ends after, the onset and coda pulses, the surface duration extends beyond the base duration.
The last level is the voicing level. This level specifies laryngeal adduction for the voicing feature (which is not marked in the feature specifications if the syllable margin IRF is voiceless). It is triggered by the magnitude of the onset and coda pulses, and the IRF pulses. The horizontal dashed blue line indicates the (abstract) duration of syllable voicing; the green curve indicates the surface laryngeal adduction curve, which, again, starts before, and ends after, the onset and coda pulses. The voiced portion of the syllable is indicated by the solid green horizontal bar, which starts at the green dashed vertical line marked “on” and ends with the green vertical dashed line marked “off”. As for the closure part of the stop, it starts with the green laryngeal adduction curve and ends at the blue vertical line. The aspiration period is the distance between the end of the stop and the beginning of the voicing for the vowel, that is, VOT is displayed as the discrepancy between articulatory release of the stop constriction and voice onset of the vowel. As the magnitude of the syllable pulse/onset pulse affects the strength of syllable margin features (e.g., voicing), it follows that syllable magnitude also affects VOT (see
Matsui, 2017).
The voice quality component of the C/D model is not yet developed and therefore is not shown in
Figure 7. F0 is described as part of voice quality, which along with other types of voice qualities, “may play crucial roles in prosodic control” (
Fujimura, 2008, p. 316), including the intonation contours, i.e., melody. The concept of F0 as part of voice quality opens the door to thinking of F0 as more than just an F0 contour displayed in a spectrogram, but rather part of the complicated source-filter interactions involved in producing different voice qualities (see, e.g.,
Obert et al., 2023). However, this part of the model has not yet been developed. As for describing Japanese pitch-accents, a Fujisaki-type model was proposed in
Fujimura (
2008).
2.3. Distributor, Actuators and Signal Generator
The specifications in the CONVERTER are fed to the DISTRIBUTOR which selects “elemental gestures” to be enlisted to implement the feature sets. Then a multidimensional set of ACTUATORS assembles the stored feature sets of the Impulse Response Functions and sends these to CONTROL FUNCTION/SIGNAL GENERATORS. Although these parts of the model are yet to be implemented, the ultimate goal is for all the component parts of the model to work together to output acoustic signals of an utterance.
3. Summary of How the C/D Model Observes and Interprets Articulatory Events to Account for Temporal Organization of Spoken Language
The C/D model proposes a novel way for handling the effect of syllable prominence on articulation; this approach has been substantiated by pilot articulatory studies by Erickson and colleagues. Specifically, these studies support the C/D model’s conversion of prosodic patterns into a skeleton of syllable pulses representing the syllable magnitude patterns in an utterance; they also lend support to the positioning of the pulses in each syllable halfway between the maximum velocities of the onset and coda CAs; and they encourage future investigation into how abstract syllable durations and boundaries are calculated via isosceles syllable triangles.
The C/D model uniquely proposes that syllable boundaries are derivatives of syllabic articulation strengths, i.e., jaw and onset/coda crucial articulators. More studies with more data are needed to explore this hypothesis. Is there indeed a correlation between syllable magnitude (amount of jaw displacement) and magnitude of crucial articulators? In the current version of the C/D model, syllable magnitude is measured in terms of the maximum amount of jaw displacement during the syllable. How is the magnitude of the onset and coda Crucial Articulators best measured? One way might be in terms of duration of the consonant (see, e.g.,
McGuire et al. (
2024) which reports initial consonants of stressed syllables are longer than unstressed ones and jaw displacement is also greater). Another suggestion proposed by
Svensson Lundmark (
2024) would be to measure magnitude of CAs in terms of magnitude of acceleration.
As mentioned in the introduction, the concept of timing in the C/D model is not in terms of durational relationships, but rather in terms of magnitude relationships. In contrast to other articulatory models, such as AP, intrasyllabic timing is not discussed per se in the C/D model. Nevertheless, exploring the timing of articulatory events relative to acoustic events within the framework of the C/D model needs to be performed. Christopher Geissler, in this same issue, examines the timing between intrasyllabic kinematic units within the framework of a gestural coupling model and along the lines of the C/D model between syllable pulse and intrasyllabic kinematic units. His study includes 11 monosyllabic CVC words consisting of various vowel types (not vowels with the same vowel height as has been previously used when examining the C/D model). The results pinpoint some interesting kinematic timing relationships with the syllable pulse. However, since his study only included monosyllabic isolated words, he could not examine how prominence might affect intrasyllabic kinematic timing. This is a study that needs to be conducted. For such a study, however, it is important to separate vowel quality from prominence or to have a method of normalizing jaw displacement across vowels, as discussed later in this section.
With regard to articulatory boundary strengths, currently used is the combination of (a) the mid-distance between maximum velocity points of the CAs to determine the abstract center of the syllable and (b) the isosceles triangles algorithm. The (abstract) duration between the bases of two consecutive triangles indicates the strength of the syllable boundaries. Would using acceleration peaks or jerks, as proposed by
Svensson Lundmark (
2023) and
Svensson Lundmark and Erickson (
2024) lead to a better estimate of articulatory boundary strengths? With regard to the syllable triangle algorithm, currently it is an ad hoc solution that seems to work. Is there an explanation why the algorithm can generate articulatory boundaries perceived by listeners, as reported in
Erickson et al. (
2015)?
With regard to intonation patterns, as discussed above, the C/D model refers to this aspect of prosody as the “melody” of the utterance which is part of the base function. The C/D model views laryngeal articulation as a complex aspect of the model which encompasses intonation, tonal F0 patterns and various voice quality issues. The laryngeal component of the model, including how to account for intonational patterns, awaits development. The work by Esling and colleagues (
Esling et al., 2019) about the larynx as a laryngeal articulator might dovetail nicely with the C/D model.
A final point to be discussed is that the amount of jaw displacement in a syllable is affected by both prominence values and vowel height. As reported by
Williams et al. (
2013) and
Menezes and Erickson (
2013), the jaw for a low vowel is 2 mm lower than for a mid vowel, and 4 mm lower than for a high vowel. To date, exploration of the C/D model has focused on examining jaw displacement values in utterances containing vowels with all the same vowel height. These results indicate a relation among (a) magnitude of jaw displacement per syllable, (b) magnitude of syllable prominence within an utterance and (c) magnitude of boundary strengths between syllables. However, in order to pursue application of the C/D model for analyzing spoken utterances, a way to normalize across vowels is needed. An approach to normalizing vowel height was proposed by
Williams et al. (
2013). Using EMA, they recorded the vertical jaw position of an American English speaker producing six repetitions of three-word sentences where the CVC target mono-syllabic words (shown in
Table 1) occurred in initial, middle and final positions. The consonants (Cs) were voiceless /p/, /t/ or /k/, and the vowels (Vs) were /ɪ/, /ɛ/ and /æ/. The sentences were “X type first,” “Type ‘X’ first,” and “First type X,” where “X” was the target CVC. Nuclear stress was placed on the first word of each sentence, i.e.,
X,
Type, and
First.
In order to determine the effect that prosodic structure has on jaw displacement independent from vowel height, they proposed a simple equation (see
Williams et al., 2013, pp. 3–4).
A two-step vowel normalization algorithm is shown in
Figure 8. By factoring out the vowel height (V) effect, as well as other potential effects such as consonants, speech style, the effect of metrical prominence on the amount of jaw displacement could be seen.
Figure 9 is a graphic display of vowel neutralization procedure for the sentences, ‘Kip met Pat’ and ‘Pat met Kip’, where nuclear stress is on the final syllable in each of the utterances. Raw jaw displacement measurements are shown in the left-hand panel for these two sentences. Notice that the amount of jaw displacement for “Pat” (which contains the low front vowel /æ/) is larger than that for “Kip” (which contains the high front vowel /ɪ/). In the right-hand panel, we see the neutralized jaw displacement values for these sentences. Notice that after neutralization, the jaw displacement for “Pat” has decreased and that for “Kip” has increased. Here we see that neutralized values are shown in the right-hand panel, labeled “Raw Data”. The bottom panel displays hypothesized metrical grids based on the neutralized values of jaw displacement; regardless of the vowel height, both utterances have the same metrical pattern, i.e., the most jaw displacement is on the final word, the word with nuclear stress.
Finally, to date, most of the experimental explorations of the CD model have focused on the prominence patterns of language. About details of implementing syllable onsets and codas using IRFs, this also has yet to be investigated experimentally; also needed is an examination of vowel articulation.
To conclude this section, as iterated in
Section 2.2.1, the objective of this manuscript is to show how the C/D model observes and interprets articulatory events to account for temporal organization of spoken language, not in terms of durational timing between articulation and acoustics, but in terms of rhythmic organization of syllable prominences and subsequent syllable boundaries. This section ends with an albeit partial list of “unfinished business”—things that still need to be examined and fleshed out in the model. The hope is that if indeed the C/D model accounts for prominence and boundaries, then perhaps following through with the yet “unfinished business” of the model might yield important insights into understanding articulation of speech. Along these lines, the next section summarizes some of the applications of the C/D model’s approach to prosody.
6. Now What?
The C/D model presents an innovative approach to understanding temporal organization of spoken utterances. The starting point is the phonological/prosodic syllabic input, presumably formulated by neural nesting patterns of syllable amplitudes of first language acquisition. This hypothesis is not explicit in the C/D model but is a logical extension that current technology allows us to explore.
The hypothesis that syllable articulation is the framework of speech organization contrasts with the AP model--which purports that speech is organized by how different types and degrees of vocal tract constrictions are timed with respect to each other. In AP, the jaw affects the various speech articulators, but the jaw per se is not a speech articulator. In the C/D model the jaw is the syllable articulator providing the skeleton/framework of temporal speech organization.
The most innovative aspect of the C/D model is the hypothesis that syllable magnitudes, the output from the phonological/prosodic input, account for speech organization. This contrasts with the usually held linguistic notion that prominence is suprasegmental, i.e., it is “added above” consonant and vowel segments. In the C/D model, the syllable magnitude is articulatorily realized by how much the jaw lowers for each syllable (below the occlusal plane). The amount of jaw lowering is represented in the model by the height of the “syllable pulse”, and the height of the syllable pulse, i.e., magnitude, is transferred to the height/magnitude of the ballistic pulses that specify the onset and coda features of the syllable. The hypothesis that prominence affects onset articulation is indirectly substantiated by research showing less consonant-vowel co-articulation with stressed syllables, e.g.,
McGuire et al. (
2024).
That jaw lowering and syllable prominence are correlated is substantiated by numerous studies about articulation of emphasis and nuclear stress (broad focus), described in the sections above. Generally reported in the above studies is how prominence increases the amount of jaw lowering on a specific syllable.
As concerns the timing of the syllable pulse, Fujimura chose the midpoint between the two “iceberg” points of the onset and coda crucial articulators to represent the location of the syllable pulse. The iceberg points were determined by overlaying the articulatory movement tracings of a single crucial articulator of a number of repetitions of the same utterance spoken by the same speaker. The point where most of the tracings overlapped was considered to represent the iceberg point. The hypothesis was that the iceberg region represented a region of invariance in articulatory movement. The duration between these two points represented the “pure, ideal” syllable, i.e., the syllable preserved from coarticulatory effects of the enveloping consonants (syllable onset and coda (p.c. Caroline Menezes)). This is an interesting hypothesis that has yet to be tested.
The hypothesis also was that each speaker has “unique” icebergs, i.e., the relation between jaw lowering and onset/coda articulation is speaker idiosyncratic (p.c. Caroline Menezes). This hypothesis, also, needs to be tested. A clinical application of this hypothesis is relevant to speech pathologies, such as dysarthria, e.g., people with spastic dysarthria will show “pathological” icebergs in that the space for the tongue mass to move is limited due to the limited jaw movement of their pathology. The assumption, thus, is that syllable magnitude affects crucial articulator velocities. This also is a hypothesis that needs exploration.
Another innovative aspect of the C/D model is that articulatory boundaries between linguistic prosodic units are the result of syllable magnitudes. The prominence patterns, as articulated by how much the jaw lowers/opens for each syllable, account for utterance rhythmic patterns of prominence and phrasing. Thus, to iterate, prosodic prominence patterns are not suprasegmental, but are the foundation of utterance organization.
Fujimura proposed a method for calculating an articulatory syllable duration based on syllable magnitudes; the amount of space (duration) between each articulatory syllable represents the size of the boundary unit, be it is a morpheme boundary, a word boundary, a foot boundary, or a phrase boundary. The approach for calculating articulatory syllable duration (which Fujimura referred to as “abstract syllable duration”) was to construct isosceles syllable triangles around each syllable pulse. The angle of the triangle was determined by locating in the utterance two syllable triangles whose two adjacent vertices touched. The rule was that only one set of syllables could have touching vertices. This resulted in a series of syllable triangles of varying heights according to the prominence of each syllable with spaces between the triangles representing the various durations of boundaries. This is the approach explained in
Section 2.2.1.
A question about the validity of the isosceles triangle approach. Fujimura stated that the isosceles syllable triangles were an ad hoc solution (pc: Erickson, Menezes). Pilot experimental results presented in
Section 2.2.1 suggest that the solution works to show that syllable magnitude affects the size of boundaries. Yet it is not clear why, or if there might be a better approach to assessing the effect of syllable magnitudes on boundaries. This needs to be investigated further.
Another question along these lines is how to determine the syllable pulse location in a CV or V syllable, i.e., a syllable with no coda or onset. One approach suggested by Fujimura was that all syllables begin and end with some type of constriction; in the case of CV or V syllables, there is glottal constriction. The topic of edge constrictions, treated as IRFs in the C/D model, awaits further implementation. As for geminate consonants, a special syllable concatenator was proposed by
Fujimura and Williams (
2008).
Experimental explorations of the C/D model are sparse, most of the work by Erickson and colleagues has focused on the relationship between prominence and jaw displacement. However, these works are limited to monosyllabic words all containing the same phonological vowel. A pilot study introduced a method for normalizing vowel quality to illustrate how prominence patterns for the utterances with different vowel heights are produced with similar jaw lowering patterns. As reported in
Section 3, high, mid and low vowels spoken with nuclear stress on the final word, e.g.,
Kip met Pat and
Pat met Kip, show that the largest jaw lowering occurs on the final word even though the vowels vary in terms of vowel height. More work with normalizing vowel qualities is necessary in order to assess the merits of the C/D model. Also necessary are experiments with polysyllabic words, not just monosyllabic words. Once a robust vowel normalization procedure is available, it would be possible to then examine jaw patterns of any spoken utterances. A challenge lies with reduced syllables which often show no distinct jaw lowering; a solution might be to measure the jaw value at the midpoint of the vowel. It might be that in rapid spontaneous speech, there is a regrouping of syllable-level jaw lowering, resulting in more of a pattern of foot or phrase-level jaw lowering. These are questions to be pursued experimentally.
An important finding of the research conducted to date within the framework of the C/D model is that each language has its unique pattern of jaw lowerings, that these are transferred to second language patterns of jaw lowering, and that jaw training can change first language jaw patterns to be more “native-like”. More data is needed about jaw lowering patterns across languages, as well as across various clinical populations. With the advent of easier ways to measure jaw lowering, such as the MARRYS helmet, this is an area of research that has great potential.
The goal of this paper was to illustrate the importance of prominence in orchestrating articulation, to open a window for viewing syllable prominence as the impetus for all aspects of speech articulation: syllable edges, nucleus and boundaries. However, the research conducted so far has been primarily on the jaw lowering aspects of the model, and how this relates to prominence and boundary patterns. What is needed now is an exploration of the other parts of the C/D model, i.e., the Base function for describing F0 patterns, voice quality, syllable edge and nucleus articulation. Much still needs to be investigated.
Finally, experiments are needed for experiments assessing the connection between the C/D model and acoustic signal in order to shed light on various applications of the model.
In summary, the C/D model offers a novel way to view the effect of prosody (syllable prominence patterns) on articulation and temporal organization of speech. Fujimura’s Converter/Distributor (C/D) model is unique in using the syllable, rather than phonemes, as the basic concatenative unit and prosodic control initiated by linearly ordered syllable and boundary pulses with specified pulse magnitudes. More experiments are needed to substantiate the details of how the model can be implemented. The hope is that this tutorial of the C/D model and its application to research on prosodic control of speech will benefit both expert and new researchers in the field.