Articulatory Control by Gestural Coupling and Syllable Pulses
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsOverview:
This paper examines two proposals for how the gestures of syllables are coordinated. The first model is the C/D model, which proposes some abstract "pulse” that C and V gestures are timed to. The second is a somewhat strict interpretation of AP/TD where gestural onsets are either in-phase or anti-phase coordinated. The authors use data from the Wisconsin X-Ray Microbeam database to test the predictions from each theory.
My main criticism of this paper is actually brought up by the authors on multiple occasions, but to me the times they bring it up seem like more of a dismissal of the issue than actually addressing the issue. Namely, it seems vacuously true to define a pulse by referencing the consonant gesture, and then note that the consonant gesture is quite consistently timed to the pulse. The authors argue that earlier consonant landmarks, like gestural onset, have an extremely low relative SD, which suggests that the pulse is the real coordination target. This wasn’t convincing to me: I would expect that different parts of one gesture would be highly coordinated to each other. So the low RSD should come as a given, because the mean lag is just longer.
A second major criticism is that the authors really only give consideration to a strict, onset-only coordination hypothesis in AP, which has been falling out of favor. For example, Gafos 2002 for early proposals, but also Karlin 2018/2022, Shaw and Chen 2019, Turk and Shattuck-Hufnagel 2020 all propose some version of downstream targets for coordination, using quite different data. Some of the data in the present paper is consistent with these “downstream coordination” hypotheses. The authors do bring this up (3.3) but never actually say that their data is consistent with it! They only say that it is not consistent with a strict onset-only model (line 293).
My last criticism is how the authors handle jaw movement. Jaw movement will be part of the consonant-to-vowel transition, especially if the active articulators are not adjusted for jaw placement, which I do not believe they are in the authors’ handling of the data. Thus, I would suspect that there would be a very strong correlation between PVEL (which contributes to the definition of the pulse) and jaw movement. Jaw movement is also influenced (as they note) by vowel height; they note that their model could not handle an effect of vowel, but again it seems like this is an important thing to consider.
Minutiae:
- 163 “were extract” -> were extracted
- A schematic illustrating the gestural landmarking would be helpful, if only to make it easier to get familiar with the specific abbreviations used
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsPlease see attached file.
Comments for author File: Comments.pdf
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe manuscript reports a study on the temporal organization of articulatory gestures within syllables by examining durations and variability of temporal intervals of several gestural landmarks of CVC words obtained from articulatory data, as well as their timings relative to the syllable pulse, an element based on the “pulse” of the C/D model, which is defined here as the midpoint between velocity peaks of the onset and coda. The reported study tested key predictions of the pulse-based model and the coupled-oscillator model – stable timing relation between gestures and a syllable pulse, and stable onset-vowel timing, respectively. Based on the most stable timing relation between onset consonants, rather than vowels, and the syllable peaks and the correlation between the C-V lag and the duration of onset gesture, among other things, are taken as support of a model that uses abstract syllable pulse to time articulatory gestures.
The manuscript certainly merits publication. Temporal organization of speech is an important area of study. Since the C/D model and the Coupled-Oscillator model assume different tools in utterance planning (an external master clock vs. pair-wise coupling), studies that compare predictions from these models (or models based on them) are extremely illuminating. The manuscript is also very clearly written with detailed descriptions of study’s methods and clear presentation of reasoning.
The only small points of improvements/clarifications are on the purpose and the claim presented in the manuscript. I would like the manuscript to be published after making quick revisions on the Abstract, Aims (section 1.3), and Discussions (section 4) so as to make the purposes and the claims clearer as suggested below:
- Purpose of the study could be more clearly stated and consistent in Abstract, Aims, and in Discussions. Section 1.3 (Aims) reads that the paper aims to test the predictions of the two models (one with syllable pulses and the other with coupled-oscillators) in order to determine which model is favored, but such aim (if it is indeed aimed) is not clearly stated in the Abstract and Discussion. Abstract also does not mention the coupled-oscillator model. The study’s aim to examine data against the syllable pulse-based predictions is clear, but what the study does to the coupled-oscillator model could be more clearly stated in all three places.
- Claims: Claims based on the stable timing between onset consonant and the syllable pulse (lines 379-380) could be qualified or immediately cautioned, just as the claim based on the more stable timing from the GONS than NMID landmarks (lines 309-402) is cautioned. While the stability of the onset consonant and the syllable pulse is the major finding of the study, it may be an artifact of the definition of the syllable pulse (as explained in lines 391-398).
Author Response
Thank you for your feedback! Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for AuthorsReview: Articulatory Control by Gestural Coupling…
This paper is somewhat outside my own area of expertise, but it’s interesting. It’s very clear (both in the results and mostly in the explanation of them), so it’s a good paper for someone with a somewhat different specialization to read. I think the results are solid despite my statistical misgivings below, and I look forward to seeing the paper published.
Section 1.1: Thank you for this very helpful and overall very clear review of the two competing models. I think it would really help to include a figure showing the triangles and calculation of relevant time points for a sample utterance, for readers who are less familiar with these models.
l. 52, regarding Erickson & Kawahara (2015): maximum velocity peak of what? Jaw movement?
p. 3 l. 109: “midpoint of plateau midpoints”: Is this a typo? If it’s correct, how can you have a midpoint of a midpoint? I may be missing something here.
Section 2.1: 17 words seems somewhat low for number of items, but the number of speakers and tokens is impressive. I appreciate the discussion of why the dataset had to be limited to these 17 CVC words. I did wind up very unclear on how the authors arrived at 7578 tokens.
p. 4: Table I could also really use a sample figure, even though the explanation is very well done.
Figure 2: I’m really not clear on how I should be evaluating the results in this figure, except for the fact that the points form a cloud with no clear pattern. Is there a correlation we should be looking for and failing to find?
Section 3.1 Statistical results (next to last paragraph of this section), and statistics for subsequent sections: I know there are 48 speakers and 17 items, with 5+ repetitions, and a total of 7578 tokens. There are phrases here like “pooled across words” and “compared across words.” The df’s are extremely large, such as (1,6608) and (33, 6576). This concerns me, because it looks like the three random factors of speaker, word, and repetition are all being pooled as if they were a single random factor, as if multiple repetitions of the same word by the same speaker were independent data points. Pooling over multiple random factors is generally a bad idea, and creates little pockets of related data, violating the independence assumption. (The 5+ repetitions of one word by one speaker will be more similar than productions of other words by that speaker, or productions of the same word by other speakers, creating little clusters of non-independent data.) Furthermore, with df’s as high as these, just about any difference is likely to be significant. Basically, I’m worried that the statistical procedure here drastically inflates Type I error. However, although I have used Levene’s test for homogeneity of variance before, I’ve never used it as the primary test statistic to reach conclusions from. This is an interesting approach that makes sense for the question of this paper. This still leaves me very worried about pooling over three random factors and treating them as a single one. When using ANOVA, my own approach would be to average over the multiple repetitions and then average over the items/words, leaving only speaker as the remaining random factor. This would give very stable data and avoid inflating the df’s. Is that inappropriate for the current question? I can see where it might be, since the question is about variability of gestural timing within a single token. I think in any case, the paper needs considerably more explanation of how the three random factors are handled and of why Levene’s test is being used this way. If each token is being treated as an independent data point, this would need considerable justification.
Section 3.2, statistics: Here, the df’s are (1,8186) and in that range. I understand that the df’s vary because different numbers of tokens have to be excluded for each test. But since the total number of tokens in the whole study was 7578, I’m not clear on how we get to a df(denom) that’s even larger than that. I haven’t looked up equations for the df’s for Levene’s test, so I may just be missing something, but I think quite a bit more explanation of the statistical methods (and serious consideration of the independence violation problem) would be helpful.
Figure 4, caption: Again here, I’d like to know how the multiple random factors are handled. Is each dot in these figures a single token, or are these averaged over repetitions or speakers?
Table 4 and par. above it: I’m glad to see that the issue of words and speakers is addressed here (random intercepts for both), but what about repetitions?
Figure 7 and throughout: Very minor issue, but it would be really nice to have the font size for text in figures large enough to read. I know people read online these days and can zoom in, but not having to zoom in on every figure (or take off my glasses and read from 2 inches from the screen) would be helpful.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 5 Report
Comments and Suggestions for AuthorsSee the attached file
Comments for author File: Comments.pdf
Author Response
Please see the attachment. The feedback was very helpful, and I am grateful for the time and effort you put into the review.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsPlease see attached
Comments for author File: Comments.pdf
Author Response
Please see the attachment
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis is the second cycle of me reviewing this article. The question the article poses is an interesting one, namely what the relationship between articulatory timing and syllable structure is. To address it, it embarks on comparing the C/DF Model to the planning oscillator model of syllable structure. In my first review of the paper, I listed several reasons for which the paper is inadequate for publication. Unfortunately, the current version does not manage to address the concerns raised in the first reviewing cycle.
First and foremost, the paper misrepresents and inadequately uses the coupled oscillators model (COM henceforth) of the syllable. This issue penetrates many aspects of the model. To mention some important cases, there is a profound confusion between the concepts of planning vs. emergence of syllable structure as well as between the concepts of planning oscillators vs. gestural timing. There is also a mix-up between coupling oscillators and nested oscillators. And, although nested oscillators are attempted to be described, this is unsuccessfully executed and results to wrong statements. Moreover, there is a confusing mixture and comparison of measures and articulatory landmarks coming from different dynamical models-based approaches, which do not however necessarily adopt, comply with or even attach themselves to COM. This is so confusing that it gives the impression that the paper is asking a different question than the one it states it does, since it diverges from testing COM’s account to dumping a series of articulatory landmarks, measures or definitions of articulatory gestures as proposed by other accounts. As a note, some of these accounts select the landmarks/measures they use because it is practically impossible to detect gestural onsets consistently, others do so because they are after a different dynamical model of gestures than the one originally proposed by Articulatory Phonology (AP) and other accounts examine different models of timing. On the other hand, syllable-level oscillators are not brought up at all (especially as part of the hypothesis to be tested), although it is a follow-up proposal made by the authors of COM, and which in fact has important similarities to the C/D Model. Another issue stems from the usage of the term c-center. C-center is a core concept for COM and its predecessors, but the paper does not use the term as defined in COM, but instead as used in another paper, for which the definition is methodological and nothing to do with the COM meaning. The example issues listed above are manifested across the manuscript, which, in addition to inappropriately representing COM, makes the text impossible to parse.
The second thread of drawbacks of the manuscript is methodological. Most of the issues mentioned in my first review remain unresolved, but here I will mention the main ones. If the goal is to test the COM hypothesis, then one needs to work with onsets (and targets, as a way to infer anti-phase). If onsets are to be measured, then one should consider to not use velocity extrema, since these are points of extreme pressure from neighboring constrictions, and use a more stabilized part of the gesture’s initiation phase instead (hence, the mview thresholds). The latter would also facilitate direct comparison to the COM papers as well. Moreover, although stability measures make sense, COM tests timing, and thus this should be the dimension to be mainly examined if COM’s hypothesis is what needs to be tested. Next, without normalization or some kind of treatment of the prosodic context of each measured gesture, the results of gestural duration analysis are void, because these words are all monosyllables (CVC) and thus stressed, and stress is known to affect the first consonant in ways that can fully account for the results of covariation with gestural duration reported here. So, there is a serious confounding effect in this analysis that is not considered at all. Additionally, using either the vertical or the horizontal dimension for measuring the vowel gestures is extremely problematic, especially from COM’s perspective. This is most likely the reason that all of the vowel-based lags are so variable (see Figure 2) and why all of them fail to show the expected timing. Related to that, the paper should mention for which vowels it picked the vertical and for which the horizontal dimension and why? There is no argumentation or explanation for that provided. Finally, the comments from the first review cycle on the jaw analysis are also relevant for the second version as well.
Another major issue of the paper is that it does not have any concrete conclusion. This is not surprising given how unsystematic it is in how it builds the COM framework (by using bits and pieces from different dynamical models-based accounts, and not just COM) and the multiple confounding effects that do not get addressed in the analysis.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf