Review Reports - Preschoolers Mark Focus Types Through Multimodal Prominence: Further Evidence for the Precursor Role of Gestures

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The authors investigate the development of multimodal cues for three different focus types in 3-5-year old Catalan-speaking children. With this, the authors aim to address a gap in the literature, namely the combination of gestures and prosodic prominence used in language production and their development in young children, which has been extremely overlooked in research. The authors show children's ability to differently mark the three focus types as well as the differences in the use of prosodic prominence and gestures for the different age groups, more specifically difference between 3-year olds on one side and 4-5-year olds on the other.

Overall, I think the manuscript is very well written. It is structured coherently and therefore it is easy to follow the authors' train of thought. The introductions seems to cover all the relevant literature and gives an appropriate overview of the topic. The description in the methods section is detailed enough to allow for potential replications. The task used appears to be adequate to investigate the research questions as well as for the age group under investigation. The results are presented in a clear way and are discussed properly with regards to the authors research questions as well as previous findings. Therefore, my comments and questions are of relatively minor nature, and target specific sentences/paragraphs in the manuscript.

In what follows, I will list my comments and questions in order of appearance within the manuscript.

With respect to the formatting of your research questions and hypotheses (p. 7-8), it may be easier (for the reader) to link the research questions to the hypotheses by numbering the questions instead of using a bullet list. I was also a bit thrown off by your hypotheses and the section where you explain how you will address your research questions being part of that bullet list. It led me to assume that you were continuing with your research questions at first.
In 2.2, you mention that "the introduction of puppets also enhanced the task's interactive quality by creating a more engaging engaging and dynamic environment." Is this something you found during your piloting process?
In 2.3, I think the first paragraph can be reduced a bit, the sentence "The puppets were introduced [...]" and the following sentence can be omitted as you have a good description in section 2.4 of the procedure anyway. Similarly, the paragraph (p. 11) starting with "The participants' productions in the task [...]" would perhaps be better placed in section 2.4. These two changes would make section 2.3 focus more on your different conditions which seem not as prominent in this section.
In 2.3 (p. 11), you mention that Catalan is a pro-drop language which is why in certain context omission of the verb is permitted. I did not see the connection between these two. I would argue that verb omission is also plausible in certain context of non-pro-drop languages like English or German.
For your analyses, I was wondering what is the reasoning behind using different age groups rather than age as a continuous variable.

Author Response

Comment 1: With respect to the formatting of your research questions and hypotheses (p. 7-8), it may be easier (for the reader) to link the research questions to the hypotheses by numbering the questions instead of using a bullet list. I was also a bit thrown off by your hypotheses and the section where you explain how you will address your research questions being part of that bullet list. It led me to assume that you were continuing with your research questions at first.

Response: Thank you for pointing out his issue. We have numbered the research questions, removing the bullet list format (see pages 7-8). We noticed a formatting issue with the two paragraphs following the bullet list that you mention here (they were not properly indented), making it indeed seem a continuation of the list. This issue has now been corrected. Additionally, we have rephrased the opening sentence of the paragraph containing the hypotheses to clarify that the focus of the paragraph is on presenting the hypotheses (see page 8).

Comment 2: In 2.2, you mention that "the introduction of puppets also enhanced the task's interactive quality by creating a more engaging and dynamic environment." Is this something you found during your piloting process?

Response: Thank you for this question. During the piloting process, we observed that children demonstrated higher levels of engagement when interacting with the puppets than when completing the task without them, where they interacted only with the experimenter. This increased engagement was reflected in their use of multimodal strategies in the puppet version of the task compared to the non-puppet version. Specifically, in the non-puppet version, 33.3% of target items in the corrective focus condition included a gesture, and 14% featured a prosodically prominent focused word. In contrast, in the puppet version, these percentages rose to 80% and 70%, respectively. We have briefly incorporated this information into the manuscript on page 10 as follows:

This was reflected in an increase in children’s multimodal strategies, as shown by the results from the pilot. Compared to the initial version of the task without puppets, the version with puppets led to a higher proportion of gestures (from 33.3% to 80%) and prosodically prominent focused words (from 14% to 70%) in corrective focus.

Comment 3: In 2.3, I think the first paragraph can be reduced a bit, the sentence "The puppets were introduced [...]" and the following sentence can be omitted as you have a good description in section 2.4 of the procedure anyway. Similarly, the paragraph (p. 11) starting with "The participants' productions in the task [...]" would perhaps be better placed in section 2.4. These two changes would make section 2.3 focus more on your different conditions which seem not as prominent in this section.

Response: Thank you for these suggestions. We have now removed from section 2.3 the two sentences mentioned (see page 10). The first paragraph of the section now reads as follows:

The experimental materials of the Train Task consisted of a toy train, three puppets manipulated by the experimenter, a series of small objects representing everyday items (i.e., the task objects), as well as a tablet identifying some of these objects (i.e., the target object; see Figure 1). To video record participants, a camera Panasonic AG-CX7 with a microphone PANASONIC AG-MC200GC was used.

Additionally, we have moved the mentioned paragraph regarding the expected productions of the participants to section 2.4., as suggested in this comment (see page 13).

Comment 4: In 2.3 (p. 11), you mention that Catalan is a pro-drop language which is why in certain contexts omission of the verb is permitted. I did not see the connection between these two. I would argue that verb omission is also plausible in certain contexts of non-pro-drop languages like English or German.

Response: Thank you for pointing this out. We agree with you that non pro-drop languages can also show this type of omission. We have clarified this issue as follows (see page 13):

Since the action of 'taking' remains constant throughout the task and the verb agafar (“to take”) is explicitly mentioned in the puppet's prompt, its meaning is contextually given, making it unnecessary to state the verb in the children’s response. As a result, children could produce a reduced structure without the verb (e.g., la sabata lila, “the purple shoe”) while still fulfilling the task requirements.

Comment 5: For your analyses, I was wondering what is the reasoning behind using different age groups rather than age as a continuous variable.

Response: Thank you for this question. The decision to treat age as a categorical variable is based on several considerations. First, our primary goal was to determine when and how children develop prosodic and gestural abilities to mark focus types. A categorical approach allows us to compare distinct developmental stages, directly addressing the "when" in our aims. Therefore, we believe that this approach aligns better with our research questions. Particularly, the fourth research question, on precursor effects, could not have been addressed if age was treated as a continuous variable. Second, our approach is consistent with previous research on prosodic focus marking in children, which has typically compared developmental stages (e.g., Chen, 2011; Destruel et al., 2024; Wonacott & Watson, 2007). Finally, model comparisons confirmed that treating age as categorical provided a better statistical fit than modeling it as a continuous variable (e.g., for prosodic prominence: LR(3) = 11.86, p < 0.01). We have briefly addressed this issue in the manuscript (page 18):

A total of three models were run in R Statistical Software (R Core Team 2024). In all three of them, age was included as a categorical predictor to allow for the comparison of distinct developmental stages. This approach is particularly relevant to our research questions, which aim to identify developmental shifts rather than assume a continuous, linear effect of age.

Reviewer 2 Report

Comments and Suggestions for Authors

This is a very interesting and sound study about the multimodal focus marking, based on a understudied language such as Catalan. The results are relevant both for prosody studies and for gesture studies, and therefore I recommend to publish it, provided that the issue below is addressed.

My only concern is about the way in which "gesture" and in particular "co-speech gestures" are described and (maybe) coded. Because the authors are not restricting the definition of gestures to the hands and arms, I think first of all that this clarification should come earlier in the paper, together with examples of each of the categories included. A good support in this direction is Bavelas (2022), in which "co-speech gesture" is used in a very similar way. Furthermore, I would also clarify what they mean with pragmatic function here, and how this is related to the category of "pragmatic gestures" proposed by Kendon (1995; 2004). Finally, the authors may want to clarify what they coded as strokes in non manual gestures: given that the notion of stroke is usually applied to triphasic movements, I wonder how a head movement could have a stroke.

Author Response

Comment 1: My only concern is about the way in which "gesture" and in particular "co-speech gestures" are described and (maybe) coded. Because the authors are not restricting the definition of gestures to the hands and arms, I think first of all that this clarification should come earlier in the paper, together with examples of each of the categories included. A good support in this direction is Bavelas (2022), in which "co-speech gesture" is used in a very similar way.

Response: Thank you for the comment. We have expanded the second paragraph in Section 1.4 to clarify our understanding of gestures. The following text has been added (pages 6–7):

We follow Kendon’s (2004) definition of gesture as “a visible action of any body part, when it is used as an utterance or as part of an utterance” (p. 7). This broad definition includes meaningful, communicative movements produced not only by the hands but also by the head and other body parts. Such a comprehensive approach has been widely used in multimodal discourse studies (see, e.g., Bavelas, 2022) and is particularly relevant for investigating children’s gestures, which, as initial observations of our data indicated, frequently involve the use of non-manual articulators.

Comment 2: Furthermore, I would also clarify what they mean with pragmatic function here, and how this is related to the category of "pragmatic gestures" proposed by Kendon (1995; 2004).

Response: Thank you for this comment. Indeed, the gestures analyzed in our study could be related with Müller's (1998) discourse gestures or Kendon's (2004) pragmatic gestures with a parsing function (which are meant to stress parts of the utterance). We do not call the analyzed gestures "pragmatic gestures" or "parsing gestures" because we adopt the M3D system (Rohrer et al, 2023), a model offering a multi-dimensional perspective in which any gesture, independently of form or semantic category, can contribute to pragmatics (e.g., mark focus, manage interaction, convey politeness). This implies that gestures that are referential in nature (e.g., pointing gestures) can also be pragmatically relevant, depending on their communicative context. Therefore, we consider that 'pragmatic' is not a category of gesture but a transversal dimension that affects all gestures. This multidimensional approach to gestures is consistent with McNeill’s (2005) idea that gesture types are dimensions more than categories and consequently, can display several features at the same time (e.g., pointing gestures that encode direction, iconic gestures that contain a superimposed beat). A similar idea was somehow already in Kendon (2017), who argued that the form of various pointing gestures is shaped by pragmatic intent. To clarify this, we have briefly added the following lines in the manuscript (page 15):

The M3D system (Rohrer et al., 2023) proposes a dimensionalized approach to gesture categorization, considering form, pragmatic meaning, and prosodic characteristics as interrelated yet distinct aspects of gesture. According to this proposal, any gesture can serve pragmatic functions. For this reason, all gestures regardless of their form (e.g., pointing, open hand, nod, tilt) or semantic category (e.g., deictic, iconic, non-referential) were annotated.

Comment 3: Finally, the authors may want to clarify what they coded as strokes in non manual gestures: given that the notion of stroke is usually applied to triphasic movements, I wonder how a head movement could have a stroke.

Response: Thank you for this comment. The stroke is defined as “the movement that contains the action of the gesture, and as such, it is the only obligatory phase of a gesture” (Rohrer et al., 2023). Therefore, it can exist without preparation or recovery phases. Head gestures have also been described as exhibiting a stroke phase in the literature (see e.g., Rohrer et al., 2013; Wagner et al., 2014). The stroke of a head nod, for example, is usually defined as the downward movement of the head. Several studies on head gestures have also considered this phase in particular in their analyses (see e.g., Cargnan et al., 2024; Esteve-Gibert et al., 2017). For this reason, we coded gesture strokes in non-manual head gestures as well. We adopted the same approach with the rest of non-manual gestures, identifying a stroke phase in them, due to two main reasons. First, we wanted to ensure comparability with manual and head gesture annotations. Since strokes were the focus of our analysis for manual and head gestures, applying the same criteria to all non-manual gestures allowed for a consistent and systematic annotation across all gesture articulators. Second, we wanted to filter out gestures which were not clearly associated with focus domains. Some body movements can be long and overlap with multiple words in speech. By annotating only the stroke, we ensured that our analyses captured the most meaningful phase of the gesture, preventing instances where a prolonged movement might align with different parts of the utterance. For example, if a large torso movement started long before the focus word (i.e, when the verb was produced) but continued past it or finished while it was being uttered, considering only the stroke helped distinguish whether the gesture was truly associated with the focus word or a different segment of speech. We have clarified this in the text in the following way (page 16):

The stroke phase has traditionally been studied in both manual and non-manual head gestures (see Wagner et al., 2014, for a review on head gestures). To ensure comparability with existing annotations and to exclude large movements unrelated to focus marking (e.g., large body movements that began while the verb was uttered but extended across multiple words, including the focused word), we applied this stroke-based annotation approach to all other non-manual articulators (i.e., eyebrows, torso, and legs). Following the definition of stroke for head gestures (see, e.g., Wagner et al., 2014), we identified the stroke of other non-manual gestures as the phase leading to the articulator's point of maximal extension, occurring just before a change in movement direction. For instance, in a forward torso movement, the stroke extended until its furthest point before the torso began moving backward. When a clearly defined stroke phase was not identifiable, the entire movement of the articulator was coded as the stroke.