1. Introduction
Speech sound disorder (SSD) refers to difficulties producing and using speech sounds and speech segments, resulting in reduced accuracy and clarity of speech production. It is the most prevalent of all childhood communication difficulties [
1], affecting 3.4–5.6% of pre-school-aged children [
2] and comprising more than 70% of a Speech–Language Pathologist’s (S-LP) caseload [
3]. Children with SSD are more likely to experience adverse social, educational and psychological outcomes than children without SSD [
4,
5]. These difficulties may further limit employment opportunities throughout their lifespan [
6]. Minimizing the impact of SSD is contingent on providing an accurate diagnosis to direct intervention approaches.
The causes of SSD can be organic or functional; organic SSDs arise from an underlying structural (e.g., cleft palate), motor/neurological or sensory/perceptual cause, while functional SSDs, which are more prevalent [
5], are idiopathic and include articulation and phonological disorders [
7]. Diagnosis of functional SSDs seeks to identify the underlying contribution of speech difficulties, for example, identifying whether the child is having difficulties learning the linguistic–phonological rules of the target language (i.e., a phonological impairment) and/or difficulties with the motor aspects of speech production. The differential diagnosis of SSD, however, is challenging due to the nature of SSDs and the limitations of current clinical practices [
8,
9,
10]. McCabe, Korkalainen and Thomas [
9] highlight that the “overlap between the symptoms of different disorders with the same speech features … from multiple different breakdowns…” complicates SSD, while Littlejohn and Maas [
8] note that diagnosis is impacted by a “poor understanding of, and limited focus on the underlying impairment(s)” (p. 2).
As part of the assessment process, S-LPs are encouraged to conduct a comprehensive case history; an assessment of oral structures and hearing function; and an error analysis from a connected speech sample to identify a child’s phonetic inventory and phonological error patterns or processes, as well as obtain measures of the phonological mean length of utterance and quantify speech intelligibility [
11]. In practice, however, time and ease of use are key factors influencing clinical decisions [
12], with S-LPs routinely using standardized naming tasks to evaluate speech sound inventory and error patterns [
12,
13]. Measures of phoneme accuracy within the single-word naming tests, including the percentage of consonants correct (PCC), the percentage of vowels correct (PVC) and the percentage of phonemes correct (PPC), are frequently used to determine the presence and severity of SSDs [
14].
The determination of speech sound error patterns typically relies on auditory–perceptual analysis using the International Phonetic Alphabet (IPA) transcription [
14,
15]. While this is a fundamental part of diagnosis, using auditory–perceptual assessment alone is limiting [
16,
17,
18], and there is no gold standard validation of perceptual measures that can discriminate SSD from TD speech. Auditory perceptual assessments do not allow clinicians to determine the contribution of speech motor control to the production difficulties of a child [
19]. Further, the reliance on assessment tools framed predominantly on linguistic models of speech production is problematic for differential diagnosis of SSD because these tools tend to focus on investigating phonological deficits [
20], as opposed to the speech movement patterns associated with underlying constraints at the level of speech motor control.
McCauley and Strand’s [
21] 2008 review of standardized tests that evaluate the speech motor performance of children concluded that “clinicians are in the position of having no tests that can be considered well developed for use with children with motor speech disorders” (p. 89). While new standardized assessment tools have been developed since this review, for example, the Dynamic Evaluation of Motor Speech Skill (DEMSS) [
22], a recent review of assessment and intervention approaches for SSD identified 37 published assessment tools for SSD, with the majority focusing on specific skills and only four assessing combined articulatory, phonetic and motor-based development [
4]. In their 2024 review of tools and approaches supporting the diagnosis of childhood motor speech disorders, McCabe, Korkalainen and Thomas [
9] state, “there are not yet validated tools for comprehensively assessing all speech production processes” (p. 9).
A tool recently developed to measure speech motor control is the Motor Speech Hierarchy–Probe Words (MSH-PWs) [
23]. The MSH [
23] comprises seven stages that reflect the hierarchical (i.e., increasing levels of motor complexity) and interactive development of speech motor control: tone (Stage I), phonatory control (Stage II), mandibular control (Stage III), labial–facial control (Stage IV), lingual control (Stage V), sequenced movements (Stage VI) and prosody (Stage VII). The Probe Words (PWs) cover Stages III to VI, with 10 words and one phrase in each stage. The PWs were selected to reflect the primary articulatory movements required for word production at that stage of developing speech motor control [
23]. Each PW is scored visually and auditorily by observing a child say the target word and judging whether the speech movements look and sound appropriate or inappropriate based on criteria outlined in the MSH-PW manual. For example, the criteria for mandibular control focus on assessing vertical jaw movements (e.g., through words containing bilabial sounds, mid-low vowels and simple syllable structures, like
map,
bob and
papa) and how jaw control integrates with voicing and nasalization. These criteria include three measures of mandibular control—appropriate jaw range, appropriate jaw control stability and appropriate close–open phase—and two measures that reflect linguistic accuracy: appropriate voicing transitions and correct syllable structure.
In 2021, Namasivayam and colleagues [
23] reported key measures of validity and reliability of the MSH-PWs. Their data indicate high content, construct and criterion-rated validity, as well as high reliability on measures of internal consistency and intra-rater reliability and moderate agreement on inter-rater reliability. This validation study, however, did not undertake a fine-grained analysis of individual scoring criteria (e.g., jaw range and jaw control), and the MSH-PW assessment tool was only validated on children aged 3 to 10 years with moderate-to-severe motor speech disorder. Therefore, construct validity in the form of distinguishing between the speech motor skills of TD children and SSD children and the scoring criteria involved in diagnosing impaired speech motor control have yet to be established for the MSH-PWs.
Furthermore, despite perceptual measures being used to judge articulatory control, the gold standard for evaluating speech motor control is based on instrumental analysis. Researchers have long advocated the need to combine perceptual analysis with the instrumental analysis of kinematic (i.e., the study of motion: displacement, velocity and acceleration) and acoustic measures [
15,
18,
21]. In recent years, the enormous potential of machine learning (ML) in the support of a diagnosis of SSD has been recognized, and the use of the kinematic analysis of speech has progressed with the development of new video and motion-tracking technologies [
24,
25] that could potentially be used in combination with clinical tools such as the MSH-PWs. Mogren et al. [
26], for example, demonstrated the ability to reliably extract measures of jaw movement using markers located on the middle upper lip, middle lower lip and chin (pogonion). While these methods provide precise motion data, they are, however, typically constrained to laboratory settings and are not yet practical for everyday clinical environments.
Computer vision-based approaches to measuring facial movements offer an objective and well-defined standard for detecting atypical speech patterns [
27,
28] and have demonstrated the potential for detecting facial movements associated with disorders [
28,
29,
30]. While promising, the features used by these systems are not well grounded in the existing clinical understanding of which facial movements are involved in which aspects of speech motor control. As such, the decisions made by these systems lack explicative transparency.
This study focused on the assessment of mandibular control in young children using the MSH-PWs and reports on the development of a computer vision-based analysis system that offers objective measures of jaw movement for use in everyday clinical environments. (This study is a part of a larger study currently being undertaken that integrates automatic facial tracking data with perceptual scoring at all levels of the MSH-PWs). Motor speech control develops in a sequential yet non-linear manner, with mandibular control providing a foundation for the later refinement of labial and lingual movements [
23,
25,
31,
32,
33]. Early speech production is characterized by reliance on jaw movements for oral closure, with young children showing greater mandibular displacement and less articulatory independence [
34]. As development progresses, control shifts from jaw dominance to more independent and refined lip and tongue movements [
32,
35]. Kinematic measures have captured these developmental changes, highlighting the foundational role of the jaw in early speech motor organization [
36]. Importantly, poor mandibular control has also been identified as a feature of children with SSD, including reduced range of motion of jaw movements during vowel production [
26] and increased variability in movement trajectories of the jaw, indicating impaired lateral jaw stability [
37]. Furthermore, there is evidence that jaw control and stability may be useful markers in determining subtypes of SSD [
38].
Therefore, given the importance of assessing mandibular control in young children with speech production difficulties, we investigated whether MSH-PW mandibular control scores obtained from expert clinicians could distinguish TD children from children diagnosed with SSD. The aim was to validate the perceptual scoring of MSH-PWs as being sensitive to individual differences in speech motor skills at the level of mandibular control and identify MSH criteria that could be predictive of disordered mandibular control, relative to TD children.
We also employed a State-of-the-Art facial mesh detection and tracking algorithm [
26] to extract measurements of facial movements identified as clinically salient in the assessment of speech motor control from recorded videos of children speaking words from the mandibular stage of the MSH-PWs. We aimed to evaluate how well these extracted facial movement measurements agree with perceptual scores for the jaw range and jaw control criteria of the MSH-PWs. We selected these two criteria because, when scoring jaw range and control, clinicians rely predominantly on the child’s facial movements.
This preliminary study, therefore, sought to answer the following questions:
- Q1.
Do the MSH-PW criteria, using expert consensus scoring, discriminate TD children from those with SSD? We predicted TD children would score more highly than SSD children in relation to the MSH-PW mandibular control criteria and that mandibular control criteria could be predictive of whether a child was TD or had an SSD.
- Q2.
Can kinematic measurements derived from automated facial tracking accurately predict the expert consensus perceptual scores of the MSH-PW jaw range and jaw control criteria? We expected good agreement between objective measures obtained from facial tracking and expert clinician judgements of appropriate and inappropriate jaw range and jaw control, as indicated by the predictive accuracy in logistic regression classification models, with objective measures as predictors and clinician judgements as the outcome.
4. Discussion
In this paper, we report on a preliminary study aimed at exploring the potential of the MSH-PWs to discriminate the speech of children with SSD from TD speech through evaluation of the Stage III mandibular control word set. We first aimed to determine whether the observations of speech pathologists on the five MSH-PW mandibular control features of appropriate mandibular range, mandibular control/stability, open–close or close–open movements, voicing transitions and syllable structure could accurately classify children with TD speech from those with an SSD. Secondly, we evaluated the agreement between the subjective visual observations of the consensus scores completed by three S-LPs derived from kinematic measurements, which were extracted from a State-of-the-Art facial mesh detection and tracking algorithm. These research questions were selected to inform the development of norms for the MSH-PWs for assessing impaired speech motor control in children and to evaluate the feasibility of supporting S-LPs with objectively acquired measurements of motor speech control, framed within the MSH-PW scoring criteria. Each research aim is discussed in turn.
4.1. Classification of Children Based on Perceptual Scoring
The results of this study found there were significant differences in MSH-PW jaw range, voicing transitions and mandibular total scores between children with TD speech and those with SSD when scored perceptually by S-LPs. Discrimination analysis indicates a significant correlation between perceptually scored MSH-PW mandibular criteria and diagnostic class, as determined by a battery of diagnostic tools used in current clinical practice. This suggests that the perceptual scoring of the MSH-PW mandibular subtest can discriminate between children with TD speech and those with SSD, with potential for the MSH-PWs to be used by S-LPs in diagnosing impaired speech motor control.
The finding of no significant difference in jaw control contrasts with the existing literature that has established the significance of jaw control and stability in speech sound production [
25,
56,
57,
58,
59]. Jaw control is considered a foundational aspect of speech development, with research indicating that its maturation follows a protracted trajectory. In early development, jaw movement velocities are slower and more variable in young children [
35]. By around 6–8 years of age, these movements become more refined and less variable [
26,
36], reflecting the gradual development of stable control. Wilson and Nip [
60] highlighted the importance of jaw control and stability in supporting lip and tongue movements, noting its involvement in nearly all articulatory positions.. Poor jaw control has been identified as a feature of SSD. For example, Mogren and colleagues found children with SSD had larger lateral jaw movements than children with TD speech [
26]. Similarly, Terband et al. [
37] identified clear deviances in lateral jaw movement within their SSD group compared to their sample of TD participants. The reasons for the contrast in these findings with this current study were considered.
Firstly, previous studies examining motor control tend to feature participants with clearly identified motor speech disorders (e.g., ref. [
59]), including childhood apraxia of speech (e.g., ref. [
61]) and/or may differentiate between various subtypes of SSD. For example, Terband et al. [
37] differentiated children with phonetic articulation disorder, phonological disorder and childhood apraxia of speech. Participants in this study reflected real-world referral patterns, and those in the SSD group of this current study had not been previously identified as having an SSD before their participation. Some parents expressed uncertainty over whether their child’s speech was developing within age expectations; however, difficulties accessing developmental screening had impeded prior access to speech pathology services. Information from the WA Department of Health, Child and Adolescent Health Services indicates that between July 2019 and June 2020, only 44% of 12-month-old children and 30.2% of 2-year-old children were seen for their universal health check, which includes a developmental screen. The COVID-19 pandemic was noted to have further reduced access to this service in the preceding years [
62]. As such, it is possible that participants in the SSD group demonstrated milder SSD features and/or that children within this group primarily had difficulties at the phonological level rather than motor-based difficulties. This suggestion is supported by Terband et al.’s finding that a participant with a phonetic articulation disorder had “very normal values” on lateral jaw movement. As such, it is plausible that the results of this current study may be reflective of the characteristics of participants in the SSD group. Further analysis of children with diagnosed motor speech difficulties may yield classification differences across a wider range of MSH-PW criteria.
Secondly, the mandibular word set items were intentionally selected to include only bilabial consonants and low vowels, with targets achieved through open–close (e.g., um), close–open (e.g., ba), close–open–close (e.g., map) and close–open–close–open (e.g., papa) jaw movements. With the vowel target determining jaw height, differences in the PVC and PPC between the TD and SSD groups indicate there are differences in speech production accuracy from an auditory perceptual perspective that may not be evident in perceptual movement analysis over the ten mandibular target items. Furthermore, difficulties in jaw control, specifically, may not be evident until motor complexity increases in the higher MSH-PW stages. Further investigations are required to determine the impact on jaw control as young children are required to integrate jaw stability with labial–facial and lingual movements and sequencing these movements in multisyllabic and phrase-level speech.
Finally, perceptual scoring of the MSH-PWs involves determining whether the child’s production meets the criterion of appropriate or inappropriate. This binary scoring system may have obscured mild features of SSD in the SSD group participants. Further research may consider whether an ordinal scoring system enhances the identification of children with mild features of SSD or subclinical features that may indicate speech delays.
4.2. Agreement Between Perceptual Scoring and Kinematic Measures
Our second research question sought to explore the agreement between the perceptual scores of jaw range with the kinematic measure labeled mouth-opening and jaw control with two kinematic measures labeled mouth-opening velocity and lateral movement of the pogonion. We found good agreement for both jaw range and jaw control.
The rating of jaw range required consensus raters to make a binary judgement of appropriate or inappropriate for age, where a judgement of inappropriate was given for movements considered restricted or overextended, as defined for each vowel height position for each word in the MSH-PW scoring manual. Our preliminary results suggest the objective measure of mouth-opening could be used to support speech pathologists in their assessment of jaw range.
It is proposed that the agreement between the consensus raters with the mouth-opening measurement resulted from their already established internal representation of jaw height and that this representation was aligned with the objective kinematic measures, resulting in good agreement. This proposal is based on the knowledge that the consensus scorers are familiar with the vowel quadrilateral that describes jaw and tongue positions, as well as the established body of literature that specifies that jaw height adjustments contribute to the production of vowels [
63,
64]. It is, therefore, conceivable that the raters utilized this knowledge, along with their experience, to inform their decision-making. That is, when a child produced a vowel error, the associated jaw height position could be evaluated as too high or too low with respect to the intended target.
Similarly, there was good agreement between the consensus scores and kinematic measures for jaw control. The rating of jaw control required the consensus raters to make a binary judgement of appropriate or inappropriate for age, based on velocity and the midline or anterior–posterior stability of the jaw. The finding that the combined measures showed greater agreement than the individual measure is likely reflective of the multidimensionality of the jaw control criterion and jaw control movements in general. Multidimensionality is an essential feature of jaw movement and the integration and balancing of vertical, lateral and rotational movements with precise timing and velocity to enable speakers to adapt to the acoustic and articulatory demands of different phonemes and provide a dynamic scaffold for tongue and lip movements [
58,
59,
61]. As outlined, the rating for jaw control is based on several criteria reflective of controlled, smooth speech movements within the vertical plane. The velocity of mouth-opening and lateral displacement of the pogonion are both key metrics of jaw control, and the interplay of each movement contributes to the production of fluent, intelligible speech.
Agreement between consensus scorers and kinematic measurements derived from automated facial tracking was likely aided by the high level of experience the consensus scorers had in the assessment of speech motor control. Further research should explore the level of agreement in ratings of jaw range and jaw control with S-LPs who have less experience in the assessment of speech motor control and determine whether kinematic measures can support the clinical judgements of S-LPs of differing levels of experience when scoring jaw range and jaw control criteria of the MSH-PWs.
4.3. Clinical Application
Perceptual, single-word speech assessments, such as the GFTA-3, are a critical component in the assessment of children’s speech and the diagnosis of SSD [
12,
13], providing timely and convenient measures of speech development and accuracy. As highlighted, there are limitations with the current perceptual assessments, including their focus on identifying phonological deficits rather than also assessing underlying speech motor control difficulties [
19]. The MSH-PWs were designed to measure inappropriate speech motor control through the perceptual, visual and auditory assessment of single-word production. The findings of this preliminary study of the Stage III mandibular control level of the MSH-PWs indicate the assessment tool can identify perceptual differences in the appropriateness of jaw range and voicing transitions between children with TD speech compared to those with SSD and support further research into the additional levels of the MSH-PW: Stage IV: labial–facial control, Stage V: lingual control and Stage VI: sequenced movements. The generation of normative data for children’s performance at these levels of speech motor control, along with the MSH-PW total scores, would be beneficial.
5. Limitations
Data were analyzed from a small sample of children comprising 41 TD participants and 13 SSD participants. All children were within the age range of 3;0 to 3;6 years. This small, limited age sample limits generalization of the findings to a larger population and those younger or older than this target age. Children aged between 3;0 and 3;6 years frequently present with speech sound errors (e.g., ref. [
65]), and it is possible that participants in the TD sample scored within age expectations on standardized assessments at the time of their participation in the study but may be identified with SSD as they get older and error patterns that are currently developmentally appropriate persist [
45]. Similarly, the SSD group comprised children who had not previously been identified as having SSD, suggesting their SSD features may have been less severe and are not a broad representation of the severity of SSD in children seeking speech pathology assessment and intervention. Additionally, the SSD group was not diagnosed according to subtype using a classification framework (e.g., phonological disorder and childhood apraxia of speech) [
66]. The SSD children, as a group, were significantly lower than the TD controls on VMPAC scores, indicating poorer oromotor and sequencing skills, which suggests some speech motor involvement within the SSD group. Future research with a larger sample size is needed to investigate the role of the MSH-PW word sets in differentially diagnosing subtypes of SSD.
Data analysis was also restricted to the MSH-PW mandibular word set. These 10 words contain a limited set of consonants (e.g., m, p and b), which may not be sufficient to accurately perceive differences in jaw control, phase and syllable structure. Work is in progress to analyze the remaining stages (thirty words and four phrases) of the MSH-PWs.
Finally, this study sought to discover if it was possible to associate three different facial measurements extracted from recorded videos with the mandibular range and control criteria of the MSH-PWs. While these measurements have been used in previous studies (e.g., [
26]), the accuracy/precision of their extraction for use in this study was not evaluated. Irrespective of this, strong (though preliminary) positive correlations were found between the measurements and the clinician-scored criteria. Assessing how accurately these measurements record salient facial movement is the focus of a study currently in progress. The results of that study will allow the measurement process to be refined, likely improving the extracted characterization of facial movements necessary to distinguish between disordered and typically developing mandibular control. Further investigation of facial measurements other than those tested herein is also warranted, especially given the likely multidimensional nature of the MSH-PW criteria.