A Psychometric Evaluation of the Dysphagia Handicap Index Using Rasch Analysis

Background/Objectives: The Dysphagia Handicap Index (DHI) is commonly used in oropharyngeal dysphagia (OD) research as a self-report measure of functional health status and health-related quality of life. The DHI was developed and validated using classic test theory. The aim of this study was to use item response theory (Rasch analysis) to evaluate the psychometric properties of the DHI. Methods: Prospective, consecutive patient data were collected at dysphagia or otorhinolaryngology clinics. The sample included 256 adults (53.1% male; mean age 65.2) at risk of OD. The measure’s response scale, person and item fit characteristics, differential item functioning, and dimensionality were evaluated. Results: The rating scale was ordered but showed a potential gap in the rating category labels for the overall measure. The overall person (0.91) and item (0.97) reliability was excellent. The overall measure reliably separated persons into at least three distinct groups (person separation index = 3.23) based on swallowing abilities, but the subscales showed inadequate separation. All infit mean squares were in the acceptable range except for the underfitting for item 22 (F). More misfitting was evident in the Z-Standard statistics. Differential item functioning results indicated good performance at an item level for the overall measure; however, contrary to expectation, an OD diagnosis presented only with marginal DIF. The dimensionality of the DHI showed two dimensions in contrast to the three dimensions suggested by the original authors. Conclusions: The DHI failed to reproduce the original three subscales. Caution is needed using the DHI subscales; only the DHI total score should be used. A redevelopment of the DHI is needed; however, given the complexities involved in addressing these issues, the development of a new measure that ensures good content validity may be preferred.

Patient self-evaluation comprises two different aspects: functional health status (FHS) and health-related quality of life (HR-QoL) [4,8].FHS is the impact of a given disease on the ability to perform tasks in multiple domains (including physical, social, role, and psychological functioning).The FHS aims to quantify the symptomatic severity and (loss of) function due to the disease and/or treatment and the effects on daily life as experienced by the individual at a particular point in time [7].HR-QoL refers to the unique personal perception of someone's health, considering social, functional, and psychological issues [9].Although considered two distinct concepts, self-evaluation questionnaires frequently combine both FHS and HR-QoL, without distinguishing disease-related functioning from disease-related quality of life as experienced by the patient [10].
A measure's robust psychometric properties must be demonstrated before implementing it in healthcare or research [11,12].Two systematic reviews summarising the evidence for the measurement properties of patient self-reported measures developed for people with dysphagia reported poor and incomplete psychometric data [13,14].Nearly all studies apply the principles of Classic Testing Theory (CTT) when evaluating the psychometric robustness of measures, while only a few studies use the more contemporary item response theory (IRT) framework [15][16][17].CTT and IRT are the most common frameworks used in instrument development and the evaluation of psychometric properties [18].CTT analyses evaluate the performance of a measure as a whole, whereas the IRT framework uses the item as the unit of analysis.Also, contrary to CTT, results in IRT are not bound by the test population [11,19].Consequently, psychometric studies using CTT analyses may yield different results than studies incorporating IRT principles and, therefore, may lead to different recommendations or guidelines about which measures to implement in clinics or research [10,15,20].
Commonly used self-report measures for patients with dysphagia include the MD Anderson Dysphagia Inventory (MDADI; [21]), the Swallowing Quality of Life Questionnaire (SWAL-QOL; [22]), the Eating Assessment Tool (EAT-10; [23]) and the Dysphagia Handicap Index (DHI; [24]).To date, only two self-reported measures targeting people with oropharyngeal dysphagia have been evaluated using IRT analyses: the SWAL-QOL [15] and the 20,25].In contrast to previous studies reporting on both measures' good validity and reliability using CTT, more recent studies using IRT analyses identified major psychometric weaknesses in both measures, calling for further evaluation of the underlying structure and possible redevelopment using IRT.
Overall, both CTT and IRT principles should be considered when developing new instruments and evaluating the psychometric properties of existing measures.Repeating limited CTT analyses for a single measure (repeated cross-cultural validation of a measure into numerous languages; for example, [26]) may not further strengthen the psychometric evidence to support its use, while introducing the IRT framework alongside CTT principles will lead to a better understanding of the robustness of the psychometric properties of a measure, prioritising the quality over quantity of psychometric analyses.
Originally, the Dysphagia Handicap Index (DHI) was developed and validated by Silbergleit, Schultz [24] using CTT.The DHI is a patient-administered questionnaire comprising 25 items across three subscales: the emotional (7 items), functional (9 items), and physical aspects of individuals' lives (9 items) [24].Items are scored using a three-point ordinal scale (i.e., never = 0; sometimes = 2; or always = 4), with higher scores indicating a higher degree of disability or impact on patients' quality of life.The questionnaire concludes with a single question on the patients' perceived severity of dysphagia using a seven-point scale with three anchor values (1 = normal swallowing; 4 = moderate swallowing problem; and 7 = severe swallowing problem).The item descriptions are provided in Table 1.
After its publication, several psychometric studies have been conducted on the DHI to determine its psychometric properties using CTT analyses; none have used IRT principles.For example, many studies evaluated hypothesis testing (e.g., convergent validity; [27,28]) and cross-cultural validity (e.g., [29,30]), while minimal data on responsiveness can be obtained from the literature, and no data on measurement error and structural validity have been published.
To address the gap in research, this study aimed to apply an IRT approach to determine the psychometric robustness of the DHI.Using the Rasch measurement model, this study evaluated the response scale, the person and item fit characteristics, differential item functioning, and the scale's dimensionality.Notes.Item description from Silbergleit, Schultz [28]; Blue = physical items; green = functional items; pink = emotional items.

Participants and Procedure
Prospective, consecutive patient data were collected from January 2017 to February 2018 at clinics for dysphagia or otorhinolaryngology at the Leiden University Medical Center, the Netherlands.Only adult patients (i.e., 18 years and older) at risk of dysphagia and who underwent either a videofluoroscopic swallowing study (VFSS) or a fiberoptic evaluation of swallowing (FEES) were included in this study.Patients with severe cognitive problems or esophageal dysphagia were excluded.
All patients completed the DHI independently, after which a VFSS or FEES was performed as part of standard clinical care.The diagnosis of OD was confirmed through a visuoperceptual evaluation of VFSS or FEES recordings by an experienced speech and language pathologist and/or laryngologist.Further, patient characteristics were collected on both age and gender, in addition to oral intake data (i.e., Fois Oral Intake Scale [FOIS]) [31] as completed by the speech and language pathologist.
In line with COSMIN criteria for adequate sample size for psychometric studies [32], the sample size needed to be five times the number of items, with a minimum sample size of 100.This study was approved by the local Medical Ethics Committee Leiden (approval code: G16.100; date: 17 January 2017) at the Leiden University Medical Center.

Instrument
In 2012, a prototype patient self-report DHI was developed based on a composite series of 60 complaints from dysphagia patients over a one-month period [24].Twenty-one items were eliminated (i.e., item total correlations r < 0.50 [n = 21] or redundancy/similar wording [n = 14]).Four items with low item total correlations were included in the final DHI as they were considered by the authors to have high content validity or provide pertinent clinical information.The final DHI version was subsequently reduced to 25 items across three subscales: an emotional (7 items), a functional (9 items), and a physical subscale (9 items).The authors chose three response levels to facilitate patients' understanding of response requirements and added a final item on dysphagia severity as perceived by the patient.

Statistical Analysis
Rasch analyses were employed to evaluate the reliability and validity of the DHI.Winsteps version 3.92.0[33] was used to analyse the data, using the joint maximum likelihood estimation rating scale estimation [34].The initial steps were to analyse all 25 DHI items.An iterative process was then used to remove poor-fitting items in various combinations and re-run the analysis to obtain the best overall item fit, person separation, and dimensionality statistics.All investigations included the analyses as described below.Figure 1 provides a schematic representation of all the Rasch domains that were evaluated.
of 100.This study was approved by the local Medical Ethics Committee Leiden (approval code: G16.100; date: 17 January 2017) at the Leiden University Medical Center.

Instrument
In 2012, a prototype patient self-report DHI was developed based on a composite series of 60 complaints from dysphagia patients over a one-month period [24].Twentyone items were eliminated (i.e., item total correlations r < 0.50 [n = 21] or redundancy/similar wording [n = 14]).Four items with low item total correlations were included in the final DHI as they were considered by the authors to have high content validity or provide pertinent clinical information.The final DHI version was subsequently reduced to 25 items across three subscales: an emotional (7 items), a functional (9 items), and a physical subscale (9 items).The authors chose three response levels to facilitate patients' understanding of response requirements and added a final item on dysphagia severity as perceived by the patient.

Statistical Analysis
Rasch analyses were employed to evaluate the reliability and validity of the DHI.Winsteps version 3.92.0[33] was used to analyse the data, using the joint maximum likelihood estimation rating scale estimation [34].The initial steps were to analyse all 25 DHI items.An iterative process was then used to remove poor-fitting items in various combinations and re-run the analysis to obtain the best overall item fit, person separation, and dimensionality statistics.All investigations included the analyses as described below.Figure 1 provides a schematic representation of all the Rasch domains that were evaluated.

Rating Scale Validity
To confirm whether the ordinal response scale for all items stays true to the assumption that higher ratings indicate "more" and lower ratings indicate "less" within the DHI measure, a Rating Scale Model (RSM) was used to examine the rating scale valid-ity.The three situations in which the partial credit model in Winsteps can be used [34] do not apply to the DHI scale structure, and all DHI items have the same scale structure.To align with the DHI response options, the original categories (i.e., never = 0; sometimes = 2; or always = 4) were recoded as Never (0), Sometimes (1), and Always (2) to comply with Rasch requirements for an ordinal scale [19].
Category response data were examined for an even distribution or category disorder to determine if the rating response scales were being used in an expected manner.Nonuniformity or category disordering may occur when poorly designed items that do not measure the construct are included.Average measure scores that increase monotonically as the category increases indicate ordered categories.Misfitting categories and disordering, indicated by mean squares (MnSq) outside 0.7-1.4,can be considered for collapsing into an adjacent category [19].
To assess step disordering, Andrich thresholds were used to estimate the equal probability of response in either of the two adjacent categories.Andrich thresholds measure the distance between categories, and it is expected that such distance progresses monotonically, without overlap or with too large a gap.Where step disordering is identified, the category may define a narrow section of the variable, but step disordering does not imply that the category definitions are out of sequence.On a 5-category scale, an increase of at least 1.0 logit indicates distinct categories within the measure.An increase of >5.0 logits indicates gaps in the variable [35].

Person and Item Fit Statistics
Fit statistics, reported as log odd units (logits), were used to assess construct validity.Patterns of responses for each person and misfitting items were analysed to determine the reliability of an individual's responses.Logits also indicate whether the items contribute to the main construct (i.e., swallowing difficulty).Infit and outfit are both described as unstandardised MnSq or Z-Standard (Z-STD) statistics.Infit and outfit MnSqs should be close to 1.0 with an acceptable range of 0.7-1.4[36].Infit and outfit Z-STD statistics should be close to 0 with an acceptable range of ±2 [36].Where underfitting is found, further investigation is required to understand the reason.Though underfitting degrades the model, the same is not always true of overfitting; however, caution must still be used to avoid misinterpreting that the model has worked better than expected [36].
Person reliability, the IRT equivalent to Cronbach's alpha, is used to evaluate the internal consistency of the measure.Low values (<0.8) suggest that the measure has too few items or reduced variability in responses (i.e., there are few people with responses in the high or low ranges, indicating more extreme abilities).
To distinguish high performers (in swallowing) from low performers (in swallowing), person separation determines whether the test separates the sample into sufficient levels.When identified as accidental responses, outliers are managed using person separation.For clusters that represent true performances, people are classified using the person separation index (PSI)/strata (4* person separation +1/3).When person separation is low, it can be assumed that the measure is not sensitive enough to separate low and high performers.Reliability values of 0.5, 0.8, and 0.9, respectively, indicate separation into only one or two levels, 2-3 levels, and 3-4 levels [19].To consistently identify three performance levels, a PSI/strata of 3 is required (the minimum level to attain a reliability of 0.9).An item hierarchy with <3 levels (high, medium, low) is verified using the item reliability.If item reliability is <0.9, then the sample is too small to confirm the measure's construct validity (item difficulty).

Differential Item Analysis
A differential item functioning (DIF) analysis was performed to examine whether the scale items were used in the same way by all groups.DIF occurs when a characteristic other than the swallowing difficulty being assessed influences the rating of an item [36].The DIF analysis was performed on all 25 items.We tested DIF in variables where we expected DIF (e.g., OD vs. no OD) and in variables where we did not expect DIF (e.g., sex).The sample was categorised by age (18-39 years vs. 40-59 years vs. 60-69 years vs. 70-79 years vs. >80 years), participant category (OD vs. no OD), sex (male vs. female), diagnostic category (neurological disorders vs. head and neck oncology vs. other disorders), and swallowing difficulty according to FOIS (nothing by mouth vs. tube dependent with minimal attempts of food or liquid vs. tube dependent with consistent oral intake of food or liquid vs. total oral diet of a single consistency vs. total oral diet with multiple consistencies, requiring special preparation or compensations vs. total oral diet with multiple consistencies without special preparation, but with specific food limitation vs. total oral diet with no restrictions).
These were variables of interest based on the current literature about OD.In addition, given that the DHI is a measure of swallowing difficulties, we needed to establish if it could detect differences in performance for those with and without swallowing difficulties, as we would expect this would impact their DHI scores [24].Patients with neurological disorders (e.g., stroke, acquired brain injury, Parkinson's disease, multiple sclerosis, cerebral palsy, or Alzheimer's disease; [37]), head and neck cancer [38], and other disorders (e.g., structural deficits of the oral cavity, pharynx, or larynx; [39]) have been found to have poorer swallowing outcomes.
A significant DIF on a large number of items can indicate item bias.DIF based on age would be expected for older patients [40].In terms of sex, previous research found that men and women experience similar rates of swallowing difficulty; as such, we do not expect DIF [41].Swallowing difficulty as classified using the FOIS is expected to show DIF for those with more severe swallowing difficulty [42], as well as DIF for those diagnosed with OD using VFSS or FEES, compared to those without OD [5].
Differential item functioning contrast refers to the difference in difficulty of the item between both groups.Concerning the hypothesis 'this item has the same difficulty for two groups', DIF is noticeable when the DIF contrast (the reporting of the effect size in Winsteps) is at least 0.5 logits with a p-value < 0.05.The combination of DIF contrast (of at least 0.5 logits) and the p-value (<0.05) needs to be present, as statistical significance can be affected by sample size, and the sample size may not be large enough to exclude the possibility of being accidental [19].Inspection results of the direction of the logits in the DIF contrast scores indicate the difficulty of the item in comparison to what was expected (i.e., positive logits indicate that the item was more difficult than expected [lower scores] and negative logits indicate that the item was easier (higher scores)).In determining DIF when comparing more than two groups (i.e., age, diagnoses, FOIS levels, and DHI severity) with the hypothesis 'this item has no overall DIF across all groups', the chisquare statistic and p-value < 0.05 are used [19].There are two DIF methods used within Winsteps.The Mantel method is used for polytomous data, which are complete or almost complete.The Mantel-Haenszel method is used for uniform DIF analysis of complete or incomplete dichotomous data; for incomplete or sparse data, it uses a logistic uniform DIF method to estimate the difference between the Rasch item difficulties for the two groups, holding everything else constant.To overcome the limitation of incomplete data, Mantel/ Mantel-Haenszel in Winsteps are (log-)odds estimators of the DIF size and significance based on the cross-tabulation of the observations of the two groups and use theta to stratify the data.Mantel and Mantel-Haenszel do not require a large sample [43], so they are suitable for our sample size.Winsteps also employs a non-uniform DIF logistic technique and a graphical non-uniform DIF approach.We used the Mantel and Mantel-Haenszel tests as they are considered the most authoritative for DIF analyses of dichotomous and polytomous variables [33].

Dimensionality of the Scale
There are a number of ways to assess dimensionality, including (a) using negative point-biserial correlations to identify problematic items; (b) using Rasch fit indicators to identify misfitting items or persons; and (c) employing Rasch factor analysis using principal component analysis (PCA) on the standardised residuals [44].A PCA of residuals checks the number of principal components to confirm that there are no second or further dimensions after the intended or Rasch dimension is removed.Where residuals for pairs of items are uncorrelated and normally distributed, it can be assumed that no second dimension is present.To determine if further dimensions in the residuals are present, the following criteria are recommended: (a) the Rasch factor uses a cut-off >60% of the explained variance; (b) on first contrast, an eigenvalue of <3 (equivalent to three items) is used; and (c) a first contrast of <10% of the explained variance is used [19].
Distributions of a person's abilities and item difficulties are represented using the person-item dimensionality map, using a logit scale framework.For this paper, person ability refers to a person's self-rated ability to swallow.Items on the DHI that are rated with such infrequency, because very few people with swallowing problems will give these items a high rating, will be classified as "difficult" items.In contrast, "easy" items might refer to aspects of swallowing that occur regularly and will receive high self-ratings.Where two or more items represent similar difficulty, they will be placed in the same location on the logit scale.Gaps in the item difficulty continuum are identified when persons are represented with no corresponding item.Another indication of the overall distribution is the person measure score, using a mean measure score of 50 to determine the location on the person item map.A centralised item mean score of lower than 50 implies that people in the sample were more able than the level of difficulties in the items; higher than 50 indicates a lower ability than the mean item difficulty.

Results
The sample of 256 records from people at risk of OD was used for Rasch analyses, thus meeting the COSMIN criteria of an adequate sample size (more than five times the number of items [5 × 25 = 125], and a minimum sample of 100) [32]; 53.1% were male, and 46.9% were female, with an overall mean age of 65.2 years (SD 14.2; range 18-96 years).A total of 188 patients with confirmed OD and 68 patients without OD were included.About one-third of patients were diagnosed with neurological disorders, one-third with head and neck oncology, and the remaining patients reported dysphagia due to other medical causes (e.g., dysphagia after surgery or presbyphagia).Oral intake data (FOIS) and dysphagia severity (DHI) data showed a wide spread of swallowing ability.No data were missing except for the DHI severity scale (missing data: 13/256; 5.1%).The participants' demographic information is reported in Table 2.  Notes.FOIS = Functional Oral Intake Scale; DHI = Dysphagia Handicap Index; MN = Mean; Med = Median; SD = Standard Deviation.

Rating Scale Validity
The Dysphagia Handicap Index (DHI) is a 25-item measure of three domains of quality of life (QoL) related to the physical aspects of dysphagia (9 items), functional aspects (9 items), and emotional aspects (7 items).The respondents rate the extent to which each statement applies to them with scores of (0) for never, (1) for sometimes, and (2) for always.This results in a maximum score of 50, with higher scores indicating poorer swallowing ability.Respondents also rate their perception of the severity of their swallowing difficulty on a scale from 1 (normal) to 7 (severe problems).We first examined the instrument overall, followed by individual analyses of the three subscales, and finally, we completed analyses to test the removal of items to determine if this improved the fit to the model.
We first examined the response category, item and person fit, dimensionality, and DIF for the DHI, and then for each of the subscales, physical, functional, and emotional aspects, and then finally examined the effect of removing each of the most misfitting items in the overall scale.

Category Order
The examination of the response category for the overall instrument revealed that as the category order increased (from 0 to 2), all fit statistics were in the acceptable range (Z-STD = 0.7-1.4), with the average measure scores increasing monotonically, indicating three distinct, ordered categories (see Table 3 and Figure 2).The Andrich thresholds reflect the relative frequency of use of the categories, and these were not disordered, but the step difficulty in the categories advanced by >5 logits between categories 1 and 2 (+4.69) (4.69 − (−4.69) = 9.38 logits), indicating a potential gap in the measure of the variable (i.e., in the rating category labels).Note: Missing data = 2; 0.03%.STD = 0.7-1.4), with the average measure scores increasing monotonically, indicating three distinct, ordered categories (see Table 3 and Figure 2).The Andrich thresholds reflect the relative frequency of use of the categories, and these were not disordered, but the step difficulty in the categories advanced by >5 logits between categories 1 and 2 (+4.69) (4.69 − (−4.69) = 9.38 logits), indicating a potential gap in the measure of the variable (i.e., in the rating category labels).
We then examined the category order for each of the three subscales.Average measures for the physical and functional subscales increased monotonically, and the examination of the Andrich thresholds revealed they were not disordered but increased by <5 logits between categories 0 and 1 on the functional subscale (−4.80), but by >5 logits on the physical subscale (−7.54).The step difficulty increased by >5 logits between categories 1 and 2 on the functional subscales (+4.80) (4.80 − (−4.80) = 9.60 logits) and on the physical subscale (+7.54) (7.54 − (−7.54) = 15.08 logits).For the emotional subscale, the average measure did not increase monotonically, and the Andrich thresholds were ordered but increased by >5 logits between 0 and 1 (−8.23) and between 1 and 2 (+8.23) (8.23 − (−8.23) = 16.46 logits).The examination of the category fit statistics revealed no categories in the misfit range.We then examined the category order for each of the three subscales.Average measures for the physical and functional subscales increased monotonically, and the examination of the Andrich thresholds revealed they were not disordered but increased by <5 logits between categories 0 and 1 on the functional subscale (−4.80), but by >5 logits on the physical subscale (−7.54).The step difficulty increased by >5 logits between categories 1 and 2 on the functional subscales (+4.80) (4.80 − (−4.80) = 9.60 logits) and on the physical subscale (+7.54) (7.54 − (−7.54) = 15.08 logits).For the emotional subscale, the average measure did not increase monotonically, and the Andrich thresholds were ordered but increased by >5 logits between 0 and 1 (−8.23) and between 1 and 2 (+8.23) (8.23 − (−8.23) = 16.46 logits).The examination of the category fit statistics revealed no categories in the misfit range.

Person and Item Fit
The summary item and person ability infit and outfit statistics for the 25-item scale were examined (see Table 4).There was a good item reliability estimate (0.97) of items with a separation of 5.76, and person reliability was 0.91.The person separation index (PSI) was 3.23, indicating that persons were reliably separated into at least three distinct groups based on the strata of abilities.When examining the subscales' item reliability estimates, they were also good (0.96-0.98), with item separation ranging from 4.89 to 7.72, but person reliability was moderate (0.72-0.82).The PSIs were poor, ranging from 1.6 (emotional) and 1.76 (physical), indicating that persons were not separated into at least two levels.For the functional subscale, it was 2.13.An examination of the point measure correlations for all 25 items revealed that they were all positive, indicating that all items contributed to the measurement of the latent variable.This was also the case for point measure correlations of the subscales.Item fit statistics are provided in Table 5.The MnSq infit and outfit statistics should be close to 1 with an acceptable range of (0.70-1.4), and they are reported as overfitting if <0.7 and underfitting if >1.4 (Bond and Fox 2015 [36]).To fit the model, infit and outfit reported as Z-STD statistics (standardised fit statistics) have an expected outcome of 0 with an acceptable range of ±2.Values exceeding +2 are reported as underfit and as overfit if they exceed −2 (Bond and Fox, 2015).All infit MnSqs were in the acceptable range except for underfit for item 22 (F), and it also had underfit Z-STD statistics.Infit Z-STD values were also underfitting for items 3 (P), 5 (P), 7 (F), and 24 (P).Infit Z-STD values were overfitting for items 2 (P), 12 (E), 13 (E), 14 (F), 15 (F), 16 (F), and 23(F).Outfit MnSqs were in the desired range except when underfitting in items 3 (P) and 24 (P), and overfitting in items 12 (E), 14 (F), 16 (F), and 18 (E).Outfit Z-STD values were also underfitting (>2) for items 1 (P), 3 (P), 7 (F), 17 (E), 20 (P), 21 (E), and 24 (P) and overfitting for items 12 (E), 13 (E), 14 (F), 15 (F), 16 (F), 18 (E), and 23 (F).When mean squares are acceptable, underfitting and overfitting Z-STD values can be ignored [19].
When examining the person fit statistics, 49 persons had some underfitting MnSqs or Z-STD scores.Twenty-nine persons had both underfitting infit and outfit MnSqs (>1.4).Overall, infit MnSq scores for 35 persons and 22 infit Z-STD scores were underfitting.Sixteen persons had both underfitting infit and outfit Z-STD scores.Overall, infit Z-STD scores were underfitting for 22 persons and the Z-STD outfit scores for 21 persons were underfitting.Infit statistics explain performance better because outfit statistics are sensitive to outlying scores.Too much variation in the responses results is underfitting (MnSq >1.4; Z-STD > 2), and this is the biggest threat to the measure because it can degrade the model (Bond and Fox 2015 [36]).We then examined the item and person fit statistics for each subscale.Notes.MnSq values outside the acceptable range of 0.7-1.4 and outfit Z-STD values that exceed ±2 are interpreted as not fitting the Rasch model [36]; PTM Corr.= point measure correlations; values that are in bold are outside the acceptable range and do not fit the Rasch model.
Fifty-one persons had at least one misfitting infit or outfit statistic.In total, 38 people had underfitting infit and outfit MnSq scores, 12 persons had underfitting Z-STD infit and outfit scores, and 12 persons had both infit and outfit MnSq and Z-STD scores that were underfitting.
Forty-three persons had at least one misfitting infit or outfit statistic.In total, 21 people had underfitting infit and outfit MnSq scores, 10 persons had underfitting Z-STD infit and outfit scores, and 10 persons had both infit and outfit MnSq and Z-STD scores that were underfitting.
Sixty-three persons had at least one misfitting infit or outfit statistic.Thirty-five persons had underfitting infit and outfit MnSq scores, five persons had both underfitting Z-STD infit and outfit scores, and five persons had both infit and outfit MnSq and Z-STD scores that were underfitting.
Differential item functioning was then examined for each of the subscales.As with the overall scale, no items showed significant DIF for all variables; however, three P items (3, 24, and 25) showed DIF on sex, and three P items (1, 2, and 20) showed significant DIF on diagnostic category.On the physical subscale, significant DIF was also evident for age on item 4, for OD vs. no OD on item 24, for FOIS on item 11, and for DHI severity on item 11.For the functional subscale, DIF was evident only for age and DHI severity on item 7, for DHI severity only on item 10, and for FOIS score on items 9, 22, and 23.No DIF was evident on the emotional subscale except for OD vs. no OD and FOIS on item 17.

Dimensionality
The dimensionality of the overall scale of 25 items was examined using the principal component analysis (PCA) of the residuals (Tables 7 and 8).Contrasts in the item residuals are examined for dimensions that are not explained by the Rasch dimension.The Rasch dimension explained 42.9% of the variance, with >40% indicating a strong measurement of dimension [19].The examination of the explained variance showed that the item measures (22.3%) explained slightly more of the variance than the person measures (20.7%).However, the unexplained variance (57.1%) was greater than the explained variance.The raw variance explained by the items was only about three times the variance explained by the first contrast (7.8%), indicating a noticeable second dimension.The first contrast had an eigenvalue of 3.43, which is greater than the value (two eigenvalue units) confirming that there is a second dimension and the eigenvalue of the second contrast (2.11), explaining 4.8% of the variance, which is the smallest amount that could indicate the possibility of a third dimension.The PCA divided the items into two groups: one with the Rasch dimension items 1, 2, 3, 4, 11, 20, 24, and 25 from the physical (P) subscale and 17,19, 21 from the emotional (E) subscale and another with a second dimension with items 6, 7, 9, 10, 14, 15, 16, 22, and 23 from the functional (F) subscale and items 8, 12, 13, and 18 from the Emotional (E) subscale.This would suggest, based on the theoretical logic for QoL, that for people with dysphagia, QoL is affected by physical symptoms and the functional impact on daily life.However, the results related to the dimensionality of the DHI suggest that the emotional impact is intertwined with both physical symptoms (e.g., having an emotional response [fear] to choking) and also in response to the functional impact (e.g., having an emotional response [depression] to appearing in public).As indicated earlier, the point measure correlations were all in a positive direction, indicating that all items contributed to the measurement of the latent variable and should, therefore, be retained.As presented in Figure 3, the person-item map showed that (a) there was a need for more easy and more difficult items, (b) that many people were not aligned against items, and (c) that there was very little item redundancy evident from items aligning at the same level.Items 8 (E), 14 (F), and 20 (P) aligned; 5 (P), 12 (E), 15 (F), and 24 (P) aligned; 2 (P), 7 (F), and 16 (F) aligned; 3 (P), 18 (E), and 23 (F) aligned; and items 10 (F) and 13 (E) aligned.However, of these, items 5 and 24 were both P items, so one was potentially redundant, and 7 and 16 were both F items, so one was potentially redundant.The other alignments can be explained because they are items belonging to differing subscales.
We also examined the dimensionality of each of the subscales separately.On the physical subscale, the Rasch dimension explained 38.4% of the variance, with persons and items explaining 18.5% and 20.3%, respectively.There was no evidence of a second dimension, with the first contrast variance being less than two eigenvalue units.On the functional subscale, 54.2% of the variance was explained by the Rasch dimension, with person and item variances of 28.2% and 26%, respectively, and no evidence of a second dimension.On the emotional subscale, 44.7% of the variance was explained by the Rasch dimension, with person and item variances of 23.8% and 20.9%, respectively, and no evidence of a second dimension.As with the overall measure, there was a need for more easy and more difficult items in each subscale.Many people were not aligned against items, and the only potential redundancies were in the physical subscale, with items 5 and 24 aligning at the same level (as observed in the overall scale) and items 1 and 2 aligning at the same level.
This process was then repeated with the removal of the most misfitting items-18 (I feel handicapped because of my swallowing (E)), 24 (I feel a strangling sensation when I swallow (P)), and 3 (my mouth is dry (P))-separately and then of combinations of items 3 and 24 and items 3, 18, and 24.Even though item 22 (I must eat another way (e.g., feeding tube) because of my swallowing problem (F)) was misfitting, it was the only difficult item and was therefore not removed in these analyses.No significant changes were evident, and all models still indicated a second dimension (Tables 7 and 8) or measures of the two components of QoL impact, physical and functional, with emotional as a third component related to both physical and functional items.However, examining the person-item map revealed that the removal of both items 3 and 24 resulted in only items 7 and 16 showing redundancy.items explaining 18.5% and 20.3%, respectively.There was no evidence of a second dimension, with the first contrast variance being less than two eigenvalue units.On the functional subscale, 54.2% of the variance was explained by the Rasch dimension, with person and item variances of 28.2% and 26%, respectively, and no evidence of a second dimension.On the emotional subscale, 44.7% of the variance was explained by the Rasch dimension, with person and item variances of 23.8% and 20.9%, respectively, and no evidence of a second dimension.As with the overall measure, there was a need for more easy and more difficult items in each subscale.Many people were not aligned against items, and the only potential redundancies were in the physical subscale, with items 5 and 24 aligning at the same level (as observed in the overall scale) and items 1 and 2 aligning at the same level.

Function subscale Emotional subscale
Notes.The areas highlighted in grey denote people without any easy (top) or difficult (bottom) items against them.This process was then repeated with the removal of the most misfitting items-18 (I feel handicapped because of my swallowing (E)), 24 (I feel a strangling sensation when I swallow (P)), and 3 (my mouth is dry (P))-separately and then of combinations of items 3 and 24 and items 3, 18, and 24.Even though item 22 (I must eat another way (e.g., feeding tube) because of my swallowing problem (F)) was misfitting, it was the only difficult item and was therefore not removed in these analyses.No significant changes were evident, and all models still indicated a second dimension (Tables 7 and 8) or measures of the two components of QoL impact, physical and functional, with emotional as a third component related to both physical and functional items.However, examining the person-item map revealed that the removal of both items 3 and 24 resulted in only items 7 and 16 showing redundancy.

Summary Statistics
The summary statistics for item and person ability for all 25 items were good (i.e., high item and person reliability).When examining the subscales, item reliability estimates were good, but person reliability was moderate with poor person separation indices.As a result, people may not be separated into different levels (i.e., high versus low performers in relation to swallowing), supporting the need for more easy and more difficult items.

Rating Scale
When examining how the rating scale was used for the overall DHI, there was no disordering in the categories, and all fit statistics were within an acceptable range.However, step difficulty between the categories indicated a potential gap in the measure of the variable.An examination of the category fit statistics per subscale revealed no categories in the misfit range except for the emotional subscale, and gaps related to step difficulty in all three subscales were confirmed.Increasing the number of categories (i.e., response options) and providing clear descriptions of how categories differ from each other can help resolve these findings.

Person and Item Fit
Overall, more misfitting was evident in the Z-STD statistics, with only one underfitting MnSq infit statistic.Outfit MnSqs outside the acceptable range were mainly overfitting and not in contradiction to the outfit Z-STD scores, with overfitting being less of a threat to the model.However, care needs to be taken so that this is not misinterpreted as 'the model is working better than expected'.
When using the scale as a whole, all items except item 22 had acceptable mean square infit statistics, and therefore, the underfitting and overfitting of the Z-STD scores are considered less important.Outfit MnSqs were overfitting for four items, but the outfit Z-STD scores were also overfitting.The overfitting of MnSq and Z-STD is less concerning than underfitting.Outfit statistics are unweighted, so they are often regarded as less important than infit statistics as they are more sensitive to outliers.Although they show that the data were more predictable than the model, they do not usually degrade the model.Further, although some items had an MnSq infit or outfit score outside the desired range, suggesting their removal, additional analyses assessing the measure's dimensionality recommend that the measure may be improved as a two-dimensional model, with these questions retained but with different wording.Item 22 had infit MnSq and Z-STD scores that were underfitting, but this is likely due to it being about needing to use an alternative means of feeding (i.e., feeding tube).So, it should be retained as it would likely perform better with a larger sample that included more people using feeding tubes.

Differential Item Functioning
Differential item functioning analysis was used to examine the potential contrasting item-by-item profiles associated with sex, age, the presence or absence of a confirmed diagnosis of OD, medical diagnosis, FOIS, and DHI severity.Overall, the DIF results indicate good performance at an item level.Theoretically, DIF would be expected on most variables, but not for sex.The most obvious DIF was found for FOIS, suggesting a more optimal representation for the functional manifestation of swallowing problems.In contrast, the presence of dysphagia presented only with marginal DIF.The DIF domains age, medical diagnosis, and DHI severity also showed minor DIF, of which the limited DIF for age could be due to a limited distribution across age groups (i.e., 71.1% of participants were ≥60 years).

Dimensionality
The PCA of residuals performed to examine the dimensionality of the overall DHI measure indicated that the DHI consisted of two dimensions in contrast to the three dimensions as suggested by Silbergleit, Schultz [24]; each dimension consisted of items either from the functional or physical subscale, supplemented with items originating from the emotional subscale.The person-item dimensionality map indicated little item redundancy, but an obvious need to generate more easy and difficult items.
Because Rasch analyses could not confirm the three-dimensional nature of the DHI, it is recommended to avoid the use of subscales and consider only the full measure until dimensionality is addressed through future instrument redevelopment.During the redevelopment process, items from the emotional subscale may be distributed across the other two domains and/or reworded to reflect that physical and functional domains have an emotional element, rather than emotion being a separate dimension.

Future Recommendations
Failing to reproduce the three subscales suggests that the DHI does not meet content validity criteria.Future studies should focus on meeting criteria for all three aspects of content validity: (1) relevance (i.e., the degree to which all items of a measure are relevant for the construct of interest within a target population and purpose of use); (2) comprehensiveness (i.e., the degree to which all key concepts of the construct are included in a measure); and (3) comprehensibility (i.e., the degree to which items of a measure are easy to understand for respondents) [45].
The redevelopment of the DHI should also address the rephrasing and regrouping of misfitting items and the need for the generation of new, easy, and difficult items to improve the separation of people into different levels of low versus high performance (i.e., a low versus high degree of disability on patient's quality of life).In supporting a two-dimensional model, all emotion items should be reworded to rather reflect an emotional response to a physical or a functional challenge.For example, item 12, 'I feel depressed because I can't eat what I want' can be changed to 'Not being able to eat what I want makes me feel depressed', which then becomes a functional item with an emotional component.After the redevelopment of the DHI, the revised measure's psychometric properties, including its dimensionality, must be determined again using CTT and IRT analyses in preferably the same, larger sample sizes, meeting current international standards for instrument development.
As solving the current psychometric issues surrounding the DHI is challenging, the development of a new measure that ensures good content validity may be preferred.During the instrument development of patient self-report measures for dysphagia, careful consideration should be given to which constructs should be targeted and whether including the constructs functional health status and health-related quality of life may suffice [7].In this context, functional health status refers to the impact of dysphagia on the ability to perform tasks in multiple domains (including physical, social, role, and psychological functioning) and aims to quantify the symptomatic severity and (loss of) function due to dysphagia and/or treatment and the impacts on daily life as experienced by patients at a particular point in time [7].HR-QoL refers to the unique personal perception of someone's health, taking into account social, functional, and psychological issues [9].
A final consideration is the naming of the measure: the Dysphagia Handicap Index.When redeveloping the DHI, one may consider changing the measure's name, as the term 'handicap' is perceived as outdated and offensive [46].Instead, preference should be given to recommended language as suggested by, for example, the United Nations [46], which refers to 'persons with disabilities' and targeting patients' capabilities over disabilities.

Conclusions
In general, previous studies using CTT to determine the psychometric properties of the DHI confirmed its validity and reliability [24].However, our current findings using IRT seem to contradict the results of these CTT studies to some degree and highlight the need for continuous instrument development.The main weakness of the DHI is related to the failure to reproduce its three subscales, suggesting that the DHI does not meet the content validity criteria.The DHI has two dimensions, not three, and the items from the emotional subscale should be reworded and integrated with the functional and physical subscales.The redevelopment of the DHI should focus on meeting all criteria for good content validity, address the rephrasing and regrouping of misfitting items, and include new, easy, and difficult items to improve the separation of people into different levels of swallowing ability.Given the complexity of addressing these issues, the development of a new measure that ensures good content validity may be preferable.

Table 1 .
Dysphagia Handicap Index items and domains.

Table 4 .
Item and person summary statistics.
* PSI, Person Separation Index/Strata; PSI = [4 × Person Separation + 1]/3.A person strata of, "3" (the minimum level to attain a reliability of 0.90) implies that three different levels of performance can be consistently identified using the test for samples like that tested; Rel = reliability; Sep = separation; bold text and thicker lines reflect that each analysis generates both person and item statistics.

Table 5 .
Individual item fit statistics and principal component analysis for subscales.

Table 6 .
Summary of DIF analysis.