A Novel Method for Assessing Risk-Adjusted Diagnostic Coding Specificity for Depression Using a U.S. Cohort of over One Million Patients

Depression is a prevalent and debilitating mental health condition that poses significant challenges for healthcare providers, researchers, and policymakers. The diagnostic coding specificity of depression is crucial for improving patient care, resource allocation, and health outcomes. We propose a novel approach to assess risk-adjusted coding specificity for individuals diagnosed with depression using a vast cohort of over one million inpatient hospitalizations in the United States. Considering various clinical, demographic, and socioeconomic characteristics, we develop a risk-adjusted model that assesses diagnostic coding specificity. Results demonstrate that risk-adjustment is necessary and useful to explain variability in the coding specificity of principal (AUC = 0.76) and secondary (AUC = 0.69) diagnoses. Our approach combines a multivariate logistic regression at the patient hospitalization level to extract risk-adjusted probabilities of specificity with a Poisson Binomial approach at the facility level. This method can be used to identify healthcare facilities that over- and under-specify diagnostic coding when compared to peer-defined standards of practice.


Introduction
The International Classification of Diseases (ICD) is a medical coding system that is continuously updated and used to catalog health conditions by categories of similar diseases under more specific conditions [1].The World Health Organization has been responsible for ICD since 1992, providing a standardized method of recording and tracking diseases worldwide [1].ICD-10 (10th revision) coding affects healthcare delivery, payments and reimbursements, and disease surveillance.ICD-10-CM, which is the clinical modification (CM) developed and maintained by the Centers for Disease Control and Prevention (CDC) that was introduced shortly after, provided a 400% increase in diagnosis codes while also increasing 18-fold the coding for procedures [2].This enhancement aimed to add granularity and specificity in clinical records for diagnoses and procedures, though some clinical specialties have been more directly affected than others [1][2][3].However, a larger catalog of ICD-10 codes does not directly imply widespread use of the more granular coding options [4].
Medical coding is expected to be accurate, complete, and specific to the finest degree possible.Rigorous coding ultimately benefits patients, healthcare providers, and payors [5].Coding errors can occur when the physician documentation is insufficient [6] or the coding staff is improperly trained.One essential aspect of the coding process is the concept of coding specificity, which can be regarded as medical coding to the greatest level of precision supported by a clinical diagnosis code [5].Unspecified codes should be the last resort when a more specific diagnosis is not viable.The United States (U.S.) Centers for Medicare and Medicaid Services' (CMS) guidelines indicate that "When sufficient clinical information is not known or available about a particular health condition to assign a more specific code, it is acceptable to report the appropriate unspecified code" [7].While increasing coding specificity rates has been recommended when appropriate [5], the focus on rates alone can be problematic.Rates can be sensitive to hospital volume, patient mix, and other factors, which may result in varying degrees of specificity, whether clinically supported or not.Additionally, coding to the highest degree of specificity is not always clinically justified, such as in instances where a specified secondary diagnosis may not be needed for the provision of treatment for the principal diagnosis during an inpatient or emergency stay or when resources are not available to provide that additional level of specificity [8][9][10].
In the inpatient setting, coding specificity is the responsibility of the provider and coder, who must work together to record a detailed clinical description of the diagnosis or procedure [5,11].Accurate levels of diagnostic coding specificity, when possible, help align reimbursements with healthcare costs and provide patients with accurate medical records to more effectively guide treatment plans.Detailed documentation and coding of diagnoses influence reimbursement but also represent an additional cost for healthcare providers and payors.Higher degrees of coding specificity, especially upon introducing ICD-10-CM, require additional levels of expertise among coders [5], which may be a larger burden in facilities providing more general services or those with limited resources or personnel.Practices may also suffer financial loss if coding specificity is insufficient or inappropriate [5].In some cases, payors may determine that codes lacking specificity are used improperly (or overused), potentially leading to a denied claim [5,7].Conversely, over-specificity, when not clinically warranted, is problematic, as such coding may overstate patient care needs, unduly inflate reimbursement through Diagnosis Related Group (DRG) creep, and exaggerate a patient's clinical risk.
Despite the importance of accurate coding for patient conditions, enforcing and maintaining a high, yet appropriate, level of coding specificity has remained an issue since the inception of the ICD-10 system [5,11].While studies have been conducted to identify sources of coding errors throughout a patient encounter or episode, there is minimal literature examining how best to quantify coding specificity as an independent metric or how to predict or identify where unspecified codes are (or have the potential to be) most misaligned with the patient's true diagnosis, that is, when the diagnosis may be accurate but the level of coded specificity is not appropriate for the clinical diagnosis [11,12].Even in studies where coding specificity is observed or utilized as an analytical component, the methods used to develop the specificity metric are often vague, overlooked entirely, and/or narrowly defined in a disease-specific form, resulting in additional complexity to generalize across conditions [13][14][15].The lack of standardized methods for measuring, quantifying, and analyzing coding specificity represents a significant gap in knowledge for the healthcare community.
While similar methodologies have been developed in the literature for measuring coding intensity [16,17], we aim to create a metric by which to measure and risk-adjust coding specificity that would allow for comparative analysis of facilities, thus identifying where coding specificity may need improvement against healthcare industry standards or aspirational peers.A metric that could have potential for widespread implementation would not require clinical inputs regarding appropriateness of specificity, which may not be agreed upon across physicians, change substantially over time, and be costly to obtain and maintain, as well as being less generalizable across health conditions.Such metric should also be relatively easy to implement without major costs and with readily available patient and facility data, such as administrative claims data, though sufficiently flexible to account for other information when available.Depression, affecting 18.5% of the U.S. adult population in 2020 [18], has been identified as one of the conditions that is commonly reported with unspecified diagnosis codes [5].Three of the most common codes produced by the ICD-10 criteria for depression include: major depressive disorder (F32); dysthymic disorder (F34.1); and unspecified depression (F32.A) [5,19,20].ICD-10 codes related to depression are also grouped within the DRG list of depressive neuroses (DRG 881) [20].Recommendations for an initial diagnosis using ICD-10 codes require identifying five symptoms of depression lasting two weeks or more and must include depressed mood or loss of interest [19,21].However, depression should only be considered after accounting for the absence of medical conditions that can mimic symptoms of depression (e.g., thyroid problems or brain tumors) and after ruling out bereavement or sadness caused by life-altering events [21].
The degree to which coding specificity varies across providers for depression patients remains unclear.Facilities do not have a standard against which they can measure their levels of coding specificity of depression diagnoses during inpatient hospitalizations, particularly because of potential case mix differences across facilities.This calls for a method that risk-adjusts for such differences and provides an objective and standardized metric against which each facility can measure variation in coding specificity.This study aims to demonstrate a novel approach for measuring the risk-adjusted probability of coding specificity controlling for patient and facility characteristics, both across principal and secondary diagnoses of depression, while building an aggregated metric that can be used at coarser levels.Such an approach can be used by quality control personnel to enhance standards of practice around coding specificity, not only for individuals diagnosed with depression but also across a wide spectrum of health conditions.

Materials and Methods
Data were obtained from the Premier Healthcare Database (PHD), a national, hospitalbased, service-level, and private all-payor database that contains information on inpatient discharges [22].The analysis comprises N = 1,071,575 observations of acute inpatient hospitalizations of first-patient stays with discharge dates in 2022 with an identified principal or secondary diagnosis of depression.Specificity for a depression principal diagnosis was identified, and, when multiple depression secondary diagnoses occurred, specificity for the secondary diagnosis was defined as specificity for at least one of these depression secondary diagnoses.The ICD-10 codes defining the patient cohort consisted of the F32 (depressive episode) and F33 (major depressive disorder, recurrent) codes.
The data consist of the following information, in addition to masked patient and facility identifiers: (1) binary response variables representing specificity of principal and secondary diagnoses of depression; (2) patient characteristics, which include age, sex, race, length of stay (log-transformed due to its large right-skewness), primary payor, point of origin, discharge status, count of procedures performed during the inpatient stay, CMS fiscal year indicator, five county-level Agency for Toxic Substance & Disease Registry (ATSDR) social vulnerability indices (SVIs) [23], COVID-19 indicator, and Medicare Severity (MS)-DRG type; and (3) facility characteristics including teaching status, academic status, urban/rural status, ownership status, bed count, hospital-level case mix index (CMI), and state.The primary payor variable refers to the insurance provider that assumes the primary responsibility for covering the costs of a healthcare claim.For example, "Medicare traditional" indicates that the patient is covered under Medicare, the U.S. government's insurance plan for patients aged 65 or older, while "Medicaid traditional" refers to the U.S. government's insurance plan for low-income patients and their families.
Descriptive statistics were calculated for all aforementioned variables, including means/counts and standard deviations/percentages.Categories with low counts and similar meanings (e.g., charity and indigent primary payor types) or adjacent ordered categories (e.g., ages 0-9) with low counts were grouped together.
Univariate and multivariate logistic regression analyses were used to extract associations between patient-level and facility-level variables and the coding specificity of depression principal and secondary diagnoses.Odds ratios (ORs), 95% confidence intervals (CIs), and p-values were calculated and tabulated for all four analyses.The receiver operating characteristic (ROC) curve was constructed, and the corresponding area under the curve (AUC) was computed for both the depression principal and secondary diagnosis specificity multivariate logistic regression models.
The unknown probability (π p,f ) of coding specificity of principal, or secondary, diagnosis for patient hospitalization p in facility f (S p,f ) was modeled with a multiple logistic regression including covariates (X p,f ) that represent both patient and facility characteristics using the equation below: where α is the intercept and the vector β T contains corresponding regression coefficients for X p,f .Each patient hospitalization's coding specificity within a facility f was assumed to be independently distributed with an unequal, unknown probability π p,f .Since the coding specificity events were not identically distributed, this total count for each facility f (i.e., ∑ p⊂f S p,f ) was assumed to follow a Poisson Binomial (PB) distribution with a probability vector πp⊂f , composed of the probabilities for each patient hospitalization p within facility f (i.e., p ⊂ f).Upon extracting the estimated probabilities πp,f for each patient hospitalization p and facility f, these were used to assess whether each facility's total count of specified diagnoses was under-, in line with, or over-specified compared with their healthcare industry peers via a user-defined probability threshold t.
Without loss of generality, we applied a common threshold t = 0.025 to identify facilities operating outside peer standards' confidence bounds (2.5th and 97.5th percentiles) denoted by Q L and Q U , representing under and over specificity, respectively, for facility f : Visualizations of the facility-specific metrics were produced to demonstrate underspecifying (p < 0.025) and over-specifying (p > 0.975) facilities using the cumulative distribution function of the facility-specific Poisson Binomial distribution and the observed specificity count across patient hospitalizations for that facility.Geospatial U.S. maps of adjusted odds ratios of coding specificity by state were also produced across both outcomes, with New York selected as the reference state based on its largest healthcare expenditure (per capita) in the U.S. [24].

Results
Table 1 provides descriptive statistics for all variables across N = 1,071,575 unique inpatient hospital admissions where depression was recorded as the principal or secondary diagnosis.Of these hospitalizations, 16,437 had depression as a principal diagnosis.Of the principal diagnoses, 4736 (28.8%) were coded as unspecified.
Most of the patients were aged 65 to 69 years old (12%), female (65%), and identified as White (80%).The median length of stay was 4 days, the average number of procedures per hospitalization was 2.9 (SD 2.7), and most hospitalizations occurred in the CMS 2022 fiscal year (75%).Traditional Medicare was the most common primary payor (29%), the most common point of origin was a non-healthcare facility (80%), and most patients were discharged to home or self-care (53%).Average scores for patients' SVI values were 0.53 (SD 0.26) for socioeconomic status, 0.51 (SD 0.25) for household characteristics, 0.66 (SD 0.24) for racial and ethnic minority status, 0.60 (SD 0.25) for housing type and transportation, and 0.58 (SD 0.25) for overall vulnerability.Seven percent of patients experienced COVID-19 during their hospitalization, and 74% of patients had a medical MS-DRG type.Most of the hospitals were non-teaching (73%), non-academic (83%), located in an urban setting (88%), voluntary non-profit private (65%), and had more than 400 beds (43%).The average patient case mix index was 1.7 (0.29).Data were collected from facilities in all fifty states, but the five states with the largest numbers of observed hospitalizations were Florida (9%), New York (7%), Texas (6%), North Carolina (6%), and Ohio (5%).
Table 2 contains the univariate and multivariate logistic regression results, including odds ratio estimates, 95% confidence intervals, and p-values for the specificity of a principal depression diagnosis.Patient characteristics such as age, primary payor, and SVI had a significant association with depression principal diagnosis coding specificity across multiple categories based on the multivariate logistic regression model.The odds of depression principal diagnosis coding specificity were at least 46% higher among patients aged less than 80 years, with the exception of those less than 10 years old, when compared to patients 85+ years old (OR ≥ 1.459; p ≤ 0.041).Males experienced 24% lower odds of depression principal diagnosis specificity compared to females (OR = 0.76; p < 0.001).No significant differences in odds of specificity were found by race upon accounting for all other factors.However, every additional unit in the Racial and Ethnic Minority Status SVI was associated with approximately 49% lower odds of specificity (OR = 0.506; p = 0.003).Length of stay (log-transformed) was also positively associated with higher odds of principal diagnosis specificity (OR = 1.82; p < 0.001).Differences were found across some categories of primary payor, point of origin, and discharge status.However, there was no significant association between COVID-19 status, CMS fiscal year period, count of procedures, or other SVI measures with depression principal diagnosis specificity.Patients grouped with a surgical MS-DRG type experienced substantially lower odds of depression-related principal diagnosis specificity (OR = 0.288; p < 0.001) when compared to those with a medical MS-DRG.
Patients attending rural facilities did not experience statistically different odds of specificity of a depression principal diagnosis compared to those attending urban facilities (OR = 1.009; p = 0.917).Patients attending teaching facilities experienced lower odds of depression principal diagnosis specificity (OR = 0.680; p < 0.001), whereas those attending facilities with an academic status experienced higher odds of depression principal diagnosis specificity (OR = 1.465; p = 0.001).All significant ownership categories were associated with lower odds of depression principal diagnosis specificity compared to the reference category (voluntary nonprofit private).No clear pattern emerged by bed size, and the case mix index was found to be non-significantly associated with principal diagnosis coding specificity.However, substantial differences were detected by state when compared to New York as the reference state.For example, states like California experienced much higher odds of specificity of a depression principal diagnosis (OR = 1.995; p < 0.001), while others like New Jersey experienced substantially lower odds of principal diagnosis specificity (OR = 0.247; p < 0.001).Table 3 contains the univariate and multivariate logistic regression results, including odds ratio estimates, 95% CIs, and p-values, for the specificity of depression-related secondary diagnoses.The multivariate analysis demonstrates that individuals of all age groups experienced significantly higher odds of specificity of depression secondary diagnoses compared to those 85 and older (OR ≥ 1.116; p ≤ 0.036).Black individuals experienced 12.5% lower odds of secondary diagnosis specificity than White patients (OR = 0.875; p < 0.001).Males experienced approximately 5% higher odds of secondary diagnosis specificity than females (OR = 1.054; p < 0.001).Length of stay (log-transformed) was also positively associ-ated with higher odds of depression secondary diagnosis specificity (OR = 1.237; p < 0.001).Primary payor type, point of origin, and discharge status all contained categories with statistically significant associations with the outcome.Patients who experienced larger numbers of procedures also experienced higher odds of secondary diagnosis specificity (OR = 1.007; p < 0.001).Those discharged in the 2023 CMS fiscal year experienced 4.2% increased odds of depression secondary diagnosis specificity (OR = 1.042; p < 0.001).All SVI indices were also significant, as was COVID-19 status, with COVID-19-positive patients experiencing approximately 7% lower odds of depression secondary diagnosis specificity (OR = 0.929; p < 0.001).Patients with a surgical MS-DRG also experienced 14.5% lower odds of depression secondary diagnosis specificity compared to those with a medical MS-DRG type (OR = 0.855; p < 0.001).
Patients admitted to teaching facilities experienced significantly higher odds of depression secondary diagnosis specificity (OR = 1.177; p < 0.001), while those attending academic facilities experienced lower odds of secondary diagnosis specificity (OR = 0.790; p < 0.001).Rural facilities provided higher odds of specificity to their patients (OR = 1.409; p < 0.001).Some differences were found by ownership status, and patients attending facilities with lower bed counts had lower odds of depression secondary diagnosis specificity compared to those with over 400 beds (OR ≤ 0.937; p < 0.001).Patients attending hospitals with larger case mix index values were associated with lower odds of secondary diagnosis specificity (OR = 0.873; p < 0.001).Finally, substantial state-based differences were detected, with most states experiencing higher odds of depression secondary diagnosis specificity than NY.For example, individuals in states like MN experienced substantially larger odds of depression secondary diagnosis specificity compared to NY (OR = 11.255;p < 0.001).Figure 2 contains a visual representation of the use of the Poisson Binomial metric for identification of facilities' specificity of depression principal (a) and secondary (b) diagnoses against healthcare industry peers upon adjusting for patient and facility characteristics.A sample of 20 facilities is portrayed in each plot, with colors denoting coding specificity performance versus peers.Observed counts below the 95% CIs identify facilities that under-specify depression diagnoses compared with their peers (blue), while observed counts above the 95% CIs identify facilities that over-specify depression diagnoses versus peers (orange).Finally, those depicted in black represent facilities that specify depression diagnoses in line with their healthcare industry peers.

Discussion
We propose a two-step approach for modeling coding specificity at the facility level.First, a multivariate logistic regression model is proposed to measure, at the patient hospitalization level, the association between the coding specificity of principal and secondary diagnoses of depression and a set of patient-and facility-level characteristics.In a second step, a Poisson Binomial approach builds upon the risk-adjusted logistic-derived patient-level specificity probabilities to estimate the anticipated 95% confidence interval for coding specificity counts per facility across patient hospitalizations if facilities were to operate in line with healthcare industry standards.Over-and under-specifying facilities are then identified upon comparing their observed coding specificity counts across patient hospitalizations and the aforementioned 95% confidence intervals.We then visualize the facility-specific metrics to demonstrate under-and over-specifying facilities.While outside the scope of this manuscript, facilities can also be ranked by risk-adjusted specificity since the p-value-based metric already adjusts for both size (i.e., counts) and strength of evidence.
Patient characteristics were associated with the coding specificity of both the principal and secondary diagnoses.Higher odds of specificity for both principal and secondary diagnoses were generally associated with lower ages compared with those 85+ years old.This may be related to a larger complexity in diagnosis or the presence of more comorbidities.However, it could also be related to a lower quality of coding and/or care provided to older populations [25].Race was not associated with differences in odds of specificity, with the exception of the secondary diagnosis, where Black patients experienced substantially lower odds of coding specificity, which may relate to differences in coding practices by practitioners and/or differences in information-seeking behaviors by patients [26].This

Discussion
We propose a two-step approach for modeling coding specificity at the facility level.First, a multivariate logistic regression model is proposed to measure, at the patient hospitalization level, the association between the coding specificity of principal and secondary diagnoses of depression and a set of patient-and facility-level characteristics.In a second step, a Poisson Binomial approach builds upon the risk-adjusted logistic-derived patient-level specificity probabilities to estimate the anticipated 95% confidence interval for coding specificity counts per facility across patient hospitalizations if facilities were to operate in line with healthcare industry standards.Over-and under-specifying facilities are then identified upon comparing their observed coding specificity counts across patient hospitalizations and the aforementioned 95% confidence intervals.We then visualize the facility-specific metrics to demonstrate under-and over-specifying facilities.While outside the scope of this manuscript, facilities can also be ranked by risk-adjusted specificity since the p-value-based metric already adjusts for both size (i.e., counts) and strength of evidence.
Patient characteristics were associated with the coding specificity of both the principal and secondary diagnoses.Higher odds of specificity for both principal and secondary diagnoses were generally associated with lower ages compared with those 85+ years old.This may be related to a larger complexity in diagnosis or the presence of more comorbidities.However, it could also be related to a lower quality of coding and/or care provided to older populations [25].Race was not associated with differences in odds of specificity, with the exception of the secondary diagnosis, where Black patients experienced substantially lower odds of coding specificity, which may relate to differences in coding practices by practitioners and/or differences in information-seeking behaviors by patients [26].This could reflect findings in prior research showing that disparities in the treatment of depression by race/ethnicity among older adults may still be present [27].Males experienced substantially lower odds of principal diagnosis specificity but higher odds of secondary diagnosis specificity for depression compared to females.It is unclear whether this is confounded by other factors, such as age, due to differentials in life expectancy and sex-related imbalances in the age-sex pyramid, particularly in the U.S. [28].
Patients with longer stays experienced higher levels of specificity in both principal and secondary diagnosis.This could be due to the additional time and resources employed during the inpatient stay or as a result of the complexity of their cases.Clinicians may spend less time documenting patients with shorter stays.Some differences were observed by the primary payor.However, the patient mix by payor could also be heterogeneous.
For example, those with employer contracts as the primary payor may be experiencing higher odds of principal and secondary diagnosis coding specificity because they are a younger population than those receiving healthcare through Medicare, which is the reference category, though it could also relate to requirements related to worker's compensation.Some social vulnerability indices were also related to differing degrees of coding specificity.However, the information content in this variable likely overlaps with other variables such as age and race/ethnicity.Patients grouped with a surgical MS-DRG experienced lower odds of principal and secondary diagnosis specificity when compared to those with a medical MS-DRG.One possible explanation for this discrepancy is that surgical patients may receive a principal diagnosis that is primarily focused on their surgical condition, which can overshadow or lead to a less detailed assessment and diagnosis of mental health conditions such as depression.Surgical patients who undergo a range of medical tests and evaluations specific to their surgical procedures may experience a more limited extent to which mental health concerns are addressed and documented as the principal diagnosis during the inpatient hospitalization.Additionally, those performing surgical procedures who may be responsible for the patient during inpatient stay may not be the same physicians identifying and/or treating any underlying depression diagnosis.Multiple procedures may require increased attention and precision, leading to more detailed physician consultations and billing practices that may impact coding specificity.
Facility-level characteristics were also associated with the specificity of both principal and secondary diagnosis.However, differences by diagnosis type were found.For example, patients who attended teaching facilities experienced lower odds of principal diagnosis specificity yet higher odds of secondary diagnosis specificity.However, the reverse is seen in academic status.This could relate to high levels of collinearity affecting some of the facility-level variables, so cautious interpretation is advisable.Facilities' case mix index was significantly associated with lower odds of specificity for both types of diagnoses.This indicates that hospitals dealing with more complex cases tend to underspecify in terms of depression diagnoses.This could relate to the severity of cases and the potential need to allocate resources unevenly across health conditions.Substantial differences were observed by state, with the odds of coding specificity higher across multiple states when compared to NY.This again could reflect differences in patient composition or complexity by state, but also the variability in spending per capita, price levels, overall healthcare affordability, and differences in uptake of Medicaid by state [29].
While of some interest, the ultimate purpose of this study is not to explore associations between these patient and facility characteristics and coding specificity outcomes but to leverage them to build a risk-adjusted estimate of the probability of coding specificity that can be used to evaluate facilities' standards of practice.The purpose of the multivariate logistic regression is to capture the probability of specificity, and the combination of patient and facility characteristics led to high levels of explanatory power, even when a large number of clinical factors were not included in this study.The AUC was 0.76 and 0.69 for the principal and secondary diagnosis specificity models, respectively.It would be reasonable to expect that principal diagnoses are specified at a higher level, since secondary diagnoses could be very unrelated to the primary reason for the inpatient hospitalization, and an accurate diagnosis may not be needed to treat the patient's condition.However, good levels of explanatory power were also found among patient and facility characteristics for the secondary diagnoses model, which comprises a larger number of individuals between the two analyses.This explanatory power was achieved with relatively low levels of clinical information about the patient.Additional variables describing the clinical characteristics of the patient hospitalization are likely to enhance the AUC levels substantially more.
The AUC values across both types of diagnoses highlight that risk-adjustment of specificity outcomes is important when evaluating hospital coding specificity performance.Otherwise, facilities could be unfairly compared and evaluated.For example, a hospital treating a large population of younger patients may demonstrate high levels of overall coding specificity while actually providing low levels of risk-adjusted specificity.Risk-adjustment allows practitioners to adjust for industry-level differences, while it also allows policymakers to explore whether such differences are warranted or demonstrate disparities or inappropriate standards of practice at the industry level that need to be addressed.
Upon risk-adjusting for patient and facility characteristics, we demonstrate that substantial differences in coding specificity by facility still remain.These differences are more likely to be due to idiosyncrasies and facility-specific processes and practices.We demonstrate these differences in risk-adjusted specificity with a sample of facilities.Our proposed metric can help identify facilities that, upon adjusting for common factors that affect variability in coding specificity, still perform substantially away from common healthcare practice.
From a practical standpoint, the model outcomes can serve multiple purposes toward enhancing clinical data abstraction, such as: (1) Serve as flags for facilities that, upon riskadjusting for their patient mix, may be operating at standards that widely differ from those of their peers.This can take the form of under-specificity or over-specificity; (2) Serve as an intra-facility flag for physicians or units who may also be under-or over-specifying when measured against peers, which may be internal or external to the facility; (3) Serve as an intra-facility flag for specificity practices across health conditions; and (4) Serve to measure the clinical abstractors themselves to conduct practical root cause analysis.In all cases, the actionable steps from flagging such differences in operations against peers could be a more in-depth gathering of information as to whether diagnoses are insufficiently precise, personnel may not be sufficiently versed in the granularity offered in ICD-10 codes, or clinical abstraction may be enhanced (e.g., due to insufficient or incorrectly recorded clinical diagnoses), or whether the diagnoses are overly precise given the information within the respective clinical records.Our approach can be applied across health conditions and units, thus serving as an automated and low-cost first-warning system for coding specificity practices.Thus, both quality-control personnel within the facilities and outside of them (e.g., claims personnel) can assess coding practices that may depart from standard practice, with or without cause, and without the need for a full clinical assessment across patients, which would be substantially more costly.While false positives may occur and coding specificity practices may be warranted on a clinical basis, this approach can serve to identify facilities, units, or physicians most likely to be true positives (intended or unintended) and who may be departing from such practices in ways that may need to be addressed.Ultimately, this would result in a benefit for both the facilities and patients, enhancing the quality of medical records and identifying and resolving inefficiencies where present.Facilities could benefit from the maximization of reimbursement (when under-specifying) and the minimization of risks (e.g., reputational or financial) due to over-specification [2,5].
As the U.S. and other countries look toward the implementation of ICD-11, standardized methods to measure variation in coding, such as those proposed here, will have an important role in providing hospitals with a fair benchmark against which coding practices can be evaluated.Since it is unclear whether there may already exist coding specificity differences between the U.S. and other countries due to the lack of literature, further studies are needed across healthcare delivery systems to assess whether the findings in our study also apply to systems that may be more centralized, such as the United Kingdom's National Health Service.The effect of the changes from ICD-10 to ICD-11 on such potential differences across healthcare systems is also unclear.However, our model allows for a rolling estimation of specificity levels.Thus, the impact of interventions, such as those derived from quality-control actions or from transitions from ICD-10 to ICD-11, could be measured with approaches such as interrupted time series analyses.
While our approach does not provide a raw measure to define 'correct' levels of coding specificity, it provides the user with a peer-based metric.Institutions that aspire to perform in line with industry standards (or standards defined by a subset of peers) can compare themselves with these standards through the counterfactual outcomes of this model.To our knowledge, the approach demonstrated in this manuscript is the first to address, in a fully extrapolatable way, the issue of diagnostic coding specificity in a large, population-based study.Finally, while our approach is built on a logistic regression model, alternative approaches are possible.We proposed a logistic regression approach due to the additional interpretability of the intermediate model outcomes.Also, this approach is useful as it serves as a natural intermediate outcome (estimated probability of specificity) for grouping/clustering across hospitalizations that share common underlying traits, such as hospitals, physicians, states, or any other clustering variable.However, other artificial intelligence/supervised learning approaches may be better suited when predictability at the hospitalization level is more relevant than the analysis of coding specificity practices.

Strengths and Limitations
Claims data are generally more readily available and standardized to a greater degree than medical records, allowing for a larger observation cohort and greater generalizability of methods and results across diseases/patient cohorts.Our cohort, which is comprised of over one million observations, is, to our knowledge, the largest cohort in the literature for measuring and modeling the coding specificity practices of any disease.The primary limitation of relying on claims data are a lack of patient-level clinical data that would be contained in an electronic health record (EHR) or similar medical record.Clinical factors such as patient underlying health conditions, severity of patient health concerns, or whether procedures are urgent or elective likely play a role in the way patient diagnoses are coded and would serve to improve the robustness of our evaluation metrics.However, the information contained in our claims data are sufficient to develop a metric by which to evaluate coding specificity, including patient-and facility-level characteristics, and would only be improved by this additional information when it is available.Also, limiting a model to only be usable when such EHRs are available would hamper its practical utility.Some variable categories were also grouped due to low value counts (e.g., ages 1-4 and 5-9 combined into a single 0-9 category), but arguably some of these groupings could be deemed subjective.Regardless, their impact on the results is unlikely to be relevant, especially given the very low counts as a proportion of the overall sample size.
The facility type and distribution of physician specialties are not considered but could be relevant factors.Facilities that provide healthcare across a wide range of health conditions may not have the level of specialization among their physicians and coders compared with those in more specialized facilities.
Race was included to measure potential inequity of care (i.e., coding) and to demonstrate the approach in general terms among practitioners.However, the inclusion of this variable in the construction of risk-adjusted metrics continues to be a debatable topic, with practitioners still using it to guide clinical decision-making [30].Because of the nature of this ongoing debate between recommendable and currently implemented practice, the variable was included to demonstrate differences by race, whether warranted by clinical diagnosis or not.The model can easily be adapted to exclude race and cluster practices, thus demonstrating differences by race and practice (or grouping across practices by race).This is outside the scope of this study and would be future research.
While a facility may contain multiple hospitalizations per patient, we restricted our dataset to one (specifically their first with a 2022 discharge date) hospitalization per patient to avoid excessive influence by patients who may have large numbers of inpatient stays due to recurring needs.This exclusion helped mitigate concerns that subsequent stays would no longer be independent hospitalizations.This cohort definition can be relaxed by including additional patient hospitalizations and random effects per patient, or by including a factor to account for second or later hospitalizations.However, the computational complexity and burden of such an approach should also be considered, as well as the heterogeneity of such a population.Ultimately, coding specificity during the first inpatient stay may likely be a lower bound for the specificity of further inpatient stays with the same diagnoses if adequate records are maintained and clinical staff carefully review them, thus providing a conservative metric for each facility.At the facility level, random effects could be used for facilities; however, this would increase the computational complexity substantially.
Multicollinearity was observed for several variables, both at the patient and facility levels, so caution is recommended when drawing conclusions about individual variable relevance (or directionality of any association) for risk adjustment.However, to account for this limitation, we also performed univariate analyses in addition to multivariate analyses, providing additional information to measure variable associations.It is also important to note that this multicollinearity does not impact overall model performance or the development of a metric to assess variations in coding specificity.The multivariate models' AUCs and the subsequent Poisson Binomial metrics would not be affected by multicollinearity, and, therefore, the model is flexible enough to be expanded with additional variables, if available, even if highly correlated with existing ones.Also, state-level clustering of facilities was not considered in this study, where hospitals may be part of a shared health system using centralized teams of coders or commonly defined standards.This may result in inter-facility correlations.In this case, the borrowing of information across facilities could be explored, though outside the scope of this study.
Observations are likely not independent since common latent factors could exist.For example, shared coders or physicians who may operate across facilities could breach the assumption of independence.Also, facilities may have common ownerships, which, in turn, could lead to similar standards of practice.However, these issues do not invalidate the methodology proposed.The grouping was demonstrated at the facility level, but it could be performed at any level, including at the physician or facility owner levels.
The definition of the secondary diagnosis was made to reflect any specified secondary diagnosis of depression.However, when multiple secondary diagnoses of depression are present, this binary definition could be subjective.Regardless, only a small number of hospitalizations reflected multiple secondary diagnoses of depression, and an analysis using an alternative definition of 'all specified diagnoses of depression' as the outcome rendered very small differences in AUC.

Conclusions
This study aims to demonstrate a novel approach for measuring the risk-adjustment specificity controlling for patient and facility-level characteristics for principal and secondary diagnoses of depression.This approach is extended to create an aggregate metric that can be used at coarser levels, grouping by any observable common factor, and demonstrated at the facility level.In this study, we propose a multivariate logistic regression model for coding the risk-adjusted specificity of depression principal and secondary diagnoses.Our findings demonstrate that both patient and facility characteristics commonly available in claims data are relevant to explaining variability in the coding specificity of both the principal and secondary diagnoses of depression.This approach represents one of the building blocks for designing a risk-adjusted, facility-specific index that can be used by quality control personnel to compare facilities' coding specificity practices with peers across diseases.While we demonstrate our novel approach with a large patient cohort diagnosed with depression during hospitalizations, the method can be applied to any disease cohort and any grouping-level variable.Therefore, our approach fills a gap in the already scarce literature on coding specificity.
Informed Consent Statement: Data was de-identified and provided by Premier Inc.No informed consent was required.

Figure 1
Figure 1 contains the ROC curves resulting from the multivariate logistic regression analyses of the coding specificity of the principal (a) and secondary (b) diagnoses of depression.The corresponding AUC values were 0.7555 and 0.6874, respectively, indicating a slightly better fit for the model assessing the specificity of a depression principal diagnosis.Figure2contains a visual representation of the use of the Poisson Binomial metric for identification of facilities' specificity of depression principal (a) and secondary (b) diagnoses against healthcare industry peers upon adjusting for patient and facility characteristics.A sample of 20 facilities is portrayed in each plot, with colors denoting coding specificity performance versus peers.Observed counts below the 95% CIs identify facilities that under-specify depression diagnoses compared with their peers (blue), while observed counts above the 95% CIs identify facilities that over-specify depression diagnoses versus peers (orange).Finally, those depicted in black represent facilities that specify depression diagnoses in line with their healthcare industry peers.

Figure 1 .
Figure 1.Receiver operating characteristic (ROC) curves of specificity of a depression-related principal diagnosis (a) and secondary diagnosis (b) using the multivariate logistic regression model.

Figure 2
Figure 2 contains a visual representation of the use of the Poisson Binomial metric for identification of facilities' specificity of depression principal (a) and secondary (b) diagnoses against healthcare industry peers upon adjusting for patient and facility characteristics.A sample of 20 facilities is portrayed in each plot, with colors denoting coding specificity performance versus peers.Observed counts below the 95% CIs identify facilities that under-specify depression diagnoses compared with their peers (blue), while observed counts above the 95% CIs identify facilities that over-specify depression diagnoses versus peers (orange).Finally, those depicted in black represent facilities that specify depression diagnoses in line with their healthcare industry peers.

Figure 2 .
Figure 2. Observed counts of specificity of depression principal (a) and secondary (b) diagnoses by facility (dots) for two samples of 20 facilities, together with 95% confidence intervals based on the Poisson Binomial model.Facilities that under-specify depression diagnoses compared to healthcare industry peers are depicted in blue (p < 0.025), while those that over-specifying depression diagnoses compared to peers are depicted in orange (p > 0.075).Facilities that specify depression diagnoses in line with peers are depicted in black.

Figure 1 .Figure 1 .
Figure 1.Receiver operating characteristic (ROC) curves of specificity of a depression-related principal diagnosis (a) and secondary diagnosis (b) using the multivariate logistic regression model.

Figure 2 Figure 2 .
Figure 2 contains a visual representation of the use of the Poisson Binomial metric for identification of facilities' specificity of depression principal (a) and secondary (b) diagnoses against healthcare industry peers upon adjusting for patient and facility characteristics.A sample of 20 facilities is portrayed in each plot, with colors denoting coding specificity performance versus peers.Observed counts below the 95% CIs identify facilities that under-specify depression diagnoses compared with their peers (blue), while observed counts above the 95% CIs identify facilities that over-specify depression diagnoses versus peers (orange).Finally, those depicted in black represent facilities that specify depression diagnoses in line with their healthcare industry peers.

Figure 2 .
Figure 2. Observed counts of specificity of depression principal (a) and secondary (b) diagnoses by facility (dots) for two samples of 20 facilities, together with 95% confidence intervals based on the Poisson Binomial model.Facilities that under-specify depression diagnoses compared to healthcare industry peers are depicted in blue (p < 0.025), while those that over-specifying depression diagnoses compared to peers are depicted in orange (p > 0.075).Facilities that specify depression diagnoses in line with peers are depicted in black.Finally, Figure 3 contains U.S. maps representing adjusted odds ratios for the two outcomes.States portrayed in grayscale represent those in which patients have similar odds of coding specificity of depression diagnoses compared with the reference state (New York).States where the odds of diagnosis specificity are below those of New York are represented in blue, while the other color scales represent different degrees of state-level over-specificity of depression diagnoses (see Figure 3 legend).Both maps indicate that New York is generally underspecified across both principal and secondary depression diagnoses when compared to most states.

(Figure 3 .
Figure 3.U.S. map representing the adjusted odds ratios, by state, of specificity of depression-related principal (a) and secondary (b) diagnoses against the reference state of New York.Non-significant adjusted odds ratios are represented in gray.Under-specificity is represented in blue, while over-specificity is clustered across three different groups (yellow, orange, and brown) based on adjusted odds ratio ranges.

Figure 3 .
Figure 3.U.S. map representing the adjusted odds ratios, by state, of specificity of depression-related principal (a) and secondary (b) diagnoses against the reference state of New York.Non-significant adjusted odds ratios are represented in gray.Under-specificity is represented in blue, while overspecificity is clustered across three different groups (yellow, orange, and brown) based on adjusted odds ratio ranges.

Table 1 .
Summary statistics, including counts (%) and means/proportions (standard deviations [SD]) of study outcomes as well as patient-and facility-level characteristics.

Table 2 .
Odds ratios (ORs), 95% confidence intervals (CIs), and p-values for univariate and multivariate logistic regression analyses for coding the specificity of a depression principal diagnosis.

Table 3 .
Odds ratios (ORs), 95% confidence intervals (CIs), and p-values for the univariate and multivariate logistic regression analyses for coding the specificity of depression-related secondary diagnoses.