Cross-Document Emotion Consistency (CDEC): A Feature Family Framework for Financial Disclosure Risk Screening

McCarthy, Shawn; Alaghband, Gita

doi:10.3390/jrfm19040251

Open AccessArticle

Cross-Document Emotion Consistency (CDEC): A Feature Family Framework for Financial Disclosure Risk Screening

by

Shawn McCarthy

^*,†

and

Gita Alaghband

^†

Department of Computer Science and Engineering, University of Colorado Denver, Denver, CO 80204, USA

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

J. Risk Financial Manag. 2026, 19(4), 251; https://doi.org/10.3390/jrfm19040251

Submission received: 7 March 2026 / Revised: 23 March 2026 / Accepted: 24 March 2026 / Published: 1 April 2026

(This article belongs to the Section Risk)

Download

Browse Figures

Versions Notes

Abstract

Financial disclosure occurs through multiple channels with fundamentally different legal constraints. Mandatory SEC filings undergo extensive legal review under Sarbanes–Oxley Section 302, while the earnings call Q&A segments represent comparatively spontaneous communication protected by safe harbor provisions. This structural difference creates a natural experiment for detecting management of information asymmetry through emotional consistency analysis. This paper presents Cross-Document Emotion Consistency (CDEC), a framework measuring emotional alignment between Management’s Discussion and Analysis (MD&A) sections and earnings call Q&A using domain-adapted 28-dimensional emotion classification. The framework jointly analyzes MD&A and Q&A for cross-channel emotion alignment, linking 8-K filings as external event validation. Neither CDEC nor financial fundamentals alone achieve meaningful risk separation; their interaction does. The integrated framework achieves approximately twice the risk separation of the strongest 3-class sentiment baseline tested (27.4 percentage points vs. 14.6 pp for RoBERTa-based consistency and 8.7 pp for FinBERT), indicating that 28-dimensional emotion granularity captures disclosure risk not detected by standard sentiment classification. The Financial Health Indicator (FHI) achieves an out-of-sample Area Under the Curve (AUC) = 0.671, distinguishing firms likely to beat Standardized Unexpected Earnings (SUE) expectations (62.8% beat rate among large-cap sector-representative firms) from those likely to miss (35.4%). The framework serves as a risk-screening tool for due diligence, internal audit, and regulatory oversight among large-cap firms, identifying firms warranting scrutiny rather than generating trading signals.

Keywords:

cross-document emotion consistency; financial disclosure analysis; natural language processing; earnings quality; red flag detection; behavioral finance

1. Introduction

Financial disclosure occurs through multiple channels, each governed by different legal frameworks, preparation protocols, and audience expectations. Mandatory Securities and Exchange Commission filings undergo extensive legal review and require executive certification under Sarbanes–Oxley Section 302, which mandates that the CEO and CFO personally certify the accuracy of financial disclosures and the effectiveness of internal controls; false certification can create significant liability, including criminal exposure under SOX Section 906. In contrast, earnings conference calls, particularly their question-and-answer segments, represent more spontaneous communication protected by the Private Securities Litigation Reform Act’s safe harbor provisions for forward-looking statements. This structural difference in disclosure channels creates a natural experiment for detecting the management of information asymmetry through analysis of emotional consistency.

The theoretical underpinning for this natural experiment lies in Cognitive Load Theory (Vrij et al., 2006). Maintaining a deceptive or strategically managed narrative is cognitively taxing: managers must suppress authentic emotions while fabricating alternatives. This effort is sustainable in prepared documents (MD&A) where revision is possible, but becomes unmanageable during the spontaneous, high-pressure environment of a Q&A session. Consequently, “emotional leakage” occurs, with inconsistencies emerging between the carefully crafted MD&A and the reactive Q&A. Valid financial narratives should be internally consistent: when management is genuinely confident, the emotional tenor of spontaneous Q&A responses should align with prepared MD&A filings. Divergence may reveal cognitive dissonance, strategic framing, or latent disclosure risk.

The CDEC premise also connects to two foundational theories of corporate disclosure. Under agency theory (Jensen & Meckling, 1976), managers as agents have incentive to strategically manage disclosure to protect their position; CDEC detects the emotional residue of this agency problem by comparing prepared (MD&A) and spontaneous (Q&A) channels. Under signaling theory (Spence, 1973), consistent emotions across channels signal credibility, while inconsistency signals potential adverse selection. Healy and Palepu (2001) documented that voluntary disclosure decisions are shaped by capital market incentives, including reducing information asymmetry, influencing stock price, and managing litigation risk. The cross-channel consistency framework operationalizes this insight by treating the MD&A-to-Q&A gap as a measurable signal of disclosure quality.

The evolution of financial text analysis has progressed through distinct phases, each addressing the limitations of previous approaches while building upon foundational insights. The journey began with the recognition that quantitative metrics alone fail to capture the full information content of financial disclosures. Sloan (1996) demonstrated in The Accounting Review that operating cash flows serve as more persistent predictors of future returns than accrual-based earnings, and that investors systematically overvalue firms with high accruals. An example would be a hedge portfolio (long low-accruals, short high-accruals) generating 10.4% size-adjusted annual abnormal returns in year

t + 1

(

t = 4.71

), establishing the foundation for earnings quality analysis. This quantitative baseline was revolutionized when Loughran and McDonald (2011) identified that generic sentiment dictionaries misclassified approximately 75% of negative words in financial contexts, establishing the necessity of domain-specific analysis.

The current frontier of financial natural language processing, exemplified by FinBERT, achieves substantially higher accuracy in sentiment classification through domain-specific pre-training on financial tokens. A. H. Huang et al. (2023) demonstrated that their model substantially outperforms dictionary-based and generic language models in financial sentiment classification. Yet despite these advances, existing approaches remain constrained to sentiment polarity classification, missing the emotional granularity that behavioral finance research identifies as critical to understanding investor decision-making and market dynamics.

The framework addresses four critical gaps in the existing literature:

Gap 1 (Dimensionality Gap): Binary sentiment analysis (positive/negative) collapses fundamentally different emotions into the same category. A CFO expressing “fear” about a looming liquidity crisis signals a solvency risk, while “annoyance” at an analyst’s question signals defensiveness. Both are “negative,” yet they carry different implications. This motivates our 28-dimensional emotion taxonomy.

Gap 2 (Channel Isolation Gap): Research bifurcates into MD&A studies or conference call studies, with no systematic cross-channel comparison treating the difference between these channels as the primary signal. The framework exploits the natural experiment inherent in financial reporting: the same firm discussing the same quarter in two different psychological modes (prepared vs. spontaneous).

Gap 3 (Interaction Gap): Textual signals are proposed as standalone predictors, yet emotional inconsistency may be benign in healthy firms (who may simply be exuberant) but toxic in financially weak firms (who are obfuscating). Prior research fails to interact textual signals with fundamental accounting quality. The model explicitly captures CDEC × fundamental interactions, multiplicative terms (e.g., CDEC × return on equity, CDEC × earnings quality) that capture how emotional consistency’s predictive power varies with a firm’s underlying financial condition.

Gap 4 (Granularity Gap): Behavioral “red flags” in accounting are well-defined (e.g., Beneish M-Score, Sloan’s Accruals), but behavioral red flags from disclosure patterns remain vague or anecdotal. The framework formalizes mathematically defined, empirically validated indicators with calibrated thresholds.

This paper presents an integrated analytical framework organized into five feature families, progressing from quantitative foundations through granular emotion classification to cross-channel consistency measurement. The primary methodological contribution lies in Family III (Cross-Channel Analysis), which introduces the Cross-Document Emotion Consistency (CDEC) metric, a risk-screening tool that quantifies emotional alignment between mandatory filings and voluntary communications.

Summary of Contributions

The central empirical finding is that neither CDEC nor financial fundamentals alone achieve meaningful risk separation, but their interaction does. The integrated framework achieves an out-of-sample AUC = 0.671, distinguishing firms likely to beat expectations (62.8% beat rate) from those likely to miss (35.4%), a 27.4 percentage-point spread that is approximately twice the separation achieved by the strongest 3-class sentiment baseline tested (14.6 pp for RoBERTa, 8.7 pp for FinBERT; Section 4.5). CDEC alone is not predictive (AUC ≈ 0.51); the interaction with accounting quality, particularly CDEC × ROE, is the mechanism. This positions CDEC as an enabling construct whose value emerges through integration with financial fundamentals, not as a standalone predictor.

Methodologically, this paper contributes CDEC as a measurable cross-document construct (Equations (15)–(19)), a domain-adapted 28-dimensional emotion classifier (FinGoEmotion, preferred 77.1% in LLM-as-judge evaluation; Section 3.2), and four empirically calibrated behavioral red flags. Of these, RF₁ and RF₄ are validated SUE predictors (both

p < 0.001

), while RF₂ shows suggestive evidence of predicting impairments (2.5× elevation; Section 4.6) and RF₃ serves a monitoring role.

The framework jointly analyzes MD&A, Q&A, and 8-K exhibits, exploiting the natural experiment created by different legal constraints across disclosure channels. The Fin-ALICE research program (McCarthy & Alaghband, 2024) provides the broader sector-level analytical context within which this firm-level analysis operates. CDEC is designed for risk screening, oversight, and monitoring among large-cap sector-representative firms, not alpha generation. The framework identifies firms warranting closer scrutiny rather than firms to trade. Throughout this paper, “prediction” refers to statistical association suitable for risk screening rather than high-precision forecasting.

This paper is structured as follows: Section 2 reviews the literature and presents the conceptual framework (Section 2.7). Section 3 details data sources, domain-adapted emotion classification (Section 3.2), cross-document consistency measurement (Section 3.6), and the synthesized Financial Health Indicator (Section 3.7). Section 4 presents empirical validation including model progression (Section 4.1), red flag validation (Section 4.2), SUE prediction (Section 4.3), ablation studies (Section 4.4), sentiment benchmark (Section 4.5), 8-K analysis (Section 4.6), health status classification (Section 4.7), and temporal performance (Section 4.8). Section 5 discusses implications and limitations. Section 6 concludes. The training corpus specifications and domain adaptation validation are provided in Supplementary Materials S1–S3; the mathematical derivations, extended analyses, and robustness details are provided in Appendix A, Appendix B, Appendix C, Appendix D and Appendix E.

2. Literature Review

2.1. Earnings Surprises and SUE

The foundation for our target variable lies in the Standardized Unexpected Earnings (SUE) literature. Latané and Jones (1977) introduced SUE as a measure of a firm’s earnings surprise relative to analyst expectations, scaled so the surprise is comparable across firms and time, enabling cross-firm comparability. Bernard and Thomas (1989) documented post-earnings announcement drift (PEAD), demonstrating that stocks beating expectations continue to earn positive abnormal returns for up to 60 days, while stocks missing expectations experience continued negative returns. They termed this persistent pattern the “SUE effect.” This anomaly persists despite decades of academic attention (see (Fink, 2021), for a comprehensive review of 216 published and 8 working papers on PEAD), suggesting that earnings surprises contain information that markets do not immediately incorporate.

We calculate SUE following Livnat and Mendenhall (2006) and Doyle et al. (2006):

S U E_{i, t} = \frac{E P S_{a c t u a l, t} - E P S_{f o r e c a s t, t}}{σ_{f o r e c a s t, t}}

(1)

where

E P S_{a c t u a l}

is reported earnings per share,

E P S_{f o r e c a s t}

is the consensus analyst forecast, and

σ_{f o r e c a s t}

is estimated from the high–low range of analyst estimates assuming a 95% confidence interval (i.e.,

\pm 1.96 σ

):

σ_{f o r e c a s t} = (E P S_{h i g h} - E P S_{l o w}) / 3.92

. A positive SUE indicates the firm beat expectations; a negative SUE indicates a miss. This standardization enables comparison across firms with different earnings magnitudes and analyst coverage.

Brown et al. (1987) established that analyst-based SUE outperforms time-series models for predicting future returns, making analyst forecasts the preferred benchmark. Our framework uses SUE as the alignment target: CDEC should correlate with subsequent earnings surprises if cross-channel emotional consistency reflects genuine management confidence.

2.2. Earnings Quality and the Accruals Anomaly

Sloan (1996) demonstrated that operating cash flows are more persistent predictors of future earnings than accruals, establishing the foundation for earnings quality analysis. The total accruals measure is calculated as:

T A_{i, t} = (Δ C A_{i, t} - Δ C a s h_{i, t}) - (Δ C L_{i, t} - Δ S T D_{i, t} - Δ T P_{i, t}) - D e p_{i, t}

(2)

where subscripts i and t index the firm and fiscal quarter, respectively,

Δ C A

is a change in current assets,

Δ C a s h

is a change in cash,

Δ C L

is a change in current liabilities,

Δ S T D

is a change in short-term debt,

Δ T P

is a change in taxes payable, and

D e p

is depreciation. Sloan showed that accruals-based earnings are less persistent:

E a r n i n g s_{i, t + 1} = α_{0} + α_{1} \cdot A c c r u a l s_{i, t} + α_{2} \cdot C a s h F l o w s_{i, t} + ε_{i, t + 1}

(3)

where

A c c r u a l s_{i, t}

corresponds to total accruals (

T A_{i, t}

) from the previous equation,

C a s h F l o w s_{i, t} = E a r n i n g s_{i, t} - T A_{i, t}

,

α_{0}

captures baseline expected earnings,

α_{1}

measures the persistence of the accrual component, and

α_{2}

measures the persistence of the cash flow component.

α_{1} < α_{2}

indicates a lower persistence of accruals. This forms the theoretical basis for our earnings quality metric (Section 3.3, Equation (8)) and the accruals control in our feature set.

2.3. Disclosure Channels and Incentives

Financial disclosure occurs through channels with fundamentally different legal constraints. As noted in Section 1, mandatory SEC filings (10-K annual reports, 10-Q quarterly reports) require SOX Section 302 certification with criminal liability exposure, resulting in extensively vetted, legalistic text designed to minimize liability.

In contrast, earnings conference calls, particularly Q&A segments, operate under the Private Securities Litigation Reform Act’s (PSLRA) safe harbor provisions for forward-looking statements. This affords managers greater flexibility in their communication, though the spontaneous nature of Q&A limits the ability to carefully manage emotional expression.

Matsumoto et al. (2011) documented that Q&A segments generate 34% higher abnormal absolute returns than management presentations (0.51% vs. 0.40%), indicating that investors view spontaneous responses as more informative. Price et al. (2012) found that textual tone in conference calls predicts future returns beyond quantitative earnings metrics. Cohen et al. (2013, 2020) documented “casting the call,” where firms strategically select which analysts participate, a practice associated with a 14% higher restatement likelihood per standard deviation of favoritism.

This structural difference creates a natural experiment: the same management team discusses the same quarter through two psychologically distinct modes (prepared vs. spontaneous). Our CDEC framework exploits this by measuring emotional alignment between channels.

2.4. Financial Text and Sentiment Analysis

The evolution of financial text analysis began with the recognition that generic sentiment dictionaries fail in financial contexts. Loughran and McDonald (2011) demonstrated that the Harvard-IV dictionary misclassifies approximately 75% of negative words in 10-K filings, with terms like “liability,” “tax,” and “cost” incorrectly flagged as negative. Their finance-specific dictionary—which classifies words into six categories (positive, negative, uncertainty, litigious, strong modal, weak modal) and calculates sentiment as:

S e n t i m e n t_{i, t} = \frac{P o s i t i v e_{i, t} - N e g a t i v e_{i, t}}{P o s i t i v e_{i, t} + N e g a t i v e_{i, t} + ϵ}

(4)

where

P o s i t i v e_{i, t}

represents the count of positive words using the financial-specific dictionary,

N e g a t i v e_{i, t}

represents the count of negative words, and

ϵ = 10^{- 6}

prevents division by zero—has become the standard for textual analysis of financial documents (see (Loughran & McDonald, 2016), for a comprehensive survey of textual analysis methods in accounting and finance).

Tetlock (2007) established that media sentiment predicts market returns, spawning a literature on disclosure tone. Campbell et al. (2014) analyzed risk factor disclosures, finding that firms with more specific risk language experience lower future stock volatility. Bodnaruk et al. (2015) showed that “constraining” language (words indicating financial constraint) predicts future firm outcomes beyond standard financial ratios; Buehlmaier and Whited (2018) extended this by demonstrating that textual measures of financial constraints are priced in equity returns.

Recent advances in natural language processing, particularly transformer architectures (Devlin et al., 2019) and sentence-level embeddings (Reimers & Gurevych, 2019), have enabled context-aware sentiment analysis. A. H. Huang et al. (2023) developed FinBERT, which substantially outperforms dictionary-based and generic language models in financial sentiment classification through domain-specific pre-training. However, even these sophisticated models remain constrained to sentiment polarity (positive/negative/neutral), collapsing the emotional granularity that behavioral research identifies as decision-relevant.

Mayew and Venkatachalam (2012) demonstrated that vocal affect in conference calls predicts future firm performance beyond textual content. They modeled affect as:

A f f e c t S c o r e_{i, t} = β_{0} + β_{1} \cdot P i t c h_{i, t} + β_{2} \cdot V a r i a n c e_{i, t} + β_{3} \cdot D u r a t i o n_{i, t} + ε_{i, t}

(5)

where vocal characteristics (pitch, variance, duration) predict future performance beyond textual content. This establishes that emotional signals matter beyond sentiment polarity, motivating our extension from binary sentiment to 28-dimensional emotion classification grounded in psychological theory (Plutchik, 1980), capturing distinctions (Fear vs. Annoyance, Optimism vs. Approval) invisible to polarity-based analysis.

The rapid advancement of large language models has further reshaped the landscape. Lopez-Lira and Tang (2023) demonstrated that GPT-4 achieves approximately 90% portfolio-day hit rates for initial market reactions from news headlines, with forecasting ability increasing with model scale. Kim et al. (2024) showed that ChatGPT summaries of MD&A sections amplify sentiment signals and better explain stock reactions than the original disclosures, with “bloated” (obfuscatory) language correlating with adverse capital market outcomes. Todd et al. (2024) survey the post-transformer trajectory of text-based sentiment analysis in finance, identifying cross-document analysis as an underexplored direction. Ergun and Sefer (2025) validate transfer learning approaches for financial sentiment, consistent with the domain adaptation strategy employed in FinGoEmotion. Despite these advances, the LLM frontier remains focused on single-document sentiment polarity. In prior work, McCarthy and Alaghband (2024) developed the Fin-ALICE framework for sector-level analyst consensus analysis, combining NLP-derived sentiment with causal econometric methods to evaluate sell-side recommendations across market regimes. The present study extends that research program from sector-level analyst evaluation to firm-level disclosure analysis, shifting the unit of observation from analyst reports to paired corporate filings (MD&A and Q&A). We did not identify prior work in this literature that constructs a paired cross-document emotion consistency measure between MD&A and earnings call Q&A at the 28-dimensional granularity level that behavioral theory motivates.

Table 1 provides an illustrative (rather than exhaustive) mapping of the most closely related prior work, identified through searches of Web of Science, Scopus, and Google Scholar using terms including “financial disclosure tone,” “cross-document sentiment,” “MD&A earnings call comparison,” and “emotion consistency.”

2.5. Distribution Comparison Metrics

Comparing probability distributions requires appropriate metrics. Kullback and Leibler (1951) introduced KL divergence, but its asymmetry and unboundedness limit applicability when neither distribution should be privileged. Lin (1991) proposed Jensen–Shannon divergence as a symmetric, bounded alternative:

J S D (P ∥ Q) = \frac{1}{2} D_{K L} (P ∥ M) + \frac{1}{2} D_{K L} (Q ∥ M)

(6)

where

D_{K L} (P ∥ M)

denotes the Kullback–Leibler divergence measuring information loss when distribution M approximates P, and

M = \frac{1}{2} (P + Q)

is the pointwise average of the two input distributions. JSD ranges from 0 (identical distributions) to 1 (no overlap), making it well-suited for comparing emotion distributions across disclosure channels without privileging either source.

Cosine similarity measures directional alignment between vectors regardless of magnitude:

S_{c o s} = \frac{\vec{A} \cdot \vec{B}}{∥ \vec{A} ∥ \times ∥ \vec{B} ∥}

(7)

ranging from −1 (opposite direction) to +1 (identical direction). For 28-dimensional emotion vectors, cosine similarity captures whether MD&A and Q&A express similar emotional patterns even if intensities differ.

Our CDEC metric combines both measures: cosine similarity captures directional alignment, while JS divergence captures distributional intensity differences. This dual approach detects both directional misalignment (optimistic MD&A paired with fearful Q&A) and intensity manipulation (muted MD&A paired with amplified Q&A emotions).

2.6. Behavioral Patterns in Financial Disclosure

Beyond textual sentiment, behavioral finance research identifies specific disclosure patterns that signal risk. We draw on three streams of literature to ground the red flag indicators operationalized in Section 3.6.

Management Overconfidence: Malmendier and Tate (2005) demonstrated that overconfident CEOs systematically overestimate returns on investment, leading to distorted capital allocation. Hirshleifer et al. (2012) extended this by showing that overconfident executives pursue riskier innovation strategies; Malmendier and Tate (2015) provide a comprehensive survey of the behavioral CEO literature. In the disclosure setting, overconfidence may manifest as excessive optimism during spontaneous Q&A responses, a pattern sustainable in prepared filings where language is carefully vetted, but difficult to suppress under real-time questioning. When Q&A optimism substantially exceeds baseline norms, it may signal unrealistic management expectations that precede earnings disappointment.

Cognitive Load and Disclosure Complexity: Vrij et al. (2006) established that maintaining deceptive or strategically managed narratives increases cognitive load, producing measurable behavioral signatures, including increased linguistic complexity and scattered emotional expression. In financial disclosure, this suggests that managers anticipating adverse outcomes may produce more complex MD&A language as they attempt to obscure unfavorable information while maintaining simpler, more controlled Q&A responses. Shannon (1948) entropy provides a natural metric for quantifying this complexity asymmetry across disclosure channels. Campbell et al. (2014) further demonstrated that the specificity of risk language in corporate filings predicts future outcomes, motivating our use of entropy ratios to detect abnormal complexity patterns.

Strategic Communication and Trust: Cohen et al. (2013, 2020) documented that firms strategically select which analysts participate in earnings calls (“casting the call”), a practice associated with 14% higher restatement likelihood per standard deviation of favoritism. We extend this concept from analyst selection to emotional expression: artificially elevated trust-building language in Q&A relative to the measured tone of mandatory filings may indicate strategic relationship management designed to deflect scrutiny.

These behavioral patterns (overconfidence, cognitive load, and strategic trust-building) provide the theoretical foundation for the four red flag indicators operationalized in Section 3.6.

2.7. Conceptual Framework

The framework addresses the identified gaps through a systematic architecture organized into five feature families (Figure 1, Table 2). The key insight motivating this multi-family design is that cross-channel emotional consistency alone is not predictive (AUC ≈ 0.51); predictive power emerges only when CDEC interacts with financial fundamentals. This validates the theoretical premise that emotional inconsistency means different things depending on a firm’s financial health—benign exuberance in healthy firms, but a warning sign in struggling ones.

Each enhancement addresses a specific limitation of traditional approaches. Family I adds the ROE trend analysis because static snapshots miss deteriorating trajectories; a firm with strong current earnings but declining ROE over four quarters presents different risk than one with stable profitability. Family II extends beyond sentiment polarity because collapsing 28 emotions into positive/negative loses decision-relevant distinctions: a CFO expressing fear about liquidity signals different risk than one expressing annoyance at analyst questions, yet both register as “negative sentiment.” Family III introduces cross-channel comparison because single-source analysis cannot detect strategic impression management; the same firm may present cautious MD&A language while projecting optimism in Q&A, a divergence invisible to studies examining either channel in isolation. Family IV operationalizes behavioral patterns into actionable thresholds because vague “red flags” provide no screening utility; practitioners need specific triggers (e.g., optimism > 0.1377) with validated predictive power. Family V synthesizes through Random Forest rather than linear regression because the interaction between CDEC and fundamentals is inherently non-linear: emotional inconsistency in a profitable firm signals potential trouble, while the same pattern in a struggling firm may simply reflect acknowledged difficulties.

3. Methodology

3.1. Data Sources and Alignment

The analysis integrates five data sources aligned at the firm-quarter level: (1) 10-Q/10-K MD&A sections from SEC EDGAR, representing the mandatory, legally-vetted disclosure channel; (2) earnings call transcripts (presentation and Q&A segments), representing the voluntary, spontaneous channel; (3) SEC Form 8-K filings (“Current Reports” filed within four business days of material corporate events) for adverse event validation, specifically Item 2.04 (Debt Covenants), Item 2.05 (Restructuring), and Item 2.06 (Impairments); (4) financial statement data (income statement, balance sheet, cash flow) for quantitative fundamentals; and (5) analyst forecast distributions for SUE construction. Documents are matched by CIK (Central Index Key, the SEC’s unique firm identifier), fiscal quarter, and filing date within ±30 days of earnings announcements. Sample sizes vary by data availability: 4453 firm-quarters have complete financial data; 4147 have matched MD&A and transcripts for CDEC calculation; 3368 have SUE data requiring analyst coverage.

Following the Fin-ALICE framework (McCarthy & Alaghband, 2024), which analyzes the top companies per GICS sector as sector bellwethers, the S&P 100 sample provides an ideal testbed for establishing the CDEC construct: these firms have the highest analyst coverage density, most consistent transcript quality, and most reliable earnings surprise measurement across all 11 GICS sectors. This sector-representative design prioritizes data quality for construct validation; generalizability to mid- and small-cap firms is addressed as a limitation in Section 5.3.

This section details the mathematical implementation of each feature family, progressing from novel quantitative metrics (Equations (8) and (9)) through cross-document emotion consistency measurements (Equations (15)–(19)) to the validated Financial Health Indicator (Equation (23)). Standard equations from prior literature are presented in Section 2. The domain-specific emotion classifier (FinGoEmotion), which serves as a prerequisite tool for all emotion-based analyses, is described first, followed by the five feature families.

3.2. Domain-Specific Emotion Classification (FinGoEmotion)

The base GoEmotions model—a RoBERTa-based classifier (Demszky et al., 2020) trained on 58,000 Reddit comments across 28 emotion categories—requires domain adaptation for financial discourse due to distinct linguistic patterns and specialized vocabulary. We develop FinGoEmotion through supervised fine-tuning on 103,181 financial text examples across three tiers: Investopedia definitions (6090 examples), synthetic summaries for Loughran–McDonald terms (20,241 examples), and authentic 10-Q MD&A sentences (76,850 examples). Training employs weak supervision with tier-specific weighting (1.0×/1.5×/2.0×) to prioritize authentic regulatory language. Critically, the base GoEmotions model generates all training labels through inference on financial text, ensuring consistent annotation methodology. See Supplementary Materials S1 for detailed corpus composition.

To validate domain adaptation, we conducted an LLM-as-Judge evaluation using Claude Opus 4.5, comparing FinGoEmotion and base GoEmotions classifications across 500 high-divergence samples. The evaluation revealed a 7:1 win ratio favoring FinGoEmotion (77.1% vs. 11.0%), confirming successful adaptation to financial language. Qualitatively, positive sentiment alignment improved by 18%, while negative category detection appropriately shifted from social reactive emotions (disappointment, annoyance) to professional analytical categories (realization, neutral). The mean KL divergence of 0.026 indicates purposeful adaptation while maintaining prediction stability—the model learned domain-specific patterns without catastrophic forgetting. Complete validation methodology and illustrative cases are provided in Supplementary Materials S2.

3.3. Family I: Quantitative Foundations

Sloan (1996) established that the ratio of operating cash flow (OCF) to net income (NI) indicates earnings persistence—firms with high OCF/NI ratios exhibit more sustainable earnings. We operationalize this insight as an earnings quality metric and extend it with a novel grade classification system:

E Q_{i, t} = \frac{O C F_{i, t}}{N I_{i, t}}

(8)

where

E Q_{i, t}

represents the earnings quality for firm i in quarter t,

O C F_{i, t}

represents the operating cash flow, and

N I_{i, t}

represents the net income. Our contribution: We translate the continuous ratio into actionable quality grades: Grade A (

E Q > 1.1

, strong cash generation), Grade B (

1.0 < E Q \leq 1.1

, positive cash conversion), Grade C (

0.9 < E Q \leq 1.0

, moderate quality), and Grade F (

E Q \leq 0.9

, poor quality). These categorical thresholds enable direct integration with risk-screening workflows.

Novel extension: ROE trend analysis, not present in prior literature, captures the profitability momentum:

R O E_T r e n d_{i, t} = \frac{\sum_{k = 1}^{4} (k - \bar{k}) (R O E_{i, t - 4 + k} - \bar{R O E})}{\sum_{k = 1}^{4} {(k - \bar{k})}^{2}}

(9)

where

R O E_T r e n d_{i, t}

represents the slope coefficient of return on equity over the past 4 quarters (positive values indicate improving profitability),

\bar{k} = 2.5

, and

\bar{R O E}

represents the mean ROE over the 4-quarter window.

3.4. Family II: Textual Components (Emotion Classification)

We extend beyond Loughran and McDonald (2011) six-category sentiment to 28-emotion classification. The emotion vector for a document is:

{\vec{E}}_{d o c} = \frac{1}{N} \sum_{s = 1}^{N} w_{s} \cdot {\vec{e}}_{s}

(10)

where

{\vec{E}}_{d o c}

represents the document-level emotion vector (28-dimensional), N represents the number of sentences,

w_{s}

represents the weight for sentence s based on position and keyword presence, and

{\vec{e}}_{s}

represents the emotion vector for sentence s.

To convert the emotion vector to a probability distribution, we normalize:

p_{j} = \frac{E_{d o c, j}}{\sum_{k = 1}^{28} E_{d o c, k}}

(11)

where

E_{d o c, j}

is the j-th component of the emotion vector. We then measure emotional entropy following Shannon (1948):

H (E) = - \sum_{j = 1}^{28} p_{j} {log}_{2} (p_{j})

(12)

where

H (E)

represents the emotional entropy measuring complexity or potential obfuscation. Higher entropy indicates diffuse emotional expression (many emotions activated), while lower entropy indicates focused expression (few dominant emotions).

3.5. Earnings Call Weighted Emotions

We weight call sections to reflect their differential information content, based on the empirical finding that Q&A generates 34% higher abnormal returns than presentations (Matsumoto et al., 2011):

{\vec{E}}_{C a l l} = 0.35 \cdot {\vec{E}}_{P r e s e n t a t i o n} + 0.65 \cdot {\vec{E}}_{Q & A}

(13)

The 65% Q&A weight reflects the higher information content documented by Matsumoto et al. (2011).

Beyond the aggregate call emotion, we also measure consistency within the management team. Executive divergence captures disagreement among company executives:

E x e c D i v_{i, t} = max_{k, l \in Execs} {∥ {\vec{E}}_{k} - {\vec{E}}_{l} ∥}_{2}

(14)

where Execs represents company executives identified as speakers in prepared remarks, and

{∥ \cdot ∥}_{2}

denotes the Euclidean distance. We report the maximum pairwise divergence, capturing the highest level of emotional disagreement within management.

3.6. Family III: Cross-Document Emotion Consistency (Primary Innovation)

The CDEC metric quantifies cross-channel emotional consistency through three complementary components, each targeting a different aspect of disclosure alignment:

Cosine Similarity ( $S_{c o s}$ )—measures directional alignment of emotion vectors regardless of intensity.
Jensen–Shannon Similarity ( $S_{J S}$ )—measures distributional intensity differences between channels.
Red Flag Penalty ( $S_{R F}$ )—penalizes behavioral patterns indicating potential deception or obfuscation.

We detail each component below.

Component 1—Directional Alignment via Cosine Similarity:

S_{c o s} = \frac{{\vec{E}}_{M D & A} \cdot {\vec{E}}_{Q & A}}{∥ {\vec{E}}_{M D & A} ∥ \times ∥ {\vec{E}}_{Q & A} ∥}

(15)

where

{\vec{E}}_{M D & A}

represents the 28-dimensional emotion vector from the 10-Q MD&A section,

{\vec{E}}_{Q & A}

represents the 28-dimensional emotion vector from the earnings call Q&A segment, · denotes the dot product, and

∥ \cdot ∥

denotes the Euclidean norm.

Cosine similarity isolates directional alignment rather than magnitude, the critical dimension for detecting strategic impression management. This design choice rests on cognitive load theory (Vrij et al., 2006): maintaining deceptive narratives requires suppressing authentic emotions while constructing alternatives. The metric ranges from −1 (complete opposition) to +1 (perfect alignment), with legitimate disclosures typically scoring 0.5–0.9. Scores below 0.3 warrant investigation, as they indicate fundamental emotional misalignment between prepared and spontaneous communications. The normalization by vector magnitudes ensures documents of different lengths remain directly comparable. See Appendix A.6 for detailed computation methodology and illustrative examples.

As a robustness check, we calculate top-k emotion overlap:

S_{o v e r l a p} = \frac{| T o p_{k} ({\vec{E}}_{M D & A}) \cap T o p_{k} ({\vec{E}}_{Q & A}) |}{k}

(16)

where

T o p_{k} (\vec{E})

returns the k emotions with highest probabilities and

k = 3

. This complements cosine similarity by measuring agreement on dominant emotional themes. The top-k overlap provides qualitative validation but is not included in the final CDEC score to avoid redundancy with cosine similarity.

Component 2—Distributional Intensity via Jensen–Shannon Divergence:

We measure the distributional similarity between the MD&A (Management Discussion and Analysis) and Q&A sections by calculating the Jensen–Shannon Divergence (JSD). This method quantifies how much the two distributions differ from a common reference distribution M (Lin, 1991):

J S D (P_{M D & A} ∥ P_{Q & A}) = \frac{1}{2} D_{K L} (P_{M D & A} ∥ M) + \frac{1}{2} D_{K L} (P_{Q & A} ∥ M)

(17)

where

D_{K L} (P ∥ M)

denotes the Kullback–Leibler divergence measuring information loss when distribution M approximates P, and

M = \frac{1}{2} (P_{M D & A} + P_{Q & A})

is the average distribution. We transform to similarity:

S_{J S} = 1 - J S D

. Unlike the KL divergence, the JS divergence is symmetric and bounded to [0, 1]. While cosine similarity captures directional alignment, JS divergence measures distributional differences in emotional intensity. Two documents may align directionally while exhibiting vastly different intensity distributions, indicating careful emotional management. The transformation

S_{J S} = 1 - J S D

converts divergence to similarity where higher values indicate greater consistency.

Component 3—Pattern-Based Red Flags:

Building on the behavioral patterns established in Section 2.6, we operationalize four red flag indicators. Each flag detects a specific disclosure pattern, with thresholds calibrated from empirical distributions across 4147 firm-quarters.

RF₁ (Overconfident Q&A) detects excessive optimism in spontaneous Q&A responses (Section 2.6, Management Overconfidence). The flag triggers when Q&A optimism exceeds the 90th percentile of the empirical distribution (

E_{o p t i m i s m} > 0.1377

), firing at a 10.0% rate.

RF₂ (Disclosure Complexity) captures asymmetric complexity between channels (Section 2.6, Cognitive Load). Using Shannon entropy

H (E)

(Equation (12)), which measures the complexity or scatter of emotional expression across the 28 dimensions, the flag triggers when Q&A emotional entropy exceeds MD&A entropy by more than 3 × (

H (Q & A) / H (M D & A) > 3.0

, 85th percentile), firing at 16.8%. A high ratio indicates that executives express substantially more diffuse or unfocused emotions in their live Q&A responses compared to the focused emotional tone of their prepared MD&A, consistent with cognitive load from narrative maintenance.

RF₃ (Trust Manipulation) detects artificially elevated trust-building language in Q&A relative to mandatory filings (Section 2.6, Strategic Communication). The trust composite averages five GoEmotions categories (admiration, approval, caring, gratitude, love). The flag triggers when the trust differential (

Δ E^{t r u s t} = E_{Q & A}^{t r u s t} - E_{M D & A}^{t r u s t}

) exceeds 0.10 (99th percentile), firing at 1.6%.

RF₄ (Mixed-Emotion Surge) identifies diffuse cognitive states where more than two emotions simultaneously exceed an activation threshold of 0.15 (Section 2.6, Cognitive Load). This pattern, inconsistent with focused communication, fires at 8.4%.

Table 3 summarizes the four patterns with their calibrated thresholds, trigger rates, and validation status.

The red flag penalty score,

S_{R F}

, quantifies the combined impact of active red flags on disclosure consistency. It calculates an aggregate score where

S_{R F}

approaches 0 as risk increases and 1 as risk decreases:

S_{R F} = exp (- \sum_{k = 1}^{K} λ_{k} \cdot 1_{R F_{k}})

(18)

where K is the number of red flag patterns (

K = 4

),

λ_{k}

is the calibrated weight for pattern k (as specified in Table 3), and

1_{R F_{k}}

is an indicator function, a binary variable (0 or 1) that activates the k-th penalty.

Of the four flags, RF₁ and RF₄ are validated SUE predictors (both

p < 0.001

): firms with excessive Q&A optimism miss expectations by 0.84 standard deviations (

t = - 3.90

), while those with diffuse emotional states miss by 0.93 standard deviations (

t = - 3.93

). RF₂ does not predict SUE directly but is elevated 2.5× before material impairments (Section 4.6), serving as an impairment-monitoring indicator. RF₃ lacks statistical power in our large-cap sample due to its low trigger rate (1.6%). See Appendix A.3, Appendix A.4 and Appendix A.5 for extended theoretical foundations, calibration details, and retention rationales for RF₂ and RF₃.

Figure 2 presents the correlation structure between red flag indicators and financial quality metrics. The heatmap reveals that RF₂ correlates with total accruals (

r = - 0.052

) and net income (

r = + 0.087

), consistent with the classic earnings management pattern where firms report strong income while exhibiting poor cash flow quality.

Final CDEC Score:

The three components, directional alignment (

S_{c o s}

), distributional similarity (

S_{J S}

), and behavioral penalties (

S_{R F}

), combine into a single consistency score. The weights were determined through empirical optimization on the training sample (pre-2020), maximizing the CDEC-SUE correlation while maintaining interpretability:

C D E C_{i, t} = α \cdot S_{c o s} + β \cdot S_{J S} - λ_{1} \cdot R F_{1} - λ_{2} \cdot R F_{2} - λ_{3} \cdot R F_{3} - λ_{4} \cdot R F_{4}

(19)

The

α > β

weighting reflects the relative predictive power of each component: cosine similarity (directional alignment) shows stronger association with SUE (

p = 0.014

) than JS divergence (distributional intensity,

p = 0.098

). The red flag weights separate validated flags (

λ_{1} = λ_{4} = 0.08

for RF₁ and RF₄, both SUE-significant at

p < 0.001

) from retained-but-unvalidated flags (

λ_{2} = λ_{3} = 0.02

for RF₂ and RF₃). With these weights (

α = 0.5

,

β = 0.3

), CDEC scores range from 0.0 to 0.8; scores above 0.6 indicate adequate consistency, while scores below 0.5 warrant scrutiny.

3.7. Family V: Synthesis and Financial Health Indicator

Family V synthesizes features from Families I–IV into a single Financial Health Indicator (FHI) that predicts earnings outcomes. This section describes the prediction task and feature architecture (Section 3.7.1), model specification (Section 3.7.2), feature importance analysis (Section 3.7.3), and validation summary (Section 3.7.4).

3.7.1. Prediction Task and Feature Architecture

The FHI predicts whether a firm will beat or miss earning expectations in the subsequent quarter. As will be demonstrated in Section 4.3, CDEC features show a statistically significant relationship with earnings surprises, and will contribute to earnings outcome prediction when combined with financial fundamentals.

Through systematic analysis of candidate features, we identified an optimal 26-feature set organized into five groups:

Family I Quantitative (7 features): total_accruals, net_income, total_equity, revenue, roe, roe_trend, eq_ratio
Family III CDEC Components (3 features): $S_{c o s}$ (cosine similarity), $S_{J S}$ (Jensen–Shannon similarity), cdec_score
CDEC × Financial Interactions (4 features): cdec_x_roe, cdec_x_eq, cdec_x_net_income, cdec_x_revenue
Red Flags (6 features): RF₁–RF₄, $s_{r f}$ , eq21_regression
Prior Quarter Analyst Signals (6 features): Lagged features to avoid data leakage: prior_eq_grade, prior_analyst_score, prior_analyst_confidence, prior_analyst_signal, prior_analyst_score_top5, prior_analyst_confidence_top5

Complete definitions of all 26 features are provided in Appendix E, Table A5.

The feature vector is:

{\vec{F}}_{i, t} = [{Quantitative}_{i, t}, {CDEC}_{i, t}, {\vec{I}}_{i, t}, {RedFlags}_{i, t}, {AnalystSignals}_{i, t}]

(20)

The interaction terms

{\vec{I}}_{i, t}

capture CDEC × Fundamental interactions, where “fundamentals” refers to the quantitative financial metrics from Family I (ROE, earnings quality EQ, net income, and revenue). These terms test whether emotional consistency’s predictive power varies with underlying financial condition:

{\vec{I}}_{i, t} = [{CDEC}_{i, t} \times {ROE}_{i, t}, {CDEC}_{i, t} \times {EQ}_{i, t}, {CDEC}_{i, t} \times {NI}_{i, t}, {CDEC}_{i, t} \times {Revenue}_{i, t}]

(21)

These interaction effects are critical to our findings. As demonstrated in Section 3.7.3, the CDEC × ROE term is the strongest predictor, suggesting that emotional consistency becomes especially consequential when profitability fundamentals are involved.

3.7.2. Model Specification

We train dual calibrated Random Forest classifiers, one for beat probability (

f_{RF, B}

) and one for miss probability (

f_{RF, M}

) using the feature vector in Equation (20):

P ({Beat}_{i, t + 1} | {\vec{F}}_{i, t}) = f_{RF, B} ({\vec{F}}_{i, t}); P ({Miss}_{i, t + 1} | {\vec{F}}_{i, t}) = f_{RF, M} ({\vec{F}}_{i, t})

(22)

The Financial Health Indicator is obtained from the difference of these probabilities:

F H I_{i, t} = P (B e a t_{i, t + 1}) - P (M i s s_{i, t + 1})

(23)

where

F H I \in [- 1, + 1]

represents the Financial Health Indicator. Health Status classification uses ±0.05 thresholds: Healthy (

F H I > 0.05

), Neutral (

- 0.05 \leq F H I \leq 0.05

), Unhealthy (

F H I < - 0.05

). The FHI synthesizes cross-document emotional consistency with financial fundamentals into a single diagnostic metric.

The model uses a temporal train-test split with pre-2020 data for training (

n = 2272

) and 2020+ data for true out-of-sample (OOS) testing (

n = 1120

). Unless otherwise stated, reported OOS performance refers to this 2-way temporal split. A separate 3-way split (train: pre-2018, validation: 2018–2019, test: 2020+) is used only for feature importance analysis and hyperparameter tuning.

3.7.3. Feature Importance and Key Drivers

Table 4 presents the top five features by importance in the Random Forest model.

Two CDEC interaction terms appear in the top five, indicating that emotional consistency contributes predictive power beyond financial fundamentals alone. ROE and ROE Trend are more important for predicting misses, while CDEC-related features drive beat prediction, suggesting financial fundamentals detect deterioration while disclosure consistency signals positive outcomes. See Appendix B for full 26-feature rankings.

Figure 3 illustrates the Health Status outcome separation on the out-of-sample test set (2020–2025,

n = 1120

). Healthy firms (

F H I > 0.05

) achieve a 62.8% beat rate compared to 35.4% for Unhealthy firms (

F H I < - 0.05

), indicating that the FHI provides meaningful discrimination between earnings outcome groups.

3.7.4. Validation Summary

The FHI achieves out-of-sample AUC = 0.671 on the temporal test set (2020–2025,

n = 1120

), indicating meaningful predictive power for earnings outcomes. Health Status groups show highly significant separation, validated by chi-square test (

χ^{2} = 69.3

,

p < 0.000001

), which tests whether earnings outcome distributions differ significantly across health status categories. Healthy firms beat expectations at a 62.8% rate versus 35.4% for Unhealthy firms, a 27.4 percentage-point spread.

A key finding is that CDEC alone is not a sufficient predictor (AUC ≈ 0.51, not statistically significant). Predictive power emerges from combining CDEC with financial fundamentals and lagged analyst signals. This validates the multi-family framework: each family contributes incrementally, but synthesis in Family V is essential. The interaction term CDEC × ROE is the top feature for both Beat and Miss prediction, indicating that emotional consistency matters most when financial performance is strong. Companies with consistent disclosure and strong ROE beat expectations; inconsistent disclosure paired with weak fundamentals signals elevated miss risk.

Of the red flags included as features, only RF₁ (

r = - 0.066

,

p < 0.001

) and RF₄ (

r = - 0.068

,

p < 0.001

) are statistically validated predictors of SUE. RF₂ (entropy ratio) showed a non-significant positive correlation (

r = + 0.018

,

p = 0.28

), indicating that disclosure complexity is not a validated predictor of earnings outcomes. RF₂ is retained as an impairment-monitoring indicator (not a SUE predictor); practitioners should use it to flag potential write-downs rather than earnings misses.

Full validation results, including ablation studies and robustness checks, are presented in Section 4.

4. Results

This section presents empirical validation of the CDEC framework as a risk-screening tool. We structure results to demonstrate the incremental value of each component: beginning with fundamentals-only baselines, progressing through text-only and CDEC-only specifications, and culminating in the combined CDEC × Fundamentals model that achieves our primary contribution. This progression establishes that emotional consistency matters most when financial stakes are high.

Sample: 102 S&P 100 companies, 2009–2025. The pre-2020 period serves as the training sample; 2020–2025 is held out for true out-of-sample testing. Sample sizes vary by data availability:

n = 4453

firm-quarters with financial data;

n = 4147

with valid CDEC scores (requiring matched 10-Q and earnings call transcripts);

n = 4112

with complete red flag data;

n = 3368

with SUE data (requiring analyst coverage);

n = 2978

with complete analyst forecasts for regression analysis.

4.1. Model Progression: From Baseline to Combined

We evaluate out-of-sample predictive power (2020–2025 test set) for earnings outcomes (Beat vs. Miss) in Table 5, across increasingly comprehensive specifications.

Critical Finding: CDEC alone achieves AUC ≈ 0.51, which is not statistically different from random (0.50). The predictive power emerges only when CDEC is combined with financial fundamentals and their interactions. This validates the multi-family framework: each family contributes incrementally, but synthesis is essential.

Key Insight: Emotional consistency matters most when fundamentals are at stake. A profitable firm (High ROE) with consistent disclosure (High CDEC) is highly likely to beat expectations. A profitable firm with inconsistent disclosure is a major warning sign: why is the narrative fracturing if the numbers are good? This explains why the interaction term CDEC × ROE emerges as the top predictor (importance = 0.103).

4.2. Red Flag Validation

We validate each red flag by examining its correlation with key financial metrics across the full sample (

n = 4112

firm-quarters with complete red flag data). Table 6 reports Pearson correlations between the four red flag indicators and financial quality measures (earnings quality ratio, total accruals, net income) as well as SUE. Statistically significant correlations indicate that the flags capture meaningful financial deterioration patterns.

Two flags (RF₁ and RF₄) are validated against the primary SUE outcome (both

p < 0.001

) and are used directly in earnings surprise prediction. Two flags (RF₂ and RF₃) serve complementary monitoring roles with different validation bases: RF₂ predicts material impairments (2.5× elevation, Section 4.6) but does not predict SUE (

p = 0.28

); RF₃ does not achieve statistical significance against any outcome and is retained solely for monitoring. This distinction should inform practitioner use: RF₁/RF₄ triggers warrant earnings-related review, while RF₂ triggers warrant impairment-monitoring workflows.

Group Comparisons: Although RF₂ does not predict SUE directly, it detects the classic earnings management pattern: RF₂-flagged firms report 45% higher net income ($2.2 B vs. $1.5 B) while simultaneously exhibiting ∼4.2× larger negative accruals ($4.5 B vs. $1.1 B). This divergence between reported income and cash flow quality is precisely the pattern Sloan (1996) identified as unsustainable. RF₂’s value lies in impairment prediction (Section 4.6) rather than earnings surprise forecasting.

4.3. SUE Prediction

We model next-period earnings surprises as:

S U E_{i, t + 1} = θ_{0} + θ_{1} \cdot C D E C_{i, t} + θ_{2} \cdot E Q_{i, t} + θ_{3} \cdot (C D E C_{i, t} \times E Q_{i, t}) + ε_{i, t + 1}

(24)

where

S U E_{i, t + 1}

represents standardized unexpected earnings in period

t + 1

(positive values indicate beating analyst expectations; negative values indicate missing expectations);

C D E C_{i, t}

denotes Cross-Document Emotion Consistency (Equation (19));

E Q_{i, t}

denotes Earnings Quality ratio (Equation (8)); and

θ_{0}, θ_{1}, θ_{2}, θ_{3}

are estimated regression coefficients. Following Livnat and Mendenhall (2006) and Doyle et al. (2006), we estimate forecast standard deviation from the high–low range of analyst estimates assuming a 95% confidence interval:

σ_{f o r e c a s t} = (E P S_{h i g h} - E P S_{l o w}) / 3.92

. This provides a standardized measure of earnings surprises relative to analyst uncertainty.

Regression Results (

n = 2978

, Table 7):

Model fit:

R^{2} = 0.012

. The modest variance explained by the model is typical for earnings surprise prediction (Livnat & Mendenhall, 2006). However, our primary interest is in coefficient significance and direction. The positive

θ_{1}

confirms that higher emotional consistency predicts positive earnings surprises (beating expectations), while lower consistency predicts negative surprises (missing expectations). The positive

θ_{3}

(CDEC × EQ interaction,

p = 0.025

) indicates that emotional consistency’s predictive power is amplified in firms with higher earnings quality, consistent with the thesis that CDEC is most informative when financial fundamentals provide a meaningful baseline.

Component Analysis (

n = 3368

, Table 8):

Cosine similarity provides the primary signal; RF₁ and RF₄ are both negatively associated with SUE (excessive optimism and emotional dispersion both predict earnings misses). JS divergence contributes marginally but does not independently drive the result.

4.4. Ablation Studies

We quantify incremental contributions via progressive feature build-up (Table 9), using the same calibrated Random Forest specification and temporal split (train: pre-2020, test: 2020+,

n = 1120

).

Finding: ROE trend features (our enhanced addition) contribute the single largest incremental lift (+0.044). CDEC components alone do not improve the AUC (−0.002), likely because raw similarity is less informative without conditioning on fundamentals. However, CDEC × financial interactions unlock the predictive value (+0.006), supporting the finding that the interaction mechanism—not raw emotional similarity—drives CDEC’s contribution. Prior quarter analyst signals provide the largest incremental gain in the final layer (+0.016). Red flags show modest standalone contribution (+0.002), consistent with their role as interaction modifiers rather than standalone predictors.

4.5. Benchmark: Sentiment Polarity vs. Emotion Granularity

To test whether the 28-dimensional emotion taxonomy contributes beyond standard sentiment polarity (Table 10), ProsusAI/FinBERT (A. H. Huang et al., 2023) (financial domain, 3-class: positive/negative/neutral) and cardiffnlp/twitter-roberta-base-sentiment (Barbieri et al., 2020) (general domain, 3-class) were run on all 4056 available firm-quarter MD&A and Q&A pairs. Model comparison uses the 3308 firm-quarter subset with complete financial fundamentals and earnings outcomes. For each model, analogous cross-document consistency features were computed (cosine similarity and Jensen–Shannon similarity of the 3-dimensional vectors) and the same calibrated Random Forest was trained with equivalent interaction terms and temporal split (train: pre-2020, test: 2020+).

The integrated CDEC framework achieves a 27.4 percentage-point healthy–unhealthy spread, approximately twice the separation of the strongest 3-class sentiment baseline tested (14.6 pp for RoBERTa, 8.7 pp for FinBERT). Collapsing 28 emotions into three sentiment classes discards the granularity needed for effective disclosure risk screening, supporting the Dimensionality Gap (Gap 1). Neither consistency measure is predictive alone (AUC ≈ 0.50 for all three), reinforcing the ablation finding that the interaction with fundamentals is the mechanism.

The general-domain RoBERTa slightly outperforms the financial-domain FinBERT for consistency detection (AUC 0.667 vs. 0.662), suggesting that broad language understanding may matter more than financial sentiment specialization for cross-document comparison. FinBERT is optimized for sentiment polarity classification, a different task from detecting cross-channel emotional patterns.

To illustrate the practical significance, consider risk analyst screening firms before earnings. The CDEC framework’s 27.4 pp spread means the “Healthy” group beats expectations at 62.8% versus 35.4% for the “Unhealthy” group, a gap wide enough to be operationally relevant for screening applications. As an illustrative implementation scenario (not a validated portfolio backtest), for a $1 B portfolio screened quarterly, this spread would translate to approximately 77 basis points of avoided downside per earnings cycle. By contrast, RoBERTa’s 14.6 pp spread would produce approximately 30 basis points per cycle, and FinBERT’s 8.7 pp spread approximately 25 basis points, both below the level of the CDEC-based screen.

4.6. 8-K Analysis: Anticipatory Obfuscation and Early Warning Value

Based on SEC Form 8-K reporting guidelines, corporate events fall into two categories: events that are foreseeable by management (anticipatable) and those discovered through audit processes (such as restatements). We test whether CDEC detects anticipatory obfuscation before foreseeable adverse events. The key insight is that CDEC can only detect obfuscation when managers anticipate bad news. Anticipatable events include Item 2.04 (debt covenants), Item 2.05 (restructuring), and Item 2.06 (impairments). Discovered events like restatements (Item 4.02) should show no elevation.

We measure RF₂ rates in the quarterly CDEC analysis preceding 8-K filings (Table 11). For each 8-K event, we examine the most recent prior quarter’s CDEC output to determine whether RF₂ was flagged before the event announcement.

Key Finding: RF₂ (Disclosure Complexity) is elevated 2.5× before material impairments (41.2% vs. 16.6% baseline,

p < 0.001

). While this result is adequately powered (

n = 34

exceeds the minimum

n = 18

for 80% power at this effect size), the limited sample means this finding should be treated as suggestive evidence warranting replication in broader samples rather than a definitive early-warning system. The pattern is consistent with managers anticipating impairments and increasing MD&A complexity while maintaining simpler Q&A language, providing preliminary evidence of 3–6 month advance warning before material write-downs.

RF₂ is included in the CDEC score (Equation (19)) with a reduced weight (

λ_{2} = 0.02

), reflecting its role as a monitoring indicator rather than a SUE predictor. Its primary value lies in impairment detection rather than earnings surprise prediction, hence the lower weight relative to RF₁ and RF₄ (

λ_{1} = λ_{4} = 0.08

). For practitioners, RF₂ triggers should prompt impairment-monitoring workflows rather than earnings miss alerts.

Contrast with Restatements: Item 4.02 (restatements) showed no RF₂ elevation because restatements are discovered by auditors, not anticipated by management. Without anticipation, there is no preemptive obfuscation to detect. However, if management suspects accounting irregularities before the formal filing, CDEC’s SUE-calibrated framework may capture resulting shifts in disclosure patterns across subsequent earnings calls—a direction for future research.

Economic Magnitude: RF₂ elevation provides 3–6 month advance warning before Item 2.06 asset impairment announcements. For a hypothetical $5B portfolio, the 2.5× lift in detecting impairments (41.2% vs. 16.6% baseline) could yield estimated savings of approximately $15M against screening costs of less than $31k (see Appendix C.3 for ROI calculation). The key insight is that CDEC’s value proposition is screening efficiency, not perfect prediction: narrowing analyst attention to the 19% of firms most likely to experience adverse events.

4.7. Health Status Classification

The Financial Health Indicator (FHI) classifies firms into Healthy, Neutral, and Unhealthy categories based on earnings outcome probabilities (

n = 1120

out-of-sample, 2020–2025, Table 12):

4.8. Temporal Out-of-Sample Performance

To assess temporal stability, we evaluate the FHI performance by year across the out-of-sample period (2020–2025). Table 13 reports annual AUC values, revealing how the model adapts to different market regimes, including the COVID-19 disruption (2020), recovery (2021), and subsequent normalization (2022–2025).

Interpretation: Performance varies by year, with 2023 showing the strongest discrimination (AUC = 0.77) and 2020 showing the weakest (AUC = 0.58) due to COVID-19 disrupting normal disclosure patterns. The model recovers post-2020, suggesting regime sensitivity but long-term stability.

5. Discussion and Implications

The empirical validation demonstrates two key findings: (1) CDEC successfully identifies firms with deteriorating financial quality, and (2) CDEC predicts earnings surprises with statistical significance (

θ_{1} = + 2.31

,

p < 0.001

,

R^{2} = 1.2 %

). This section interprets these findings through two lenses: theoretical contributions to behavioral finance and disclosure quality literature (Section 5.1), and practical applications for risk screening, audit, and regulatory oversight (Section 5.2). We conclude by acknowledging limitations and proposing extensions, including vocal affect analysis and expert validation of emotion labels (Section 5.3).

5.1. Theoretical Contributions

The framework makes several theoretical contributions to financial disclosure analysis. First, cross-document emotion consistency is formalized as a measurable construct (Equations (19)–(24)), extending beyond single-document analysis to capture strategic communication patterns across disclosure channels. Second, the results demonstrate that specific red flag patterns (RF₁ Overconfident Q&A and RF₄ Mixed Emotion Surge) predict earnings surprises (both

p < 0.001

), while RF₂ (Disclosure Complexity) provides complementary advance warning of material impairments (2.5× elevation, Section 4.6). Together, these patterns identify firms with deteriorating financial quality through distinct mechanisms, providing empirical support for impression management theory in financial contexts. X. Huang et al. (2014) documented that managers strategically adjust disclosure tone to influence investor perceptions; the cross-document consistency approach extends this by detecting tone manipulation that manifests differently across mandatory and voluntary channels. Third, emotional granularity captured through 28-dimensional vectors (Equation (10)) enables detection of subtle patterns invisible to binary sentiment analysis.

5.2. Practical Applications

Risk Screening and Due Diligence: The CDEC framework serves as a screening tool that complements existing financial analysis and machine learning approaches to fraud detection (Bao et al., 2020) with disclosure-based behavioral signals. The framework achieves modest but statistically significant discriminatory power (AUC = 0.671), consistent with the inherent difficulty of earnings prediction where even sophisticated models explain only 1–3% of variance. Red flag detection (Table 3 and Equation (24)) identifies firms exhibiting patterns consistent with earnings management: reporting higher net income ($695 M higher, $2.2 B vs. $1.5 B) while simultaneously showing worse cash flow quality ($3.4 B more negative accruals). This pattern provides actionable signals for internal audit teams monitoring disclosure quality, credit analysts assessing default risk, M&A due diligence teams evaluating acquisition targets, and regulatory oversight bodies focusing examinations on high-risk firms.

Adverse Event Prediction: As demonstrated in Section 4.6, RF₂ shows preliminary evidence of 3–6 month advance elevation before material impairments (

n = 34

; replication in broader samples is warranted), suggesting potential for proactive risk management rather than reactive crisis response.

Earnings Miss Prediction (Validated): As validated in Section 3.7.4, the FHI predicts earnings outcomes rather than analyst grades, with significant Health Status separation. The optimal feature set combines CDEC × Financial interactions with emotion features (Section 3.7.3), positioning CDEC as a risk-screening tool for identifying firms likely to disappoint earnings expectations rather than a grading tool for investment recommendations.

Economic Magnitude: The SUE regression coefficient (

θ_{1} = + 2.31

) implies that a one-standard-deviation increase in CDEC is associated with a +2.31 standard deviation increase in SUE. For the median S&P 100 firm-quarter (actual EPS ≈ $0.90, implied forecast dispersion ≈ $0.07), this corresponds to an earnings surprise of approximately $0.15 per share, roughly 17% of median quarterly EPS. The benchmark comparison (Section 4.5) provides a complementary perspective: the integrated CDEC framework produces a 27.4 percentage point healthy–unhealthy spread, compared to 14.6 pp for RoBERTa and 8.7 pp for FinBERT. As an illustrative implementation scenario, for a screening universe of 100 large-cap firms evaluated quarterly, the CDEC framework correctly identifies 63% of the “healthy” group as subsequent earnings beats versus 35% of the “unhealthy” group. Under the FinBERT baseline, this separation narrows to 54% versus 46%, approaching chance levels. These figures represent extrapolations from the spread metric rather than validated portfolio backtests, but they indicate economically meaningful separation for screening applications. Appendix C.3 provides a more detailed cost–benefit analysis.

Limitations of Trading Applications: Our validation reveals that CDEC, despite predicting earnings surprises, does not generate trading alpha. This finding is consistent with multi-factor asset pricing models (Fama & French, 2015) suggesting market efficiency: emotional consistency signals are either already priced or too weak to overcome transaction costs in an active trading strategy (Treynor & Black, 1973). CDEC’s value lies in risk screening, not alpha generation.

Practitioner Implementation: For regulators, audit committees, and enterprise risk functions, CDEC offers a scalable surveillance tool that complements traditional financial ratio analysis. A firm flagged by elevated RF₂ or low CDEC consistency warrants deeper examination, not automatic intervention, consistent with the risk-screening purpose emphasized throughout this paper.

5.3. Limitations

Several limitations bound our findings:

Sample scope: The S&P 100 sample represents a deliberate sector-representative design following the Fin-ALICE framework (McCarthy & Alaghband, 2024), providing the densest analyst coverage and most reliable SUE measurement. However, smaller firms may exhibit different disclosure dynamics: less sophisticated investor relations teams, sparser analyst coverage, and shorter Q&A sessions could affect CDEC’s discriminatory power. Extending CDEC to mid- and small-cap firms is a natural next step that the Fin-ALICE sector-level infrastructure supports.
Cross-sectional vs. temporal: CDEC’s predictive power attenuates under firm fixed effects ( $p = 0.248$ ), indicating that the signal is cross-sectional rather than within-firm temporal. The framework is accordingly positioned as a screening tool for comparing firms at a point in time, not for tracking individual firms over time.
Cross-industry comparability: The framework assumes emotional expressions are comparable across firms and industries, which may not hold universally.
Threshold stability: Weights in Equation (19) are empirically determined on historical data and may require periodic recalibration.
Taxonomy coverage: The 28-emotion taxonomy, while comprehensive, may not capture all relevant emotional nuances in financial communication.
Weak supervision bias: Fine-tuning relies on base GoEmotions labels, potentially propagating biases from Reddit-trained representations.
Technology sector anomaly: The inverse CDEC-SUE relationship in technology ( $t = - 2.04$ ) requires sector-specific calibration.

5.4. Future Extensions

Future work could incorporate vocal affect analysis following Mayew and Venkatachalam (2012) methodology, detecting vocal–textual emotional divergence (optimistic words delivered with stressed vocal characteristics). Additionally, expert financial analysts could validate a subset of emotion labels to measure alignment between soft labels and human judgment. Cross-market validation on non-U.S. disclosures would test framework generalizability.

5.5. Robustness Checks

The results remain statistically significant under time fixed effects (

p = 0.001

), heteroskedasticity-robust standard errors (

p = 0.005

), and firm-clustered standard errors (

p = 0.047

), indicating that the findings are not driven by common macro shocks or serial correlation.

Under the more demanding two-way fixed-effects specification (firm and time), statistical significance becomes marginal (

p = 0.098

), and the firm fixed-effects specification alone yields non-significance (

p = 0.248

). Because firm fixed effects absorb all time-invariant cross-sectional differences, this attenuation suggests that CDEC’s predictive power arises primarily from variation across firms rather than from within-firm changes over time.

These results indicate that CDEC captures relatively stable firm-level differences in disclosure consistency. This attenuation is an important design finding: CDEC captures cross-sectional variation in disclosure quality across firms, not within-firm temporal shifts. Accordingly, CDEC is positioned as a cross-sectional screening tool for comparing one firm’s disclosure quality against its peers, rather than a within-firm monitoring instrument. Future work exploring within-firm dynamics (e.g., changes in management team or disclosure strategy) may recover temporal signal. See Appendix B.3 for full robustness tables.

5.6. Practitioner Implementation Guide

For practitioners seeking to deploy CDEC in risk management workflows, Table 3 provides validated thresholds with recommended actions. RF₁ and RF₄ triggers warrant analyst review (both SUE-validated at

p < 0.001

), while RF₂ triggers should prompt impairment monitoring. Multiple red flags (≥2) warrant priority escalation; low CDEC scores (<0.60) warrant portfolio risk review. A detailed implementation workflow, including tiered response protocols and sector-specific calibration, is provided in Appendix D.

6. Conclusions

This paper presents Cross-Document Emotion Consistency (CDEC), operationalizing emotional alignment across financial disclosure channels as a measurable risk signal. By exploiting the legal and cognitive asymmetry between prepared filings and spontaneous disclosure, the framework detects patterns of strategic impression management that manifest differently across mandatory (MD&A) and voluntary (Q&A) channels.

The central finding is that CDEC alone is not predictive, but when interacting with financial fundamentals, the integrated framework achieves substantially greater risk separation than the strongest 3-class sentiment baseline tested (Section 4.5). The interaction mechanism is the contribution: emotion granularity enables detection of disclosure risk that only matters when financial stakes are high.

The 28-dimensional emotion taxonomy provides granularity that a standard sentiment polarity cannot match. Benchmark comparison against FinBERT and RoBERTa indicates that collapsing emotions to three sentiment classes discards the information needed for effective risk screening (Section 4.5). Among the four behavioral red flags, two (RF₁, RF₄) are validated against the primary earnings outcome, while RF₂ provides suggestive advance warning of material impairments, and RF₃ serves a monitoring role (Section 4.2).

The framework achieves modest but statistically significant discriminatory power, consistent with the inherent difficulty of earnings prediction. It is designed for risk screening rather than trading signal generation. The S&P 100 sector-representative sample, following the Fin-ALICE framework (McCarthy & Alaghband, 2024), establishes the construct under favorable data conditions; extending to mid- and small-cap firms, incorporating vocal affect analysis following Mayew and Venkatachalam (2012), expert validation of emotion labels, and cross-market generalization to non-U.S. disclosures are natural next steps.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/jrfm19040251/s1, S1: FinGoEmotion Training Corpus Details; S2: FinGoEmotion Domain Adaptation Validation; S3: Technical Implementation Details.

Author Contributions

Conceptualization, S.M. and G.A.; methodology, S.M.; software, S.M.; validation, S.M. and G.A.; formal analysis, S.M.; investigation, S.M.; resources, S.M.; data curation, S.M.; writing—original draft preparation, S.M.; writing—review and editing, S.M. and G.A.; visualization, S.M.; supervision, G.A.; project administration, G.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to licensing restrictions from the commercial data provider. The data and research project are made available for academic use only, subject to the terms of the original data license agreement.

Acknowledgments

The authors acknowledge the University of Colorado Denver for providing computational resources and research support. During the preparation of this manuscript, the authors used Claude (Anthropic) for the purposes of code generation, data analysis pipeline development, and editorial review. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AUC	Area Under the Curve
CDEC	Cross-Document Emotion Consistency
EQ	Earnings Quality
FHI	Financial Health Indicator
JSD	Jensen–Shannon Divergence
KL	Kullback–Leibler
MD&A	Management’s Discussion and Analysis
NLP	Natural Language Processing
OCF	Operating Cash Flow
OOS	Out-of-Sample
PEAD	Post-Earnings Announcement Drift
Q&A	Question and Answer
RF	Red Flag
ROE	Return on Equity
SEC	Securities and Exchange Commission
SOX	Sarbanes–Oxley Act
SUE	Standardized Unexpected Earnings

Appendix A. Mathematical Derivations

Appendix A.1. Emotion Vector Normalization

The normalization of emotion vectors to probability distributions for JS divergence:

p_{j} = \frac{e_{j} + ϵ}{\sum_{k = 1}^{28} (e_{k} + ϵ)}

(A1)

where

ϵ = 10^{- 10}

prevents division by zero.

Appendix A.2. Red Flag Pattern Definitions (Calibrated Thresholds)

Thresholds were calibrated using empirical distributions from 4147 S&P 100 firm-quarter observations (see Table 3 in main paper for validation details).

RF₁: Overconfident Q&A Pattern:

R F_{1} = 1 [E_{Q & A}^{o p t i m i s m} > 0.1377]

(A2)

where

E_{Q & A}^{o p t i m i s m}

is the GoEmotions optimism probability in the Q&A section. The threshold corresponds to the 90th percentile of the Q&A optimism distribution, triggering at 10.0% rate.

RF₂: Disclosure Complexity Pattern:

R F_{2} = 1 [\frac{H (E_{Q & A})}{H (E_{M D & A})} > 3.0]

(A3)

where

H (E) = - \sum_{j = 1}^{28} p_{j} {log}_{2} (p_{j})

is Shannon entropy across the 28-emotion taxonomy. The threshold at 3.0 (85th percentile) triggers at 16.8% rate.

RF₃: Trust Manipulation Pattern:

R F_{3} = 1 [(E_{Q & A}^{t r u s t} - E_{M D & A}^{t r u s t}) > 0.10]

(A4)

where trust is constructed as a composite of five GoEmotions labels:

E_{d o c}^{t r u s t} = \frac{1}{5} (E_{d o c}^{a d m i r a t i o n} + E_{d o c}^{a p p r o v a l} + E_{d o c}^{c a r i n g} + E_{d o c}^{g r a t i t u d e} + E_{d o c}^{l o v e})

(A5)

The calibrated threshold of 0.10 (99th percentile) triggers at 1.6% rate.

RF₄: Mixed Emotion Surge Pattern:

R F_{4} = 1 [| {j : p_{j} > 0.15, j \in Q & A} | > 2]

(A6)

Triggering when more than two emotions are simultaneously active (8.4% rate) indicates diffuse cognitive state inconsistent with focused communication.

Appendix A.3. Red Flag Theoretical Foundations and Interpretations

RF₁ (Overconfident Q&A): Based on Malmendier and Tate (2005, 2015) and Hirshleifer et al. (2012), this flag detects excessive optimism in spontaneous Q&A responses, signaling management overconfidence. Unlike textual negativity (which may reflect transparency), excessive positivity predicts earnings disappointment. Firms with excessive Q&A optimism miss analyst expectations by 0.84 std (

t = - 3.90

,

p < 0.001

). Threshold calibrated to 90th percentile of Q&A optimism distribution. Weight reflects strong empirical validation for SUE prediction.

RF₂ (Disclosure Complexity): Based on Vrij et al. (2006) cognitive load theory, Shannon (1948) entropy, and Campbell et al. (2014) risk factor analysis, this flag identifies complexity from two sources: (1) structural business complexity requiring scattered emotional expression (utilities, financials), and (2) cognitive load from narrative maintenance difficulty. A high-entropy ratio correlates with larger negative accruals (cash flow stress). Sector variation: Utilities 56.5%, Financials 44.7%, and Technology 9.5%. Threshold set at 85th percentile.

RF₃ (Trust Manipulation): Based on Cohen et al. (2013, 2020), who found “casting the call” (selective analyst inclusion) to be associated with a 14% increase in the restatement likelihood per standard deviation of favoritism, this flag detects artificial relationship-building in Q&A relative to measured mandatory disclosure. A low trigger rate (99th percentile threshold) captures extreme cases.

RF₄ (Mixed Emotion Surge): Based on Vrij et al. (2006) cognitive load theory, this flag identifies diffuse cognitive state where >2 emotions simultaneously exceed the activation threshold, indicating scattered attention inconsistent with focused response. Strongly significant predictor of negative earnings surprises: firms miss expectations by 0.93 std (

t = - 3.93

,

p < 0.001

).

Appendix A.4. Threshold Calibration Notes

Original thresholds derived from general deception literature assumed emotion magnitudes typical of conversational or social media text. The GoEmotions analysis of 4147 financial filing observations (102 S&P 100 firms, 2009–2025) revealed a substantially lower baseline emotional expression in financial text compared to conversational norms. Calibrated thresholds reflect empirical distributions while preserving theoretical foundations.

Empirical validation identified two red flags with significant SUE predictive power: RF₁ (Overconfident Q&A) and RF₄ (Mixed Emotion Surge) both predict earnings misses (

p < 0.001

). RF₂ and RF₃ validate against financial metrics but lack SUE predictive power in this large-cap sample; they are retained with lower weights for potential applicability to smaller firms.

Appendix A.5. RF₂ and RF₃ Retention Rationales

RF₂ Retention Rationale: While RF₂ (Disclosure Complexity) does not predict SUE in our S&P 100 sample (

r = + 0.018

,

p = 0.28

), we retain it with reduced weight (

λ_{2} = 0.02

) for two reasons: (1) 8-K Event Validation: RF₂ shows strong predictive power for material impairments (Section 4.6), with 2.5× elevation before Item 2.06 events (

p < 0.001

). (2) Small-Cap Generalizability: S&P 100 firms maintain sophisticated investor relations teams that may normalize disclosure complexity.

RF₃ Retention Rationale: While RF₃ (Trust Manipulation) does not achieve statistical significance in our large-cap sample (

n = 65

triggers, 1.6% rate), we retain it based on: (1) theoretical foundation from Cohen et al. (2013, 2020); (2) statistical power limitations at the 99th percentile threshold; (3) small-cap generalizability where relationship dynamics differ. The conservative weight (

λ_{3} = 0.02

) ensures RF₃ contributes minimally to CDEC scores in our validated sample.

Appendix A.6. Cosine Similarity Computation Details

Computation proceeds by first constructing 28-dimensional emotion vectors

{\vec{E}}_{M D & A}

and

{\vec{E}}_{Q & A}

through weighted aggregation of sentence-level emotion classifications derived from our fine-tuned GoEmotions model (Equation (10)). The dot product

\sum_{j = 1}^{28} e_{j}^{M D & A} \times e_{j}^{Q & A}

captures alignment across all emotion dimensions simultaneously, with strong positive contributions when corresponding emotions co-occur and minimal contribution when emotion profiles diverge. The magnitude normalization

\sqrt{\sum_{j = 1}^{28} {(e_{j})}^{2}}

prevents high-intensity emotional expression from artificially inflating similarity scores.

Scores range from −1 (complete emotional opposition) to +1 (perfect emotional alignment), with expected empirical distributions for legitimate corporate disclosures typically centered between 0.5 and 0.9 based on our analysis of 4147 firm-quarters. Scores below 0.3 indicate fundamental misalignment warranting investigation.

Appendix B. Extended Feature Importance Analysis

Appendix B.1. Full Feature Importance Comparison

Table A1. Feature importance—beat vs. miss comparison (All 26 Features).

Rank	Feature	Beat Imp.	Miss Imp.	Diff	Direction
1	cdec_x_roe	0.103	0.104	≈	Both
2	roe	0.091	0.115	−0.024	→ Miss
3	roe_trend	0.091	0.101	−0.010	→ Miss
4	cdec_x_net_income	0.078	0.072	+0.006	→ Beat
5	eq_ratio	0.074	0.067	+0.007	→ Beat
6	total_equity	0.074	0.078	≈	Both
7	cdec_x_eq	0.069	0.072	≈	Both
8	net_income	0.067	0.065	≈	Both
9	revenue	0.062	0.065	≈	Both
10	cdec_x_revenue	0.054	0.048	+0.005	→ Beat
11	total_accruals	0.039	0.039	≈	Both
12	s_js	0.034	0.027	+0.008	→ Beat
13	eq21_regression	0.034	0.027	+0.006	→ Beat
14	s_cos	0.030	0.025	+0.005	→ Beat
15	cdec_score	0.029	0.022	+0.007	→ Beat
16	prior_analyst_confidence	0.020	0.022	≈	Both
17–26	Remaining features	<0.02	<0.02	—	—

ROE and ROE Trend are more important for predicting misses, while CDEC-related features (

S_{J S}

, cdec_score,

S_{c o s}

) are more important for predicting beats.

Appendix B.2. Temporal Out-of-Sample Performance by Year

Table A2. FHI performance by year (2020–2025).

Year	N	AUC (Beat)	AUC (Miss)	Base Miss Rate
2020	140	0.592	0.563	53.6%
2021	147	0.622	0.605	44.9%
2022	187	0.713	0.685	48.7%
2023	263	0.769	0.771	55.9%
2024	194	0.635	0.634	39.2%
2025	189	0.649	0.639	43.9%
Overall	1120	0.675	0.666	48.0%

2023 shows strongest performance (AUC 0.77), likely due to post-COVID normalization. 2020 shows weakest (AUC 0.59), reflecting COVID regime disruption.

Appendix B.3. Robustness Analysis Details

Table A3. Robustness analysis of CDEC-SUE relationship.

Specification	N	CDEC Coef	t-Stat	p-Value	Significant
Baseline OLS	2978	+0.199	2.77	0.006	Yes
Time FE	2978	+0.237	3.24	0.001	Yes
Robust SE (HC3)	2978	+0.199	2.82	0.005	Yes
Clustered SE (firm)	2978	+0.199	1.99	0.047	Yes
Two-way FE	2978	+0.130	1.65	0.098	Marginal
Firm FE	2978	+0.088	1.15	0.248	No

All specifications use

n = 2978

firm-quarters with valid SUE and CDEC scores. Coefficients are standardized.

Appendix C. Power Analysis and Robustness Details

Appendix C.1. Statistical Power Analysis

The minimum sample size for detecting RF₂ elevation in a one-sample proportion test follows Fleiss et al. (2003):

n = \frac{{[Z_{α} \sqrt{p_{0} (1 - p_{0})} + Z_{β} \sqrt{p_{1} (1 - p_{1})}]}^{2}}{{(p_{1} - p_{0})}^{2}}

(A7)

where

Z_{α} = 1.645

(one-sided

α = 0.05

),

Z_{β} = 0.84

(80% power),

p_{0} = 0.166

(baseline RF₂ rate), and

p_{1}

is the observed rate.

Table A4. Power analysis results.

Item	Observed Rate	Effect Size	Min n (80%)	Actual n	Adequate?
2.06 (Impairments)	41.2%	+24.6 pp	18	34	Yes
2.05 (Restructuring)	19.7%	+3.1 pp	931	66	No
2.04 (Covenants)	12.5%	−4.1 pp	471	8	No

Item 2.06 is adequately powered (

n = 34

exceeds minimum

n = 18

), explaining its statistical significance (

p < 0.001

). Items 2.04 and 2.05 would require hundreds of observations to detect their small effect sizes reliably.

Appendix C.2. Extended Robustness Discussion

Interpretation of Firm FE Attenuation: The loss of significance under firm fixed effects (

p = 0.248

) deserves careful interpretation. CDEC’s predictive power derives substantially from between-firm variation in disclosure quality, which the firm FEs absorb by design. This is consistent with CDEC functioning as a cross-sectional screening tool that identifies firms with systematically different disclosure practices, rather than detecting purely within-firm temporal deterioration.

Causal Interpretation Caveat: The attenuation under firm fixed effects raises the possibility of omitted variable bias. Firms with consistently low CDEC may differ systematically in unobserved characteristics (management quality, corporate culture, investor relations sophistication) that jointly drive both disclosure patterns and earnings outcomes. While our cross-sectional screening application remains valid for identifying which firms warrant scrutiny, the results should not be interpreted as evidence that improving the CDEC causes better earnings outcomes.

Technology Sector Anomaly: The inverse CDEC-SUE relationship in technology (

t = - 2.04

,

p < 0.05

) warrants specific discussion. Technology firms may exhibit systematically different disclosure norms: rapid innovation cycles create legitimate uncertainty that elevates emotional complexity, while market expectations already discount this sector’s volatility. We recommend sector-specific threshold calibration for technology applications.

Appendix C.3. Cost–Benefit Analysis

Consider an institutional investor managing $5 B across 100 positions:

Average impairment magnitude: 15% of market cap;
Expected loss without early warning: 10 impairments × $50 M average position × 15% = $75 M;
RF₂ detection rate: 41.2% of impairments flagged in advance;
Potential loss mitigation (assuming 50% position reduction on flag): 0.412 × $75 M × 0.50 = $15.5 M saved;
Screening cost: 19.1 flagged firms × 8 analyst hours × $200/h = $30,500;
Net benefit: $15.5 M − $30,500 = $15.47 M (507× ROI on screening effort).

Appendix D. Practitioner Implementation Workflow

Quarterly Screening: Run CDEC analysis within 48 h of earnings call transcript availability. Priority: firms with prior red flag history or sector-specific risk factors.
Tiered Response Protocol:
- Single red flag: Document and monitor; no immediate action required;
- Two red flags: Escalate to senior analyst; request additional information from investor relations;
- Three+ red flags or CDEC $< 0.50$ : Immediate portfolio review; consider position reduction pending resolution.
Sector Calibration: Adjust thresholds by ±15% for sector-specific norms:
- Utilities, Financials: Higher RF₂ baseline (+15% threshold) due to structural complexity;
- Technology: Interpret with caution; consider inverse relationship;
- Healthcare: Standard thresholds apply.
Integration Points:
- Credit analysis: Combine CDEC with traditional Altman Z-score and Merton model inputs;
- M&A due diligence: Screen targets for disclosure quality before LOI;
- Internal audit: Prioritize examination of firms with deteriorating CDEC trends.

Appendix E. FHI Model Feature Definitions

Table A5. Complete feature definitions (26 Features).

Feature	Family	Definition	Source
total_accruals	I	Total accruals $T A_{i, t}$ (Section 2.2)	Financial statements
net_income	I	Net income $N I_{i, t}$	Financial statements
total_equity	I	Total stockholders’ equity	Financial statements
revenue	I	Total quarterly revenue	Financial statements
roe	I	Return on equity = NI/Total Equity	Financial statements
roe_trend	I	4-quarter ROE slope (Equation (9))	Computed
eq_ratio	I	Earnings quality $E Q = O C F / N I$ (Equation (8))	Computed
s_cos	III	Cosine similarity $S_{c o s}$ (Equation (15))	CDEC pipeline
s_js	III	Jensen–Shannon similarity $S_{J S}$ (Equation (17))	CDEC pipeline
cdec_score	III	Combined CDEC score (Equation (19))	CDEC pipeline
cdec_x_roe	Interaction	$C D E C_{i, t} \times R O E_{i, t}$	Computed
cdec_x_eq	Interaction	$C D E C_{i, t} \times E Q_{i, t}$	Computed
cdec_x_net_income	Interaction	$C D E C_{i, t} \times N I_{i, t}$	Computed
cdec_x_revenue	Interaction	$C D E C_{i, t} \times R e v e n u e_{i, t}$	Computed
RF₁	IV	Overconfident Q&A flag (Table 3)	CDEC pipeline
RF₂	IV	Disclosure Complexity flag (Table 3)	CDEC pipeline
RF₃	IV	Trust Manipulation flag (Table 3)	CDEC pipeline
RF₄	IV	Mixed Emotion Surge flag (Table 3)	CDEC pipeline
s_rf	IV	Red flag penalty score (Equation (18))	CDEC pipeline
eq21_regression	IV	Accruals regression residual	Computed
prior_eq_grade	V-Lag	Prior quarter EQ grade (A/B/C/F)	Lagged
prior_analyst_score	V-Lag	Prior quarter consensus analyst score	Lagged
prior_analyst_confidence	V-Lag	Prior quarter analyst confidence	Lagged
prior_analyst_signal	V-Lag	Prior quarter analyst signal direction	Lagged
prior_analyst_score_top5	V-Lag	Prior quarter top-5 analyst score	Lagged
prior_analyst_confidence_top5	V-Lag	Prior quarter top-5 analyst confidence	Lagged

Family V-Lag features are derived from the prior quarter’s FHI model output and analyst consensus data (Section 3.1). “Top-5” refers to the five analyst firms per sector with the highest historical return spread (buy minus sell), ranked using a sector-aligned momentum evaluation framework. All V-Lag features are lagged by one quarter to prevent data leakage. See Section 3.7.1 for the complete feature enumeration.

References

Bao, Y., Ke, B., Li, B., Yu, Y. J., & Zhang, J. (2020). Detecting accounting fraud in publicly traded U.S. firms using a machine learning approach. Journal of Accounting Research, 58(1), 199–235. [Google Scholar] [CrossRef]
Barbieri, F., Camacho-Collados, J., Espinosa-Anke, L., & Neves, L. (2020). TweetEval: Unified benchmark and comparative evaluation for tweet classification. In Findings of the association for computational linguistics: EMNLP 2020 (pp. 1644–1650). Association for Computational Linguistics. [Google Scholar] [CrossRef]
Bernard, V. L., & Thomas, J. K. (1989). Post-earnings-announcement drift: Delayed price response or risk premium? Journal of Accounting Research, 27, 1–36. [Google Scholar] [CrossRef]
Bodnaruk, A., Loughran, T., & McDonald, B. (2015). Using 10-K text to gauge financial constraints. Journal of Financial and Quantitative Analysis, 50(4), 623–646. [Google Scholar] [CrossRef]
Brown, L. D., Griffin, P. A., Hagerman, R. L., & Zmijewski, M. E. (1987). An evaluation of alternative proxies for the market’s assessment of unexpected earnings. Journal of Accounting and Economics, 9(2), 159–193. [Google Scholar] [CrossRef]
Buehlmaier, M. M., & Whited, T. M. (2018). Are financial constraints priced? Evidence from textual analysis. Review of Financial Studies, 31(7), 2693–2728. [Google Scholar] [CrossRef]
Campbell, J. L., Chen, H., Dhaliwal, D. S., Lu, H., & Steele, L. B. (2014). The information content of mandatory risk factor disclosures in corporate filings. Review of Accounting Studies, 19(1), 396–455. [Google Scholar] [CrossRef]
Cohen, L., Lou, D., & Malloy, C. (2013). Playing favorites: How firms prevent the revelation of bad news (NBER Working Paper No. 19429). National Bureau of Economic Research. [CrossRef]
Cohen, L., Lou, D., & Malloy, C. J. (2020). Casting conference calls. Management Science, 66(11), 5015–5039. [Google Scholar] [CrossRef]
Davis, A. K., & Tama-Sweet, I. (2012). Managers’ use of language across alternative disclosure outlets: Earnings press releases versus MD&A. Contemporary Accounting Research, 29(3), 804–837. [Google Scholar] [CrossRef]
Demszky, D., Movshovitz-Attias, D., Ko, J., Cowen, A., Nemade, G., & Ravi, S. (2020). GoEmotions: A dataset of fine-grained emotions. In Proceedings of the 58th annual meeting of the association for computational linguistics (ACL) (pp. 4040–4054). Association for Computational Linguistics. [Google Scholar] [CrossRef]
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 4171–4186). Association for Computational Linguistics. [Google Scholar] [CrossRef]
Doyle, J. T., Lundholm, R. J., & Soliman, M. T. (2006). The extreme future stock returns following I/B/E/S earnings surprises. Journal of Accounting Research, 44(5), 849–887. [Google Scholar] [CrossRef]
Ergun, Z. E., & Sefer, E. (2025). FinSentiment: Predicting financial sentiment through transfer learning. Intelligent Systems in Accounting, Finance and Management, 32(3), e70015. [Google Scholar] [CrossRef]
Fama, E. F., & French, K. R. (2015). A five-factor asset pricing model. Journal of Financial Economics, 116(1), 1–22. [Google Scholar] [CrossRef]
Fink, J. (2021). A review of the post-earnings-announcement drift. Journal of Behavioral and Experimental Finance, 29, 100446. [Google Scholar] [CrossRef]
Fleiss, J. L., Levin, B., & Paik, M. C. (2003). Statistical methods for rates and proportions (3rd ed.). Wiley. [Google Scholar] [CrossRef]
Healy, P. M., & Palepu, K. G. (2001). Information asymmetry, corporate disclosure, and the capital markets: A review of the empirical disclosure literature. Journal of Accounting and Economics, 31(1–3), 405–440. [Google Scholar] [CrossRef]
Hirshleifer, D., Low, A., & Teoh, S. H. (2012). Are overconfident CEOs better innovators? Journal of Finance, 67(4), 1457–1498. [Google Scholar] [CrossRef]
Huang, A. H., Wang, H., & Yang, Y. (2023). FinBERT: A large language model for extracting information from financial text. Contemporary Accounting Research, 40(2), 806–841. [Google Scholar] [CrossRef]
Huang, X., Teoh, S. H., & Zhang, Y. (2014). Tone management. The Accounting Review, 89(3), 1083–1113. [Google Scholar] [CrossRef]
Jensen, M. C., & Meckling, W. H. (1976). Theory of the firm: Managerial behavior, agency costs and ownership structure. Journal of Financial Economics, 3(4), 305–360. [Google Scholar] [CrossRef]
Kim, A., Muhn, M., & Nikolaev, V. V. (2024). Bloated disclosures: Can ChatGPT help investors process financial information? arXiv, arXiv:2306.10224v5. [Google Scholar] [CrossRef]
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86. [Google Scholar] [CrossRef]
Latané, H. A., & Jones, C. P. (1977). Standardized unexpected earnings—A progress report. Journal of Finance, 32(5), 1457–1465. [Google Scholar] [CrossRef]
Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145–151. [Google Scholar] [CrossRef]
Livnat, J., & Mendenhall, R. R. (2006). Comparing the post-earnings announcement drift for surprises calculated from analyst and time series forecasts. Journal of Accounting Research, 44(1), 177–205. [Google Scholar] [CrossRef]
Lopez-Lira, A., & Tang, Y. (2023). Can ChatGPT forecast stock price movements? Return predictability and large language models. arXiv, arXiv:2304.07619v3. [Google Scholar] [CrossRef]
Loughran, T., & McDonald, B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. Journal of Finance, 66(1), 35–65. [Google Scholar] [CrossRef]
Loughran, T., & McDonald, B. (2016). Textual analysis in accounting and finance: A survey. Journal of Accounting Research, 54(4), 1187–1230. [Google Scholar] [CrossRef]
Malmendier, U., & Tate, G. (2005). CEO overconfidence and corporate investment. Journal of Finance, 60(6), 2661–2700. [Google Scholar] [CrossRef]
Malmendier, U., & Tate, G. (2015). Behavioral CEOs: The role of managerial overconfidence. Journal of Economic Perspectives, 29(4), 37–60. [Google Scholar] [CrossRef]
Matsumoto, D., Pronk, M., & Roelofsen, E. (2011). What makes conference calls useful? The information content of managers’ presentations and analysts’ discussion sessions. The Accounting Review, 86(4), 1383–1414. [Google Scholar] [CrossRef]
Mayew, W. J., & Venkatachalam, M. (2012). The power of voice: Managerial affective states and future firm performance. Journal of Finance, 67(1), 1–43. [Google Scholar] [CrossRef]
McCarthy, S., & Alaghband, G. (2024). Fin-ALICE: Artificial linguistic intelligence causal econometrics. Journal of Risk and Financial Management, 17(12), 537. [Google Scholar] [CrossRef]
Plutchik, R. (1980). Emotion: A psychoevolutionary synthesis. Harper & Row. [Google Scholar]
Price, S. M., Doran, J. S., Peterson, D. R., & Bliss, B. A. (2012). Earnings conference calls and stock returns: The incremental informativeness of textual tone. Journal of Banking & Finance, 36(4), 992–1011. [Google Scholar] [CrossRef]
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 conference on empirical methods in natural language processing (EMNLP) (pp. 3982–3992). Association for Computational Linguistics. [Google Scholar] [CrossRef]
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423. [Google Scholar] [CrossRef]
Sloan, R. G. (1996). Do stock prices fully reflect information in accruals and cash flows about future earnings? The Accounting Review, 71(3), 289–315. [Google Scholar] [CrossRef]
Spence, M. (1973). Job market signaling. The Quarterly Journal of Economics, 87(3), 355–374. [Google Scholar] [CrossRef]
Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market. Journal of Finance, 62(3), 1139–1168. [Google Scholar] [CrossRef]
Todd, J., Bowden, J., & Moshiri, S. (2024). Text-based sentiment analysis in finance: Synthesising the existing literature and exploring future directions. Intelligent Systems in Accounting, Finance and Management, 31(1), e1549. [Google Scholar] [CrossRef]
Treynor, J. L., & Black, F. (1973). How to use security analysis to improve portfolio selection. Journal of Business, 46(1), 66–86. [Google Scholar] [CrossRef]
Vrij, A., Fisher, R., Mann, S., & Leal, S. (2006). Detecting deception by manipulating cognitive load. Trends in Cognitive Sciences, 10(4), 141–142. [Google Scholar] [CrossRef]

Figure 1. Conceptual framework. Five data sources feed five feature families, producing risk screening outputs. Family III (CDEC) represents the primary methodological contribution; Family V synthesizes all families into the Financial Health Indicator.

Figure 2. Correlation heatmap showing red flag associations with financial quality metrics (

n = 4112

firm-quarters).

Figure 2. Correlation heatmap showing red flag associations with financial quality metrics (

n = 4112

firm-quarters).

Figure 3. Earnings outcomes by FHI Health Status (out-of-sample, 2020–2025,

n = 1120

). Healthy firms (

F H I > 0.05

) beat expectations at 62.8% versus 35.4% for Unhealthy firms (

F H I < - 0.05

), a 27.4 pp spread (

χ^{2} = 69.3

,

p < 0.000001

).

Figure 3. Earnings outcomes by FHI Health Status (out-of-sample, 2020–2025,

n = 1120

). Healthy firms (

F H I > 0.05

) beat expectations at 62.8% versus 35.4% for Unhealthy firms (

F H I < - 0.05

), a 27.4 pp spread (

χ^{2} = 69.3

,

p < 0.000001

).

Table 1. Prior literature on financial disclosure tone analysis.

Study	Channel(s)	Document Type	Granularity	Cross-Doc Measure?
Loughran and McDonald (2011)	Single	10-K	6-category dictionary	No
Davis and Tama-Sweet (2012)	Two	Press release, MD&A	Binary tone	Compared levels, no metric
X. Huang et al. (2014)	Single	Earnings press release	Binary abnormal tone	No
Mayew and Venkatachalam (2012)	Single	Earnings call (audio)	Vocal affect	No
A. H. Huang et al. (2023)/FinBERT	Single	Various financial text	3-class sentiment	No
Kim et al. (2024)	Single	MD&A (ChatGPT 3.5)	Binary + bloat	No
Lopez-Lira and Tang (2023)	Single	News headlines	Binary direction	No
This paper (CDEC)	Two	MD&A + Q&A	28-dim emotion	Yes: cosine + JS + RF

Table 2. Integrated feature family framework.

Feature Family	Traditional Component	Enhanced Feature	Financial Logic
I. Quantitative Foundations	Total Accruals and Cash Flow (Sloan, 1996)	ROE Trend and Accrual Quality Grades	Establishes baseline financial reality; CDEC must predict beyond this signal
II. Textual Components	Sentiment Polarity (Loughran & McDonald, 2011)	28-Dimension Emotion Vectors (FinGoEmotion)	Granularity detects specific states (Fear vs. Confusion) lost in binary sentiment
III. Cross-Channel Analysis	Single-Source Tone (Price et al., 2012; Tetlock, 2007)	Cross-Document Consistency (CDEC)	Measures “Authenticity Gap” between vetted MD&A and spontaneous Q&A
IV. Risk Indicators	Dictionary-Based Constraints (Bodnaruk et al., 2015)	Behavioral Red Flags (RF₁–RF₄)	Detects cognitive load patterns rather than keyword counts
V. Synthesis Model	Linear Regression/Logit	Financial Health Indicator (Random Forest)	Non-linear synthesis; identifies firms likely to miss earnings

Table 3. Red flag pattern operationalization.

Red Flag	Weight	Threshold	Rate	Validation	Action
RF₁: Overconfident Q&A	0.08	$E_{o p t} > 0.1377$	10.0%	SUE: $p < 0.001$ ***	Analyst review
RF₂: Disclosure Complexity	0.02	$H (Q & A) / H (M D & A) > 3.0$	16.8%	8-K only ^†	Monitor impairments
RF₃: Trust Manipulation	0.02	$Δ E^{t r u s t} > 0.10$	1.6%	Not significant	Monitor
RF₄: Mixed Emotion Surge	0.08	>2 emotions > 0.15	8.4%	SUE: $p < 0.001$ ***	EQ deep dive

^† RF₂ is NOT a validated SUE predictor (

r = + 0.018

,

p = 0.28

) but predicts impairments (2.5× elevation, ***

p < 0.001

). Sample sizes: RF₁/RF₄

n = 3368

; RF₂/RF₃

n = 4112

.

Table 4. Top five feature importance.

Rank	Feature	Importance	Interpretation
1	cdec_x_roe	0.103	CDEC × ROE interaction dominates
2	roe	0.103	Profitability fundamentals
3	roe_trend	0.096	ROE trajectory signals risk
4	cdec_x_net_income	0.075	Disclosure-earnings interaction
5	eq_ratio	0.071	Earnings quality ratio

Table reports feature importance averaged across the Beat and Miss classifiers. See Appendix B, Table A1 for the full comparison showing that ROE features are more important for predicting misses while CDEC features drive beat prediction.

Table 5. Model performance progression.

Specification	AUC	Key Features	Interpretation
Baseline: Fundamentals-only	0.63	ROE, accruals, equity ratio, revenue	Traditional financial metrics alone
CDEC components	0.51	cdec_score, $S_{c o s}$ , $S_{j s}$	Cross-document similarity—not predictive alone
Red Flags only	0.51	RF₁–RF₄, $S_{r f}$	Emotion-derived warning signals—not predictive alone
CDEC × Fundamentals	0.67	26-feature combined model	Interaction drives prediction

Table 6. Red flag correlations with financial metrics.

Red Flag	EQ Ratio	Total Accruals	Net Income	SUE Correlation
RF₁ (Overconfident Q&A)	$r = - 0.101$ ***	$r = 0.011$	$r = - 0.028$	$r = - 0.066$ ***
RF₂ (Disclosure Complexity)	$r = - 0.005$	$r = - 0.052$ ***	$r = + 0.087$ ***	$r = + 0.018$ (n.s.)
RF₃ (Trust Manipulation)	$r = 0.006$	$r = 0.005$	$r = - 0.026$	$r = - 0.024$ (n.s.)
RF₄ (Mixed Emotions)	$r = 0.030$ **	$r = 0.009$	$r = - 0.055$ ***	$r = - 0.068$ ***

***

p < 0.01

, **

p < 0.05

. EQ Ratio = Earnings Quality ratio (OCF/NI, Equation (8)). RF₁ and RF₄ are validated SUE predictors (both

p < 0.001

). RF₂ is not a validated SUE predictor (n.s.) but predicts impairments (Section 4.6).

Table 7. SUE regression results.

Coefficient	Estimate	t-Statistic	p-Value
$θ_{1}$ (CDEC)	+2.31	+22.2	<0.001 ***
$θ_{2}$ (EQ)	−0.17	−2.56	0.011 **
$θ_{3}$ (CDEC × EQ)	+0.20	+2.25	0.025 **

**

p < 0.05

; ***

p < 0.001

.

Table 8. CDEC component correlations with SUE.

Component	r with SUE	p-Value
$S_{c o s}$ (Cosine Similarity)	+0.042	0.014 **
$S_{J S}$ (JS Divergence)	+0.029	0.098
RF₁ (Overconfident Q&A)	−0.066	<0.001 ***
RF₄ (Mixed Emotion)	−0.068	<0.001 ***

**

p < 0.05

; ***

p < 0.001

.

Table 9. Progressive feature build-up.

Specification	Features	AUC	Δ AUC	Interpretation
Core fundamentals	5	0.605	—	Baseline
+ROE, ROE trend	7	0.649	+0.044	Largest single lift
+CDEC components	10	0.648	−0.002	Raw similarity not informative alone
+CDEC × Financial interactions	14	0.654	+0.006	Interactions unlock CDEC value
+Red flags + SUE regression	20	0.655	+0.002	Modest standalone contribution
Full model (+analyst signals)	26	0.671	+0.016	Largest final-layer gain

Table 10. Benchmark: emotion granularity (28-dim) vs. sentiment polarity (3-class).

Specification	Dim	Features	AUC	Spread (pp)	Healthy %	Unhealthy %
CDEC + Fundamentals (paper model)	28	26	0.671	27.4	62.8	35.4
RoBERTa + Fundamentals	3	21	0.667	14.6	55.1	40.5
FinBERT + Fundamentals	3	21	0.662	8.7	54.2	45.5
Fundamentals only	—	7	0.640	11.5	55.1	43.6
CDEC consistency only	28	3	0.524	3.7	—	—
FinBERT consistency only	3	3	0.497	−4.7	—	—
RoBERTa consistency only	3	3	0.491	−6.5	—	—

Spread = difference in earnings beat rate between firms classified as “Healthy” (FHI > 0.05) and “Unhealthy” (FHI < −0.05). Healthy/Unhealthy % = beat rate within each group. All specifications use identical temporal split and Random Forest hyperparameters.

Table 11. RF₂ rates before adverse events.

Event Type	n	RF₂ Rate	RF₂ vs. Baseline	p-Value
Baseline (All Quarters)	4147	16.6%	—	—
Item 2.04 (Debt Covenants)	8	12.5%	0.75×	0.77
Item 2.05 (Restructuring)	66	19.7%	1.19×	0.30
Item 2.06 (Impairments)	34	41.2%	2.48×	<0.001 ***

***

p < 0.001

.

Table 12. Health status outcome separation.

Status	N	%	Beat Rate	Miss Rate
Healthy ( $F H I > 0.05$ )	650	58.0%	62.8%	37.2%
Neutral ( $- 0.05 \leq F H I \leq 0.05$ )	103	9.2%	41.7%	58.3%
Unhealthy ( $F H I < - 0.05$ )	367	32.8%	35.4%	64.6%
Spread	—	—	27.4 pp	—

Statistical Validation:

χ^{2} = 69.3

,

p < 0.000001

. The 27.4 percentage point spread between Healthy and Unhealthy beat rates is highly significant. An “Unhealthy” firm is nearly twice as likely to miss earnings as a “Healthy” peer (Unhealthy miss rate 64.6% vs. Healthy miss rate 37.2%, ratio = 1.74×). Statistical validation uses a 2 × 2 contingency table (Healthy/Unhealthy × Beat/Miss, excluding Neutral) with Pearson’s

χ^{2}

test. Beat = SUE > 0; Miss = 1 − Beat Rate.

Table 13. Performance by year.

Year	N	AUC	Notes
2020	140	0.58	COVID-19 regime disruption
2021	147	0.62	Recovery
2022	187	0.70	Normalization
2023	263	0.77	Best performance (post-COVID-19)
2024	194	0.63	Maintained
2025	189	0.65	Preliminary
Overall	1120	0.671	—

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

McCarthy, S.; Alaghband, G. Cross-Document Emotion Consistency (CDEC): A Feature Family Framework for Financial Disclosure Risk Screening. J. Risk Financial Manag. 2026, 19, 251. https://doi.org/10.3390/jrfm19040251

AMA Style

McCarthy S, Alaghband G. Cross-Document Emotion Consistency (CDEC): A Feature Family Framework for Financial Disclosure Risk Screening. Journal of Risk and Financial Management. 2026; 19(4):251. https://doi.org/10.3390/jrfm19040251

Chicago/Turabian Style

McCarthy, Shawn, and Gita Alaghband. 2026. "Cross-Document Emotion Consistency (CDEC): A Feature Family Framework for Financial Disclosure Risk Screening" Journal of Risk and Financial Management 19, no. 4: 251. https://doi.org/10.3390/jrfm19040251

APA Style

McCarthy, S., & Alaghband, G. (2026). Cross-Document Emotion Consistency (CDEC): A Feature Family Framework for Financial Disclosure Risk Screening. Journal of Risk and Financial Management, 19(4), 251. https://doi.org/10.3390/jrfm19040251

Article Menu

Cross-Document Emotion Consistency (CDEC): A Feature Family Framework for Financial Disclosure Risk Screening

Abstract

1. Introduction

Summary of Contributions

2. Literature Review

2.1. Earnings Surprises and SUE

2.2. Earnings Quality and the Accruals Anomaly

2.3. Disclosure Channels and Incentives

2.4. Financial Text and Sentiment Analysis

2.5. Distribution Comparison Metrics

2.6. Behavioral Patterns in Financial Disclosure

2.7. Conceptual Framework

3. Methodology

3.1. Data Sources and Alignment

3.2. Domain-Specific Emotion Classification (FinGoEmotion)

3.3. Family I: Quantitative Foundations

3.4. Family II: Textual Components (Emotion Classification)

3.5. Earnings Call Weighted Emotions

3.6. Family III: Cross-Document Emotion Consistency (Primary Innovation)

3.7. Family V: Synthesis and Financial Health Indicator

3.7.1. Prediction Task and Feature Architecture

3.7.2. Model Specification

3.7.3. Feature Importance and Key Drivers

3.7.4. Validation Summary

4. Results

4.1. Model Progression: From Baseline to Combined

4.2. Red Flag Validation

4.3. SUE Prediction

4.4. Ablation Studies

4.5. Benchmark: Sentiment Polarity vs. Emotion Granularity

4.6. 8-K Analysis: Anticipatory Obfuscation and Early Warning Value

4.7. Health Status Classification

4.8. Temporal Out-of-Sample Performance

5. Discussion and Implications

5.1. Theoretical Contributions

5.2. Practical Applications

5.3. Limitations

5.4. Future Extensions

5.5. Robustness Checks

5.6. Practitioner Implementation Guide

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Mathematical Derivations

Appendix A.1. Emotion Vector Normalization

Appendix A.2. Red Flag Pattern Definitions (Calibrated Thresholds)

Appendix A.3. Red Flag Theoretical Foundations and Interpretations

Appendix A.4. Threshold Calibration Notes

Appendix A.5. RF2 and RF3 Retention Rationales

Appendix A.6. Cosine Similarity Computation Details

Appendix B. Extended Feature Importance Analysis

Appendix B.1. Full Feature Importance Comparison

Appendix B.2. Temporal Out-of-Sample Performance by Year

Appendix B.3. Robustness Analysis Details

Appendix C. Power Analysis and Robustness Details

Appendix C.1. Statistical Power Analysis

Appendix C.2. Extended Robustness Discussion

Appendix C.3. Cost–Benefit Analysis

Appendix D. Practitioner Implementation Workflow

Appendix E. FHI Model Feature Definitions

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Appendix A.5. RF₂ and RF₃ Retention Rationales