3.1. Data Sources and Alignment
The analysis integrates five data sources aligned at the firm-quarter level: (1) 10-Q/10-K MD&A sections from SEC EDGAR, representing the mandatory, legally-vetted disclosure channel; (2) earnings call transcripts (presentation and Q&A segments), representing the voluntary, spontaneous channel; (3) SEC Form 8-K filings (“Current Reports” filed within four business days of material corporate events) for adverse event validation, specifically Item 2.04 (Debt Covenants), Item 2.05 (Restructuring), and Item 2.06 (Impairments); (4) financial statement data (income statement, balance sheet, cash flow) for quantitative fundamentals; and (5) analyst forecast distributions for SUE construction. Documents are matched by CIK (Central Index Key, the SEC’s unique firm identifier), fiscal quarter, and filing date within ±30 days of earnings announcements. Sample sizes vary by data availability: 4453 firm-quarters have complete financial data; 4147 have matched MD&A and transcripts for CDEC calculation; 3368 have SUE data requiring analyst coverage.
Following the Fin-ALICE framework (
McCarthy & Alaghband, 2024), which analyzes the top companies per GICS sector as sector bellwethers, the S&P 100 sample provides an ideal testbed for establishing the CDEC construct: these firms have the highest analyst coverage density, most consistent transcript quality, and most reliable earnings surprise measurement across all 11 GICS sectors. This sector-representative design prioritizes data quality for construct validation; generalizability to mid- and small-cap firms is addressed as a limitation in
Section 5.3.
This section details the mathematical implementation of each feature family, progressing from novel quantitative metrics (Equations (
8) and (
9)) through cross-document emotion consistency measurements (Equations (
15)–(
19)) to the validated Financial Health Indicator (Equation (
23)). Standard equations from prior literature are presented in
Section 2. The domain-specific emotion classifier (FinGoEmotion), which serves as a prerequisite tool for all emotion-based analyses, is described first, followed by the five feature families.
3.6. Family III: Cross-Document Emotion Consistency (Primary Innovation)
The CDEC metric quantifies cross-channel emotional consistency through three complementary components, each targeting a different aspect of disclosure alignment:
Cosine Similarity ()—measures directional alignment of emotion vectors regardless of intensity.
Jensen–Shannon Similarity ()—measures distributional intensity differences between channels.
Red Flag Penalty ()—penalizes behavioral patterns indicating potential deception or obfuscation.
We detail each component below.
Component 1—Directional Alignment via Cosine Similarity:
where
represents the 28-dimensional emotion vector from the 10-Q MD&A section,
represents the 28-dimensional emotion vector from the earnings call Q&A segment, · denotes the dot product, and
denotes the Euclidean norm.
Cosine similarity isolates directional alignment rather than magnitude, the critical dimension for detecting strategic impression management. This design choice rests on cognitive load theory (
Vrij et al., 2006): maintaining deceptive narratives requires suppressing authentic emotions while constructing alternatives. The metric ranges from −1 (complete opposition) to +1 (perfect alignment), with legitimate disclosures typically scoring 0.5–0.9. Scores below 0.3 warrant investigation, as they indicate fundamental emotional misalignment between prepared and spontaneous communications. The normalization by vector magnitudes ensures documents of different lengths remain directly comparable. See
Appendix A.6 for detailed computation methodology and illustrative examples.
As a robustness check, we calculate top-
k emotion overlap:
where
returns the
k emotions with highest probabilities and
. This complements cosine similarity by measuring agreement on dominant emotional themes. The top-
k overlap provides qualitative validation but is not included in the final CDEC score to avoid redundancy with cosine similarity.
Component 2—Distributional Intensity via Jensen–Shannon Divergence:
We measure the distributional similarity between the MD&A (Management Discussion and Analysis) and Q&A sections by calculating the Jensen–Shannon Divergence (JSD). This method quantifies how much the two distributions differ from a common reference distribution
M (
Lin, 1991):
where
denotes the Kullback–Leibler divergence measuring information loss when distribution
M approximates
P, and
is the average distribution. We transform to similarity:
. Unlike the KL divergence, the JS divergence is symmetric and bounded to [0, 1]. While cosine similarity captures directional alignment, JS divergence measures distributional differences in emotional intensity. Two documents may align directionally while exhibiting vastly different intensity distributions, indicating careful emotional management. The transformation
converts divergence to similarity where higher values indicate greater consistency.
Component 3—Pattern-Based Red Flags:
Building on the behavioral patterns established in
Section 2.6, we operationalize four red flag indicators. Each flag detects a specific disclosure pattern, with thresholds calibrated from empirical distributions across 4147 firm-quarters.
RF
1 (Overconfident Q&A) detects excessive optimism in spontaneous Q&A responses (
Section 2.6, Management Overconfidence). The flag triggers when Q&A optimism exceeds the 90th percentile of the empirical distribution (
), firing at a 10.0% rate.
RF
2 (Disclosure Complexity) captures asymmetric complexity between channels (
Section 2.6, Cognitive Load). Using Shannon entropy
(Equation (
12)), which measures the complexity or scatter of emotional expression across the 28 dimensions, the flag triggers when Q&A emotional entropy exceeds MD&A entropy by more than 3 × (
, 85th percentile), firing at 16.8%. A high ratio indicates that executives express substantially more diffuse or unfocused emotions in their live Q&A responses compared to the focused emotional tone of their prepared MD&A, consistent with cognitive load from narrative maintenance.
RF
3 (Trust Manipulation) detects artificially elevated trust-building language in Q&A relative to mandatory filings (
Section 2.6, Strategic Communication). The trust composite averages five GoEmotions categories (admiration, approval, caring, gratitude, love). The flag triggers when the trust differential (
) exceeds 0.10 (99th percentile), firing at 1.6%.
RF
4 (Mixed-Emotion Surge) identifies diffuse cognitive states where more than two emotions simultaneously exceed an activation threshold of 0.15 (
Section 2.6, Cognitive Load). This pattern, inconsistent with focused communication, fires at 8.4%.
Table 3 summarizes the four patterns with their calibrated thresholds, trigger rates, and validation status.
The red flag penalty score,
, quantifies the combined impact of active red flags on disclosure consistency. It calculates an aggregate score where
approaches 0 as risk increases and 1 as risk decreases:
where
K is the number of red flag patterns (
),
is the calibrated weight for pattern
k (as specified in
Table 3), and
is an indicator function, a binary variable (0 or 1) that activates the
k-th penalty.
Of the four flags, RF
1 and RF
4 are validated SUE predictors (both
): firms with excessive Q&A optimism miss expectations by 0.84 standard deviations (
), while those with diffuse emotional states miss by 0.93 standard deviations (
). RF
2 does not predict SUE directly but is elevated 2.5× before material impairments (
Section 4.6), serving as an impairment-monitoring indicator. RF
3 lacks statistical power in our large-cap sample due to its low trigger rate (1.6%). See
Appendix A.3,
Appendix A.4 and
Appendix A.5 for extended theoretical foundations, calibration details, and retention rationales for RF
2 and RF
3.
Figure 2 presents the correlation structure between red flag indicators and financial quality metrics. The heatmap reveals that RF
2 correlates with total accruals (
) and net income (
), consistent with the classic earnings management pattern where firms report strong income while exhibiting poor cash flow quality.
Final CDEC Score:
The three components, directional alignment (
), distributional similarity (
), and behavioral penalties (
), combine into a single consistency score. The weights were determined through empirical optimization on the training sample (pre-2020), maximizing the CDEC-SUE correlation while maintaining interpretability:
The
weighting reflects the relative predictive power of each component: cosine similarity (directional alignment) shows stronger association with SUE (
) than JS divergence (distributional intensity,
). The red flag weights separate validated flags (
for RF
1 and RF
4, both SUE-significant at
) from retained-but-unvalidated flags (
for RF
2 and RF
3). With these weights (
,
), CDEC scores range from 0.0 to 0.8; scores above 0.6 indicate adequate consistency, while scores below 0.5 warrant scrutiny.
3.7. Family V: Synthesis and Financial Health Indicator
Family V synthesizes features from Families I–IV into a single Financial Health Indicator (FHI) that predicts earnings outcomes. This section describes the prediction task and feature architecture (
Section 3.7.1), model specification (
Section 3.7.2), feature importance analysis (
Section 3.7.3), and validation summary (
Section 3.7.4).
3.7.1. Prediction Task and Feature Architecture
The FHI predicts whether a firm will beat or miss earning expectations in the subsequent quarter. As will be demonstrated in
Section 4.3, CDEC features show a statistically significant relationship with earnings surprises, and will contribute to earnings outcome prediction when combined with financial fundamentals.
Through systematic analysis of candidate features, we identified an optimal 26-feature set organized into five groups:
Family I Quantitative (7 features): total_accruals, net_income, total_equity, revenue, roe, roe_trend, eq_ratio
Family III CDEC Components (3 features): (cosine similarity), (Jensen–Shannon similarity), cdec_score
CDEC × Financial Interactions (4 features): cdec_x_roe, cdec_x_eq, cdec_x_net_income, cdec_x_revenue
Red Flags (6 features): RF1–RF4, , eq21_regression
Prior Quarter Analyst Signals (6 features): Lagged features to avoid data leakage: prior_eq_grade, prior_analyst_score, prior_analyst_confidence, prior_analyst_signal, prior_analyst_score_top5, prior_analyst_confidence_top5
The feature vector is:
The interaction terms
capture CDEC × Fundamental interactions, where “fundamentals” refers to the quantitative financial metrics from Family I (ROE, earnings quality EQ, net income, and revenue). These terms test whether emotional consistency’s predictive power varies with underlying financial condition:
These interaction effects are critical to our findings. As demonstrated in
Section 3.7.3, the CDEC × ROE term is the strongest predictor, suggesting that emotional consistency becomes especially consequential when profitability fundamentals are involved.
3.7.2. Model Specification
We train dual calibrated Random Forest classifiers, one for beat probability (
) and one for miss probability (
) using the feature vector in Equation (
20):
The Financial Health Indicator is obtained from the difference of these probabilities:
where
represents the Financial Health Indicator. Health Status classification uses ±0.05 thresholds: Healthy (
), Neutral (
), Unhealthy (
). The FHI synthesizes cross-document emotional consistency with financial fundamentals into a single diagnostic metric.
The model uses a temporal train-test split with pre-2020 data for training () and 2020+ data for true out-of-sample (OOS) testing (). Unless otherwise stated, reported OOS performance refers to this 2-way temporal split. A separate 3-way split (train: pre-2018, validation: 2018–2019, test: 2020+) is used only for feature importance analysis and hyperparameter tuning.
3.7.3. Feature Importance and Key Drivers
Table 4 presents the top five features by importance in the Random Forest model.
Two CDEC interaction terms appear in the top five, indicating that emotional consistency contributes predictive power beyond financial fundamentals alone. ROE and ROE Trend are more important for predicting misses, while CDEC-related features drive beat prediction, suggesting financial fundamentals detect deterioration while disclosure consistency signals positive outcomes. See
Appendix B for full 26-feature rankings.
Figure 3 illustrates the Health Status outcome separation on the out-of-sample test set (2020–2025,
). Healthy firms (
) achieve a 62.8% beat rate compared to 35.4% for Unhealthy firms (
), indicating that the FHI provides meaningful discrimination between earnings outcome groups.
3.7.4. Validation Summary
The FHI achieves out-of-sample AUC = 0.671 on the temporal test set (2020–2025, ), indicating meaningful predictive power for earnings outcomes. Health Status groups show highly significant separation, validated by chi-square test (, ), which tests whether earnings outcome distributions differ significantly across health status categories. Healthy firms beat expectations at a 62.8% rate versus 35.4% for Unhealthy firms, a 27.4 percentage-point spread.
A key finding is that CDEC alone is not a sufficient predictor (AUC ≈ 0.51, not statistically significant). Predictive power emerges from combining CDEC with financial fundamentals and lagged analyst signals. This validates the multi-family framework: each family contributes incrementally, but synthesis in Family V is essential. The interaction term CDEC × ROE is the top feature for both Beat and Miss prediction, indicating that emotional consistency matters most when financial performance is strong. Companies with consistent disclosure and strong ROE beat expectations; inconsistent disclosure paired with weak fundamentals signals elevated miss risk.
Of the red flags included as features, only RF1 (, ) and RF4 (, ) are statistically validated predictors of SUE. RF2 (entropy ratio) showed a non-significant positive correlation (, ), indicating that disclosure complexity is not a validated predictor of earnings outcomes. RF2 is retained as an impairment-monitoring indicator (not a SUE predictor); practitioners should use it to flag potential write-downs rather than earnings misses.
Full validation results, including ablation studies and robustness checks, are presented in
Section 4.