1. Introduction
Agricultural credit rationing remains a persistent challenge. When borrower risk is hard to verify and monitoring is costly, the credit supply can remain constrained even when prices adjust, because screening and incentive problems do not disappear with higher rates [
1,
2]. This friction is particularly acute in agriculture, as biological production cycles, seasonal settlement patterns, and correlated shocks make short-run financial outcomes volatile, complicating the distinction between genuine distress and normal operating rhythms [
3,
4].
In practice, banks rely primarily on two inputs for underwriting: financial statements and collateral. Both are fragile in agricultural settings. Many agricultural SMEs have limited standardized disclosures, and their statements are often too infrequent or too noisy to support continuous risk monitoring [
5]. Collateral is constrained because pledgeable assets in rural areas can be illiquid or difficult to enforce [
6]. When both inputs lose reliability, lenders tighten thresholds, shorten maturities, or exit the segment altogether.
Over the past decade, however, firms have generated increasingly rich auditable traces outside financial statements. Tax filings, court records, and other administrative interactions create institutional digital footprints produced under formal rules and more difficult to retroactively alter, potentially complementing conventional hard information [
7]. Evidence from fintech-enabled lending further suggests that alternative data can improve default predictions, especially for thin-file borrowers [
7,
8], indicating that institutional traces may partially substitute for missing hard information and strengthen monitoring.
Applying such data to agriculture raises two methodological challenges. The first is signal stability: raw, high-frequency operational variables swing with planting and harvest cycles, and feeding them directly into flexible models risks overfitting seasonal patterns, particularly when events (Y = 1) are rare and the label distribution shifts over time [
9]. An effective early-warning model must therefore extract persistent signals rather than transient fluctuations, motivating continuity-oriented feature design [
7]. The second is institutional role differentiation: tax records mainly reflect ongoing operating intensity and compliance continuity, whereas judicial records mark discrete constraints—lawsuits, enforcement actions, or adverse case outcomes—that can immediately alter a firm’s financing feasibility [
10]. Treating these structurally distinct footprints as interchangeable features obscures their different economic meanings and weakens interpretability.
This paper addresses both issues with a multi-source early-warning model built on institutional digital footprints from 1021 agricultural enterprises in China. We use a LightGBM framework for structured, heterogeneous tabular data. More importantly, we enforce temporal validity via a Default Event Isolation protocol, where the model is trained only on information available before the prediction window, and observations at and after the event month are discarded to prevent look-ahead bias and post-event leakage [
11].
This study differs from prior digital footprint research in three respects. First, whereas most alternative-data credit models target consumer lending or platform merchants [
7,
8,
12], we focus on agricultural enterprises—a setting where seasonal noise, thin files, and heterogeneous institutional traces create distinct modeling challenges. Second, existing multi-source fusion studies often treat alternative data as homogeneous predictors [
12,
13]; we explicitly separate the economic roles of judicial boundary constraints and tax stability/continuity signals, enabling a mechanism-aware interpretation rather than a black-box prediction. Third, we enforce strict ex ante temporal validity through a Default Event Isolation protocol that discards observations at and after the event month, an anti-leakage safeguard less commonly documented in high-frequency panel settings [
11,
14].
Our results support a dual-core risk paradigm for agricultural credit early-warning: judicial baseline + tax stability. Judicial footprints function as boundary constraints, while tax footprints provide stability signals through intertemporal continuity rather than single-period levels. Multi-source fusion operates as institutional cross-checking, where signals from different subsystems jointly reduce uncertainty and suppress single-source blind spots [
11,
15]. We also apply explainable learning tools to ensure that performance gains are mechanism-consistent rather than driven by unstable proxies [
14].
The remainder of this paper is organized as follows:
Section 2 reviews the relevant literature and develops the research hypotheses.
Section 3 describes the data sources, feature engineering pipeline, and the Default Event Isolation protocol.
Section 4 presents the empirical results, including model comparison, ablation analysis, explainability assessment, and robustness checks.
Section 5 discusses the findings and their implications.
3. Methodology
3.1. Research Design and Workflow
In this study, an ex ante early-warning model for the agricultural enterprise credit risk is developed using multi-source institutional digital footprints. Agricultural firms interact with various institutional systems, such as tax authorities, courts, social security agencies, and banking institutions. They generate high-frequency, traceable, and relatively tamper-resistant digital records. We conceptualize these records as risk signals emitted by distinct societal subsystems. Specifically, judicial information functions as a boundary constraint that reflects an enterprise’s enforcement-limited viability (judicial baseline), while tax information primarily captures the steady-state continuity of operations (tax stability). The key advantage of multi-source fusion lies in establishing a logical cross-verification network, where mutually reinforcing “hard” signals reduce structural blind spots and mitigate noise-driven misclassification. As illustrated in
Figure 1, we propose a holistic system architecture that translates the theoretical framework into an end-to-end computational pipeline comprising five interconnected modules. The process begins with data ingestion, which aggregates heterogeneous records from judicial, tax, and financial subsystems, followed by governance to ensure entity alignment and strict temporal validity. Central to the system is dual-core engineering, which constructs the “judicial baseline” (Core I) and “tax stability” (Core II) feature sets while enforcing temporal guardrails. These features inform the modeling phase, where the LightGBM classifier is trained with interpretability constraints, culminating in deployment, which establishes a closed-loop feedback mechanism for continuous monitoring.
3.2. Data Sources, Sample Selection, and Temporal Window Design
The dataset used in this study was obtained through a research collaboration with a professional enterprise credit information agency. The agency aggregates multi-source enterprise records as part of its compliant credit information services and provided the research team with a de-identified dataset for academic use. All records were anonymized prior to delivery; the research team did not access any direct identifiers or raw administrative documents. The shared dataset contains no personally identifiable information and was used solely for academic research under a data-sharing agreement.
Our analysis draws on administrative and operational information on agricultural enterprises in China, covering January 2021 to April 2024. The collaborating agency performed cross-source linkage internally using its standard enterprise identifiers and released an analysis-ready firm–month panel containing feature-level variables derived from multiple subsystems. These subsystems include business registration and filing information, provider-curated financial and credit history indicators, tax-related compliance and penalty indicators, social insurance contribution indicators, judicial case and enforcement indicators, as well as innovation-related indicators such as patent registrations. After receipt, we conducted additional data cleaning and quality control, including timestamp harmonization, definition alignment across subsystems, duplicate removal, and cross-table consistency checks, to ensure that each firm–month corresponds to a single structured observation.
To ensure a valid ex ante prediction setting and avoid label truncation, we adopt an explicit observation period—label period design. Firm–month observations from January 2021 to April 2023 are used to generate predictors, while subsequent months up to April 2024 are reserved to define a fixed-length forward-looking outcome window. This design ensures that for any observation month , the inputs are constructed strictly from information available at or before , whereas the labels depend only on outcomes occurring after .
Our sample selection follows a “business usability + continuity” principle. We exclude firms with only registration information but no operational traces to avoid conflating institutional silence (e.g., truly zero tax filing or no employment in a month) with random missingness. We require firms to exhibit sustained institutional interactions during the observation period (e.g., repeated tax filings, social insurance payments, or other administrative activities) consistent with agricultural operational continuity and seasonality, and to have at least one financial statement record to support baseline comparisons with traditional hard information. The final sample contains 1021 agricultural enterprises and 28,403 firm–month observations in an unbalanced panel.
3.3. Outcome Definition: Comprehensive Institutional Distress Events for Credit Early Warning
Credit impairment in agricultural enterprises may manifest first through institutional frictions before being formally recorded as a bank non-performing loan (NPL). We acknowledge that persistent tax non-compliance and judicial enforcement are administrative/legal events that are conceptually distinct from bank-defined credit defaults. Nevertheless, in information-opaque agricultural markets, such institutional signals can provide timely indications of risk escalation and therefore carry early-warning value.
Accordingly, to better reflect an ex ante risk escalation process, our primary prediction target is a comprehensive institutional distress event rather than a narrow bank-defined default label. This composite event is intended as an early-warning proxy for the deterioration of credit conditions, and it is not interpreted as equivalent to bank-recorded default. The event month is defined as the earliest occurrence among the three distress categories, consistent with an “early interruption” logic of risk escalation. Formally, for firm i at month t, we predict whether a comprehensive institutional distress event occurs within a forward window of length 12 months as follows:
The event is triggered by the earliest occurrence of any of the following:
(1) Credit default event (bank-recorded NPL). We use non-performing loan (NPL) records and define the event month as the first month in which a firm is recorded with an NPL status.
(2) Tax distress event (administrative non-compliance). The firm exhibits persistent tax arrears (continuous arrears for 2 months).
(3) Judicial distress event (hard-constraint enforcement). This event captures the tightening of “hard constraints.” To distinguish substantial solvency crises from ordinary commercial disputes, a risk event is triggered by enforcement status (i.e., being listed as a judgment debtor or subject to execution orders) or by high-intensity litigation (i.e., defendant status with a case amount exceeding the sample’s 75th percentile). This definition aligns with the theoretical premise that judicial enforcement constitutes a binding boundary on firm viability.
We emphasize that the three triggering categories above are economically heterogeneous and are not treated as interchangeable “defaults.” In this study, we use the composite comprehensive institutional distress event as an early-warning proxy capturing an ex ante escalation process, where institutional frictions may precede bank-recognized NPL outcomes. To ensure that our findings are not an artifact of bundling heterogeneous events, we further evaluate the same optimized feature set under an alternative label specified in
Section 4.5.4, namely, a strict credit-only label based solely on bank NPL records (AUC = 0.9089). The maintained performance under the credit-only label supports that judicial and tax footprints provide forward-looking early-warning information rather than merely redefining bank default.
3.4. Digital Footprint Features: Modular Construction and Denoising/Smoothing
We construct a modular digital footprint feature system from multi-source institutional records and map features to the proposed mechanisms to support ablation tests and interpretation. The feature set contains approximately 138 variables organized into four core signal modules plus a baseline module.
Judicial constraint module . This module reflects enforcement pressure and institutional boundary constraints. Beyond basic litigation, enforcement counts (or amounts) and status indicators, we emphasize the case closure rate to proxy the firm’s ability to resolve disputes and comply with judicial outcomes, which operationalizes the judicial baseline concept.
Tax and operational stability module . This module integrates tax payment/arrears/penalties with operational traces such as social insurance and contract filings. Because agricultural operations contain strong seasonal forcing, we focus on intertemporal continuity and stability signals rather than single-month levels. Rolling-window features (e.g., 3/6/12/24-month means, volatility measures, trend slopes, and consecutive compliance lengths) are used to extract the operational steady state and to suppress overfitting induced by seasonal noise.
Financing constraint module . Based on bank credit records, this module includes the loan frequency, maturity structure, credit availability changes, and the time since the most recent loan, reflecting a dynamic tightening of financing conditions as the financial subsystem updates its risk perceptions.
Cross-feature module This module contains engineered cross-module consistency features designed to capture agreement or mismatch across institutional subsystems. Examples include the “contract–tax growth rate gap” and the “social security–contract matching degree” to summarize whether contract-related activity, tax-related activity, and employment-related traces evolve coherently over comparable horizons. The “contract–tax growth rate gap” measures the divergence between contract-side dynamics and tax-side dynamics, serving as a plausibility-oriented consistency indicator (rather than a direct validation of any latent “true revenue”). These features are designed to capture the synergistic triggering effect where simultaneous anomalies across distinct subsystems signal a nonlinear escalation in risk.
Baseline module. We also retain firm fundamentals and financial ratios as a traditional hard information baseline, including the firm size, age, sector classification, leverage, and solvency indicators, enabling a direct comparison between conventional hard information and digital footprints.
3.5. Identification Strategy: Default Event Isolation Under Temporal Constraints
Early-warning credit modeling is inherently directional in time. If future information enters either the training–testing split or the feature construction process, predictive performance can be systematically inflated and will not generalize to real deployment. We therefore treat Default Event Isolation as a central identification strategy and apply two complementary constraints to mitigate look-ahead bias and information leakage.
Temporal Isolation. We use an out-of-time split respects chronological order. The model is trained on historical data (January 2021–August 2022) and evaluated on future data (September 2022–April 2023), so that the model learns only from the past to predict future event windows.
Pre-event Truncation. For any firm that experiences a comprehensive institutional distress event, we retain only observations strictly before the first event month and remove the event month and post-event months. This truncation blocks post-event information from entering predictors and guarantees that the model uses only pre-event signals.
Together, these two constraints ensure that reported test performance reflects a true forward-looking ability to identify temporally later data and is aligned with operational early-warning requirements.
As illustrated in
Figure 2, Panel A illustrates the temporal split between training set and test set, ensuring strict out-of-time validation. Panel B demonstrates the pre-event truncation protocol for defaulting firms: observations following a risk event are discarded to prevent post-event information from contaminating predictors. Panel C depicts the forward-looking prediction window structure, where features are constructed from historical data (t − 12 to t) and labels are defined over a 12-month future horizon (t + 1 to t + 12). The anti-leakage guarantees box summarizes four safeguards: temporal isolation, entity isolation, pre-event truncation, and forward-looking label construction.
We note that this is an observation-level design: we truncate the event month and all post-event months (pre-event truncation). When combined with a temporal training–test split, this truncation can implicitly lead to no test-period observations for firms whose event occurs in the training period (an “entity isolation” effect), which may simplify the evaluated test population; we discuss the resulting trade-off in
Section 5.
3.6. Model Specification and Training
We employ LightGBM (Light Gradient Boosting Machine) as the primary classifier. Although deep learning architectures (e.g., RNNs and Transformers) offer strong representation and sequential modeling power, we deliberately adopt a gradient-boosted decision tree (GBDT) framework for three reasons. First, the predictors in this study are predominantly structured, tabular, and heterogeneous institutional footprint features (including stability- and consistency-oriented engineered variables), a setting in which GBDTs are widely regarded as strong baselines and often competitive with or superior to deep neural networks on mid-sized tabular datasets [
27]. Second, the prediction task involves an extreme class imbalance (an event rate of approximately 0.94%) and pronounced structural noise driven by seasonality; in such settings, higher-capacity deep models may be more prone to overfitting and unstable calibration without substantially larger sample sizes and careful regularization. Third, regulated credit decision contexts require auditable and interpretable decision logic. LightGBM, combined with SHAP-based attribution, provide a favorable accuracy–interpretability trade-off. Logistic regression is retained as a linear benchmark to quantify the incremental value of nonlinearity.
3.7. Evaluation and Explainability
Because accuracy can be misleading under an extreme class imbalance, we report AUC-ROC (Area Under the Receiver Operating Characteristic Curve) to evaluate overall ranking ability and PR-AUC (Area Under the Precision–Recall Curve) to capture precision–recall trade-offs in rare-event detection. To align the model assessment with credit operations under limited screening capacity, we further report business-oriented metrics such as Recall@TopK% (e.g., recall among the top 10% highest-risk firms), a metric that reflects the model’s ability to capture the high-risk tail. All metrics are computed on the independent test set after Default Event Isolation, ensuring that the evaluation reflects genuine ex ante predictability. For explainability, we apply SHAP to attribute LightGBM predictions to individual variables, examining both global importance and local explanations to validate the proposed judicial baseline and tax stability hypotheses. Furthermore, for the cross-module consistency feature set, we specifically inspect interaction patterns to document nonlinear risk escalation when multiple subsystems exhibit coherent anomalies (aligned abnormal signals), thereby providing an interpretable evidence chain for the proposed logical cross-verification mechanism.
5. Discussion
5.1. Theoretical Contributions
These findings support all three research hypotheses proposed in this study, establishing that multi-source digital footprints can construct a “logical cross-validation network” to overcome structural information deficiencies.
H1 (Judicial Constraint Hypothesis). The legal module’s 38.32% importance contribution and the significant AUC recovery from M3 to M4 confirm that judicial data provides “institutional hard constraint” signals that are difficult to manipulate and highly predictive of solvency boundaries. The case closing rate alone accounts for 32.41% of the predictive power, demonstrating that the ability to resolve disputes—rather than merely the presence of lawsuits—distinguishes high-risk from low-risk firms.
H2 (Tax Stability Hypothesis). The comparison between M5 (smoothed features, AUC = 0.9345) and M3 (raw features, AUC = 0.4307) provides strong evidence that temporal aggregation is essential. The ablation study further confirms this result: raw features yield AUC = 0.7906 with a training–test gap of 0.209, while smoothed features achieve AUC = 0.9198 with a gap of only 0.054. The predictive value of tax data lies in capturing operational continuity through temporal aggregation rather than short-term cash flow pulses.
H3 (Cross-Validation Synergy Hypothesis). We find evidence for non-linear synergies across data sources through three channels. First, tree-based models yield a +22.6% AUC improvement over linear models (p < 0.001, DeLong test), indicating that cross-feature interactions are meaningful for prediction. Second, the cross-feature module, which explicitly captures cross-source consistency signals such as the “contract–tax growth gap”, contributes to the overall predictive framework, with these engineered consistency features appearing among the 50 selected variables in M5. Third, the incremental model comparison shows super-additive gains: adding judicial data to an already-comprehensive feature set (M3 → M4) yields a +26.86pp AUC recovery, suggesting that different data sources provide complementary rather than redundant information. While we cannot precisely quantify the contribution of cross-module interactions versus individual features, the overall pattern supports the hypothesis that multi-source fusion yields gains beyond simple aggregation.
Our findings complement three adjacent literature streams. First, relative to consumer-oriented alternative-data studies, the predictive value here stems from governance-generated institutional traces rather than platform behavioral footprints, which is consistent with the view that institutional records can act as “hard” information channels when conventional disclosures are sparse. Second, the dominance of judicial-resolution-related signals is consistent with law-and-finance evidence that enforcement capacity shapes financing feasibility; we operationalize this mechanism at the firm level and show its relevance for early warning. Third, the sharp contrast between raw and smoothed high-frequency signals aligns with recent cautions on temporal dependence and distribution shift in panel-based ML: deployment-oriented credit modeling requires leakage-aware evaluation and continuity-oriented feature design rather than relying on random splits or raw monthly magnitudes.
5.2. Practical Recommendations
Based on our empirical findings, we recommend a targeted deployment strategy for agricultural credit risk early-warning systems. Financial institutions should implement priority monitoring by focusing enhanced due diligence on the top 10% risk-scored firms, which allows for the interception of nearly 90% of potential defaults with minimal false positives. To ensure system longevity, quarterly retraining is essential to incorporate new observations and adapt to potential regime changes, as evidenced by the performance dip observed during economic transitions. Regarding data infrastructure, strict feature pipeline requirements must be enforced to ensure 12-month rolling aggregation for tax and operational features, as raw monthly data should not be used directly. Finally, multi-source integration of judicial data should be prioritized, as its predictive value far exceeds its prevalence in providing critical early warnings.
The results also imply actionable risk management levers for agricultural companies themselves. (i) Compliance continuity is important as maintaining stable and timely tax filing/payment patterns reduces distress signals driven by irregularity rather than one-off levels. (ii) The dispute resolution capacity is important as improving contract governance and dispute-resolution practices (e.g., timely settlement and execution compliance) can directly reduce judicial hard constraint exposure reflected in enforcement-related indicators. (iii) Traceability and documentation such as strengthening auditable administrative traces (e.g., consistent social-insurance contributions, standardized filings, and procurement documentation) can improve verifiability and lower perceived information opacity. (iv) Early engagement when institutional frictions emerge (tax arrears or enforcement signals) and proactive engagement with lenders and regulators may prevent escalation into bank-recognized NPL outcomes.
5.3. Limitations
Several limitations of this study warrant acknowledgment and suggest directions for future research as follows:
1. Geographic scope. The current sample is restricted to a highly digitized city within the Yangtze River Delta. As such, the external validity of our framework in less developed agricultural contexts remains to be confirmed. Subsequent studies should examine the model’s robustness across heterogeneous regions to assess its adaptability to varying levels of institutional development.
2. Temporal coverage. The observation period (January 2021–April 2023) spans 28 months and overlaps substantially with the COVID-19 pandemic, which created unusual economic conditions. The performance dip observed in Fold 4 of our cross-validation suggests sensitivity to macroeconomic regime changes. Longer observation windows and explicit regime-switching models may improve robustness.
3. Predictive vs. causal claims. This study establishes predictive associations, not causal mechanisms. While the temporal precedence of judicial events relative to credit defaults is suggestive of a causal pathway, we cannot rule out common confounders driving both judicial involvement and subsequent credit distress. Causal identification would require exogenous variation in judicial exposure, which is beyond the scope of this observational study.
4. Default Event Isolation and deployment. Our leakage-aware evaluation relies on Default Event Isolation (DEI), implemented as pre-event truncation at the observation level. For each firm, we remove the event month and all subsequent post-event months to prevent post-event information from contaminating predictors. While necessary for strict ex ante validity in high-frequency administrative panels, this design implies a trade-off for deployment realism. In particular, when pre-event truncation is combined with the temporal split, firms whose event occurs during the training period will naturally have no remaining observations in the test period—an implicit “entity isolation” effect, which may yield a cleaner evaluated test distribution than some continuous monitoring practices. At the same time, firms that trigger events in the test period are still evaluated based on their pre-event observations, consistent with the intended use of an early-warning model to detect newly emerging risk states. Future work should explicitly quantify sensitivity to alternative operational policies (e.g., retaining firms with prior event histories under explicit masking rules, different truncation choices, or clearly defined re-entry criteria) to better bridge evaluation protocols and real-world monitoring workflows.
5. Composite label heterogeneity. The three event types in our composite label (bank NPL, tax distress, and judicial distress) differ in economic mechanism and severity. Although we provide a first-triggering decomposition in
Section 4.1.3 and a credit-only robustness check in
Section 4.5.4, we do not estimate event type-specific prediction models due to sample size constraints. Future research with larger samples could train type-specific models and compare whether dominant predictors and their economic interpretation differ across NPL, tax, and judicial outcomes, thereby further disentangling heterogeneous pathways through which institutional distress materializes.
5.4. Future Research Questions
Building on the current findings, several focused research questions merit investigation: (1) Generalization—how stable are the learned mechanisms across regions with different levels of institutional digitization and enforcement intensity? (2) Event-type modeling—do judicial, tax, and bank-NPL outcomes have systematically different dominant predictors when modeled separately, and can multi-task designs improve type-specific early-warning? (3) Drift-aware deployment—how should early-warning systems be updated under regime shifts (e.g., macro shocks), and can online learning or explicit drift detection improve robustness? (4) Decision analytics—how should thresholds be calibrated to operational costs (false positives vs. missed events), and what governance constraints (auditability, fairness, and privacy) shape deployable risk systems in rural finance?