1. Introduction
As a complex system integrating IT, operations, and culture, internal control in commercial banks plays a pivotal role in determining performance and risk governance [
1,
2]. The increase in the complexity of the internal control system is mainly attributed to two reasons: not only the continuous expansion of fintech, which has broadened the boundaries and connotations of internal control [
3,
4], but also the increasingly strict regulatory constraints in the post-crisis era, especially the significantly enhanced compliance requirements under China’s “dual-pillar” framework [
5,
6]. To address this situation, the construction of the internal control system of banks needs to take into account dual standards: deeply integrating domestic regulations and international practices at the compliance level while seeking a balance of efficient operation within the constraints of core risk indicators, such as the capital adequacy ratio, at the business level. Nevertheless, the lack of transparency in internal control mechanisms leaves investors and regulators dependent on restricted disclosure channels, such as assessment reports, to obtain a direct picture of the system. However, such reports often exhibit delayed disclosure, incomplete coverage of information, inconsistent statements, templated language, and a lack of unified evaluation standards [
7]. A representative illustration can be found in the recent Internal Control Evaluation Report of a leading state-owned commercial bank. Its key conclusions are primarily delivered through binary selection formats. For instance, checking “No” for “major deficiencies in financial reporting” and “Effective” for the overall conclusion. The narrative then proceeds to a broad, standardized assertion that the bank has maintained effective controls “in all material respects,” while specific operational defects are often generalized as “rectified general deficiencies” without elaborating on the underlying risks. Consequently, critical details about control processes and risk response mechanisms are often omitted.
With the development and widespread application of text analytics and artificial intelligence technologies [
8,
9,
10], new mitigation paths have emerged for the aforementioned predicaments. Corporate annual reports and Environmental, Social, and Governance (ESG) reports provide vast, heterogeneous, and more specific information sources with higher information entropy and more dispersed semantics. The narrative sections in annual reports (such as corporate governance, comprehensive risk management and compliance, business description, major events and regulatory penalty rectification, etc.) present rich insights into internal control elements in a multi-dimensional and decentralized manner, forming a more complete evidence chain of “how internal control is implemented and improved in business processes”. ESG reports further strengthen compliance culture and risk awareness at the governance level, providing additional semantics for identifying the operational status of internal controls. By integrating multiple technologies to process publicly disclosed annual bank reports and ESG reports, converting large-scale, complex, unstructured content into computable, structured indicators, and integrating these indicators into Business Intelligence (BI) decision support systems, this approach can serve as an operational solution.
Nevertheless, significant challenges persist in deriving quantitative internal control metrics from unstructured text and embedding them into decision-support frameworks. A major hurdle is the “coarse-grained” nature of current metric construction. Since internal controls rely on multiple coupled elements, a single aggregated metric rarely captures the system’s full complexity. The text data itself presents further difficulties. Banks often use templated “defensive disclosures” that obscure specific risks in vague language, so traditional keyword matching captures noise rather than substance. High data dimensionality exacerbates this problem, making the modeling process even more challenging. Beyond these technical issues, there is a functional disconnect between data and decision-making. Most text mining stops at generating indicators without linking them to decision systems. This leaves regulators and managers unable to readily identify sources of risk or translate insights into action.
Our study focuses on three core research questions:
(RQ1) How can unstructured text in bank reports be turned into a multidimensional quantitative framework that captures the layered structure of internal controls?
(RQ2) Does a model that includes text-mined internal-control variables predict outcomes significantly better than models that use other internal-control variables?
(RQ3) How can text-driven indicators be operationalized to support model interpretation and risk prioritization in the banking sector?
To address these issues and make the results usable for BI decision support, we developed a workflow that links text, indicators, models, and dashboards. We leverage high-performance embedding models, such as those from the Beijing Academy of Artificial Intelligence General Embedding (BGE), along with a dual regulatory-semantic knowledge base to map disclosure texts to vector spaces. By computing hybrid probabilities against regulatory prototypes, we filter out noise and convert raw text into a rigorous, five-element internal control quality indicator system (IC-5Q). Validating this structure requires a two-step approach: we first use Partial Least Squares Structural Equation Modeling (PLS-SEM) to test construct validity, and then pair the indicators with Extreme Gradient Boosting (XGBoost) to evaluate their out-of-sample predictive performance for asset quality risk. To ensure practical utility, we embed SHapley Additive exPlanations (SHAP)- based explainable AI into BI dashboards, thereby creating an Intelligent Internal Control Decision Support System (IIC-DSS). This system visualizes the marginal contribution of each element, providing managers with intuitive risk assessments that directly support governance decisions.
The remainder of this paper is organized as follows.
Section 2 conducts a critical review of the literature on internal control measurement, text mining in internal control, and the technological integration of business intelligence, machine learning, and explainable artificial intelligence.
Section 3 explains how to extract the internal control indicator system from unstructured disclosure texts and presents an empirical validation framework for the index system: using PLS-SEM to test the construct validity, evaluating the out-of-sample prediction performance of asset quality risks based on models such as XGBoost, and introducing SHAP to provide traceable explanations at the internal control component level.
Section 4 describes the dataset construction process, reports verified empirical results, and presents the application scenario via a business intelligence dashboard.
Section 5 summarizes the research conclusions and proposes directions for future research.
3. Methodology
This section follows a single thread, moving from unstructured disclosures to interpretable, verifiable, and actionable internal control measurement and risk governance, and proposes an end-to-end methodological framework for implementing the IIC-DSS (
Figure 1). Additionally, as shown in
Figure S1 in the Supplementary Materials, a roadmap designed for non-technical readers is available. In plain terms, the workflow proceeds in four stages: (i) Quantification, that is, turning disclosure text into structured indicators; (ii) Validation, that is, verifying that the indicators jointly define COSO-based internal control constructs and forming ICI; (iii) Prediction, that is, using ICI to forecast asset quality risk; and (iv) Diagnosis, that is, using explainable AI to identify which control elements drive each risk signal. To clarify the logical connection between the research objectives and the technical implementation,
Table 1 maps each research question to its corresponding methodological section and the key technical approaches employed.
3.1. Developing a “Regulation–Semantics Dual-Driven” Internal Control Indicator System
To address RQ1, the following section proposes the “dual-driven regulatory-semantic” internal control indicator system, which forms the formative index IC-5Q and the composite index ICI. The indicator system is constructed using a bottom-up strategy, initially based on third-level indicators. Using knowledge-enhanced corpus modelling and detailed element mapping, the indices are ultimately aggregated layer by layer. Our disclosure corpus includes both annual reports and ESG reports. ESG reports are incorporated because their governance narratives often disclose internal-control arrangements, such as compliance culture, audit mechanisms, and risk governance. Additionally, ESG reports include numeric governance KPIs that can serve as “hard evidence” during feature construction.
3.1.1. Building a “Regulation–Semantics Dual-Driven” Knowledge Base
The building of a “Regulatory-Semantic Dual-Driven” knowledge base aims to accumulate a rich regulatory corpus while clarifying “what constitutes knowledge and under what conditions it is permitted to enter the repository.” To achieve this, the evaluation of textual information is grounded in two complementary knowledge bases. It starts from institutionally defined conceptual boundaries: internal control frameworks and banking regulatory requirements issued by Committee of Sponsoring Organizations of the Treadway Commission (COSO), the Basel Committee on Banking Supervision (BCBS), and the China Banking and Insurance Regulatory Commission (CBIRC) are used as references (the document titles and sources are listed in the
Supplementary Materials), and these references delineate the core meaning and scope of the five components of internal control. In parallel, semantic “prototypes” are constructed from sentence embeddings to capture semantic equivalence in banking disclosures under synonym substitution, syntactic rewriting, and shifts in writing style, enabling a stable identification of differences in expression.
The regulatory-driven component requires translating institutional provisions into sentence-level evidence and forming actionable rules. First, construct an internal control element mapping pattern library to serve as an anchor for coarse annotation of regulatory corpora and rule generation. The tag space is limited to the five COSO elements while standardizing Chinese and English expressions and abbreviations. Next, seed term clusters for each component are extracted from COSO/regulatory texts and domain glossaries, such as governance structure, stress testing, segregation of duties, risk reporting, and internal audit rectification. Multiple expressions within the same semantic cluster are merged into regular patterns. Pilot runs using regulatory sentences as samples record hits, conflicts, and omissions; overly broad patterns are narrowed, while high-frequency omissions are supplemented. After stabilization, the system consolidated into a version-maintainable mapping table. Once regulatory documents were parsed into sentences, a “regulatory sentence-element” correspondence could be generated. In the final phase of Klex construction, we introduced embedding space validation: only when a candidate rule achieved a cosine similarity exceeding threshold τ with its source regulatory sentence in the embedding space was it formally added to the repository and archived under the five elements.
The semantic-driven component incorporates prototype theory to mitigate discrepancies between standard terminology and banking disclosure expressions. Semantic prototype vectors are constructed based on five elements. Regulatory sentences are first mapped to elements E ∈ [CE, RA, CA, IC, MA], and the seed set S
Eseed is then filtered. Filtering criteria include rule-matching strength and explicit action-verb characteristics (e.g., “establish, implement, monitor, rectify, audit”). The Chinese sentence-embedding model fine-tuned for finance (BGE) computes the centroid vector v
Eproto for the seed set (Equation (1)), which is incorporated into the knowledge base alongside K
lex.
3.1.2. Sentence–Component Mapping via Hybrid Probabilistic Constraints
To handle the interwoven nature of disclosure texts, we use a knowledge-based ‘neural-symbolic’ strategy.
Appendix B illustrates how we process raw text into probability scores using a specific example. Each sentence i is represented as a membership-probability vector over the five internal control components, allowing a single sentence to load on multiple components simultaneously.
The method combines two probability measures to handle both implicit context and explicit rules. For the embedding-based semantic probability P
i,Eembed, we L2-normalize the sentence embedding e
i and the component prototype centroids C
E, compute dot-product similarities s
i,E, and then apply a numerically stable row-wise Softmax (shifted by the row maximum) to obtain a valid distribution:
To capture a more direct regulatory consistency signal, we derive the dictionary-based rule probability P
i,Elex by grouping K
lex into weighted sub-items for each component. Hit weights are accumulated into Score
i,E, where we apply a saturation map defined as 1 − 1/(1 + Score
i,E) to keep values within the [0,1) range before row normalization. The process also validates regular expressions and automatically falls back to fixed-string matching. We combine the two distributions into a mixed-membership probability P
i,Emix, using the mixing weight α, as shown in Equation (3). The value of α is tuned by grid search over (0,0.6] for each year. In the normalized embedding space, we measure sentence dissimilarity with cosine distance d = 1 − cos(⋅). We select the fusion weight α as the value that yields the highest silhouette score for the clusters. To validate this method, we compared it against fixed α values ranging from 0.0 to 0.8. As shown in
Appendix C, the model is robust. Specifically, when α is between 0.2 and 0.6, the rankings remain highly correlated, and the top-tier classifications stay consistent. However, performance drops at 0.8. This decline confirms that we should cap the weight at 0.6 to prevent semantic patterns from overpowering clear regulatory signals.
Considering the prevalence of “model sentences” and cross-year reuse in regulatory texts, Equation (4) further transforms the mix probability into the final contribution weight. The quality term ϕ
iqual integrates three constraints. Min-wise Independent Permutations Locality-Sensitive Hashing (MinHash-LSH) provides a non-duplication coefficient to penalize highly similar or cross-year-reused statements. PDF document tree reconstruction provides chapter position weights, giving greater importance to core sections such as risk management and internal control self-assessment. Digital features, combined with strong action verbs, constitute an evidence-enhancing term, increasing the contribution of sentences containing quantitative information and substantive actions. In simple terms, this step acts as a ‘quality filter.’ It penalizes vague, ‘boilerplate’ language (sentences that look like copy-pasted templates) while rewarding specific, verifiable evidence (such as numbers or hard deadlines). This ensures that the final index reflects the substance of internal control rather than the mere volume of text.
3.1.3. Hierarchical Formative Index Construction and Aggregation
After completing the knowledge base construction and component mapping, we followed the bottom-up index construction logic and set the starting point for extracting and constructing text information at the third level of the index.
In order to remove a large amount of marketing statements and macro-level noise from the disclosure text, alleviate the intertemporal fluctuations in length and writing style caused by “disclosure overload”, and avoid noise diluting effective signals and causing bias in indicator construction, we first preprocessed the raw texts using Python (v3.10) and then applied sentence-level screening. Specifically, for each sentence, we calculate its relevance score w
i,E,t under year t and the corresponding internal control element E. Instead of overwhelming the model with hundreds of repetitive keywords, we consolidate them into six distinct themes (such as ‘Disclosure Quality’ or ‘Hard Evidence’). This reduces noise and ensures that the indicators are robust across different writing styles. Then, based on the empirical distribution of this score across samples from that year, we use the Otsu dynamic threshold method [
37] to determine the segmentation point τ
t, EOtsu that distinguishes between relevant and irrelevant sentences. Considering that the threshold may be too low when the signal is weak, we imposed a minimum threshold constraint. We used the higher value between the Otsu-derived cutoff and the prespecified lower bound as the effective threshold for that year–element pair. To conclude the process, we retained all sentences whose relevance scores met or exceeded this threshold to construct the representative sentence subset for bank b in year t under element E. The screening rule and the resulting subset are detailed in Equation (5).
We built the Level 3 system using “general” and “specific” dimensions. This dual approach evaluates both the format’s credibility and the content’s substance.
The general dimensions are designed to filter out purely formal noise, thereby keeping the indicators anchored in meaningful content for every internal control component. Here, we use relative attention and semantic coverage to gauge disclosure intensity and relevance. Additionally, we strengthened the “Hard Measures” dimension by extracting quantitative ESG data. Specifically, we scan governance sections in ESG reports for numeric values, such as audit frequency and board meeting counts. These figures serve as verifiable evidence.
The specific dimensions strictly correspond to the heterogeneity logic of the five COSO elements. That is, the control environment focuses on governance structure and culture; risk assessment emphasizes data quantification and foresight; control activities revolve around process automation and separation of duties; information communication examines the effectiveness of communication channels; and monitoring activities focus on the implementation of audit independence and the closure of rectification. All indicators and their calculation methods are shown in Appendix
Table A1.
After constructing the third-level indicators (L3), we propose a two-stage weighting scheme that balances data distribution characteristics and theoretical priors by combining subjective and objective weighting. To make the L3 indicators more informative when they are rolled up to second-level indicators (L2), and to produce a stable IC-5Q index when L2 is further aggregated to first-level indexes (L1), we adopt two weighting steps that address different needs. Because third-level indicators can be correlated, we apply the CRITIC method [
38] in the L3-to-L2 mapping to reflect both the comparative strength and the degree of conflict among standardized indicators, thereby ensuring the resulting weights better reflect the distinguishability and value of each piece of information. When aggregating from L2 to L1 elements and constructing the IC-5Q index, we introduce a game-theoretic combinatorial weighting model. The data-driven weights are combined with a uniformly distributed (subjective) prior weight vector that serves as an uninformed baseline, and the combination coefficients are chosen by minimizing the sum of squared deviations between the candidate weight vectors, yielding the final weighting scheme (Equation (6)).
3.2. Multi-Level Validation Framework: From Construct Validity to Predictive Power
We use a progressive, multilevel validation framework to examine measurement validity (whether the indicator system forms the intended construct), criterion validity (whether ICI relates to an established benchmark), and predictive validity (whether ICI explains future credit risk) to address RQ2.
3.2.1. Measurement and Criterion Validity via Formative PLS-SEM
We estimate the PLS-SEM model following standard hierarchical procedures [
39,
40]. Rather than entering high-dimensional L3 textual items directly, we consolidate them into six L2 dimensions per element: Disclosure Breadth, Quality, Distinctiveness, Regulatory Alignment, Hard Measures, and Specific Measures. These dimensions are treated as formative indicators because together they define the components of internal control rather than merely reflecting them. Each dimension captures a distinct aspect of disclosure, such as breadth of coverage or strength of supporting evidence, and these aspects are not interchangeable. Removing any single dimension would therefore inappropriately narrow the scope of the construct. To handle multicollinearity, we look beyond simple Variance Inflation Factor (VIF) values. We use CRITIC-based weighting during the aggregation phase to strictly reduce the impact of redundant data. We also ensure the stability of results by examining the dispersion of bootstrap weights. If diagnostics indicate potential overlap, we re-estimate the model using alternative specifications.
Structurally, the path model groups L2 dimensions into first-order elements (E
k), which then combine to form the second-order composite index (ICI). For element k in year t, the formative measurement model is defined as:
where γk,d represents the formative weight and ζ
k,t the disturbance. Convergent validity is strongly supported by the redundancy analysis. The SEM-derived latent constructs are nearly identical to their corresponding aggregate targets (Target
Ek), with path coefficients consistently close to 1.0 and high R2 values. This indicates that abstracting the six disclosure dimensions into first-order internal control elements results in minimal information loss, validating the reliability of the hierarchical structure.
At the second-order level, ICI is formed as:
where ω
k is the weight and ξ
t the residual. We validate ICI by assessing its association with the DIB Internal Control Index (ICDI) and conducting supplementary panel regressions (Equation (9)) to ensure the index preserves benchmark ranking logic after controlling for firm characteristics.
3.2.2. Out-of-Sample Predictive Validity of the Internal Control Index (ICI)
To validate the predictive capability of the internal control index (ICI) for future credit risk, this section compares out-of-sample forecasting performance across multiple models. Specifically, this study discretizes the non-performing loan change rate into a binary risk transition indicator to represent future credit risk (see Equation (11)). To avoid forward-looking bias when determining decision thresholds, the paper adopts the data-driven adaptive approach shown in Equation (10). To capture the tail risk of asset quality deterioration, we set the benchmark for the parameter at the upper quartile of historical data. Although higher quantiles can better capture extreme crises, they are prone to causing a scarcity of positive samples in small datasets, thereby making it difficult for the model to converge. In contrast, the 75th percentile can effectively capture the early stages of asset deterioration and ensure sufficient information density for model training. In addition, we use the boundary condition [τ
min, τ
abs] to filter out fine noise during the stationary period without sacrificing sensitivity to crises.
Once the non-performing loan change rate was transformed into a binary risk-transition indicator and the threshold criteria were defined, out-of-sample forecasting was performed using XGBoost as the primary model. Unlike traditional linear regression, which assumes risk factors act independently, XGBoost allows us to capture complex interactions. For instance, it can detect that a weak control environment becomes critically dangerous only when combined with rapid asset expansion, a nuance that simpler models would likely miss.
Rolling-window cross-validation was used to get a reliable assessment of the model’s predictive performance. The training process for each prediction window between 2017 and 2023 used only historical data before time T. Yet, all testing activities took place during the current period at time T. Since the risk events in this paper are frequently unbalanced in nature, performance evaluation relies on PR-AUC and ROC-AUC for discrimination ability, Best F1 for the precision–recall trade-off, and the Brier score alongside the Top-K capture rate to quantify calibration and high-risk detection accuracy.
3.3. The IIC-DSS Framework: SHAP-Based Diagnosis and Decision Support
By applying the TreeSHAP algorithm to the XGBoost framework, we isolate the marginal impact of the five internal control components on the risk of sudden NPL increases. This step essentially translates the model’s complex mathematical output into a human-readable explanation, identifying the specific why behind each risk prediction. On this basis, using probability calibration and natural language generation technologies, elaborate mathematical results are transformed into visual indicators and diagnostic reports within the business intelligence (BI) dashboard, and, ultimately, an internal control decision support system (IIC-DSS) integrating “prediction—interpretation—presentation” is constructed.
By applying the TreeSHAP algorithm, we decompose the model output in the logarithmic probability space into an additive form of “pivot value + feature contribution”, and map it to the final jump probability through the logical function σ(⋅). The risk prediction of bank i at time point t satisfies:
In Formula (12), ϕ0 is the benchmark term, and ϕi and j quantify the marginal effect of feature j on mechanism i. When the SHAP value is positive, it indicates that this feature increases risk; when it is negative, it quantifies the buffering effect of effective internal control on risk.
During the empirical process, NPL leap labels in the training set are generated using dynamic hybrid thresholds. Based on this, the XGBoost model is trained. After training is complete, TreeSHAP is called to perform attribution analysis, output the SHAP contribution matrix, and calculate the corresponding risk probability, thereby improving the clarity of the interpretation and decomposition of the prediction results. TreeSHAP not only summarizes the SHAP importance of the five elements of internal control, but also provides the mean importance and 95% confidence interval through Bootstrap repeated sampling to achieve a robust characterization of “which type of internal control subsystem is more critical”, and generates a unique SHAP contribution vector for each bank for risk diagnosis.
After completing the attribution analysis, this study used XAI to convert calibrated risk probabilities and attribution results into decision-support information and integrated it into the BI system. Given that the jump in the non-performing loan ratio exhibits low frequency and that the prediction probability is easily perturbed by sample imbalance, a robust calibration mechanism is introduced into the model after obtaining SHAP values. Specifically, the Platt scaling method based on logistic regression [
41] is preferred for probability calibration; If the results show instability, the prior correction strategy is enabled. Through logarithmic probability transformation, the predicted probability is aligned with the training set’s overall distribution.
The constructed visual interaction platform consists of three core functional modules. The summary display module presents the calibrated risk probability distribution and constructs the “System Resilience Intensity” by summing the negative SHAP values of the five internal control elements, quantifying each element’s offset contribution to risk. The diagnostic analysis module performs a global importance assessment with the Bootstrap repeated sampling method and presents the contribution differences of various internal control elements in quantified form through confidence interval error bar plots. To support precise and effective tiered management, the interference recommendation module employs a dynamic threshold classification technique that searches the probability quantile matrix to determine the optimal threshold and safety floor for the F1 score. Based on these two thresholds, banks are segmented into three tiers: high, medium, and low. Building on these tiering results, the systemic framework uses Natural Language Generation technology to generate heterogeneous reports that not only expose the essential weaknesses, along with their SHAP contributions, for high-risk banks, but also identify risks for medium-risk banks and present the principal benefits provided by low-risk banks. In the foregoing procedure, the IIC-DSS framework translates the outputs of complex statistical models into a set of internal control governance measures that can be implemented directly.
4. Results and Discussion
4.1. Dataset and Descriptive Statistics
The text data are derived from the annual and ESG reports of commercial banks listed on China’s A-share market.
Table A2 in
Appendix A summarizes the step-by-step preprocessing pipeline that converts raw PDF annual/ESG reports into a sentence-level, section-tagged corpus with quality weights. At the numerical level, key financial and risk variables from the Wind database are aggregated and incorporated into the Internal Control Index (ICDI) provided by the DIB Internal Control Index database.
To address the small number of missing values in the sample, we evaluated the performance of the interpolation method using a combination of rolling time-window cross-validation and ground-truth masking. The algorithms selected for evaluation include panel means and medians, k-Nearest Neighbors (k-NN), Random Forests, and MICE with Predictive Mean Matching (MICE-PMM). The evaluation process involves rolling training and validation sets annually and randomly masking known observations in the validation set before reconstruction. Standardized NRMSE and NMAE were calculated between interpolated and actual values. Based on the principle of minimizing NRMSE and NMAE, we ultimately employed the random forest algorithm for data interpolation. The programming for the index construction and validation procedures described above was conducted using R (v4.1.0). The processed descriptive statistics are presented in
Table 2.
4.2. Construct Validity and External Consistency of the Index System
Based on the methodology, we applied PLS-SEM to verify the construct validity and external consistency of the internal control index system constructed from complex textual content.
Figure 2 illustrates the hierarchical formative path model used for this validation. The specific verification results are shown in
Table 3.
The first-order measurement model results for Stage 1 show that the external weights for the six process quality dimensions are all significantly positive (***), and the bias-corrected BCa confidence intervals do not include zero, establishing the statistical significance of the indicators. The collinearity diagnosis shows that the variance inflation factor (VIF) of all indicators is below the critical value of 3.0, eliminating the interference of multicollinearity and confirming that the attributes, such as disclosure breadth and consistency, provide independent and non-redundant information contributions, effectively constituting the five elements of internal control: control environment (CE), risk assessment (RA), control activities (CA), information and communication (IC), and monitoring activities (MA). The weight ranges vary among different elements. For example, the weight range for the L2 dimension is 0.302–0.541 in the control environment (CE) and 0.244–0.517 in the control activities (CA), indicating that the marginal contributions of each process dimension across different governance semantics are not balanced.
Similarly, at the second-order structural level (Stage 2), the five elements, as formative indicators of the internal control index (ICI), are also significant. The weight ranking shows that information and communication (IC, 0.319) contributes most to ICI, followed by monitoring activities (MA, 0.258), control activities (CA, 0.222), and risk assessment (RA, 0.218), while the control environment (CE, 0.162) has the least significant contribution. Convergent validity is supported by the redundancy analysis: each construct’s path coefficient to its global single-item target variable is close to 1.0, and R2 ranges from 0.959 to 0.994 (ICI: 0.988), indicating that the mapping from text features to latent construct scores exhibits no material information distortion.
4.3. Out-of-Sample Predictive Performance
To assess the incremental predictive value of the textual Internal Control Index (ICI), we employed an optimized XGBoost model interacting ICI with proxies for organizational complexity (lnAssets), risk vulnerability (NPL_lag1), and performance incentives (ROE). Other controls (CAR, leverage, and LDR) primarily reflect regulatory buffers or balance-sheet structure; treating them as main effects already absorbs important financial differences, while interacting ICI with all controls would substantially increase feature dimensionality and can reduce stability and interpretability under the very low base rate of NPL jumps.
Table 4 shows that the best-performing XGBoost specification is Controls + ESG + ICI + ICI × (lnAssets, NPL_lag1, ROE), achieving ROC-AUC = 0.909 and PR-AUC = 0.0909, with the strongest overall classification quality (Best F1 = 0.167) and strong tail-event prioritization (Top-10 capture = 0.667). Importantly, adding ICI provides incremental value beyond ESG ratings: the Controls + ESG model captures only 33.3% of actual jump events in the top decile, whereas the ESG + ICI model captures 66.7%. This is practically meaningful in a rare-event setting (base rate ≈ 0.31%): it means that when regulators or risk managers can intensively review only the top 10% of banks flagged by the model, incorporating ICI doubles the yield of true distressed cases relative to relying on financial controls and ESG scores alone. This superior tail-risk sensitivity confirms that incorporating textual internal control quality enables the detection of nonlinear risk precursors that linear models and general governance scores fail to capture.
4.4. From Explanation to Action: SHAP Diagnostics and IIC-DSS Application
We integrated the SHAP attribution mechanism into the optimal XGBoost model, aiming to identify the core elements driving the risk jump and convert them into governance diagnostic bases under the IIC-DSS framework.
The IIC-DSS is operationalized as a deployable business intelligence platform with a streamlined user workflow. Users (regulators, risk managers, or investors) upload bank PDF reports through a drag-and-drop web interface. Once a report is uploaded, the backend automatically runs the full analysis pipeline. The results appear in an interactive “Risk Diagnosis” panel. This panel shows the calibrated risk probability, SHAP force plots that break down each element’s contribution, and auto-generated remediation suggestions in plain language. An offline snapshotof the dashboard interface is available in the
Supplementary Materials. The summary results indicate that the average predicted probability of a calibrated bad-loan event is approximately 0.93%. Based on the aggregated SHAP contributions of the five internal control elements, the “system resilience strength” is approximately 88.3%, indicating that, in the vast majority of sample banks, the current internal control system has exerted a net inhibitory effect on credit risk and effectively buffered potential risk exposure.
In the diagnostic analysis module, the global importance assessment based on bootstrap resampling (
Table 5) reveals differences in the contributions of internal control elements. The Control Environment (CE) has the highest weight (mean |SHAP| = 0.592), followed by Information & Communication (IC, 0.463) and Control Activities (CA, 0.422).
In the intervention recommendation module, TreeSHAP generates corresponding contribution profiles for each sample bank. By decomposing the prediction results, the model quantifies the marginal driving or buffering effects of each internal control element on the risk probability. On this basis, the IIC-DSS system implements a three-level hierarchical strategy of “dynamic threshold as the main approach, and quantile distribution as the auxiliary approach”. To ensure sensitivity to tail risks, the system employs an optimal F1 threshold, combined with a head-protective mechanism, to jointly identify high-risk groups, thereby achieving adequate coverage of the top 10% of risk samples. For non-high-risk areas, the model further delineates clear boundaries for medium and low risks based on the quartile points of the probability distribution. The system then generates differentiated attribution diagnoses that clearly identify the main governance weaknesses and their contribution directions, providing targeted weak links and actionable improvement recommendations for management, regulatory authorities, and external investors. Representative sample results are shown in
Table 6.
We use China Minsheng Bank to illustrate how algorithm results can turn into governance advice. The model estimates a 12.95% chance of an NPL jump and labels the bank as “High Risk.” SHAP then shows which factors increase the risk. The biggest drivers are the Control Environment (SHAP +0.356) and Information and Communication (SHAP +0.351). This points to governance culture and internal information sharing as the core problems, not day-to-day operating errors. Based on this, the IIC-DSS does not suggest adding broad capital buffers. Instead, it recommends specific governance fixes. For example, the bank can redesign internal reporting to reduce information silos and increase board-level monitoring. Conversely, the model supports a “maintenance” strategy for the low-risk Bank of Ningbo. A negative score for “Control Activities” (−0.592) confirms that current procedures effectively reduce risk, meaning no remediation is needed.
5. Conclusions and Limitations
This paper proposes a set of procedures for quantifying complex textual information to evaluate the internal control quality of Chinese listed banks and to deeply integrate business intelligence to develop a visual, intelligent internal control decision support system (IIC-DSS). The PLS-SEM and XGBoost validation results indicate that this indicator system exhibits good construct validity and performs well in predicting the probability of an increase in non-performing loans. Furthermore, the system dashboard integrates interpretable tools such as TreeSHAP, enabling the model to analyze the marginal contributions of internal control elements and automatically generate intelligent diagnostic reports for individual banks, helping them more effectively identify governance weak links and clarify improvement directions.
Several limitations remain, mainly related to data coverage, regulatory dependence, and external validity. The analysis uses Chinese A-share-listed banks because their annual reports are standardized and consistently accessible; as a result, the model may not fully capture the risk patterns of non-listed banks. In addition, the textual feature engineering was developed around China’s Basic Standard for Enterprise Internal Control and guidance from the National Financial Regulatory Administration. While COSO and Basel principles are widely applicable, their linguistic realization varies by jurisdiction, so applying the model under regimes such as the U.S. Sarbanes–Oxley Act or the European Banking Authority Guidelines would require revalidating the semantic dictionary. Cross-regional transfer is therefore not plug-and-play: parameters estimated from Chinese disclosures cannot be directly used for EU or U.S. banks, although the modular design supports adaptation. Components that transfer relatively well include the COSO five-element structure, preprocessing logic, the XGBoost framework, and SHAP-based interpretation. By contrast, regulatory seed terms, section-tagging rules, and semantic prototype vectors need to be rebuilt using local regulations and disclosure corpora. Generalizability may also be constrained by institutional differences, including the role of state ownership in China’s banking sector. Future work will broaden validation across regulatory settings, incorporate process-mining logs, and explore more advanced NLP methods to automate governance diagnostics further.