Next Article in Journal
A Joint Scheduling Framework for Electric Bus Fleets and Charging Infrastructure in Urban Transit Systems
Next Article in Special Issue
RO-FIN-LLM: A Benchmark with LLM-as-a-Judge and Human Evaluators for Romanian Tax and Accounting
Previous Article in Journal
Synchronizing Concurrent Security Modernization Programs: A Systems Integration Framework for Post-Quantum Cryptography, Zero Trust Architecture, and AI Security
Previous Article in Special Issue
A Comprehensive Business Intelligence Framework for Diabetes Management in Telemedicine: Advancing Data-Driven Decision Support Through Integrated Visualization and Predictive Analytics
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

From Unstructured Text to Automated Insights: An Explainable AI Approach to Internal Control in Banking Systems

1
China School of Banking and Finance, University of International Business and Economics, Beijing 100105, China
2
School of Public Finance and Economics, Shanxi University of Finance and Economics, Taiyuan 030012, China
*
Author to whom correspondence should be addressed.
Systems 2026, 14(3), 234; https://doi.org/10.3390/systems14030234
Submission received: 13 January 2026 / Revised: 12 February 2026 / Accepted: 22 February 2026 / Published: 25 February 2026
(This article belongs to the Special Issue Business Intelligence and Data Analytics in Enterprise Systems)

Abstract

The complexity of internal control in commercial banks continues to increase, and relevant reports exhibit notable lag and template issues. In response to the demand to transform unstructured disclosures into actionable insights, this study proposes an “augmented Business Intelligence (BI) framework” that integrates a text-based internal control quality assessment system, a dual-validation process, and the resulting Intelligent Internal Control Decision Support System (IIC-DSS). By combining large language models and neural-symbolic models of regulatory prototypes, a quality evaluation system for internal control based on complex text is constructed using a mixed probability mechanism to reduce interference from defensive disclosures. A dual validation process is designed with Partial Least Squares Structural Equation Modeling (PLS-SEM). PLS-SEM verification confirms the construct validity of this evaluation system, while XGBoost verification indicates that internal control quality has incremental predictive ability for asset quality deterioration. The IIC-DSS uses SHapley Additive exPlanations (SHAP) to explain XGBoost outputs, quantifying the marginal contribution of each control factor to the predicted risk. Overall, this study advances internal-control measurement by establishing a neural-symbolic, text-to-indicator representation within an augmented BI architecture and empirically demonstrating its utility in improving predictive power for bank asset quality deterioration and in enhancing decision transparency via explainable AI.

1. Introduction

As a complex system integrating IT, operations, and culture, internal control in commercial banks plays a pivotal role in determining performance and risk governance [1,2]. The increase in the complexity of the internal control system is mainly attributed to two reasons: not only the continuous expansion of fintech, which has broadened the boundaries and connotations of internal control [3,4], but also the increasingly strict regulatory constraints in the post-crisis era, especially the significantly enhanced compliance requirements under China’s “dual-pillar” framework [5,6]. To address this situation, the construction of the internal control system of banks needs to take into account dual standards: deeply integrating domestic regulations and international practices at the compliance level while seeking a balance of efficient operation within the constraints of core risk indicators, such as the capital adequacy ratio, at the business level. Nevertheless, the lack of transparency in internal control mechanisms leaves investors and regulators dependent on restricted disclosure channels, such as assessment reports, to obtain a direct picture of the system. However, such reports often exhibit delayed disclosure, incomplete coverage of information, inconsistent statements, templated language, and a lack of unified evaluation standards [7]. A representative illustration can be found in the recent Internal Control Evaluation Report of a leading state-owned commercial bank. Its key conclusions are primarily delivered through binary selection formats. For instance, checking “No” for “major deficiencies in financial reporting” and “Effective” for the overall conclusion. The narrative then proceeds to a broad, standardized assertion that the bank has maintained effective controls “in all material respects,” while specific operational defects are often generalized as “rectified general deficiencies” without elaborating on the underlying risks. Consequently, critical details about control processes and risk response mechanisms are often omitted.
With the development and widespread application of text analytics and artificial intelligence technologies [8,9,10], new mitigation paths have emerged for the aforementioned predicaments. Corporate annual reports and Environmental, Social, and Governance (ESG) reports provide vast, heterogeneous, and more specific information sources with higher information entropy and more dispersed semantics. The narrative sections in annual reports (such as corporate governance, comprehensive risk management and compliance, business description, major events and regulatory penalty rectification, etc.) present rich insights into internal control elements in a multi-dimensional and decentralized manner, forming a more complete evidence chain of “how internal control is implemented and improved in business processes”. ESG reports further strengthen compliance culture and risk awareness at the governance level, providing additional semantics for identifying the operational status of internal controls. By integrating multiple technologies to process publicly disclosed annual bank reports and ESG reports, converting large-scale, complex, unstructured content into computable, structured indicators, and integrating these indicators into Business Intelligence (BI) decision support systems, this approach can serve as an operational solution.
Nevertheless, significant challenges persist in deriving quantitative internal control metrics from unstructured text and embedding them into decision-support frameworks. A major hurdle is the “coarse-grained” nature of current metric construction. Since internal controls rely on multiple coupled elements, a single aggregated metric rarely captures the system’s full complexity. The text data itself presents further difficulties. Banks often use templated “defensive disclosures” that obscure specific risks in vague language, so traditional keyword matching captures noise rather than substance. High data dimensionality exacerbates this problem, making the modeling process even more challenging. Beyond these technical issues, there is a functional disconnect between data and decision-making. Most text mining stops at generating indicators without linking them to decision systems. This leaves regulators and managers unable to readily identify sources of risk or translate insights into action.
Our study focuses on three core research questions:
(RQ1) How can unstructured text in bank reports be turned into a multidimensional quantitative framework that captures the layered structure of internal controls?
(RQ2) Does a model that includes text-mined internal-control variables predict outcomes significantly better than models that use other internal-control variables?
(RQ3) How can text-driven indicators be operationalized to support model interpretation and risk prioritization in the banking sector?
To address these issues and make the results usable for BI decision support, we developed a workflow that links text, indicators, models, and dashboards. We leverage high-performance embedding models, such as those from the Beijing Academy of Artificial Intelligence General Embedding (BGE), along with a dual regulatory-semantic knowledge base to map disclosure texts to vector spaces. By computing hybrid probabilities against regulatory prototypes, we filter out noise and convert raw text into a rigorous, five-element internal control quality indicator system (IC-5Q). Validating this structure requires a two-step approach: we first use Partial Least Squares Structural Equation Modeling (PLS-SEM) to test construct validity, and then pair the indicators with Extreme Gradient Boosting (XGBoost) to evaluate their out-of-sample predictive performance for asset quality risk. To ensure practical utility, we embed SHapley Additive exPlanations (SHAP)- based explainable AI into BI dashboards, thereby creating an Intelligent Internal Control Decision Support System (IIC-DSS). This system visualizes the marginal contribution of each element, providing managers with intuitive risk assessments that directly support governance decisions.
The remainder of this paper is organized as follows. Section 2 conducts a critical review of the literature on internal control measurement, text mining in internal control, and the technological integration of business intelligence, machine learning, and explainable artificial intelligence. Section 3 explains how to extract the internal control indicator system from unstructured disclosure texts and presents an empirical validation framework for the index system: using PLS-SEM to test the construct validity, evaluating the out-of-sample prediction performance of asset quality risks based on models such as XGBoost, and introducing SHAP to provide traceable explanations at the internal control component level. Section 4 describes the dataset construction process, reports verified empirical results, and presents the application scenario via a business intelligence dashboard. Section 5 summarizes the research conclusions and proposes directions for future research.

2. Literature Review

2.1. Evaluation Methods for Internal Control Systems

There are various indices for internal control assessment, which can be classified into two categories: goal-oriented and process-oriented [11].
The goal-oriented indices measure quality by checking how well a company meets its targets in strategy, operations, reporting, and compliance [12,13]. Nevertheless, internal controls can provide only a reasonable level of assurance and cannot, by themselves, ensure the attainment of goals. As a result, these indices tend to reflect a company’s overall strength but struggle to pinpoint specific weaknesses, making diagnosis and targeted improvement harder.
Process-oriented indices adopt a different perspective by examining the system’s configuration, integrity, and efficiency. They assess quality mainly through disclosed deficiencies and the five standard components of internal control [14,15,16]. The downside is that they rely heavily on disclosure quality, and because scoring often depends on subjective expert judgment, results can vary significantly across different studies [17,18].
To better evaluate complex internal control systems, their shortcomings, and guidance for improvement, this study aims to develop an optimized process index. On the one hand, this study complements and explores appraisal indices through abundant unstructured “soft information.” On the other hand, it uses a weighted method combining subjective and objective approaches to adjust the weight coefficients and reduce the influence of human opinion on appraisal indices.

2.2. Evolution of Text Analysis in Internal Control

The application of text analysis to internal controls has deepened with advances in natural language processing technology in recent years. Early research in this field mainly depended on lexicon-based methods and automated content analysis. For example, Boritz et al. [19] identified words in audit reports related to IT weakness by building a lexicon. Rich et al. [20] argued that unstructured text provides clues about not only the control environment but also the text’s tone, which is strongly correlated with the quality of future internal control. With the advancement of text analysis tools, text mining and machine learning techniques have been increasingly adopted. For instance, Boskou et al. [21] extracted internal audit value by building a classification model using specific terminology and n-gram syntax, which improved performance. Similarly, Liu et al. [22] confirmed that text analysis based on Python, combined with machine learning, effectively measures internal control intent—a useful approach for controlling earnings management behavior.
Nowadays, innovations in deep learning are driving text analysis tools from conventional approaches to large language models (LLMs). Huang et al. [23] and Yang et al. [24] developed FinBERT and FinGPT, respectively, capable of analyzing unstructured information in the financial sector more deeply. Chiu and Hung [25] further advanced this line of research and developed a finance-specific LLaMA-2 model enhanced with an AI-driven summarization process. The results demonstrated superior performance in sentiment analysis and return prediction compared to existing approaches.
Currently, there is very little work on using large language models for deep semantic analysis of vast amounts of Chinese texts in the field of internal control. In particular, the question of how to systematically map and quantify multidimensional textual features onto specific elements of internal control requires further attention.

2.3. Business Intelligence, Machine Learning, and Explainable Artificial Intelligence in Enterprise Systems

BI is a cornerstone of management decision-making, and its theoretical framework and practical value have been extensively discussed. Chen et al. [26] systematically clarified the evolutionary trajectory and current state of application of BI. Visinescu et al. [27] further examined decision effectiveness and, by constructing a simplified model, revealed the internal mechanism by which BI enhances decision quality, thereby providing crucial theoretical support for understanding the relationship between BI and decision quality.
Compared with traditional BI analysis, the introduction of machine learning has endowed enterprise systems with deeper insights and is gradually being internalized as a core tool. The application scenarios of this technology are increasingly diverse: Ji and Li [28] combined gradient boosting decision trees with dynamic indicator selection to construct an enterprise financial risk prediction system that achieves high accuracy in identifying potential risks; Duan et al. [29] examined the internal audit process and constructed an evaluation model integrating machine learning and process mining, enriching the technical means of internal control through quantitative analysis of transaction anomalies.
However, the enhancement of model capabilities has also brought about trust and compliance pressures caused by “black boxes”, and explainable artificial intelligence (XAI) has thus become an important supplement. Barredo Arrieta et al. [30] reviewed XAI research, emphasizing the need to improve explainability to reduce concerns about AI applications and promote its implementation. Subsequently, XAI rapidly penetrated vertical fields: Weber [31] summarized different XAI paths in financial scenarios; Lu and Lin [32] integrated XGBoost with SHAP techniques to explore the determinants of voluntary disclosure, enhancing the interpretability of financial disclosure prediction; Kou et al. [33] further introduced large language models and XAI into annual report text analysis, proposing new digital measurement ideas to make complex text information more traceable and interpretable.
BI systems have been actively integrating machine learning and AI capabilities to deliver stronger end-to-end insights and decision support. Rane et al. [34] and Ebule [35] both note that embedding technologies such as NLP and computer vision into BI systems can substantially improve an organization’s capacity to generate actionable insights and automate decision-making. Chebrolu’s [36] review supports this with empirical data, showing that AI-driven automation reduces manual data processing by approximately 70% and improves prediction accuracy by 35–50%. These efficiency gains directly strengthen an enterprise’s ability to manage risk and make strategic moves.
In summary, existing research indicates that business intelligence has evolved from traditional data analysis tools into an intelligent decision-support system powered by two engines: machine learning and explainable artificial intelligence. This paper establishes a complete technical loop—”regulatory disclosure text → semantically enhanced quantification → IC-5Q indicator system → XGBoost machine learning risk prediction validation → SHAP-driven explainable BI dashboard (IIC-DSS)”—to organically integrate business intelligence, machine learning, and explainable AI within risk early-warning scenarios of internal control systems. This approach offers new research perspectives and practical pathways for the integrated application of these emerging technologies in internal control.

3. Methodology

This section follows a single thread, moving from unstructured disclosures to interpretable, verifiable, and actionable internal control measurement and risk governance, and proposes an end-to-end methodological framework for implementing the IIC-DSS (Figure 1). Additionally, as shown in Figure S1 in the Supplementary Materials, a roadmap designed for non-technical readers is available. In plain terms, the workflow proceeds in four stages: (i) Quantification, that is, turning disclosure text into structured indicators; (ii) Validation, that is, verifying that the indicators jointly define COSO-based internal control constructs and forming ICI; (iii) Prediction, that is, using ICI to forecast asset quality risk; and (iv) Diagnosis, that is, using explainable AI to identify which control elements drive each risk signal. To clarify the logical connection between the research objectives and the technical implementation, Table 1 maps each research question to its corresponding methodological section and the key technical approaches employed.

3.1. Developing a “Regulation–Semantics Dual-Driven” Internal Control Indicator System

To address RQ1, the following section proposes the “dual-driven regulatory-semantic” internal control indicator system, which forms the formative index IC-5Q and the composite index ICI. The indicator system is constructed using a bottom-up strategy, initially based on third-level indicators. Using knowledge-enhanced corpus modelling and detailed element mapping, the indices are ultimately aggregated layer by layer. Our disclosure corpus includes both annual reports and ESG reports. ESG reports are incorporated because their governance narratives often disclose internal-control arrangements, such as compliance culture, audit mechanisms, and risk governance. Additionally, ESG reports include numeric governance KPIs that can serve as “hard evidence” during feature construction.

3.1.1. Building a “Regulation–Semantics Dual-Driven” Knowledge Base

The building of a “Regulatory-Semantic Dual-Driven” knowledge base aims to accumulate a rich regulatory corpus while clarifying “what constitutes knowledge and under what conditions it is permitted to enter the repository.” To achieve this, the evaluation of textual information is grounded in two complementary knowledge bases. It starts from institutionally defined conceptual boundaries: internal control frameworks and banking regulatory requirements issued by Committee of Sponsoring Organizations of the Treadway Commission (COSO), the Basel Committee on Banking Supervision (BCBS), and the China Banking and Insurance Regulatory Commission (CBIRC) are used as references (the document titles and sources are listed in the Supplementary Materials), and these references delineate the core meaning and scope of the five components of internal control. In parallel, semantic “prototypes” are constructed from sentence embeddings to capture semantic equivalence in banking disclosures under synonym substitution, syntactic rewriting, and shifts in writing style, enabling a stable identification of differences in expression.
The regulatory-driven component requires translating institutional provisions into sentence-level evidence and forming actionable rules. First, construct an internal control element mapping pattern library to serve as an anchor for coarse annotation of regulatory corpora and rule generation. The tag space is limited to the five COSO elements while standardizing Chinese and English expressions and abbreviations. Next, seed term clusters for each component are extracted from COSO/regulatory texts and domain glossaries, such as governance structure, stress testing, segregation of duties, risk reporting, and internal audit rectification. Multiple expressions within the same semantic cluster are merged into regular patterns. Pilot runs using regulatory sentences as samples record hits, conflicts, and omissions; overly broad patterns are narrowed, while high-frequency omissions are supplemented. After stabilization, the system consolidated into a version-maintainable mapping table. Once regulatory documents were parsed into sentences, a “regulatory sentence-element” correspondence could be generated. In the final phase of Klex construction, we introduced embedding space validation: only when a candidate rule achieved a cosine similarity exceeding threshold τ with its source regulatory sentence in the embedding space was it formally added to the repository and archived under the five elements.
The semantic-driven component incorporates prototype theory to mitigate discrepancies between standard terminology and banking disclosure expressions. Semantic prototype vectors are constructed based on five elements. Regulatory sentences are first mapped to elements E ∈ [CE, RA, CA, IC, MA], and the seed set SEseed is then filtered. Filtering criteria include rule-matching strength and explicit action-verb characteristics (e.g., “establish, implement, monitor, rectify, audit”). The Chinese sentence-embedding model fine-tuned for finance (BGE) computes the centroid vector vEproto for the seed set (Equation (1)), which is incorporated into the knowledge base alongside Klex.
v E proto   = 1 S E seed s S E seed Embed ( s )  

3.1.2. Sentence–Component Mapping via Hybrid Probabilistic Constraints

To handle the interwoven nature of disclosure texts, we use a knowledge-based ‘neural-symbolic’ strategy. Appendix B illustrates how we process raw text into probability scores using a specific example. Each sentence i is represented as a membership-probability vector over the five internal control components, allowing a single sentence to load on multiple components simultaneously.
The method combines two probability measures to handle both implicit context and explicit rules. For the embedding-based semantic probability Pi,Eembed, we L2-normalize the sentence embedding ei and the component prototype centroids CE, compute dot-product similarities si,E, and then apply a numerically stable row-wise Softmax (shifted by the row maximum) to obtain a valid distribution:
s i , E = e i e i 2 · c E c E 2 ,   P i , E e m b e d = exp ( s i , E m i ) k = 1 5 exp ( s i , k m i ) , m i = max s i , k k 1 , 2 , 3 , 4 , 5
To capture a more direct regulatory consistency signal, we derive the dictionary-based rule probability Pi,Elex by grouping Klex into weighted sub-items for each component. Hit weights are accumulated into Scorei,E, where we apply a saturation map defined as 1 − 1/(1 + Scorei,E) to keep values within the [0,1) range before row normalization. The process also validates regular expressions and automatically falls back to fixed-string matching. We combine the two distributions into a mixed-membership probability Pi,Emix, using the mixing weight α, as shown in Equation (3). The value of α is tuned by grid search over (0,0.6] for each year. In the normalized embedding space, we measure sentence dissimilarity with cosine distance d = 1 − cos(⋅). We select the fusion weight α as the value that yields the highest silhouette score for the clusters. To validate this method, we compared it against fixed α values ranging from 0.0 to 0.8. As shown in Appendix C, the model is robust. Specifically, when α is between 0.2 and 0.6, the rankings remain highly correlated, and the top-tier classifications stay consistent. However, performance drops at 0.8. This decline confirms that we should cap the weight at 0.6 to prevent semantic patterns from overpowering clear regulatory signals.
P i , E m i x   = ( 1 α ) P i , E e m b e d   + α P i , E l e x   α ( 0 , 0.6 ]
Considering the prevalence of “model sentences” and cross-year reuse in regulatory texts, Equation (4) further transforms the mix probability into the final contribution weight. The quality term ϕiqual integrates three constraints. Min-wise Independent Permutations Locality-Sensitive Hashing (MinHash-LSH) provides a non-duplication coefficient to penalize highly similar or cross-year-reused statements. PDF document tree reconstruction provides chapter position weights, giving greater importance to core sections such as risk management and internal control self-assessment. Digital features, combined with strong action verbs, constitute an evidence-enhancing term, increasing the contribution of sentences containing quantitative information and substantive actions. In simple terms, this step acts as a ‘quality filter.’ It penalizes vague, ‘boilerplate’ language (sentences that look like copy-pasted templates) while rewarding specific, verifiable evidence (such as numbers or hard deadlines). This ensures that the final index reflects the substance of internal control rather than the mere volume of text.
w i , E   = P i , E m i x   ϕ i q u a l j = 1 n P j , E m i x   ϕ j q u a l + ε

3.1.3. Hierarchical Formative Index Construction and Aggregation

After completing the knowledge base construction and component mapping, we followed the bottom-up index construction logic and set the starting point for extracting and constructing text information at the third level of the index.
In order to remove a large amount of marketing statements and macro-level noise from the disclosure text, alleviate the intertemporal fluctuations in length and writing style caused by “disclosure overload”, and avoid noise diluting effective signals and causing bias in indicator construction, we first preprocessed the raw texts using Python (v3.10) and then applied sentence-level screening. Specifically, for each sentence, we calculate its relevance score wi,E,t under year t and the corresponding internal control element E. Instead of overwhelming the model with hundreds of repetitive keywords, we consolidate them into six distinct themes (such as ‘Disclosure Quality’ or ‘Hard Evidence’). This reduces noise and ensures that the indicators are robust across different writing styles. Then, based on the empirical distribution of this score across samples from that year, we use the Otsu dynamic threshold method [37] to determine the segmentation point τt, EOtsu that distinguishes between relevant and irrelevant sentences. Considering that the threshold may be too low when the signal is weak, we imposed a minimum threshold constraint. We used the higher value between the Otsu-derived cutoff and the prespecified lower bound as the effective threshold for that year–element pair. To conclude the process, we retained all sentences whose relevance scores met or exceeded this threshold to construct the representative sentence subset for bank b in year t under element E. The screening rule and the resulting subset are detailed in Equation (5).
R b , t , E   = [ s i : i b , w i , E , t   τ t , E   ] , τ t , E = m a x ( τ t , E O t s u ,   τ m i n   ) .
We built the Level 3 system using “general” and “specific” dimensions. This dual approach evaluates both the format’s credibility and the content’s substance.
The general dimensions are designed to filter out purely formal noise, thereby keeping the indicators anchored in meaningful content for every internal control component. Here, we use relative attention and semantic coverage to gauge disclosure intensity and relevance. Additionally, we strengthened the “Hard Measures” dimension by extracting quantitative ESG data. Specifically, we scan governance sections in ESG reports for numeric values, such as audit frequency and board meeting counts. These figures serve as verifiable evidence.
The specific dimensions strictly correspond to the heterogeneity logic of the five COSO elements. That is, the control environment focuses on governance structure and culture; risk assessment emphasizes data quantification and foresight; control activities revolve around process automation and separation of duties; information communication examines the effectiveness of communication channels; and monitoring activities focus on the implementation of audit independence and the closure of rectification. All indicators and their calculation methods are shown in Appendix Table A1.
After constructing the third-level indicators (L3), we propose a two-stage weighting scheme that balances data distribution characteristics and theoretical priors by combining subjective and objective weighting. To make the L3 indicators more informative when they are rolled up to second-level indicators (L2), and to produce a stable IC-5Q index when L2 is further aggregated to first-level indexes (L1), we adopt two weighting steps that address different needs. Because third-level indicators can be correlated, we apply the CRITIC method [38] in the L3-to-L2 mapping to reflect both the comparative strength and the degree of conflict among standardized indicators, thereby ensuring the resulting weights better reflect the distinguishability and value of each piece of information. When aggregating from L2 to L1 elements and constructing the IC-5Q index, we introduce a game-theoretic combinatorial weighting model. The data-driven weights are combined with a uniformly distributed (subjective) prior weight vector that serves as an uninformed baseline, and the combination coefficients are chosen by minimizing the sum of squared deviations between the candidate weight vectors, yielding the final weighting scheme (Equation (6)).
min λ 1 , λ 2 λ 1 ω u n i f + λ 2 ω C R I T I C ω C R I T I C 2 2 + λ 1 ω u n i f + λ 2 ω C R I T I C ω u n i f   2 2 s . t . λ 1 + λ 2 = 1 , λ 1 0 , λ 2 0 y i e l d i n g ω = λ 1 ω u n i f + λ 2 ω C R I T I C λ 1 = λ 2 = 0.5

3.2. Multi-Level Validation Framework: From Construct Validity to Predictive Power

We use a progressive, multilevel validation framework to examine measurement validity (whether the indicator system forms the intended construct), criterion validity (whether ICI relates to an established benchmark), and predictive validity (whether ICI explains future credit risk) to address RQ2.

3.2.1. Measurement and Criterion Validity via Formative PLS-SEM

We estimate the PLS-SEM model following standard hierarchical procedures [39,40]. Rather than entering high-dimensional L3 textual items directly, we consolidate them into six L2 dimensions per element: Disclosure Breadth, Quality, Distinctiveness, Regulatory Alignment, Hard Measures, and Specific Measures. These dimensions are treated as formative indicators because together they define the components of internal control rather than merely reflecting them. Each dimension captures a distinct aspect of disclosure, such as breadth of coverage or strength of supporting evidence, and these aspects are not interchangeable. Removing any single dimension would therefore inappropriately narrow the scope of the construct. To handle multicollinearity, we look beyond simple Variance Inflation Factor (VIF) values. We use CRITIC-based weighting during the aggregation phase to strictly reduce the impact of redundant data. We also ensure the stability of results by examining the dispersion of bootstrap weights. If diagnostics indicate potential overlap, we re-estimate the model using alternative specifications.
Structurally, the path model groups L2 dimensions into first-order elements (Ek), which then combine to form the second-order composite index (ICI). For element k in year t, the formative measurement model is defined as:
E k , t   = d = 1 5 γ k , d L k , d , t   + ζ k , t
where γk,d represents the formative weight and ζk,t the disturbance. Convergent validity is strongly supported by the redundancy analysis. The SEM-derived latent constructs are nearly identical to their corresponding aggregate targets (TargetEk), with path coefficients consistently close to 1.0 and high R2 values. This indicates that abstracting the six disclosure dimensions into first-order internal control elements results in minimal information loss, validating the reliability of the hierarchical structure.
At the second-order level, ICI is formed as:
I C I t   = k = 1 5 ω k   E k , t   + ξ t  
where ωk is the weight and ξt the residual. We validate ICI by assessing its association with the DIB Internal Control Index (ICDI) and conducting supplementary panel regressions (Equation (9)) to ensure the index preserves benchmark ranking logic after controlling for firm characteristics.
I C D I i , t   = β I C I i , t   + Γ X i , t   + ε i , t  

3.2.2. Out-of-Sample Predictive Validity of the Internal Control Index (ICI)

To validate the predictive capability of the internal control index (ICI) for future credit risk, this section compares out-of-sample forecasting performance across multiple models. Specifically, this study discretizes the non-performing loan change rate into a binary risk transition indicator to represent future credit risk (see Equation (11)). To avoid forward-looking bias when determining decision thresholds, the paper adopts the data-driven adaptive approach shown in Equation (10). To capture the tail risk of asset quality deterioration, we set the benchmark for the parameter at the upper quartile of historical data. Although higher quantiles can better capture extreme crises, they are prone to causing a scarcity of positive samples in small datasets, thereby making it difficult for the model to converge. In contrast, the 75th percentile can effectively capture the early stages of asset deterioration and ensure sufficient information density for model training. In addition, we use the boundary condition [τmin, τabs] to filter out fine noise during the stationary period without sacrificing sensitivity to crises.
τ t   = m i n { τ a b s   , m a x [ τ m i n   , Q q   ( { Δ N P L , t 1   } ) ] } , q = 0.75
Y i , t + 1   = I { Δ N P L i , t + 1   τ t   }
Once the non-performing loan change rate was transformed into a binary risk-transition indicator and the threshold criteria were defined, out-of-sample forecasting was performed using XGBoost as the primary model. Unlike traditional linear regression, which assumes risk factors act independently, XGBoost allows us to capture complex interactions. For instance, it can detect that a weak control environment becomes critically dangerous only when combined with rapid asset expansion, a nuance that simpler models would likely miss.
Rolling-window cross-validation was used to get a reliable assessment of the model’s predictive performance. The training process for each prediction window between 2017 and 2023 used only historical data before time T. Yet, all testing activities took place during the current period at time T. Since the risk events in this paper are frequently unbalanced in nature, performance evaluation relies on PR-AUC and ROC-AUC for discrimination ability, Best F1 for the precision–recall trade-off, and the Brier score alongside the Top-K capture rate to quantify calibration and high-risk detection accuracy.

3.3. The IIC-DSS Framework: SHAP-Based Diagnosis and Decision Support

By applying the TreeSHAP algorithm to the XGBoost framework, we isolate the marginal impact of the five internal control components on the risk of sudden NPL increases. This step essentially translates the model’s complex mathematical output into a human-readable explanation, identifying the specific why behind each risk prediction. On this basis, using probability calibration and natural language generation technologies, elaborate mathematical results are transformed into visual indicators and diagnostic reports within the business intelligence (BI) dashboard, and, ultimately, an internal control decision support system (IIC-DSS) integrating “prediction—interpretation—presentation” is constructed.
By applying the TreeSHAP algorithm, we decompose the model output in the logarithmic probability space into an additive form of “pivot value + feature contribution”, and map it to the final jump probability through the logical function σ(⋅). The risk prediction of bank i at time point t satisfies:
  p i , t + 1 = σ ϕ 0   + j ϕ i , j
In Formula (12), ϕ0 is the benchmark term, and ϕi and j quantify the marginal effect of feature j on mechanism i. When the SHAP value is positive, it indicates that this feature increases risk; when it is negative, it quantifies the buffering effect of effective internal control on risk.
During the empirical process, NPL leap labels in the training set are generated using dynamic hybrid thresholds. Based on this, the XGBoost model is trained. After training is complete, TreeSHAP is called to perform attribution analysis, output the SHAP contribution matrix, and calculate the corresponding risk probability, thereby improving the clarity of the interpretation and decomposition of the prediction results. TreeSHAP not only summarizes the SHAP importance of the five elements of internal control, but also provides the mean importance and 95% confidence interval through Bootstrap repeated sampling to achieve a robust characterization of “which type of internal control subsystem is more critical”, and generates a unique SHAP contribution vector for each bank for risk diagnosis.
After completing the attribution analysis, this study used XAI to convert calibrated risk probabilities and attribution results into decision-support information and integrated it into the BI system. Given that the jump in the non-performing loan ratio exhibits low frequency and that the prediction probability is easily perturbed by sample imbalance, a robust calibration mechanism is introduced into the model after obtaining SHAP values. Specifically, the Platt scaling method based on logistic regression [41] is preferred for probability calibration; If the results show instability, the prior correction strategy is enabled. Through logarithmic probability transformation, the predicted probability is aligned with the training set’s overall distribution.
The constructed visual interaction platform consists of three core functional modules. The summary display module presents the calibrated risk probability distribution and constructs the “System Resilience Intensity” by summing the negative SHAP values of the five internal control elements, quantifying each element’s offset contribution to risk. The diagnostic analysis module performs a global importance assessment with the Bootstrap repeated sampling method and presents the contribution differences of various internal control elements in quantified form through confidence interval error bar plots. To support precise and effective tiered management, the interference recommendation module employs a dynamic threshold classification technique that searches the probability quantile matrix to determine the optimal threshold and safety floor for the F1 score. Based on these two thresholds, banks are segmented into three tiers: high, medium, and low. Building on these tiering results, the systemic framework uses Natural Language Generation technology to generate heterogeneous reports that not only expose the essential weaknesses, along with their SHAP contributions, for high-risk banks, but also identify risks for medium-risk banks and present the principal benefits provided by low-risk banks. In the foregoing procedure, the IIC-DSS framework translates the outputs of complex statistical models into a set of internal control governance measures that can be implemented directly.

4. Results and Discussion

4.1. Dataset and Descriptive Statistics

The text data are derived from the annual and ESG reports of commercial banks listed on China’s A-share market. Table A2 in Appendix A summarizes the step-by-step preprocessing pipeline that converts raw PDF annual/ESG reports into a sentence-level, section-tagged corpus with quality weights. At the numerical level, key financial and risk variables from the Wind database are aggregated and incorporated into the Internal Control Index (ICDI) provided by the DIB Internal Control Index database.
To address the small number of missing values in the sample, we evaluated the performance of the interpolation method using a combination of rolling time-window cross-validation and ground-truth masking. The algorithms selected for evaluation include panel means and medians, k-Nearest Neighbors (k-NN), Random Forests, and MICE with Predictive Mean Matching (MICE-PMM). The evaluation process involves rolling training and validation sets annually and randomly masking known observations in the validation set before reconstruction. Standardized NRMSE and NMAE were calculated between interpolated and actual values. Based on the principle of minimizing NRMSE and NMAE, we ultimately employed the random forest algorithm for data interpolation. The programming for the index construction and validation procedures described above was conducted using R (v4.1.0). The processed descriptive statistics are presented in Table 2.

4.2. Construct Validity and External Consistency of the Index System

Based on the methodology, we applied PLS-SEM to verify the construct validity and external consistency of the internal control index system constructed from complex textual content. Figure 2 illustrates the hierarchical formative path model used for this validation. The specific verification results are shown in Table 3.
The first-order measurement model results for Stage 1 show that the external weights for the six process quality dimensions are all significantly positive (***), and the bias-corrected BCa confidence intervals do not include zero, establishing the statistical significance of the indicators. The collinearity diagnosis shows that the variance inflation factor (VIF) of all indicators is below the critical value of 3.0, eliminating the interference of multicollinearity and confirming that the attributes, such as disclosure breadth and consistency, provide independent and non-redundant information contributions, effectively constituting the five elements of internal control: control environment (CE), risk assessment (RA), control activities (CA), information and communication (IC), and monitoring activities (MA). The weight ranges vary among different elements. For example, the weight range for the L2 dimension is 0.302–0.541 in the control environment (CE) and 0.244–0.517 in the control activities (CA), indicating that the marginal contributions of each process dimension across different governance semantics are not balanced.
Similarly, at the second-order structural level (Stage 2), the five elements, as formative indicators of the internal control index (ICI), are also significant. The weight ranking shows that information and communication (IC, 0.319) contributes most to ICI, followed by monitoring activities (MA, 0.258), control activities (CA, 0.222), and risk assessment (RA, 0.218), while the control environment (CE, 0.162) has the least significant contribution. Convergent validity is supported by the redundancy analysis: each construct’s path coefficient to its global single-item target variable is close to 1.0, and R2 ranges from 0.959 to 0.994 (ICI: 0.988), indicating that the mapping from text features to latent construct scores exhibits no material information distortion.

4.3. Out-of-Sample Predictive Performance

To assess the incremental predictive value of the textual Internal Control Index (ICI), we employed an optimized XGBoost model interacting ICI with proxies for organizational complexity (lnAssets), risk vulnerability (NPL_lag1), and performance incentives (ROE). Other controls (CAR, leverage, and LDR) primarily reflect regulatory buffers or balance-sheet structure; treating them as main effects already absorbs important financial differences, while interacting ICI with all controls would substantially increase feature dimensionality and can reduce stability and interpretability under the very low base rate of NPL jumps.
Table 4 shows that the best-performing XGBoost specification is Controls + ESG + ICI + ICI × (lnAssets, NPL_lag1, ROE), achieving ROC-AUC = 0.909 and PR-AUC = 0.0909, with the strongest overall classification quality (Best F1 = 0.167) and strong tail-event prioritization (Top-10 capture = 0.667). Importantly, adding ICI provides incremental value beyond ESG ratings: the Controls + ESG model captures only 33.3% of actual jump events in the top decile, whereas the ESG + ICI model captures 66.7%. This is practically meaningful in a rare-event setting (base rate ≈ 0.31%): it means that when regulators or risk managers can intensively review only the top 10% of banks flagged by the model, incorporating ICI doubles the yield of true distressed cases relative to relying on financial controls and ESG scores alone. This superior tail-risk sensitivity confirms that incorporating textual internal control quality enables the detection of nonlinear risk precursors that linear models and general governance scores fail to capture.

4.4. From Explanation to Action: SHAP Diagnostics and IIC-DSS Application

We integrated the SHAP attribution mechanism into the optimal XGBoost model, aiming to identify the core elements driving the risk jump and convert them into governance diagnostic bases under the IIC-DSS framework.
The IIC-DSS is operationalized as a deployable business intelligence platform with a streamlined user workflow. Users (regulators, risk managers, or investors) upload bank PDF reports through a drag-and-drop web interface. Once a report is uploaded, the backend automatically runs the full analysis pipeline. The results appear in an interactive “Risk Diagnosis” panel. This panel shows the calibrated risk probability, SHAP force plots that break down each element’s contribution, and auto-generated remediation suggestions in plain language. An offline snapshotof the dashboard interface is available in the Supplementary Materials. The summary results indicate that the average predicted probability of a calibrated bad-loan event is approximately 0.93%. Based on the aggregated SHAP contributions of the five internal control elements, the “system resilience strength” is approximately 88.3%, indicating that, in the vast majority of sample banks, the current internal control system has exerted a net inhibitory effect on credit risk and effectively buffered potential risk exposure.
In the diagnostic analysis module, the global importance assessment based on bootstrap resampling (Table 5) reveals differences in the contributions of internal control elements. The Control Environment (CE) has the highest weight (mean |SHAP| = 0.592), followed by Information & Communication (IC, 0.463) and Control Activities (CA, 0.422).
In the intervention recommendation module, TreeSHAP generates corresponding contribution profiles for each sample bank. By decomposing the prediction results, the model quantifies the marginal driving or buffering effects of each internal control element on the risk probability. On this basis, the IIC-DSS system implements a three-level hierarchical strategy of “dynamic threshold as the main approach, and quantile distribution as the auxiliary approach”. To ensure sensitivity to tail risks, the system employs an optimal F1 threshold, combined with a head-protective mechanism, to jointly identify high-risk groups, thereby achieving adequate coverage of the top 10% of risk samples. For non-high-risk areas, the model further delineates clear boundaries for medium and low risks based on the quartile points of the probability distribution. The system then generates differentiated attribution diagnoses that clearly identify the main governance weaknesses and their contribution directions, providing targeted weak links and actionable improvement recommendations for management, regulatory authorities, and external investors. Representative sample results are shown in Table 6.
We use China Minsheng Bank to illustrate how algorithm results can turn into governance advice. The model estimates a 12.95% chance of an NPL jump and labels the bank as “High Risk.” SHAP then shows which factors increase the risk. The biggest drivers are the Control Environment (SHAP +0.356) and Information and Communication (SHAP +0.351). This points to governance culture and internal information sharing as the core problems, not day-to-day operating errors. Based on this, the IIC-DSS does not suggest adding broad capital buffers. Instead, it recommends specific governance fixes. For example, the bank can redesign internal reporting to reduce information silos and increase board-level monitoring. Conversely, the model supports a “maintenance” strategy for the low-risk Bank of Ningbo. A negative score for “Control Activities” (−0.592) confirms that current procedures effectively reduce risk, meaning no remediation is needed.

5. Conclusions and Limitations

This paper proposes a set of procedures for quantifying complex textual information to evaluate the internal control quality of Chinese listed banks and to deeply integrate business intelligence to develop a visual, intelligent internal control decision support system (IIC-DSS). The PLS-SEM and XGBoost validation results indicate that this indicator system exhibits good construct validity and performs well in predicting the probability of an increase in non-performing loans. Furthermore, the system dashboard integrates interpretable tools such as TreeSHAP, enabling the model to analyze the marginal contributions of internal control elements and automatically generate intelligent diagnostic reports for individual banks, helping them more effectively identify governance weak links and clarify improvement directions.
Several limitations remain, mainly related to data coverage, regulatory dependence, and external validity. The analysis uses Chinese A-share-listed banks because their annual reports are standardized and consistently accessible; as a result, the model may not fully capture the risk patterns of non-listed banks. In addition, the textual feature engineering was developed around China’s Basic Standard for Enterprise Internal Control and guidance from the National Financial Regulatory Administration. While COSO and Basel principles are widely applicable, their linguistic realization varies by jurisdiction, so applying the model under regimes such as the U.S. Sarbanes–Oxley Act or the European Banking Authority Guidelines would require revalidating the semantic dictionary. Cross-regional transfer is therefore not plug-and-play: parameters estimated from Chinese disclosures cannot be directly used for EU or U.S. banks, although the modular design supports adaptation. Components that transfer relatively well include the COSO five-element structure, preprocessing logic, the XGBoost framework, and SHAP-based interpretation. By contrast, regulatory seed terms, section-tagging rules, and semantic prototype vectors need to be rebuilt using local regulations and disclosure corpora. Generalizability may also be constrained by institutional differences, including the role of state ownership in China’s banking sector. Future work will broaden validation across regulatory settings, incorporate process-mining logs, and explore more advanced NLP methods to automate governance diagnostics further.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/systems14030234/s1, Figure S1: Methodological roadmap; Table S1: Detailed diagnostic results for the remaining sample banks; Table S2: Regulatory Documents Related to Internal Control of Commercial Banks (International and China); File S1: IIC-DSS Intelligent Internal Control Decision System (Offline Snapshot).

Author Contributions

Conceptualization, Y.L. and X.L.; methodology, Y.L. and X.L.; software, X.L.; validation, Y.L., X.L. and C.S.; formal analysis, Y.L. and X.L.; investigation, X.L.; resources, C.S.; data curation, X.L. and C.S.; writing—original draft preparation, X.L.; writing—review and editing, Y.L. and C.S.; visualization, X.L.; supervision, Y.L.; project administration, C.S.; funding acquisition, C.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Program for the Philosophy and Social Sciences Research of Higher Learning Institutions of Shanxi (No. 2024W066).

Data Availability Statement

The data presented in this study are available from the corresponding author upon request, as they are part of ongoing research.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Structure of the Internal Control Index System.
Table A1. Structure of the Internal Control Index System.
L1 ElementL2 DimensionL3 IndicatorOperational Definition & Computation Method
Common Indicators (Applied to all 5 Elements)Disclosure BreadthFocus_ERelative Attention: Sum of hybrid weights wi,E, normalized by total sentence count; measures the intensity of disclosure for element E.
Coverage_ESemantic Coverage: Overlap between bank sentences and element-specific sub-theme embeddings, calculated via dynamic thresholds and sigmoid smoothing.
Topic Entropy_EThematic Diversity: Normalized Shannon entropy of embedding clusters; measures the diversity of topics in element E.
Disclosure QualityReadability_ELinguistic Readability: Weighted average (wi,E) of sentence readability scores (Gaussian-smoothed sentence length).
Commit_EImplementation Strength: Weighted frequency of explicit action verbs (e.g., “establish”, “enforce”), penalizing vague/hedging expressions.
DistinctivenessSpec_EPeer Divergence: Jensen–Shannon Divergence (JSD) between the bank’s topic distribution vector and the peer group average; measures idiosyncrasy.
Reg. AlignmentAlign_EPrototype Affinity: Weighted cosine similarity between bank sentences and the Regulatory Semantic Prototype (vproto) of element E.
RegCover_ERegulatory Breadth: The proportion of regulatory corpus sentences (from the RD database) semantically “covered” (Top-k similarity) by bank disclosures.
Hard MeasuresMeas_EQuantitative Density: Density of general quantitative tokens (numbers, percentages, currency) per 1000 characters within element E.
CE Control EnvironmentSpecific MeasuresGovScore_CEGovernance Structure: Weighted hit rate of corporate governance entities (e.g., “Board”, “Supervisory Board”, “Three Lines of Defense”).
Ethics_CECulture & Ethics: Weighted density of terms related to “integrity”, “code of conduct”, “compliance culture”, and “anti-corruption”.
Whistle_CEWhistleblowing Mechanism: Intensity of disclosures regarding reporting channels, whistleblower protection, and anonymous reporting.
RA Risk AssessmentSpecific MeasuresRDQ_RARisk Data Quantification: Density of specific risk metrics (e.g., NPL ratio, LCR, VaR, stress test results) defined in risk dictionaries.
RiskClass_RARisk Taxonomy: The count of distinct risk types (e.g., Credit, Market, Liquidity, Climate) mentioned, normalized by the total risk taxonomy size.
Foresight_RAForward-looking Capability: Weighted frequency of future-oriented modal words and terms like “stress scenario”, “sensitivity analysis”.
CA Control ActivitiesSpecific MeasuresCAD_CAControl Descriptors: Weighted density of procedural control terms (e.g., “approval”, “reconciliation”, “verification”, “limit management”).
Seg_CASegregation of Duties: Intensity of disclosures related to “incompatible posts”, “separation of duties (SoD)”, and “checks and balances”.
AutoCtrl_CAAutomated Controls: Density of terms related to IT General Controls (ITGC), RPA, system constraints, and rigid control embedding.
ChgMgmt_CAChange Management: Hit rate of terms concerning system changes, UAT testing, code review, and version control.
IC Info & CommSpecific MeasuresChannelDF_ICChannel Diversity: Weighted summation of distinct communication channel mentions (e.g., “hotline”, “portal”, “app”, “matrix”).
DataGov_ICData Governance: Intensity of terms related to “data quality”, “data lineage”, “standardization”, and “privacy protection”.
ITInfra_ICIT Infrastructure Depth: Product of IT infrastructure term density (e.g., “cloud”, “data lake”) and their entropy (diversity).
MA Monitoring ActivitiesSpecific MeasuresAssure_MAIndependent Assurance: Weighted presence of external audit terms, assurance opinions, and “unqualified opinion” declarations.
ContMon_MAContinuous Monitoring: Density of terms related to “real-time monitoring”, “early warning”, “automatic detection”, and “continuous audit”.
Remedy_MARemediation Loop: Intensity of disclosures regarding “rectification”, “defects”, “tracking”, and “closed-loop management”.
MonFreq_MAMonitoring Frequency: Weighted score based on temporal frequency keywords (Real-time = 5 > Daily = 4 > ... > Annual = 1).
ExtCons_MAExternal Constraint: Hit rate of signals regarding regulatory inspections, notifications, and external supervision feedback.
Table A2. Step-by-Step Preprocessing Pipeline for Bank Disclosure Documents.
Table A2. Step-by-Step Preprocessing Pipeline for Bank Disclosure Documents.
StepTechnical ImplementationPurpose & Output
Step 1: Format Unification & Parsing
  • Engine: Primary parsing via PyMuPDF; fallback to pdfminer or pdfplumber
  • OCR: Selective Tesseract OCR is applied only when the text density
  • Filter: Pages containing keywords like “Contents” or “Index” are identified as TOC and discarded.
Converts heterogeneous PDF formats (scanned/digital) into a unified text stream while removing non-substantive navigation pages.
Step 2: Structural Cleaning
  • Header/Footer Removal: Frequency-based detection; lines appearing on >60% of pages (excluding page numbers) are stripped.
  • Noise Removal: Cleaning of control characters and normalization of Unicode.
Eliminates recurring page artifacts that create false duplicates and inflate noise levels.
Step 3: Segmentation & Normalization
  • Split: Hybrid sentence segmentation using Regex (handling quotes/brackets) + HanLP NLP toolkit.
  • Min-Length: Sentences < 6 characters are dropped.
  • Conversion: Global conversion of Traditional Chinese to Simplified Chinese (via OpenCC/ZhConv) to unify script variants.
Transforms continuous text blocks into discrete, grammatically complete, and standardized sentence units for analysis.
Step 4: Numeric & Keyword TaggingWe apply regex matching to flag the following:
  • Hard Evidence: Numeric units (%, yuan, times, dates)
  • Boilerplate: High-frequency template phrases (e.g., “The Board guarantees truthfulness”).
  • Vague Terms: Hedging words (e.g., “basically”, “to a certain extent”).
Pre-computation step: Annotates each sentence with binary flags. Note: The specific weighting logic using these tags (e.g., penalties/bonuses) is detailed in Section 3.1.2.

Appendix B. Examples of Disclosure Assessment Based on the COSO Framework

Appendix B.1. High-Specificity Disclosure (Source: Annual Report)

We analyze the following sentence: “The Board of Directors annually reviews the effectiveness of the internal control system, approves the bank’s risk appetite statement, and ensures that material deficiencies identified by the internal audit department are remediated within 90 days.” The model begins by extracting key signals from the text. Phrases like “Board of Directors” point to the Control Environment, while “risk appetite” maps to Risk Assessment. The specific mention of “remediated within 90 days” supports Monitoring Activities. Additionally, the “internal audit department” indicates a clear reporting channel. Next, the embedding step compares the text to standard categories. It finds the strongest match with Control Environment (0.82), followed by Risk Assessment and Monitoring. The model then mixes these rule-based and embedding results. This hybrid approach produces a primary score focused on the Control Environment. The model also evaluates the quality of the writing. The sentence is readable and precise. Using concrete terms like “90 days” avoids vagueness and earns a quality bonus. As a result, the sentence retains its full weight. Ultimately, the analysis treats this as strong evidence of oversight, which improves the scores for Control Environment and Monitoring.

Appendix B.2. Low-Specificity Disclosure (Source: ESG Report)

We then evaluate the following sentence: “The Bank continuously improves its internal control management system and strives to ensure strict compliance with relevant national laws and regulations to support sustainable development.” In the first stage, rule-based keyword matching produces only weak signals. “Internal control management system” registers a mild hit under Control Activities, but the term is generic and lacks specificity. Similarly, “compliance” loosely maps to the Control Environment, though it reads more as a broad aspiration than a concrete governance mechanism. In the embedding check, the sentence sits close to a boilerplate pattern (0.85). It is far from clear that COSO component prototypes have all scores below 0.35, so the topic focus is unclear. The hybrid step combines both sources, resulting in a set of probabilities that is scattered. No single component stands out. In the quality step, the base score is 0.90. The algorithm applies a −35% boilerplate penalty because the phrasing follows a common template, such as “continuously improves…”. It also applies a −25% vagueness penalty due to soft words such as “strives to,” “relevant,” and “support,” as well as the lack of concrete actions or targets. There is no numeric or verifiable detail, so no bonus is added. The final weight is 0.36 (0.90 × (1 − 0.35 − 0.25)). As a result, the sentence is down-weighted by 64% and adds little to the index. It is treated as defensive wording rather than solid evidence, which helps prevent score inflation from vague disclosure.

Appendix C. Sensitivity Analysis of Hyperparameter α

Table A3. Robustness of the Internal Control Index (ICI) under Different α Values.
Table A3. Robustness of the Internal Control Index (ICI) under Different α Values.
Comparison ScenarioSpearman’s ρTop 20% OverlapMean Absolute Rank ChangeQuartile Change Rate
α = 0.0 vs. Baseline0.3190.34011.340.677
α = 0.2 vs. Baseline0.9760.8881.840.187
α = 0.4 vs. Baseline0.9760.8431.850.175
α = 0.6 vs. Baseline0.9660.8222.310.215
α = 0.8 vs. Baseline0.9410.7782.890.256
Overall Consistency (Kendall’s W0.793
Notes: This table compares the ICI derived from fixed α values against the baseline (dynamically optimized α). Metrics are averages across the sample period. Kendall’s W is calculated across all scenarios (including baseline) to measure global agreement.
Table A4. Structural Stability of Internal Control Components.
Table A4. Structural Stability of Internal Control Components.
Panel A: Global Consistency Across All Scenarios
ElementKendall’s WCoefficient of Variation Stability Assessment
Control Activities (CA)0.8990.045Very High
Control Environment (CE)0.8250.077High
Risk Assessment (RA)0.8000.057High
Monitoring Activities (MA)0.7490.117Moderate
Information & Comm (IC)0.7050.119Moderate
Panel B: Boundary Testing (α = 0.8 vs. Baseline)
ElementSpearman’s ρMARC (Rank Change)Interpretation of α = 0.8 Impact
Control Activities (CA)0.9612.36Robust to high semantic weight.
Control Environment (CE)0.9283.23Benefits from high semantic weight.
Risk Assessment (RA)0.8813.88Degraded: Semantic drift dilutes governance rules.
Monitoring Activities (MA)0.7785.79Degraded: Audit outcomes require strict rule matching.
Information & Comm (IC)0.7366.73Degraded: Risk metrics require precise quantification.
Note: This table assesses whether the five internal control elements remain stable. Panel A reports the global consistency (Kendall’s W) across all α. Panel B highlights the degradation of specific elements (CE, MA) when α exceeds the 0.6 cap.

References

  1. Aebi, V.; Sabato, G.; Schmid, M. Risk management, corporate governance, and bank performance in the financial crisis. J. Bank. Financ. 2012, 36, 3213–3226. [Google Scholar] [CrossRef]
  2. Baugh, M.; Ege, M.S.; Yust, C.G. Internal Control Quality and Bank Risk-Taking and Performance. Audit. J. Pract. Theory 2020, 40, 49–84. [Google Scholar] [CrossRef]
  3. Basel Committee on Banking Supervision. Sound Practices: Implications of Fintech Developments for Banks and Bank Supervisors; Bank for International Settlements: Basel, Switzerland, 2018; Available online: https://www.bis.org/bcbs/publ/d431.htm (accessed on 21 February 2026).
  4. The People’s Bank of China. Financial Technology Development Plan (2022–2025); The People’s Bank of China: Beijing, China, 2022. Available online: https://www.pbc.gov.cn/zhengwugongkai/4081330/4406346/4693549/4470403/index.html (accessed on 21 February 2026).
  5. Basel Committee on Banking Supervision. Basel III: Finalising Post-Crisis Reforms; The Bank for International Settlements: Basel, Switzerland, 2017; Available online: https://www.bis.org/bcbs/publ/d424.pdf (accessed on 21 February 2026).
  6. The People’s Bank of China. China Financial Stability Report (2025); The People’s Bank of China: Beijing, China, 2025. Available online: https://www.pbc.gov.cn/goutongjiaoliu/113456/113469/2025122616592613805/index.html (accessed on 21 February 2026).
  7. Kuang, Y.; Li, Z.; Liang, R. Disclosure of internal control evaluation reports of Chinese enterprises: History, problems and strategies. Financ. Res. Lett. 2024, 66, 105642. [Google Scholar] [CrossRef]
  8. Senave, E.; Jans, M.J.; Srivastava, R.P. The application of text mining in accounting. Int. J. Account. Inf. Syst. 2023, 50, 100624. [Google Scholar] [CrossRef]
  9. Bochkay, K.; Brown, S.V.; Leone, A.J.; Tucker, J.W. Textual Analysis in Accounting: What’s Next? Contemp. Account. Res. 2023, 40, 765–805. [Google Scholar] [CrossRef]
  10. Monteiro, A.; Cepêda, C.; Da Silva, A.C.F.; Vale, J. The Relationship between AI Adoption Intensity and Internal Control System and Accounting Information Quality. Systems 2023, 11, 536. [Google Scholar] [CrossRef]
  11. Chen, H.; Huang, X. Internal control indexes for listed firms in China: Logic, design and validation. Audit. Res. 2019, 207, 55–63. [Google Scholar] [CrossRef]
  12. Lawrence, A.G.; Martin, P.L.; Chih-Yang, T. Enterprise risk management and firm performance: A contingency perspective. J. Account. Public Policy 2009, 28, 301–327. [Google Scholar] [CrossRef]
  13. Lin, B.; Lin, D.; Hu, W.; Xie, F.; Yang, Y. Research on goal-oriented internal-control index. Account. Res. 2014, 8, 16–24. [Google Scholar] [CrossRef]
  14. Ashbaugh-Skaife, H.; Collins, D.W.; Kinney, W.R.K. The discovery and reporting of internal control deficiencies prior to SOX-mandated audits. J. Account. Econ. 2007, 44, 166–192. [Google Scholar] [CrossRef]
  15. Jeffrey, D.; Weili, G.; Sarah, M. Determinants of weaknesses in internal control over financial reporting. J. Account. Econ. 2007, 44, 193–223. [Google Scholar] [CrossRef]
  16. Deumes, R.; Knechel, W.R. Economic incentives for voluntary reporting on internal risk management and control systems. Audit. J. Pract. Theory 2008, 27, 35–66. [Google Scholar] [CrossRef]
  17. Lin, B.; Lin, D.; Hu, W.; Xie, F.; Yang, Y. Research of Internal Control Index Based on Information Disclosure. Account. Res. 2016, 12, 12–20. [Google Scholar] [CrossRef]
  18. Chen, H.; Dong, W.; Han, H.; Zhou, N. A comprehensive and quantitative internal control index: Construction, validation, and impact. Rev. Quant. Financ. Account. 2017, 49, 337–377. [Google Scholar] [CrossRef]
  19. Boritz, J.E.; Hayes, L.; Lim, J.-H. A content analysis of auditors’ reports on IT internal control weaknesses: The comparative advantages of an automated approach to control weakness identification. Int. J. Account. Inf. Syst. 2013, 14, 138–163. [Google Scholar] [CrossRef]
  20. Rich, K.T.; Roberts, B.L.; Zhang, J.X. Linguistic Tone and Internal Control Reporting: Evidence from Municipal Management Discussion and Analysis Disclosures. J. Gov. Nonprofit Account. 2018, 7, 24–54. [Google Scholar] [CrossRef]
  21. Boskou, G.; Kirkos, E.; Spathis, C. Classifying internal audit quality using textual analysis: The case of auditor selection. Manag. Audit. J. 2019, 34, 924–950. [Google Scholar] [CrossRef]
  22. Liu, B.; Li, Y.; Chi, J.D. Internal control willingness, internal control level and earnings management methods—The measurement method based on text analysis and machine learning. Sci. Res. Manag. 2021, 42, 166–174. [Google Scholar] [CrossRef]
  23. Huang, A.; Wang, H.; Yang, Y. FinBERT: A Large Language Model for Extracting Information from Financial Text. Contemp. Account. Res. 2022, 40, 806–841. [Google Scholar] [CrossRef]
  24. Yang, H.; Liu, X.-Y.; Wang, C.D. FinGPT: Open-Source Financial Large Language Models. arXiv 2023, arXiv:2306.06031. [Google Scholar] [CrossRef]
  25. Chiu, I.C.; Hung, M.-W. Finance-specific large language models: Advancing sentiment analysis and return prediction with LLaMA 2. Pac.-Basin Financ. J. 2025, 90, 102632. [Google Scholar] [CrossRef]
  26. Chen, H.; Chiang, R.H.L.; Storey, V.C. Business Intelligence and Analytics: From Big Data to Big Impact. MIS Q. 2012, 36, 1165–1188. [Google Scholar] [CrossRef]
  27. Visinescu, L.; Jones, M.; Sidorova, A. Improving Decision Quality: The Role of Business Intelligence. J. Comput. Inf. Syst. 2015, 57, 58–66. [Google Scholar] [CrossRef]
  28. Ji, L.; Li, S. A dynamic financial risk prediction system for enterprises based on gradient boosting decision tree algorithm. Syst. Soft Comput. 2025, 7, 200189. [Google Scholar] [CrossRef]
  29. Duan, H.K.; Vasarhelyi, M.A.; Codesso, M. Integrating Process Mining and Machine Learning for Advanced Internal Control Evaluation in Auditing. J. Inf. Syst. 2025, 39, 55–75. [Google Scholar] [CrossRef]
  30. Barredo Arrieta, A.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
  31. Weber, P.; Carl, K.V.; Hinz, O. Applications of Explainable Artificial Intelligence in Finance—A systematic review of Finance, Information Systems, and Computer Science literature. Manag. Rev. Q. 2024, 74, 867–907. [Google Scholar] [CrossRef]
  32. Lu, Y.-H.; Lin, Y.-C. The determinants of voluntary disclosure: Integration of eXtreme gradient boost (XGBoost) and explainable artificial intelligence (XAI) techniques. Int. Rev. Financ. Anal. 2024, 96, 103577. [Google Scholar] [CrossRef]
  33. Kou, H.; Tang, R.; Chen, N. Enterprise Digitalization and ESG Performance: Evidence from Interpretable AI Large Language Models. Systems 2025, 13, 832. [Google Scholar] [CrossRef]
  34. Rane, N.; Paramesha, M.; Choudhary, S.; Rane, J. Business Intelligence and Business Analytics with Artificial Intelligence and Machine Learning: Trends, Techniques, and Opportunities. SSRN Electron. J. 2024. [Google Scholar] [CrossRef]
  35. Ebule, A. The Role of Business Intelligence and Artificial Intelligence in Real-Time Decision Making. Int. J. Sci. Res. Manag. (IJSRM) 2025, 13, 1902–1916. [Google Scholar] [CrossRef]
  36. Chebrolu, S.K. AI-Powered Business Intelligence: A Systematic Literature Review on the Future of Decision-Making in Enterprises. Am. J. Sch. Res. Innov. 2025, 4, 33–62. [Google Scholar] [CrossRef]
  37. Otsu, N. A Threshold Selection Method from Gray-Level Histograms. IEEE Trans. Syst. Man Cybern. 1979, 9, 62–66. [Google Scholar] [CrossRef]
  38. Diakoulaki, D.; Mavrotas, G.; Papayannakis, L. Determining objective weights in multiple criteria problems: The critic method. Comput. Oper. Res. 1995, 22, 763–770. [Google Scholar] [CrossRef]
  39. Becker, J.-M.; Klein, K.; Wetzels, M. Hierarchical latent variable models in PLS-SEM: Guidelines for using reflective-formative type models. Long Range Plan. Int. J. Strateg. Manag. 2012, 45, 359–394. [Google Scholar] [CrossRef]
  40. Hair, J.; Hult, G.T.M.; Ringle, C.; Sarstedt, M. A Primer on Partial Least Squares Structural Equation Modeling (PLS-SEM), 3rd ed.; SAGE Publications, Inc.: Thousand Oaks, CA, USA, 2022. [Google Scholar]
  41. Niculescu-Mizil, A.; Caruana, R. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning; Association for Computing Machinery: New York, NY, USA, 2005; pp. 625–632. [Google Scholar] [CrossRef]
Figure 1. End-to-end methodological framework of the IIC-DSS. The architecture integrates “unstructured disclosure text → structured evidence → formative index → validation and prediction → executable interventions” into a traceable analytical chain. Layer 1 parses PDF reports to build a structure-aware corpus, then constructs a dual-driven knowledge base by combining a rule-based regulatory dictionary with embedding-based semantic prototype vectors. Layer 2 applies an optimized hybrid membership-probability algorithm and a quality filter to convert unstructured disclosures into a quality-weighted component evidence matrix, mapping sentences to internal control components. Layer 3 uses adaptive Otsu thresholding to extract representative evidence, then aggregates micro-level indicators into a hierarchical formative index, IC-5Q, via the CRITIC method and a game-theoretic combined-weighting scheme. Layer 4 integrates measurement validity testing using PLS-SEM with rolling-window predictive validity testing using XGBoost; TreeSHAP attribution is used for diagnostic explanation, yielding actionable governance intervention recommendations. A control-theoretic feedback loop (red dashed line) feeds these insights back to guide future refinement of disclosure and internal control improvement.
Figure 1. End-to-end methodological framework of the IIC-DSS. The architecture integrates “unstructured disclosure text → structured evidence → formative index → validation and prediction → executable interventions” into a traceable analytical chain. Layer 1 parses PDF reports to build a structure-aware corpus, then constructs a dual-driven knowledge base by combining a rule-based regulatory dictionary with embedding-based semantic prototype vectors. Layer 2 applies an optimized hybrid membership-probability algorithm and a quality filter to convert unstructured disclosures into a quality-weighted component evidence matrix, mapping sentences to internal control components. Layer 3 uses adaptive Otsu thresholding to extract representative evidence, then aggregates micro-level indicators into a hierarchical formative index, IC-5Q, via the CRITIC method and a game-theoretic combined-weighting scheme. Layer 4 integrates measurement validity testing using PLS-SEM with rolling-window predictive validity testing using XGBoost; TreeSHAP attribution is used for diagnostic explanation, yielding actionable governance intervention recommendations. A control-theoretic feedback loop (red dashed line) feeds these insights back to guide future refinement of disclosure and internal control improvement.
Systems 14 00234 g001
Figure 2. Hierarchical Formative PLS-SEM Path Model. * p < 0.05, *** p < 0.001.
Figure 2. Hierarchical Formative PLS-SEM Path Model. * p < 0.05, *** p < 0.001.
Systems 14 00234 g002
Table 1. Mapping of Research Questions to Methodological Framework.
Table 1. Mapping of Research Questions to Methodological Framework.
Research Question (RQ)Corresponding SectionKey Methods & Techniques
(RQ1) How can unstructured text in bank reports be turned into a multidimensional quantitative framework?Section 3.1
  • Knowledge Base Construction: Merging regulatory rules with embedding-based semantic prototypes.
  • Neural-Symbolic Mapping: Using hybrid membership probability to map sentences to internal control components.
  • Index Aggregation: Constructing the hierarchical formative index (IC-5Q) via CRITIC and game-theoretic weighting.
(RQ2) Does a model that includes text-mined internal-control variables predict outcomes significantly better?Section 3.2
  • Construct Validity: Formative PLS-SEM to verify the structural relationship of indicators.
  • Predictive Validity: Out-of-sample rolling-window forecasting using XGBoost to test the incremental predictive power of the Internal Control Index (ICI) on asset quality.
(RQ3) How can text-driven indicators be operationalized to support model interpretation and risk prioritization?Section 3.3
  • Explainable AI: Applying TreeSHAP to isolate marginal contributions of control factors.
  • Risk Calibration: Using Platt scaling and dynamic thresholding for tiered risk management.
  • Decision Support: Integrating Natural Language Generation into a BI dashboard (IIC-DSS) for actionable governance interventions.
Table 2. Descriptive Statistics of Main Variables.
Table 2. Descriptive Statistics of Main Variables.
VariableDefinitionMeanSDMinMedianMax
Panel A: Internal Control Indices
ICIComposite Internal Control Index (0–100)47.6449.27520.33848.80069.201
CEControl Environment Index52.1128.63123.48552.92181.123
RARisk Assessment Index51.56610.74119.51452.29384.664
CAControl Activities Index42.39712.36513.34543.15374.968
ICInformation & Communication Index47.90711.46515.27048.62377.893
MAMonitoring Activities Index45.26710.64418.03245.81772.196
Panel B: Benchmark & Outcome Variables
ICDIDIB Internal Control Index39.8924.9845.53039.88454.190
NPLNon-Performing Loan Ratio0.0140.0040.0070.0140.025
Panel C: Control Variables
lnAssetsNatural Log of Total Assets27.9341.68624.99227.73031.519
ROEReturn on Equity (%)0.1160.0290.0340.1130.264
CARCapital Adequacy Ratio (%)0.1390.0190.1050.1360.339
LDRLoan-to-Deposit Ratio0.7400.1070.3490.7551.052
LeverageLeverage Ratio0.0660.0100.0360.0660.097
Notes: The sample consists of 420 bank-year observations representing 42 banks over a 10-year period. All Internal Control Indices in Panel A are standardized to a scale of 0 to 100.
Table 3. PLS-SEM Validation Results for the IC5Q/ICI System.
Table 3. PLS-SEM Validation Results for the IC5Q/ICI System.
Stage/AspectConstruct/PathWeight/Coeff. (β)SE95% BCa CIR2
Stage 1: First-Order Constructs
Formative Weights (Range)L2 → CE (6 items)0.302–0.541 ***0.018–0.048All Positive
L2 → RA (6 items)0.269–0.511 ***0.019–0.049All Positive
L2 → CA (6 items)0.244–0.517 ***0.012–0.030All Positive
L2 → IC (6 items)0.014–0.4710.017–0.039[−0.021, 0.552]
L2 → MA (6 items)0.214–0.434 ***0.013–0.043All Positive
Redundancy AnalysisCE → Target_CE0.995 ***0.001[0.992, 0.996]0.989
Convergent ValidityRA → Target_RA0.989 ***0.002[0.986, 0.992]0.979
CA → Target_CA0.995 ***0.001[0.994, 0.996]0.990
IC → Target_IC0.979 ***0.004[0.969, 0.986]0.959
MA → Target_MA0.997 ***0.0005[0.996, 0.998]0.994
Stage 2: Second-Order ICI
Formative WeightsCE → ICI0.162 ***0.032[0.093, 0.216]
RA → ICI0.218 ***0.031[0.155, 0.274]
CA → ICI0.222 ***0.026[0.180, 0.284]
IC → ICI0.319 ***0.023[0.270, 0.361]
MA → ICI0.258 ***0.035[0.199, 0.333]
Redundancy AnalysisICI → Target_ICI0.994 ***0.001[0.991, 0.995]0.988
Criterion ValidityICI → ICDI0.244 *0.087[0.060, 0.397]0.060
Notes: * p < 0.05, *** p < 0.001. Standard errors and confidence intervals are obtained from 800 cluster bootstrap samples, with “bank” as the clustering unit and a BCa adjustment.
Table 4. Out-of-Sample Prediction Performance for NPL Jumps (2020–2023 Rolling CV).
Table 4. Out-of-Sample Prediction Performance for NPL Jumps (2020–2023 Rolling CV).
ModelSpecificationROC-AUCPR-AUCBrierBest F1Top-10 Capture
XGBoostControls + ESG + ICI + ICI × (lnAssets, NPL_lag1, ROE)0.9090.09090.1790.1670.667
XGBoostControls + ICI + ICI × (lnAssets, NPL_lag1, ROE)0.7550.05770.1610.1290.667
XGBoostControls + ESG0.7580.05960.1550.1330.333
XGBoostControls Only0.7520.05590.1670.1250.333
XGBoostControls + G-score + ICI + ICI × (lnAssets, NPL_lag1, ROE)0.8300.05080.3330.0970.333
XGBoostControls + ICDI + ICDI × (lnAssets, NPL_lag1, ROE)0.5700.02410.2020.0560.000
Notes: All metrics are computed from out-of-sample predictions obtained via annual rolling cross-validation for 2020–2023 and then pooled. Control variables include lnAssets, capital adequacy ratio (CAR), return on equity (ROE), lagged NPL ratio (NPL_lag1), leverage ratio, and loan-to-deposit ratio (LDR). ICI is the newly constructed composite internal control index; ICDI is the DIB internal control index. “XGBoost” denotes gradient-boosted decision trees. The Top-10 capture rate is, for each year, the proportion of all actual NPL jump events that fall within the top 10% of observations ranked by predicted jump probability.
Table 5. SHAP Importance of Internal Control Elements (Bootstrap 95% CI).
Table 5. SHAP Importance of Internal Control Elements (Bootstrap 95% CI).
ElementMean |SHAP|95% CI Lower95% CI Upper
Control Environment (CE)0.592 0.471 0.726
Risk Assessment (RA)0.346 0.302 0.394
Control Activities (CA)0.422 0.367 0.475
Information & Communication (IC)0.463 0.421 0.505
Monitoring Activities (MA)0.336 0.282 0.388
Table 6. IIC-DSS intelligent reporting: summary of diagnostic results for representative banks.
Table 6. IIC-DSS intelligent reporting: summary of diagnostic results for representative banks.
BankPredicted Jump ProbabilityRisk LevelDiagnostic Summary (Based on Local SHAP Attribution)
China Minsheng Bank12.947%HighClassified as HIGH RISK. Key weaknesses: Control Environment (CE) (contribution +0.356); Information & Communication (IC) (contribution +0.351).
Xi’an Bank14.196%HighClassified as HIGH RISK. Key weaknesses: Information & Communication (IC) (contribution +0.293).
China Merchants Bank0.124%LowClassified as LOW RISK. Key strength: Control Environment (CE) (contribution −0.530).
Bank of Shanghai0.294%LowClassified as LOW RISK. Key strength: Control Environment (CE) (contribution −0.618).
Industrial and Commercial Bank of China0.035%LowClassified as LOW RISK. Key strength: Control Environment (CE) (contribution −1.252).
Shanghai Pudong Development Bank1.671%MediumClassified as MEDIUM ATTENTION. Main potential issue: Control Environment (CE) (contribution +0.423).
Bank of Ningbo0.081%Low Classified as LOW RISK. Key strength: Control Activities (CA) (contribution −0.592).
Notes: The banks listed in this table provide a representative example, and the complete results for the whole sample are in the Supplementary Materials.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, Y.; Li, X.; Su, C. From Unstructured Text to Automated Insights: An Explainable AI Approach to Internal Control in Banking Systems. Systems 2026, 14, 234. https://doi.org/10.3390/systems14030234

AMA Style

Liu Y, Li X, Su C. From Unstructured Text to Automated Insights: An Explainable AI Approach to Internal Control in Banking Systems. Systems. 2026; 14(3):234. https://doi.org/10.3390/systems14030234

Chicago/Turabian Style

Liu, Ya, Xinqiu Li, and Congli Su. 2026. "From Unstructured Text to Automated Insights: An Explainable AI Approach to Internal Control in Banking Systems" Systems 14, no. 3: 234. https://doi.org/10.3390/systems14030234

APA Style

Liu, Y., Li, X., & Su, C. (2026). From Unstructured Text to Automated Insights: An Explainable AI Approach to Internal Control in Banking Systems. Systems, 14(3), 234. https://doi.org/10.3390/systems14030234

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop