Next Article in Journal
Cross-Domain Data Sharing Scheme Based on Threshold Proxy Re-Encryption
Previous Article in Journal
A Systematic Lifecycle-Referenced Capability Mapping of MLOps Platforms for Energy Forecasting
Previous Article in Special Issue
Exploring the Topics and Sentiments of AI-Related Public Opinions: An Advanced Machine Learning Text Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

AI-Driven Corruption Risk Indicator Detection: A Comparative Evaluation of Transformer-Based NLP Models in Unstructured Procurement Data

by
Nikolaos Peppes
,
Theodoros Alexakis
,
Emmanouil Daskalakis
and
Evgenia Adamopoulou
*
Institute of Communication and Computer Systems, National Technical University of Athens, 15773 Athens, Greece
*
Author to whom correspondence should be addressed.
Information 2026, 17(4), 329; https://doi.org/10.3390/info17040329
Submission received: 27 February 2026 / Revised: 19 March 2026 / Accepted: 26 March 2026 / Published: 28 March 2026

Abstract

The detection of corruption-related indicators within unstructured, textual procurement data remains a complex task due to linguistic ambiguity, contextual variation and domain-specific terminology. This study presents a comparative evaluation of three transformer-based Natural Language Processing (NLP) architectures (BERT-base-uncased, RoBERTa-base and DeBERTa-v3-base) for automated corruption risk indicator detection in procurement texts coming from heterogeneous sources. A unified dataset is constructed by linking unstructured technical documentation with structured procurement outcomes, enabling an outcome-driven risk labeling strategy. Performance evaluation is conducted through different metrics, including precision, recall, F1-score and ROC-AUC, complemented by explainability analysis using Integrated Gradients. The results demonstrate a clear performance progression and highlight the comparative strengths of the evaluated architectures. Overall, this study highlights the potential of contextual transformer models to support scalable, transparent and operational anti-corruption monitoring systems.

Graphical Abstract

1. Introduction

Corruption is not a new phenomenon. Ancient Egypt faced a wide range of corruption forms from offering and accepting bribes, to embezzlement, stealing or misusing money [1]. Corruption was also a prevalent crime in the Roman Empire, known as ambitus, according to ancient Roman law. More specifically, ambitus was a crime of political corruption, mainly a candidate’s attempt to influence the outcome (or direction) of an election through bribery or other forms of soft power [2]. On the other hand, anti-corruption efforts are not new either. In ancient Greece, the various and multiple anti-corruption measures of Athens sought to bring ‘hidden’ knowledge into the open and thereby remove information from the realm of individual judgment, placing it instead into the realm of collective judgment. The Athenian experience suggests that participatory democracy and a civic culture that fosters political equality rather than reliance on individual expertise provide a key bulwark against corruption [3].
In the modern world, corruption is still present and has evolved in terms of its forms, but without altering its purpose. So, according to Transparency International, corruption can still be defined as the abuse of entrusted power for private gain [4]. Thus, corruption poses one of the biggest obstacles to social justice, institutional trust, and worldwide economic stability. The amount of documentation, from company and financial records to procurement contracts, has increased dramatically as the public and private sectors move toward digital governance [5]. Signals of misconduct are frequently hidden in unstructured text in the ever-evolving digital and big data landscape, making manual oversight not only time-consuming but also inadequate [6]. The efficient tackling of corruption nowadays relies heavily on the ability to automatically detect corruption risk indicators.
Until recently, corruption detection relied heavily on structured data analysis, such as identifying irregularities in financial transactions or flagging statistical anomalies in procurement records [7,8]. While these structured approaches provided a necessary foundation for oversight, they remained fundamentally limited to numerical discrepancies, leaving the vast, qualitative landscape of unstructured linguistic data available unmonitored. Current research is shifting towards Natural Language Processing (NLP) and Large Language Models (LLMs) to address the complexities of fraud concealed within unstructured text, ranging from email correspondence to intricate contract narratives [6,9]. Despite being highly interpretable and computationally efficient, rule-based systems and keyword matching were the mainstays of early textual methodologies. However, they often struggled with the linguistic ambiguity that characterizes corrupt exchanges [10]. The “coded” language that malicious actors frequently use to conceal their illegal intent was not captured by these static models, which treated language as a bag of isolated terms. Therefore, when confronted with sophisticated evasion strategies that bypassed explicit risk terminology, conventional techniques frequently generated high false-negative rates [11].
The emergence of transformer-based architectures has revolutionized this landscape by moving beyond simple lexical analysis to capture deep semantic meaning. Transformer architectures [12] are widely used in NLP processes, outperforming other neural models (e.g., Recurrent Neural Networks and Convolutional Neural Networks in terms of natural language generation or understanding). Model pretraining on generic large corpora is an important advantage of transformer-based models, which leads to increased efficiency in downstream tasks such as machine translation, summarization, language understanding and classification [13]. Bidirectional Encoder Representation from Transformers (BERT) [14], which follows an advanced trained deep learning approach, has demonstrated an impressive capability in text detection, mining, processing, and analysis tasks, outperforming conventional methods in diverse scenarios [15].
Despite the growing body of research on corruption detection, existing approaches predominantly rely on structured data analysis or rule-based textual methods, with limited attention given to the systematic use of advanced transformer-based architectures for analyzing unstructured procurement documentation. Particularly, the comparative effectiveness of recent transformer-based models, as well as the role of explainability in supporting transparent and interpretable corruption risk detection, remains underexplored. The current study is structured to bridge the research gap with regard to the automated semantic analysis of unstructured procurement documentation, where misleading language is used to hide fraudulent behavior. Specifically, this study investigates and evaluates the effectiveness of transformer-based architectures in detecting corruption risk indicators within complex technical specifications. It further examines how successive generations of models, specifically BERT, RoBERTa and DeBERTa-v3, compare across dimensions of predictive accuracy and operational efficiency. Additionally, this research investigates the extent to which explainability mechanisms, such as Integrated Gradients, provide the transparency and traceability necessary for human-in-the-loop oversight. By leveraging an outcome-driven labeling strategy grounded in Open Contracting Data Standard (OCDS) metrics, this research moves beyond subjective annotation to establish a reproducible, evidence-based detection framework that meets the practical requirements of large-scale public transparency initiatives and real-world operational scenarios.
The remainder of the paper is organized as follows: Section 2 presents related works, focusing on the domain of corruption detection and risk assessment using state-of-the-art technologies and Artificial Intelligence. Section 3 describes in detail the proposed methodology designed and developed, whilst Section 4 elaborates on the produced results. Finally, Section 5 concludes the paper.

2. Related Works

Modern corruption is so complex and dynamic that it requires the use of cutting-edge scientific approaches, as well as state-of-the-art computational tools, to model and detect activities that indicate corruption [16]. BERT models are increasingly used in contemporary research studies to tackle corruption cases. Damiano et al. [17] analyzed the annual reports of banks utilizing textual analysis methods for extracting indicators of potential corruption cases. More specifically, the authors combined sentiment analysis following a dictionary approach and a BERT model called FinBERT for the categorization of Environmental, Social, and Governance (ESG) sentences. Soon after that, they utilized Random Forest (RF), Support Vector Machines (SVMs), Naïve Bayes and Gradient boosting algorithms for the classification of potential corruption events. The authors concluded that specific textual measures could make an important contribution to the detection of corruption events.
Algorithmic Trading Systems allow trade execution to be implemented automatically, rapidly and efficiently. However, their complex structure often renders them susceptible to being utilized in financial corruption and fraud cases. Mohamed et al. [18] proposed a framework which combined different BERT variants for the semantic interpretation of financial logs with transformer models for modeling market behavior. The so-called TADST framework offered real-time detection of potentially fraudulent activities. Experimental testing of the framework proved its effectiveness, as it helped in improving existing benchmarks (98.7% improvement in efficiency and 97.4% in accuracy). In another research work for tackling financial fraud, Ergun and Sefer [19] proposed the so-called DeepFraud framework, which combined different Large Language embeddings (e.g., FinBERT, Fin GPT, and FinLlama) and Long Short-Term Memory (LSTM), for detecting corruption incidents related to financial fraud. Experimental testing of the framework in financial records of a 30-year period (1995 to 2024) indicated a precision score of 86% and an F1-score of 84%, outperforming, in many scenarios, other contemporary models (e.g., SVM, XGBoost, Logistic Regression, and Autoformer).
A methodology combining BERT and NLP techniques was proposed by Lima et al. [20] for the detection of corruption indicators in public procurement texts describing the rules for hiring. More specifically, the methodology extracts red flags denoting potential fraud cases. Experimental testing of the proposed methodology indicated an 88.8% recall rate, which outperformed other contemporary models (i.e., Bottleneck and BiLSTM). Torres-Berru et al. [21] presented an NLP-based approach for the detection of gender bias and favoritism in public procurements. More specifically, the authors made use of a Word2-vec model, as well as a sentiment analysis algorithm, for analyzing the questions and answers registry platform for public procurement processes in Ecuador. Experimental testing of the methodology in a corpus of 303,076 procurement processes indicated high accuracy rates, i.e., 88% for favoritism detection and 90% for gender bias detection.
Combating corruption in public procurement from several different aspects is the focus of many scientific papers. Salazar et al. [22] developed a tool for detecting public procurement corruption cases and for prioritizing resources. Their tool took into consideration both deliberate corrupt actions taken by decision-makers and inefficiencies which may support corrupt cases. It also detected red flags, which were highly probable to be connected with corrupt deeds. For the classification tasks, Logistic regression and RF methods were used. Experimental testing indicated improvements in corruption detection as compared to other contemporary methods, achieving accuracy rates of up to 88.29%. On the other hand, Munoz-Cancino and Rios [23] presented a methodology for detecting corruption in government tenders based on Social Network Analysis and an Isolation Forest algorithm. The authors stressed the importance of specific network structural settings for the early detection of corruption. During the experimental testing of the methodology, supplier centrality, density and the number of connections of related entities, as well as supplier financial characteristics, were found to have a key importance in the detection of previously unknown anomaly patterns. Other complex characteristics were also highlighted by Pernica et al. [24] as important in the detection of corruption in military equipment procurement. More specifically, variables related to national culture and a government’s ability to combat corruption were indicated as important in detecting suspicious cases. The authors conducted a comparative case study spanning 16 years (from 2008 and 2023) across four countries (i.e., Norway, Lithuania, Slovakia, and Czechia) related to mass-produced military equipment procurement.
The importance of indicators related to the relationships between buyers and suppliers in identifying corruption in procurement contracts was stressed by Aldana et al. [25]. The authors concluded that such indicators were more important than those related to the characteristics of individual contracts. An ensemble model of RF classifiers was also proposed, which achieved an accuracy rate of up to 92% during its experimental testing. Ayobami et al. [26] utilized several parameters (e.g., contract values, timelines, and bidder characteristics) for detecting corruption indicators, bid-rigging cases, and conflicts of interest. The proposed framework was experimentally tested, yielding an accuracy rate of over 87% in the detection of suspicious transactions.
The data included in audit reports and governmental budgets can be used for the detection of corruption incidents. Based on NLP methods, Beltran [27] proposed a pipeline for detecting indicators of potential corruption cases in audit reports related to governmental budgets. The author utilized publicly available data from Supreme Audit Institutions (SAIs) for this pipeline. Firstly, a classification algorithm was used for determining which parts of the input texts were relevant. Soon after that, a Named Entity Recognition (NER) model was developed for extracting monetary values of budget discrepancies. The author also highlighted that although a discrepancy itself was not necessarily denoting a corruption incident, the proposed model could be a useful tool for fighting corruption and forming anti-corruption policies. In another research work focusing on audit and budget data, Ash et al. [28] proposed a Gradient Boosting model for detecting corruption cases. This tree-based model calculated a measure indicating the possibility of corruption issues, which could be used for empirical analysis and for supporting anti-corruption policy-making. Experimental testing of the proposed model indicated that it could be helpful in conducting targeted rather than random audits, yielding 83.6% more corruption cases as compared to random audits.
Social media posts are used in many research works related to tackling corruption. Xiao [29] proposed a deep learning methodology for detecting corruption incidents in texts from social media. The methodology encompassed preprocessing, feature extraction and selection and corruption detection based on a Convolutional Neural Network (CNN). In an experimental testing of the model using a dataset including 19,560 tweets, the model yielded an accuracy of about 90% in the detection of corruption incidents of different kinds (e.g., money laundering, bribery, and nepotism). On the other hand, Indriyanti et al. [30] proposed a methodology for the analysis of the public perception of corruption incidents based on posts from the X social media. In this context, BERT-based sentiment analysis was used together with Latent Dirichlet Allocation (LDA) for the identification of corruption-related dominant topics in the East Java province of Indonesia. The authors also highlighted the important role such a methodology could play in forming concrete policy recommendations based on social media data, as well as in strengthening accountability. Experimental testing of the methodology also indicated its high accuracy as regards the sentiment categorization in posts, which reached 98.51% in accuracy.
Transform-based architectures and, more specifically, LLMs offer enhanced capabilities to the ever-growing digital landscape, which abounds in digital forensics. In this light, LLMs provide a very strong analysis tool for unstructured data with supreme semantic analysis capabilities. The authors of [31] concentrated on the promising potential of incorporating LLMs into digital forensics in order to improve the effectiveness of investigations and deal with the massive amounts of data that are met in modern cybercrimes, including corruption. Their study indicated how LLMs could greatly speed up conventional forensic processes by automating the extraction, classification, and summarization of vast quantities of unstructured digital evidence, including emails, chat logs, and text files. More specifically, the experimental findings showed that LLMs could perform promisingly on age and gender prediction tasks while retaining computational efficiency, especially the Polyglot model with LoRA and QLoRA fine-tuning, achieving accuracy and F1-score for both categories over 70%. Li et al. [32] aimed at presenting how sophisticated semantic parsing could successfully reveal hidden behavioral patterns and unusual trajectories by using LLM-based architectures to analyze intricate geospatial movements and unstructured spatial narratives. Because of its reference architecture, it could be directly used to map and identify illicit trade networks that are relevant to smuggling operations and cross-border corruption.

3. Methodology

3.1. Overall System Architecture

The objective of the proposed methodology is the development of an operational NLP-based system capable of automatically identifying corruption-related risks through the analysis of unstructured procurement documentation. Unlike approaches relying exclusively on structured numerical indicators, the proposed framework integrates textual information originating from technical specifications and tender documentation with structured procurement data in order to enhance corruption risk assessment.
The overall system architecture consists of four interconnected stages. Initially, structured procurement information is collected from open procurement repositories following the OCDS. In parallel, unstructured technical documentation associated with each procurement procedure is retrieved and scrapped from public procurement portals. These two data sources are subsequently linked through unique procurement identifiers, enabling the creation of a unified dataset containing textual data and measurable procurement outcomes.
Following data integration, a text processing pipeline prepares the documentation for transformer-based modeling. Three transformer architectures are then trained and evaluated for corruption risk prediction: BERT, as the baseline of this comparative analysis, RoBERTa, and DeBERTa-v3.
Finally, model predictions are transformed into risk scores that can be integrated into a monitoring or decision-support platform for practical anti-corruption analysis. The proposed architecture is designed with deployment feasibility and explainability in mind, to ensure that the extracted outputs remain interpretable and operationally useful and applicable within real-world monitoring environments and investigation procedures. Table 1 depicts the core components and their roles within this architecture.
The overall workflow of the proposed corruption risk detection framework is illustrated in Figure 1. The depicted architecture highlights the integration of structured procurement data and unstructured documentation, followed by preprocessing, transformer-based modeling, explainability analysis and deployment-oriented risk scoring. Additionally, this pipeline reflects the end-to-end operational design adopted in this study.

3.2. Data Acquisition and Integration

The data acquisition approach has been designed to create a multi-dimensional view of procurement, by scraping—as an initial step—and unifying textual/data outcomes. This dual-source approach allows the model to learn and detect how specific linguistic patterns in tender documents correlate with high-risk procurement results and identified (risk) indicators.

3.2.1. Structured Procurement Datasets

The structured component of the dataset originates from open procurement repositories compliant with OCDS principles. These repositories contain contract notices, specified procurement procedures and contract award information describing the lifecycle of public procurement activities.
The existence of structured data is essential for the proposed methodology since it provides measurable, objective outcomes that can be transformed into corruption risk indicators and used during the training process. Key attributes extracted from procurement records include:
number of participating bidders,
procurement procedure type,
estimated and awarded contract value,
procurement status and
awarding criteria and associated metadata.
These variables represent operational indicators frequently used by anti-corruption researchers and public oversight organizations to evaluate procedural transparency and competition levels. The mapping of these OCDS fields to risk categories is summarized in Table 2.
Leveraging structured data as a baseline ensures that the proposed framework avoids subjective labeling approaches and instead relies on measurable procurement outcomes in order to improve the overall methodological robustness.

3.2.2. Unstructured Procurement Datasets

In addition to structured records, this study incorporates unstructured textual documentation published alongside tender announcements. Such documentation typically includes technical specifications, participation requirements, explanatory descriptions, and contractual guidelines.
These documents are particularly relevant because they contain detailed language describing technical requirements and eligibility constraints. Previous research and policy reports suggest that restrictive descriptions or overly specific requirements may indirectly lead to market competition limitations or specific suppliers’ favoring. As a result, the linguistic analysis of these documents may additionally provide insights beyond the numerical procurement indicators.
Documentation is collected from public procurement portals where downloadable files are available. Due to real-world data availability constraints, unstructured documentation exists only for a subset of procurement cases. Nevertheless, even partial availability provides sufficient data to explore the relationship between textual characteristics and procurement outcomes. Documents appear in heterogeneous formats, primarily PDF and text-based files, that require automated extraction pipelines to transform them into machine-readable textual content.
For the purposes of this study, only English-language procurement records were preserved to ensure consistency in the evaluation of the transformer architectures. Non-English or multilingual documents were excluded during the initial data filtering phase. This design choice enables a controlled experimental setting by minimizing linguistic variability, allowing for a more accurate assessment of architectural differences across models. While this approach may limit immediate generalizability, it establishes a baseline for future extensions of the proposed framework to multilingual procurement data. Given that the methodology is architecture-agnostic, it can be readily adapted to cross-lingual settings through the use of multilingual pretrained models, such as mBERT [33] or XLM-RoBERTa [34].

3.2.3. Unified Dataset Construction and Data Linking

The construction of the unified dataset involves a primary key-join on the procurement identifier, ensuring that the textual features from the specifications are directly paired with the ground-truth outcomes (e.g., number of bidders) from the structured records highlighted within the previous subchapter. This link is vital for the supervised learning phase, as it allows the RoBERTa and DeBERTa-v3 models to identify which linguistic constraints are statistically linked to non-competitive outcomes.
Table 3 summarizes the key statistical characteristics of the dataset used in this study, providing an overview of its size, composition and class distribution. From the initial collection of 34,297 procurement records, a subset of 10,742 cases was identified with available unstructured and associated textual documentation. Following preprocessing and filtering steps, a total of 10,120 documents were preserved for the modeling phase. The dataset was subsequently partitioned into training, validation and testing subsets following a 70/15/15 split. As expected in real-world procurement scenarios, the dataset exhibits high class imbalance, with high-risk cases representing approximately 12% of the total observations. This characteristic reflects the fact that corruption-related signals are relatively rare in practice and motivated the use of a class-weighted loss function during the training phase to ensure that the models remained sensitive to the minority (high-risk) class.

3.3. Corruption Risk Indicator Definition

Apart from manually labeling text according to identified corruption indicators and signals, this study introduces an outcome-driven labeling strategy. Corruption risk indicators are derived directly from measurable procurement outcomes contained in structured datasets.
Examples of such corruption indicators include low competition scenarios (e.g., single-bidder tenders), procedural configurations associated with increased risk or other observable anomalies indicating reduced transparency or competitive pressure. These indicators serve as main labels representing potential corruption-related risk rather than direct evidence of any illegal behavior. This clarification is important, as the objective of the system is risk assessment and prioritization rather than definitive legal classification.
By grounding labels in measurable outcomes, the methodology ensures reproducibility and reduces potential bias associated with subjective annotation processes. Moreover, this approach allows trained models to learn implicit relationships between textual characteristics and procurement outcomes without relying on predefined keyword lists or definitions, while enhancing the interpretation of commonly used terms into mathematical ones. Table 4 summarizes the correlations between the extracted textual data and the identified corruption indicators.
The different thresholds presented in Table 4 are grounded in the prior literature and established policy frameworks on public procurement risk assessment. Indicators such as single-bidder tenders and restricted procedures have been widely associated with reduced competition and increased corruption risk [8]. Similarly, deviations between awarded and estimated contract values are commonly used as proxies for financial irregularities [7]. In addition, international guidelines such as the World Bank’s Red Flag framework [35] identify restricted procedures and abnormal pricing patterns as key signals of elevated procurement risk. These thresholds are therefore designed to capture operationally meaningful deviations from standard procurement conditions and serve as proxies for elevated risk rather than direct evidence of corruption.
It should be noted that the selected thresholds are derived from the established literature and policy frameworks and are treated as standardized indicators of procurement risk rather than tunable parameters. As such, no explicit sensitivity analysis was conducted in this study. The primary objective is to evaluate the capability of transformer-based models to detect these predefined signals within unstructured procurement text. Future work may explore the sensitivity of the proposed framework to variations in threshold definitions, particularly in the context of different regulatory environments or procurement datasets.

3.4. Text Extraction and Preprocessing Pipeline

Unstructured documentation undergoes a multistage preprocessing pipeline (as shown in Table 5) designed to preserve core semantic information while ensuring compatibility with transformer architectures. Initially, the extracted documents are converted into plain text using automated extraction techniques capable of handling data with heterogeneous formats. Basic cleaning operations remove encoding artifacts, repetitive headers and non-informative symbols that may be introduced during document generation, extraction or conversion.
Subsequently, normalization steps are implemented to ensure consistency across documents originating from different data sources. Language-specific processing is minimized to preserve contextual information, as transformer models rely on raw textual inputs rather than more advanced or manually engineered linguistic features.
Since transformer models operate with fixed input lengths, long procurement documents are segmented or truncated according to predefined approaches that prioritize retaining the most informative sections of these. Specifically, documents that exceed the 512-token limit are segmented into overlapping windows to preserve contextual consistency. This step could be considered critical given the fact that an extensive length of technical procurement documentation may affect the whole training process. The extracted, processed text consists of the direct input for transformer-based risk modeling. In this study, segmentation into fixed-length token windows was adopted to ensure a consistent preprocessing strategy across all evaluated transformer architectures. This choice enables a fair and controlled comparison of architectural differences, while also providing a computationally efficient solution for large-scale procurement datasets. Although long-sequence transformer models, such as Longformer [36], may better capture extended contextual dependencies in lengthy documents, their evaluation was considered beyond the scope of the current study. The primary objective is to isolate the impact of architectural refinements under uniform input conditions. Future work may investigate whether long-sequence models further improve performance in cases involving complex or extended procurement documentation.

3.5. Transformer-Based Corruption Risk Modeling

In the current study, the core of the suggested framework relies on leveraging state-of-the-art encoder-based transformer models to process the linguistic complexities of procurement documentation.

3.5.1. BERT-Based Classification Model

BERT is utilized as the fundamental baseline for this study’s corruption risk detection tasks. As the original bidirectional transformer-based encoder, BERT enables the model to consider the context of a word based on its surroundings (both left and right) simultaneously. The pretrained BERT-base-uncased model is fine-tuned on the unified procurement dataset with a classification head attached to the [CLS] classification token. This provides a performance floor against which more optimized architectures like RoBERTa and DeBERTa-v3 are measured and also analyzed during this study.

3.5.2. RoBERTa-Based Classification Model

RoBERTa is employed as the reference transformer architecture for corruption risk prediction. As an optimized variant of the BERT architecture, RoBERTa introduces improvements in multiple training strategies and approaches, as well as dynamic masking that enhances contextual representation learning.
The pretrained RoBERTa model is fine-tuned using the unified procurement dataset. A task-specific classification head is attached to the final encoder layer, allowing the model to detect and extract corruption risk predictions based on the provided textual content.
RoBERTa serves as a stable and widely recognized baseline model capable of capturing contextual relationships between terms, making it suitable for analyzing procurement language characterized by complex requirements and advanced technical terminology.

3.5.3. DeBERTa-v3-Based Classification Model

DeBERTa-v3 consists of the advanced transformer architecture evaluated in this study. Its disengaged attention mechanism separates positional and semantic representations, which enable enhanced contextual modeling compared to conventional transformers and relevant models.
The model is fine-tuned under identical experimental settings to ensure fair comparison with RoBERTa. The main purpose of including DeBERTa-v3 is to evaluate whether architectural improvements in contextual representation led to measurable gains in corruption risk detection performance.
By comparing two modern transformer architectures under identical conditions, this study aims to isolate the influence of model design on predictive capability within the procurement domain that the current study examines.

3.5.4. Training Configuration and Optimization

Both transformer models are trained using equivalent configurations to eliminate experimental bias and provide a comparison reference point. The dataset is divided into training, validation, and testing subsets following standard supervised learning practices. Training procedures include:
adaptive learning rate optimization,
mini-batch gradient descent,
early stopping based on validation performance,
cross-entropy loss minimization.
Hyperparameter selection as shown in Table 6 is performed using validation data to ensure generalization while mitigating overfitting risks common in domain-specific datasets.

3.6. Explainability and Risk Interpretability

Interpretability is a critical requirement for AI systems operating in anti-corruption contexts, where models’ outcomes must be transparent and traceable. Therefore, the proposed methodology incorporates explainability mechanisms capable of highlighting textual elements that may influence the extracted risk predictions.
The explainability strategy adopted in this study is summarized in Table 7, which outlines the main components of the interpretability framework, including local token-level attribution and decision-support outputs intended to assist users during investigation processes. Within this framework, attribution-based explainability techniques are employed to estimate the contribution of individual tokens to model predictions, allowing analysts to identify specific phrases or requirements associated with elevated risk levels.
Unlike traditional feature importance approaches used in structured machine learning models, transformer-based architectures require methods capable of quantifying token contributions within contextual representations. In this context, methods such as Integrated Gradients and SHAP (SHapley Additive exPlanations) have been considered for attribution analysis, as they assign numerical contribution scores indicating the positive or negative influence of textual elements on the final classification outcome. These explanations enhance transparency and traceability while facilitating manual validation, which is essential in procurement monitoring scenarios.
In the current study, the Integrated Gradients method has been selected as the primary explainability mechanism for token-level attribution analysis. This method computes contribution scores for each token relative to a baseline reference input, thereby quantifying its influence on the final classification outcome in a stable and deterministic manner, while it preserves full compatibility with transformer-based architecture and approaches. Subsequently, the extracted attribution results are analyzed and compared across BERT, RoBERTa and DeBERTa-v3 models in Section 4. Additionally, the Integrated Gradients provide deterministic attribution outputs, ensuring that identical inputs and model parameters yield consistent explanations. This is in contrast to perturbation-based methods such as LIME or SHAP, which may introduce variability due to sampling procedures. In this study, reproducibility was further ensured through fixed random seeds and consistent experimental settings across all model evaluations. While no formal quantitative stability analysis was performed, attribution patterns were observed to be consistent across runs and aligned with the qualitative and quantitative findings presented in Section 4.4. Future work may incorporate explicit stability and faithfulness metrics to further evaluate explanation robustness (Table 7).

3.7. Evaluation Methodology

Model performance is assessed using standard classification metrics, including precision, recall, F1-score and ROC-AUC. These metrics provide complementary perspectives on prediction quality, balancing detection capability and false-positive behavior. In the context of corruption detection, high recall is prioritized to ensure that suspicious documents are not overlooked, while maintaining sufficient precision to avoid overwhelming investigators with many false alerts outcomes. To ensure the statistical validity of these results, the dataset is divided into training, validation and testing subsets using a common approach of 70/15/15 split ratio. Additionally, a stratified sampling strategy is implemented to maintain the distribution of risk indicators across all subsets, ensuring that rare corruption signals are adequately represented. All experiments are conducted using a fixed random seed to guarantee the reproducibility of the findings. Furthermore, to address the inherent class imbalance found in procurement datasets (where “high-risk” cases are less frequent than standard ones), the training phase incorporates class weighting within the loss function. This ensures the models do not become biased toward the majority “low-risk” class. To ensure a fair comparative evaluation between RoBERTa and DeBERTa-v3, both models are evaluated under identical experimental conditions, using a consistent probability threshold of 0.5 for high-risk classification, which may be further tuned based on validation set performance.
Beyond statistical performance, evaluation also considers operational aspects such as inference efficiency and stability across document types (Table 8). This reflects the applied orientation of the study, where models must operate reliably within realistic monitoring workflows. This detailed and comparative evaluation between BERT, RoBERTa and DeBERTa-v3 enables assessment of whether architectural complexity may lead or not to meaningful improvements in corruption risk detection.
The proposed framework is designed for integration into operational anti-corruption monitoring systems. Once trained, transformer models can operate as inference services exposed through application programming interfaces (APIs) to enable automated risk scoring of newly published procurement documentation. Furthermore, predictions can be visualized within dashboards or analytical platforms where investigators prioritize high-risk cases for further examination and investigation cases (Table 8).
This modular nature of the architecture allows incremental updates or further feature integration or additions as new data becomes available. By highlighting deployment readiness, the methodology presented tries to bridge the gap between academic experimentation and practical anti-corruption applications.

4. Experimental Results and Discussion

4.1. Experimental Setup Overview

This section presents the experimental results obtained from the comparative evaluation between the BERT (which is considered as the baseline model), RoBERTa and DeBERTa-v3 transformer architectures for corruption risk detection in procurement documentation. The experiments follow the methodological framework as described in Section 3, including the unified dataset construction, preprocessing pipeline, training configuration, as well as the evaluation methodology, at the end of it.
All models were trained under identical experimental conditions to ensure methodological consistency and fairness. This process includes a consistent training/validation/test split, common preprocessing strategies, equivalent optimization procedures and identical risk-label definitions derived from structured procurement outcomes. The objective of this experimental study is not only to evaluate the predictive performance but also to examine whether architectural improvements in newer or more advanced transformer models lead to practical gains for corruption risk assessment.
Beyond standard classification performance, the evaluation framework also investigates model stability, operational efficiency and interpretability in order to reflect the deployment-oriented nature of the proposed system.

4.2. Comparative Performance Results Overview

4.2.1. Overall Classification Performance Results

Table 9 summarizes the comparative analysis results of the three transformer architecture implementations across the main evaluation metrics, as indicated in the previous Section 3.
To evaluate the statistical significance of the performance results observed across the three architectures, 95% Confidence Intervals (CI) were calculated for the F1-scores using the normal approximation method, which is commonly adopted for large-sample performance estimation. As shown in Table 9, the confidence intervals for BERT, RoBERTa and DeBERTa-v3 do not overlap, providing strong evidence that the performance improvements are statistically meaningful. Furthermore, a McNemar’s test [37] was conducted to compare the classification error distributions of the baseline BERT model against the top-performing DeBERTa-v3. The test yielded a p-value of <0.001, confirming that the architectural superiority of DeBERTa-v3 is statistically significant at the 99.9% confidence level and not a result of incidental data variations. These statistical analyses were implemented using the statsmodels [38] Python v3.12.3 library.
The extracted results demonstrate a clear performance improvement from the initial baseline BERT model to RoBERTa and DeBERTa-v3 models. Specifically, BERT provides solid performance and shows that contextual bidirectional representations already identify meaningful linguistic prompts related to corruption risk. However, its lower recall values highlight limitations and indicate the model as less reliable for identifying subtle or implicit risk-related patterns.
On the other hand, RoBERTa improves upon BERT across all metrics, suggesting enhanced contextual learning capabilities. The observed gains in recall and ROC-AUC indicate that RoBERTa more effectively identifies complex procurement language, including restrictive requirements and nuanced procedural descriptions.
Ultimately, DeBERTa-v3 achieves the highest overall performance. Its superior recall and F1-score highlight an improved sensitivity to high-risk cases—which is particularly a critical factor—during corruption monitoring scenarios where missing suspicious cases can have significant operational consequences. Moreover, the increased ROC-AUC further validates and confirms the model’s ability to distinguish between risk and non-risk documents.
The discriminative capability of the evaluated architectures is illustrated in Figure 2, where ROC curves confirm the progressive improvement from the baseline to the advanced transformer model. DeBERTa-v3 maintains consistently higher true positive rates across varying thresholds, an aspect that highlights its stronger contextual modeling ability.
Since corruption risk detection represents a class-imbalanced problem, additional evaluation is provided throughout the Precision–Recall (PR) curves (Figure 3). The PR analysis confirms that DeBERTa-v3 maintains higher precision across broader recall ranges, suggesting improved robustness when identifying minority high-risk cases. This behavior is operationally beneficial, as it reduces false alerts while preserving detection capability.
Overall, these results validate that those architectural refinements beyond the original BERT model lead to measurable improvements in corruption risk detection performance.

4.2.2. Class-Level Performance Results

Table 10 provides a granular breakdown of how each model performs against specific corruption risk proxies. The results indicate that while all models handle explicit procedural patterns well, the architectural refinements in DeBERTa-v3 significantly enhance the detection of more implicit linguistic patterns, such as those found in transparency and financial risks. Class-level evaluation results are detailed, as presented in Table 9, to make the overall model behavior across different corruption risk indicators more understandable.
The analysis of these results demonstrates that performance varies depending on the type of corruption risk indicator. Models generally achieve higher accuracy for corruption indicators associated with explicit procedural patterns, while more implicit linguistic signals present increased difficulty. Ultimately, DeBERTa-v3 demonstrates consistent improvements across all risk categories, particularly for transparency-related indicators, where precise linguistic expressions may conceal restrictive practices. This indicates that its architectural design better identifies long-range dependencies and contextual relationships within procurement text.
These findings confirm that model improvements are not limited to overall metrics but extend across multiple risk categories, strengthening the robustness of the proposed transformer-based modeling approach.

4.2.3. Error Analysis

Despite the strong overall performance of the evaluated transformer-based models, several recurring error patterns were observed during qualitative inspection of misclassified cases. A first source of error relates to procurement documents containing generic legal or administrative language, where linguistic prompts associated with corruption risk are subtle or ambiguous. In such cases, models occasionally misclassify normal procedural descriptions as risky, leading to false-positive predictions in some cases.
A second category of errors emerges from long and complex technical documentation, where relevant restrictive clauses appear sparsely within extensive textual content. While transformer architectures are capable of modeling contextual dependencies, performance degradation can occur when key signals can be obscured across lengthy documents. This observation partially explains the improved performance of DeBERTa-v3, whose disentangled attention mechanism appears more effective in maintaining contextual focus in particular patterns. This limitation is partly related to the fixed-length segmentation strategy adopted in this study. While this approach ensures computational efficiency and methodological consistency, long or complex procurement documents may contain dispersed risk indicators that are more difficult to capture within standard-length input segments. Future work may explore the use of long-sequence transformer architectures, such as Longformer [36], to better model extended contextual dependencies and assess potential improvements in detecting such cases.
Finally, certain false negatives were associated with implicitly expressed risk indicators, where potentially problematic procurement conditions were described using neutral or legally compliant language. Such cases highlight the fundamental difficulty of detecting corruption-related patterns only through linguistic analysis and emphasize the importance of combining textual signals with structured procurement indicators.
Overall, the analysis, as presented in the current subchapter, confirms that model limitations are primarily linked to ambiguity and contextual complexity rather than systematic model bias, supporting the robustness of the proposed framework while identifying areas for future improvements and additions.

4.3. Operational Performance Evaluation

Reflecting the real-world application of this study, Table 11 evaluates the computational efficiency and stability essential for deployment into realistic monitoring workflows.
BERT serves as the baseline model, showing the lowest computational requirements and fastest inference speed, making it suitable for resource-constrained environments or large-scale screening tasks. RoBERTa provides a balanced trade-off between efficiency and detection performance, making it a practical option for continuous monitoring applications. While DeBERTa-v3 is computationally heavier, it achieves superior predictive performance. Additionally, its increased resource consumption reflects a common trade-off in advanced transformer architectures, where, in cases of improved contextual modeling, higher computational complexity is required.
These results highlight an important deployment consideration: model selection must align with operational priorities to effectively balance throughput requirements against detection sensitivity. While DeBERTa-v3 demonstrates superior predictive performance, it introduces higher computational complexity compared to lighter architectures such as BERT, which may result in increased inference latency. However, the observed latency remains within acceptable limits for the targeted application domain, where procurement risk analysis is typically performed in batch or near-real-time scenarios rather than strict real-time environments. To provide additional context, at a throughput of 61 documents per second, the entire processed dataset of over 10,000 documents (Table 3) can be analyzed in less than three minutes, demonstrating that DeBERTa-v3 remains highly scalable for large-scale procurement monitoring. Furthermore, the proposed framework can be optimized for deployment through standard techniques such as batch inference, GPU acceleration and model optimization strategies (e.g., distillation or quantization). These approaches can significantly reduce inference latency while maintaining high predictive performance, supporting the practical applicability of the approach in real-world operational settings.

4.4. Explainability Analysis

As described in Section 3.6, Integrated Gradients have been employed as the primary attribution-based explainability mechanism to quantify the contribution of individual tokens to corruption risk predictions. The selected method allows an estimation of token-level attribution scores relative to a neutral baseline input, providing stable and deterministic explanations across the different transformer architectures evaluated in the current study.
The attribution analysis reveals that all evaluated models consistently assign higher contribution scores to linguistically meaningful procurement constraints, such as restrictive eligibility criteria, proprietary certifications and unusually specific technical conditions. These patterns align with known corruption risk indicators described in the structured labeling framework.
However, qualitative comparison of attribution distributions highlights architectural differences. BERT frequently emphasizes isolated tokens, often assigning importance to individual words without fully capturing the broader semantic structure of restrictive clauses. RoBERTa demonstrates improved contextual grouping, assigning higher attribution scores to short multi-token segments. DeBERTa-v3 exhibits the most coherent behavior of attribution, frequently concentrating importance across semantically complete phrases, indicating enhanced contextual modeling capabilities. To highlight the practical application of this framework, Table 12 compares the specific focus area and depicts how each model assigns importance to the same high-risk textual segments.
In addition to qualitative inspection, a quantitative perspective on attribution behavior was considered by analyzing both the distribution and concentration of attribution scores across tokens. Specifically, we observed that DeBERTa-v3 assigns higher cumulative attribution to a smaller number of semantically coherent token groups, while BERT more frequently distributes importance across isolated tokens. RoBERTa demonstrates intermediate behavior, capturing short phrase-level importance more effectively than BERT. A simple aggregation of top-k attribution scores further indicates that DeBERTa-v3 concentrates importance on fewer but more contextually meaningful tokens. The results presented in Table 13 are consistent with the qualitative patterns shown in Table 12 and further support the conclusion that architectural improvements enhance not only predictive performance but also attribution coherence. While the current analysis focuses on practical interpretability, future work may incorporate formal faithfulness metrics, such as deletion and insertion tests, to further quantify explanation reliability. Table 13 provides a quantitative summary of attribution concentration across models, illustrating how importance is distributed over tokens.
As shown in Table 13, DeBERTa-v3 demonstrates the highest attribution concentration, proving its ability to prioritize fewer but more semantically meaningful tokens compared to BERT and RoBERTa model architectures.
This difference in attribution coherence supports the quantitative findings presented in Section 4.2, where DeBERTa-v3 achieved superior recall and F1-score performance in comparison with the other two models evaluated. The explainability analysis, therefore, provides complementary evidence that architectural refinements lead not only to higher predictive accuracy but also to improved semantic interpretability.
From an operational point of view, attribution-based explanations enhance trust in automated corruption risk detection systems. By identifying influential textual features associated with raised risk scores, investigators will be able to validate model outputs more efficiently and prioritize high-risk procurement cases for further examination.

4.5. Comparative Discussion of Transformer Architectures and Key Findings

The experimental results reveal a consistent performance progression across transformer generations. BERT establishes a strong baseline, demonstrating the viability of contextual embeddings for corruption detection. RoBERTa introduces clear improvements through optimized pretraining approaches, while DeBERTa-v3 achieves superior performance by leveraging disentangled attention mechanisms.
The comparative analysis of these three evaluations confirms that architectural innovation contributes directly to improved sensitivity and robustness in corruption risk detection tasks. However, increased performance comes with higher computational and latency requirements, emphasizing the need for a deployment-aware model selection strategy.
Overall, this study demonstrates that advanced transformer architectures provide significant advantages for extracting corruption-related indicators from unstructured procurement documentation, supporting the feasibility of AI-driven monitoring systems.

5. Conclusions

This study presented a comparative evaluation of three transformer-based NLP architectures (i.e., BERT, RoBERTa, and DeBERTa-v3) for the detection of corruption risk indicators in procurement texts from heterogeneous sources. By combining textual analysis with structured outcome-based risk indicators, the current study demonstrated that contextual language models can effectively identify linguistic patterns associated with marked corruption risk. Moreover, the comparative evaluation analysis of BERT, RoBERTa and DeBERTa-v3 confirmed a consistent performance progression across the different transformer generations, with DeBERTa-v3 achieving the strongest overall results, particularly in terms of precision, recall and F1-score, validating its superior predictive performance and contextual understanding in detecting corruption indicators. These findings highlight the importance of advanced contextual modeling when analyzing complex procurement language that is often found in restrictive technical specifications.
Beyond predictive accuracy, this study emphasized explainability. Specifically, attribution-based explainability using Integrated Gradients allowed the identification of influential textual features contributing to risk predictions, supporting transparency, traceability and manual validation. The analysis results demonstrated that more advanced transformer architectures produced more coherent attribution patterns, reinforcing their suitability for operational environments, where interpretability and accountability are essential. From an applied perspective, the integration of automated risk scoring with attribution-driven explanations provides users with actionable insights that can support and enhance decision-making, while it can also reduce manual analysis effort.
The findings of this study provide substantial theoretical insights and practical policy recommendations for the modernization of public procurement oversight. Because of complicated processes, vast amounts of financial transactions, and sometimes subjective decision-making processes, public procurement is a breeding ground for corruption. As authorities move to digital e-procurement, they have to deal with an overwhelming amount of data that cannot be examined manually. This study shows that using AI and machine learning tools in e-procurement can make oversight controls much more efficient, open, and effective. Transformer-based NLP models help auditors and law enforcement agencies figure out corruption risks and find illegal ways by automating the analysis of vast datasets.
From a policy point of view, it is crucial that the use of AI-assisted automated systems for decision support follows strict ethical rules and frameworks, like the EU AI Act, to retain accountability of any decision. Adding explainability features to AI models directly fulfills this policy requirement by making it clear which parts of the text set off risk signals. This ensures that algorithmic decisions are still legal, clear, and open to human review. In the end, making these AI-powered monitoring systems standard across the board can cut down on discretionary decision-making, make institutions more open, and encourage collaboration between institutions to find corruption patterns more quickly across different public bodies.
Although the proposed method demonstrates strong performance and practical applicability, several directions for future research remain open. Data generalization, whereby expanding the availability and diversity of procurement documentation across jurisdictions, represents a key step toward improving model robustness and generalization. Future work may also explore the integration of additional contextual features, including supplier networks, financial patterns or even graph-based relationships, to enrich corruption risk assessment beyond textual analysis. Moreover, systematic evaluation of explainability consistency and the incorporation of human-in-the-loop feedback mechanisms could enhance both model reliability and operational trust of the extracted results.
In addition, recent advancements in LLMs open promising opportunities for extending the proposed framework. While this study focused on transformer-based classification architectures to ensure stability and reproducibility, future applications may explore and leverage comparisons with zero-shot or few-shot LLMs (e.g., GPT-based approaches) for tasks such as advanced contextual reasoning, automated explanation generation, semantic summarization of procurement documents or interactive investigator assistance, while further contextualization of the performance of fine-tuned transformer models can succeed. Furthermore, hybrid investigating approaches could also combine robust transformer-based risk prediction with LLM-driven analytical support, which could be considered as a powerful next step toward intelligent and collaborative anti-corruption monitoring systems.

Author Contributions

Conceptualization, N.P., T.A., E.D. and E.A.; methodology, N.P., T.A., E.D. and E.A.; software, N.P.; validation, N.P., E.A. and E.D.; formal analysis, N.P., T.A., E.D. and E.A.; investigation, N.P., T.A. and E.D.; resources, N.P., E.A., E.D. and T.A.; data curation, N.P. and T.A. writing—original draft preparation, N.P., T.A., E.D. and E.A.; writing—review and editing, E.A., E.D. and T.A.; visualization, N.P. and T.A.; supervision, E.A.; project administration, E.A. All authors have read and agreed to the published version of the manuscript.

Funding

Co-funded by the European Union within the Horizon Europe Program, under grant agreement No. 101121281 (Project FALCON). Views and opinions expressed are, however, those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AIArtificial Intelligence
APIsApplication Programming Interfaces
BERTBidirectional Encoder Representation from Transformers
BPEByte-Pair Encoding
CNNConvolutional Neural Network
ESGEnvironmental, Social, and Governance 
GPUGraphics Processing Unit
LDALatent Dirichlet Allocation
LLMsLarge Language Models
LSTMLong Short-Term Memory
NERNamed Entity Recognition
NLPNatural Language Processing
OCDSOpen Contracting Data Standard
PRPrecision–Recall
RFRandom Forest
SAIsSupreme Audit Institutions
SHAPSHapley Additive exPlanations
SVMsSupport Vector Machines
XAIeXplainable AI

References

  1. Eyre, C. Patronage, Power, and Corruption in Pharaonic Egypt. Int. J. Public Adm. 2011, 34, 701–711. [Google Scholar] [CrossRef]
  2. Lintott, A. Electoral Bribery in the Roman Republic. J. Rom. Stud. 1990, 80, 1–16. [Google Scholar] [CrossRef]
  3. Taylor, C. Corruption and Anticorruption in Democratic Athens. In Anti-Corruption in History: From Antiquity to the Modern Era; Oxford University Press: Oxford, UK, 2017; ISBN 978-0-19-880997-5. [Google Scholar]
  4. Transparency International What Is Corruption? Available online: https://www.transparency.org/en/what-is-corruption (accessed on 13 February 2026).
  5. Petheram, A.; Pasquarelli, W.; Stirling, R. The Next Generation of Anti-Corruption Tools: Big Data, Open Data & Artificial Intelligence. 2019. Available online: https://ec.europa.eu/futurium/en/system/files/ged/researchreport2019_thenextgenerationofanti-corruptiontools_bigdataopendataartificialintelligence.pdf (accessed on 28 February 2024).
  6. Parvanova, I. The Use of Big Data by Anticorruption Authorities; CHR, U4 Anti-Corruption Resource Centre, Michelsen Institute: Bergen, Norway, 2025. [Google Scholar]
  7. Mironov, M.; Zhuravskaya, E. Corruption in Procurement and the Political Cycle in Tunneling: Evidence from Financial Transactions Data. Am. Econ. J. Econ. Policy 2016, 8, 287–321. [Google Scholar] [CrossRef]
  8. Fazekas, M.; Kocsis, G. Uncovering High-Level Corruption: Cross-National Objective Corruption Risk Indicators Using Public Procurement Data. Br. J. Political Sci. 2017, 50, 155–164. [Google Scholar] [CrossRef]
  9. Bauer, M.; Zirker, A. Strategies of Ambiguity; Routledge: Abingdon, UK, 2024. [Google Scholar]
  10. OECD. Governing with Artificial Intelligence: The State of Play and Way Forward in Core Government Functions; OECD Publishing: Paris, France, 2025. [Google Scholar]
  11. Hajek, P.; Henriques, R. Mining Corporate Annual Reports for Intelligent Detection of Financial Statement Fraud—A Comparative Study of Machine Learning Methods. Knowl.-Based Syst. 2017, 128, 139–152. [Google Scholar] [CrossRef]
  12. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
  13. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; Liu, Q., Schlangen, D., Eds.; Association for Computational Linguistics: Cedarville, OH, USA, 2020; pp. 38–45. [Google Scholar]
  14. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Available online: https://arxiv.org/abs/1810.04805 (accessed on 26 February 2026).
  15. Aftan, S.; Shah, H. A Survey on BERT and Its Applications. In Proceedings of the 2023 20th Learning and Technology Conference (L&T), Jeddah, Saudi Arabia, 26 January 2023; pp. 161–166. [Google Scholar]
  16. Joly, M. Corruption: The Shortcut to Disaster. Sustain. Prod. Consum. 2017, 10, 133–156. [Google Scholar] [CrossRef]
  17. Damiano, R.; Polizzi, S.; Scannella, E.; Valenza, G. Corruption Detection Through Textual Analysis: Evidence from Eurozone Banks. Bus. Ethics Environ. Responsib. 2026, 35, 1017–1037. [Google Scholar] [CrossRef]
  18. Mohamed, A.N.; Manaa, M.E.; Soni, S.; Kizi, S.S.K.; Doss, D. Financial Fraud Detection in Algorithmic Trading Systems Using BERT Variants and Time-Series Embedding. In Proceedings of the 2025 3rd International Conference on Cyber Resilience (ICCR), Dubai, United Arab Emirates, 3–4 July 2025; pp. 1–6. [Google Scholar]
  19. Erva Ergun, Z.; Sefer, E. Financial Statement Fraud Detection via Large Language Models. Intell. Syst. Account. Financ. Manag. 2025, 32, e70021. [Google Scholar] [CrossRef]
  20. Lima, W.; Lira, R.; Paiva, A.; Silva, J.; Silva, V. Methodology for Automatic Extraction of Red Flags in Public Procurement. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN), Gold Coast, Australia, 18–23 June 2023; pp. 1–7. [Google Scholar]
  21. Torres Berrú, Y.; Batista, V.; Conde, L. A Data Mining Approach to Detecting Bias and Favoritism in Public Procurement. Intell. Autom. Soft Comput. 2023, 36, 3501–3516. [Google Scholar] [CrossRef]
  22. Salazar, A.; Pérez, J.F.; Gallego, J. VigIA: Prioritizing Public Procurement Oversight with Machine Learning Models and Risk Indices. Data Policy 2024, 6, e75. [Google Scholar] [CrossRef]
  23. Muñoz-Cancino, R.; Ríos, S.A. Data-Driven Transparency: Machine Learning and Social Network Analysis for Corruption Detection in Public Procurement. Procedia Comput. Sci. 2025, 270, 1788–1795. [Google Scholar] [CrossRef]
  24. Pernica, B.; Palavenis, D.; Dvorak, J. Small Arms Procurement and Corruption in Small NATO Countries. J. Public Procure. 2024, 24, 348–370. [Google Scholar] [CrossRef]
  25. Aldana, A.; Falcón-Cortés, A.; Larralde, H. A Machine Learning Model to Identify Corruption in México’s Public Procurement Contracts. arXiv 2022. [Google Scholar] [CrossRef]
  26. Ayobami, A.T.; Mike-Olisa, U.; Chidera Ogeawuchi, J.; Abayomi, A.A.; Agboola, O.A. Algorithmic Integrity: A Predictive Framework for Combating Corruption in Public Procurement through AI and Data Analytics. J. Front. Multidiscip. Res. 2023, 4, 130–141. [Google Scholar] [CrossRef]
  27. Beltran, A. Fiscal Data in Text: Information Extraction from Audit Reports Using Natural Language Processing. Data Policy 2023, 5, e7. [Google Scholar] [CrossRef]
  28. Ash, E.; Galletta, S.; Giommoni, T. A Machine Learning Approach to Analyze and Support Anti-Corruption Policy. SSRN J. 2021, 17, 162–193. [Google Scholar] [CrossRef]
  29. Xiao, Q. Automated Detection of Corruption Reports in Text via Deep Reinforcement Learning. Sci. Rep. 2025, 15, 36674. [Google Scholar] [CrossRef] [PubMed]
  30. Indriyanti, A.D.; Gernowo, R.; Sediyono, E. Machine Learning Approach for Sentiment and Topic Analysis on Social Media X: Case Study of Corruption Handling by the East Java Government. In Proceedings of the 2025 Eight International Conference on Vocational Education and Electrical Engineering (ICVEE), Surabaya, Indonesia, 24–25 September 2025; pp. 239–245. [Google Scholar]
  31. Cho, S.-H.; Kim, D.; Kwon, H.-C.; Kim, M. Exploring the Potential of Large Language Models for Author Profiling Tasks in Digital Text Forensics. Forensic Sci. Int. Digit. Investig. 2024, 50, 301814. [Google Scholar] [CrossRef]
  32. Li, M.; Zhang, Y.; Zou, W.; Chen, H.; Yang, X.; Chen, T. Geographical Network Analysis of Drug Trafficking in China (2012–2024): A Method Based on Large Language Models. J. Saf. Sci. Resil. 2025, 100273. [CrossRef]
  33. Anwar, M. mBERT: Multilingual BERT. Available online: https://anwarvic.github.io/cross-lingual-lm/mBERT (accessed on 18 March 2026).
  34. Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-Lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; Association for Computational Linguistics: Red Hook, NY, USA, 2020; pp. 8440–8451. [Google Scholar]
  35. Kenny, C.; Musatova, M. ‘Red Flags of Corruption’ in World Bank Projects: An Analysis of Infrastructure Contracts. In International Handbook on the Economics of Corruption; Elgar Publishing: Camberley, UK, 2010; Volume Two, p. 499. [Google Scholar]
  36. Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The Long-Document Transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar]
  37. Smith, M.; Ruxton, G. Effective Use of the McNemar Test. Behav. Ecol. Sociobiol. 2020, 74, 133. [Google Scholar] [CrossRef]
  38. Seabold, S.; Perktold, J. Statsmodels: Econometric and Statistical Modeling with Python. In Proceedings of the 9th Python in Science Conference, Austin, TX, USA, 28 June–3 July 2010. [Google Scholar]
Figure 1. Overall workflow of the proposed AI-driven corruption risk detection framework.
Figure 1. Overall workflow of the proposed AI-driven corruption risk detection framework.
Information 17 00329 g001
Figure 2. ROC Curve Comparison between BERT, RoBERTa and DeBERTa-v3.
Figure 2. ROC Curve Comparison between BERT, RoBERTa and DeBERTa-v3.
Information 17 00329 g002
Figure 3. Precision–Recall Curve for BERT, RoBERTa and DeBERTa-v3 models.
Figure 3. Precision–Recall Curve for BERT, RoBERTa and DeBERTa-v3 models.
Information 17 00329 g003
Table 1. Overview of the Architectural Components and Functional Roles.
Table 1. Overview of the Architectural Components and Functional Roles.
StageProcessObjective
1Data Ingestion and LinkingSynchronization of OCDS structured data with raw PDF/Text specifications
2NLP Preprocessing ProceduresCleaning, segmentation and tokenization for Transformer compatibility
3Model InferenceComparative risk prediction using BERT, RoBERTa and DeBERTa-v3
4Risk Scoring and XAIGenerating interpretable risk scores and feature extraction for the users
Table 2. Mapping of OCDS Structured Fields to Corruption Risk Indicators.
Table 2. Mapping of OCDS Structured Fields to Corruption Risk Indicators.
OCDS FieldData TypeRisk Interpretation
tender/numberOfTenderersIntegerSynchronization of OCDS structured data with raw PDF/Text specifications
tender/procurementMethodCategorical (String)Cleaning, segmentation and tokenization for Transformer compatibility
awards/value/amountFloat (Numerical)Comparative risk prediction using BERT, RoBERTa and DeBERTa-v3
tender/selectionCriteriaTextual (Metadata)Generating interpretable risk scores and feature extraction for the users
Table 3. Dataset Summary Statistics.
Table 3. Dataset Summary Statistics.
CategoryMetric DescriptionsValue
Total procurement recordsNumber of instances collected34,297
Records with textual inputNumber of records with associated textual specs10,742
Total processed documentsNumber of post-processing documents10,120
Average documents lengthNumber of average document length (tokens)512
Final dataset samplesNumber of training samples (70%)7084
 Number of validation samples (15%)1518
 Number of testing samples (15%)1518
Class distributionNumber of high-risk records (positive class)1214 (12%)
 Number of low-risk records (negative class)8906 (88%)
Table 4. Correlation of Risk Indicators and Textual Outcomes.
Table 4. Correlation of Risk Indicators and Textual Outcomes.
Risk IndicatorTextual OutcomeThreshold
Competition RiskSingle-bidding submissionNumberOfTenderers = 1
Transparency RiskRestricted procedureProcurementMethod ≠ “open”
Financial RiskAwarded price > Estimated(awardValue/estimatedValue) > 1.2
Anomalous RiskAbnormally short tender periodTenderPeriod < statutoryMinimum
Table 5. Different Stages of the NLP Preprocessing Pipeline.
Table 5. Different Stages of the NLP Preprocessing Pipeline.
Preprocessing StageActionObjective
ExtractionPDF-to-Text conversionTransformation of non-indexed PDF documents to machine-readable strings
CleaningRegex-related noise removalRemoval of non-semantic artifacts and headers
TokenizationSub-word Byte-Pair Encoding (BPE)Preparation of raw text into compatible vocabularies for the BERT, RoBERTa and DeBERTa-v3 models
SegmentationTruncationManagement of documents exceeding the 512-token transformer limit
Table 6. Hyperparameter Selection for Model Training.
Table 6. Hyperparameter Selection for Model Training.
Technical ParameterValueDescription
Learning Rate2 × 10−5Standard for finetuning transformer models
Weight Decay0.01Prevents overfitting on domain-specific procurement terminology
Warmup Steps10% of total stepsEnsures stable gradient calculation in early epochs
Batch Size16 or 32Selected based on GPU memory constraints and document length
Dropout Rate0.1Regularization of the classification head
Table 7. Attribution-Based Explainability Framework.
Table 7. Attribution-Based Explainability Framework.
ComponentTechnical ImplementationDescription
Local Attribution AnalysisIntegrated Gradients (Token-Level Attribution)Quantify the contribution of individual tokens to the predicted risk score
Comparative Model AnalysisCross-model attribution comparisonEvaluate differences in semantic consistency and contextual focus
Operational Decision SupportHighlighted influential phrases for investigatorsFacilitate manual validation and enhance transparency and traceability
Table 8. Operational Evaluation and Deployment Metrics.
Table 8. Operational Evaluation and Deployment Metrics.
CategoryMetricObjective
Quality DetectionPrecision, Recall, F1, ROC-AUCEnsures high-fidelity risk identification
System LatencyInference time per documentSupports real-time monitoring of large-scale data portals
RobustnessStability across textual formatsEnsures consistent performance
ScalabilityAPI throughputHandles concurrent requests on client sides, from multiple agents
Table 9. Comparative Performance of RoBERTa and DeBERTa-v3 models.
Table 9. Comparative Performance of RoBERTa and DeBERTa-v3 models.
ModelPrecisionRecallF1-ScoreF1-Score (95% CI)ROC-AUC
BERT0.7850.7420.763[0.742, 0.784]0.812
RoBERTa0.8240.8070.807[0.787, 0.827]0.865
DeBERTa-v30.8680.8850.876[0.859, 0.893]0.904
Table 10. Class-Level Performance per Corruption Risk Indicator.
Table 10. Class-Level Performance per Corruption Risk Indicator.
ModelRisk IndicatorPrecisionRecallF1-Score
BERTCompetition Risk0.7420.7630.812
RoBERTaCompetition Risk0.8070.8070.865
DeBERTa-v3Competition Risk0.8850.8760.904
BERTTransparency Risk0.7120.6850.698
RoBERTaTransparency Risk0.7640.7420.753
DeBERTa-v3Transparency Risk0.8210.8440.832
BERTFinancial Risk0.6950.6540.674
RoBERTaFinancial Risk0.7520.7180.735
DeBERTa-v3Financial Risk0.8030.8120.807
Table 11. Operational Performance Comparison between the Different Models.
Table 11. Operational Performance Comparison between the Different Models.
ModelBERTRoBERTaDeBERTa-v3
Inference Time (ms/doc)9.211.816.4
Avg GPU memory usage (GB)4.24.86.1
Throughput (docs/s)1088561
Table 12. Comparison of Token and Phrases-Level Attribution across BERT, RoBERTa and DeBERTa-v3 models.
Table 12. Comparison of Token and Phrases-Level Attribution across BERT, RoBERTa and DeBERTa-v3 models.
Original Text SegmentBERTRoBERTaDeBERTa-v3
“Requires exclusive proprietary certification from X…”“exclusive”, “X”“exclusive proprietary”“exclusive proprietary certification”
“Immediate delivery within 24 h is mandatory…”“24”, “mandatory”“24 h”, “mandatory”“Immediate delivery within 24 h”
“Specific technical brand-name only…”“brand-name”“technical brand-name”“Specific technical brand-name”
Table 13. Quantitative Attribution Concentration Analysis.
Table 13. Quantitative Attribution Concentration Analysis.
Avg. Top-5 Attibution (%)Avg. Tokens Covering 50% AttributionAttribution Concentration
BERT41.29.8Low
RoBERTa52.76.3Medium
DeBERTa64.54.1High
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Peppes, N.; Alexakis, T.; Daskalakis, E.; Adamopoulou, E. AI-Driven Corruption Risk Indicator Detection: A Comparative Evaluation of Transformer-Based NLP Models in Unstructured Procurement Data. Information 2026, 17, 329. https://doi.org/10.3390/info17040329

AMA Style

Peppes N, Alexakis T, Daskalakis E, Adamopoulou E. AI-Driven Corruption Risk Indicator Detection: A Comparative Evaluation of Transformer-Based NLP Models in Unstructured Procurement Data. Information. 2026; 17(4):329. https://doi.org/10.3390/info17040329

Chicago/Turabian Style

Peppes, Nikolaos, Theodoros Alexakis, Emmanouil Daskalakis, and Evgenia Adamopoulou. 2026. "AI-Driven Corruption Risk Indicator Detection: A Comparative Evaluation of Transformer-Based NLP Models in Unstructured Procurement Data" Information 17, no. 4: 329. https://doi.org/10.3390/info17040329

APA Style

Peppes, N., Alexakis, T., Daskalakis, E., & Adamopoulou, E. (2026). AI-Driven Corruption Risk Indicator Detection: A Comparative Evaluation of Transformer-Based NLP Models in Unstructured Procurement Data. Information, 17(4), 329. https://doi.org/10.3390/info17040329

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop