Next Article in Journal
Comparative Study of Modern Differential Evolution Algorithms: Perspectives on Mechanisms and Performance
Previous Article in Journal
A Heuristic Approach for Last-Mile Delivery with Consistent Considerations and Minimum Service for a Supply Chain
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

TextNeX: Text Network of eXperts for Robust Text Classification—Case Study on Machine-Generated-Text Detection

by
Emmanuel Pintelas
1,*,
Athanasios Koursaris
2,
Ioannis E. Livieris
3 and
Vasilis Tampakas
4
1
Department of Mathematics, University of Patras, GR 265-00 Patras, Greece
2
Department of Mechanical Engineering and Aeronautics, University of Patras, GR 265-00 Patras, Greece
3
Department of Statistics & Insurance Science, University of Piraeus, GR 185-32 Piraeus, Greece
4
Department of Electrical and Computer Engineering, University of Peloponnese, GR 263-34 Patras, Greece
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(10), 1555; https://doi.org/10.3390/math13101555
Submission received: 31 March 2025 / Revised: 26 April 2025 / Accepted: 29 April 2025 / Published: 9 May 2025
(This article belongs to the Section E1: Mathematics and Computer Science)

Abstract

:
Efficient and accurate text classification is essential for a wide range of natural language processing applications, including sentiment analysis, spam detection and machine-generated text identification. While recent advancements in transformer-based large language models have achieved remarkable performance, they often come with significant computational costs, limiting their applicability in resource-constrained environments. In this work, we propose TextNeX, a new ensemble model that leverages lightweight language models to achieve state-of-the-art performance while maintaining computational efficiency. The development process of TextNeX model follows a three-phase procedure: (i) Expansion: generation of a pool of diverse lightweight models via randomized model setups and variations of training data; (ii) Selection: application of a clustering-based heterogeneity-driven selection to retain the most complementary models and (iii) Ensemble optimization: optimization of the selected models’ contributions using sequential quadratic programming. Experimental evaluations on three challenging text classification datasets demonstrate that TextNeX outperforms existing state-of-the-art ensemble models in accuracy, robustness and computational effectiveness, offering a practical alternative to large-scale models for real-world applications.

1. Introduction

Text classification is a fundamental task in Natural Language Processing (NLP), with applications ranging from spam detection and sentiment analysis to the increasingly critical task of machine-generated text (MGT) detection [1,2]. In recent years, the rapid development of large language models (LLMs), such as BERT [3], RoBERTa [4], DeBERTa [5] and XLNet [6], has led to significant improvements in classification performance across a wide range of NLP tasks [7,8]. These models are widely adopted due to their strong generalization capabilities and their ability to extract rich contextual representations from text.
Despite their success, these transformer-based models are computationally intensive, requiring substantial memory and processing power for both training and inference. This high resource demand makes their deployment challenging, especially in real-world, resource-constrained environments such as mobile devices, embedded systems, or large-scale production pipelines [9,10,11]. Additionally, ensemble approaches based on these models further amplify the computational burden, often involving complex combinations or meta-learning steps that lead to inefficiencies and hinder scalability. Therefore, there is a critical need for efficient yet accurate alternatives that maintain high classification performance while significantly reducing resource usage.
To address this limitation, we propose TextNeX (Text Network of eXperts), an ensemble framework based on lightweight transformer architectures. The aim is to develop a model that achieves competitive or superior performance compared to heavyweight solutions, while remaining practical and scalable in low-resource settings. The motivation behind our approach is the development of a prediction model, which optimally balances accuracy, computational efficiency and resource constraints. TextNeX is built utilizing a three-phase approach: (i) Expansion, where a diverse population of lightweight transformer-based models is generated through randomized training configurations and variations in the training data, ensuring a wide range of model behaviors and perspectives; (ii) Selection, where a heterogeneity-driven clustering and selection process identifies the most heterogeneous and complementary models (experts) for the final ensemble, minimizing redundancy and overfitting and (iii) Ensemble optimization, where the contributions of the selected experts are optimized using Sequential Quadratic Programming (SQP) with Powell’s derivative-free method [12]. This gradient-free optimization ensures robust and efficient weight assignment, balancing accuracy and computational efficiency, especially in scenarios where gradient-based methods are impractical.
Although text classification encompasses a wide range of tasks, we specifically focus on MGT detection as a case study for evaluating the proposed model. The rapid advancement of LLMs has made MGT detection increasingly relevant and challenging, as AI-generated text becomes harder to distinguish from human-written content. This task serves as an ideal benchmark for assessing the robustness, accuracy and efficiency of text classification models, particularly in resource-constrained environments. The main contributions of this work are summarized as follows:
  • We propose TextNeX, a new model that achieves state-of-the-art performance on text classification benchmarks, demonstrating its superiority over traditional ensemble-based models. In addition, it maintains significantly lower computational costs, making it a scalable solution for real-world applications.
  • We propose a new ensemble learning approach, which balances high accuracy and computational efficiency by leveraging lightweight transformer-based models.
  • We propose a clustering-based selection process, which identifies the most complementary models (experts) from a pool of trained ones for the final ensemble. Unlike traditional methods, which focus solely on validation performance, our approach prioritizes model diversity, reducing the risk of overfitting and improving generalization and robustness.
The remainder of this paper is organized as follows. Section 2 reviews the state-of-the-art text classification models based on ensemble approaches relevant to this work. Section 3 provides a detailed description of the TextNeX model, including its methodology and construction. Section 4 presents the datasets used for the evaluation, compares TextNeX with existing SoA models and discusses the experimental results. Finally, Section 5 concludes the paper with key findings and outlines directions for future work.

2. Related Work

Machine-Generated Text (MGT) detection has become increasingly prevalent in the last few years, primarily due to the ability of LLMs to generate high-quality text across a wide array of domains. The continuous improvement of LLMs in terms of abilities and complexity has given rise to concerns regarding the possibility of the malicious use of MGT, including various instances of misinformation, deception, fraudulent activity and academic dishonesty [13]. MGT detection has been clearly defined as a classification problem, focusing on discerning between human-written and AI-generated text, while in some cases, there have been attempts to address the problem of model attribution, i.e., to identify the model (GPT-4, LLaMa, Gemini, etc.), which has generated a given textual passage. Recent research has demonstrated the efficiency of transformer-based classifiers as well as ensemble classifiers, which make use of both transformers with traditional machine learning (ML) models.
Gambini et al. [8] evaluated the performance of various ensemble stacking models, which exploit the prediction of models, including BERT, BART, RoBERTa and DistilBERT. The evaluation procedure was based on the Deepfake Tweets detection dataset, in which the fake tweets were produced by GPT2 and GPT3. The experimental analysis demonstrated that (i) all models’ performance was considerably based on the selection of the hyperparameters as well as on the employed meta-learner; (ii) all evaluated models were not able to generalize well on GPT-3 generated tweets.
Mikros et al. [9] proposed a new ensemble model, which combines transformer-based learners (RoBERTa, ELECTRA and XLNet) with tree-based learners (XGBoost/LightGBM). The latter were fitted with features, such as the GPT-2 embeddings, as well as various stylometric features. The proposed model was compared with several single and ensemble-based models on two versions of the AuTexTification dataset, while the numerical experiments highlighted that the proposed model was able to achieve the best overall performance. Nevertheless, the primary limitation of the proposed approach is that stylometric features did not demonstrate a substantial ability to enhance overall performance.
Abburi et al. [2] proposed a variation of the stacking approach, based entirely on transformer models (BERT, RoBERTa, XLM-RoBERTa, DeBERTa). The predicted outputs were used to generate feature vectors, which in turn, were used to train an ensemble voting meta-classifier, composed of traditional ML models (logistic regression, random forest, naive Bayes and support vector machine). Based on the conducted numerical experiments, the authors stated that (i) the proposed ensemble model was able to outperform single transformer-based models and (ii) further experimentation and more sophisticated approaches are required to improve the results of the MGT classification task.
Preda et al. [11] proposed a new model based on stacking methodology by combining the individual predictions of three transformer-based models (XLM-RoBERTa, TwHIN-BERT and multilingual BERT) while the XGBoost classifier was used as a meta-learner. The conducted experimental evaluation on the AuTexTification benchmark reported that the proposed model exhibited the top performance among all evaluated ensemble models. In addition, the authors stated that the main limitation of their work was the employment of grid search for the hyperparameter optimization of hyperparameter and base-model selection for avoiding overfitting.
Sheykhlan et al. [1] proposed a soft voting ensemble model by averaging the probabilities of the predicted classes produced by multiple LLMs, including BLOOM-560m, ErnieM and mDeBERTaV3. In their research, the authors evaluated the performance of the proposed model on two MGT sub-tasks of the IberAuTexTification benchmark (one binary and one multi-class). Their experiments showed that the proposed model achieved the best performance in the binary classification task and the second-best performance in the multi-class classification task. Furthermore, based on their findings, the authors stated that ensemble learning has considerable potential to improve the effectiveness of AI text detection systems.
While recent efforts have introduced stacking or hybrid ensemble strategies combining large-scale transformer models with traditional classifiers [2,8,9,11], they either rely heavily on computationally demanding models or involve ensemble processes that neglect the importance of model diversity. Moreover, most selection strategies focus exclusively on validation performance, often leading to overfitting and reduced generalizability [14]. To the best of our knowledge, no existing approach has explored the joint use of lightweight transformer models, clustering-based expert selection, and derivative-free ensemble optimization to simultaneously optimize accuracy, generalization, and computational cost. This gap motivates our proposal of TextNeX as a new, practical alternative.

3. TextNeX

TextNeX (Text Network of eXperts) is a new ensemble-based model designed for text classification tasks. While previous methodologies rely on heavy and large-scale transformer ensembles, TextNeX offers a lightweight and effective alternative by leveraging diverse small-scale text networks trained under heterogeneous conditions.
Figure 1 illustrates the core methodology for the development of TextNeX model, which follows a three-phase process: (i) Expansion, where a pool of diverse text models is generated through randomized hyperparameter configurations and training variations; (ii) Selection, which applies a clustering-based approach to identify heterogeneous and complementary models (experts), aiming at minimizing redundancy and overfitting and (iii) Ensemble optimization, where the selected experts are combined using an optimization framework to maximize predictive performance.

3.1. Expansion

The first phase involves the generation of a pool of diverse models, each trained under different configurations and variations of the training data to enhance heterogeneity. The primary goal is to ensure diversity in learned representations by encouraging the models to capture distinct linguistic patterns and text characteristics for the development of an accurate and powerful ensemble.
Specifically, at each step, various hyperparameter configurations are introduced, including changes in learning rates, dropout rates and weight initialization. To enhance the diversity of the generated models, the training process employs a bagging strategy, wherein each model is trained on a randomly sampled subset of the training data using bootstrapping (i.e., sampling with replacement). This approach ensures that each model is exposed to a unique subset of the training data, fostering greater variation in the learned representations. As a result, the trained models develop different decision-making behaviors, further promoting heterogeneity in the final ensemble.
Algorithm 1 presents a high-level description of the procedure for model generation and training. Initially, the pool of trained models, denoted as M, is defined as an empty set (Step 1). At each iteration (Steps 2–8), a base model architecture is randomly selected from a set of selected transformer-based architectures T (Step 3) and a subset of the training data D t r a i n is sampled (Step 4). Next, a new instance of the base model T i is initialized with randomly selected hyperparameter settings (Step 5) and trained on the bootstrapped data (Step 6). Finally, the trained model is then added to the pool of models M. This process is repeated for N iterations to create a diverse set of models.
Algorithm 1: Expansion: Generating Diverse Models
  • Inputs:
  •   T: Set of transformer-based architectures
  •    D t r a i n : Training dataset
  •   N: Number of iterations
  • Output:
  •   Pool of trained models M 
1:
M =
2:
for  i = 1 to N do
3:
     Randomly select a base model architecture T i from T
4:
     Sample a subset of the training data D t r a i n ( i ) from D t r a i n using bootstrapping
5:
    Initialize new instance T i based on a randomly selected hyperparameter settings H i
6:
     Train T i on D t r a i n ( i ) using H i
7:
     Add trained model T i to M
8:
end for
    In our experiments, the number of iterations was set to N = 50 , while the selected lightweight pre-trained transformer architectures were the following:
  • DistilBERT [15]: A distilled version of BERT, retaining 95% of its performance while being 40% smaller and 60% faster.
  • MiniLM [16]: A highly efficient transformer model with variants such as MiniLM-L6-H384 (6 layers, 384 hidden units) and MiniLM-L12-H384 (12 layers, 384 hidden units), offering a balance between size and performance.
  • MobileBERT [17]: A compact BERT variant optimized for mobile devices, featuring fewer parameters and faster inference.
These models were specifically chosen for their efficiency and suitability for resource-constrained environments as well as their ability to generalize well across various text classification tasks.
It is worth noting that the selected lightweight transformer architectures, DistilBERT, MiniLM, and MobileBERT, share the foundational transformer encoder structure but differ in depth, parameter size, and optimization objectives. DistilBERT is a distilled version of BERT, designed to retain performance with fewer layers. MiniLM variants compress the attention mechanism while preserving contextual richness through deep supervision, and MobileBERT adopts a bottleneck structure and inverted residuals to enhance efficiency on edge devices. These differences contribute to architectural diversity and enable the ensemble to capture complementary patterns.
Moreover, we acknowledge that the effectiveness of specific hyperparameter choices (e.g., learning rate, sequence length, dropout rate) can vary across languages and domains. Practitioners applying TextNeX to non-English or domain-specific corpora are encouraged to adapt the randomization ranges and consider language-specific preprocessing steps. Future work will explore multilingual extensions and domain-aware configurations to further enhance the generalizability of the framework.

3.2. Selection

After the Expansion phase is complete, the pool of generated models undergoes a pruning/selection process to identify the most heterogeneous and complementary subset for the final ensemble. Unlike traditional performance-based selection, which often leads to validation overfitting [14], we employ a clustering-based approach, which is presented in Algorithm 2 and balances both model diversity and predictive performance.
Firstly, we extract the probabilistic output of each model in the validation set, treating them as high-dimensional feature vectors, which encode their decision-making behavior. These vectors are then projected into a lower-dimensional space using Uniform Manifold Approximation and Projection (UMAP) [18] that preserves both the local and global structure of the data. Next, a Gaussian Mixture Model (GMM) [19,20] is employed to cluster the models into groups based on their decision patterns. Notice that UMAP is employed for its ability to improve the clustering process by effectively grouping similar decision-making patterns and separating distinct ones, while GMM is chosen for its flexibility in modeling complex, multi-modal distributions and its ability to represent overlapping clusters, which is crucial when dealing with diverse models. In addition, for optimizing the clustering process, the hyperparameters of both UMAP and GMM are fine-tuned by maximizing the Silhouette score using Bayesian optimization [21]. This allows for an efficient search of the parameter space, ensuring that the resulting clusters are both meaningful and well-separated; thereby, improving the overall selection process.
Next, the model of each cluster whose predictions are closest to its centroid is selected as the most representative one. The selected models are referred to as “experts” due to their ability to contribute unique and complementary decision-making patterns to the final ensemble. By focusing on the models, whose predictions are closest to the centroids of their respective clusters, we ensure that each expert brings a distinct perspective; therefore, enriching the diversity of the ensemble. The term “expert” is used because each model within a cluster represents a specialized area of knowledge or decision-making expertise, reflected in its behavior and decision patterns. The clustering-based approach allows us to identify these models as experts, as they have demonstrated the ability to capture and represent different aspects of the data, enhancing the robustness of the final ensemble. Note that the selected experts may include a mix of different base architectures depending on their complementary decision patterns.
Algorithm 2 summarizes the preceding procedure. In Step 1, the expert set M E is initialized as an empty set. In Steps 2–4, the logits (probabilistic outputs) are computed on the validation set for all models. In Step 5, UMAP is applied to project these logits into a lower-dimensional representation space, while in Step 6, clustering is performed using a GMM on the reduced space. In Steps 7–10, the expert selection from each cluster is conducted. Specifically, for each cluster C k the centroid μ k is calculated (Step 8), the model whose logits are closest to μ k is identified (Step 9) and this model is added to M E as a selected expert (Step 10).
Algorithm 2: Selection: Selecting Heterogeneous Models
  • Inputs:
  •   M: Pool of trained models
  •    D v a l : Validation dataset
  • Output:
  •    M E : Set of selected models (experts) 
1:
M E =
2:
for each model m i M  do
3:
     Compute logits for model m i using D v a l
4:
end for
5:
Project logits into a lower-dimensional space using UMAP
6:
Perform clustering on the reduced logits using GMM
7:
for each cluster C k  do
8:
     Compute cluster centroid μ k
9:
     Select model T k whose logits are closest to μ k
10:
    Add T k to M E
11:
end for
12:
Note: UMAP and GMM hyperparameters are fine-tuned by maximizing the Silhouette score using Bayesian optimization.

3.3. Ensemble Optimization

The last phase for the development of the TextNeX model involves computing the optimal weight contributions of the selected experts using Sequential Quadratic Programming (SQP). The goal of this phase is to ensure that the developed ensemble achieves the best possible performance while maintaining computational efficiency. Algorithm 3 presents the process for determining the optimal weights for each model in the selected expert set M E . The process begins by defining the objective function, which is set to maximize the Geometric Mean (GM) score [22,23] on the validation dataset D v a l (Step 1). This ensures that the weights are chosen in a way that optimizes the performance of the ensemble with respect to the validation data. In Step 2, The optimization is performed under two main constraints: (i) the sum of the weights w i for each expert must equal 1, namely w i = 1 , ensuring that the total weight distribution is normalized; (ii) each individual weight w i is constrained to be between 0 and 1, that is 0 w i 1 , to avoid any model having a negative or excessively large weight. In Step 3, to calculate the optimal set of weights, the algorithm employs Powell’s derivative-free method [24], which is particularly well-suited for solving optimization problems where derivatives of the objective function are not available or are difficult to compute. Finally, in Step 4, after the optimal weights { w i * } are determined, these weights, along with the corresponding experts in M E are used to define and generate the proposed TextNeX ensemble model.
At this point, it is worth noting that we prioritize a weighted average approach over stacking to effectively leverage the individual predictions of base learners. This decision is primarily driven by the need to mitigate overfitting on validation data and training instability, while also considering computational efficiency. Notice that stacking requires partitioning the training data to generate out-of-fold predictions for the meta-learner [10], introducing a trade-off between base learner performance and robust meta-learning. However, as the meta-learner is trained on validation predictions, it is particularly prone to overfitting, especially when dealing with limited datasets or highly correlated base models [10].
Algorithm 3: Ensemble optimization—Generation of optimal weights
  • Input:
  •    M E : Set of selected models (experts)
  •    D v a l : Validation dataset
  • Output:
  •   TextNex model 
1:
Set objective function to maximize GM score on D v a l
2:
Define optimization constraints: w i = 1 and 0 w i 1
3:
Apply Powell’s derivative-free method for obtain the optimal weights { w i * }
4:
Construct the TextNeX ensemble model by aggregating the weighted outputs of the selected models M E , using the optimal weights { w i * } .

4. Experimental Analysis

Next, we present an extensive experimental analysis, which includes a comparison study of the proposed TextNeX model against state-of-the-art models as well as an ablation study to investigate the impact of different model selection approaches within clusters on the final ensemble performance. The implementation code can be found in https://github.com/EmmanuelPintelas/TextNeX (accessed on 10 April 2025).

4.1. Experimental Setup

The numerical experiments were conducted using three datasets from the area of MGT detection:
  • AuTexTification shared task (IberLEF 2023) [11]: This dataset focuses on AI-generated content detection; this dataset includes both human and machine-generated texts encompassing different domains (tweets, legal documents, news articles, among others). The AI-generated texts were generated by six different models, among them BLOOM-7B1 and OpenAI’s text-davinci-003. The dataset is bilingual, available both in English and Spanish and comprises a training set of 33,845 samples (17,046 human/16,799 generated), as well as a test set of 21,832 samples (10,642 human/11,190 generated). Even though the dataset is bilingual, in our experiments only the English version of the dataset was utilized.
  • TweepFake—Twitter deep Fake text Dataset [1]: This dataset concerns both human and machine-generated tweets scraped directly from the Twitter API, with the AI-generated content being created by models, such as GPT-2, LSTMs or Markov Chains. It is divided into a training set of 20,712 samples (10,358 human/10,354 generated), a validation set of 2302 samples (1150 human/1152 generated) and a testing set of 2558 samples (1278 human/1280 generated).
  • AI Text Detection Pile [25]: This dataset is developed for AI-generated Text Detection tasks and it is mainly constituted by longer texts, such as reviews and essays from sources such as Reddit, OpenAI Webtext, Twitter and HC3. It contains 1,339,000 samples (990,000 human/340,000 generated) while the machine generated texts were generated from OpenAI’s GPT models, i.e., GPT2, GPTJ and ChatGPT (GPT-3.5-Turbo).
At this point, we recall that in this work we focused on MGT detection benchmarks to address the growing challenge of distinguishing machine-generated content from human-written text, which is critical for many real-world applications [9,26]. These selected benchmark datasets provide diverse and complex test cases, which challenge the robustness and efficiency of text classification models.
The numerical experiments include the performance evaluation of the following text classification models:
  • BERT [3]: A transformer-based model widely used for text classification.
  • RoBERTa [4]: An optimized variant of BERT with improved performance.
  • DeBERTa [5]: A transformer model that improves context understanding with disentangled attention.
  • DistilBERT [15]: A smaller, faster and lighter version of BERT, retaining most of its performance.
  • XLNet [6]: A generalized autoregressive pretraining method for text classification.
  • Majority Voting [9]: An ensemble model which uses RoBERTa, ELECTRA and XLNet as based learners and the final prediction is calculated by majority voting.
  • Soft Voting [1]: An ensemble model which uses BLOOM-560m, ErnieM and DeBERTaV3 as based learners and the final prediction is calculated by probability-weighted voting.
  • Stacking (SVC) [8]: An ensemble model combining the predictions of BERT, BART, RoBERTa and GPT-2 using a Support Vector Classifier (SVC) as the meta-learner.
  • Stacking (Voting) [2]: An ensemble model combining the predictions of BERT, RoBERTa, XLM-RoBERTa and DeBERTa using an ensemble voting classifier (Logistic Regression, Random Forest, Gaussian Naive Bayes and SVC) as the meta-learner.
  • Stacking (XGBoost) [11]: An ensemble model, which uses by XLM-RoBERTa, TwHIN-BERT and multilingual BERT as base learners and XGBoost model as a meta-learner.
  • TextNeX: The proposed model, which consists of a heterogeneous ensemble of lightweight text models, utilizing a clustering-based selection process to maximize diversity.
All models were trained using the Adam optimizer [27] with an initial learning rate of 10 4 and a Reduce-on-Plateau scheduler [28] was used, which reduces the learning rate by a factor of 0.7 after five epochs of no improvement. To prevent overfitting, early stopping [29] is applied, terminating training if the validation performance does not improve for ten consecutive epochs.
The experiments were conducted on an NVIDIA RTX 3090 GPU (NVIDIA, Santa Clara, CA, USA) (24GB VRAM) and an AMD Ryzen 9 5950X CPU (AMD, Santa Clara, CA, USA) with 64GB RAM. Finally, the performance comparison of all models was evaluated using three classification metrics: Accuracy (Acc), which measures the overall correctness of the model’s predictions; Area Under the Curve (AUC), which assesses the model’s ability to distinguish between different classes across all decision thresholds; and Geometric Mean (GM), which balances sensitivity and specificity, making it particularly useful for imbalanced datasets [22,23].

4.2. Experimental Results

Table 1 summarizes the classification performance of all evaluated models across the three benchmark datasets. Notice that the best performance for each metric and dataset is highlighted in bold. The results indicate that the proposed TextNeX model consistently outperforms both single transformer-based models and ensemble-based models. Specifically, TextNeX achieves the highest Acc and GM across all datasets, and the best AUC score in AuTexTification and AI Text Detection Pile benchmarks. Compared to traditional ensemble methods, TextNeX considerably outperforms both voting-based models (Majority Voting and Soft Voting) while it provides competitive AUC performance and exhibits the highest GM score compared to all Stacking-based approaches. Based on the provided experiments, we conclude that TextNeX effectively balances predictive performance and computational efficiency, demonstrating the advantage of the proposed heterogeneous-based approach.
Next, for validating the effectiveness of the proposed TextNeX model, we conduct a statistical analysis to evaluate the hypothesis H 0 that all models perform equally. For this purpose, we employ the non-parametric Friedman Aligned-Ranks (FAR) test [30] to rank the models and the Finner post-hoc test [31] (with statistical significance α = 5 % ) in order to identify significant performance differences, without assuming any distribution about the performance scores [32].
Table 2, Table 3 and Table 4 present the statistical analysis for the Acc, AUC and GM metrics, respectively. TextNeX achieves the highest FAR ranking, demonstrating its superior performance while the Finner post-hoc test rejects H 0 for all comparisons involving TextNeX, since all p-values are below the significance threshold α .
In addition to the predictive performance, we evaluated the computational efficiency of TextNeX compared to the SoA baseline models on the AuTexTification dataset. For single models, inference time was measured per sample in a batch size of 32, capturing memory usage and processing time under realistic conditions. For ensemble models, we differentiate between execution methods:
  • Voting-based ensembles (e.g., majority voting, averaging) run models in parallel, meaning the total inference time is determined by the slowest model in the ensemble. However, the trade-off of parallel execution is reduced memory efficiency, as memory usage is computed as the sum of the memory requirements of all base learners in the ensemble.
  • Stacking-based ensembles, where outputs from multiple models are passed to a meta-learner sequentially, result in a total inference time equal to the sum of the base learners’ times. However, the advantage of sequential execution lies in its memory efficiency, as memory usage is determined by the model in the ensemble with the highest memory requirement.
Table 5 presents the inference time and memory usage for each model on the AuTex-Tification dataset. The results demonstrate that TextNeX achieves a significant reduction in inference time and memory usage compared to transformer-based ensemble models. Furthermore, TextNeX is comparable with single transformer-based models as regards both inference time and memory requirements. Consequently, we are able to conclude that by leveraging lightweight models and a heterogeneity-driven selection process, TextNeX provides a scalable alternative to large transformer ensembles, making it particularly suitable for real-world applications where computational resources are of major importance.
Summarizing the previous discussion, we highlight that TextNeX is able to outperform traditional ensemble methods in predictive performance and be competitive with single transformer-based models relative to computational requirements. These findings demonstrate its ability to achieve high accuracy while maintaining low inference time and memory usage, which makes it a compelling alternative for real-world applications, especially for resource-constrained environments. Furthermore, the efficiency of TextNeX enables seamless deployment in large-scale settings, where traditional ensemble approaches may be impractical due to their high computational overhead.

4.3. Ablation Study

In the sequel, we conduct an ablation study to evaluate the impact of different model selection approaches within clusters on the final ensemble performance. Specifically, we compare the Best-Valid-based Selection strategy, which selects the model with the highest validation performance within clusters, with the Centroid-based Selection (proposed), which emphasizes model heterogeneity by selecting the model closest to the centroid of the cluster.
Table 6 presents the comparative results for both selection approaches across the three benchmark datasets, regarding both validation and testing datasets. The results demonstrate that the Best-Valid-based Selection achieves higher validation scores for individual models; however, its test performance does not always reflect these high validation scores, indicating a tendency towards validation overfitting. This is particularly evident in the AuTexTification dataset, where models selected based on validation scores exhibit a significant drop in test performance, with the ensemble scoring GM = 0.753 on the test set, despite achieving GM = 0.890 on validation.
In contrast, the Centroid-based Selection consistently achieves better generalization across all datasets. By prioritizing heterogeneity within the selected models, this approach mitigates validation overfitting and enhances robustness. Notably, in the TweepFake dataset, the centroid-based ensemble achieves GM = 0.940 , compared to 0.935 for the best validation-based ensemble. Similarly, for the AI Text Detection Pile dataset, the centroid-based approach improves test performance, scoring GM = 0.850 , compared to 0.839 for the validation-based ensemble.
These findings confirm that relying solely on validation scores for model selection within clusters can lead to suboptimal generalization. Instead, the centroid-based approach effectively balances diversity and complementarity, leading to more accurate predictions. This result aligns with the motivation of the proposed model, emphasizing the importance of heterogeneous expert selection in text classification tasks.

4.4. Discussion of Research Objectives and Limitations

The main objective of this study was to design a robust, accurate and computationally efficient text classification framework that addresses the growing need for machine-generated text detection, particularly in resource-constrained environments. To achieve this, we proposed TextNeX, a three-phase ensemble methodology leveraging lightweight transformers, clustering-based model selection, and derivative-free ensemble optimization. The experimental evaluation across three challenging MGT detection datasets confirms that this objective was successfully achieved, as TextNeX demonstrated superior accuracy, generalization, and significantly reduced computational overhead compared to both single and ensemble-based state-of-the-art models.
Despite the promising results, there are certain limitations and potential threats to the validity of our study. First, the effectiveness of the clustering-based selection process heavily depends on the diversity of the initial model pool. If the expansion phase fails to generate sufficiently heterogeneous models, the selection process may converge on redundant or suboptimal experts. Second, although UMAP and GMM offer strong clustering capabilities, their sensitivity to parameter tuning may affect reproducibility across different datasets or domains. To mitigate this, we employed Bayesian optimization to systematically fine-tune clustering parameters. Third, while the current work focuses on English text, the performance of TextNeX on multilingual or domain-specific corpora remains an open question. Lastly, although we demonstrated improved generalization via centroid-based selection, further validation on imbalanced or adversarial datasets would strengthen the conclusions.
Future work will address these limitations by exploring adaptive model expansion strategies, evaluating the framework on multilingual benchmarks, and integrating model robustness as an explicit selection criterion during clustering. These extensions will help generalize the applicability of TextNeX across broader NLP tasks and deployment scenarios.

5. Conclusions

In this study, we proposed TextNeX, a new ensemble framework designed for efficient and robust text classification, with a particular focus on machine-generated text (MGT) detection. By leveraging a pool of lightweight transformer-based models, a clustering-based expert selection strategy, and a derivative-free ensemble optimization procedure, TextNeX effectively addresses the trade-off between accuracy and computational efficiency, two major limitations of existing large-scale transformer ensembles.
The proposed three-phase methodology (Expansion, Selection, Optimization) enabled the construction of a highly diverse and complementary ensemble of models. Our extensive experimental evaluation on three challenging MGT datasets demonstrated that TextNeX consistently outperformed both individual transformer models and existing ensemble techniques in terms of predictive performance, generalization ability, and computational cost. Additionally, our ablation study highlighted the benefits of centroid-based selection for enhancing generalization and mitigating validation overfitting.
To further enhance the applicability and scalability of TextNeX, several directions for future research can be explored. First, we aim to extend the model to multilingual and domain-specific datasets, assessing its adaptability to different linguistic and contextual settings. Second, integrating alternative clustering methods and expert selection criteria, such as robustness or interpretability, could strengthen the ensemble’s reliability. Third, we plan to investigate lightweight stacking mechanisms as an additional ensemble layer to further boost predictive performance without compromising efficiency. Lastly, we are interested in evaluating TextNeX on adversarially perturbed and low-resource text datasets to examine its robustness under challenging real-world scenarios.

Author Contributions

Author Contributions: conceptualization, E.P. and I.E.L.; methodology, E.P. and I.E.L.; software, E.P.; validation, E.P. and I.E.L.; formal analysis, E.P. and I.E.L.; investigation, E.P.; resources, E.P.; data curation, E.P.; writing original draft preparation, E.P., A.K. and I.E.L.; writing review and editing, E.P., A.K., I.E.L. and V.T.; visualization, I.E.L.; supervision, I.E.L. and V.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data sharing will be available upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Sheykhlan, M.K.; Abdoljabbar, S.K.; Mahmoudabad, M.N. KaramiTeam at IberAuTexTification: Soft Voting Ensemble for Distinguishing AI-Generated Texts. In CEUR Workshop Proceedings, Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2024) Co-Located with the Conference of the Spanish Society for Natural Language Processing (SEPLN 2024), Valladolid, Spain, 24 September 2024; Jiménez-Zafra, S.M., Chiruzzo, L., Rangel, F., Balouchzahi, F., Corrêa, U.B., Bonet-Jover, A., Gómez-Adorno, H., Barba, J.Á.G., Farías, D.I.H., Montejo-Ráez, A., et al., Eds.; CEUR-WS: Aachen, Germany, 2024; Volume 3756. [Google Scholar]
  2. Abburi, H.; Suesserman, M.; Pudota, N.; Veeramani, B.; Bowen, E.; Bhattacharya, S. Generative AI text classification using ensemble llm approaches. arXiv 2023, arXiv:2309.07755. [Google Scholar]
  3. Kenton, J.D.M.W.C.; Toutanova, L.K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL-HLT, Minneapolis, MN, USA, 2–7 June 2019; Volume 1. [Google Scholar]
  4. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
  5. He, P.; Liu, X.; Gao, J.; Chen, W. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. In Proceedings of the 9th International Conference on Learning Representations, Virtual Event, 3–7 May 2021. [Google Scholar]
  6. Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates, Inc.: New York, NY, USA, 2019; Volume 32, pp. 5754–5764. [Google Scholar]
  7. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv 2020, arXiv:1910.10683. [Google Scholar]
  8. Gambini, M.; Fagni, T.; Falchi, F.; Tesconi, M. On pushing deepfake tweet detection capabilities to the limits. In Proceedings of the 14th ACM Web Science Conference 2022, Barcelona, Spain, 26–29 June 2022; pp. 154–163. [Google Scholar]
  9. Mikros, G.K.; Koursaris, A.; Bilianos, D.; Markopoulos, G. AI-Writing Detection Using an Ensemble of Transformers and Stylometric Features. In Proceedings of the IberLEF@ SEPLN, Andalusia, Spain, 26 September 2023. [Google Scholar]
  10. Sarmah, U.; Borah, P.; Bhattacharyya, D.K. Ensemble Learning Methods: An Empirical Study. SN Comput. Sci. 2024, 5, 924. [Google Scholar] [CrossRef]
  11. Preda, A.A.; Cercel, D.C.; Rebedea, T.; Chiru, C.G. UPB at IberLEF-2023 AuTexTification: Detection of Machine-Generated Text using Transformer Ensembles. arXiv 2023, arXiv:2308.01408. [Google Scholar]
  12. Larson, J.; Menickelly, M.; Wild, S.M. Derivative-free optimization methods. Acta Numer. 2019, 28, 287–404. [Google Scholar] [CrossRef]
  13. Crothers, E.N.; Japkowicz, N.; Viktor, H.L. Machine-generated text: A comprehensive survey of threat models and detection methods. IEEE Access 2023, 11, 70977–71002. [Google Scholar] [CrossRef]
  14. Domingo, J.D.; Aparicio, R.M.; Rodrigo, L.M.G. Cross validation voting for improving CNN classification in grocery products. IEEE Access 2022, 10, 20913–20925. [Google Scholar] [CrossRef]
  15. Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
  16. Wang, W.; Wei, F.; Dong, L.; Bao, H.; Yang, N.; Zhou, M. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Adv. Neural Inf. Process. Syst. 2020, 33, 5776–5788. [Google Scholar]
  17. Sun, Z.; Yu, H.; Song, X.; Liu, R.; Yang, Y.; Zhou, D. Mobilebert: A compact task-agnostic bert for resource-limited devices. arXiv 2020, arXiv:2004.02984. [Google Scholar]
  18. Healy, J.; McInnes, L. Uniform manifold approximation and projection. Nat. Rev. Methods Prim. 2024, 4, 82. [Google Scholar] [CrossRef]
  19. Zhang, Y.; Li, M.; Wang, S.; Dai, S.; Luo, L.; Zhu, E.; Xu, H.; Zhu, X.; Yao, C.; Zhou, H. Gaussian mixture model clustering with incomplete data. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2021, 17, 1–14. [Google Scholar] [CrossRef]
  20. Patel, E.; Kushwaha, D.S. Clustering cloud workloads: k-means vs gaussian mixture model. Procedia Comput. Sci. 2020, 171, 158–167. [Google Scholar] [CrossRef]
  21. Victoria, A.H.; Maragatham, G. Automatic tuning of hyperparameters using Bayesian optimization. Evol. Syst. 2021, 12, 217–223. [Google Scholar] [CrossRef]
  22. Livieris, I.E. A novel forecasting strategy for improving the performance of deep learning models. Expert Syst. Appl. 2023, 230, 120632. [Google Scholar] [CrossRef]
  23. Naidu, G.; Zuva, T.; Sibanda, E.M. A review of evaluation metrics in machine learning algorithms. In Artificial Intelligence Application in Networks and Systems, Proceedings of 12th Computer Science On-line Conference 2023; Springer: Berlin/Heidelberg, Germany, 2023; Volume 3, pp. 15–25. [Google Scholar]
  24. Ragonneau, T.M.; Zhang, Z. PDFO: A cross-platform package for Powell’s derivative-free optimization solvers. arXiv 2024, arXiv:2302.13246. [Google Scholar] [CrossRef]
  25. artem9k. AI Text Detection Pile. Available online: https://huggingface.co/datasets/artem9k/ai-text-detection-pile (accessed on 22 January 2024).
  26. He, X.; Shen, X.; Chen, Z.; Backes, M.; Zhang, Y. Mgtbench: Benchmarking machine-generated text detection. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, Salt Lake City, UT, USA, 14–18 October 2024; pp. 2251–2265. [Google Scholar]
  27. Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  28. Smith, L.N. Cyclical learning rates for training neural networks. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 27–29 March 2017; pp. 464–472. [Google Scholar]
  29. Goodfellow, I. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
  30. Hodges, J.L., Jr.; Lehmann, E.L. Rank methods for combination of independent experiments in analysis of variance. In Selected Works of E.L. Lehmann; Springer: Berlin/Heidelberg, Germany, 2011; pp. 403–418. [Google Scholar]
  31. Juarros-Basterretxea, J.; Aonso-Diego, G.; Postigo, Á.; Montes-Álvarez, P.; Menéndez-Aller, Á.; García-Cueto, E. Post-hoc tests in one-way ANOVA: The case for normal distribution. Methodology 2024, 20, 84–99. [Google Scholar] [CrossRef]
  32. Kiriakidou, N.; Livieris, I.E.; Pintelas, P. Mutual information-based neighbor selection method for causal effect estimation. Neural Comput. Appl. 2024, 36, 9141–9155. [Google Scholar] [CrossRef]
Figure 1. High-level description of the process for the development of TextNex model. The process consists of three phases: (i) Expansion, where diverse lightweight text networks are generated through randomized training configurations; (ii) Selection, where a clustering-based selection process identifies the most heterogeneous and complementary models (experts) and (iii) Ensemble optimization, where the selected experts are combined using Sequential Quadratic Programming (SQP) to optimize their contributions to the final ensemble.
Figure 1. High-level description of the process for the development of TextNex model. The process consists of three phases: (i) Expansion, where diverse lightweight text networks are generated through randomized training configurations; (ii) Selection, where a clustering-based selection process identifies the most heterogeneous and complementary models (experts) and (iii) Ensemble optimization, where the selected experts are combined using Sequential Quadratic Programming (SQP) to optimize their contributions to the final ensemble.
Mathematics 13 01555 g001
Table 1. Performance Comparison of Text Classification Models. The best performance for each metric and dataset is highlighted in bold.
Table 1. Performance Comparison of Text Classification Models. The best performance for each metric and dataset is highlighted in bold.
ModelAuTexTificationTweepFakeAI Text Detection Pile
AccAUCGMAccAUCGMAccAUCGM
BERT0.6520.7410.6450.8910.9050.8840.8650.9010.798
RoBERTa0.6750.7650.6700.8960.9120.8900.8770.9090.812
DeBERTa0.6900.7800.6850.9020.9180.8980.8790.9140.818
DistilBERT0.7310.8050.7250.8870.8990.8760.8510.8920.780
XLNet0.6600.7490.6550.8770.8920.8690.8590.9030.795
Majority Voting0.7250.8100.7200.9140.9260.9100.8900.9240.825
Soft Voting0.7460.8250.7350.9210.9340.9180.8940.9260.832
Stacking (SVC)0.7550.8350.7500.9260.9420.9230.8990.9280.835
Stacking (Voting)0.7600.8400.7550.9310.9510.9290.9020.9300.840
Stacking (XGBoost)0.7350.8150.7330.9290.9430.9260.8960.9240.828
TextNeX0.7750.8480.7720.9430.9480.9400.9080.9350.850
Table 2. Statistical analysis: FAR and Finner post-hoc tests results relative to Acc metric.
Table 2. Statistical analysis: FAR and Finner post-hoc tests results relative to Acc metric.
ModelFriedmanFinner Post-Hoc Test
Ranking p -Value H 0
TextNeX3.67
Stacking (Voting)6.000.01963Rejected
Stacking (SVC)8.676.37·10−7Rejected
Stacking (XGBoost)10.673.2·10−12Rejected
Soft Voting12.000.0Rejected
Majority Voting18.000.0Rejected
DistilBERT22.670.0Rejected
DeBERTa23.330.0Rejected
RoBERTa25.000.0Rejected
BERT27.330.0Rejected
XLNet29.670.0Rejected
Table 3. Statistical analysis: FAR and Finner post-hoc tests results relative to AUC metric.
Table 3. Statistical analysis: FAR and Finner post-hoc tests results relative to AUC metric.
ModelFriedmanFinner Post-Hoc Test
Ranking p -Value H 0
TextNeX5.00
Stacking (Voting)5.670.50499Failed to reject
Stacking (SVC)8.000.00299Rejected
Soft Voting10.671.82·10−8Rejected
Stacking (XGBoost)11.179.96·10−10Rejected
Majority Voting16.830.0Rejected
DeBERTa22.670.0Rejected
DistilBERT25.000.0Rejected
RoBERTa25.330.0Rejected
BERT28.000.0Rejected
XLNet28.670.0Rejected
Table 4. Statistical analysis: FAR and Finner post-hoc tests results relative to GM metric.
Table 4. Statistical analysis: FAR and Finner post-hoc tests results relative to GM metric.
ModelFriedmanFinner Post-Hoc Test
Ranking p -Value H 0
TextNeX3.33
Stacking (Voting)5.330.04550Rejected
Stacking (SVC)8.671.07·10−7Rejected
Soft Voting11.331.67·10−15Rejected
Stacking (XGBoost)11.670.0Rejected
Majority Voting18.000.0Rejected
DeBERTa23.000.0Rejected
DistilBERT24.330.0Rejected
RoBERTa25.000.0Rejected
BERT27.330.0Rejected
XLNet29.000.0Rejected
Table 5. Computational Efficiency Comparison on the AuTexTification dataset.
Table 5. Computational Efficiency Comparison on the AuTexTification dataset.
Single ModelsInference Time (ms)Memory Usage (MB)
BERT [3]12.4444
RoBERTa [4]14.1502
DeBERTa [5]16.5580
DistilBERT [15]7.2280
XLNet [6]19.8499
MobileBERT [17]5.8111
MiniLM [16]4.9142
Majority Voting Ensemble [9]19.81445
Soft Voting Ensemble [1]50.04302
Stacking Ensemble (SVC) [8]60.4870
Stacking Ensemble (Voting) [2]60.11086
Stacking (XGBoost) [11]17.12292
TextNeX7.2534
Table 6. Ablation Study: Comparison of model selection approaches in cluster-based ensemble construction (Algorithm 2). Best validation-based selection focuses on validation performance within clusters, while centroid-based selection prioritizes heterogeneity across clusters. Generalization performance is assessed using the GM metric on validation (Valid) and test (Test) splits.
Table 6. Ablation Study: Comparison of model selection approaches in cluster-based ensemble construction (Algorithm 2). Best validation-based selection focuses on validation performance within clusters, while centroid-based selection prioritizes heterogeneity across clusters. Generalization performance is assessed using the GM metric on validation (Valid) and test (Test) splits.
Selection Approaches
(Algorithm 2)
AuTexTificationTweepFakeAI Text Detection Pile
ModelsValidTestModelsValidTestModelsValidTest
Best-Valid-based
Selection
T 15 0.8580.782 T 30 0.9530.938 T 7 0.9100.818
T 42 0.8450.684 T 25 0.9580.944 T 34 0.8940.830
T 9 0.8320.729 T 11 0.9560.902 T 21 0.8800.835
Ensemble0.8900.753Ensemble0.9700.935Ensemble0.9260.839
Centroid-based
Selection
(proposed)
T 12 0.8300.735 T 27 0.9450.943 T 4 0.8820.824
T 38 0.8280.748 T 19 0.9440.929 T 29 0.8900.835
T 5 0.8150.752 T 8 0.8950.936 T 18 0.8730.828
Ensemble0.8550.772Ensemble0.9650.940Ensemble0.9050.850
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pintelas, E.; Koursaris, A.; Livieris, I.E.; Tampakas, V. TextNeX: Text Network of eXperts for Robust Text Classification—Case Study on Machine-Generated-Text Detection. Mathematics 2025, 13, 1555. https://doi.org/10.3390/math13101555

AMA Style

Pintelas E, Koursaris A, Livieris IE, Tampakas V. TextNeX: Text Network of eXperts for Robust Text Classification—Case Study on Machine-Generated-Text Detection. Mathematics. 2025; 13(10):1555. https://doi.org/10.3390/math13101555

Chicago/Turabian Style

Pintelas, Emmanuel, Athanasios Koursaris, Ioannis E. Livieris, and Vasilis Tampakas. 2025. "TextNeX: Text Network of eXperts for Robust Text Classification—Case Study on Machine-Generated-Text Detection" Mathematics 13, no. 10: 1555. https://doi.org/10.3390/math13101555

APA Style

Pintelas, E., Koursaris, A., Livieris, I. E., & Tampakas, V. (2025). TextNeX: Text Network of eXperts for Robust Text Classification—Case Study on Machine-Generated-Text Detection. Mathematics, 13(10), 1555. https://doi.org/10.3390/math13101555

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop