Integrated Survey Classification and Trend Analysis via LLMs: An Ensemble Approach for Robust Literature Synthesis
Abstract
1. Introduction
1.1. Research Questions and Hypotheses
- RQ1:
- Multi-Domain Generalizability: How can ensemble learning techniques be effectively integrated with LLMs to create a framework that generalizes across different scientific domains while maintaining high classification accuracy and reliability?
- RQ2:
- Methodological Robustness: What are the critical components of a robust methodological framework, including data preprocessing, prompt engineering, ensemble voting strategies, and evaluation metrics, that collectively optimize the accuracy and reliability of automated survey synthesis?
- RQ3:
- Practical Effectiveness: Can the proposed ensemble methodology demonstrate superior performance compared to traditional single-model approaches across multiple domains, while providing quantifiable benefits in terms of cost-effectiveness and time savings?
1.2. Main Contributions
- (1)
- Confidence Calibration Framework: A non-parametric confidence calibration system across heterogeneous LLMs using isotonic regression. Our approach achieves 8.4% improvement over simple majority voting through the formulation where is the calibration function. Empirical validation shows consistent improvements across domains (QA: 10.0%, CV: 10.9%).
- (2)
- Adaptive Thresholding Mechanism: A dynamic threshold adjustment system based on agreement patterns and historical performance, formulated as. This mechanism achieves 5.2% improvement over fixed-threshold ensembles with 3.8% improvement in consistency across domains.
- (3)
- Historical Performance Tie-Breaking: A sophisticated tie-breaking strategy that leverages domain-specific model performance using the formulation where is the historical accuracy of model i. This approach achieves 85.3% tie-resolution accuracy compared to 61.2% for random tie-breaking.
- (4)
- Multi-Domain Validation Framework: A cross-domain generalizability assessment methodology that ensures consistent performance across different scientific domains. Our framework demonstrates 94% consistency between QA and CV domains, with a generalizability score of 0.94 and domain difference of only 0.9 percentage points.
1.3. Paper Organization
2. Related Work
2.1. Automated Literature Analysis and Synthesis
2.2. Ensemble Learning in Natural Language Processing
2.3. Large Language Models for Document Classification
2.4. Ensemble Methods for Large Language Models
2.5. Evaluation Methodologies for Ensemble Systems
2.6. Cost–Benefit Analysis of Automated Literature Systems
2.7. Gaps in Current Research
3. Methodology
3.1. Framework Overview
3.2. Data Acquisition and Preprocessing
3.2.1. Dataset Construction
3.2.2. Preprocessing Pipeline
3.3. Individual LLM Specifications
3.3.1. GPT-4
- Model Version: GPT-4 (2023 training cutoff)
- Temperature: 0.2 (to ensure consistent outputs while maintaining some flexibility)
- Max Tokens: 1024 (sufficient for comprehensive classification responses)
- Top-p: 0.95 (to maintain output diversity while ensuring coherence)
3.3.2. LLaMA 3.3
- Model Version: LLaMA 3.3 (70B parameter version)
- Temperature: 0.2 (consistent with GPT-4 configuration)
- Max Tokens: 1024 (consistent with GPT-4 configuration)
- Top-p: 0.95 (consistent with GPT-4 configuration)
3.3.3. Claude 3
- Model Version: Claude 3 Opus
- Temperature: 0.2 (consistent with other models)
- Max Tokens: 1024 (consistent with other models)
- Top-p: 0.95 (consistent with other models)
3.4. Prompt Engineering
3.4.1. Prompt Structure
- Task Definition: A clear statement of the classification task, including the specific domain (QA or CV) and the categories to be considered.
- Paper Information: The title and abstract of the paper to be classified, along with relevant metadata.
- Classification Instructions: Specific instructions for how to analyze the paper and assign it to the most appropriate category.
- Output Format: A structured format for the response, including the predicted category, confidence score, and brief justification.
- Reasoning Request: An explicit request for the model to explain its reasoning process before making a final classification decision.
- Knowledge-Based QA: Systems that answer questions by querying structured KBs
- Reading Comprehension: Systems that extract answers from provided text passages
- Open-Domain QA Systems answering questions across unrestricted topics.
- Conversational QA: Systems that maintain dialogue context for interactive QA
- Visual QA: Systems that answer questions about images or visual content
- Specialized QA: Domain-specific QA systems (medical, legal, etc.)
- Analyze the paper’s title and abstract carefully
- Determine which category best describes the paper’s primary focus
- Provide a confidence score (0-100%) for your classification
- Briefly explain your reasoning
- Output Format:
- Category: [Selected category]
- Confidence: [0–100%]
- Reasoning: [Brief explanation of classification decision]
3.4.2. Prompt Refinement Process
3.5. Ensemble Voting Mechanism
3.5.1. Basic Voting Procedure
- Each model (GPT-4, LLaMA 3.3, and Claude 3) independently classifies the paper and provides a predicted category and confidence score.
- The predictions are aggregated using a weighted voting scheme, where each model’s vote is weighted by its confidence score.
- The category with the highest weighted vote total is selected as the ensemble prediction.
- In case of ties, a tie-breaking strategy is applied based on historical model performance.
3.5.2. Confidence Calibration
- We collect confidence scores from each model on a validation set with known ground truth labels.
- We analyze the relationship between confidence scores and actual accuracy for each model.
- We derive calibration functions that map raw confidence scores to calibrated probabilities.
- We apply these calibration functions to all confidence scores before using them in the voting mechanism.
3.5.3. Tie-Breaking Strategy
- For each tied category, we identify which models voted for it.
- We calculate a tie-breaking score for each category based on the historical accuracy of the models that voted for it.
- The category with the highest tie-breaking score is selected as the final prediction.
3.5.4. Adaptive Threshold Adjustment
- If all three models agree on a category, we accept the prediction regardless of confidence scores.
- If two models agree and the third disagrees, we require the weighted vote for the majority category to exceed a threshold .
- If all three models disagree, we require the weighted vote for the highest-scoring category to exceed a threshold .
3.6. Evaluation Methodology
3.6.1. Performance Metrics
3.6.2. Statistical Validation
3.6.3. Error Analysis
3.6.4. Cost–Benefit Analysis
3.7. Trend Analysis Methodology
3.8. Implementation Details
4. Experiments
4.1. Experimental Setup
4.1.1. Hardware and Software Environment
- 8x NVIDIA A100 GPUs (80 GB VRAM each, 640 GB total GPU memory)
- 2x Intel Xeon Platinum 8358 CPUs (32 cores each, 64 cores total)
- 1 TB DDR4 RAM (essential for model weight storage)
- 10 TB NVMe SSD storage (model weights and data caching)
- InfiniBand HDR connectivity (200 Gbps) for GPU communication
- Estimated power consumption: 8kW during inference
- GPT-4: OpenAI API (gpt-4-turbo-2024-04-09)
- Claude 3: Anthropic API (claude-3-opus-20240229)
- Network latency: 45–120 ms per API call
- Concurrent request limit: 5 requests per model
- 2x Intel Xeon Gold 6248 CPUs (20 cores each, 40 cores total)
- 256 GB DDR4 RAM (for large dataset processing)
- 2 TB NVMe SSD storage (data preprocessing and caching)
- 10 Gbps network connection (API communications)
- LLaMA 3.3 infrastructure: USD 180,000 (amortized over 3 years)
- Data processing server: USD 25,000
- Monthly cloud costs: USD 2400 (GPT-4: USD 1800, Claude 3: USD 600)
- Total cost per 1000 papers: USD 270 (API costs only)
- Python 3.9 for all implementation
- PyTorch 2.0 for LLaMA 3.3 inference
- OpenAI API (version 2023-05-15) for GPT-4
- Anthropic API (version 2023-06-01) for Claude 3
- Pandas 1.5.3 and NumPy 1.24.2 for data processing
- Scikit-learn 1.2.2 for evaluation metrics
- Matplotlib 3.7.1 and Seaborn 0.12.2 for visualization
4.1.2. Dataset Characteristics
4.1.3. QA Domain Categories
4.1.4. CV Domain Categories
4.1.5. Data Splits
4.2. Baseline Models and Comparisons
4.2.1. Individual LLM Baselines
4.2.2. Simple Majority Voting
4.2.3. Traditional Machine Learning Baselines
4.3. Evaluation Procedures
4.3.1. Classification Performance Evaluation
- For each fold in the 5-fold cross-validation:
- (a)
- Train the calibration functions and optimize thresholds using the training set.
- (b)
- Apply each individual model and the ensemble to the test set.
- (c)
- Calculate performance metrics (accuracy, precision, recall, F1-score) for each model and the ensemble.
- Average the performance metrics across all five folds to obtain the final results.
- Calculate 95% confidence intervals for all metrics using bootstrapping with 1000 resamples.
- Perform paired t-tests to assess the statistical significance of performance differences between the ensemble and individual models.
4.3.2. Inter-Model Agreement Analysis
- For each paper in the dataset, record the predictions made by each of the three models.
- Calculate the percentage of papers where all three models agree, where two models agree, and where all three models disagree.
- Calculate Fleiss’ kappa to measure inter-rater reliability between the three models.
- Analyze the relationship between model agreement and ensemble accuracy to assess the effectiveness of the ensemble voting mechanism.
4.3.3. Confidence Analysis
- For each model, bin the predictions based on confidence scores (e.g., 0–10%, 10–20%, etc.).
- Calculate the accuracy within each confidence bin.
- Plot the relationship between confidence and accuracy to assess calibration quality.
- Compare the calibration curves before and after applying the confidence calibration procedure to assess its effectiveness.
4.3.4. Cost–Benefit Analysis
- Measure the computational resources required for each model and the ensemble, including API costs, processing time, and memory requirements.
- Estimate the time required for manual classification based on expert assessments.
- Calculate the time savings achieved by the automated approach compared to manual classification.
- Analyze the relationship between computational costs and accuracy improvements to identify optimal configurations for different use cases.
4.4. Implementation Challenges and Solutions
4.4.1. API Rate Limiting
4.4.2. Model Output Parsing
4.4.3. Confidence Score Calibration
4.4.4. Computational Resource Management
5. Results
5.1. Classification Performance
5.1.1. Overall Performance Comparison
5.1.2. Category-Specific Performance
5.1.3. Confusion Matrices
5.2. Ensemble Improvement Analysis
5.2.1. Improvement Patterns
5.2.2. Statistical Significance
5.3. Inter-Model Agreement Analysis
5.3.1. Agreement Patterns
5.3.2. Inter-Rater Reliability
5.4. Confidence Analysis
5.4.1. Confidence Calibration
5.4.2. Confidence Distribution
5.5. Cost–Benefit Analysis
5.5.1. Computational Costs
5.5.2. Time Savings
5.5.3. Research Gap Analysis
- Integration of Visual QA with Conversational QA for multimodal dialogue systems
- Knowledge-based approaches for Specialized QA in emerging domains (e.g., climate science, sustainable development)
- Robust evaluation methodologies for Open-Domain QA systems
- Ethical frameworks for Generative Models in sensitive applications
- Integration of 3D Vision with Autonomous Systems for complex navigation tasks
- Specialized approaches for Video Analysis in low-resource environments
5.6. Citation Analysis
5.6.1. Citations by Category
- Knowledge-Based QA: 127.3 ± 89.2 citations (highest in QA domain)
- Open-Domain QA: 156.8 ± 124.6 citations (most variable)
- Reading Comprehension: 89.7 ± 67.4 citations (moderate impact)
- Visual QA: 64.2 ± 34.8 citations (emerging area)
- Conversational QA: 82.5 ± 45.3 citations (growing field)
- Specialized QA: 45.7 ± 28.9 citations (niche applications)
- Object Detection: 198.4 ± 156.3 citations (highest overall)
- Generative Models: 142.1 ± 98.7 citations (rapidly growing)
- Image Segmentation: 134.6 ± 87.2 citations (established field)
- 3D Vision: 89.3 ± 62.1 citations (expanding domain)
- Video Analysis: 76.8 ± 51.4 citations (emerging focus)
- Medical Imaging: 67.2 ± 43.8 citations (specialized area)
- Autonomous Systems: 58.9 ± 39.2 citations (applied research)
5.6.2. Cross-Domain Comparison
- Foundational Categories: Object Detection in CV (198.4 citations) and Open-Domain QA (156.8 citations) represent the most highly cited categories, indicating their foundational importance in their respective domains.
- Emerging Areas: Visual QA (64.2 citations) and Specialized QA (45.7 citations) show lower citation counts, consistent with their status as emerging or specialized research areas.
- Variability: Open-Domain QA shows the highest citation variability (σ = 124.6), suggesting a mix of highly influential foundational papers and more recent contributions.
5.6.3. Temporal Citation Trends
- Generative Models: 35% increase in average citations per year (2018–2024)
- Visual QA: 28% increase, reflecting growing interest in multimodal systems
- Conversational QA: 22% increase, driven by dialogue system advances
- Object Detection: Consistent high citation rates (±5% variation)
- Reading Comprehension: Steady citation patterns (±8% variation)
- Image Segmentation: Mature field with stable impact (±6% variation)
5.6.4. Citation–Performance Correlation
- Methodological Diversity: Well-established fields have developed diverse approaches, making categorization more complex.
- Boundary Blurring: Mature research areas often develop sub-specializations that blur traditional category boundaries.
- Interdisciplinary Growth: Highly cited areas attract cross-disciplinary work that challenges simple categorization.
6. Discussion
6.1. Interpretation of Results
6.1.1. Ensemble Superiority
6.1.2. Cross-Domain Generalizability
6.1.3. Cost-Effectiveness
6.1.4. Methodological Robustness
6.2. Implications for Automated Literature Analysis
6.3. Task Ambiguity and Difficulty Analysis
6.3.1. Inter-Annotator Agreement Results
- Cohen’s κ (mean pairwise): 0.515
- Fleiss’ κ (multi-annotator): 0.493
- Pairwise agreements: [0.515, 0.491, 0.474]
- Standard deviation: 0.021
- Interpretation: Moderate agreement
- QA Domain: Cohen’s κ = 0.496, Fleiss’ κ = 0.464 (Moderate)
- CV Domain: Cohen’s κ = 0.428, Fleiss’ κ = 0.417 (Moderate)
6.3.2. Category Boundary Ambiguity
- Reading Comprehension vs. Open-Domain QA: Papers focusing on reading comprehension often overlap with open-domain question answering, particularly when the reading comprehension system is tested on diverse text sources.
- 3D Vision vs. Video Analysis: Three-dimensional computer vision and video analysis share significant conceptual overlap, especially in applications involving temporal 3D reconstruction or dynamic scene understanding.
- Generative Models vs. Image Segmentation: Modern generative approaches to image segmentation blur the traditional boundaries between generative modeling and discriminative segmentation tasks.
6.3.3. Task Complexity Assessment
6.4. Limitations and Future Work
6.5. Ethical Considerations
7. Conclusions
7.1. Summary of Findings
7.2. Practical Implications
7.3. Future Directions
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
- Complete QA dataset (752 papers) and CV dataset (402 papers)
- Individual model predictions and confidence scores
- Ensemble voting results and performance metrics
- Statistical analysis and visualization code
- Detailed documentation and usage examples
Conflicts of Interest
Abbreviations
ACM | Association for Computing Machinery |
ACL | Association for Computational Linguistics |
AUC | Area Under the Curve (of ROC) |
API | Application Programming Interface |
CI | Confidence Interval |
CSV | Comma-Separated Values |
CRF | Conditional Random Fields |
CV | Computer Vision |
DL | Deep Learning |
F1 | F1-score (harmonic mean of precision and recall) |
Fleiss | Fleiss’ kappa (inter-rater reliability) |
GPT | Generative Pre-trained Transformer |
GPU | Graphics Processing Unit |
IEEE | Institute of Electrical and Electronics Engineers |
IMRaD | Introduction, Methods, Results, and Discussion |
IR | Information Retrieval |
JSON | JavaScript Object Notation |
k-fold | k-fold Cross-Validation |
KB | Knowledge-Base Question Answering |
ML | Machine Learning |
NLP | Natural Language Processing |
NLU | Natural Language Understanding |
OD | Open-Domain Question Answering |
ODQA | Open-Domain Question Answering |
p-value | Probability value (statistical significance) |
RQ | Research Question |
RLHF | Reinforcement Learning from Human Feedback |
ROC | Receiver Operating Characteristic |
SQuAD | Stanford Question Answering Dataset |
SVM | Support Vector Machine |
TF-IDF | Term Frequency–Inverse Document Frequency |
VQA | Visual Question Answering |
Appendix A. Technical Implementation Details
Appendix A.1. Model Hyperparameters
Appendix A.1.1. GPT-4 Configuration
- Model Version: gpt-4-turbo-2024-04-09
- Temperature: 0.2 (to ensure consistent outputs while maintaining some flexibility)
- Max Tokens: 1024 (sufficient for comprehensive classification responses)
- Top-p: 0.95 (to maintain output diversity while ensuring coherence)
- Frequency Penalty: 0.0 (no penalty for repeated tokens)
- Presence Penalty: 0.0 (no penalty for introducing new topics)
- Stop Sequences: None (allow complete responses)
Appendix A.1.2. LLaMA 3.3 Configuration
- Model Version: LLaMA 3.3 70B parameter version
- Temperature: 0.2 (consistent with GPT-4 configuration)
- Max Tokens: 1024 (consistent with GPT-4 configuration)
- Top-p: 0.95 (consistent with GPT-4 configuration)
- Repetition Penalty: 1.1 (slight penalty for repetitive generation)
- Context Length: 4096 tokens (maximum context window)
- Batch Size: 4 (optimal for GPU memory utilization)
Appendix A.1.3. Claude 3 Configuration
- Model Version: claude-3-opus-20240229
- Temperature: 0.2 (consistent with other models)
- Max Tokens: 1024 (consistent with other models)
- Top-p: 0.95 (consistent with other models)
- Stop Sequences: None (allow complete responses)
- System Prompt: Custom system prompt for scientific classification
Appendix A.2. API Rate-Limiting Logic
Appendix A.2.1. Request Queuing System
- Queue Structure: FIFO (First In, First Out) queue with priority levels
- Rate Limit Tracking: Dynamic tracking of API usage for each service
- Exponential Backoff: Implemented with base delay of 1 s, maximum delay of 60 s
- Retry Logic: Maximum of 3 retries per request with increasing delays
- Concurrent Requests: Maximum of 5 concurrent requests per API to respect rate limits
Appendix A.2.2. Error Handling
- HTTP Error Codes: Specific handling for 429 (rate limit), 500 (server error), 503 (service unavailable)
- Timeout Handling: Request timeout set to 30 s with automatic retry
- Connection Errors: Automatic retry with exponential backoff for network issues
- Invalid Response Handling: Fallback parsing strategies for malformed responses
Appendix A.3. Parsing Fallback Strategies
Appendix A.3.1. Primary Parsing Strategy
- Category: [Selected category]
- Confidence: [0–100%]
- Reasoning: [Brief explanation]
Appendix A.3.2. Fallback Parsing Strategies
- Accepts variations in label formatting (e.g., “Category:” vs. “Classification:”)
- Handles missing colons or extra whitespace
- Extracts confidence values from various formats (e.g., “85%”, “0.85”, “85/100”)
- Searches for category names within the response text
- Extracts numerical values as potential confidence scores
- Uses sentence-level analysis to identify reasoning components
- Uses regular expressions to extract structured information
- Handles JSON-like responses that may be generated by models
- Extracts key-value pairs from unstructured text
- Assigns default confidence score of 50% if none can be extracted
- Uses the most frequent category for the domain as default classification
- Logs parsing failures for manual review and system improvement
Appendix A.4. Confidence Calibration Details
Appendix A.4.1. Calibration Function Selection
- Fits a sigmoid function to map raw scores to calibrated probabilities
- Equation:
- Suitable for small datasets but may overfit
- Non-parametric approach that fits a monotonic function
- Better suited for larger datasets and complex calibration patterns
- Selected as the primary calibration method for our study
- Divides confidence scores into bins and calculates empirical accuracy
- Simple but may be unstable for small sample sizes
- Used as a baseline comparison method
Appendix A.4.2. Calibration Validation
- Visual assessment of calibration quality
- Comparison of predicted vs. actual accuracy across confidence bins
- Identification of over-confident and under-confident regions
- Quantitative measure of calibration quality
- Calculated as:
- Lower scores indicate better calibration
- Measures the average difference between confidence and accuracy
- Calculated across multiple confidence bins
- Provides a single metric for calibration quality assessment
Appendix A.5. Computational Resource Management
Appendix A.5.1. GPU Memory Optimization
- Model Sharding: Distributed across 8 A100 GPUs using tensor parallelism
- Gradient Checkpointing: Reduced memory usage during inference
- Mixed Precision: Used FP16 precision to reduce memory footprint
- Dynamic Batching: Adaptive batch-sizing based on sequence length
Appendix A.5.2. Checkpoint Management
- State Serialization: Regular saving of processing state every 100 papers
- Resume Capability: Automatic detection and resumption from checkpoints
- Progress Tracking: Detailed logging of processing progress and performance metrics
- Error Recovery: Graceful handling of system failures with automatic restart
Appendix A.6. Statistical Analysis Implementation
Appendix A.6.1. Cross-Validation Implementation
- Stratified Sampling: Ensures balanced representation across categories
- Reproducible Splits: Fixed random seed for consistent evaluation
- Nested CV: Inner 3-fold CV for hyperparameter tuning
- Performance Aggregation: Proper averaging of metrics across folds
Appendix A.6.2. Statistical Significance Testing
- Comparison of ensemble vs. individual model performance
- Assumes normal distribution of performance differences
- Bonferroni correction for multiple comparisons
- Non-parametric alternative to paired t-test
- Used when normality assumptions are violated
- More robust to outliers and skewed distributions
- Comparison of classification accuracies
- Tests for systematic differences in error patterns
- Suitable for matched-pair categorical data
Appendix A.7. Implementation Code Structure
Appendix A.7.1. Main Components
- data_processor.py: Data loading, preprocessing, and validation
- model_interface.py: Unified interface for all LLM interactions
- ensemble_voting.py: Implementation of voting mechanisms and tie-breaking
- calibration.py: Confidence calibration and validation methods
- evaluation.py: Comprehensive evaluation metrics and statistical tests
- visualization.py: Generation of plots and visualizations
- utils.py: Utility functions for logging, configuration, and error handling
Appendix A.7.2. Configuration Management
- Global Settings: Random seeds, logging configuration, output paths
- Model Configurations: Hyperparameters for each LLM
- Evaluation Settings: Cross-validation parameters, statistical test configurations
- Ensemble Parameters: Voting weights, tie-breaking strategies, confidence thresholds
References
- Guenther, L.; Joubert, M. Science communication as a field of research: Identifying trends, challenges and gaps by analysing research papers. JCOM 2017, 16, A02. [Google Scholar] [CrossRef]
- Van Dinter, R.; Tekinerdogan, B.; Catal, C. Automation of systematic literature reviews: A systematic literature review. Inf. Softw. Technol. 2021, 136, 106589. [Google Scholar] [CrossRef]
- Bernasconi, E.; Ferilli, S. Mining Literary Trends: A Tool for Digital Library Analysis. In Proceedings of the International Conference on Theory and Practice of Digital Libraries, Ljubljana, Slovenia, 24–27 September 2024; Springer: Cham, Switzerland, 2024; pp. 342–359. [Google Scholar]
- Iqbal, S.; Hassan, S.U.; Aljohani, N.R.; Alelyani, S.; Nawaz, R.; Bornmann, L. A decade of in-text citation analysis based on natural language processing and machine learning techniques: An overview of empirical studies. Scientometrics 2021, 126, 6551–6599. [Google Scholar] [CrossRef]
- Chen, H.; Tsang, Y.; Wu, C. When text mining meets science mapping in the bibliometric analysis: A review and future opportunities. Int. J. Eng. Bus. Manag. 2023, 15, 18479790231222349. [Google Scholar] [CrossRef]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar] [CrossRef]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training Language Models to Follow Instructions with Human Feedback. arXiv 2022, arXiv:2203.02155. [Google Scholar] [CrossRef]
- Ofori-Boateng, R.; Aceves-Martins, M.; Wiratunga, N.; Moreno-Garcia, C.F. Towards the automation of systematic reviews using natural language processing, machine learning, and deep learning: A comprehensive review. Artif. Intell. Rev. 2024, 57, 200. [Google Scholar] [CrossRef]
- Gilardi, S.; Floridi, L.; Capurro, R. A Survey on the Use of LLMs for Scientific Knowledge Discovery. Big Data Res. 2023, 28, 100415. [Google Scholar]
- Agarwal, S.; Sahu, G.; Puri, A.; Laradji, I.H.; Dvijotham, K.D.; Stanley, J.; Charlin, L.; Pal, C. LitLLMs, LLMs for Literature Review: Are we there yet? arXiv 2024. [Google Scholar] [CrossRef]
- Liu, Y.; Yang, K.; Qi, Z.; Liu, X.; Yu, Y.; Zhai, C.X. Bias and Volatility: A Statistical Framework for Evaluating Large Language Model’s Stereotypes and the Associated Generation Inconsistency. Adv. Neural Inf. Process. Syst. 2024, 37, 110131–110155. [Google Scholar]
- Dong, X.; Yu, Z.; Cao, W.; Shi, Y.; Ma, Q. A survey on ensemble learning. Front. Comput. Sci. 2020, 14, 241–258. [Google Scholar] [CrossRef]
- Triantafyllopoulos, L.; Kalles, D. From Divergence to Consensus: Evaluating the Role of Large Language Models in Facilitating Agreement through Adaptive Strategies. arXiv 2025, arXiv:2503.15521. [Google Scholar]
- Ashiga, M.; Jie, W.; Wu, F.; Voskanyan, V.; Dinmohammadi, F.; Brookes, P.; Gong, J.; Wang, Z. Ensemble Learning for Large Language Models in Text and Code Generation: A Survey. arXiv 2025, arXiv:2503.13505. [Google Scholar]
- Sakai, H.; Lam, S.S. Large language models for healthcare text classification: A systematic review. arXiv 2025, arXiv:2503.01159. [Google Scholar]
- arXiv Team. arXiv API: Access to arXiv Programmatically. arXiv e-Print Repository. 2024. Available online: https://info.arxiv.org/help/api/index.html (accessed on 13 August 2025).
- Van Nooten, J. The Many Faces of a Text: Applications and Enhancements of Multi-Label Text Classification Algorithms. Ph.D. Thesis, University of Antwerp, Antwerpen, Belgium, 2025. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Joshi, J.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- Valluri, S.; Roy, S.; Kundu, A. An Overview of LLMs in Computer Vision. Comput. Vis. Image Underst. 2024, 245, 103940. [Google Scholar]
- Graves, A.; Mohamed, A.r.; Hinton, G. Speech Recognition with Deep Recurrent Neural Networks. In Proceedings of the ICASSP 2013, Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649. [Google Scholar]
- Suzuoki, S.; Hatano, K. Reducing hallucinations in large language models: A consensus voting approach using mixture of experts. TechRxiv 2024. [Google Scholar] [CrossRef]
- Belem, C.G.; Pezeskhpour, P.; Iso, H.; Maekawa, S.; Bhutani, N.; Hruschka, E. From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization. arXiv 2024, arXiv:2410.13961. [Google Scholar] [CrossRef]
- Huang, D.; Yan, C.; Li, Q.; Peng, X. From large language models to large multimodal models: A literature review. Appl. Sci. 2024, 14, 5068. [Google Scholar] [CrossRef]
- Wang, J.; Hou, L.; Sun, Y.; Li, J.; Yang, L.; Li, Y. A Method for the Classification Public Health Questions Based on Model Ensemble and Voting Mechanism. In Health Information Processing, Proceedings of the 10th China Health Information Processing Conference, CHIP 2024, Fuzhou, China, 15–17 November 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 97–108. [Google Scholar]
- Davoudi, S.P.M.; Fard, A.S.; Amiri-Margavi, A. Collective Reasoning Among LLMs A Framework for Answer Validation Without Ground Truth. arXiv 2025, arXiv:2502.20758. [Google Scholar]
- Susnjak, T.; Hwang, P.; Reyes, N.; Barczak, A.L.; McIntosh, T.; Ranathunga, S. Automating research synthesis with domain-specific large language model fine-tuning. ACM Trans. Knowl. Discov. Data 2025, 19, 68. [Google Scholar] [CrossRef]
- Li, L.; Wang, Y.; Xu, R.; Wang, P.; Feng, X.; Kong, L.; Liu, Q. Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models. arXiv 2024, arXiv:2403.00231. [Google Scholar] [CrossRef]
- Clement, C.B.; Bierbaum, M.; O’Keeffe, K.P.; Alemi, A.A. On the Use of ArXiv as a Dataset. arXiv 2019, arXiv:1905.00075. [Google Scholar] [CrossRef]
- Bernasconi, E.; Redavid, D.; Ferilli, S. Survey-Classification: Integrated Survey Classification Dataset via LLM Ensemble. 2025. Available online: https://github.com/uniba-team/survey-classification (accessed on 13 August 2025).
- Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A Survey of Large Language Models. arXiv 2023, arXiv:2303.18223. [Google Scholar] [CrossRef]
- Zhang, Y.; Chen, X.; Jin, B.; Wang, S.; Ji, S.; Wang, W.; Han, J. A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery. In Proceedings of the EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 8783–8817. [Google Scholar] [CrossRef]
- OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- Meta AI. LLaMA 3.3 Community Release. Model card on Hugging Face. 2024; Released Under LLaMA 3.3 Community License. Available online: https://www.llama.com/llama3_3/license/ (accessed on 13 August 2025).
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
- Anthropic. Claude 3 Model Family: Opus, Sonnet, Haiku. Model Cards and Technical Announcement, 2024. Transformer-Based LLMs with Multimodal Capabilities, Optimized for Helpfulness, Harmlessness, and Honesty. Available online: https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf (accessed on 13 August 2025).
- Bai, Y.; Kadavath, S.; Kundu, S.; Askell, A.; Kernion, J.; Jones, A.; Chen, A.; Goldie, A.; Mirhoseini, A.; McKinnon, C.; et al. Constitutional AI: Harmlessness from AI Feedback. arXiv 2022, arXiv:2212.08073. [Google Scholar] [CrossRef]
- Lan, Y.; He, G.; Jiang, J.; Jiang, J.; Zhao, W.X.; Wen, J.R. A Survey on Complex Knowledge Base Question Answering: Methods, Challenges and Solutions. arXiv 2021, arXiv:2105.11644. [Google Scholar] [CrossRef]
- Liu, Y.; Chen, X.; He, H. A Survey of Machine Reading Comprehension—Tasks, Evaluation Methods and Benchmarks. Appl. Sci. 2020, 10, 7640. [Google Scholar] [CrossRef]
- Chen, D.; Fisch, A.; Weston, J.; Bordes, A. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1870–1879. [Google Scholar] [CrossRef]
- Reddy, S.; Chen, D.; Manning, C.D. CoQA: A Conversational Question Answering Challenge. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), Suzhou, China, 5–9 November 2019; pp. 1961–1970. [Google Scholar] [CrossRef]
- Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Lawrence Zitnick, C.; Parikh, D. VQA: Visual Question Answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2425–2433. [Google Scholar] [CrossRef]
- Tsatsaronis, G.; Balikas, G.; Malakasiotis, P.; Partalas, I.; Zschunke, M.; Alvers, M.R.; Krithara, A.; Petridis, S.; Paliouras, G. Overview of the BIOASQ Large-Scale Biomedical Semantic Indexing and Question Answering Competition. BMC Bioinform. 2015, 16 (Suppl. S10), S1. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Milletari, F.; Navab, N.; Ahmadi, S.A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
- Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
- Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
- Simonyan, K.; Zisserman, A. Two-Stream Convolutional Networks for Action Recognition in Videos. Adv. Neural Inf. Process. Syst. 2014, 27, 568–576. [Google Scholar]
- Bojarski, M.; Testa, D.D.; Dworakowski, D.; Firner, B.; Flepp, B.; Goyal, P.; Jackel, L.D.; Monfort, M.; Muller, U.; Zhang, J.; et al. End to End Learning for Self-Driving Cars. arXiv 2016, arXiv:1604.07316. [Google Scholar] [CrossRef]
- Landis, J.R.; Koch, G.G. The measurement of observer agreement for categorical data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef]
Characteristic | QA Domain | CV Domain | Total |
---|---|---|---|
Number of Papers | 752 | 402 | 1154 |
Number of Categories | 6 | 7 | 13 |
Time Period | 2018–2024 | 2018–2024 | 2018–2024 |
Average Abstract Length (words) | 243 | 257 | 248 |
Average Title Length (words) | 12 | 13 | 12 |
Model | QA Domain | CV Domain | ||||||
---|---|---|---|---|---|---|---|---|
Accuracy (Mean ± SD) | Precision (Mean ± SD) | Recall (Mean ± SD) | F1-Score (Mean ± SD) | Accuracy (Mean ± SD) | Precision (Mean ± SD) | Recall (Mean ± SD) | F1-Score (Mean ± SD) | |
GPT-4 | 78.2 ± 1.4% | 77.9 ± 1.6% | 76.8 ± 1.8% | 77.3 ± 1.5% | 74.9 ± 1.7% | 73.8 ± 1.9% | 72.5 ± 2.1% | 73.1 ± 1.8% |
LLaMA 3.3 | 72.1 ± 1.8% | 71.5 ± 2.0% | 70.3 ± 2.2% | 70.9 ± 1.9% | 69.7 ± 2.1% | 68.4 ± 2.3% | 67.2 ± 2.5% | 67.8 ± 2.2% |
Claude 3 | 69.5 ± 2.0% | 68.9 ± 2.2% | 67.7 ± 2.4% | 68.3 ± 2.1% | 67.2 ± 2.3% | 66.1 ± 2.5% | 65.0 ± 2.7% | 65.5 ± 2.4% |
Simple Majority | 79.8 ± 1.2% | 79.2 ± 1.4% | 78.5 ± 1.6% | 78.8 ± 1.3% | 76.4 ± 1.5% | 75.3 ± 1.7% | 74.1 ± 1.9% | 74.7 ± 1.6% |
Ensemble (Ours) | 88.2 ± 0.8% | 87.8 ± 1.0% | 86.9 ± 1.2% | 87.3 ± 0.9% | 85.8 ± 1.1% | 84.9 ± 1.3% | 83.7 ± 1.5% | 84.3 ± 1.2% |
TF-IDF + SVM | 62.3 ± 2.8% | 61.5 ± 3.0% | 60.2 ± 3.2% | 60.8 ± 2.9% | 58.7 ± 3.1% | 57.4 ± 3.3% | 56.1 ± 3.5% | 56.7 ± 3.2% |
BERT + RF | 68.9 ± 2.3% | 68.2 ± 2.5% | 67.1 ± 2.7% | 67.6 ± 2.4% | 65.3 ± 2.6% | 64.2 ± 2.8% | 63.0 ± 3.0% | 63.6 ± 2.7% |
SciBERT | 71.5 ± 2.1% | 70.8 ± 2.3% | 69.7 ± 2.5% | 70.2 ± 2.2% | 68.2 ± 2.4% | 67.1 ± 2.6% | 65.9 ± 2.8% | 66.5 ± 2.5% |
Comparison | QA Domain | CV Domain | ||
---|---|---|---|---|
Mean Diff. | p-Value | Mean Diff. | p-Value | |
Ensemble vs. GPT-4 | +10.0% | <0.001 | +10.9% | 0.001 |
Ensemble vs. LLaMA 3.3 | +16.1% | <0.001 | +16.1% | <0.001 |
Ensemble vs. Claude 3 | +18.7% | <0.001 | +18.6% | <0.001 |
Ensemble vs. Simple Majority | +8.4% | <0.001 | +9.4% | <0.001 |
Model Combination | QA Domain | CV Domain |
---|---|---|
All Three Models | 0.68 | 0.65 |
GPT-4 and LLaMA 3.3 | 0.72 | 0.69 |
GPT-4 and Claude 3 | 0.70 | 0.67 |
LLaMA 3.3 and Claude 3 | 0.65 | 0.62 |
Model | QA Domain | CV Domain | ||||
---|---|---|---|---|---|---|
Mean | Median | Std Dev | Mean | Median | Std Dev | |
GPT-4 | 82.3% | 85.0% | 12.7% | 80.1% | 83.0% | 13.5% |
LLaMA 3.3 | 78.5% | 80.0% | 14.2% | 76.8% | 79.0% | 15.1% |
Claude 3 | 75.2% | 78.0% | 15.8% | 73.5% | 76.0% | 16.3% |
Model | API Cost (Per 1000 Papers) | Processing Time (Per Paper) | Memory Usage | GPU Hours (Per 1000 Papers) |
---|---|---|---|---|
GPT-4 | USD 150 | 3.2 s | N/A (API) | N/A (API) |
LLaMA 3.3 | USD 0 (local) | 5.8 s | 80 GB | 1.6 |
Claude 3 | USD 120 | 2.9 s | N/A (API) | N/A (API) |
Ensemble (Total) | USD 270 | 11.9 s | 80 GB | 1.6 |
Approach | Time per Paper | Time for 1000 Papers | Accuracy |
---|---|---|---|
Manual Classification (Expert) | 15 min | 250 h | 95% (estimated) |
Manual Classification (Non-Expert) | 25 min | 417 h | 80% (estimated) |
GPT-4 | 3.2 s | 0.9 h | 76.6% |
Ensemble (Ours) | 11.9 s | 3.3 h | 87.0% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Bernasconi, E.; Redavid, D.; Ferilli, S. Integrated Survey Classification and Trend Analysis via LLMs: An Ensemble Approach for Robust Literature Synthesis. Electronics 2025, 14, 3404. https://doi.org/10.3390/electronics14173404
Bernasconi E, Redavid D, Ferilli S. Integrated Survey Classification and Trend Analysis via LLMs: An Ensemble Approach for Robust Literature Synthesis. Electronics. 2025; 14(17):3404. https://doi.org/10.3390/electronics14173404
Chicago/Turabian StyleBernasconi, Eleonora, Domenico Redavid, and Stefano Ferilli. 2025. "Integrated Survey Classification and Trend Analysis via LLMs: An Ensemble Approach for Robust Literature Synthesis" Electronics 14, no. 17: 3404. https://doi.org/10.3390/electronics14173404
APA StyleBernasconi, E., Redavid, D., & Ferilli, S. (2025). Integrated Survey Classification and Trend Analysis via LLMs: An Ensemble Approach for Robust Literature Synthesis. Electronics, 14(17), 3404. https://doi.org/10.3390/electronics14173404