Topic Editors

School of AI and Advanced Computing, Xi’an Jiaotong-Liverpool University, Suzhou 215123, China
Foundation of Software Engineering (FSE) Group, Department of Software Engineering, Faculty of Physics, Engineering, and Computer Science, University of Hertfordshire, Hatfield, UK
Dr. Lijie Wen
School of Software, Tsinghua University, Beijing, China

Applications of NLP, AI, and ML in Software Engineering

Abstract submission deadline
30 June 2026
Manuscript submission deadline
30 August 2026
Viewed by
75609

Topic Information

Dear Colleagues,

The integration of Natural Language Processing (NLP), Artificial Intelligence (AI), and Machine Learning (ML) into Software Engineering is revolutionizing the way software is developed, tested, and maintained. These advanced technologies enable the automation of complex tasks, improve accuracy in bug detection, and enhance code quality. By leveraging NLP, AI, and ML, software engineers can better manage requirements, optimize project workflows, and predict project risks. This topic seeks to showcase cutting-edge research and practical applications that demonstrate the transformative potential of these technologies in the software engineering domain. We invite contributions that explore innovative methodologies, practical tools, and real-world case studies. High-quality studies comparing the efficiency of various algorithms on different datasets are also of particular interest. Such comparative analyses are crucial for understanding the strengths and weaknesses of different approaches, thereby guiding practitioners in selecting the most appropriate techniques for their specific needs. These studies provide valuable insights into algorithm performance, scalability, and adaptability across diverse software engineering contexts. One compelling example of the application of NLP, AI, and ML in Software Engineering is the automated generation of code documentation. By utilizing NLP techniques, AI models can analyze the codebase and generate comprehensive documentation that explains the functionality of the code in human-readable language. This not only saves significant time for developers but also ensures that the documentation is always up-to-date with the latest code changes. Additionally, ML algorithms can be used to predict potential areas in the code that are prone to bugs or require refactoring, further enhancing the efficiency and reliability of the software development process.

Dr. Affan Yasin
Dr. Javed Ali Khan
Dr. Lijie Wen
Topic Editors

Keywords

  • natural language processing (NLP)
  • artificial intelligence (AI)
  • machine learning (ML)
  • software engineering
  • algorithm comparison
  • requirements engineering
  • bug detection
  • performance analysis
  • code quality
  • predictive analytics

Participating Journals

Journal Name Impact Factor CiteScore Launched Year First Decision (median) APC
AI
ai
5.0 6.9 2020 19.2 Days CHF 1800 Submit
Algorithms
algorithms
2.1 4.5 2008 19.2 Days CHF 1800 Submit
Applied Sciences
applsci
2.5 5.5 2011 16 Days CHF 2400 Submit
Electronics
electronics
2.6 6.1 2012 16.4 Days CHF 2400 Submit
Machine Learning and Knowledge Extraction
make
6.0 9.9 2019 27 Days CHF 1800 Submit
Software
software
- - 2022 28.8 Days CHF 1000 Submit

Preprints.org is a multidisciplinary platform offering a preprint service designed to facilitate the early sharing of your research. It supports and empowers your research journey from the very beginning.

MDPI Topics is collaborating with Preprints.org and has established a direct connection between MDPI journals and the platform. Authors are encouraged to take advantage of this opportunity by posting their preprints at Preprints.org prior to publication:

  1. Share your research immediately: disseminate your ideas prior to publication and establish priority for your work.
  2. Safeguard your intellectual contribution: Protect your ideas with a time-stamped preprint that serves as proof of your research timeline.
  3. Boost visibility and impact: Increase the reach and influence of your research by making it accessible to a global audience.
  4. Gain early feedback: Receive valuable input and insights from peers before submitting to a journal.
  5. Ensure broad indexing: Web of Science (Preprint Citation Index), Google Scholar, Crossref, SHARE, PrePubMed, Scilit and Europe PMC.

Published Papers (28 papers)

Order results
Result details
Journals
Select all
Export citation of selected articles as:
18 pages, 312 KB  
Article
Investigating the Refactoring Capabilities of Small Open-Weight Language Models
by Tamás Márton, Balázs Szalontai, Balázs Pintér and Tibor Gregorics
Software 2026, 5(2), 19; https://doi.org/10.3390/software5020019 - 29 Apr 2026
Viewed by 145
Abstract
Refactoring is essential for developing maintainable software. Using Large Language Models in software engineering is widespread, but compared to well-established domains such as code generation, reliable refactoring is still relatively underexplored. In this paper, we perform a broad analysis on the refactoring capabilities [...] Read more.
Refactoring is essential for developing maintainable software. Using Large Language Models in software engineering is widespread, but compared to well-established domains such as code generation, reliable refactoring is still relatively underexplored. In this paper, we perform a broad analysis on the refactoring capabilities of small open-weight language models (SLMs) by evaluating 12 models on 3453 Python programs. Our study focuses on the two defining aspects of refactoring: behavior preservation and code quality improvement. We evaluate these properties using unit tests and various code metrics. Across models ranging from 0.5B to 8 B parameters, most models improve code quality. Larger models are more reliable, as they preserve behavior more consistently. Reasoning models often make more significant changes while refactoring. Allowing models to generate reasoning traces improves performance, but only for models larger than 4B. For smaller models, reasoning in fact reduces refactoring reliability. The difficulty of the underlying task affects refactoring performance, with more complex tasks associated with higher failure rates. Our results indicate that current open SLMs can support refactoring tasks, especially larger ones with reasoning capabilities, but they are best used with human oversight. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Figure 1

36 pages, 1713 KB  
Article
Software Unfairness Detection in Machine Learning-Based Systems: A Systematic Mapping Study
by Roa Alharbi and Noureddine Abbadeni
Software 2026, 5(2), 18; https://doi.org/10.3390/software5020018 - 27 Apr 2026
Viewed by 197
Abstract
Machine learning-based systems are increasingly deployed in high-stakes domains, such as healthcare, finance, law, and e-commerce, where their predictions directly influence critical decisions. Although these systems offer powerful data-driven support, they also introduce serious concerns related to fairness, bias, and discrimination. As a [...] Read more.
Machine learning-based systems are increasingly deployed in high-stakes domains, such as healthcare, finance, law, and e-commerce, where their predictions directly influence critical decisions. Although these systems offer powerful data-driven support, they also introduce serious concerns related to fairness, bias, and discrimination. As a result, detecting and addressing unfairness in machine learning software has become a central research challenge. This study presents a systematic mapping of research on software unfairness detection in machine learning systems, with the aim of consolidating existing fairness definitions, identifying major problem types, examining testing approaches, reviewing commonly used datasets, and highlighting open research gaps. A structured search was conducted across five major digital libraries and additional sources, covering publications from 2010 to 2025. From 1805 initially identified records, 67 primary studies met the inclusion and quality assessment criteria. The findings show that research activity has grown significantly since 2019, reaching a peak in 2022. Most studies were published in conference proceedings, accounting for 52% of the primary studies, followed by journals and workshop proceedings, which accounted for 42% and 6% of the primary studies. The literature encompasses multiple research themes, with 36% of the primary studies focusing on the analysis of existing fairness methods, 22% addressing bias mitigation strategies, 30% investigating testing techniques, and 12% proposing or evaluating evaluation frameworks. Fairness testing was conducted across multiple testing levels, including unit, integration, and system testing. Integration-level testing was the most prevalent, accounting for approximately 37.9% of the studies, followed by system-level testing at 27.3% and unit-level testing at 12.1%. Additionally, 22.7% of the studies applied fairness testing across more than one testing level. Frequently used datasets included COMPAS, Adult Census Income, and German Credit. Widely adopted tools, such as IBM AI Fairness 360, Themis, and Aequitas, were also identified. Overall, the systematic mapping study (SMS) highlights the progress made in fairness research while emphasizing the need for stronger integration of fairness into practical machine learning development. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Figure 1

26 pages, 833 KB  
Article
Design of a RAG-Based Customer Service Chatbot Enhanced with Knowledge Graph and GPT Evaluation: A Case Study in the Import Trade Industry
by Nien-Lin Hsueh and Wei-Che Lin
Software 2026, 5(2), 15; https://doi.org/10.3390/software5020015 - 2 Apr 2026
Viewed by 1140
Abstract
Amid the wave of digital transformation and customer service automation, traditional chatbots are increasingly challenged by their inability to handle unstructured data and complex queries. This issue is particularly critical in the import trade industry, where customer service representatives must respond promptly to [...] Read more.
Amid the wave of digital transformation and customer service automation, traditional chatbots are increasingly challenged by their inability to handle unstructured data and complex queries. This issue is particularly critical in the import trade industry, where customer service representatives must respond promptly to diverse inquiries involving quality anomalies, order tracking, and product substitution. Existing rule-based or keyword-driven chatbots often fail to provide accurate responses, resulting in reduced customer satisfaction and increased operational burdens. This study proposes and implements a “Retrieval-Augmented Generation (RAG)-based Customer Service Chatbot,” integrating the RAG framework with a Neo4j-based knowledge graph, specifically tailored for the import trade domain. The system constructs a dedicated QA dataset, knowledge graph, and dynamic learning mechanism. It semantically vectorizes internal documents, meeting records, quality assurance procedures, and historical dialogues, establishing interrelated knowledge nodes to enhance the chatbot’s comprehension and response accuracy. The study also incorporates GPT-based response evaluation and a high-score caching strategy, enabling dynamic learning and knowledge enhancement. Experiments were conducted using 101 representative enterprise-level queries across six categories, reflecting real-world operational scenarios and inquiry needs. The results demonstrate that the combination of knowledge graphs and RAG technology effectively reduces AI hallucinations and improves response coverage and accuracy, thereby addressing complex problems in customer service applications. This paper not only presents a feasible AI implementation model for the import trading industry but also offers a practical architectural reference for domain-specific knowledge management in the import trade and allied sectors. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Figure 1

35 pages, 1625 KB  
Article
Dynamic Feature Selection for Canadian GDP Forecasting: Machine Learning with Google Trends and Official Data
by Shafiullah Qureshi, Ba M. Chu, Fanny S. Demers, Najib Khan and Ateeq ur Rehman Irshad
Mach. Learn. Knowl. Extr. 2026, 8(3), 66; https://doi.org/10.3390/make8030066 - 9 Mar 2026
Viewed by 437
Abstract
We forecast monthly Canadian real GDP growth using machine learning models trained on Official macroeconomic indicators and Google Trends (GT) data. Predictors are selected dynamically in each rolling window using PDC-SIS, with cross-validation-based tuning to support real-time forecasting and avoid data leakage. The [...] Read more.
We forecast monthly Canadian real GDP growth using machine learning models trained on Official macroeconomic indicators and Google Trends (GT) data. Predictors are selected dynamically in each rolling window using PDC-SIS, with cross-validation-based tuning to support real-time forecasting and avoid data leakage. The evaluation is conducted on the latest-available (final-vintage) series and should be interpreted as a pseudo out-of-sample forecasting exercise rather than real-time vintage nowcasting. We evaluate GBM, XGBoost, LightGBM, CatBoost, and Random Forest against an ARIMA baseline. Official data deliver the strongest performance at short and medium horizons, while combining Official and GT data yields the clearest improvement at the longest horizon. With GT data alone, LightGBM is the only ML model maintaining positive out-of-sample R2 across all horizons. Diebold–Mariano tests corroborate these patterns: LightGBM dominates other ML models under GT-only predictors, whereas with Official and combined data, the horizon-specific best models significantly outperform ARIMA, with smaller differences among leading tree-based methods. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Figure 1

15 pages, 1413 KB  
Article
An Adaptive Multi-Source Retrieval-Augmented Generation Framework Integrating Query Complexity Awareness and Confidence-Aware Fusion
by Wenxuan Dong, Mingguang Diao and Meiqi Yang
Appl. Sci. 2026, 16(5), 2495; https://doi.org/10.3390/app16052495 - 5 Mar 2026
Viewed by 722
Abstract
Retrieval-Augmented Generation (RAG) has been observed to encounter challenges in heterogeneous query scenarios characterised by varying evidence requirements and reasoning depths. In order to address this limitation, the present paper puts forward a proposal for an Adaptive Multi-Source RAG framework (AMSRAG) that integrates [...] Read more.
Retrieval-Augmented Generation (RAG) has been observed to encounter challenges in heterogeneous query scenarios characterised by varying evidence requirements and reasoning depths. In order to address this limitation, the present paper puts forward a proposal for an Adaptive Multi-Source RAG framework (AMSRAG) that integrates query complexity awareness with confidence-aware fusion. The framework performs query complexity classification with a pretrained language model, calibrates the classification confidence to guide the dynamic scheduling of retrieval paths and the adjustment of fusion weights, and enables a controllable balance between answer quality and retrieval efficiency through hierarchical path selection and cross-source weighting. The experiments conducted on multiple open-domain question-answering datasets demonstrate that the query complexity classifier achieves an accuracy of 85.9% and a Macro-F1 score of 85.4%. These outcomes indicate the potential for the classifier to generate a reliable decision signal, which can subsequently be utilised to guide the process of adaptive retrieval and fusion. The proposed framework demonstrates a marked improvement in terms of both answer accuracy and retrieval relevance when compared to the fixed-pipeline RAG. In scenarios involving high-confidence queries, the system has been shown to effectively avoid redundant retrieval, thereby reducing the average number of retrievals. In instances of low-confidence complex queries, the system has been shown to enhance evidence coverage and completeness of answers through multi-source retrieval and confidence-weighted fusion. This study proposes a novel methodology for enhancing the adaptability and resource efficiency of RAG systems in response to heterogeneous query conditions. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Figure 1

28 pages, 2010 KB  
Article
Prompt Engineering Strategies for Generating Medical Case-Based MCQs with Large Language Models: A Multi-Model Comparative Study
by Somaiya Al Shuraiqi, Adhari AlZaabi and Abdulrahman Aal Abdulsalam
Mach. Learn. Knowl. Extr. 2026, 8(2), 41; https://doi.org/10.3390/make8020041 - 10 Feb 2026
Cited by 1 | Viewed by 1650
Abstract
The use of large language models (LLMs) to automate the generation of medical case-based multiple-choice questions (MCQs) is increasing, but their accuracy, reliability, and educational validity are still not well understood. This study in a comparative framework examined nine LLMs with four different [...] Read more.
The use of large language models (LLMs) to automate the generation of medical case-based multiple-choice questions (MCQs) is increasing, but their accuracy, reliability, and educational validity are still not well understood. This study in a comparative framework examined nine LLMs with four different prompting methods to evaluate LLM-produced MCQs for clinical coherence and readiness for assessment. A uniform evaluation pipeline was constructed to examine automatic text-similarity measures using automated metrics (BLEU, ROUGE, and METEOR), structural and parsability measures, and operational effectiveness (latency, cost, quality-efficiency ratios). Human validation was performed on the best-performing model and prompt combination (OpenBioLLM-70B with Chain-of-Thought) focusing on the model prompt that demonstrated the best linguistic fidelity and clinically aligned reasoning. Two clinical experts independently reviewed 88 items using a five-domain rubric covering appropriateness, clarity, relevance, distractor quality, and cognitive level. Results indicated significant variation across models and prompting strategies, with Chain-of-Thought yielding the best overall performance in comparison to other strategies. The OpenBioLLM-70B model demonstrated the best overall balance of quality, parsability, and efficiency, achieving a prompt template quality score of 90.4, a consistency score of 88.8, and a response time of 3.28 s, with a quality-per-dollar value of 134.11. The expert rating confirmed clinical alignment, but there was consensus that distractor quality needed further improvements. These results provide evidence that LLMs under optimal prompting conditions can reliably support MCQ generation and provide large-scale, cost-effective support for medical assessment production. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Figure 1

18 pages, 1796 KB  
Article
SpADE-BERT: Multilingual BERT-Based Model with Trigram-Sensitive Tokenization, Tuned for Depression Detection in Spanish Texts
by Abdiel Reyes-Vera, Magdalena Saldana-Perez, Marco Moreno-Ibarra and Juan Pablo Francisco Posadas-Durán
AI 2026, 7(2), 48; https://doi.org/10.3390/ai7020048 - 1 Feb 2026
Viewed by 675
Abstract
This article proposes an automated approach, based on artificial intelligence techniques, for detecting indicators of depression in texts written in Spanish. Among the main contributions is the construction of a new specialized corpus, supervised by mental health professionals and based on the Beck [...] Read more.
This article proposes an automated approach, based on artificial intelligence techniques, for detecting indicators of depression in texts written in Spanish. Among the main contributions is the construction of a new specialized corpus, supervised by mental health professionals and based on the Beck Depression Inventory. Text processing included linguistic techniques such as lemmatization, stopword removal, and structural transformation using trigrams. As part of the work, SpADE-BERT was designed, a model based on multilingual BERT with a tokenization scheme adapted to incorporate trigrams directly from the input phase. This modification allowed for more robust interaction between the local context and semantic representations. SpADE-BERT was evaluated against multiple approaches reported in the literature, which employ algorithms such as logistic regression, support vector machines, decision trees, and Random Forest with advanced configurations and specialized preprocessing. In all cases, our model showed consistently superior performance on metrics such as precision, recall, and F1-score. The results show that integrating deep language models with adapted tokenization strategies can significantly strengthen the automated identification of linguistic signals associated with depression in Spanish texts. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Figure 1

37 pages, 2905 KB  
Article
A Slide Annotation System with Multimodal Analysis for Video Presentation Review
by Amma Liesvarastranta Haz, Komang Candra Brata, Nobuo Funabiki, Htoo Htoo Sandi Kyaw, Evianita Dewi Fajrianti and Sritrusta Sukaridhoto
Algorithms 2026, 19(2), 110; https://doi.org/10.3390/a19020110 - 1 Feb 2026
Viewed by 935
Abstract
With the rapid growth of online presentations, there has been an increasing need for efficient review of recorded materials. In typical presentations, speakers verbally elaborate on each slide, providing details not captured in the slides themselves. Automatically extracting and embedding these verbal explanations [...] Read more.
With the rapid growth of online presentations, there has been an increasing need for efficient review of recorded materials. In typical presentations, speakers verbally elaborate on each slide, providing details not captured in the slides themselves. Automatically extracting and embedding these verbal explanations at their corresponding slide locations can greatly enhance the review process for audiences. This paper presents a Slide Annotation System that employs a robust hybrid two-stage detector to identify slide boundaries, extracts slide text through Optical Character Recognition (OCR), transcribes narration, and employs a multimodal Large Language Model (LLM) to generate concise, context-aware annotations that are added to their corresponding slide locations. For evaluations, the technical performance was validated on five recorded presentations, while the user experience was assessed by 37 participants. The results showed that the system achieved a macro-average F1 score of 0.879 (SD=0.024, 95% CI[0.849,0.909]) for slide segmentation and 90.0% accuracy (95% CI[74.4%,96.5%]) for annotation alignment. Subjective evaluations revealed high annotation validity and usefulness as rated by presenters, and a high System Usability Scale (SUS) score of 80.5 (SD=6.7, 95% CI[78.3,82.7]). Qualitative feedback further confirmed that the system effectively streamlined the review process, enabling users to locate key information more efficiently than standard video playback. These findings demonstrate the strong potential of the proposed system as an effective automated annotation system. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Graphical abstract

18 pages, 321 KB  
Article
Instruction-Tuned Decoder-Only Large Language Models for Efficient Extreme Summarization on Consumer-Grade GPUs
by Attia Fathalla Elatiky, Ahmed M. Hamad, Heba Khaled and Mahmoud Fayez
Algorithms 2026, 19(2), 96; https://doi.org/10.3390/a19020096 - 25 Jan 2026
Cited by 1 | Viewed by 546
Abstract
Extreme summarization generates very short summaries, typically a single sentence, answering the question “What is the document about?”. Although large language models perform well in text generation, fine-tuning them for summarization often requires substantial computational resources that are unavailable to many researchers. In [...] Read more.
Extreme summarization generates very short summaries, typically a single sentence, answering the question “What is the document about?”. Although large language models perform well in text generation, fine-tuning them for summarization often requires substantial computational resources that are unavailable to many researchers. In this study, we present an effective method for instruction-tuning open decoder-only large language models under limited GPU resources. The proposed approach combines parameter-efficient fine-tuning techniques, such as Low-Rank Adaptation (LoRA), with quantization to reduce memory requirements, enabling training on a single consumer-grade GPU. We fine-tuned a pre-trained decoder-only model on the XSum dataset using an instruction-following format. Experimental results demonstrate that the proposed decoder-only approach achieves competitive performance on the XSum dataset under strict GPU memory constraints. On the full test set, the proposed 2G–1R pipeline attains ROUGE-1/2/L F1 scores of 46.0/22.0/37.0 and a BERTScore F1 of 0.917, outperforming the individual generator models in lexical overlap and semantic similarity. Evaluation was conducted using traditional overlap-based metrics (ROUGE) and semantic metrics, including BERTScore and G-Eval. While remaining competitive in ROUGE compared to strong encoder–decoder baselines, the pipeline consistently produces summaries with higher semantic quality. These findings demonstrate that large decoder-only language models can be efficiently fine-tuned for extreme summarization on limited consumer-grade hardware without sacrificing output quality. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Graphical abstract

32 pages, 122293 KB  
Article
Hybrid Negation: Enhancing Sentiment Analysis for Complex Sentences
by Miftahul Qorib and Paul Cotae
Appl. Sci. 2026, 16(2), 1000; https://doi.org/10.3390/app16021000 - 19 Jan 2026
Viewed by 592
Abstract
Numerous valuable information is available on the Internet, and many individuals rely on mass media as their primary source of information. Various views, comments, expressions, and opinions on social networks have been a tremendous source of information. Harvesting free, resourceful information through social [...] Read more.
Numerous valuable information is available on the Internet, and many individuals rely on mass media as their primary source of information. Various views, comments, expressions, and opinions on social networks have been a tremendous source of information. Harvesting free, resourceful information through social media makes text mining a powerful tool for analyzing public opinions on various issues across diverse social networks. Various research projects have implemented text sentiment analysis through machine and deep learning approaches. Social media text often expresses sentiment through complex syntax and negation (e.g., implicit and double negation and nested clauses), which many classifiers mishandle. We propose hybrid negation, a clause-aware approach that combines (i) explicit/implicit/double-negation rules, (ii) dependency-based scope detection, (iii) a TextBlob back-off for phrase polarity, and (iv) an MLP-learned clause-weighting module that aggregates clause-level scores. Across 156,539 tweets (three-class sentiment), we evaluate six negation strategies and 228 model configurations with and without SMOTE (applied strictly within training folds). Hybrid Negation achieves 98.582% accuracy, 98.196% precision, 98.189% recall, and 98.193% F1 with BERT, outperforming rule-only and antonym/synonym baselines. Ablations show each component contributes to the model’s performance, with dependency scope and double negations offering the largest gains. Per-class results, confidence intervals, and paired tests with multiple-comparison control confirm statistically significant improvements. We release code and preprocessing scripts to support reproducibility. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Figure 1

16 pages, 2030 KB  
Article
Chinese Text Readability Assessment Based on the Integration of Visualized Part-of-Speech Information with Linguistic Features
by Chi-Yi Hsieh, Jing-Yan Lin, Chi-Wen Hsieh, Bo-Yuan Huang, Yi-Chi Huang and Yu-Xiang Chen
Algorithms 2025, 18(12), 777; https://doi.org/10.3390/a18120777 - 9 Dec 2025
Viewed by 1091
Abstract
The assessment of Chinese text readability plays a significant role in Chinese language education. Due to the intrinsic differences between alphabetic languages and Chinese character representations, the readability assessment becomes more challenging in terms of the language’s inherent complexity in vocabulary, syntax, and [...] Read more.
The assessment of Chinese text readability plays a significant role in Chinese language education. Due to the intrinsic differences between alphabetic languages and Chinese character representations, the readability assessment becomes more challenging in terms of the language’s inherent complexity in vocabulary, syntax, and semantics. The article proposed the conceptual analogy between Chinese readability assessment and music’s rhythm and tempo patterns, in which the syntactic structures of the Chinese sentences could be transformed into an image. The Chinese Knowledge and Information Processing Tagger (CkipTagger) tool developed by Sinica-Taiwan is utilized to decompose the Chinese text into a set of tokens. These tokens are then refined through a user-defined token pool to retain meaningful units. An image with part-of-speech (POS) information will be generated by using the token versus syntax alignment. A discrete cosine transform (DCT) is then applied to extract the temporal characteristics of the text. Moreover, the study integrated four categories: linguistic features–type–token ratio, average sentence length, total word, and difficulty level of vocabulary for the readability assessment. Finally, these features were fed into the Support Vector Machine (SVM) network for the classifications. Furthermore, a bidirectional long short-term memory (Bi-LSTM) network is adopted for quantitative comparisons. In simulation, a total of 774 Chinese texts fitted with Taiwan Benchmarks for the Chinese Language were selected and graded by Chinese language experts, consisting of equal amounts of basic, intermediate, and advanced levels. The finding indicated the proposed POS with the linguistic features work well in the SVM network, and the performance matches with the more complex architectures like the Bi-LSTM network in Chinese readability assessments. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Figure 1

30 pages, 2155 KB  
Article
Extreme Multi-Label Text Classification for Less-Represented Languages and Low-Resource Environments: Advances and Lessons Learned
by Nikola Ivačič, Blaž Škrlj, Boshko Koloski, Senja Pollak, Nada Lavrač and Matthew Purver
Mach. Learn. Knowl. Extr. 2025, 7(4), 142; https://doi.org/10.3390/make7040142 - 11 Nov 2025
Viewed by 1548
Abstract
Amid ongoing efforts to develop extremely large, multimodal models, there is increasing interest in efficient Small Language Models (SLMs) that can operate without reliance on large data-centre infrastructure. However, recent SLMs (e.g., LLaMA or Phi) with up to three billion parameters are predominantly [...] Read more.
Amid ongoing efforts to develop extremely large, multimodal models, there is increasing interest in efficient Small Language Models (SLMs) that can operate without reliance on large data-centre infrastructure. However, recent SLMs (e.g., LLaMA or Phi) with up to three billion parameters are predominantly trained in high-resource languages, such as English, which limits their applicability to industries that require robust NLP solutions for less-represented languages and low-resource settings, particularly those requiring low latency and adaptability to evolving label spaces. This paper examines a retrieval-based approach to multi-label text classification (MLC) for a media monitoring dataset, with a particular focus on less-represented languages, such as Slovene. This dataset presents an extreme MLC challenge, with instances labelled using up to twelve thousand categories. The proposed method, which combines retrieval with computationally efficient prediction, effectively addresses challenges related to multilinguality, resource constraints, and frequent label changes. We adopt a model-agnostic approach that does not rely on a specific model architecture or language selection. Our results demonstrate that techniques from the extreme multi-label text classification (XMC) domain outperform traditional Transformer-based encoder models, particularly in handling dynamic label spaces without requiring continuous fine-tuning. Additionally, we highlight the effectiveness of this approach in scenarios involving rare labels, where baseline models struggle with generalisation. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Graphical abstract

29 pages, 4141 KB  
Article
Integrating Structured Time-Series Modeling and Ensemble Learning for Strategic Performance Forecasting
by Liqing Tang, Shuxin Wang, Jintian Ji, Siyuan Yin, Robail Yasrab and Chao Zhou
Algorithms 2025, 18(10), 611; https://doi.org/10.3390/a18100611 - 29 Sep 2025
Viewed by 1084
Abstract
Forecasting outcomes in high-stakes competitive spectacles like the Olympic Games, World Cups, and professional league championships has grown increasingly vital, directly impacting strategic planning, resource allocation, and performance optimization across a multitude of fields. However, accurate forecasting remains challenging due to complex, nonlinear [...] Read more.
Forecasting outcomes in high-stakes competitive spectacles like the Olympic Games, World Cups, and professional league championships has grown increasingly vital, directly impacting strategic planning, resource allocation, and performance optimization across a multitude of fields. However, accurate forecasting remains challenging due to complex, nonlinear interactions inherent in high-dimensional time-series data, further complicated by socioeconomic indicators, historical influences, and host-country advantages. In this study, we propose a comprehensive forecasting framework integrating structured time-series modeling with ensemble learning. We extract key structural features via two novel indices: the Advantage Index (measuring a competitor’s dominance in specific areas) and the Herfindahl Index (quantifying performance outcome concentration). We also evaluate host-country advantage using a Difference-in-Differences (DiD) approach. Leveraging these insights, we develop a dual-branch predictive model combining an Attention-augmented Long Short-Term Memory (Attention-LSTM) network and a Random Forest classifier. Attention-LSTM captures long-term dependencies and dynamic patterns in structured temporal data, while Random Forest handles predictions for unrecognized contenders, addressing zero-inflation issues. Extensive stability and comparative analyses demonstrate that our model outperforms traditional and state-of-the-art methods, exhibiting strong resilience to input perturbations, consistent performance across multiple runs, and appropriate sensitivity to key features. Our key contributions include the development of a novel integrated forecasting framework, the introduction of two innovative structural indices for competitive dynamics analysis, and the demonstration of robust predictive performance that bridges technical innovation with practical strategic application. Finally, we transform our modeling insights into actionable strategic insights. This translation is powered by interpretable feature importance rankings and stability analysis that rigorously validate the robustness of key predictors. These insights apply across multiple dimensions—encompassing advantage assessment, resource distribution, strategic simulation, and breakthrough potential identification—providing comprehensive decision support for strategic planners and policymakers navigating competitive environments. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Figure 1

29 pages, 13142 KB  
Article
Automatic Complexity Analysis of UML Class Diagrams Using Visual Question Answering (VQA) Techniques
by Nimra Shehzadi, Javed Ferzund, Rubia Fatima and Adnan Riaz
Software 2025, 4(4), 22; https://doi.org/10.3390/software4040022 - 23 Sep 2025
Cited by 1 | Viewed by 3296
Abstract
Context: Modern software systems have become increasingly complex, making it difficult to interpret raw requirements and effectively utilize traditional tools for software design and analysis. Unified Modeling Language (UML) class diagrams are widely used to visualize and understand system architecture, but analyzing them [...] Read more.
Context: Modern software systems have become increasingly complex, making it difficult to interpret raw requirements and effectively utilize traditional tools for software design and analysis. Unified Modeling Language (UML) class diagrams are widely used to visualize and understand system architecture, but analyzing them manually, especially for large-scale systems, poses significant challenges. Objectives: This study aims to automate the analysis of UML class diagrams by assessing their complexity using a machine learning approach. The goal is to support software developers in identifying potential design issues early in the development process and to improve overall software quality. Methodology: To achieve this, this research introduces a Visual Question Answering (VQA)-based framework that integrates both computer vision and natural language processing. Vision Transformers (ViTs) are employed to extract global visual features from UML class diagrams, while the BERT language model processes natural language queries. By combining these two models, the system can accurately respond to questions related to software complexity, such as class coupling and inheritance depth. Results: The proposed method demonstrated strong performance in experimental trials. The ViT model achieved an accuracy of 0.8800, with both the F1 score and recall reaching 0.8985. These metrics highlight the effectiveness of the approach in automatically evaluating UML class diagrams. Conclusions: The findings confirm that advanced machine learning techniques can be successfully applied to automate software design analysis. This approach can help developers detect design flaws early and enhance software maintainability. Future work will explore advanced fusion strategies, novel data augmentation techniques, and lightweight model adaptations suitable for environments with limited computational resources. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Figure 1

17 pages, 3896 KB  
Article
HFGAD: Hierarchical Fine-Grained Attention Decoder for Gaze Estimation
by Shaojie Huang, Tianzhong Wang, Weiquan Liu, Yingchao Piao, Jinhe Su, Guorong Cai and Huilin Xu
Algorithms 2025, 18(9), 538; https://doi.org/10.3390/a18090538 - 24 Aug 2025
Cited by 1 | Viewed by 1154
Abstract
Gaze estimation is a cornerstone of applications such as human–computer interaction and behavioral analysis, e.g., for intelligent transport systems. Nevertheless, existing methods predominantly rely on coarse-grained features from deep layers of visual encoders, overlooking the critical role that fine-grained details from shallow layers [...] Read more.
Gaze estimation is a cornerstone of applications such as human–computer interaction and behavioral analysis, e.g., for intelligent transport systems. Nevertheless, existing methods predominantly rely on coarse-grained features from deep layers of visual encoders, overlooking the critical role that fine-grained details from shallow layers play in gaze estimation. To address this gap, we propose a novel Hierarchical Fine-Grained Attention Decoder (HFGAD), a lightweight fine-grained decoder that emphasizes the importance of shallow-layer information in gaze estimation. Specifically, HFGAD integrates a fine-grained amplifier MSCSA that employs multi-scale spatial-channel attention to direct focus toward gaze-relevant regions, and also incorporates a shallow-to-deep fusion module SFM to facilitate interaction between coarse-grained and fine-grained information. Extensive experiments on three benchmark datasets demonstrate the superiority of HFGAD over existing methods, achieving a remarkable 1.13° improvement in gaze estimation accuracy for in-car scenarios. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Figure 1

16 pages, 396 KB  
Article
Investigating Reproducibility Challenges in LLM Bugfixing on the HumanEvalFix Benchmark
by Balázs Szalontai, Balázs Márton, Balázs Pintér and Tibor Gregorics
Software 2025, 4(3), 17; https://doi.org/10.3390/software4030017 - 14 Jul 2025
Cited by 1 | Viewed by 5305
Abstract
Benchmark results for large language models often show inconsistencies across different studies. This paper investigates the challenges of reproducing these results in automatic bugfixing using LLMs, on the HumanEvalFix benchmark. To determine the cause of the differing results in the literature, we attempted [...] Read more.
Benchmark results for large language models often show inconsistencies across different studies. This paper investigates the challenges of reproducing these results in automatic bugfixing using LLMs, on the HumanEvalFix benchmark. To determine the cause of the differing results in the literature, we attempted to reproduce a subset of them by evaluating 12 models in the DeepSeekCoder, CodeGemma, CodeLlama, and WizardCoder model families, in different sizes and tunings. A total of 35 unique results were reported for these models across studies, of which we successfully reproduced 12. We identified several relevant factors that influenced the results. The base models can be confused with their instruction-tuned variants, making their results better than expected. Incorrect prompt templates or generation length can decrease benchmark performance, as well as using 4-bit quantization. Using sampling instead of greedy decoding can increase the variance, especially with higher temperature values. We found that precision and 8-bit quantization have less influence on benchmark results. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Figure 1

18 pages, 1871 KB  
Article
Interpretable Reinforcement Learning for Sequential Strategy Prediction in Language-Based Games
by Jun Zhao, Jintian Ji, Robail Yasrab, Shuxin Wang, Liang Yu and Lingzhen Zhao
Algorithms 2025, 18(7), 427; https://doi.org/10.3390/a18070427 - 11 Jul 2025
Cited by 1 | Viewed by 1460
Abstract
Accurate and interpretable prediction plays a vital role in natural language processing (NLP) tasks, particularly for enhancing user trust and model transparency. However, existing models often struggle with poor adaptability and limited interpretability when applied to dynamic language prediction tasks such as Wordle [...] Read more.
Accurate and interpretable prediction plays a vital role in natural language processing (NLP) tasks, particularly for enhancing user trust and model transparency. However, existing models often struggle with poor adaptability and limited interpretability when applied to dynamic language prediction tasks such as Wordle. To address these challenges, this study proposes an interpretable reinforcement learning framework based on an Enhanced Deep Deterministic Policy Gradient (Enhanced-DDPG) algorithm. By leveraging a custom simulation environment and integrating key linguistic features word frequency, letter frequency, and repeated letter patterns (rep) the model dynamically predicts the number of attempts needed to solve Wordle puzzles. Experimental results demonstrate that Enhanced-DDPG outperforms traditional methods such as Random Forest Regression (RFR), XGBoost, LightGBM, METRA, and SQIRL in terms of both prediction accuracy (MSE = 0.0134, R2 = 0.8439) and robustness under noisy conditions. Furthermore, SHapley Additive exPlanations (SHAP) are employed to interpret the model’s decision process, revealing that repeated letter patterns significantly influence low-attempt predictions, while word and letter frequencies are more relevant for higher attempt scenarios. This research highlights the potential of combining interpretable artificial intelligence (I-AI) and reinforcement learning to develop robust, transparent, and high-performance NLP prediction systems for real-world applications. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Figure 1

19 pages, 914 KB  
Article
RU-OLD: A Comprehensive Analysis of Offensive Language Detection in Roman Urdu Using Hybrid Machine Learning, Deep Learning, and Transformer Models
by Muhammad Zain, Nisar Hussain, Amna Qasim, Gull Mehak, Fiaz Ahmad, Grigori Sidorov and Alexander Gelbukh
Algorithms 2025, 18(7), 396; https://doi.org/10.3390/a18070396 - 28 Jun 2025
Cited by 4 | Viewed by 2120
Abstract
The detection of abusive language in Roman Urdu is important for secure digital interaction. This work investigates machine learning (ML), deep learning (DL), and transformer-based methods for detecting offensive language in Roman Urdu comments collected from YouTube news channels. Extracted features use TF-IDF [...] Read more.
The detection of abusive language in Roman Urdu is important for secure digital interaction. This work investigates machine learning (ML), deep learning (DL), and transformer-based methods for detecting offensive language in Roman Urdu comments collected from YouTube news channels. Extracted features use TF-IDF and Count Vectorizer for unigrams, bigrams, and trigrams. Of all the ML models—Random Forest (RF), Logistic Regression (LR), Support Vector Machine (SVM), and Naïve Bayes (NB)—the best performance was achieved by the same SVM. DL models involved evaluating Bi-LSTM and CNN models, where the CNN model outperformed the others. Moreover, transformer variants such as LLaMA 2 and ModernBERT (MBERT) were instantiated and fine-tuned with LoRA (Low-Rank Adaptation) for better efficiency. LoRA has been tuned for large language models (LLMs), a family of advanced machine learning frameworks, based on the principle of making the process efficient with extremely low computational cost with better enhancement. According to the experimental results, LLaMA 2 with LoRA attained the highest F1-score of 96.58%, greatly exceeding the performance of other approaches. To elaborate, LoRA-optimized transformers perform well in capturing detailed subtleties of linguistic nuances, lending themselves well to Roman Urdu offensive language detection. The study compares the performance of conventional and contemporary NLP methods, highlighting the relevance of effective fine-tuning methods. Our findings pave the way for scalable and accurate automated moderation systems for online platforms supporting multiple languages. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Figure 1

56 pages, 1008 KB  
Review
Machine Learning Techniques for Requirements Engineering: A Comprehensive Literature Review
by António Miguel Rosado da Cruz and Estrela Ferreira Cruz
Software 2025, 4(3), 14; https://doi.org/10.3390/software4030014 - 28 Jun 2025
Cited by 6 | Viewed by 8887
Abstract
Software requirements engineering is one of the most critical and time-consuming phases of the software-development process. The lack of communication with stakeholders and the use of natural language for communicating leads to misunderstanding and misidentification of requirements or the creation of ambiguous requirements, [...] Read more.
Software requirements engineering is one of the most critical and time-consuming phases of the software-development process. The lack of communication with stakeholders and the use of natural language for communicating leads to misunderstanding and misidentification of requirements or the creation of ambiguous requirements, which can jeopardize all subsequent steps in the software-development process and can compromise the quality of the final software product. Natural Language Processing (NLP) is an old area of research; however, it is currently undergoing strong and very positive impacts with recent advances in the area of Machine Learning (ML), namely with the emergence of Deep Learning and, more recently, with the so-called transformer models such as BERT and GPT. Software requirements engineering is also being strongly affected by the entire evolution of ML and other areas of Artificial Intelligence (AI). In this article we conduct a systematic review on how AI, ML and NLP are being used in the various stages of requirements engineering, including requirements elicitation, specification, classification, prioritization, requirements management, requirements traceability, etc. Furthermore, we identify which algorithms are most used in each of these stages, uncover challenges and open problems and suggest future research directions. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Figure 1

51 pages, 9787 KB  
Article
AI-Driven Predictive Maintenance for Workforce and Service Optimization in the Automotive Sector
by Şenda Yıldırım, Ahmet Deniz Yücekaya, Mustafa Hekimoğlu, Meltem Ucal, Mehmet Nafiz Aydin and İrem Kalafat
Appl. Sci. 2025, 15(11), 6282; https://doi.org/10.3390/app15116282 - 3 Jun 2025
Cited by 8 | Viewed by 10111
Abstract
Vehicle owners often use certified service centers throughout the warranty period, which usually extends for five years after buying. Nonetheless, after this timeframe concludes, a large number of owners turn to unapproved service providers, mainly motivated by financial factors. This change signifies a [...] Read more.
Vehicle owners often use certified service centers throughout the warranty period, which usually extends for five years after buying. Nonetheless, after this timeframe concludes, a large number of owners turn to unapproved service providers, mainly motivated by financial factors. This change signifies a significant drop in income for automakers and their certified service networks. To tackle this issue, manufacturers utilize customer relationship management (CRM) strategies to enhance customer loyalty, usually depending on segmentation methods to pinpoint potential clients. However, conventional approaches frequently do not successfully forecast which clients are most likely to need or utilize maintenance services. This research introduces a machine learning-driven framework aimed at forecasting the probability of monthly maintenance attendance for customers by utilizing an extensive historical dataset that includes information about both customers and vehicles. Additionally, this predictive approach supports workforce planning and scheduling within after-sales service centers, aligning with AI-driven labor optimization frameworks such as those explored in the AI4LABOUR project. Four algorithms in machine learning—Decision Tree, Random Forest, LightGBM (LGBM), and Extreme Gradient Boosting (XGBoost)—were assessed for their forecasting capabilities. Of these, XGBoost showed greater accuracy and reliability in recognizing high-probability customers. In this study, we propose a machine learning framework to predict vehicle maintenance visits for after-sales services, leading to significant operational improvements. Furthermore, the integration of AI-driven workforce allocation strategies, as studied within the AI4LABOUR (reshaping labor force participation with artificial intelligence) project, has contributed to more efficient service personnel deployment, reducing idle time and improving customer experience. By implementing this approach, we achieved a 20% reduction in information delivery times during service operations. Additionally, survey completion times were reduced from 5 min to 4 min per survey, resulting in total time savings of approximately 5906 h by May 2024. The enhanced service appointment scheduling, combined with timely vehicle maintenance, also contributed to reducing potential accident risks. Moreover, the transition from a rule-based maintenance prediction system to a machine learning approach improved efficiency and accuracy. As a result of this transition, individual customer service visit rates increased by 30%, while corporate customer visits rose by 37%. This study contributes to ongoing research on AI-driven workforce planning and service optimization, particularly within the scope of the AI4LABOUR project. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Figure 1

17 pages, 4840 KB  
Article
SMART Restaurant ReCommender: A Context-Aware Restaurant Recommendation Engine
by Ayesha Ubaid, Adrian Lie and Xiaojie Lin
AI 2025, 6(4), 64; https://doi.org/10.3390/ai6040064 - 25 Mar 2025
Cited by 2 | Viewed by 5492
Abstract
With the rise of e-commerce and web application usage, recommendation systems have become important to our daily tasks. They provide personalized suggestions to assist with any task under consideration. While various machine learning algorithms have been developed for recommendation tasks, existing systems still [...] Read more.
With the rise of e-commerce and web application usage, recommendation systems have become important to our daily tasks. They provide personalized suggestions to assist with any task under consideration. While various machine learning algorithms have been developed for recommendation tasks, existing systems still face limitations. This research focuses on advancing context-aware recommendation sytems by leveraging the capabilities of Large Language Models (LLMs) in conjunction with real-time data. The research exploits the integration of existing real-time data APIs with LLMs to enhance the capabilities of the recommendation systems already integrated into smart societies. The experimental results demonstrate that the hybrid approach significantly improves the user experience and recommendation quality, ensuring more relevant and dynamic suggestions. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Figure 1

16 pages, 511 KB  
Article
Hybrid Machine Learning and Deep Learning Approaches for Insult Detection in Roman Urdu Text
by Nisar Hussain, Amna Qasim, Gull Mehak, Olga Kolesnikova, Alexander Gelbukh and Grigori Sidorov
AI 2025, 6(2), 33; https://doi.org/10.3390/ai6020033 - 8 Feb 2025
Cited by 12 | Viewed by 3265
Abstract
Thisstudy introduces a new model for detecting insults in Roman Urdu, filling an important gap in natural language processing (NLP) for low-resource languages. The transliterated nature of Roman Urdu also poses specific challenges from a computational linguistics perspective, including non-standardized grammar, variation in [...] Read more.
Thisstudy introduces a new model for detecting insults in Roman Urdu, filling an important gap in natural language processing (NLP) for low-resource languages. The transliterated nature of Roman Urdu also poses specific challenges from a computational linguistics perspective, including non-standardized grammar, variation in spellings for the same word, and high levels of code-mixing with English, which together make automated insult detection for Roman Urdu a highly complex problem. To address these problems, we created a large-scale dataset with 46,045 labeled comments from social media websites such as Twitter, Facebook, and YouTube. This is the first dataset for insult detection for Roman Urdu that was created and annotated with insulting and non-insulting content. Advanced preprocessing methods such as text cleaning, text normalization, and tokenization are used in the study, as well as feature extraction using TF–IDF through unigram (Uni), bigram (Bi), trigram (Tri), and their unions: Uni+Bi+Trigram. We compared ten machine learning algorithms (logistic regression, support vector machines, random forest, gradient boosting, AdaBoost, and XGBoost) and three deep learning topologies (CNN, LSTM, and Bi-LSTM). Different models were compared, and ensemble ones were proven to give the highest F1-scores, reaching 97.79%, 97.78%, and 95.25%, respectively, for AdaBoost, decision tree, TF–IDF, and Uni+Bi+Trigram configurations. Deeper learning models also performed on par, with CNN achieving an F1-score of 97.01%. Overall, the results highlight the utility of n-gram features and the combination of robust classifiers in detecting insults. This study makes strides in improving NLP for Roman Urdu, yet further research has established the foundation of pre-trained transformers and hybrid approaches; this could overcome existing systems and platform limitations. This study has conscious implications, mainly on the construction of automated moderation tools to achieve safer online spaces, especially for South Asian social media websites. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Figure 1

23 pages, 602 KB  
Article
The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT
by Nils Baumgartner, Padma Iyenghar, Timo Schoemaker and Elke Pulvermüller
Software 2025, 4(1), 3; https://doi.org/10.3390/software4010003 - 2 Feb 2025
Cited by 1 | Viewed by 2788
Abstract
This paper explores a modular pipeline architecture that integrates ChatGPT, a Large Language Model (LLM), to automate the detection and refactoring of data clumps—a prevalent type of code smell that complicates software maintainability. Data clumps refer to clusters of code that are often [...] Read more.
This paper explores a modular pipeline architecture that integrates ChatGPT, a Large Language Model (LLM), to automate the detection and refactoring of data clumps—a prevalent type of code smell that complicates software maintainability. Data clumps refer to clusters of code that are often repeated and should ideally be refactored to improve code quality. The pipeline leverages ChatGPT’s capabilities to understand context and generate structured outputs, making it suitable for addressing complex software refactoring tasks. Through systematic experimentation, our study not only addresses the research questions outlined but also demonstrates that the pipeline can accurately identify data clumps, particularly excelling in cases that require semantic understanding—where localized clumps are embedded within larger codebases. While the solution significantly enhances the refactoring workflow, facilitating the management of distributed clumps across multiple files, it also presents challenges such as occasional compiler errors and high computational costs. Feedback from developers underscores the usefulness of LLMs in software development but also highlights the essential role of human oversight to correct inaccuracies. These findings demonstrate the pipeline’s potential to enhance software maintainability, offering a scalable and efficient solution for addressing code smells in real-world projects, and contributing to the broader goal of enhancing software maintainability in large-scale projects. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Figure 1

25 pages, 641 KB  
Article
A Lexicon-Based Framework for Mining and Analysis of Arabic Comparative Sentences
by Alaa Hamed, Arabi Keshk and Anas Youssef
Algorithms 2025, 18(1), 44; https://doi.org/10.3390/a18010044 - 13 Jan 2025
Viewed by 1782
Abstract
People tend to share their opinions on social media daily. This text needs to be accurately mined for different purposes like enhancements in services and/or products. Mining and analyzing Arabic text have been a big challenge due to many complications inherited in Arabic [...] Read more.
People tend to share their opinions on social media daily. This text needs to be accurately mined for different purposes like enhancements in services and/or products. Mining and analyzing Arabic text have been a big challenge due to many complications inherited in Arabic language. Although, many research studies have already investigated the Arabic text sentiment analysis problem, this paper investigates the specific research topic that addresses Arabic comparative opinion mining. This research topic is not widely investigated in many research studies. This paper proposes a lexicon-based framework which includes a set of proposed algorithms for the mining and analysis of Arabic comparative sentences. The proposed framework comprises a set of contributions including an Arabic comparative sentence keywords lexicon and a proposed algorithm for the identification of Arabic comparative sentences, followed by a second proposed algorithm for the classification of identified comparative sentences into different types. The framework also comprises a third proposed algorithm that was developed to extract relations between entities in each of the identified comparative sentence types. Finally, two proposed algorithms were developed for the extraction of the preferred entity in each sentence type. The framework was evaluated using three different Arabic language datasets. The evaluation metrics used to obtain the evaluation results include precision, recall, F-score, and accuracy. The average values of the evaluation metrics for the proposed sentences identification algorithm reached 97%. The average evaluation values of the evaluation metrics for the proposed sentence type identification algorithm reached 96%. Finally, the average results showed 97% relation word extraction precision for the proposed relation extraction algorithm. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Figure 1

21 pages, 16988 KB  
Article
An End-to-End Adaptive Method for Remaining Useful Life Prediction of Rolling Bearings Using Time–Frequency Image Features
by Liang Chen, Hao Wang, Linshu Meng, Zhenzhen Xu, Lin Xue and Mingfa Ren
Mach. Learn. Knowl. Extr. 2024, 6(4), 2892-2912; https://doi.org/10.3390/make6040138 - 16 Dec 2024
Cited by 4 | Viewed by 2532
Abstract
The deep learning model has attracted widespread attention in the field of rolling bearing remaining useful life (RUL) prediction due to its advantages of less reliance on prior knowledge, high accuracy, and strong generalization. However, a large number of prediction models use very [...] Read more.
The deep learning model has attracted widespread attention in the field of rolling bearing remaining useful life (RUL) prediction due to its advantages of less reliance on prior knowledge, high accuracy, and strong generalization. However, a large number of prediction models use very complicated artificial feature extraction and selection methods to build the original input features of the deep learning model and health indicator. These approaches do not fully exploit the capabilities of deep learning models as they continue to heavily rely on prior knowledge, The accuracy of their predictions largely hinges on the quality of the input features, and the generalization of manually crafted features remains uncertain. To address these challenges, in this paper, an end-to-end prediction model for the remaining useful life of rolling bearings is proposed, which is divided into three modules. First, a short-term Fourier transform module is incorporated into the model to automatically obtain the time–frequency information of the signal. Then, the convolutional next (ConvNext) module, which is a simple and efficient pure convolutional neural network, is utilized to extract features from the spectrogram. Finally, we capture the short-term dependence and long-term dependence by two parallel channels Transformer and self-attention convolutional long short-term memory (SA-ConvLSTM), and the self-attention mechanism is employed for the adaptive prediction of the bearing’s remaining useful life. Through integration with artificial intelligence, this method proposes a high-performance solution for predicting the remaining useful life of bearings. It has minimal reliance on manual labor, stronger fitting capabilities, and can be widely used for predicting the remaining useful life of bearings. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Figure 1

12 pages, 293 KB  
Article
Detecting Online Sexism: Integrating Sentiment Analysis with Contextual Language Models
by Faiza Belbachir, Thomas Roustan and Assia Soukane
AI 2024, 5(4), 2852-2863; https://doi.org/10.3390/ai5040137 - 10 Dec 2024
Cited by 5 | Viewed by 3215
Abstract
In the digital era, social media platforms have seen a substantial increase in the volume of online comments. While these platforms provide users with a space to express their opinions, they also serve as fertile ground for the proliferation of hate speech. Hate [...] Read more.
In the digital era, social media platforms have seen a substantial increase in the volume of online comments. While these platforms provide users with a space to express their opinions, they also serve as fertile ground for the proliferation of hate speech. Hate comments can be categorized into various types, including discrimination, violence, racism, and sexism, all of which can negatively impact mental health. Among these, sexism poses a significant challenge due to its various forms and the difficulty in defining it, making detection complex. Nevertheless, detecting and preventing sexism on social networks remains a critical issue. Recent studies have leveraged language models such as transformers, known for their ability to capture the semantic nuances of textual data. In this study, we explore different transformer models, including multiple versions of RoBERTa (A Robustly Optimized BERT Pretraining Approach), to detect sexism. We hypothesize that combining a sentiment-focused language model with models specialized in sexism detection can improve overall performance. To test this hypothesis, we developed two approaches. The first involved using classical transformers trained on our dataset, while the second combined embeddings generated by transformers with a Long Short-Term Memory (LSTM) model for classification. The probabilistic outputs of each approach were aggregated through various voting strategies to enhance detection accuracy. The LSTM with embeddings approach improved the F1-score by 0.2% compared to the classical transformer approach. Furthermore, the combination of both approaches confirms our hypothesis, achieving a 1.6% improvement in the F1-score in each case. We determined that an F1 score of over 0.84 effectively measures sexism. Additionally, we constructed our own dataset to train and evaluate the models. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Figure 1

22 pages, 1085 KB  
Article
SevPredict: Exploring the Potential of Large Language Models in Software Maintenance
by Muhammad Ali Arshad, Adnan Riaz, Rubia Fatima and Affan Yasin
AI 2024, 5(4), 2739-2760; https://doi.org/10.3390/ai5040132 - 5 Dec 2024
Cited by 5 | Viewed by 3428
Abstract
The prioritization of bug reports based on severity is a crucial aspect of bug triaging, enabling a focus on more critical issues. Traditional methods for assessing bug severity range from manual inspection to the application of machine and deep learning techniques. However, manual [...] Read more.
The prioritization of bug reports based on severity is a crucial aspect of bug triaging, enabling a focus on more critical issues. Traditional methods for assessing bug severity range from manual inspection to the application of machine and deep learning techniques. However, manual evaluation tends to be resource-intensive and inefficient, while conventional learning models often lack contextual understanding. This study explores the effectiveness of large language models (LLMs) in predicting bug report severity. We propose a novel approach called SevPredict using GPT-2, an advanced LLM, and compare it against state-of-the-art models. The comparative analysis between the proposed approach and state-of-the-art approaches suggests that the proposed approach outperforms the state-of-the-art approaches in terms of performance evaluation metrics. SevPredict shows improvements over the best-performing state-of-the-art approach (BERT-SBR) with 1.72% higher accuracy, 2.18% higher precision, and 4.94% higher MCC. The improvements are even more substantial when compared to the approach by Ramay et al., with SevPredict demonstrating 10.66% higher accuracy, 10.39% higher precision, 3.29% higher recall, 7.19% higher F1-score, and a remarkable 41.27% higher MCC. These findings not only demonstrate the superiority of our GPT-2-based approach in predicting the severity of bug reports but also highlight its potential to significantly advance automated bug triaging and software maintenance. This research introduces a severity prediction tool named SevPredict. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Figure 1

19 pages, 2078 KB  
Article
Enhancing Medical Image Classification with Unified Model Agnostic Computation and Explainable AI
by Elie Neghawi and Yan Liu
AI 2024, 5(4), 2260-2278; https://doi.org/10.3390/ai5040111 - 5 Nov 2024
Cited by 4 | Viewed by 3539
Abstract
Background: Advances in medical image classification have recently benefited from general augmentation techniques. However, these methods often fall short in performance and interpretability. Objective: This paper applies the Unified Model Agnostic Computation (UMAC) framework specifically to the medical domain to demonstrate [...] Read more.
Background: Advances in medical image classification have recently benefited from general augmentation techniques. However, these methods often fall short in performance and interpretability. Objective: This paper applies the Unified Model Agnostic Computation (UMAC) framework specifically to the medical domain to demonstrate its utility in this critical area. Methods: UMAC is a model-agnostic methodology designed to develop machine learning approaches that integrate seamlessly with various paradigms, including self-supervised, semi-supervised, and supervised learning. By unifying and standardizing computational models and algorithms, UMAC ensures adaptability across different data types and computational environments while incorporating state-of-the-art methodologies. In this study, we integrate UMAC as a plug-and-play module within convolutional neural networks (CNNs) and Transformer architectures, enabling the generation of high-quality representations even with minimal data. Results: Our experiments across nine diverse 2D medical image datasets show that UMAC consistently outperforms traditional data augmentation methods, achieving a 1.89% improvement in classification accuracy. Conclusions: Additionally, by incorporating explainable AI (XAI) techniques, we enhance model transparency and reliability in decision-making. This study highlights UMAC’s potential as a powerful tool for improving both the performance and interpretability of medical image classification models. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Figure 1

Back to TopTop