Next Article in Journal
Research on Laser Measurement Technology for Online Roll Profile Measurement in Strip Rolling Mills
Previous Article in Journal
From Measured In Situ Stress to Dynamic Simulation: A Calibrated 3DEC Model of a Rock Quarry
Previous Article in Special Issue
A Cloud-Based Sentiment Analysis System with a BERT Algorithm for Fake News on Twitter
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

Large Language Models for Early-Stage Software Project Estimation: A Systematic Mapping Study

1
Faculty of Computer Science and Information Technology, West Pomeranian University of Technology in Szczecin, ul. Żołnierska 49, 71-210 Szczecin, Poland
2
Department of Information Technology in Management, University of Szczecin, ul. Cukrowa 8, 71-004 Szczecin, Poland
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(24), 13099; https://doi.org/10.3390/app152413099
Submission received: 16 November 2025 / Revised: 4 December 2025 / Accepted: 10 December 2025 / Published: 12 December 2025
(This article belongs to the Special Issue Natural Language Processing in the Era of Artificial Intelligence)

Abstract

Accurate estimation of software project characteristics during the early stages of development remains a constant challenge in software projects. Recent research suggests that large language models (LLMs) offer new opportunities to support such estimation tasks through their ability to interpret natural language specifications and extract contextual information from project descriptions. This paper presents a mapping study providing an overview of research on the applications of LLMs in early software project estimation. Thirty primary studies were systematically identified and categorised to examine estimation targets, used models, reference and supportive techniques, as well as applied evaluation measures. The obtained results provide insights into the methodological considerations, limitations, and challenges associated with LLM-based estimation approaches. The obtained findings inform both researchers and practitioners about the current state and potential of LLMs for supporting early-stage software project estimation.

1. Introduction

Accurate estimation of software project characteristics during the early stages of development is one of the most critical challenges in software engineering (SE) [1,2,3]. Reliable effort estimation has a direct impact on project planning, resource allocation, budget constraints, and ultimately, product quality and project success [4,5,6]. Despite decades of research and the evolution of software development methodologies from traditional waterfall approaches to contemporary agile frameworks, software projects still encounter significant cost overruns and schedule delays [7]. Several projects have cost overruns of around 200% [8,9] or even 400% [10].
Software estimation covers several dimensions that are crucial to project success. Size estimation focuses on determining the volume of the software to be developed and can be measured in various units, including lines of code, function points, or use case points. Size estimates serve as a foundation for subsequent effort and cost calculations, making them essential input for project planning activities.
Effort estimation aims to predict the amount of work required to complete a software project. Traditionally, effort was expressed in person-hours or person-months, representing the time developers need to complete specific tasks [11,12,13]. In modern agile development environments, effort was frequently measured using story points [14,15] that capture the complexity, uncertainty, and work involved in implementing user stories. Story points enable teams to estimate work without reference to specific time durations, thereby reflecting the iterative nature of agile methodologies. The relationship between story points and actual development time varies across teams and projects, making this metric both flexible and context-dependent [16].
Traditional estimation approaches relied on several established techniques. Parametric models, such as COCOMO (Constructive Cost Model), use mathematical equations based on project attributes and historical data to provide estimates [17]. Analogy-based methods compare current projects with similar completed projects to provide estimates [18]. Machine learning (ML) approaches emerged as alternatives to traditional estimation methods. Techniques such as linear regression models, decision trees, support vector machines, random forests, and neural networks (especially those involving deep learning—DL) learn patterns from historical project data to generate predictions [11,19,20]. These data-driven methods can capture non-linear relationships between project attributes and effort, potentially improving estimation accuracy.
The emergence of large language models (LLMs) [21,22] opened a novel avenue for addressing the challenges of software project estimation. These models, trained on vast text corpora and featuring millions to billions of parameters, were originally developed to execute a diverse range of Natural Language Processing (NLP) tasks, such as text generation, text analysis, translation, sentiment analysis, question answering, etc. [23]. Note that some authors, e.g., [24], consider only models with at least 10 billion parameters as LLMs. However, we do not follow such a radical view. LLMs’ remarkable capabilities in understanding natural language combined with their ability to handle various forms of input data, capture semantic relationships, identify patterns, generalise knowledge, and perform versatile tasks without prior task-specific training (see, e.g., [25]), make them suitably qualified for software project estimation tasks. They can process a wide range of textual project artefacts, also taking into consideration multiple and complex project features as well as historical data. What is especially important is that they can extract information from textual requirements, user stories, or issue descriptions that are already available at the early stages of software projects [14,26]. Even more, they can be fine-tuned for specific tasks and/or prompt engineering can be applied to augment their behaviour (e.g., Chain-of-Thought can bolster LLMs’ reasoning abilities) [27]. All this makes them potentially transformative tools, especially for early estimation and prediction tasks.
Despite the theoretical promise of LLMs for early software project estimation, their application in this domain remains underexplored. Although several studies have investigated the use of pre-trained language models, such as BERT and GPT-2, for story point estimation, the research landscape remains fragmented. Individual studies often employ different estimation targets, models, and techniques, as well as evaluation measures and experimental designs.
Although several literature reviews covered a similar research field (see Section 2 for an overview and comparison), none of them investigated the primary studies on the applications of LLMs to early-stage estimation in software projects. Specifically, most earlier reviews covered various SE tasks or phases, causing them to be inherently less focused on early-stage estimation. In particular, the most recent reviews covering primary studies published in 2024–2025 did not focus on such an application area. Most reviews investigated primary studies published earlier than in the last two years, when the available LLMs offered significantly fewer capabilities. Some of these reviews did not even investigate the use of LLMs but ML and DL techniques. The absence of a comprehensive synthesis of existing empirical evidence on applications of LLMs to early-stage estimation represents a significant gap in the literature. This limits both researchers seeking to advance the field and practitioners considering the application of LLM-based estimation approaches.
To address the gap, this paper presents a systematic mapping study that provides a comprehensive overview of empirical research on the applications of LLMs for early software project estimation. To ensure methodological appropriateness for the study’s goal, we based our approach primarily on well-established guidelines for systematic mapping studies [28,29], highly cited systematic mappings in software engineering [30,31,32,33,34] and partially on earlier similar related work in the analysed area [3,35,36]. We rigorously performed the search and selection process to identify primary studies that investigated the use of LLMs to support estimation activities based on natural language specifications. Then, we carefully evaluated and categorised these studies to produce the publication dataset, reflecting the state-of-the-art in the analysed area. Our study examines various dimensions describing the use of LLM-based estimation. Specifically, our study investigated primary studies published between 2022 and 2025, focused on the estimation of effort, size, user story quality and project productivity. These studies primarily used textual information, such as user stories, requirements, use cases, or issue descriptions, as inputs to various BERT or GPT models. Most often, these studies evaluated the accuracy of estimation using regression-based measures, such as the mean absolute error (MAE). However, several studies also employed various measures for classification tasks, including accuracy, F1 score, and recall. Hence, this study provides researchers and practitioners with a consolidated body of knowledge of LLM-based approaches to early-stage software project estimation.
We recall the critical difference between similar types of literature reviews: systematic mapping studies are primarily concerned with structuring a research area, whereas systematic reviews aim at synthesising evidence, also considering the strength of evidence [28]. As this is a systematic mapping study and due to the identified heterogeneity across primary studies, we did not directly compare the results reported in the primary studies. Consequently, we did not make any claims that one approach is better than another based on the estimation accuracy or any other empirical criteria. Based on the counts of primary studies, our study primarily presents descriptive statistics characterising several dimensions of the research landscape, as well as patterns and relationships that demonstrate which combinations tend to co-occur. We reported the diversity of approaches without claiming direct comparability of their results.
This mapping study explored the following research questions (RQs):
  • RQ1. In which publication venues was research on LLM-based early software project estimation disseminated?
The publication landscape provides insights into the maturity and visibility of research in this domain. By identifying the venues, we can assess the level of interest across different research communities. This approach also reveals potential gaps in dissemination strategies and opportunities for cross-disciplinary engagement.
  • RQ2. Which countries and research institutions contributed to research on LLM-based early software project estimation?
Mapping the geographical and institutional distribution of research contributions enables the identification of leading research groups and potential regional variations in research focus or approach. This information is valuable for facilitating international collaboration and identifying centres of expertise.
  • RQ3. What has been the scholarly impact of published research on LLM-based early software project estimation?
Assessing citations indicates the influence and adoption of obtained findings within the broader research community. By examining studies most frequently cited, we can identify those with the most substantial impact on the community.
  • RQ4. What is the thematic scope of research on LLM-based early software project estimation, as reflected in author keywords?
Analysing author-defined keywords reveals the conceptual landscape and terminology describing this research domain. Identifying frequently occurring keywords and their relationships helps characterise the main themes and conceptual boundaries of the field.
  • RQ5. What types of primary studies investigated the use of LLMs for early software project estimation?
Understanding the nature and distribution of study types is essential for assessing the maturity and empirical foundation of research in this domain. We classify studies according to their contribution as conceptual, new (with adaptation and composition), empirical evaluation, and tool. This classification enables us to determine whether the field is dominated by conceptual proposals or empirically validated approaches, and whether research is focused on developing entirely new techniques or refining and combining existing methods.
  • RQ6. Which estimation targets were addressed in research on LLM-based early software project estimation?
Software estimation covers several dimensions, including effort, duration, size, and quality. By systematically identifying the estimation targets investigated in primary studies and the units of measurement employed, we can characterise the scope and diversity of estimation objectives addressed by LLM-based approaches. This analysis examined whether specific estimation targets received disproportionate attention and whether there were underexplored areas where LLMs could be beneficial.
  • RQ7. Which LLMs and LLM-based approaches, methods or tools have been applied or proposed for early software project estimation tasks?
This research question focuses on the technical landscape of LLM applications in estimation. We identified the specific LLM architectures and model families employed (e.g., BERT or GPT), as well as the supportive techniques (prompt engineering and fine-tuning), and the reference techniques (i.e., those against which LLM-based approaches were evaluated). Understanding the diversity of technical approaches provides insights into current best practices. This analysis also highlights the extent to which researchers explored the full range of available LLM capabilities or whether research has concentrated on a limited set of techniques.
  • RQ8. Which artefacts were used as inputs for LLM-based early software project estimation?
Investigating the types of textual inputs for LLMs (e.g., user stories, requirements specifications, use cases, issue descriptions) enables the assessment of the practical applicability of proposed approaches. This analysis examined whether LLM-based methods were used with realistic, naturally occurring software artefacts or relied on artificially structured or idealised inputs.
  • RQ9. Which evaluation measures were used to assess the performance of LLM-based early software project estimation approaches?
Rigorous empirical evaluation is fundamental to establishing the effectiveness of estimation approaches. Systematic identification of employed evaluation measures enables the assessment of consistency and comparability of evaluation practices across studies. This analysis highlights whether the community has converged on standard evaluation protocols or whether varying measures impede cross-study comparisons.
  • RQ10. What relationships exist between key dimensions identified in this mapping study?
Beyond examining individual dimensions in isolation, understanding the relationships between various dimensions investigated individually in the above RQs is crucial for identifying significant patterns and gaps.
This study makes the following contributions to the field of software engineering and project estimation using LLMs:
  • We provide the first systematic mapping of research on LLM-based approaches for early software project estimation, offering a structured and comprehensive overview of the current state of knowledge. This synthesis consolidates findings from diverse studies, enabling researchers and practitioners to quickly understand the current landscape of the field.
  • By analysing publication patterns, thematic scope, methodological approaches, and performance characteristics, we identify emerging trends and underexplored areas in the application of LLMs to early-stage software estimation. This analysis provides a foundation for prioritising future research efforts and highlights opportunities for novel contributions.
  • The detailed examination of the LLM architectures, reference and supportive techniques, input artefacts, and evaluation measures employed across studies provides actionable insights for researchers designing new studies and practitioners evaluating the suitability of LLM-based estimation approaches for their contexts.
The rest of this paper is organised as follows. Section 2 provides a related work on earlier literature reviews with a similar scope. Section 3 describes the methodology employed in this study, including procedures for publication search, data preparation, and analysis of publications. It also presents the main data characteristics. Section 4 presents the results of this mapping study, according to each research question. Section 5 discusses the findings, their implications for research and practice, the identified limitations, and the open challenges. Section 6 considers threats to validity. Finally, Section 7 draws conclusions and outlines directions for future work.

2. Related Work

Recent years have witnessed a significant growth in the number of literature reviews and mapping studies investigating the use of AI methods, including LLMs, in various software engineering areas. Table 1 presents a comparative overview of these studies.
Sofian et al. [36] conducted a systematic mapping study of 60 papers investigating various AI techniques in SE. Specifically, they classified the analysed papers into eight software development phases: planning, requirements, engineering, design, system development, testing, deployment, training, and maintenance. The study considered the following classification of techniques: ML, HA (heuristic algorithm), DL, DD (data-driven) and various combinations of these. The study provided a mapping of papers to the SE phase and applied AI technique.
Watson et al. [37] analysed 128 studies of deep learning applications in SE. This study covered 22 SE areas, with the majority of papers focusing on programme synthesis, code comprehension, bug-fixing process, clone detection, and source code generation. It did not investigate early-stage project estimation, and the closest area was software reliability/defect prediction. The most frequently used techniques were Recurrent Neural Networks (RNNs), Encoder–Decoder Models, Convolutional Neural Networks (CNNs), and Feed-Forward Neural Networks.
Yang et al. [38] performed a literature survey of 142 studies on deep learning applications across SE tasks. The most frequently used techniques were CNN, LSTM, RNN, and FNN. This study covered the papers on the following SE activities: design, implementation, testing and debugging, maintenance, and management. Within the last one, there were six papers on project effort estimation. This survey provided a taxonomy of DL techniques and their applications, together with model architectures, training strategies, and evaluation methodologies.
Fan et al. [39] surveyed studies on LLMs for software engineering tasks. Their work provided a comprehensive taxonomy of LLM applications across software development activities, including requirements engineering and design; code generation and completion; testing; maintenance, evolution and deployment; document generation, as well as the following research domains: software analytics and repository, human–computer interaction, SE process, SE education. The study highlighted the need for techniques to eliminate hallucinations and for the hybrid techniques (traditional SE with LLMs).
Wang et al. [40] conducted one of the most extensive systematic literature reviews, analysing 1428 ML and DL studies published between 2009 and 2020. Their work systematically categorised 77 distinct SE tasks across seven major activities: requirements engineering, design and modelling, implementation, testing, defect analysis, maintenance and evolution, and project management. The study revealed that defect prediction and software maintenance tasks dominate the research landscape, accounting for over half of all publications. The study investigated the following areas of early-stage estimation: software effort/cost estimation (67 papers), software schedule estimation (16 papers), project outcome prediction (5 papers), and software size estimation (1 paper). The study analysed differences in data preprocessing, model training, and evaluation, and identified factors influencing model choice and reproducibility.
Hou et al. [23] performed an SLR providing a comprehensive overview of LLM applications in SE. Their research systematically categorised applications to 85 SE tasks across the entire software development lifecycle, from requirements elicitation to deployment and maintenance. Despite a high number of primary studies investigated (395), only two were related to effort estimation. The review examined the types of LLMs used, data collection and preprocessing practices, and optimisation strategies such as fine-tuning and prompt engineering. It also identified major challenges and proposed a research roadmap outlining future directions for improving the reliability and applicability of LLM-based approaches in software engineering.
Marques et al. [41] reviewed 22 studies on ChatGPT in the context of requirements engineering. Their work examined how LLMs were applied to requirements elicitation, specification, validation, and management tasks. The review demonstrated that LLMs can accelerate requirements documentation, improve stakeholder communication, and support the detection of inconsistencies in specifications. Key challenges persist in various biases related to training data, user interaction, training algorithms, contextual information, as well as hallucinations, explainability and transparency, susceptibility to attacks, and other factors.
Rivera Ibarra et al. [3] focused specifically on early estimation in agile software development projects through a systematic mapping study of 18 papers. Their work examined three estimation targets: size, cost, and effort. The analysed papers were classified according to the approaches, i.e., data- or expert-driven, and groups of techniques, i.e., algorithmic, ML, and non-algorithmic. The authors also identified the predictors, i.e., inputs to the estimation technique, as well as dataset characteristics. Finally, they highlighted the need for improved estimation support tools and future research on hybrid methods, as well as comprehensive, standardised project datasets.
Wang et al. [42] conducted a survey of 102 studies on the use of LLMs in software testing. Their work systematically examined LLM applications across the generation of test cases, oracles, and inputs, as well as bug analysis, debugging, and program repair. The authors analysed how LLMs were applied to software testing tasks, identified commonly used models, prompt engineering strategies, input artefacts, and other supportive techniques.
Zhang et al. [43] surveyed 112 studies on automated program repair (APR) involving ML/DL techniques. The study presented the typical workflow of learning-based APR (covering fault localisation, patch generation, ranking, validation, and correctness) and analysed datasets, evaluation metrics, and empirical studies. The authors also discussed industrial applications, open science issues, and emerging use of pre-trained models, concluding with open research challenges and practical guidelines for future work.
Alturayeif et al. [44] analysed 59 studies on ML/DL approaches, including LLMs, for automated software traceability. Their research mapped the evolution from traditional information retrieval techniques to sophisticated deep learning and transfer learning methods. The study investigated 174 datasets, the most common artefact pairs, and their roles across SDLC phases, including requirements analysis and implementation. The review examined classification and ranking prediction methods, as well as supervised, semi-supervised, and unsupervised learning approaches, and highlighted the growing adoption of deep learning and LLMs for traceability tasks. Key challenges include data scarcity, lack of benchmark datasets, and limited real-world validation. The authors also proposed a unified ML-based traceability framework involving transfer learning to guide future research and tool development.
Czarnacka-Chrobot [47] conducted a systematic review of 39 studies (2010–2025) on the use of AI in software functional size measurement (FSM). The study examined the use of techniques such as ML, NLP, DL, and fuzzy logic to automate FSM processes traditionally performed according to IFPUG and COSMIC methods. Although the major area of investigation was software size, the study also partially covered the estimation of effort, cost and duration. NLP and conventional ML were the dominant approaches. The review highlighted growing interest in hybrid models and identified gaps related to benchmark datasets and intelligent project comparison mechanisms.
Husein et al. [45] conducted the systematic literature review on the use of LLMs for code completion, analysing 23 primary studies. The analysis involved several dimensions, including granularity levels, model architectures, training methods, and evaluation metrics. Results showed that LLMs substantially improve code completion performance across programming languages and contexts, particularly through their ability to predict relevant code snippets from partial input.
Khalil et al. [35] conducted an SLR combining bibliometric and qualitative analyses to examine the use of ML and AI techniques, including LLMs, in project management. The study covered the following project knowledge areas: communications, cost, integration, quality, resource, risk, time, procurement, scope, and stakeholder management. Identified major thematic clusters highlight the increasing influence of AI in improving project planning, decision-making, and resource management. The research emphasised the importance of robust data governance frameworks and comprehensive training programmes to facilitate the adoption of AI in project management practices. Referring to the estimation, the study only briefly mentioned that advanced ML algorithms outperformed traditional methods, addressing biases such as over-optimism in cost and duration predictions.
 Syahputri et al. [46] conducted an SLR of 42 studies examining prompt engineering paradigms in SE. Their research identified four prominent approaches: manual prompt crafting, retrieval-augmented generation (RAG), chain-of-thought (CoT) prompting, soft prompt tuning and automated prompt generation. The study covered a wide range of SE areas, from requirements to maintenance. Interdisciplinary collaboration among experts in AI, machine learning, and software engineering was identified as a key driver of innovation. To address several research gaps, the authors proposed a modular framework integrating human-in-the-loop design, automated optimisation, and version control mechanisms to address these limitations.
Analysis across these systematic reviews reveals several key observations. First, a clear evolution is evident from conventional ML techniques toward deep learning and, most recently, large language model-based approaches. It reflects both technological advancement and the increasing availability of large-scale code repositories and datasets. Second, while model performance has improved substantially, practical industrial adoption remains limited due to concerns about reliability, interpretability, and integration with existing development workflows. Third, evaluation methodologies vary considerably across studies, with no consensus on appropriate metrics for different SE tasks, limiting comparative assessment.
The increasing attention to prompt engineering and few-shot learning approaches indicates a shift toward more flexible and adaptable AI systems for specific project contexts, without the need for extensive retraining.
Investigated earlier literature reviews demonstrated only limited overlap with our present study. Most notably, several of them were published a couple of years ago, when the landscape of available LLMs was vastly different from the present one [36,37,38,40]. Second, only very few studies covered the scope of early-stage project estimation (e.g., size or effort). On the other hand, studies focused on such estimation targets covered the use of older approaches and techniques. Hence, this study fills the gap in the area that was not explored in earlier literature reviews.

3. Materials and Methods

3.1. Publication Search

We performed the search for relevant papers on 15th October 2025 using the following publication databases:
  • ACM Digital Library—The ACM Guide to Computing Literature,
  • IEEE Xplore,
  • Scopus,
  • Springer,
  • Web of Science Core Collection.
For the sake of brevity, throughout the paper, we refer to these databases as ACM, IEEE, Scopus, Springer, and WoS, respectively. Each database provides different options for query formulation and search field selection. For example, they offer a general search that does not specify the bibliographic fields for the query, and an advanced search that requires providing these fields. We explored several variants with each database to obtain the most relevant publications. Hence, for IEEE, Springer, and WoS, the most effective approach was a general search, while for ACM and Scopus, it was the advanced search. Table 2 summarises the fields used and the corresponding raw publication counts after language filtering. Appendix A explains the search queries used.
Initially, we also considered using the ScienceDirect database. However, due to its limitations, specifically a lack of support for wildcards and a maximum of eight Boolean connectors per field in search queries, we were unable to construct a sensible query covering the scope of the study. Hence, we decided not to use it.
Our search strategy followed the PICOC framework (Population, Intervention, Comparison, Outcome, Context) [29] to ensure conceptual completeness and consistency of the query. Specifically, the Population (P) was defined as the software development activities, represented in the query by the term “software”, indicating a focus on software projects and related development activities. The Intervention (I) corresponds to the use of large language models (LLMs) and related generative AI technologies, expressed by terms such as “LLM*”, “large language model*”, “GPT”, “ChatGPT”, and “generative AI”. The Outcome (O) refers to the estimation or prediction of key software attributes, formulated through general verbs (“estimat*”, “predict*”, “forecast*”) and a wide set of measurable targets considered at the early stage of software development, including effort, cost, productivity, and size. The Context (C2) refers to the software engineering domain in which LLM-based estimation or prediction approaches were applied. There was no need to add another keyword to delimit the context, as we believe the keyword used to denote population (“software”) also effectively covers the context. We intentionally omitted the Comparison (C1) element, as the study primarily aims to map and characterise LLM-based approaches, rather than to compare them to other methods (although we do report the comparisons performed in the analysed papers as a part of response to RQ5). Appendix A provides the search queries used with specific publication databases.

3.2. Data Preparation

After gathering search results from the databases, we integrated them into a uniform dataset. We then performed a deduplication of the results as follows: we alphabetically sorted and grouped publications by title. We verified that particular groups actually contained identical entries and that no similar entries with minor variations, such as differences in letter capitalisation, punctuation or special characters, appeared outside these groups. We then removed repeated entries within groups while preserving information on all source databases from which such a paper was retrieved. This manual approach ensured the correct handling of duplicate entries.
Then we performed additional manual filtering according to the set of inclusion and exclusion criteria. Specifically, in the final dataset, we included publications if they met all the following inclusion criteria:
  • The publication reported the use of LLMs for early software project estimation.
  • The target for estimation involved important project aspects from a management perspective, e.g., effort, cost, duration, productivity, or project size.
  • The publication presented research-based content, e.g., empirical study, conceptual framework, method, tool, experiment.
  • The publication reported sufficient details required for analysis within the scope of the research questions.
Similarly, we excluded the publication if any of the following applied:
  • The publication was not of sufficient length, e.g., abstract-only, poster, presentation summary.
  • The publication was not final, e.g., preprint.
  • The publication was not peer-reviewed, e.g., editorial, tutorial, keynote, technical report.
  • The publication was not written in English.
We performed this filtering in the following steps: first, both authors independently scanned the main bibliographic information, i.e., title, abstract, and keywords, for each publication in the dataset, and recommended whether to retain or remove each publication based on the formulated inclusion and exclusion criteria. In case of disagreement, both authors shared comments and reached a unanimous decision regarding each publication. Hence, this resulted in a reduced dataset. Then, we repeated this process of applying inclusion and exclusion criteria, but this time to the full texts of publications. In this turn, the first author provided recommendations, the second author verified them, and in the event of initial disagreement, both authors discussed a given paper to reach a consensus. Table 3 provides publication counts remaining after the particular stage of data preparation.
Upon full-text screening, we also conducted a snowballing analysis to identify additional relevant publications. Specifically, we analysed the references in the papers retained in our database to determine if they are suitable for our analysis and meet the inclusion and exclusion criteria. Similarly, we analysed papers reported in search databases that cited those already present in our database. Moreover, given our prior research in software project estimation, we were already aware of other potentially relevant papers that we had read prior to conducting this study. Hence, we manually searched for these publications and evaluated those that we found for possible inclusion in the analysis. Performed snowballing and manual search delivered an additional 8 publications. Consequently, the final dataset for analysis consisted of 30 publications.
For the final set of publications, on 3rd November 2025, we manually gathered citation counts reported in each publication database. Although some databases reported these counts, several publications were not retrieved during the search process from a particular database, despite being indexed there. Additionally, we identified some publications through snowballing and a manual search. Hence, for a fair comparison, we gathered citation counts for all analysed publications on the same day.
We performed data preprocessing by removing several inconsistencies. Most importantly, we adjusted the keywords to ensure consistent spelling. For example, in several publications, the authors used varying capitalisation, plurality and usage of acronyms, e.g., referring to LLMs as follows: “LLM”, “Large Language Model”, “Large language model (LLM)”, “Large language models (LLMs)”, and “Large Language Models (LLMs)”. In such cases, we encoded these keywords by using the most frequent term for a given group of keyword variants across our publication database. As particular search databases encoded the lists of authors differently, we also adjusted the formatting of the author list to a consistent form.

3.3. Analysis of Publications

Based on the formulated research questions, we defined a set of attributes and, where applicable, corresponding categories to classify the publications included in the final dataset. We split the attributes between the two authors, with each author responsible for annotating all publications according to the assigned subset. For each publication, the authors either applied predefined (closed set) attributes, such as study type or the use of supportive techniques, or identified relevant categories for open attributes, including estimation targets, measurement units, used LLMs, input artefacts for LLMs, and reference and supportive techniques. Subsequently, each author verified the other’s work. In cases of initial disagreement, we shared our justifications and reached a consensus in all such cases.
We did not apply any AI or NLP-based text analysis techniques (e.g., topic modelling or clustering) to the corpus of abstracts, keywords, or full texts, e.g., as in [48,49], as this was unnecessary for our study. Specifically, because of the relatively small number of publications in the final dataset, automated topic extraction methods were unlikely to provide meaningful additional insights. Instead, we identified them explicitly from individual studies as estimation targets.

3.4. Data Characteristics

Figure 1 presents the number of publications retrieved from each source database, including only those retained in the final dataset after screening and filtering. Since some publications were indexed in multiple databases, the counts do not add up to the total number of unique publications. The figure also includes a category labelled “manual”, representing publications identified through manual search and snowballing. The majority of publications were retrieved from Scopus and WoS, while IEEE, Springer, and the manual search, along with snowballing (labelled as “Manual” in the chart), each delivered fewer but roughly similar numbers. ACM provided the fewest.
Most publications (8) were retrieved from a single database. However, 14 publications were indexed in multiple databases—specifically, six in two databases, four in three databases, and four in four databases. The remaining eight publications were identified through a manual search and snowballing.
Figure 2 presents the distribution of publications by year and venue type. The publication years range from 2022 to 2026, with the highest number published in 2024 (16), followed by 2025 (8). Although 2025 has not yet ended, one study was already released with a publication year of 2026.

4. Results

4.1. Publication Venues Chosen by Authors (RQ1)

Table 4 summarises the publication venues represented in the final dataset. Most venues published just a single paper, with two exceptions: Applied Sciences and IEEE Transactions on Software Engineering, which each featured two studies. The publications are distributed across a wide range of well-recognised journals and conferences, reflecting the multidisciplinary nature of research on LLM-based software project estimation. Notably, apart from those mentioned above, contributions appeared in leading software engineering venues, including Information and Software Technology, Journal of Systems and Software, Proceedings of the ACM on Software Engineering, and the International Conference on Software Engineering, indicating the topic’s visibility within the core software engineering research community.

4.2. Involved Research Institutions and Countries (RQ2)

There were 93 authors of publications in our dataset. They were affiliated with 50 institutions (two authors did not declare any institutional affiliation) from 19 countries. Figure 3 provides the number of publications by the countries of the authors’ affiliated institutions. Research institutions based in China produced the highest number of publications (4), followed by Brazil and Turkey (3 each). Eleven countries contributed two publications each: Australia, Canada, India, Indonesia, Italy, Japan, Morocco, Slovenia, Sweden, the United States, and the United Kingdom.
Table 5 provides a detailed breakdown of publication counts by the individual institutions represented by the authors of primary studies. The distribution reveals high institutional fragmentation within the research landscape. Specifically, only five institutions contributed more than one publication each, whilst the remaining 45 institutions contributed single publications to the dataset. This pattern suggests that no single institution has yet emerged as a dominant research centre in the field of LLM-based software estimation. The prevalence of single-paper contributions from the vast majority of institutions indicates that research efforts remain largely dispersed across the global academic community, with limited institutional continuity or capacity within individual organisations.

4.3. Impact of the Published Research (RQ3)

Table 6 summarises the citation counts reported for the analysed publications across major bibliographic databases. Twenty publications (67%) were cited at least once in any of the databases. The dashes indicate publications that were not retrieved with queries from any of the specified sources. Citation coverage varied notably among databases, with Scopus and WoS indexing and citing the highest numbers of publications. Specifically, all 20 publications were indexed in Scopus, with 18 receiving at least one citation. Fifteen publications were indexed in WoS, with 13 cited at least once. The most frequently cited study was ref. [26], with over forty citations in both Scopus and IEEE, and 32 in WoS. Three studies (refs. [50,51,52]) published in 2025 were already cited. Overall, the citation patterns demonstrate substantial variation across databases, reflecting differences in indexing scope, citation coverage, and update frequency.

4.4. Thematic Scope of Research on LLM-Based Early Software Project Estimation (RQ4)

The analysis of the thematic scope considered both author keywords and citation counts. Figure 4 presents a treemap of the author-defined keywords within the publication corpus. For clarity, only the 21 keywords that appeared in at least two publications are included, while the remaining 65 single-occurrence keywords were omitted. These keywords were grouped into two categories: SE (shades of green) and AI (shades of coral). The keyword AI for SE belongs to both categories and is therefore displayed with a different colour. Darker shades represent general terms, whereas lighter shades represent specific ones. The area of each block is proportional to the number of occurrences, with both the count and the percentage of publications displayed. In total, twelve keywords fall under the category SE, eight under AI, and one under both (as noted earlier). The most frequently occurring keywords are Large language model, Natural language processing, BERT, and Software effort estimation.
The investigation of keyword co-occurrence revealed several associations among the terms used in the studied publications. Figure 5 illustrates the keyword co-occurrence map. For clarity, it displays only keywords that occurred at least twice. There were three pairs with the highest co-occurrence (4 publications), i.e., BERT with Natural language processing, COSMIC with Functional size measurement, and Deep learning with Software effort estimation. Each of them differs from the others as the first links two terms of AI, the second—of SE, and the third—between the two broad categories.
Several additional pairs appeared slightly weaker but still notable relationships (3 publications), including BERT with COSMIC, BERT with Software effort estimation, COSMIC with Large language model, COSMIC with Natural language processing, Functional size measurement with Large language model, Machine learning with Software effort estimation, and Natural language processing with Software effort estimation. All these pairs define a bridge between AI and SE. There was only one pair, i.e., Effort estimation with Story points that did not involve the AI area. There were 23 other pairs of keywords which appeared in two publications.
To explore the relationship between keyword frequency and citation impact, we analysed citation data from the available bibliographic databases. Due to visualisation constraints, the analysis focused on the WoS and Scopus databases, selected for two reasons: (1) they contained citation information for the highest number of publications and (2) they are not limited to a specific publisher.
Figure 6 presents the relationship between total citation counts in WoS and Scopus for publications associated with each keyword. The majority of keywords follow an approximately linear relationship between the two databases, beginning with those that have low citation counts in both sources (Story point estimation, Agile development, Generative AI, Story points, and Software size measurement). This trend extends through keywords with moderate citation levels (Empirical software engineering, User stories, Machine learning, ChatGPT, and Natural language processing), and further to two highly cited keywords (Software effort estimation and BERT), culminating with AI for SE, which achieved the highest citation counts. For these keywords, the total citation counts reported in Scopus were approximately twice as high as those in WoS.
However, several keywords deviated from this general trend, i.e., Effort estimation, Deep learning, Artificial intelligence, Agile software development, and Software engineering, for which citation counts in Scopus were disproportionately higher than in WoS. These discrepancies suggest variations in database coverage or indexing policies that may influence the visibility of specific research topics.
We further examined the relationship between the occurrence of keywords in the publication corpus and the total citation counts of the corresponding publications, based on data from Scopus. We selected Scopus for this analysis because it covered the highest number of publications and consistently reported citation counts that were equal to or higher than those in WoS. Figure 7 presents the resulting relationship.
Keywords with low occurrence and citation counts included Agile development, Software measurement, and Story point estimation. In contrast, the two most frequently occurring keywords, i.e., Large language model and Natural language processing, showed relatively modest citation totals. Conversely, AI for SE occurred very rarely but had exceptionally high total citations, primarily due to a single highly cited publication [26]. The keywords BERT and Software effort estimation demonstrated both high occurrence frequencies and high cumulative citation counts, reflecting their central role in the analysed body of research.

4.5. Types of Studies That Investigated the Use of LLMs for Early Software Project Estimation (RQ5)

We classified publications into the following categories of study types:
  • Conceptual—the study proposed a concept-level method, solution or approach that was not yet implemented and evaluated.
  • New (Adaptation)—the main contribution of the study is a new method, solution or approach that was adapted from an earlier one.
  • New (Composition)—the main contribution of the study is a new method, solution or approach that was composed of several earlier ones without any adaptation or with minimal adaptation.
  • New (Adaptation/Composition)—the main contribution of the study is a new method, solution or approach that was composed of several earlier ones with significant adaptation either at the level of individual components or the whole composition.
  • Empirical evaluation—the study was primarily focused on evaluating existing methods, solutions or approaches.
  • Tool—the study demonstrated a novel tool.
All categories except ‘Tool’ were disjoint, i.e., only one could be assigned. The category ‘Tool’ was an additional one that could be added to any other. All publications that demonstrated a tool were of the primary type ‘New (Adaptation/Composition)’.
Figure 8 presents the distribution of publications across different types of studies. The most common type was empirical evaluation. However, 19 publications (63%) proposed new methods, solutions, or approaches, often involving various forms of adaptation or composition.

4.6. Addressed Estimation Targets (RQ6)

Figure 9 illustrates publication counts per estimation target and its unit of measurement. We only included estimation targets which are related to the scope of this study. However, apart from these, some publications also addressed other targets, such as risks and challenges [56] or the time for implementing a number of unit tests, marking tasks as done based on a generic definition of “done” criteria or checking if requirements were met [64].
We identified four early-stage estimation targets. The most frequent was effort (21 studies), followed by size (7), story quality (2), and productivity (1). One study [68] addressed two targets: effort and productivity. Two studies covered a single target but used two units of measurement: size, as measured by COSMIC and MicroM function points [69], and effort, as measured by story points and person-hours [55].
Effort was most commonly expressed in story points and person-hours. In all studies, size was measured in function points, most often using the COSMIC variant (6 studies). Estimation of user story quality did not involve any specific unit of measurement. In the single study estimating productivity, it was represented as the number of user stories completed per fixed-duration iteration.
Figure 10 presents the temporal trends in estimation targets across publication years. Effort estimation studies spanned the longest period, covering four publication years (2022–2025), followed by size estimation studies, which appeared across three years. The highest number of effort estimation studies were published in 2024 (11) and 2025 (5), whereas size estimation studies—in 2024 (4). We did not observe a clear temporal trend, such as a steady increase in publication counts. However, additional studies may still appear in 2025, and one publication has already been indexed with a publication year of 2026.

4.7. LLMs and Approaches or Tools Involving LLMs Applied or Proposed for Estimation Tasks (RQ7)

Figure 11 presents the distribution of publications by LLMs, grouped according to their model family. As several studies employed more than one model—either within the same or across different families—the counts do not sum to the total number of publications. The most frequently used model was BERT (9 studies), which combined both those that explicitly reported using BERT_base and those that referred to BERT more generally. Within the BERT family, other models used in more than one publication included BERT_SE, BERT_large, and SBERT. Among the GPT-based models, the most frequently used were GPT-4, GPT-2, and GPT-3.5-turbo.
Note that while almost all identified categories are LLMs, two of them are actually end-user tools built on top of LLMs. These include ChatGPT [59,70], which uses various versions of the GPT model family, and GitLab Duo [66,71], which integrates diverse models such as Claude and Mistral.
While LLMs were most often used to provide the estimates for particular targets, in seven studies they were only used in an earlier phase of the processing pipeline for feature extraction [25,55,60,63,72,73,74] while the final estimates were obtained with other means, most often using models involving traditional ML techniques such as random forests [55], various boosted trees [60,72], various neural networks [74] or a collection of various ML techniques [63]. The pipelines proposed in 3 studies involved the distinct use of LLMs at more than one processing stage, interleaved with other techniques (e.g., [26,50,51]).
Figure 12 presents the distribution of reference techniques, grouped by family, that were used for comparison with the LLM-based approaches shown in Figure 11. Note that the reference techniques were not limited only to LLMs but also include other types popular in ML studies. The reference technique appearing in comparisons most often was Deep-SE [14] (6 studies). Among the LLM-based techniques, GPT2SP [26] was included in most comparisons (4 studies). Additionally, 4 studies employed human estimates as a reference, and another 4 used Mean and Median baselines. In total, 18 studies (60%) applied at least one reference technique, whereas 12 studies (40%) did not.
Figure 13 presents the publication counts for LLM families and the reference techniques used for comparisons. Because the number of individual reference techniques was large and most were used only once in particular publications, we applied the following categorisation for clarity. We only retained individual techniques that appeared in more than 2 publications, namely: Deep-SE, GPT2SP, Mean, Median, and Human. Techniques other than these, involving BERT or GPT, were grouped under their respective model families. A separate category labelled “none” represents studies that did not include any comparison. Finally, we grouped all other techniques under the category “other”.
Most publications (7) involved GPT-family models without comparison to any other technique. In 6 studies, GPT models were compared with “other” reference techniques. Five publications compared BERT-family models with Deep-SE, and another 5 with “other” reference techniques. Several studies compared BERT or GPT models with Mean and Median baselines or with estimates provided by humans. While some studies compared BERT models with other BERT or GPT variants, none compared GPT-family models with other BERT or GPT variants. LLMs outside the BERT and GPT families were most often not compared with any other technique (3 publications).
The analysis of supportive techniques, i.e., those that support the use of LLM for a particular goal or task, focused on the applications of fine-tuning and prompt engineering. In 9 studies, the authors applied fine-tuning, while in 15, they employed prompt engineering. Table 7 provides detailed counts of studies that used these techniques. Specifically, 1 study reported the use of both techniques, whereas 7 did not report applying either. The authors used the following types of prompt engineering: zero-shot learning [65], one-shot learning [53], few-shot learning [50,57,73], Context Minimal Prompting and Context Rich Prompting [52], and ConceptAct [62].
The Sankey diagram in Figure 14 illustrates the counts of publications reporting the use of prompt engineering and fine-tuning across model families. No study that used models from the BERT family applied prompt engineering, whereas most studies using GPT models did. All studies involving models outside the BERT or GPT families also used prompt engineering. Fine-tuning was applied in a slight majority of studies that used BERT-based models. In contrast, for the vast majority of studies employing GPT or other models, fine-tuning was not used.

4.8. Artefacts Used as Inputs for LLM-Based Estimation (RQ8)

Figure 15 illustrates the publication counts for each type of artefact used as the main input for the LLMs. Descriptions of user stories were the most frequently used inputs. The category “requirements” denotes all types of textual requirements for which the authors did not provide additional details. Among the “other” category, there were names of microservices to be developed and task descriptions. One study [72] used not just one input artefact but integrated several of them, i.e., user stories, images, and severity levels.

4.9. Used Evaluation Measures (RQ9)

The primary focus of our study was on early-stage software project estimation, with the estimates typically presented in numerical form. Hence, most studies (20) used evaluation measures for regression, including the following: absolute error (AE), estimation accuracy, mean absolute error (MAE), median absolute error (MdAE), median magnitude of relative error (MdMRE), mean magnitude of relative error (MMRE), mean squared error (MSE), normalised mean absolute error (NMAE), relative error (RE), root mean squared error (RMSE) standardised accuracy (SA), prediction at level X (Pred), and MRE standard deviation (SDRMS). However, 11 studies involved delivering non-numerical estimates—they used evaluation measures for classification problems: accuracy, area under the ROC curve (AUC), false alarm rate (FAR), F1 score (F1), precision, recall, and specificity.
There were two measures with similar names: accuracy and estimation accuracy. The former is a widely used measure in classification problems (see Formula 1 in [75]). The authors [66] defined the latter as actual/estimate × 100%. Hence, these two measures were categorised into different groups.
Figure 16 illustrates publication counts for each evaluation measure. The most frequently used measures were MAE (16 studies) and accuracy, F1, and recall (7 studies). Because several studies used more than one measure, these counts do not sum up to the total number of studies. In fact, 5 studies used at least one regression and one classification measure. The measure Pred was used in 4 studies, each with a different threshold, i.e., Pred(10), Pred(25), Pred(30), Pred(50). Arman et al. [59] and Bahi et al. [56] did not use any evaluation measures, as these studies were conceptual. Calikli and Alhamed [68] and Cabrero-Daniel et al. [64] did not report using any of them; instead, they employed statistical tests and effect size measures to compare different treatments.

4.10. Relationships Between the Key Identified Dimensions (RQ10)

In this RQ, we analysed two groups of dimensions describing the publications. The first group contains the study type, the estimation target, and the type of input artefact to the LLMs. Figure 17 illustrates these relationships using a Sankey diagram.
Most effort estimation studies (13) proposed a new approach or method, with varying degrees of adaptation, composition, or both. Six effort estimation studies were empirical evaluations, 2 were conceptual, and 2 proposed a tool. Nearly all size estimation studies (6) proposed a new approach or method, 1 was an empirical evaluation, and 1 proposed a tool. Both story quality studies were empirical evaluations, as well as a single study of productivity estimation. Empirical evaluations covered studies on all estimation targets, new studies and tool demonstrations—only effort and size, and conceptual studies—only effort estimation.
Most effort estimation studies used user stories as input (14), while 6 studies used issues, 1 used requirements, and 2 used other input artefacts. The distribution of input artefacts for size estimation studies varied: 4 studies used requirements as input, 3 studies used use cases, and 1 study used user stories. Both story quality studies obviously took user stories as inputs, similarly to the single study on productivity. User stories were inputs in studies with all estimation targets, issues were used in 6 effort estimation studies, whereas use cases were used only in 3 size estimation studies.
The second group of dimensions involves estimation target, LLM family, and evaluation measure. Because of the high number of distinct LLMs and evaluation measures used in primary studies, we applied the following categorisation for clarity. Instead of particular LLMs, we present their families, as applied earlier in Section 4.7. We grouped the evaluation measures as follows. We retained individual measures that appeared in at least 5 publications, namely MAE, accuracy, F1, recall, and precision. We also retained a category “none” indicating studies which used no evaluation measure at all. Then, we combined other classification and regression measures into two respective groups. Figure 18 illustrates these relationships.
Ten effort estimation studies used models from the BERT family, 10 used GPT, and 4 used other models. Three size estimation studies used GPT models, also 3—BERTs, and 1—another (Qwen-72B). Both story quality studies used GPT modes, whereas the productivity study used GPT and the others (Gemini and Llama). GPT models were used in studies on all estimation targets, BERTs for effort and size, whereas other models for effort, productivity, and size.
Most effort estimation studies (12) used MAE as an evaluation measure, 3 used accuracy and F1, 8 used other regression measures, while 4 did not use any evaluation measure. Four size estimation studies used MAE, 3 used recall, 2 used accuracy, F1, or precision, and 6 used other regression measures. Story quality studies employed classification measures, including accuracy, F1 score, precision, and others. Eleven studies involving BERT models used MAE, three used accuracy, two used F1, and nine used other regression measures. Among the studies involving GPT models, 7 used MAE, 5–precision or recall, 4–accuracy, F1 or other regression measures. Among the studies that applied models other than BERT and GPT, 2 did not use any evaluation measures, 1 used F1, recall, or other classification measures, while 2 used other regression measures.

5. Discussion

5.1. Summary of Key Findings

The results demonstrate that the application of LLMs to early software project estimation is a rapidly emerging research field, with over 4/5 of the papers published within the last 24 months (see Figure 2). The field is highly dispersed, both in terms of publication venues (most of which have published only a single paper, see Table 4) [RQ1], and in terms of involved research institutions (most of which have only a single paper affiliated, see Table 5) [RQ2]. Although China leads among the involved countries, the distance is minimal, as shown in Figure 3 [RQ2]. While not a particularly fashionable topic, it also did not get unnoticed by the research community, as evidenced by a total of 134 citations registered by Scopus (see Table 6) [RQ3]. The most cited paper is the one introducing GPT2SP (ref. [26]), indicating a milestone in the development of this research area and a point of reference for authors publishing later [RQ3].
The analysis of keywords indicated by authors reveals an expected mix of terms related to SE and AI (see Figure 4), and the co-occurring keywords are also not surprising (see Figure 5), though some identified pairs are not obvious (e.g., linking BERT with COSMIC) [RQ4]. Unexpected was the visible discrepancy in the citation counts reported by the two largest databases (Scopus and WoS).
The research on the topic is motivated by practice rather than theory: only two among the analysed papers are of purely conceptual character, with the remainder reporting empirical research. While almost 1/3 of them evaluate existing methods and/or tools, about 2/3 introduce new ones, involving adaptation and/or combination of existing methods and/or tools (see Figure 8), indicating that more can be gained than what the baseline methods offer [RQ5].
The analysed papers considered four estimation targets: effort, size, story quality, and productivity, with effort-focused studies dominating the field (70%), followed by those estimating project size (23%) [RQ6]. Delving into the metrics used, the effort estimation most often addressed the number of story points (in 2/3 of the relevant papers), followed by person-hours (in about 1/4 of the relevant papers). The size estimation targeted three variations in function points (COSMIC, MicroM, and NESMA), most often the first of the mentioned (see Figure 9) [RQ6]. Quite unexpectedly, no study dealt with the estimation of the number of source code lines.
Both LLMs based on encoder-only architecture (namely, various versions of BERT [21]) and those based on decoder-only architecture (primarily, GPT series [22]) were considered in the analysed studies (see Figure 11). All but 4 studies accessed the respective model directly, whereas the remaining ones made use of LLMs via intermediary AI tools (ChatGPT and GitLab Duo, respectively) [RQ7]. While the majority of studies (70%) exploited the predictive capabilities of LLMs as such, the rest used LLMs only as an important, yet merely a processing step (e.g., feature extraction [55]) in a pipeline culminating in an estimation based on traditional ML techniques [RQ7]. The authors of half of the studies admitted to using prompt engineering (note that we are not aware whether prompt engineering was actually used in the remaining studies, or only not reported). However, only 30% reported performing fine-tuning of the model (as this is definitely a non-obvious step, we tend to consider the reported number as close to the actual number in this case) [RQ7]. Prompt engineering was not mentioned in any of the analysed studies using BERT-family models, although it applies to them (see, e.g., [76,77]). In contrast, about half of the BERT-based studies reported performing model fine-tuning, whereas only 2 other studies did. This can most likely be explained by the higher computational and economic costs of fine-tuning larger GPT-series models compared to the much smaller BERT-family ones [78].
The analysed studies explored the use of various artefacts available for the early software project estimation. The most frequently chosen were user stories, which were used in slightly over half of the papers, followed by requirements specifications and issue descriptions, each used in 1/6 of the papers (see Figure 15) [RQ8].
All empirical studies performed some kind of estimation evaluation, though the actual measures varied greatly. In 2/3 of the studies, numerical regression measures were used (most often MAE), whereas no less than 1/5 of the studies used at least one of the traditional classification measures (accuracy, precision, recall, and F1 [75]) (see Figure 16) [RQ9].
The presented results also exposed some relationships between the key identified dimensions [RQ10]. As shown in Figure 17, the research on story quality estimation was limited to the evaluation of existing approaches, whereas the research on size estimation attracted mostly novel proposals for adaptation and/or composition of existing approaches. And while most size estimation research was based on requirements and use cases, most effort estimation research processed user stories and issues as the input for estimation. As shown in Figure 18, only GPT-family models were used for story quality estimation, whereas both GPT-series and BERT-family models were used for effort and size estimation. And while most effort estimation studies relied on MAE, most size estimation studies chose other regression measures for their evaluation.

5.2. Maturity and Evolution of the Topic

The theme under study has obvious traits of an emerging research field, taking into consideration its short time frame (2022–2026), limited yet quickly growing body of literature (the peak in 2024 visible in Figure 2 is because it is the last full year covered in the study, with papers from 2025 to 2026 not yet indexed in bibliographic databases or even not yet published), the lack of solidified co-authorship clusters, and an underdeveloped citation network, with most of the papers having only received a few citations. What strongly confirms the low maturity of the subject is the share of validation studies (about 1/3) compared to those introducing novel approaches (about 2/3) (see Figure 8).
While there is almost no institutional continuity according to the obtained results (see Table 5), we believe this is just another indication of the novelty of the theme, and at least some institutions will continue their studies in the future.
Regarding the theme evolution, despite BERT-family models [21] gaining popularity earlier than GPT [22], we have not identified any significant shift in interest between the two approaches (9 out of 13 papers using BERT vs. 8 out of 13 papers using GPT were published in 2024 or before). What we can observe is a slight shift in interest from effort estimation (16 out of 21 papers published in 2024 or earlier) to size estimation (3 out of 7).

5.3. Methodological Considerations

The results clearly reveal the problem of evaluation inconsistency across studies, primarily regarding the use of various evaluation measures. To illustrate this issue with an example, even though refs. [51,73] both discuss new approaches to size estimation and use the same measurement unit (COSMIC function points), their results are not easily comparable due to the use of different sets of evaluation measures (respectively, MAE and MSE in [51] vs. precision, recall, F1, and RE in [73]). Consequently, it is unclear which evaluation measures future authors should use to place their results in the context of prior work. Moreover, the problem with collecting data within the same dimension from various papers will make it infeasible to perform a meta-analysis once the field matures.
The second face of this problem is the lack of a standard evaluation dataset, which not only makes it difficult to compare results from various authors but also challenges the reproducibility of the presented results. While there is some consistency among papers dedicated to story-point-based effort estimation that use the Choetkiertikul [14] and/or TAWOS [79] datasets, many of the analysed studies use different experimental datasets. Therefore, even though refs. [51,58] both use MAE as the evaluation measure for size estimation with COSMIC function points, their results are still not directly comparable, as they come from experiments on different datasets, neither of which is accessible as open research data. Note that the authors’ liberty in selecting datasets for evaluation may also raise concerns regarding the generalizability of the presented results across project contexts.
The third aspect of this problem is the discrepancy in choosing the reference baseline towards which the evaluation results are compared. As can be observed in Figure 12, apart from human evaluation, a plethora of techniques, both involving LLMs and not involving them, were used in this role. A promising possible starting point for defining such a baseline is two techniques (Deep-SE [14] among the non-LLM-based ones, and GPT2SP [26] among the LLM-based ones) that have already attracted the interest of several authors, who have comparatively evaluated their results.
Nonetheless, even for directly comparable measurements, the feasibility of reproducing results of prior authors is critical, considering that even small, seemingly nuanced changes in the experimental setup can lead to disproportionately large changes in the measured performance (cf. [37], p. 37). Hence, there is a need to develop a standardised evaluation protocol or, at the very least, implement reproducibility checklists to be answered upon submitting a paper, ensuring that authors have provided sufficient information for the replicability and reproducibility of their findings (see, e.g., [40], p. 1219). Some suggestions on what should be included when reporting evaluation results of empirical studies in software engineering involving LLMs are provided by Wagner et al. [80].
Note also that the analysed studies often lack transparent reporting of the presented results. For instance, even though refs. [67,81] include measurements for prior work in their result comparisons, it is unclear whether they have copied the results from the original sources or have reproduced them independently. Such omissions, however, could be easily addressed by more careful proofreading and reviewing works on the topic.

5.4. Limitations and Challenges

The application of LLMs to early software project estimation is not free from issues specific to LLMs in general. One of these is the risk of LLMs hallucination, i.e.,  generating content that is “nonsensical or unfaithful to the provided source content” [82]. In the estimation context, it means producing output completely ungrounded in the input data fed to the LLM. Compared to LLM-based code generation, where code execution provides the objective verification needed to filter out hallucinated responses [39], there is no similarly easy way of detecting hallucinated estimation numbers. Marques et al. advise exposing the model to different contexts and information during training as a way to reduce the risk of hallucination [41]. This is not an option if a pre-trained model is used—in such a case, the proposed mitigation strategies include retrieval-augmented generation and prompt refinement mechanisms [46].
Another issue specific to LLMs is the problem with interpretability and explainability of the outputs they provide, which limits their use in high-stakes settings  [83]. This also means a limited understanding of when LLMs work well or poorly, which can cause doubts about their outputs and lead to concerns about their trustworthiness, necessary for the produced estimates to be used in practice. According to Syahputri et al. (and authors cited by them), the explainability of model outputs can be critically enhanced with chain-of-thought prompting ([46], p. 11).
Moreover, LLMs are sensitive to input quality and prompt phrasing, which stresses the need for careful data preprocessing (see, e.g., [63], p. 6) and prompt engineering (see, e.g., [52], pp. 1430–1431). Another issue is the risk of bias, which can be mitigated by employing adequate data preprocessing techniques if the model is retrained or fine-tuned (see, e.g., [63], p. 20), less so if a pre-trained model is used.
The primary technical limitation of using LLMs (and consequently, an economic one) is their high computational resource requirement. The costs of hardware infrastructure needed to host multi-billion-parameter models are unbearable for small organisations [84]. The use of LLM via cloud-based services, while economically viable even for individual users, is subject to another limitation: it raises privacy and security concerns regarding data transmitted over the Internet and potentially incorporated into the training data of future model versions. A shared model can also be vulnerable to input data poisoning by malicious users. Such security and privacy risks can be minimised if the Input and Output Privacy framework is applied to cloud-based systems providing access to LLMs [85].
The respective models also have their specific technical limitations, such as the maximum size of their context window (see, e.g., [86], p. 7459). This may require some input data preprocessing so that they fit the context window. A recently proposed solution advocates the use of dynamically generated soft prompts to obtain a concise yet semantically robust depiction of shortened input content [87].

5.5. Implications for Practice

Although the field is far from mature, LLM-supported estimation techniques are already getting adopted by industry as illustrated by the examples of commercial solutions, such as devtimate (https://devtimate.com/ai-generated-estimates, accessed on 4 December 2025) and Quanter (https://www.quanter.com/en/revolutionising-software-estimation-the-power-of-ai-and-natural-language-at-quanter/, accessed on 4 December 2025). In fact, their adoption by organisations managing software projects is a necessary prerequisite for their verification in real-world conditions, as only that can reveal the true, practical quality of the estimations they provide. Nevertheless, adopting the solutions proposed in the analysed studies is not equally easy. Using zero-shot learning with GPT (as described, e.g., in [65]) can be performed forthwith using generally available tools, but some methods require time-intensive work to be performed first, such as model training (e.g., [67]) or at least model fine-tuning (e.g., [62]). The latter approaches, however, allow for incorporating the organisation’s own data in the training, which could yield predictions better tuned to the organisation’s specifics. On the other hand, this requires an organisation to collect and maintain data for the sake of training or fine-tuning the model. As some solutions go even further, including a modification of the fine-tuned model structure (e.g., [81]) or to execute a complex processing pipeline involving an LLM at one or more of its steps (e.g., [74]), they could be difficult to apply in production environments unless dedicated tools comprehensively implementing such complex methods are provided. As shown in Figure 8, only three of the analysed papers introduced such tools ([26,50,88]).
Even simpler solutions, combining generally accessible models with their own unsophisticated yet proprietary processing, could not be adopted when their source code is not publicly released (e.g., [73]).
The LLM-based estimation and prediction of software projects have the virtue of being quick to apply once the required software and/or model are set up. Practitioners considering its adoption should perform a cost–benefit analysis, appraising the benefits (particularly in terms of more reliable estimation) and time savings on one side, against the costs (associated with implementation and use) and risks of using LLMs on the other. While the benefits depend mainly on the quality of the solution chosen, the costs can be adjusted by selecting an LLM hosted on-premise or in the cloud [84] as well as choosing between commercial and open-source models ([73], p. 16).
The use of LLMs can be integrated with existing estimation practices that the organisation can resort to if more time is available or the results of LLM-based estimation are startling.
Our study focused specifically on LLM applications for software project estimation. However, the identified approaches may have broader applicability to project management and safety-critical domains where early risk and effort estimation are crucial. The capability of LLMs to process natural language specifications and extract semantic information could benefit other knowledge-intensive tasks, such as project planning or hazard analysis in safety-critical systems [35,89,90,91,92,93,94,95]. However, such cross-domain applications would require careful validation, because software projects have unique characteristics, including a high level of abstraction, frequently changing requirements, and iterative development processes. Hence, the results from our study may not be generalised directly. Future research could systematically explore these cross-domain applications.

6. Threats to Validity

The first construct threat to validity relates to potentially incomplete coverage of search queries. To address this, we performed multiple query iterations and tests across databases using different search keywords and operators, as well as wildcards and synonyms for key terms. We also applied the PICOC framework to ensure conceptual completeness.
To address the database-specific query limitations, especially in ACM, where a generic query returned thousands of irrelevant publications, we adapted the query for ACM’s specific format and logical operators. We also manually verified the abstracts and then the full texts to ensure relevance.
We omitted the use of the ScienceDirect database due to limitations in the support for wildcards and boolean operators, which made sensible query construction impossible. Instead, we used five other major databases to ensure broad coverage and performed snowballing and manual search to mitigate potential gaps.
To ensure data completeness and timeliness, we performed the search in all publication databases on the same day (15 October 2025). We gathered citation counts also on a single day (3 November 2025), for all publications. We used multiple databases to ensure comprehensive citation coverage, as most publications indexed in multiple databases had varying citation counts. To ensure the reliability of citation data, we used ACM, IEEE, Scopus, Springer, and WoS as citation sources.
We mitigated subjective interpretation during screening by consistently applying predefined inclusion and exclusion criteria. In addition, one author screened all publications, while the other verified the outcomes and decisions. In consensus meetings, authors shared and discussed relevant justifications, resolving any disagreements that arose. We used the same process and the share of responsibilities between the authors to address the subjectivity of study classifications later during the analysis of publications. Such a double-review process with consensus resolution left no unresolved disagreements and was appropriate for our scale of mapping study.
We focused on peer-reviewed publications only to ensure the quality and credibility of the analysed work. Snowballing and manual search helped identify additional peer-reviewed papers. We excluded grey literature and preprints to maintain quality standards.
Our restriction to exclude publications not written in English may introduce a language bias, potentially missing relevant work published in languages other than English. However, to ensure comprehensibility, we followed such a standard practice in software engineering literature reviews, as the major databases primarily index English publications and the international SE community mostly publishes in English.
Regarding the geographic and institutional coverage, the publications were written by authors from 19 countries and 50 institutions. The distribution reflects international research activity. We did not apply geographical restrictions in the search or screening strategy. Similarly, we also did not apply any publication date restrictions. Since the analysed knowledge area contains only recently published works (the oldest published in 2022), there was no need to filter publications by date.
Performed search involving five major databases and snowballing resulted in a set of primary studies spanning across 28 different venues (Table 4), including leading software engineering journals and conferences. This reflects the multidisciplinary nature of the research area, indicating broad coverage. We did not filter the results by venue or venue type because all papers that passed other criteria were published in either a journal, conference proceedings, or as a book chapter, i.e., in venues that ensure sufficient quality. Nonetheless, work published in venues not indexed by our selected databases may be missing.
We addressed the threat of selection bias in inclusion and exclusion criteria by formulating and explicitly stating them before screening, and consistently applying them during screening with independent verification by two authors. To address the potential issue of missing relevant publications, we conducted snowballing and a manual search of authors’ private repositories for known relevant work, which identified an additional eight publications.
We did not perform a formal quality assessment of primary studies because our study is a systematic mapping, where such an assessment is optional (in contrast to systematic literature reviews, where it is more essential) [28,96,97,98]. Still, such an evaluation was built into the applied methodology. Specifically, we assessed whether each publication provided sufficient methodological detail to extract data for our research questions. This served as a threshold for methodological adequacy. Furthermore, during the analysis of RQ5, we classified primary studies into different types. The category “conceptual” reflects a study at a low maturity level that has not yet been implemented or evaluated. Finally, all 30 primary studies in our final dataset were peer-reviewed publications, which provides a baseline quality assurance.
Another threat was the alignment of research questions with analysis. We addressed it by carefully formulating each research question to explore essential elements of LLM-based estimation and to ensure that the analysis approach directly mapped to the research questions. Additionally, we developed the structured classification scheme prior to analysis.
The analysed publications span a limited timeframe from 2022 onwards. However, we did not impose this restriction a priori during the publication search or screening. Instead, it reflects the publications on LLM applications in early-stage project estimation. We focused on contemporary LLM technologies suitable for the scope of this mapping study, but we also covered earlier ML/DL approaches in Section 2 to provide context.
The analysis covered a limited number of final publications, i.e., 30. However, such a dataset reflects the state of the research area. As presented in Section 2, some similar studies covered even fewer publications, including [3], which analysed as few as 18 papers. The sample size was small for some analyses, but it was still appropriate and sufficient for this mapping study, which effectively illustrates the research landscape. We did not perform statistical analyses because the study was descriptive in nature.
Another threat is the generalizability across software domains. To address this, we demonstrated that the analysed publications span diverse estimation targets and contexts. Additionally, they covered multiple LLMs or families and approaches, as well as various input artefacts and evaluation measures.
The rapid evolution of LLM technologies presents a temporal threat to validity. Studies published after our search date were not included, and newer solutions may render some findings less relevant. However, this limitation is inherent to any study of fast-evolving technologies and does not diminish the value of documenting the current state-of-the-art. Our findings capture a critical period (2022–2025) during which modern LLMs were applied to early-stage software estimation, providing a baseline for future comparative studies. Moreover, our focus on model families (BERT, GPT) rather than specific versions enhances the longevity of our insights. Finally, we submitted our manuscript to a journal known for its rapid handling and publication.
To ensure the reproducibility of results, we followed a systematic methodology based on established guidelines [29]. We provided detailed search queries in Appendix A. We also explicitly documented data preparation and analysis procedures. We also provided a complete list of analysed publications in Appendix B.
We did not apply any advanced NLP analysis because manual classification was sufficient for a relatively small dataset. In addition, automated topic extraction would unlikely provide meaningful additional insights. Instead, explicit identification from individual studies was more accurate.
We consider the following threats related to each other most serious: (1) the risk of rapid obsolescence caused by fast-moving LLM developments, which limits the temporal validity of our findings, and (2) the emerging nature of the analysed area, with only 30 primary studies potentially limiting generalisability. However, such types of threats are inherent to mapping studies covering rapidly evolving research areas and do not undermine our contribution in establishing the initial landscape.

7. Conclusions and Future Work Directions

To our knowledge, this is the first systematic mapping study of LLM-based early software project estimation. It contains a comprehensive overview of 30 primary studies (published since 2022) and provides a structured classification across the ten investigated research questions.
Provided insights deliver a clear entry point for researchers and practitioners seeking to familiarise themselves with the field. By analysing publication patterns, thematic scope, methodological approaches, and performance characteristics, we reveal emerging trends as well as underexplored areas in the application of LLMs to early-stage estimation. The detailed investigation of the employed LLM architectures, reference and supportive techniques, input artefacts, and evaluation measures delivers actionable insights for both researchers designing future studies and practitioners assessing the suitability of LLM-based estimation solutions in their contexts.
Future research could extend this analysis through a more detailed investigation of primary studies. Specifically, a systematic literature review could synthesise more details reported in primary studies, for example, regarding the use of more detailed techniques integrated with LLMs or supporting them, and the estimation accuracy, as well as involving a more in-depth quality assessment of primary studies. It can also be an even more engaging meta-analysis of reported performance or other aggregated results. This, however, should be postponed until a larger body of literature on the early estimation of software projects becomes available.
Regarding potential technical research directions, future work may cover domain-specific LLM development for estimation, multimodal estimation (e.g., text with diagrams and historical data), the development and application of explainable AI techniques for estimation, uncertainty or confidence quantification methods, or ensemble methods that combine multiple LLMs.
Another possible path for future work, notably absent from current research, which focuses on individual tasks or phases treated in an isolated manner (see, e.g., [99]), is a comprehensive analysis of AI techniques across the entire software project lifecycle in an integrated manner. It could investigate potential synergy between various AI-supported tools and possible additional benefits of obtaining consistent AI-based support from requirements through maintenance. Specifically, this could include analysing how LLM-based early estimation integrates with and supports subsequent AI-assisted activities in software projects, possibly creating coherent toolchains and involving emerging opportunities, such as continuous real-time estimation during project development and evolution.

Author Contributions

Conceptualisation, J.S.; Methodology, J.S. and Ł.R.; Software, Ł.R.; Validation, J.S. and Ł.R.; Formal analysis, Ł.R. and J.S.; Investigation, Ł.R. and J.S.; Resources, J.S. and Ł.R.; Data curation, J.S. and Ł.R.; Writing—original draft preparation, Ł.R. and J.S.; Writing—review and editing, J.S. and Ł.R.; Visualisation, Ł.R.; Supervision, J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

During the preparation of this manuscript, the authors used Perplexity, ChatGPT, and Grammarly to correct grammar errors and rephrase selected individual sentences and paragraphs of the manuscript for the improvement of its writing style and the enhancement of its clarity. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Search Queries

The following search query was used with publication databases IEEE, Scopus, Springer, and WoS:
  • (
      "LLM*" OR "large language model*" OR "GPT" OR "ChatGPT" OR
      "generative AI"
    )
    AND
    (estimat* OR predict* OR forecast*)
    AND
    (
      "effort" OR "cost" OR "man-hour*" OR "person-hour*" OR
      "development time" OR "duration" OR "schedule" OR "productivity" OR
      "size" OR "sizing" OR "function point*" OR "feature point*" OR
      "use case point" OR "use case points" OR "COSMIC" OR "FISMA" OR
      "NESMA" OR "IFPUG" OR "story point" OR "story points" OR "user story" OR
      "user stories" OR "lines of code" OR "LOC" OR "KLOC" OR "SLOC"
    )
    AND
    ("software")
The above query was adapted for search with ACM to the following form:
  • [[Title: "llm*"] OR [Title: "large language model*"] OR [Title: "gpt"] OR
    [Title: "chatgpt"] OR [Title: "generative ai"] OR [Abstract: "llm*"] OR
    [Abstract: "large language model*"] OR [Abstract: "gpt"] OR
    [Abstract: "chatgpt"] OR [Abstract: "generative ai"] OR
    [Keywords: "llm*"] OR [Keywords: "large language model*"] OR
    [Keywords: "gpt"] OR [Keywords: "chatgpt"] OR [Keywords: "generative ai"]]
    AND
    [[Title: estimat*] OR [Title: predict*] OR [Title: forecast*] OR
    [Abstract: estimat*] OR [Abstract: predict*] OR [Abstract: forecast*] OR
    [Keywords: estimat*] OR [Keywords: predict*] OR [Keywords: forecast*]]
    AND
    [[Title: "effort"] OR [Title: "cost"] OR [Title: "man-hour*"] OR
    [Title: "person-hour*"] OR [Title: "development time"] OR
    [Title: "duration"] OR [Title: "schedule"] OR [Title: "productivity"] OR
    [Title: "size"] OR [Title: "sizing"] OR [Title: "function point*"] OR
    [Title: "feature point*"] OR [Title: "use case point*"] OR
    [Title: "cosmic"] OR [Title: "fisma"] OR [Title: "nesma"] OR
    [Title: "ifpug"] OR [Title: "story point*"] OR [Title: "user story"] OR
    [Title: "user stories"] OR [Title: "lines of code"] OR [Title: "loc"] OR
    [Title: "kloc"] OR [Title: "sloc"] OR [Abstract: "effort"] OR
    [Abstract: "cost"] OR [Abstract: "man-hour*"] OR
    [Abstract: "person-hour*"] OR [Abstract: "development time"] OR
    [Abstract: "duration"] OR [Abstract: "schedule"] OR
    [Abstract: "productivity"] OR [Abstract: "size"] OR
    [Abstract: "sizing"] OR [Abstract: "function point*"] OR
    [Abstract: "feature point*"] OR [Abstract: "use case point*"] OR
    [Abstract: "cosmic"] OR [Abstract: "fisma"] OR [Abstract: "nesma"] OR
    [Abstract: "ifpug"] OR [Abstract: "story point*"] OR
    [Abstract: "user story"] OR [Abstract: "user stories"] OR
    [Abstract: "lines of code"] OR [Abstract: "loc"] OR [Abstract: "kloc"] OR
    [Abstract: "sloc"] OR [Keywords: "effort"] OR [Keywords: "cost"] OR
    [Keywords: "man-hour*"] OR [Keywords: "person-hour*"] OR
    [Keywords: "development time"] OR [Keywords: "duration"] OR
    [Keywords: "schedule"] OR [Keywords: "productivity"] OR
    [Keywords: "size"] OR [Keywords: "sizing"] OR
    [Keywords: "function point*"] OR [Keywords: "feature point*"] OR
    [Keywords: "use case point*"] OR [Keywords: "cosmic"] OR
    [Keywords: "fisma"] OR [Keywords: "nesma"] OR [Keywords: "ifpug"] OR
    [Keywords: "story point*"] OR [Keywords: "user story"] OR
    [Keywords: "user stories"] OR [Keywords: "lines of code"] OR
    [Keywords: "loc"] OR [Keywords: "kloc"] OR [Keywords: "sloc"]]
    AND
    [[Title: "software"] OR [Abstract: "software"] OR [Keywords: "software"]]

Appendix B. List of Analysed Publications

The analysis of this mapping study covered the following publications: [25,26,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,78,81,88].

References

  1. Hussain, I.; Kosseim, L.; Ormandjieva, O. Approximation of COSMIC Functional Size to Support Early Effort Estimation in Agile. Data Knowl. Eng. 2013, 85, 2–14. [Google Scholar] [CrossRef]
  2. Bisikirskienė, L.; Čeponienė, L.; Jurgelaitis, M.; Ablonskis, L.; Grigonytė, E. Compiling Requirements from Models for Early Phase Scope Estimation in Agile Software Development Projects. Appl. Sci. 2023, 13, 12353. [Google Scholar] [CrossRef]
  3. Rivera Ibarra, J.G.; Borrego, G.; Palacio, R.R. Early Estimation in Agile Software Development Projects: A Systematic Mapping Study. Informatics 2024, 11, 81. [Google Scholar] [CrossRef]
  4. Mahmood, Y.; Kama, N.; Azmi, A.; Khan, A.S.; Ali, M. Software Effort Estimation Accuracy Prediction of Machine Learning Techniques: A Systematic Performance Evaluation. Softw.-Pract. Exp. 2022, 52, 39–65. [Google Scholar] [CrossRef]
  5. Robles, G.; Capiluppi, A.; Gonzalez-Barahona, J.M.; Lundell, B.; Gamalielsson, J. Development Effort Estimation in Free/Open Source Software from Activity in Version Control Systems. Empir. Softw. Eng. 2022, 27, 135. [Google Scholar] [CrossRef]
  6. Sánchez-García, Á.J.; González-Hernández, M.S.; Cortés-Verdín, K.; Pérez-Arriaga, J.C. Software Estimation in the Design Stage with Statistical Models and Machine Learning: An Empirical Study. Mathematics 2024, 12, 1058. [Google Scholar] [CrossRef]
  7. Flyvbjerg, B.; Budzier, A.; Lee, J.S.; Keil, M.; Lunn, D.; Bester, D.W. The Empirical Reality of IT Project Cost Overruns: Discovering A Power-Law Distribution. J. Manag. Inf. Syst. 2022, 39, 607–639. [Google Scholar] [CrossRef]
  8. Flyvbjerg, B.; Budzier, A. Why Your IT Project May Be Riskier than You Think. SSRN Electron. J. 2011. [Google Scholar] [CrossRef]
  9. Keil, M.; Mann, J.; Rai, A. Why Software Projects Escalate: An Empirical Analysis and Test of Four Theoretical Models. MIS Q. 2000, 24, 631. [Google Scholar] [CrossRef]
  10. Bloch, M.; Blumberg, S.; Laartz, J. Delivering Large-Scale IT Projects on Time, on Budget, and on Value. Harv. Bus. Rev. 2012, 5, 2–7. [Google Scholar]
  11. Ali, A.; Gravino, C. A Systematic Literature Review of Software Effort Prediction Using Machine Learning Methods. J. Softw. Evol. Process 2019, 31, e2211. [Google Scholar] [CrossRef]
  12. Gautam, S.S.; Singh, V. The State-of-the-art in Software Development Effort Estimation. J. Softw. Evol. Process 2018, 30, e1983. [Google Scholar] [CrossRef]
  13. Jørgensen, M. A Review of Studies on Expert Estimation of Software Development Effort. J. Syst. Softw. 2004, 70, 37–60. [Google Scholar] [CrossRef]
  14. Choetkiertikul, M.; Dam, H.K.; Tran, T.; Pham, T.; Ghose, A.; Menzies, T. A Deep Learning Model for Estimating Story Points. IEEE Trans. Softw. Eng. 2019, 45, 637–656. [Google Scholar] [CrossRef]
  15. Fernandez-Diego, M.; Mendez, E.R.; Gonzalez-Ladron-De-Guevara, F.; Abrahao, S.; Insfran, E. An Update on Effort Estimation in Agile Software Development: A Systematic Literature Review. IEEE Access 2020, 8, 166768–166800. [Google Scholar] [CrossRef]
  16. Tawosi, V.; Moussa, R.; Sarro, F. On the Relationship Between Story Points and Development Effort in Agile Open-Source Software. In Proceedings of the 16th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, Helsinki, Finland, 19–23 September 2022; pp. 183–194. [Google Scholar] [CrossRef]
  17. Boehm, B.; Abts, C.; Brown, A.; Chulani, S.; Clark, B.; Horowitz, E.; Madachy, R.; Reifer, D.; Steece, B. Software Cost Estimation with COCOMO II; Prentice Hall: Hoboken, NJ, USA, 2000. [Google Scholar]
  18. Shepperd, M.; Schofield, C. Estimating Software Project Effort Using Analogies. IEEE Trans. Softw. Eng. 1997, 23, 736–743. [Google Scholar] [CrossRef]
  19. Martínez-Aguilar, S.; Sánchez-García, Á.J.; López-Martín, C.; Octavio Ocharán-Hernández, J. Systematic Literature Review on Effort Estimation by Software Development Life Cycle Phases. IEEE Access 2025, 13, 153340–153358. [Google Scholar] [CrossRef]
  20. Wen, J.; Li, S.; Lin, Z.; Hu, Y.; Huang, C. Systematic Literature Review of Machine Learning Based Software Development Effort Estimation Models. Inf. Softw. Technol. 2012, 54, 41–59. [Google Scholar] [CrossRef]
  21. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
  22. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS’20), Red Hook, NY, USA, 6–12 December 2020. [Google Scholar]
  23. Hou, X.; Zhao, Y.; Liu, Y.; Yang, Z.; Wang, K.; Li, L.; Luo, X.; Lo, D.; Grundy, J.; Wang, H. Large Language Models for Software Engineering: A Systematic Literature Review. ACM Trans. Softw. Eng. Methodol. 2024, 33, 220. [Google Scholar] [CrossRef]
  24. Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A Survey of Large Language Models. arXiv 2025. [Google Scholar] [CrossRef]
  25. Li, Y.; Ren, Z.; Wang, Z.; Yang, L.; Dong, L.; Zhong, C.; Zhang, H. Fine-SE: Integrating Semantic Features and Expert Features for Software Effort Estimation. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon Portugal, 14–20 April 2024; pp. 1–12. [Google Scholar] [CrossRef]
  26. Fu, M.; Tantithamthavorn, C. GPT2SP: A Transformer-Based Agile Story Point Estimation Approach. IEEE Trans. Softw. Eng. 2023, 49, 611–625. [Google Scholar] [CrossRef]
  27. Yang, J.; Jin, H.; Tang, R.; Han, X.; Feng, Q.; Jiang, H.; Zhong, S.; Yin, B.; Hu, X. Harnessing the power of LLMs in practice: A survey on ChatGPT and beyond. ACM Trans. Knowl. Discov. Data 2024, 18, 160. [Google Scholar] [CrossRef]
  28. Petersen, K.; Vakkalanka, S.; Kuzniarz, L. Guidelines for Conducting Systematic Mapping Studies in Software Engineering: An Update. Inf. Softw. Technol. 2015, 64, 1–18. [Google Scholar] [CrossRef]
  29. Petticrew, M.; Roberts, H. Systematic Reviews in the Social Sciences: A Practical Guide; John Wiley & Sons: Hoboken, NJ, USA, 2008. [Google Scholar]
  30. Alshuqayran, N.; Ali, N.; Evans, R. A Systematic Mapping Study in Microservice Architecture. In Proceedings of the 2016 IEEE 9th International Conference on Service-Oriented Computing and Applications (SOCA), Macau, China, 4–6 November 2016; pp. 44–51. [Google Scholar] [CrossRef]
  31. Di Francesco, P.; Lago, P.; Malavolta, I. Architecting with Microservices: A Systematic Mapping Study. J. Syst. Softw. 2019, 150, 77–97. [Google Scholar] [CrossRef]
  32. Li, Z.; Avgeriou, P.; Liang, P. A Systematic Mapping Study on Technical Debt and Its Management. J. Syst. Softw. 2015, 101, 193–220. [Google Scholar] [CrossRef]
  33. Pedreira, O.; García, F.; Brisaboa, N.; Piattini, M. Gamification in Software Engineering–A Systematic Mapping. Inf. Softw. Technol. 2015, 57, 157–168. [Google Scholar] [CrossRef]
  34. Riccio, V.; Jahangirova, G.; Stocco, A.; Humbatova, N.; Weiss, M.; Tonella, P. Testing Machine Learning Based Systems: A Systematic Mapping. Empir. Softw. Eng. 2020, 25, 5193–5254. [Google Scholar] [CrossRef]
  35. Khalil, M.; Bravo, A.; Vieira, D.; Carvalho, M.M.D. Mapping the AI Landscape in Project Management Context: A Systematic Literature Review. Systems 2025, 13, 913. [Google Scholar] [CrossRef]
  36. Sofian, H.; Yunus, N.A.M.; Ahmad, R. Systematic Mapping: Artificial Intelligence Techniques in Software Engineering. IEEE Access 2022, 10, 51021–51040. [Google Scholar] [CrossRef]
  37. Watson, C.; Cooper, N.; Palacio, D.N.; Moran, K.; Poshyvanyk, D. A Systematic Literature Review on the Use of Deep Learning in Software Engineering Research. ACM Trans. Softw. Eng. Methodol. 2022, 31, 32. [Google Scholar] [CrossRef]
  38. Yang, Y.; Xia, X.; Lo, D.; Grundy, J. A Survey on Deep Learning for Software Engineering. ACM Comput. Surv. 2022, 54, 206. [Google Scholar] [CrossRef]
  39. Fan, A.; Gokkaya, B.; Harman, M.; Lyubarskiy, M.; Sengupta, S.; Yoo, S.; Zhang, J.M. Large Language Models for Software Engineering: Survey and Open Problems. In Proceedings of the 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), Melbourne, Australia, 14–20 May 2023; pp. 31–53. [Google Scholar] [CrossRef]
  40. Wang, S.; Huang, L.; Gao, A.; Ge, J.; Zhang, T.; Feng, H.; Satyarth, I.; Li, M.; Zhang, H.; Ng, V. Machine/Deep Learning for Software Engineering: A Systematic Literature Review. IEEE Trans. Softw. Eng. 2023, 49, 1188–1231. [Google Scholar] [CrossRef]
  41. Marques, N.; Silva, R.R.; Bernardino, J. Using ChatGPT in Software Requirements Engineering: A Comprehensive Review. Future Internet 2024, 16, 180. [Google Scholar] [CrossRef]
  42. Wang, J.; Huang, Y.; Chen, C.; Liu, Z.; Wang, S.; Wang, Q. Software Testing with Large Language Models: Survey, Landscape, and Vision. IEEE Trans. Softw. Eng. 2024, 50, 911–936. [Google Scholar] [CrossRef]
  43. Zhang, Q.; Fang, C.; Ma, Y.; Sun, W.; Chen, Z. A Survey of Learning-based Automated Program Repair. ACM Trans. Softw. Eng. Methodol. 2024, 33, 55. [Google Scholar] [CrossRef]
  44. Alturayeif, N.; Hassine, J.; Ahmad, I. Machine Learning Approaches for Automated Software Traceability: A Systematic Literature Review. J. Syst. Softw. 2025, 230, 112536. [Google Scholar] [CrossRef]
  45. Husein, R.A.; Aburajouh, H.; Catal, C. Large Language Models for Code Completion: A Systematic Literature Review. Comput. Stand. Interfaces 2025, 92, 103917. [Google Scholar] [CrossRef]
  46. Syahputri, I.W.; Budiardjo, E.K.; Putra, P.O.H. Unlocking the Potential of the Prompt Engineering Paradigm in Software Engineering: A Systematic Literature Review. AI 2025, 6, 206. [Google Scholar] [CrossRef]
  47. Czarnacka-Chrobot, B. AI Applications in Software Functional Size Measurement: A Systematic Literature Review. J. Syst. Softw. 2026, 232, 112686. [Google Scholar] [CrossRef]
  48. Hankar, M.; Kasri, M.; Beni-Hssane, A. A Comprehensive Overview of Topic Modeling: Techniques, Applications and Challenges. Neurocomputing 2025, 628, 129638. [Google Scholar] [CrossRef]
  49. Sehra, S.K.; Brar, Y.S.; Kaur, N.; Sehra, S.S. Research Patterns and Trends in Software Effort Estimation. Inf. Softw. Technol. 2017, 91, 1–21. [Google Scholar] [CrossRef]
  50. De Vito, G.; Di Martino, S.; Ferrucci, F.; Gravino, C.; Palomba, F. LLM-Based Automation of COSMIC Functional Size Measurement From Use Cases. IEEE Trans. Softw. Eng. 2025, 51, 1500–1523. [Google Scholar] [CrossRef]
  51. Molla, Y.S.; Alemneh, E.; Yimer, S.T. COSMIC-Based Early Software Size Estimation Using Deep Learning and Domain-Specific BERT. IEEE Access 2025, 13, 28463–28475. [Google Scholar] [CrossRef]
  52. Sharma, A.; Kumar Tripathi, A. Evaluating User Story Quality with LLMs: A Comparative Study. J. Intell. Inf. Syst. 2025, 63, 1423–1451. [Google Scholar] [CrossRef]
  53. Ronanki, K.; Cabrero-Daniel, B.; Berger, C. ChatGPT as a Tool for User Story Quality Evaluation: Trustworthy Out of the Box? In Agile Processes in Software Engineering and Extreme Programming–Workshops; Kruchten, P., Gregory, P., Eds.; Springer Nature Switzerland: Cham, Switzerland, 2024; Volume 489, pp. 173–181. [Google Scholar] [CrossRef]
  54. De Bortoli Fávero, E.M.; Casanova, D.; Pimentel, A.R. SE3M: A Model for Software Effort Estimation Using Pre-Trained Embedding Models. Inf. Softw. Technol. 2022, 147, 106886. [Google Scholar] [CrossRef]
  55. Alhamed, M.; Storer, T. Evaluation of Context-Aware Language Models and Experts for Effort Estimation of Software Maintenance Issues. In Proceedings of the 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME), Limassol, Cyprus, 3–7 October 2022; pp. 129–138. [Google Scholar] [CrossRef]
  56. Bahi, A.; Gharib, J.; Gahi, Y. Integrating Generative AI for Advancing Agile Software Development and Mitigating Project Management Challenges. Int. J. Adv. Comput. Sci. Appl. 2024, 15. [Google Scholar] [CrossRef]
  57. Tawosi, V.; Alamir, S.; Liu, X. Search-Based Optimisation of LLM Learning Shots for Story Point Estimation. In Search-Based Software Engineering; Arcaini, P., Yue, T., Fredericks, E.M., Eds.; Springer Nature Switzerland: Cham, Switzerland, 2024; Volume 14415, pp. 123–129. [Google Scholar] [CrossRef]
  58. Ünlü, H.; Tenekeci, S.; Çiftçi, C.; Oral, İ.B.; Atalay, T.; Hacaloğlu, T.; Musaoğlu, B.; Demirörs, O. Predicting Software Functional Size Using Natural Language Processing: An Exploratory Case Study. In Proceedings of the 2024 50th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), Paris, France, 28–30 August 2024; pp. 188–193. [Google Scholar] [CrossRef]
  59. Arman, A.; Di Reto, E.; Mecella, M.; Santucci, G. An Approach for Software Development Effort Estimation Using ChatGPT. In Proceedings of the 2023 IEEE International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), Paris, France, 14–16 June 2023; pp. 1–7. [Google Scholar] [CrossRef]
  60. Yalçıner, B.; Dinçer, K.; Karaçor, A.G.; Efe, M.Ö. Enhancing Agile Story Point Estimation: Integrating Deep Learning, Machine Learning, and Natural Language Processing with SBERT and Gradient Boosted Trees. Appl. Sci. 2024, 14, 7305. [Google Scholar] [CrossRef]
  61. Amasaki, S. On Effectiveness of Further Pre-training on BERT Models for Story Point Estimation. In Proceedings of the 19th International Conference on Predictive Models and Data Analytics in Software Engineering, San Francisco, CA, USA, 8 December 2023; 2023; pp. 49–53. [Google Scholar] [CrossRef]
  62. Zhao, Z.; Jiang, H.; Zhao, R.; He, B. Emergence of A Novel Domain Expert: A Generative AI-based Framework for Software Function Point Analysis. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27–31 October 2024; pp. 2245–2250. [Google Scholar] [CrossRef]
  63. Atoum, I.; Otoom, A.A. Enhancing Software Effort Estimation with Pre-Trained Word Embeddings: A Small-Dataset Solution for Accurate Story Point Prediction. Electronics 2024, 13, 4843. [Google Scholar] [CrossRef]
  64. Cabrero-Daniel, B.; Fazelidehkordi, Y.; Nouri, A. How Can Generative AI Enhance Software Management? Is It Better Done than Perfect? In Generative AI for Effective Software Development; Nguyen-Duc, A., Abrahamsson, P., Khomh, F., Eds.; Springer Nature Switzerland: Cham, Switzerland, 2024; pp. 235–255. [Google Scholar] [CrossRef]
  65. Miranda, D.; Palma, D.; Fernández, A.; Noel, R.; Cechinel, C.; Munoz, R. Enhancing Agile Project Management Education with AI: ChatGPT-4’s Role in Evaluating Student Contributions. In Proceedings of the 2024 43rd International Conference of the Chilean Computer Science Society (SCCC), Temuco, Chile, 18–22 November 2024; pp. 1–4. [Google Scholar] [CrossRef]
  66. Pavlič, L.; Saklamaeva, V.; Beranič, T. Can Large-Language Models Replace Humans in Agile Effort Estimation? Lessons from a Controlled Experiment. Appl. Sci. 2024, 14, 12006. [Google Scholar] [CrossRef]
  67. Permana, B.; Ferdiana, R.; Pratama, A. Large Language Model Employment for Story Point Estimation Problems in AGILE Development. In Proceedings of the 2024 International Conference on Electrical Engineering and Computer Science (ICECOS), Palembang, Indonesia, 14–15 August 2024; pp. 391–398. [Google Scholar] [CrossRef]
  68. Calikli, G.; Alhamed, M. Impact of Request Formats on Effort Estimation: Are LLMs Different Than Humans? Proc. ACM Softw. Eng. 2025, 2, 1114–1135. [Google Scholar] [CrossRef]
  69. Ünlü, H.; Tenekeci, S.; Kennouche, D.E.; Demirörs, O. Automating Software Size Measurement with Language Models: Insights from Industrial Case Studies. J. Syst. Softw. 2026, 231, 112638. [Google Scholar] [CrossRef]
  70. Valdés-Souto, F.; Torres-Robledo, D. Is It Possible to Use ChatGPT to Perform Measurements Using the COSMIC Method? Program. Comput. Softw. 2024, 50, 674–689. [Google Scholar] [CrossRef]
  71. Saklamaeva, V.; Pavlič, L. Effort Estimation in Agile Software Development-Is AI a Resourceful Addition? In Proceedings of the Eleventh Workshop on Software Quality Analysis, Monitoring, Improvement, and Applications, Novi Sad, Serbia, 18–19 June 2024; Volume 3845, p. 21. [Google Scholar]
  72. Islam, M.R.; Sandborn, P. Multimodal Generative AI for Story Point Estimation. In Proceedings of the 2025 IEEE Conference on Artificial Intelligence (CAI), Santa Clara, CA, USA, 25–27 March 2025; pp. 659–662. [Google Scholar] [CrossRef]
  73. Laqrichi, S. A Hybrid Framework for COSMIC Measurement: Combining Large Language Models with a Rule-Based System. In Proceedings of the Joint Proceedings of the 33rd International Workshop on Software Measurement and the 18th International Conference on Software Process and Product Measurement (IWSM-MENSURA 2024), Montréal, CA, Canada, 7–9 October 2024; Volume 3852. [Google Scholar]
  74. Maiga, S.; Bilgaiyan, S.; Sagnika, S. Predicting Software Effort Using BERT-based Word Embeddings. Int. J. Syst. Assur. Eng. Manag. 2025, 16, 1728–1742. [Google Scholar] [CrossRef]
  75. Foody, G.M. Challenges in the real world use of classification accuracy metrics: From recall and precision to the Matthews correlation coefficient. PLoS ONE 2023, 18, e0291908. [Google Scholar] [CrossRef] [PubMed]
  76. Luo, X.; Xue, Y.; Xing, Z.; Sun, J. PRCBERT: Prompt Learning for Requirement Classification using BERT-based Pretrained Language Models. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, ASE ’22, Rochester, MI, USA, 10–14 October 2022; pp. 1–13. [Google Scholar] [CrossRef]
  77. Guo, W.; Ling, H.; Pan, L. BERT-Prompt Based Equipment to Support Domain Sentence Vector Training. J. Comput. Commun. 2025, 13, 289–310. [Google Scholar] [CrossRef]
  78. Alexander, L.; Jayadi, R. Machine Learning for Story Point Estimation: Do Large Language Models Outperform Traditional Methods? J. Theor. Appl. Inf. Technol. 2024, 102, 7387–7399. [Google Scholar]
  79. Tawosi, V.; Al-Subaihin, A.; Moussa, R.; Sarro, F. A Versatile Dataset of Agile Open Source Software Projects. In Proceedings of the 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), Pittsburgh, PA, USA, 23–24 May 2022; pp. 707–711. [Google Scholar] [CrossRef]
  80. Wagner, S.; Barón, M.M.; Falessi, D.; Baltes, S. Towards Evaluation Guidelines for Empirical Studies Involving LLMs. In Proceedings of the 2025 IEEE/ACM International Workshop on Methodological Issues with Empirical Studies in Software Engineering (WSESE), Ottawa, ON, USA, 27 April–3 May 2025; IEEE: New York, NY, USA, 2025; pp. 24–27. [Google Scholar] [CrossRef]
  81. Cheemaa, A.S.; Azhar, M.; Arif, F.; Ul Haq, Q.M.; Sohail, M.; Iqbal, A. EGPT-SPE: Story Point Effort Estimation Using Improved GPT-2 by Removing Inefficient Attention Heads. Appl. Intell. 2025, 55, 994. [Google Scholar] [CrossRef]
  82. Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 248. [Google Scholar] [CrossRef]
  83. Singh, C.; Inala, J.P.; Galley, M.; Caruana, R.; Gao, J. Rethinking interpretability in the era of large language models. arXiv 2024, arXiv:2402.01761. [Google Scholar] [CrossRef]
  84. Wani, S.G.; Khurana, T.; Ellison, D.; Ziegler, M.; Upton, J. On-Premise vs. Cloud: Generative AI Total Cost of Ownership. Available online: https://lenovopress.lenovo.com/lp2225.pdf (accessed on 30 September 2025).
  85. Nie, Z.; Dave, L.; Lewis, R. Privacy considerations for LLMs and other AI models: An input and output privacy approach. Front. Commun. Netw. 2025, 6. [Google Scholar] [CrossRef]
  86. Zan, D.; Chen, B.; Zhang, F.; Lu, D.; Wu, B.; Guan, B.; Yongji, W.; Lou, J.G. Large Language Models Meet NL2Code: A Survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 7443–7464. [Google Scholar] [CrossRef]
  87. Wang, C.; Yang, Y.; Li, R.; Sun, D.; Cai, R.; Zhang, Y.; Fu, C. Adapting LLMs for Efficient Context Processing through Soft Prompt Compression. In Proceedings of the International Conference on Modeling, Natural Language Processing and Machine Learning, CMNM 2024, Xi'an, China, 17–19 May 2024; pp. 91–97. [Google Scholar] [CrossRef]
  88. Da Silva Neo, G.; Beltrão Moura, J.A.; De Almeida, H.O.; Neo, A.V.B.D.S.; Freitas Júnior, O.D.G. Writing Better User Stories and Estimates Story Point with Machine Learning and Natural Language Processing. SN Comput. Sci. 2025, 6, 821. [Google Scholar] [CrossRef]
  89. Aramali, V.; Cho, N.; Pande, F.; Al-Mhdawi, M.; Ojiako, U.; Qazi, A. Generative AI in Project Management: Impacts on Corporate Values, Employee Perceptions, and Organizational Practices. Proj. Leadersh. Soc. 2025, 6, 100191. [Google Scholar] [CrossRef]
  90. Baek, S.; Park, C.Y.; Jung, W. Automated Safety Risk Management Guidance Enhanced by Retrieval-Augmented Large Language Model. Autom. Constr. 2025, 176, 106255. [Google Scholar] [CrossRef]
  91. Collier, Z.A.; Gruss, R.J.; Abrahams, A.S. How Good Are Large Language Models at Product Risk Assessment? Risk Anal. 2025, 45, 766–789. [Google Scholar] [CrossRef]
  92. Diemert, S.; Weber, J.H. Can Large Language Models Assist in Hazard Analysis? In Computer Safety, Reliability, and Security, Proceedings of the SAFECOMP 2023 Workshops, Toulouse, France, 19–22 September 2023; Guiochet, J., Tonetta, S., Schoitsch, E., Roy, M., Bitsch, F., Eds.; Springer: Berlin/Heidelberg, Germany, 2023; pp. 410–422. [Google Scholar]
  93. Dokas, I.M. From Hallucinations to Hazards: Benchmarking LLMs for Hazard Analysis in Safety-Critical Systems. Saf. Sci. 2026, 194, 107056. [Google Scholar] [CrossRef]
  94. Felicetti, A.M.; Cimino, A.; Mazzoleni, A.; Ammirato, S. Artificial Intelligence and Project Management: An Empirical Investigation on the Appropriation of Generative Chatbots by Project Managers. J. Innov. Knowl. 2024, 9, 100545. [Google Scholar] [CrossRef]
  95. Greco, M.; Corvello, V. Artificial Intelligence in Project Management: An Empirical Study of Project Managers’ Appropriation of Generative Chatbots. In Advanced Perspectives and Trends in Digital Transformation of Firms, Networks, and Society, Proceedings of the 2nd International Conference of the Digital Transformation Society, Naples, Italy, 23–24 May 2024; Schiavone, F., Omrani, N., Gabteni, H., Eds.; Springer: Berlin/Heidelberg, Germany, 2024; pp. 107–121. [Google Scholar]
  96. Hannousse, A.; Yahiouche, S. Securing Microservices and Microservice Architectures: A Systematic Mapping Study. Comput. Sci. Rev. 2021, 41, 100415. [Google Scholar] [CrossRef]
  97. Kitchenham, B.A.; Budgen, D.; Brereton, P. Evidence-Based Software Engineering and Systematic Reviews; CRC Press: Boca Raton, FL, USA, 2016. [Google Scholar]
  98. Rožanc, I.; Mernik, M. Chapter Three-The Screening Phase in Systematic Reviews: Can We Speed up the Process? In Advances in Computers; Elsevier: Amsterdam, The Netherlands, 2021; Volume 123, pp. 115–191. [Google Scholar] [CrossRef]
  99. Swacha, J.; Gracel, M. Supporting Serious Game Development with Generative Artificial Intelligence: Mapping Solutions to Lifecycle Stages. Appl. Sci. 2025, 15, 11606. [Google Scholar] [CrossRef]
Figure 1. Publication counts by source.
Figure 1. Publication counts by source.
Applsci 15 13099 g001
Figure 2. Publication counts per venue type and publication year. The values above the bars reflect the yearly totals, whereas at the end of the legend labels—totals per venue type.
Figure 2. Publication counts per venue type and publication year. The values above the bars reflect the yearly totals, whereas at the end of the legend labels—totals per venue type.
Applsci 15 13099 g002
Figure 3. Publication counts per country.
Figure 3. Publication counts per country.
Applsci 15 13099 g003
Figure 4. TreeMap of publications based on the keywords.
Figure 4. TreeMap of publications based on the keywords.
Applsci 15 13099 g004
Figure 5. Keyword co-occurrence map. The circle size is proportional to the keyword count, the circle colour reflects the major grouping (as in Figure 4), and the link thickness and font colour intensity are proportional to the co-occurrence count.
Figure 5. Keyword co-occurrence map. The circle size is proportional to the keyword count, the circle colour reflects the major grouping (as in Figure 4), and the link thickness and font colour intensity are proportional to the co-occurrence count.
Applsci 15 13099 g005
Figure 6. Keyword total citations: WoS vs. Scopus.
Figure 6. Keyword total citations: WoS vs. Scopus.
Applsci 15 13099 g006
Figure 7. Keyword frequency vs. total citations.
Figure 7. Keyword frequency vs. total citations.
Applsci 15 13099 g007
Figure 8. Publication counts per type of study contribution.
Figure 8. Publication counts per type of study contribution.
Applsci 15 13099 g008
Figure 9. Publication counts per estimation target and its unit of measurement. The values above the bars indicate publication counts for a given estimation target.
Figure 9. Publication counts per estimation target and its unit of measurement. The values above the bars indicate publication counts for a given estimation target.
Applsci 15 13099 g009
Figure 10. Trends per estimation target.
Figure 10. Trends per estimation target.
Applsci 15 13099 g010
Figure 11. Publication counts per LLMs.
Figure 11. Publication counts per LLMs.
Applsci 15 13099 g011
Figure 12. Publication counts per reference techniques used in comparisons.
Figure 12. Publication counts per reference techniques used in comparisons.
Applsci 15 13099 g012
Figure 13. Publication counts for LLM families and other techniques referenced in comparisons.
Figure 13. Publication counts for LLM families and other techniques referenced in comparisons.
Applsci 15 13099 g013
Figure 14. Relationships of publication counts between model family, prompt engineering and fine-tuning. Particular colours denote categories of model families.
Figure 14. Relationships of publication counts between model family, prompt engineering and fine-tuning. Particular colours denote categories of model families.
Applsci 15 13099 g014
Figure 15. Publication counts per artefacts used as the main inputs for LLMs.
Figure 15. Publication counts per artefacts used as the main inputs for LLMs.
Applsci 15 13099 g015
Figure 16. Publication counts per evaluation measure used.
Figure 16. Publication counts per evaluation measure used.
Applsci 15 13099 g016
Figure 17. Relationships of publication counts between study type, estimation target and type of input for the LLM. The unlabeled category for Target is Productivity, whose label is omitted on the plot due to its low count. Particular colours denote categories of the Target.
Figure 17. Relationships of publication counts between study type, estimation target and type of input for the LLM. The unlabeled category for Target is Productivity, whose label is omitted on the plot due to its low count. Particular colours denote categories of the Target.
Applsci 15 13099 g017
Figure 18. Relationships of publication counts between model family, estimation target and evaluation measure. The unlabelled category for Target is “Productivity”, whose label is omitted on the plot due to its low count. Particular colours denote categories of the Target.
Figure 18. Relationships of publication counts between model family, estimation target and evaluation measure. The unlabelled category for Target is “Productivity”, whose label is omitted on the plot due to its low count. Particular colours denote categories of the Target.
Applsci 15 13099 g018
Table 1. Overview of the recent literature reviews of ML/AI/DL/LLM applications in software engineering.
Table 1. Overview of the recent literature reviews of ML/AI/DL/LLM applications in software engineering.
StudyYearModel TypesSE AreaType of Study 1Time FrameNumber of Papers
 [36]2022ML, DLVarious SE phasesSMS2016–202160
[37]2022DLVarious SE tasksSLR2014–2019128
[38]2022ML, DLVarious SE tasksLS2015–2020142
[39]2023LLMsVarious SE tasksLS2019–2023N/A
[40]2023ML, DLVarious SE tasksSLR2009–20201428
[23]2024LLMsVarious SE tasksSLR2017–2024395
[41]2024LLMsRequirements engineeringLR2022–202422
[3]2024MLEarly-stage estimationSMS2012–202318
[42]2024LLMsSoftware testingLR2020–2023102
[43]2024ML, DLAutomated program repairLR2016–2022112
[44]2025ML, DL, LLMsSoftware traceabilitySLR2014–202459
[45]2025LLMsCode completionSLR2021–202423
[35]2025ML, AI, LLMsProject managementSLR2007–202527
[46]2025LLMsVarious SE tasksSLR2020–202542
[47]2026ML, DL, LLMsSoftware sizeSLR2010–202539
Our2025LLMsEarly-stage estimationSMS2022–2025 230
1 As classified by the authors in the paper title or inside the full text: LS—literature survey, SLR—systematic literature review, SMS—systematic mapping study. 2 See Section 3.4 for details.
Table 2. Search fields and raw result counts per database (after language filtering).
Table 2. Search fields and raw result counts per database (after language filtering).
DatabaseSearch FieldsCount
ACMTitle, Abstract, Keywords41
IEEE177
ScopusTitle, Abstract, Keywords189
Springer760
WoS156
Total1296
Table 3. Publication counts of search results after particular stages.
Table 3. Publication counts of search results after particular stages.
StageChangeCount
Raw search total 1150
After deduplication−1951101
After title, abstract, and keywords screening−102675
After full text screening−5322
After snowballing and manual search+830
Table 4. Publication counts per venue.
Table 4. Publication counts per venue.
VenueVenue TypeCount
Applied SciencesJournal2
IEEE Transactions on Software EngineeringJournal2
Applied IntelligenceJournal1
ElectronicsJournal1
IEEE AccessJournal1
Information and Software TechnologyJournal1
International Journal of Advanced Computer Science and ApplicationsJournal1
International Journal of System Assurance Engineering and ManagementJournal1
Journal of Intelligent Information SystemsJournal1
Journal of Systems and SoftwareJournal1
Journal of Theoretical and Applied Information TechnologyJournal1
Proceedings of the ACM on Software EngineeringJournal1
Programming and Computer SoftwareJournal1
SN Computer ScienceJournal1
Agile Processes in Software Engineering and Extreme ProgrammingConference1
Conference on Artificial IntelligenceConference1
Euromicro Conference on Software Engineering and Advanced ApplicationsConference1
International Conference of the Chilean Computer Science SocietyConference1
International Conference on Automated Software EngineeringConference1
International Conference on Electrical Engineering and Computer ScienceConference1
International Conference on Enabling Technologies: Infrastructure for Collaborative EnterprisesConference1
International Conference on Predictive Models and Data Analytics in Software EngineeringConference1
International Conference on Software EngineeringConference1
International Conference on Software Maintenance and EvolutionConference1
International Workshop on Software Measurement and International Conference on Software Process and Product MeasurementConference1
Search-Based Software EngineeringConference1
Workshop on Software Quality Analysis, Monitoring, Improvement, and ApplicationsConference1
Generative AI for Effective Software DevelopmentBook chapter1
Table 5. Publication counts per institution.
Table 5. Publication counts per institution.
InstitutionCityCountryCount
Beihang UniversityBeijingChina2
Chalmers University of TechnologyGothenburgSweden2
University of GlasgowGlasgowUnited Kingdom2
University of GothenburgGothenburgSweden2
Univerza v MariboruMariborSlovenia2
Applied Behaviour Systems Ltd (Hexis)WorcesterUnited Kingdom1
Applied Science Private UniversityAmmanJordan1
Arts et Métiers Campus de RabatRabatMorocco1
Atilim UniversityAnkaraTurkey1
Bahir Dar UniversityBahir DarEthiopia1
Bilgi GrubuIzmirTurkey1
Bina Nusantara UniversityJakartaIndonesia1
COMSATS UniversityIslamabadPakistan1
École de Technologie SupérieureMontrealCanada1
École Nationale des Sciences AppliquéesKenitraMorocco1
Federal Institute of AlagoasViçosaBrazil1
Federal Institute of Mato Grosso do SulCorumbáBrazil1
Federal University of AlagoasMaceióBrazil1
Federal University of Campina GrandeCampina GrandeBrazil1
George Mason UniversityFairfaxUSA1
Grand Valley State UniversityAllendaleUSA1
Hacettepe UniversityAnkaraTurkey1
Hong Kong Shue Yan UniversityHong KongChina1
Hortiai Pty Ltd.CanberraAustralia1
Indian Institute of Technology (BHU)VaranasiIndia1
Institute of Research in Applied Mathematics and SystemsMexico CityMexico1
Izmir Institute of TechnologyIzmirTurkey1
Izmir Yüksek Teknoloji EnstitüsüIzmirTurkey1
Mohammed V University in RabatRabatMorocco1
Monash UniversityMelbourneAustralia1
Nanjing UniversityNanjingChina1
National Institute of InformaticsTokyoJapan1
National University of Sciences and TechnologyIslamabadPakistan1
Okayama Prefectural UniversitySojaJapan1
Ostim Technical UniversityAnkaraTurkey1
Philadelphia UniversityAmmanJordan1
Sapienza Università di RomaRomeItaly1
School of Computer EngineeringBhubaneswarIndia1
Siskon Software & AutomationIzmirTurkey1
Universidad Nacional Autónoma de MéxicoMexico CityMexico1
Universidad de ValparaisoValparaisoChile1
Universidade Federal de Santa CatarinaFlorianopolisBrazil1
Universitas Gadjah MadaYogyakartaIndonesia1
University of MarylandCollege ParkUSA1
University of OttawaOttawaCanada1
Università degli Studi di Napoli Federico IINaplesItaly1
Università degli Studi di SalernoSalernoItaly1
Volvo CarsGothenburgSweden1
Wollo UniversityKombolchaEthiopia1
ZAQCBeijingChina1
Non-affiliatedCuritibaBrazil1
Non-affiliatedPato BrancoBrazil1
Table 6. Citation counts for primary studies reported in publication databases. For clarity, only studies with at least one citation were included.
Table 6. Citation counts for primary studies reported in publication databases. For clarity, only studies with at least one citation were included.
StudyYearACMIEEEScopusSpringerWoS
[26]2023414532
[53]202412246
[54]20221209
[55]202210106
[56]202415
[25]2024528
[57]20241451
[51]2025431
[58]2024231
[59]202323
[60]202441
[50]20250211
[61]2023111
[52]20250111
[62]20241011
[63]202411
[64]202411
[65]2024100
[66]202410
[67]202410
Table 7. Reported supportive techniques.
Table 7. Reported supportive techniques.
Fine-TuningPrompt EngineeringTotal
YesNo
Yes189
No14721
Total1515
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Radliński, Ł.; Swacha, J. Large Language Models for Early-Stage Software Project Estimation: A Systematic Mapping Study. Appl. Sci. 2025, 15, 13099. https://doi.org/10.3390/app152413099

AMA Style

Radliński Ł, Swacha J. Large Language Models for Early-Stage Software Project Estimation: A Systematic Mapping Study. Applied Sciences. 2025; 15(24):13099. https://doi.org/10.3390/app152413099

Chicago/Turabian Style

Radliński, Łukasz, and Jakub Swacha. 2025. "Large Language Models for Early-Stage Software Project Estimation: A Systematic Mapping Study" Applied Sciences 15, no. 24: 13099. https://doi.org/10.3390/app152413099

APA Style

Radliński, Ł., & Swacha, J. (2025). Large Language Models for Early-Stage Software Project Estimation: A Systematic Mapping Study. Applied Sciences, 15(24), 13099. https://doi.org/10.3390/app152413099

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop