1. Introduction
Artificial intelligence (AI) systems have become deeply embedded in scientific research, industry, and public-sector decision-making, driving advances across domains such as healthcare, finance, transportation, and climate science [
1,
2,
3]. Recent progress has been fueled by the rapid scaling of machine learning models, increased availability of data, and the proliferation of specialized computational hardware [
4]. However, this expansion has also raised growing concerns regarding the energy consumption and carbon footprint associated with the development, deployment, and large-scale use of AI systems [
5,
6].
Early studies of AI-related environmental impacts primarily focused on the substantial energy required to train large models, particularly in deep learning and natural language processing [
7,
8]. More recent work, however, has demonstrated that the environmental footprint of AI extends far beyond training alone. Inference-phase computation, continuous deployment in user-facing applications, and embodied emissions from hardware manufacturing and turnover often represent comparable or even dominant sources of impact [
9,
10]. As AI systems increasingly operate at a global scale and in real-time environments, their cumulative contribution to energy demand and greenhouse-gas emissions has become a non-negligible component of the broader digital carbon footprint.
In response, the emerging field of Green AI has promoted energy-aware model design, improved measurement and reporting practices, and optimization techniques aimed at reducing per-task energy consumption [
11,
12,
13]. While these efforts have advanced awareness and methodological rigor, they largely remain fragmented. Existing studies tend to focus on isolated components of the AI lifecycle, rely on heterogeneous system boundaries and metrics, or emphasize voluntary disclosure rather than enforceable or adaptive mitigation. As a result, it remains difficult to translate empirical findings into coordinated, system-level strategies for reducing aggregate environmental impact.
This limitation is increasingly problematic given the evolving socio-technical context in which AI systems operate [
14]. Energy systems are transitioning toward higher shares of renewable generation, introducing temporal and geographic variability in electricity carbon intensity. AI infrastructures are geographically distributed across heterogeneous hardware platforms, and efficiency gains are frequently offset by rebound effects, whereby reduced per-task energy costs lead to expanded deployment and higher total emissions [
15,
16,
17]. These dynamics suggest that sustainability challenges in AI are not static optimization problems, but adaptive control problems unfolding under uncertainty.
Against this backdrop, there is a clear need for work that goes beyond cataloguing impacts or proposing isolated efficiency improvements [
13,
18]. Specifically, the literature lacks (i) a systematic synthesis that jointly examines operational energy use, carbon emissions, and embodied life-cycle impacts across the AI lifecycle, and (ii) conceptual frameworks that explicitly translate these empirical insights into coordinated, adaptive mitigation mechanisms. To date, no prior study integrates these dimensions while linking sustainability assessment to control-oriented system design.
The contribution of this work is therefore twofold. First, we conduct a systematic review that synthesizes empirical and analytical evidence on the energy consumption and carbon footprint of contemporary AI systems across their lifecycle, identifying key drivers, methodological limitations, and sources of variability. Second, building directly on the patterns identified in this review, we propose a conceptual, system-level framework that illustrates how sustainability constraints could be operationalized within AI systems through coordinated, carbon-aware control mechanisms under dynamic and uncertain conditions.
By explicitly connecting empirical evidence to a hypothesis-driven control framework, this study advances the literature beyond descriptive assessment toward an integrated research agenda for sustainable and climate-resilient AI development and deployment.
3. Materials and Methods
3.1. Research Question and PICOS Framework
This systematic review was registered in PROSPERO (registration number: [CRD420251276494]) and conducted in accordance with PRISMA 2020 guidelines [
29] and AMSTAR-2 [
30] methodological standards (
Figure 1). A completed PRISMA checklist is provided as
Supplementary File S1. The final literature search was completed on 19 December 2025. The research question was formulated using the PICOS framework to ensure methodological rigor and relevance to the environmental assessment of artificial intelligence (AI) systems. Specifically, we aimed to evaluate whether the development, training, deployment, and large-scale use of AI and machine learning models (Population/Exposure) are associated with increased energy consumption and carbon emissions (Outcomes), compared with alternative computational configurations, deployment strategies, or baseline scenarios reported in the literature (Comparator). Eligible studies assessed at least one dimension of the environmental impact of AI, including operational energy consumption, computational cost, carbon footprint, greenhouse-gas emissions, or life-cycle emissions associated with AI hardware, software pipelines, or deployment infrastructures. We considered empirical measurement studies, life-cycle assessments, modelling and simulation studies, comparative experimental benchmarks, and systematic or exploratory reviews with quantitative or analytical components (Study design). Studies focused exclusively on AI applications without assessment of energy use or environmental impact, as well as purely conceptual or narrative work lacking empirical or analytical evaluation, were excluded. Subgroup analyses considered AI system characteristics (e.g., model type, scale, and application domain), computational context (e.g., cloud-based vs. on-device deployment, hardware configuration), methodological approach (e.g., direct energy measurement, modelling, life-cycle assessment), and geographic context of electricity generation as potential sources of heterogeneity. Through this comprehensive approach, the review aimed to synthesize current evidence on the energy and carbon implications of contemporary AI systems, identify key drivers of environmental impact across the AI lifecycle, and inform the development of energy-aware, transparent, and sustainable AI practices.
3.2. Eligibility Criteria
Studies were excluded if they met any of the following conditions: (i) commentary articles, policy notes, or purely conceptual papers without empirical data or analytical assessment; (ii) review articles (systematic or narrative), editorials, or conference abstracts lacking original quantitative results; (iii) duplicate publications derived from the same dataset, experimental setup, modelling framework, or computational pipeline; or (iv) studies judged to provide insufficient methodological detail, non-transparent assumptions, or a high risk of bias that precluded reliable interpretation of results.
Additional exclusions included studies that did not quantify the environmental impact of artificial intelligence systems, such as those reporting AI applications without measurable energy consumption, computational cost, or carbon footprint outcomes. Analyses focused exclusively on climate or sustainability domains where AI served solely as a predictive or analytical tool, without assessment of the energy or emissions associated with AI development, training, inference, or deployment, were also excluded. Furthermore, studies lacking comparative evaluation (e.g., absence of comparisons across model scales, hardware configurations, software pipelines, deployment contexts, or baseline versus optimized scenarios) were not considered eligible, as they did not allow meaningful synthesis of drivers influencing energy use or carbon emissions.
3.3. Information Sources
A comprehensive and systematic literature search was conducted using major scientific databases, including Web of Science, Pubmed and Scopus, without restrictions on publication date or language. These databases were selected to ensure broad coverage of the scientific literature related to artificial intelligence, machine learning, and deep learning, as well as studies assessing energy consumption, computational cost, carbon footprint, and greenhouse-gas emissions associated with AI systems.
To complement the database searches, the reference lists of all eligible articles were manually screened to identify additional relevant studies not captured by the initial search strategy, including foundational methodological papers, life-cycle assessment studies, and domain-specific applications evaluating the environmental impact of AI. Gray literature sources and non-peer-reviewed reports were not systematically searched and were included only when cited within eligible peer-reviewed articles and deemed necessary to contextualize methodological frameworks or reporting standards.
3.4. Search Methods for Identification of Studies
The search strategy combined controlled vocabulary and free-text terms related to artificial intelligence and its environmental impacts, with a specific focus on energy consumption and carbon emissions. Search terms covered AI systems and modelling approaches, energy and computational cost metrics, carbon footprint and greenhouse-gas emissions, and climate and sustainability contexts. Search strings were adapted to the syntax and field requirements of each database using Boolean operators and truncation to ensure comprehensive retrieval of relevant studies. The search query was constructed by combining the following terms using Boolean operators: ((“artificial intelligence” OR “machine learning” OR “deep learning” OR “large-scale AI” OR “foundation model” OR “large language model” OR LLM*) AND (“energy consumption” OR “energy footprint” OR “computational cost” OR “training cost” OR “inference cost” OR “electricity consumption”) AND (“carbon footprint” OR “carbon emission*” OR “greenhouse gas emission*” OR “CO
2 emission*” OR “carbon cost”) AND (“climate change” OR “climate mitigation” OR “environmental sustainability”)). The complete search strategies for all databases are provided in
Supplementary File S2.
Two reviewers independently assessed study eligibility at both the title and abstract screening stage and the full-text review stage. Disagreements were resolved through discussion and consensus, with consultation of a third reviewer when necessary. No language restrictions were applied.
3.5. Data Extraction and Data Items
Two authors independently extracted data from all eligible studies. For each included article, key characteristics were recorded, including first author, year of publication, study scope or application domain, type of artificial intelligence system assessed, and study design. Study designs included empirical measurement studies, life-cycle assessments, comparative experimental analyses, modelling and simulation studies, and systematic or exploratory reviews with quantitative or analytical components. Discrepancies in data extraction were resolved through discussion and consensus. Record management, including duplicate removal and eligibility tracking, was performed using Microsoft Excel®.
Primary variables extracted included indicators of computational and energy demand, such as energy consumption, electricity use, execution time, computational cost, and hardware utilization (Central Processing Unit (CPU), Graphics Processing Unit (GPU), or accelerator usage), as well as reported measures of energy efficiency. Carbon-related variables included carbon footprint, greenhouse-gas emissions, carbon dioxide equivalent (CO2eq) emissions, and life-cycle emissions where available. When reported, information on electricity grid characteristics or emission factors used to estimate carbon emissions was also recorded.
Additional data items captured methodological characteristics relevant to environmental assessment, including the type of measurement or estimation approach (direct energy logging, modelling, or life-cycle assessment), system boundaries considered (operational-only versus cradle-to-grave), and the computational context (cloud-based, on-device, or hybrid deployment). Where applicable, comparative dimensions such as alternative model configurations, software pipelines, hardware platforms, or deployment scenarios were extracted.
Further information included temporal scope, scale of analysis, uncertainty assessment methods, and author-reported limitations. Potential sources of bias (such as reliance on single experimental setups, lack of comparative baselines, incomplete reporting of assumptions, or undisclosed funding or conflicts of interest) were documented to support interpretation of heterogeneity and robustness across studies.
3.6. Quality Appraisal and Risk of Bias Assessment
The methodological quality and potential risk of bias of the included studies were assessed using a structured appraisal framework informed by AMSTAR-2 principles and established guidance for environmental and computational impact assessments. As AMSTAR-2 is designed to evaluate systematic reviews rather than primary studies, it was used to guide the rigor and transparency of the review process, while study-level quality was evaluated using operational criteria tailored to the objectives of this synthesis.
Specifically, included studies were assessed according to: (i) transparency of energy measurement or estimation methods; (ii) clarity and justification of carbon intensity factors or emission conversion assumptions; (iii) definition of system boundaries (operational-only versus life-cycle assessment); and (iv) sufficiency of methodological detail to enable reproducibility. Studies relying primarily on indirect modelling assumptions without validation, or lacking explicit boundary definitions, were considered at higher risk of bias. Overall, the included studies exhibited moderate methodological quality, with the most common limitations related to incomplete life-cycle boundary specification and heterogeneous carbon accounting approaches.
4. Results
To improve analytical clarity and comparability across the heterogeneous literature, the results are organized using a standardized categorization framework that distinguishes between energy use and carbon emissions, as well as between operational and embodied sources of impact. Energy use refers to the direct electricity consumption associated with AI computation, typically reported in kilowatt-hours (kWh) or related units, whereas carbon emissions represent the climate-relevant impacts derived by converting energy use into CO2eq emissions using electricity grid–specific emission factors. Operational impacts encompass energy use and emissions arising from model training and inference during system operation, while embodied impacts capture life-cycle emissions associated with hardware manufacturing, transportation, maintenance, and end-of-life processes. All results reported below are interpreted within this four-category framework, enabling a consistent synthesis of methodological approaches and reported outcomes across studies.
This section is intentionally limited to a descriptive synthesis of the evidence reported in the included studies. No new conceptual models, prescriptive frameworks, or system-level design proposals are introduced here. Interpretative integration and the proposed conceptual framework are developed separately in
Section 5 and
Section 6.
4.1. Study Selection
A total of 679 records were identified through database searches, including Web of Science (n = 268), PubMed (n = 27), and Scopus (n = 384) (
Figure 1). After removal of duplicates and title and abstract screening, 547 records were excluded for not addressing the energy consumption, carbon footprint, or environmental impacts of artificial intelligence systems; lacking quantitative, life-cycle, or comparative analysis; or being commentaries, conceptual papers, or narrative reviews. Subsequently, 132 full-text articles were assessed for eligibility. Of these, 123 full-text articles were excluded for the following main reasons: absence of quantitative assessment of energy consumption or carbon emissions (n = 46); focus on AI applications without explicit evaluation of environmental impact (n = 32); lack of comparative or life-cycle analysis (n = 25); insufficient methodological transparency or incomplete reporting of assumptions (n = 12); and high risk of bias or non-reproducible results (n = 8). One additional study was identified through manual screening of reference lists. In total, 10 studies met the inclusion criteria and were included in the qualitative synthesis [
9,
31,
32,
33,
34,
35,
36,
37,
38,
39].
4.2. Methodological Quality of Included Studies
Of the 10 included studies, 6 relied primarily on direct energy measurements using CPU/GPU monitoring or execution-time logging, while 4 estimated energy use through modelling approaches. Explicit electricity grid emission factors were reported in 7 studies, whereas 3 relied on generic or assumed conversion factors. Life-cycle assessment boundaries extending beyond operational energy use were clearly defined in 4 studies, while the remaining studies focused on operational emissions only. Overall, the included studies exhibited moderate methodological quality, with the most common limitations related to incomplete life-cycle boundary specification and heterogeneous carbon accounting approaches.
4.3. Study Characteristics
Table 1 summarizes the principal characteristics of the studies included in this synthesis, which examine the energy consumption and carbon footprint of artificial intelligence (AI) systems across diverse computational contexts and application domains. The included evidence spans methodological frameworks, empirical measurement studies, life-cycle assessments, comparative experimental analyses, modelling studies, and systematic or exploratory reviews. Together, these studies address a wide range of AI systems, including large-scale machine learning experiments, large language models, deep neural networks deployed in cloud and on-device environments, AI hardware accelerators, data-processing pipelines, and domain-specific applications such as healthcare and cybersecurity.
Methodologically, the body of evidence encompasses direct energy measurements obtained from CPU and GPU monitoring, execution-time analysis, and workload profiling, as well as modelling approaches that convert energy use into CO2eq emissions using region-specific electricity grid intensities. Several studies adopt a life-cycle perspective, extending the analysis beyond operational energy consumption to include embodied emissions associated with hardware manufacturing and infrastructure. Comparative studies evaluate alternative software frameworks, pipelines, and system configurations, demonstrating how technical choices can substantially influence energy efficiency and carbon outcomes. Review and framework-based studies synthesize algorithmic, hardware-level, and training-level strategies aimed at reducing the environmental impact of AI, while also highlighting governance mechanisms such as auditing, reporting, and sustainability budgeting.
Across the included studies, the primary objectives are to quantify the environmental costs associated with AI development and deployment, identify key drivers of energy demand and emissions, and explore mitigation strategies that balance computational performance with sustainability considerations. Collectively, the evidence indicates that modern AI systems, particularly large-scale and data-intensive models, can entail substantial energy use and carbon footprints. At the same time, the synthesis highlights considerable variability across systems and contexts, underscoring the potential for targeted technical and organizational interventions to reduce the environmental impact of AI.
As summarized in
Table 1, the functional units used to report energy consumption and carbon emissions varied substantially across studies, reflecting differences in system boundaries, analytical scope, and deployment context. Reported units included energy or emissions per training run, per inference workflow, per operational period, or across defined life-cycle stages. Given this heterogeneity, results were interpreted within their respective functional contexts rather than normalized to a single unit, as forced normalization would require additional assumptions and could reduce interpretability. This variability highlights a current limitation in cross-study comparability and underscores the need for standardized functional units in future assessments of AI-related environmental impacts.
4.4. Outcomes
To synthesize the diverse empirical findings across the reviewed studies,
Figure 2 provides a comparative breakdown of environmental impacts.
Figure 2A highlights the shift from training-dominant footprints in early-stage development to inference and hardware-dominant (embodied) footprints in large-scale deployments.
Figure 2B illustrates the critical role of geographic context, showing that identical energy demands result in vastly different carbon outcomes depending on the local electricity grid’s carbon intensity (CO
2eq/kWh).
4.4.1. Energy Consumption and Computational Demand of AI Systems
This subsection synthesizes evidence on operational energy use, focusing on the electricity consumed during AI model training and inference, independent of location-specific carbon intensity or life-cycle conversion assumptions.
Across the included studies, energy consumption emerges as a central and consistently reported outcome, highlighting the substantial computational demand associated with contemporary AI systems. Empirical measurement frameworks demonstrate that both training and inference phases contribute meaningfully to overall energy use, although their relative importance varies according to model scale, deployment frequency, and system architecture. Henderson et al. [
9] provide one of the most influential methodological contributions by introducing standardized approaches to log energy consumption during machine learning experiments, revealing large variability in GPU- and CPU-related energy use even for comparable tasks. Their findings underscore that experimental design choices, such as hyperparameter tuning strategies and repeated model retraining, can dramatically increase energy demand.
The disproportionate computational burden imposed by large-scale models, specifically deep neural networks and Large Language Models (LLMs), has led to a critical re-evaluation of the AI energy lifecycle. While the “Green AI” discourse historically focused on the massive energy spikes during model training, recent empirical evidence suggests a significant shift toward inference-phase dominance. Jiang et al. [
32] demonstrate that LLM-powered systems incur cumulative energy costs across multiple interaction stages, highlighting that under high-frequency, real-world deployment, sustained inference demand can match or even exceed the initial training expenditure.
This transition from training-centric to inference-centric impact has ignited a rigorous scholarly debate concerning measurement methodologies and system boundaries. Currently, the literature exhibits a methodological dichotomy. On one hand, studies utilizing direct hardware-level logging, such as those employing Running Average Power Limit (RAPL) or NVIDIA Management Library (NVML) sensors, provide high-precision, real-time data (e.g., Henderson et al. [
9]). While these methods offer granular accuracy, they are often restricted to local hardware environments and frequently fail to account for the complex energy overheads of shared cloud infrastructures. Conversely, software-based estimation frameworks offer scalability by utilizing Floating Point Operation (FLOP)-to-energy conversion models. However, these estimations often overlook the “energy tail” of inference—encompassing data movement, cooling requirements, and network latency—resulting in a systematic underestimation of the total environmental footprint in live production environments.
Beyond measurement techniques, the debate centers on where the “computational boundary” of an AI system should be drawn. Proponents of operational boundaries argue that isolating per-inference energy provides the most actionable data for developers seeking immediate algorithmic optimizations. However, a growing body of scholarship led by LCA proponents, such as Schneider et al. [
37], contends that such narrow boundaries are insufficient. They argue that the environmental shift toward inference is further exacerbated by the embodied emissions of specialized hardware (e.g., high-performance GPUs and TPUs) required for low-latency deployment. These embodied impacts are rarely captured in standard “Green AI” training logs, suggesting that current metrics, which favor training-phase transparency, may be structurally incapable of governing the cumulative, distributed energy demand of AI at a global scale.
This indicates that while training is energy-intensive, the cumulative demand of inference in production environments is the primary driver of long-term energy consumption (see
Figure 2A). The software-based estimation methods often used in the literature may still struggle to capture the full ‘energy tail’ of these distributed inference phases.
Comparative experimental analyses provide additional insight into how software and pipeline choices influence computational efficiency. Mekouar et al. [
33] benchmark data-processing frameworks commonly used in Green AI workflows and show that execution time, CPU utilization, and total energy consumption differ substantially across libraries, even when performing identical tasks. Such results indicate that energy consumption is not solely determined by model architecture but is also shaped by upstream data handling and software engineering decisions. Similarly, Boumendil et al. [
31], through a systematic survey, identify algorithmic design, batch size, precision settings, and parallelization strategies as key determinants of energy efficiency in deep learning systems.
On-device deployment scenarios further illustrate the importance of context. The survey on on-device deep learning highlights how resource-constrained environments necessitate energy-aware model design, emphasizing trade-offs between accuracy, latency, and power consumption. Collectively, the evidence demonstrates that energy consumption is a multi-dimensional outcome influenced by model characteristics, computational workflows, and deployment conditions. Rather than being an inherent property of AI, computational demand reflects a series of technical and design choices, many of which remain underexplored in mainstream performance-driven research.
It should be noted that only a subset of the included studies relies on continuous, real-world inference measurements, while others estimate inference-related energy use through scaling assumptions, which introduces uncertainty into cross-study comparisons.
Critically, the current Green AI paradigm is often limited by “carbon tunnel vision”. While carbon footprint (CO2eq) is the primary metric reported, it fails to capture the multi-dimensional environmental impact of AI. Scholars are increasingly calling for the inclusion of:
Water Footprint: The massive freshwater requirements for cooling the data centers that host large-scale model training and inference.
Electronic Waste (E-waste): The environmental cost of specialized AI hardware (e.g., GPUs/TPUs (Tensor Processing Unit)) that has a high turnover rate due to rapid technological shifts.
The contrast between these metrics reveals a fundamental trade-off: carbon-centric metrics are easier to standardize but may inadvertently encourage “efficiency” gains that lead to higher hardware turnover or increased water usage, thus shifting rather than reducing the total environmental burden.
4.4.2. Carbon Footprint and Greenhouse-Gas Emissions
Beyond raw energy consumption, the reviewed literature identifies carbon footprint as the primary metric linking AI systems to broader climate impacts. This subsection distinguishes between operational emissions, derived from energy use under region-specific electricity mixes, and embodied emissions, which arise from hardware manufacturing, infrastructure, and end-of-life processes. A central finding across the included studies is that the carbon footprint of an AI system is not a static attribute of its model architecture, but rather a dynamic function of the local energy infrastructure. As illustrated in
Figure 2B, identical energy requirements result in vastly different CO
2eq outcomes depending on the carbon intensity of the regional electricity grid. This variability highlights the inherent limitations of “energy-only” metrics for assessing environmental sustainability.
Methodologically, carbon emissions are quantified by converting energy demand into CO
2eq using grid-specific emission factors; however, substantial heterogeneity remains in how these factors are applied. Henderson et al. [
9] emphasize this contextual dependence, demonstrating that the geographic location of a data center can be as significant as the algorithmic efficiency itself in determining the final environmental cost. Consequently, achieving “Green AI” requires a shift in focus from purely computational optimization to the strategic, carbon-aware geographic placement of training and inference workloads.
Life-cycle–oriented analyses substantially extend this perspective by accounting for emissions beyond operational energy use. Schneider et al. [
37] provide a comprehensive LCA of AI hardware accelerators, showing that embodied emissions from manufacturing, transport, and end-of-life phases can equal or exceed operational emissions over the lifespan of the hardware. These findings challenge narrow operational assessments and underscore the importance of cradle-to-grave accounting when evaluating the true carbon cost of AI systems.
Jiang et al. [
32] further reinforce this systemic view by examining the life-cycle energy and carbon implications of LLM-powered chatbots. Their analysis reveals that emissions accumulate across development, deployment, and user interaction phases, with scaling effects that are often underestimated in conventional assessments. Importantly, the study demonstrates that even modest per-interaction emissions can translate into substantial carbon costs when deployed at a global scale.
Domain-specific modelling studies illustrate how these impacts may manifest in applied settings. Vafaei Sadr et al. [
39] estimate CO
2eq emissions associated with deep learning inference in digital pathology workflows, showing that routine clinical deployment could generate non-negligible emissions at scale. Similarly, Sarkodie et al. [
36], while focused on blockchain-related systems, provide relevant insights into the carbon intensity of energy-intensive digital infrastructures, offering a useful parallel for understanding AI-driven computational systems.
Across the corpus, a consistent signal emerges: AI-related carbon emissions are highly variable but potentially substantial, particularly for large-scale and continuously operating systems. The evidence highlights the need for standardized reporting practices and transparent assumptions to ensure comparability across studies. Without such harmonization, carbon footprint estimates risk underrepresenting the true climate impact of AI technologies.
4.4.3. Mitigation and Optimization Strategies
A third outcome domain concerns mitigation strategies aimed at reducing the energy and carbon intensity of AI systems without compromising performance. Several studies identify algorithmic, hardware-level, and software-based optimization techniques as promising avenues for impact reduction. Boumendil et al. [
31] synthesize a wide range of strategies, including model pruning, quantization, knowledge distillation, and efficient training protocols, demonstrating their potential to significantly reduce computational demand.
On-device deep learning studies further highlight optimization as a necessity rather than an optional enhancement. By emphasizing lightweight architectures and energy-aware inference, these approaches illustrate how deployment constraints can drive innovation in efficiency-oriented model design. Comparative benchmarking by Mekouar et al. [
33] extends this logic to data pipelines, showing that software choices alone can yield meaningful reductions in energy use and associated emissions.
Hardware-related strategies are also emphasized. Schneider et al. [
37] discuss the role of hardware specialization and improved manufacturing practices in reducing life-cycle emissions, while also cautioning that efficiency gains at the device level may be offset by rebound effects associated with increased deployment. Jiang et al. [
32] similarly note that optimization at individual stages of the AI lifecycle must be considered in a system-wide context to avoid shifting emissions rather than reducing them.
Together, the literature suggests that mitigation is feasible but requires coordinated action across the AI development pipeline. Isolated technical optimizations, while beneficial, are unlikely to achieve meaningful reductions without complementary changes in deployment practices and evaluation criteria.
4.4.4. Governance, Reporting, and Ethical Frameworks
The final outcome domain addresses governance and reporting frameworks that seek to integrate sustainability into AI development and deployment. Across the included studies, a recurring theme is that technical optimization alone may be insufficient to adequately address the environmental impact of AI systems, particularly under conditions of large-scale or continuous deployment. As methodological approaches and boundary definitions vary substantially, several authors emphasize the importance of standardized reporting and institutional oversight to improve transparency and comparability. Henderson et al. [
9], for instance, advocate for routine disclosure of energy and carbon metrics alongside performance results, framing transparency as a prerequisite for accountability rather than as a definitive impact mitigation measure.
This comparative analysis reveals a fundamental trade-off: measurement-focused metrics offer high precision but low systemic relevance, whereas lifecycle and governance frameworks offer high relevance but suffer from significant data-gathering hurdles and a lack of universal standardization.
Raper et al. [
34] introduce the concept of sustainability budgets, proposing governance mechanisms that allocate explicit energy and carbon constraints to AI projects. Rather than emerging from a single causal estimate, this approach is motivated by the recognition of scaling effects, uncertainty in inference-related energy use, and the potential for rebound effects, reframing environmental impact as a design parameter within existing project management practices. Usman et al. [
38] extend this perspective to cybersecurity and digital infrastructure, highlighting the cumulative carbon burden associated with continuously operating AI-driven systems and the corresponding need for policy-level guidance under conditions of limited empirical evidence.
Studies focused on healthcare and applied domains further emphasize ethical considerations. Richie et al. [
35] argue that environmental impacts should be considered alongside clinical benefits when evaluating AI deployment in healthcare, particularly given the potential for widespread adoption and long-term operational use. Similarly, Green Cybersecurity frameworks position sustainability as an integral component of responsible AI governance, not as a consequence of isolated performance metrics but as part of broader accountability structures.
Taken together, the evidence does not support a single causal pathway from specific energy estimates to governance prescriptions. Instead, it indicates that the combination of heterogeneous methodologies, limited long-term inference measurements, and scaling uncertainty motivates a precautionary, system-level governance perspective. Embedding energy and carbon considerations into reporting standards, funding criteria, and institutional policies is therefore presented as a pragmatic response to current evidence gaps, rather than as a definitive solution to the environmental footprint of AI systems.
4.5. Limitations and Strengths of the Review
A potential limitation of this review is the inclusion of a relatively small number of studies (n = 10). This result is indicative of the current state of the field: while the “Green AI” discourse is expansive, there remains a significant lack of peer-reviewed, primary empirical research that provides granular, hardware-verified energy data across the full lifecycle. However, the strength of this review is not found in volume, but in the rigorous exclusion of secondary data that often propagates unverified estimates. By focusing only on high-fidelity, quantitative studies, we provide a more accurate baseline for the current carbon-aware governance framework. This limited sample size underscores the critical ‘transparency gap’ in AI reporting and highlights the urgent need for standardized benchmarks that include water and e-waste metrics alongside carbon.
5. Discussion
This systematic review synthesizes emerging evidence on the energy consumption and carbon footprint of contemporary artificial intelligence systems, revealing that environmental impact is not a marginal side effect of AI development but a structural and increasingly consequential dimension of digital infrastructure. Across heterogeneous methodologies and application domains, the literature converges on the conclusion that modern AI systems, particularly large-scale, continuously deployed models, can entail substantial and highly variable environmental costs [
7].
Beyond the studies included in the PRISMA-guided synthesis, a broader empirical literature—spanning numerical modelling, AI systems research, and experimental measurement—addresses complementary dimensions of the environmental footprint of AI. However, these contributions differ fundamentally from the studies included in this review in terms of methodological assumptions, system boundaries, and analytical scope, and their conclusions must therefore be interpreted with caution.
Numerical and simulation-driven analyses provide valuable insights into the sensitivity of AI-related emissions to workload parameters, infrastructure design, and carbon-intensity assumptions, and they highlight the theoretical mitigation potential of carbon-aware workload shifting and control strategies under controlled or idealized conditions [
40,
41]. Nevertheless, these studies are predominantly scenario-based, rely on synthetic workloads or assumed energy–carbon mappings, and rarely capture deployment-scale dynamics, which limits their empirical grounding and cross-context comparability.
In parallel, AI-based systems and deployment-oriented research on large language model inference clusters and carbon-aware scheduling demonstrate that operational energy demand is strongly shaped by serving stack design, batching and quantization strategies, and cluster orchestration under latency and throughput constraints [
42,
43,
44,
45]. While these studies offer high-resolution, system-specific evidence, their findings are inherently context-dependent, optimized for particular infrastructures, and not designed to support lifecycle-wide or cross-study synthesis.
Experimental and measurement-focused studies further contribute hardware-grounded evidence through NVML- and RAPL-based telemetry and benchmarking protocols that directly measure energy and power during training and inference across specific hardware configurations [
46,
47,
48,
49]. However, such measurements are typically limited to isolated components of the AI pipeline, short temporal windows, or single hardware generations, and do not account for embodied emissions, deployment scale, or long-term system feedback.
Complementary life-cycle assessments of data-centre and computing-centre infrastructures underscore the importance of embodied and infrastructure-mediated impacts, including cooling systems and facility-level design [
50,
51]. Yet, these analyses generally operate at an infrastructural level that precludes direct linkage to model-level or workload-level decision-making.
Overall, this body of literature provides fragmented but complementary evidence: it elucidates specific mechanisms, sensitivities, and localized optimization opportunities, but does not offer an integrated, lifecycle-wide synthesis across heterogeneous empirical contexts. This fragmentation directly motivates the present systematic review, which consolidates evidence across operational and embodied dimensions using transparent inclusion criteria and comparative synthesis to enable robust inference on system-level environmental impacts of AI.
A first critical insight concerns the distribution of energy demand across the AI lifecycle. While early discourse emphasized the environmental burden of training large models, the reviewed evidence consistently challenges this training-dominant narrative. In real-world deployment contexts, inference-phase energy consumption emerges as a co-equal or dominant contributor, especially for systems that operate continuously or at a global scale, such as large language model–based services [
10,
52]. This finding has important implications for both research evaluation and policy, as inference energy use is rarely reported, regulated, or optimized with the same rigor as training.
A second major theme is the importance of system boundaries in environmental assessment. Studies adopting a life-cycle perspective demonstrate that embodied emissions associated with hardware manufacturing, transport, and disposal can rival or exceed operational emissions over the lifespan of AI systems [
10,
52]. These findings expose the limitations of operational-only accounting practices and highlight a systematic underestimation of AI’s true carbon footprint in much of the existing literature. Moreover, embodied emissions are tightly coupled to hardware refresh cycles and utilization efficiency, linking technical design decisions directly to long-term environmental outcomes.
The review further underscores the contextual dependence of AI-related emissions, particularly with respect to geographic variation in electricity generation. Identical computational workloads can result in markedly different carbon footprints depending on grid composition, temporal availability of renewable energy, and regional emission factors [
53]. This geographic sensitivity complicates cross-study comparison and reinforces calls for standardized reporting practices that transparently document assumptions, system boundaries, and emission factors.
Importantly, the evidence also reveals that energy consumption and carbon emissions are not intrinsic properties of AI models, but emergent outcomes shaped by interacting choices across algorithms, software pipelines, hardware configurations, and deployment strategies. Comparative studies demonstrate that engineering decisions, such as data-processing frameworks, numerical precision, batch size, and workload scheduling, can yield order-of-magnitude differences in energy use [
11,
54,
55]. This observation shifts responsibility from model architecture alone to the broader socio-technical systems in which AI is embedded.
Despite growing awareness of these issues, the review identifies a persistent gap between measurement and mitigation. While numerous studies quantify environmental impacts and propose efficiency-enhancing techniques, these interventions are typically evaluated in isolation and without accounting for system-level feedbacks. In particular, the rebound effect emerges as a central unresolved challenge: efficiency gains achieved through algorithmic or hardware optimization may be offset by expanded deployment, increased demand, or accelerated hardware turnover, leading to stable or rising net emissions [
15]. As a result, technical optimization alone is unlikely to deliver absolute reductions in environmental impact.
These findings align closely with the broader Green AI literature, which critiques the performance-centric paradigm of “Red AI” and advocates for energy- and resource-aware evaluation [
11]. However, the reviewed evidence suggests that even Green AI approaches risk being insufficient if they remain confined to voluntary reporting or localized efficiency improvements. Without enforceable constraints or coordinated decision-making across the AI lifecycle, sustainability remains an external consideration rather than an operational imperative.
Taken together, the results of this review point toward a fundamental conclusion: sustainable AI cannot be achieved through isolated optimizations or post hoc reporting alone. Instead, the environmental impact of AI must be addressed as a dynamic control problem, in which performance, energy use, and carbon emissions are jointly managed under uncertainty and evolving constraints. This insight directly motivates the conceptual framework introduced in
Section 6. The empirical evidence reviewed here—particularly the dominance of inference-phase emissions [
10], the significance of embodied carbon [
52], the sensitivity to energy-system context, and the prevalence of rebound effects [
15]—collectively indicate the need for coordinated, adaptive, and constraint-aware governance mechanisms. A multi-agent reinforcement learning paradigm provides a principled means of operationalizing these requirements by enabling distributed decision-makers across software, hardware, and energy systems to learn policies that internalize environmental limits while maintaining system utility.
By embedding carbon constraints directly into optimization objectives, such approaches shift sustainability from a descriptive metric to a design-time and run-time control variable.
Section 6 therefore extends the findings of this review from diagnosis to prescription, illustrating how environmental assessment can inform the development of AI systems aligned with long-term climate mitigation and adaptation goals.
6. Implications for Sustainable AI: A Carbon-Aware Multi-Agent Reinforcement Learning Framework
This section introduces a conceptual framework and research agenda for carbon-aware, climate-resilient AI systems. While grounded in empirical findings from the reviewed literature, the proposed MARL framework is not empirically validated within this study. Instead, it should be understood as a hypothesis-driven systems design that synthesizes existing evidence into a coherent control-oriented architecture, intended to guide future methodological development, empirical evaluation, and policy experimentation.
6.1. Motivation and Conceptual Contribution
The results of this systematic review highlight a critical limitation in current approaches to sustainable artificial intelligence: while environmental impacts are increasingly measured and reported, they are rarely operationalized as control variables within AI systems themselves. Most existing work remains descriptive, retrospective, or localized, focusing on individual models, experiments, or efficiency techniques rather than coordinated system-level intervention [
9,
11].
Three empirically grounded challenges identified in the reviewed literature motivate the need for a new paradigm. First, inference-phase energy consumption has emerged as a dominant contributor to total energy use in large-scale and continuously deployed AI systems, particularly in user-facing services [
10,
52]. Second, embodied emissions associated with hardware manufacturing and turnover can rival or exceed operational emissions, underscoring the inadequacy of operational-only optimization strategies [
19,
52]. Third, efficiency improvements are frequently undermined by the rebound effect, whereby reductions in per-task energy consumption are offset by increased deployment, usage, or scale [
15].
In response, this section proposes a conceptual Carbon-Aware MARL framework that reframes AI deployment as a coordinated, adaptive decision problem across software, hardware, and energy infrastructures. Unlike existing Green AI approaches that emphasize voluntary reporting or isolated optimization, the framework hypothesize that embedding environmental constraints directly into the decision-making logic of AI systems, enabling proactive mitigation under dynamic conditions.
As a simple illustrative example, consider a large-scale language model deployed as a cloud-based conversational service. Under the proposed carbon-aware MARL framework, a model optimization agent adapts inference precision and batch size, a hardware lifecycle agent manages accelerator utilization, and a grid-aware scheduling agent shifts non-urgent inference workloads toward periods of lower electricity carbon intensity. This coordinated control aims to reduce cumulative operational and embodied emissions while maintaining service-level performance.
Figure 3 provides a conceptual overview of the proposed carbon-aware multi-agent reinforcement learning framework, illustrating how coordinated agents operate across software, hardware, and energy-system layers under shared carbon constraints. The figure highlights the interaction between centralized training, decentralized execution, and governance mechanisms that jointly internalize operational and embodied emissions throughout the AI lifecycle.
6.2. System-Level Architecture and Coordinated Control
The proposed framework adopts a Centralized Training, Decentralized Execution (CTDE) paradigm, which has proven effective for coordination in complex, distributed systems [
56,
57]. Centralized training allows agents to learn from a shared global state incorporating performance requirements, carbon intensity, and infrastructure constraints, while decentralized execution ensures scalability and robustness in real-world deployment. As shown in
Figure 3, centralized training enables agents to learn from a shared global state incorporating performance requirements, carbon constraints, and infrastructure context, while decentralized execution allows each agent to act autonomously at deployment time.
Within this architecture, three functionally distinct agents are hypothesized to operate at complementary layers of the AI lifecycle:
A Model Optimization Agent, which dynamically adjusts computational complexity using techniques such as adaptive precision, early exiting, and knowledge distillation, reflecting evidence that computational demand varies substantially across tasks and contexts [
11];
A Hardware Lifecycle Agent, which manages accelerator utilization, consolidation, and replacement timing to reduce embodied emissions per unit of computation, directly responding to life-cycle assessment findings on hardware-related carbon impacts [
19,
52];
A Grid-Aware Scheduling Agent, which interfaces with real-time and forecasted electricity grid carbon intensity to temporally shift energy-intensive workloads toward periods of lower marginal emissions, operationalizing geographic and temporal variability identified in prior work [
9,
53].
Together, these agents define a hypothetical cross-layer control system capable of internalizing both operational and embodied carbon costs, rather than optimizing them in isolation.
6.3. Carbon-Constrained Learning and Rebound Effect Mitigation
A central innovation of the framework is the explicit treatment of carbon emissions as a first-class optimization objective. This is implemented through a multi-objective reward function (R
total) that balances system utility with environmental impact:
where the weighting coefficients (w
i) regulate trade-offs between system utility and environmental impact. Sustainability budgets are conceptualized as explicit upper bounds on allowable emissions and act as binding constraints that activate penalties when exceeded.
While not empirically evaluated here, this reward structure hypothesizes a mechanism by which efficiency gains can be prevented from translating into unbounded deployment and associated rebound effects. By embedding environmental constraints directly into the learning objective, the framework conceptually shifts sustainability from an external compliance metric to an endogenous optimization target aligned with long-term environmental goals.
6.4. Uncertainty-Aware Coordination in Non-Stationary Systems
Climate and energy systems are inherently non-stationary, characterized by fluctuating renewable generation, evolving demand patterns, and increasing frequency of extreme events. Prior research has shown that reinforcement learning systems trained under static assumptions often fail under distributional shifts [
58,
59].
To address this, the framework proposes uncertainty-driven exploration and cooperative coordination as design principles rather than validated solutions. Agents are assumed to seek strategies that remain robust across a wide range of plausible future states, consistent with concepts from robust and constrained reinforcement learning [
60].
Conceptually, uncertainty-aware coordination enables agents to negotiate workload prioritization under fluctuating energy availability and climate-driven system stress. During periods of scarcity, agents are assumed to prioritize high-societal-value workloads, such as healthcare, emergency response, or climate modeling, while deferring elective or carbon-intensive tasks. This mechanism illustrates how AI deployment could be aligned with broader climate adaptation objectives, pending empirical validation.
6.5. Explainability, Ethics, and Governance by Design
Delegating cross-layer control to autonomous agents raises significant ethical and governance concerns, particularly in contexts where AI systems influence access to critical infrastructure during climate stress events. Opaque decision-making processes risk undermining public trust and accountability [
61,
62].
Accordingly, the framework incorporates Explainable Reinforcement Learning (XRL) as a design requirement, enabling post-hoc interpretation and auditing of agent decisions [
63,
64]. These mechanisms are intended to support accountability, regulatory oversight, and fairness auditing, particularly where historically biased data could otherwise lead to inequitable outcomes. In this sense, explainability and governance are treated as structural components of the framework rather than downstream add-ons.
6.6. Implementation Roadmap and Policy Implications
To clarify the pathway from conceptual design to empirical evaluation, we outline a high-level implementation roadmap for future research and pilot deployments:
Data Requirements: Continuous, interoperable data streams spanning model performance, energy consumption, grid carbon intensity, hardware utilization, and social vulnerability indicators. Regulatory mandates for standardized, real-time data sharing are likely prerequisites.
Digital Twin Infrastructure: High-fidelity digital twins of urban, energy, or data center systems are required to train and stress-test MARL agents under diverse and extreme scenarios. These environments must support scenario engineering, including rare and compound climate events.
Metrics and Validation: Beyond accuracy and efficiency, evaluation metrics should include operational and embodied emissions, rebound-adjusted carbon impact, system resilience under distributional shift, and equity-aware performance indicators.
Governance and Oversight: Clear legal mandates must define the scope and limits of autonomous decision-making, particularly during climate emergencies. Human-in-the-loop oversight, inter-institutional agreements, and long-term funding mechanisms are essential for responsible deployment.
Taken together, these considerations reinforce that the proposed framework should be understood as a conceptual synthesis and research agenda, rather than a validated solution. Its primary contribution lies in demonstrating how insights from environmental assessment, reinforcement learning, and climate governance can be integrated into a unified control-oriented vision for sustainable and climate-resilient AI systems.
7. Conclusions
This systematic review demonstrates that the environmental footprint of artificial intelligence is a structural and increasingly consequential challenge, shaped by interdependent decisions across algorithms, software pipelines, hardware infrastructures, and deployment contexts. The synthesized evidence shows that energy consumption and carbon emissions associated with AI systems are highly variable, context-dependent, and often underestimated, particularly when inference-phase demand and embodied hardware emissions are omitted from analysis.
A central conclusion of this work is that sustainable AI cannot be achieved through isolated technical optimizations or voluntary reporting alone. While efficiency-enhancing techniques and standardized measurement frameworks are necessary, they are insufficient to deliver absolute reductions in environmental impact due to rebound effects, scaling dynamics, and non-stationary energy systems. Sustainability must therefore be embedded directly into the operational logic and governance of AI systems.
To move from principle to practice, this study proposes a multi-layered implementation pathway. At the technical level, sustainability objectives can be operationalized through carbon-aware, multi-agent control frameworks that integrate energy and emission constraints as endogenous system goals. At the organizational level, AI developers and operators should adopt life-cycle assessment protocols, incorporate sustainability metrics into performance evaluations, and align Research and Development (R&D) priorities with energy-efficient design principles. At the policy level, targeted incentives, regulatory standards, and sectoral reporting requirements can reinforce adoption, creating an environment in which sustainable AI is economically viable and institutionally supported. Together, these layers form a feasible pathway for embedding environmental considerations into AI’s operational logic within existing commercial and policy landscapes.
By synthesizing empirical evidence on AI’s environmental impacts and integrating it with a forward-looking conceptual framework, this study advances the field in two key ways. First, it consolidates fragmented evidence into a coherent synthesis that identifies common drivers, methodological gaps, and actionable insights for policymakers and practitioners. Second, it illustrates how sustainability can be treated as a core operational objective rather than an external afterthought, enabling AI systems to align technological innovation with climate mitigation goals.
Future research should empirically evaluate system-level approaches through high-fidelity simulations and real-world pilot deployments, develop standardized benchmarks for inference-phase and life-cycle emissions, and investigate governance mechanisms that incentivize sustainable AI across commercial and regulatory contexts. As AI systems become increasingly embedded in critical societal functions, integrating environmental sustainability into their design and deployment is not only merely a technical challenge, it is an ethical and policy imperative.