A Deployment-Aware Framework for Carbon- and Water- Efficient LLM Serving
Abstract
1. Introduction
2. Optimizing the Environmental Footprint of LLM Inference: A Literature Review
2.1. From Training to Inference: Why the Burden Has Shifted
2.2. Full Stack per Prompt Accounting: Energy, Carbon, Water, and Embodied Impacts
2.3. Measurement Boundaries: Why Scope Transparency Matters
2.4. Real-Time Orchestration: Carbon- and Water-Aware Routing Under SLOs
2.5. Phase-Aware Hardware Scheduling (Prefill vs. Decode)
2.6. Semantic-Level Interventions
2.7. Lifecycle and Circular Economy Strategies
2.8. Toward a Unified, Deployment-Aware Framework
3. Methodology
3.1. Functional Unit and System Boundaries
3.2. Impact Accounting
3.3. Time Resolution, Traffic Mix, and SLOs
3.4. Decision Variables, Objective, and Constraints
3.4.1. Setup: Decisions, Replicas, and Parameters
3.4.2. Token Accounting (Profiles, Directives, Per-Phase Tokens)
3.4.3. Per-Assignment Coefficients
3.4.4. Objective (What the -Scale Solver Minimizes)
3.4.5. Feasibility Constraints
3.4.6. Embodiment and Daily Coupling (Once/Day per Hardware Class)
3.5. Framework and Algorithmic Handoff
| Algorithm 1 -Scale (daily-coupled MILP; SLOs, replicas, daily binaries). |
(assignments), (replicas), (activation), (daily activation). Optional: (usage, for gated SLO). 1:
Per-prompt coefficients for each in Wh/prompt: 2:
Minimize (global daily sum): 3:
Subject to 4:
(i) Conservation per and phase link: 5:
(ii) Capacity with replicas (RHS scaled by n): 6:
(iii) Replica bounds and activation: 7:
(iv) Daily coupling (embodiment once/day): 8:
(v) Link assignments to activation (tight big-M): 9:
(vi) Latency SLO at (choose one form): Ungated: Optional gated: , with and . 10:
Solve the MILP once over all ; return . |
| Algorithm 2 Aggregation and scope-transparent reporting (post-solve). |
1:
for alldo 2:
3:
4:
5:
Normalize to per-prompt: , , . 6:
end for 7:
For each profile p, take the median over of ; then mix by to get per-model medians. 8:
Embodied and e-waste (once/day): , . 9:
Scopes. Report comprehensive (default) via ; add accelerator-only whiskers using (or the observed narrow→comprehensive ratio when shares are undisclosed). 10:
Emissions. Default LB via ; add MB sensitivity by substituting . |
3.6. Parameterization from Public Sources
4. Results
4.1. Comprehensive Boundary Medians and Daily Totals
4.2. Scope Reconciliation: Accelerator-Only vs. Comprehensive
4.3. Carbon–Water Movement at Fixed QoS
4.4. Routing-Only Carbon–Water Pareto Under SLOs
4.5. Joint Frontiers from Site + Batch + Token Sweeps Under the SLO
4.6. Ablation: Where the Gains Come from (Fixed SLOs)
5. Discussion
6. Conclusions
- Report daily medians at the comprehensive serving boundary (accelerators + host CPU/DRAM + provisioned idle, lifted by ). Provide accelerator-only whiskers for reconciliation so chip-only and full-stack studies remain comparable.
- Enable carbon- and water-aware geo-routing with explicit latency gates. Right-size batches by five-minute windows. Apply concise directives () when SLO-safe to reduce decode work. Consider phase-aware placement (fast, compute-efficient cohorts for prefill, and memory-efficient or second-life cohorts for decode).
- Use site rows per region: , site , , and . Default to location-based carbon for Scope-2 reporting. Treat market-based factors as a sensitivity when disclosed.
- Track per-profile (short/medium/long) medians for energy (Wh), consumptive water (mL, site + source), and CO2 (g, LB by default). Mix these using your business-specific traffic shares to obtain service-level medians and daily totals, and use these to tune weights and capacity caps over time.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligence. |
| API | Application Programming Interface (used when referring to cross-model API benchmarks). |
| CIF | Carbon Intensity of the Grid (typically kg CO2 kWh−1); used for location-based emissions. |
| CO2 | Carbon dioxide; operational emissions are reported in grams per prompt and in tons per day. |
| CO2e | Carbon-dioxide equivalent (used when referring to greenhouse-gas accounting). |
| CPU | Central Processing Unit (host side of serving stack). |
| DRAM | Dynamic Random-Access Memory (host memory included in comprehensive boundary). |
| Market-Based portfolio emission factor (kg CO2e kWh−1) used as a sensitivity to LB. | |
| EWIF | Electricity–Water Intensity Factor (L kWh−1) capturing off-site, generation-mix water. |
| “Source” component of water from electricity generation in the site + source accounting. | |
| GHG | Greenhouse Gas. |
| GPU | Graphics Processing Unit (accelerator). |
| GWh | Gigawatt-hour ( Wh). |
| IT | Information Technology load (accelerators + host CPU/DRAM + provisioned idle). |
| kWh | Kilowatt-hour ( Wh). |
| KV cache | Key–Value cache (used in decode optimizations). |
| LB | Location-Based (grid-average, point-of-consumption reporting for emissions; the default in this work). |
| LBNL | Lawrence Berkeley National Laboratory (source for PUE/WUE context). |
| LLM | Large Language Model. |
| MB | Market-Based (portfolio accounting sensitivity for emissions). |
| MILP | Mixed-Integer Linear Program (optimization formulation). |
| mL | Milliliter ( L). |
| ML | Megaliter ( L); in results tables, ML day−1 is used for daily totals. |
| s | Second. |
| PUE | Power Usage Effectiveness (facility/IT energy ratio). |
| QoS | Quality of Service (used when discussing interactive service constraints). |
| SLO | Service Level Objective (latency/throughput targets enforced in the optimizer). |
| -Scale | The time-resolved, SLO-aware bi-objective orchestration loop proposed in the paper. |
| TPOT | Time Per Output Token (latency metric for decode). |
| TPS | Tokens Per Second (throughput metric used in capacity constraints). |
| TPU | Tensor Processing Unit (accelerator). |
| TTFT | Time To First Token (latency metric for prefill). |
| Wh | Watt-hour (unit for per-prompt energy). |
| WUE | Water Usage Effectiveness (L kWh−1 at the facility; site cooling). |
| Site-level WUE used in the site + source water formulation. | |
| 95th-percentile statistic (used for latency and throughput SLO enforcement). |
Appendix A. Per-Model Ablation Under p95 SLOs (70/25/5 Mix, Comprehensive Serving Boundary)








References
- Elsworth, C.; Huang, K.; Patterson, D.; Schneider, I.; Sedivy, R.; Goodman, S.; Manyika, J. Measuring the environmental impact of delivering AI at Google Scale. arXiv 2025, arXiv:2508.15734. [Google Scholar] [CrossRef]
- Huang, Y. Advancing industrial sustainability research: A domain-specific large language model perspective. Clean Technol. Environ. Policy 2025, 27, 1899–1901. [Google Scholar] [CrossRef]
- Li, S. Making AI less “thirsty”: Uncovering and addressing the secret water footprint of AI models. arXiv 2023, arXiv:2304.03271. [Google Scholar] [CrossRef]
- Desislavov, R.; Martínez-Plumed, F.; Hernández-Orallo, J. Trends in AI inference energy consumption: Beyond the performance-vs-parameter laws of deep learning. Sustain. Comput. Inform. Syst. 2023, 38, 100857. [Google Scholar] [CrossRef]
- Jegham, N.; Abdelatti, M.; Elmoubarki, L.; Hendawi, A. How hungry is AI? Benchmarking energy, water, and carbon footprint of LLM inference. arXiv 2025, arXiv:2505.09598. [Google Scholar] [CrossRef]
- Jagannadharao, A.; Beckage, N.; Nafus, D.; Chamberlin, S. Time shifting strategies for carbon-efficient long-running large language model training. Innov. Syst. Softw. Eng. 2025, 21, 517–531. [Google Scholar] [CrossRef]
- Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Mian, A. A comprehensive overview of large language models. ACM Trans. Intell. Syst. Technol. 2025, 16, 1–72. [Google Scholar] [CrossRef]
- Husom, E.J.; Goknil, A.; Shar, L.K.; Sen, S. The price of prompting: Profiling energy use in large language models inference. arXiv 2024, arXiv:2407.16893. [Google Scholar]
- Moore, H.; Qi, S.; Hogade, N.; Milojicic, D.; Bash, C.; Pasricha, S. Sustainable Carbon-Aware and Water-Efficient LLM Scheduling in Geo-Distributed Cloud Datacenters. arXiv 2025, arXiv:2505.23554. [Google Scholar]
- Chien, A.A.; Lin, L.; Nguyen, H.; Rao, V.; Sharma, T.; Wijayawardana, R. Reducing the Carbon Impact of Generative AI Inference (today and in 2035). In Proceedings of the 2nd Workshop on Sustainable Computer Systems (HotCarbon ’23), Boston, MA, USA, 9 July 2023; pp. 1–7. [Google Scholar]
- Argerich, M.F.; Patiño-Martínez, M. Measuring and improving the energy efficiency of large language models inference. IEEE Access 2024, 12, 80194–80207. [Google Scholar] [CrossRef]
- De Vries, A. The growing energy footprint of artificial intelligence. Joule 2023, 7, 2191–2194. [Google Scholar] [CrossRef]
- Luccioni, A.S.; Viguier, S.; Ligozat, A.L. Estimating the carbon footprint of BLOOM, a 176B parameter language model. J. Mach. Learn. Res. 2023, 24, 1–15. [Google Scholar]
- Jiang, Y.; Roy, R.B.; Kanakagiri, R.; Tiwari, D. WaterWise: Co-optimizing Carbon-and Water-Footprint Toward Environmentally Sustainable Cloud Computing. In Proceedings of the PPoPP ’25: 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, Las Vegas, NV, USA, 1–5 March 2025; pp. 297–311. [Google Scholar]
- Islam, M.A.; Ren, S.; Quan, G.; Shakir, M.Z.; Vasilakos, A.V. Water-constrained geographic load balancing in data centers. IEEE Trans. Cloud Comput. 2015, 5, 208–220. [Google Scholar] [CrossRef]
- Schneider, I.; Xu, H.; Benecke, S.; Patterson, D.; Huang, K.; Ranganathan, P.; Elsworth, C. Life-cycle emissions of AI hardware: A cradle-to-grave approach and generational trends. arXiv 2025, arXiv:2502.01671. [Google Scholar]
- Wu, Y.; Hua, I.; Ding, Y. Unveiling environmental impacts of large language model serving: A functional unit view. arXiv 2025, arXiv:2502.11256. [Google Scholar] [CrossRef]
- Cheng, K.; Wang, Z.; Hu, W.; Yang, T.; Li, J.; Zhang, S. SCOOT: SLO-Oriented Performance Tuning for LLM Inference Engines. In Proceedings of the Web Conference 2025, Sydney, NSW, Australia, 28 April–2 May 2025; pp. 829–839. [Google Scholar]
- Wu, C.J.; Raghavendra, R.; Gupta, U.; Acun, B.; Ardalani, N.; Maeng, K.; Chang, G.; Behram, F.A.; Huang, J.; Bai, C.; et al. Sustainable AI: Environmental implications, challenges and opportunities. In Proceedings of the Machine Learning and Systems, Santa Clara, CA, USA, 29 August–1 September 2022; Volume 4, pp. 795–813. [Google Scholar]
- Samsi, S.; Zhao, D.; McDonald, J.; Li, B.; Michaleas, A.; Jones, M.; Bergeron, W.; Kepner, J.; Tiwari, D.; Gadepally, V. From words to watts: Benchmarking the energy costs of large language model inference. In Proceedings of the 2023 IEEE High Performance Extreme Computing Conference (HPEC), Wakefield, MA, USA, 15–19 September 2025; pp. 1–9. [Google Scholar]
- Wiesner, P.; Grinwald, D.; Weiß, P.; Wilhelm, P.; Khalili, R.; Kao, O. Carbon-Aware Quality Adaptation for Energy-Intensive Services. In Proceedings of the 16th ACM International Conference on Future and Sustainable Energy Systems, Rotterdam, The Netherlands, 17–20 June 2025; pp. 415–422. [Google Scholar]
- Nguyen, S.; Zhou, B.; Ding, Y.; Liu, S. Towards sustainable large language model serving. ACM Sigenergy Energy Inform. Rev. 2024, 4, 134–140. [Google Scholar] [CrossRef]
- Falk, S.; Ekchajzer, D.; Pirson, T.; Lees-Perasso, E.; Wattiez, A.; Biber-Freudenberger, L.; van Wynsberghe, A. More than Carbon: Cradle-to-Grave environmental impacts of GenAI training on the Nvidia A100 GPU. arXiv 2025, arXiv:2509.00093. [Google Scholar]
- Mistral AI. Our Contribution to a Global Environmental Standard for AI. 2025. Available online: https://mistral.ai/news/our-contribution-to-a-global-environmental-standard-for-ai (accessed on 17 November 2025).
- Soares, I.V.; Yarime, M.; Klemun, M.M. Estimating GHG emissions from cloud computing: Sources of inaccuracy, opportunities and challenges in location-based and use-based approaches. Clim. Policy 2025, 25, 1335–1353. [Google Scholar] [CrossRef]
- Anquetin, T.; Coqueret, G.; Tavin, B.; Welgryn, L. Scopes of carbon emissions and their impact on green portfolios. Econ. Model. 2022, 115, 105951. [Google Scholar] [CrossRef]
- Różycki, R.; Solarska, D.A.; Waligóra, G. Energy-Aware Machine Learning Models—A Review of Recent Techniques and Perspectives. Energies 2025, 18, 2810. [Google Scholar] [CrossRef]
- Fu, Z.; Chen, F.; Zhou, S.; Li, H.; Jiang, L. LLMCO2: Advancing accurate carbon footprint prediction for LLM inferences. ACM Sigenergy Energy Inform. Rev. 2025, 5, 63–68. [Google Scholar] [CrossRef]
- Daraghmeh, H.M.; Wang, C.C. A review of current status of free cooling in datacenters. Appl. Therm. Eng. 2017, 114, 1224–1239. [Google Scholar] [CrossRef]
- Ebrahimi, K.; Jones, G.F.; Fleischer, A.S. A review of data center cooling technology, operating conditions and the corresponding low-grade waste heat recovery opportunities. Renew. Sustain. Energy Rev. 2014, 31, 622–638. [Google Scholar] [CrossRef]
- Mytton, D. Data centre water consumption. NPJ Clean Water 2021, 4, 8. [Google Scholar] [CrossRef]
- Sharma, N.; Mahapatra, S.S. A preliminary analysis of increase in water use with carbon capture and storage for Indian coal-fired power plants. Environ. Technol. Innov. 2018, 9, 51–62. [Google Scholar] [CrossRef]
- Chlela, S.; Selosse, S. Water use in a sustainable net zero energy system: What are the implications of employing bioenergy with carbon capture and storage? Int. J. Sustain. Energy Plan. Manag. 2024, 40, 146–162. [Google Scholar] [CrossRef]
- Chung, J.W.; Liu, J.; Ma, J.J.; Wu, R.; Liu, J.; Kweon, O.J.; Xia, Y.; Wu, Z.; Chowdhury, M. The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization. arXiv 2025, arXiv:2505.06371. [Google Scholar] [CrossRef]
- Luccioni, S.; Gamazaychikov, B. AI Energy Score Leaderboard. 2025. Available online: https://huggingface.co/spaces/AIEnergyScore/Leaderboard (accessed on 17 November 2025).
- Sarkar, S.; Naug, A.; Luna, R.; Guillen, A.; Gundecha, V.; Ghorbanpour, S.; Babu, A.R. Carbon footprint reduction for sustainable data centers in real-time. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 22322–22330. [Google Scholar]
- Mondal, S.; Faruk, F.B.; Rajbongshi, D.; Efaz, M.M.K.; Islam, M.M. GEECO: Green data centers for energy optimization and carbon footprint reduction. Sustainability 2023, 15, 15249. [Google Scholar] [CrossRef]
- Riepin, I.; Brown, T.; Zavala, V.M. Spatio-temporal load shifting for truly clean computing. Adv. Appl. Energy 2025, 17, 100202. [Google Scholar] [CrossRef]
- Rahman, A.; Liu, X.; Kong, F. A survey on geographic load balancing based data center power management in the smart grid environment. IEEE Commun. Surv. Tutor. 2013, 16, 214–233. [Google Scholar] [CrossRef]
- Cao, Z.; Zhou, X.; Hu, H.; Wang, Z.; Wen, Y. Toward a systematic survey for carbon neutral data centers. IEEE Commun. Surv. Tutor. 2022, 24, 895–936. [Google Scholar] [CrossRef]
- Islam, M.A.; Mahmud, H.; Ren, S.; Wang, X. A carbon-aware incentive mechanism for greening colocation data centers. IEEE Trans. Cloud Comput. 2017, 8, 4–16. [Google Scholar] [CrossRef]
- Wiesner, P.; Behnke, I.; Scheinert, D.; Gontarska, K.; Thamsen, L. Let’s wait awhile: How temporal workload shifting can reduce carbon emissions in the cloud. In Proceedings of the 22nd International Middleware Conference, Virtual Event, 6–10 December 2021; pp. 260–272. [Google Scholar]
- Silva, C.A.; Vilaça, R.; Pereira, A.; Bessa, R.J. A review on the decarbonization of high-performance computing centers. Renew. Sustain. Energy Rev. 2024, 189, 114019. [Google Scholar] [CrossRef]
- Radovanović, A.; Koningstein, R.; Schneider, I.; Chen, B.; Duarte, A.; Roy, B.; Cirne, W. Carbon-aware computing for datacenters. IEEE Trans. Power Syst. 2022, 38, 1270–1280. [Google Scholar] [CrossRef]
- Faiz, A.; Kaneda, S.; Wang, R.; Osi, R.; Sharma, P.; Chen, F.; Jiang, L. LLMCarbon: Modeling the end-to-end carbon footprint of large language models. arXiv 2023, arXiv:2309.14393. [Google Scholar]
- Patel, P.; Choukse, E.; Zhang, C.; Shah, A.; Goiri, Í.; Maleki, S.; Bianchini, R. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. In Proceedings of the 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), Buenos Aires, Argentina, 29 June–3 July 2024; pp. 118–132. [Google Scholar]
- Fan, H.; Lin, Y.C.; Prasanna, V. ELLIE: Energy-Efficient LLM Inference at the Edge Via Prefill-Decode Splitting. In Proceedings of the 2025 IEEE 36th International Conference on Application-specific Systems, Architectures and Processors (ASAP), Vancouver, BC, Canada, 28–30 July 2025; pp. 139–146. [Google Scholar]
- Zhu, K.; Gao, Y.; Zhao, Y.; Zhao, L.; Zuo, G.; Gu, Y.; Kasikci, B. NanoFlow: Towards Optimal Large Language Model Serving Throughput. In Proceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), Boston, MA, USA, 7–9 July 2025; pp. 749–765. [Google Scholar]
- Zhong, Y.; Liu, S.; Chen, J.; Hu, J.; Zhu, Y.; Liu, X.; Jin, H.; Zhang, H. DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. In Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), Santa Clara, CA, USA, 10–12 July 2024; pp. 193–210. [Google Scholar]
- Feng, J.; Huang, Y.; Zhang, R.; Liang, S.; Yan, M.; Wu, J. WindServe: Efficient Phase-Disaggregated LLM Serving with Stream-based Dynamic Scheduling. In Proceedings of the 2nd Annual International Symposium on Computer Architecture, Tokyo, Japan, 21–25 June 2025; pp. 1283–1295. [Google Scholar]
- Svirschevski, R.; May, A.; Chen, Z.; Chen, B.; Jia, Z.; Ryabinin, M. Specexec: Massively parallel speculative decoding for interactive LLM inference on consumer devices. Adv. Neural Inf. Process. Syst. 2024, 37, 16342–16368. [Google Scholar]
- Liu, A.; Liu, J.; Pan, Z.; He, Y.; Haffari, G.; Zhuang, B. MiniCache: KV cache compression in depth dimension for large language models. Adv. Neural Inf. Process. Syst. 2024, 37, 139997–140031. [Google Scholar]
- Wang, Y.; Chen, K.; Tan, H.; Guo, K. Tabi: An efficient multi-level inference system for large language models. In Proceedings of the Eighteenth European Conference on Computer Systems, Rome, Italy, 8–12 May 2023; pp. 233–248. [Google Scholar]
- Ahmadpanah, S.H.; Sobhanloo, S.; Afsharfarnia, P. Dynamic token pruning for LLMs: Leveraging task-specific attention and adaptive thresholds. Knowl. Inf. Syst. 2025, 67, 7431–7450. [Google Scholar] [CrossRef]
- Belhaouari, S.B.; Kraidia, I. Efficient self-attention with smart pruning for sustainable large language models. Sci. Rep. 2025, 15, 10171. [Google Scholar] [CrossRef]
- Jiang, Y.; Roy, R.B.; Li, B.; Tiwari, D. Ecolife: Carbon-aware serverless function scheduling for sustainable computing. In Proceedings of the SC24: International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, GA, USA, 17–22 November 2024; pp. 1–15. [Google Scholar]
- Li, B.; Jiang, Y.; Gadepally, V.; Tiwari, D. SPROUT: Green generative AI with carbon-efficient LLM inference. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 21799–21813. [Google Scholar]
- Kim, H.; Young, S.; Chen, X.; Gupta, U.; Hester, J. Slower is Greener: Acceptance of Eco-feedback Interventions on Carbon Heavy Internet Services. ACM J. Comput. Sustain. Soc. 2025, 3, 1–21. [Google Scholar] [CrossRef]
- Jiang, P.; Sonne, C.; Li, W.; You, F.; You, S. Preventing the immense increase in the life-cycle energy and carbon footprints of LLM-powered intelligent chatbots. Engineering 2024, 40, 202–210. [Google Scholar] [CrossRef]
- Morsy, M.; Znid, F.; Farraj, A. A critical review on improving and moving beyond the 2 nm horizon: Future directions and impacts in next-generation integrated circuit technologies. Mater. Sci. Semicond. Process. 2025, 190, 109376. [Google Scholar] [CrossRef]
- Wang, P.; Zhang, L.Y.; Tzachor, A.; Chen, W.Q. E-waste challenges of generative artificial intelligence. Nat. Comput. Sci. 2024, 4, 818–823. [Google Scholar] [CrossRef] [PubMed]
- Shehabi, A.; Smith, S.J.; Hubbard, A.; Newkirk, A.; Lei, N.; Siddik, M.A.B.; Holecek, B.; Koomey, J.G.; Masanet, E.; Sartor, D.A. 2024 United States Data Center Energy Usage Report (LBNL-2001637); Technical Report; Lawrence Berkeley National Laboratory: Berkeley, CA, USA, 2024.
- Kasprzyk, J.R.; Nataraj, S.; Reed, P.M.; Lempert, R.J. Many objective robust decision making for complex environmental systems undergoing change. Environ. Model. Softw. 2013, 42, 55–71. [Google Scholar] [CrossRef]
- Marler, R.T.; Arora, J.S. Survey of multi-objective optimization methods for engineering. Struct. Multidiscip. Optim. 2004, 26, 369–395. [Google Scholar] [CrossRef]
- Mavrotas, G. Effective implementation of the ε-constraint method in multi-objective mathematical programming problems. Appl. Math. Comput. 2009, 213, 455–465. [Google Scholar] [CrossRef]
- Tamiz, M.; Jones, D.F.; El-Darzi, E.S. Goal programming for decision making: An overview of the current state-of-the-art. Eur. J. Oper. Res. 1998, 111, 569–581. [Google Scholar] [CrossRef]
- Pati, R.K.; Vrat, P.; Kumar, P. A goal programming model for the paper recycling system. Omega 2008, 36, 405–417. [Google Scholar] [CrossRef]
- Eskandarpour, M.; Dejax, P.; Miemczyk, J.; Péton, O. Sustainable supply chain network design: An optimization-oriented review. Omega 2015, 54, 11–32. [Google Scholar] [CrossRef]
- Li, P.; Yang, J.; Wierman, A.; Ren, S. Towards environmentally equitable AI via geographical load balancing. In Proceedings of the 15th ACM International Conference on Future and Sustainable Energy Systems, Singapore, 4–7 June 2024; pp. 291–307. [Google Scholar]










| Metric | GPT-4o | GPT-4o Mini | Claude 3.7 Sonnet | LLaMA 3 70B † |
|---|---|---|---|---|
| Baseline Wh/prompt | 0.6876 | 0.7545 | 1.55635 | 0.97145 |
| Optimized Wh/prompt | 0.289824 | 0.319896 | 0.664582 | 0.400321 |
| Energy % | ||||
| Baseline mL/prompt | 2.391473 | 2.624151 | 5.412985 | 3.378703 |
| Optimized mL/prompt | 0.980349 | 1.081547 | 2.245570 | 1.356734 |
| Water % | ||||
| Baseline g CO2/prompt (LB) | 0.242585 | 0.266188 | 0.549080 | 0.342728 |
| Optimized g CO2/prompt (LB) | 0.050739 | 0.055035 | 0.111840 | 0.074971 |
| CO2 % | ||||
| Baseline Energy (GWh/d) | 0.3438 | 0.37725 | 0.778175 | 0.485725 |
| Optimized Energy (GWh/d) | 0.144912 | 0.159948 | 0.332291 | 0.200160 |
| Baseline Water (ML/d) | 1.196 | 1.312 | 2.706 | 1.689 |
| Optimized Water (ML/d) | 0.490 | 0.541 | 1.123 | 0.678 |
| Baseline CO2 (t/d, LB) | 121.293 | 133.094 | 274.540 | 171.364 |
| Optimized CO2 (t/d, LB) | 25.370 | 27.517 | 55.920 | 37.485 |
| Metric | GPT-4o | GPT-4o Mini | Claude 3.7 Sonnet | LLaMA 3 70B |
|---|---|---|---|---|
| Narrow Wh/prompt | 0.2865 | 0.314375 | 0.648479 | 0.404771 |
| Comprehensive Wh/prompt | 0.6876 | 0.7545 | 1.55635 | 0.97145 |
| Narrow/Comprehensive | 0.417 | 0.417 | 0.417 | 0.417 |
| Narrow mL/prompt | 0.996447 | 1.093396 | 2.255411 | 1.407793 |
| Comprehensive mL/prompt | 2.391473 | 2.624151 | 5.412985 | 3.378703 |
| Narrow g CO2/prompt (LB) | 0.101077 | 0.110911 | 0.228783 | 0.142803 |
| Comprehensive g CO2/prompt (LB) | 0.242585 | 0.266188 | 0.549080 | 0.342728 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hoxha, J.; Thanasi-Boçe, M.; Khalifa, T. A Deployment-Aware Framework for Carbon- and Water- Efficient LLM Serving. Sustainability 2025, 17, 10473. https://doi.org/10.3390/su172310473
Hoxha J, Thanasi-Boçe M, Khalifa T. A Deployment-Aware Framework for Carbon- and Water- Efficient LLM Serving. Sustainability. 2025; 17(23):10473. https://doi.org/10.3390/su172310473
Chicago/Turabian StyleHoxha, Julian, Marsela Thanasi-Boçe, and Tarek Khalifa. 2025. "A Deployment-Aware Framework for Carbon- and Water- Efficient LLM Serving" Sustainability 17, no. 23: 10473. https://doi.org/10.3390/su172310473
APA StyleHoxha, J., Thanasi-Boçe, M., & Khalifa, T. (2025). A Deployment-Aware Framework for Carbon- and Water- Efficient LLM Serving. Sustainability, 17(23), 10473. https://doi.org/10.3390/su172310473

