Introducing LEAF: LLM Edge Assessment Framework for Generative AI on the Edge

Abdulkadhim, Mustafa; Repas, Sandor R.

doi:10.3390/make8020048

Open AccessArticle

Introducing LEAF: LLM Edge Assessment Framework for Generative AI on the Edge

by

Mustafa Abdulkadhim

^* and

Sandor R. Repas

Department of Electrical Engineering and Info Communications, Széchenyi István University, 9026 Gyor, Hungary

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2026, 8(2), 48; https://doi.org/10.3390/make8020048

Submission received: 26 January 2026 / Revised: 10 February 2026 / Accepted: 13 February 2026 / Published: 18 February 2026

(This article belongs to the Section Data)

Download

Browse Figures

Versions Notes

Abstract

The transition of Large Language Models (LLMs) from centralized clouds to edge environments is critical for addressing privacy concerns, latency bottlenecks, and operational costs. However, existing edge benchmarking frameworks remain tailored to discriminative Deep Learning tasks (e.g., object detection), failing to capture the multidimensional challenges of generative AI, specifically the trade-offs between token generation speed, semantic accuracy, and hardware sustainability. To address this gap, we introduce LEAF (LLM Edge Assessment Framework), a novel evaluation methodology that integrates Circular Economy principles directly into performance metrics. LEAF assesses edge deployments across five synergistic pillars: Circular Economy Score, Energy Efficiency (Joules/Token), Performance Speed (Tokens/Second), semantic accuracy (BERTScore), and End-to-End Latency. We validate LEAF through an extensive experimental analysis of five distinct hardware classes, ranging from embedded IoT devices (Raspberry Pi 4 and 5, NVIDIA Jetson Nano) to professional edge servers (NVIDIA T400) and repurposed legacy workstations (NVIDIA GTX 1050 Ti). Utilizing 4-bit quantized models via the Ollama runtime, our results reveal a counterintuitive insight: repurposed consumer hardware significantly outperforms modern purpose-built edge SoCs. The legacy GTX 1050 Ti achieved a 20× speedup over the Raspberry Pi 4 and maintained superior energy-per-task efficiency compared to low-power ARM architectures by minimizing active runtime. These findings challenge the prevailing narrative that newer silicon is essential for Edge AI, demonstrating that sustainable, high-performance inference can be achieved by extending the lifecycle of existing hardware. LEAF thus provides a blueprint for a “Green Edge” ecosystem that balances computational capability with environmental responsibility.

Keywords:

Edge AI; large language models; circular economy; benchmarking framework; sustainable computing; Raspberry Pi; Generative AI

Graphical Abstract

1. Introduction

The proliferation of Large Language Models (LLMs) has changed the view of Artificial Intelligence; it shifted the focus from traditional predictive tasks to generative capabilities that nowadays power chatbots, code assistants, and reasoning agents. While data center deployment remains the standard for these massive models, it introduces significant challenges regarding data privacy for customers, high latency, and a higher increase in operational costs. For these reasons, there is an urgent need for a paradigm shift towards Edge AI. The goal is to deploy generative AI directly on the edge; these edge devices are closer to the data source.

Comparison of Existing Standards: Although industry-standard benchmarks such as MLPerf Inference give rigorous metrics for throughput (Queries Per Second) and latency on server-grade hardware, they are mostly focused on performance maximization. MLPerf is not yet taking hardware lifecycle metrics into account, including the environmental cost of manufacturing (embodied carbon) or hardware reusability potential.

On the other hand, edge-oriented tools such as DeepEdgeBench are optimized for discriminative tasks (e.g., object detection using CNNs) and do not identify the autoregressive generation dynamics of LLMs. LEAF closes this divide by implementing the strict latency/accuracy measures of MLPerf and combining them with new sustainability measures (Circular Economy Score, Energy-per-Token). This changes the assessment point of view not only to the speed of what is deployed but also to the sustainability and efficiency of this deployment.

The transition of the deployment of Large Language Models from the cloud to the edge presents a unique set of validation challenges that existing frameworks fail to address adequately. Traditional edge benchmarks, such as DeepEdgeBench, were designed for discriminative Deep Learning models (e.g., CNNs for object detection), and their focus is primarily on inference latency and accuracy on static datasets. These metrics are insufficient and not very applicable for generative AI, where performance is multidimensional, involving token generation speed (tokens/sec), semantic coherence (BERTScore), and even energy consumption. Furthermore, the recent literature often treats “performance” and “sustainability” as separate domains. While recent surveys discuss hardware constraints such as energy consumption, there is an absence of a unified framework that evaluates the Circular Economy impact of hardware choices, specifically, the trade-off between manufacturing new edge SoCs like the Raspberry Pi 5 versus repurposing existing older hardware like legacy GPUs, and how these configurations function as an Edge AI computing device.

To bridge this gap, we introduce LEAF (LLM Edge Assessment Framework), a novel evaluation and benchmarking framework that is designed specifically for Edge AI benchmarking. Unlike previous benchmarks that use a single metric, LEAF assesses edge deployment through five performance pillars:

Circular Economy Score: This score quantifies the sustainability value of repurposing existing hardware to reduce e-waste.
Energy Efficiency: This score metric calculates the energy cost per inference (Joules/Token).
Performance Speed: This score metric evaluates the token generation throughput (Tokens/Second).
Model Accuracy: This score metric uses semantic metrics (F1/BERTScore) rather than manual human verification for the calculation of the semantic coherence.
End-to-End Latency (T_lat): This metric assesses the total wall-clock time from request submission to final token generation, representing the actual delay experienced by the user.

The proposed framework was validated by conducting an extensive comparative analysis on five different edge hardware devices; those devices ranged from embedded single-board computers like the Raspberry Pi 4,5 and NVIDIA’s Jetson Nano computing boards to a repurposed workstation (NVIDIA GTX 1050 Ti). Our experimental results reveal a very important insight: that the Circular Economy approach utilizing older, repurposed consumer GPUs can outperform modern, purpose-built edge SoCs by a factor of 7× in speed while maintaining decent Energy Efficiency per task. These findings challenge the current mindset that newer hardware is always superior for Edge AI, and they substantiate the idea that sustainable, high-performance edge computing can be achieved through the correct reuse of legacy silicon on older GPUs.

The remainder of this paper is organized as follows:

Section 2 reviews a background on generative AI in edge computing. Section 3 shows a detailed literature review and an extensive research gap analysis. Section 4 discusses methodology and framework design, Section 5 shows the implementation details of LEAF, Section 6 is the Results and Discussion, and finally, the Conclusions and future work will be discussed in Section 7.

2. Background

Recent deployments of Artificial Intelligence at the edge have evolved significantly, transitioning from simple predictive tasks to complex generative capabilities. This section of the paper reviews the recent trajectories of Edge AI, the limitations of current benchmarking methodologies, and the optimization techniques that make the deployment of LLMs on the edge feasible.

2.1. The Evolution: From Discriminative to Generative Edge AI

Traditionally, Edge AI research focuses on “discriminative” models such as Convolutional Neural Networks (CNNs) that are usually used for object detection or Recurrent Neural Networks (RNNs) that were usually used for simple time-series forecasting. These AI models are characterized by deterministic outputs, and they have a relatively small footprint in memory. This smaller footprint allowed these models to run on microcontrollers and early edge accelerators like the Coral TPU.

That said, the recent advancements in Large Language Models (LLMs) have introduced a paradigm shift towards “Generative Edge Intelligence.” Different from their predecessors, GenAI models are memory and CPU-hungry; this means they require more GPU or CPU memory to perform the token generation required for the operation of the LLM. The recent literature also highlights that generative AI deployment in the cloud faces critical bottlenecks in latency and privacy, especially for applications in healthcare, smart cities, and industrial IoT, where data sovereignty is crucial. Researchers are now discovering new architectures to bring these massive Large Language Models to the edge. Their research uses techniques like Fog Computing and distributed inference.

2.2. Challenges in Benchmarking Edge LLMs

Benchmarking is an essential tool for quantifying progress; however, significant limitations exist in the existing benchmarking frameworks for Edge AI, which are largely ill-suited for generative AI deployment. Established suites like DeepEdgeBench and MLPerf Mobile focus on metrics that are relevant to image recognition tasks, principally throughput (frames per second) and accuracy. These metrics do not capture the nuances of text generation, such as “Time-to-First Token” (TTFT) or semantic coherence.

Recent research in the field of generative AI on the edge has begun to address this gap. For instance, Huang et al. (2025) [1] presented a performance evaluation of quantized LLMs using the Ollama framework, measuring tokens per second on the edge-capable hardware. Similarly, Nezami et al. (2025) [2] proposed the BeDGED dataset specifically to stress-test generative AI on the edge. However, these frameworks often operate in isolation, focusing either purely on hardware performance or purely on model hallucination and correctness. We cannot easily integrate both into a unified scoring metric. Furthermore, system-level benchmarks for “Cloud-Edge” routing often treat the LLM as a black box, optimizing for network performance rather than the edge device’s internal efficiency.

2.3. Optimization Techniques Enabling Edge Deployment

Besides the use of static quantization, dynamic task planning is also necessary for the successful deployment on heterogeneous edge nodes. The significance of adaptive algorithms in distributing resources in distributed IoT settings is noted in recent studies on streamlining task planning systems [3]. LEAF takes into account these concepts by considering the dynamic End-to-End Latency and System Processing Time, and not only the fixed model size but also a proxy for the hardware–software stack effectiveness in dealing with computational loads.

2.4. The Sustainability Gap: Energy vs. Circular Economy

While the technical feasibility of Edge LLMs is well-documented, the sustainability dimension remains underexplored. Current optimization literature focuses heavily on operational Energy Efficiency—minimizing the Joules consumed per inference to extend battery life. However, this view neglects the embodied carbon cost of manufacturing new hardware.

New studies on the industrial applications for generative AI emphasize the importance of cost-effective and ruggedized solutions, yet few frameworks explicitly evaluate the Circular Economy potential of repurposing older hardware (e.g., legacy GPUs) to be used in modern AI tasks like generative AI. While cost-optimization models for edge networks exist, they were typically optimized for operational expenses (bandwidth, electricity) rather than hardware lifecycle sustainability. The proposed LEAF addresses this critical gap by treating hardware reusability as a primary metric alongside speed and accuracy, offering a holistic view of sustainable Edge AI.

3. Literature Survey

Here is a detailed literature review of the most relevant literature. In order to determine the need for a multidimensional assessment framework, we analyze various high-stakes areas of application, such as medical diagnostics and social media integrity, to determine which operational constraints (e.g., privacy, latency, accuracy) an effective Edge AI benchmark should operate within. I chose the first ten references, and a more detailed analysis is shown in Table 1 as well.

The integration of Large Language Models (LLMs) into diverse computational domains has catalyzed a significant body of research focusing on automation, security, and optimization. However, a critical analysis of recent literature reveals a predominant focus on cloud-centric deployment and semantic accuracy, often overlooking the hardware constraints of edge environments. This section categorizes existing works into three primary domains: Industrial and IoT Applications, Cybersecurity and Reliability Frameworks, and Edge Benchmarking Methodologies.

3.1. LLMs in Industrial and IoT Applications

Recent studies have demonstrated the efficacy of LLMs in streamlining complex industrial workflows. For instance, research on AI-assisted industrial programming investigated the use of models like GPT-4 for generating Structured Text (ST) for Programmable Logic Controllers (PLCs). While this study introduced a benchmarking framework for code syntax and logic, the inference relied entirely on OpenAI’s cloud API, leaving the feasibility of on-device industrial code generation and the exploration of hardware constraints unaddressed. Similarly, in the realm of IoT service orchestration, the SOLAR framework utilizes LLMs to rank APIs based on Quality of Service (QoS) constraints. Although SOLAR incorporates edge-centric metrics such as latency and bandwidth into its recommendation engine, the LLM itself functions as a server-side discovery tool rather than an edge-resident component. Additionally, comprehensive surveys on machine translation highlight the paradigm shift toward context-aware neural translation, yet these reviews remain focused on linguistic capabilities without addressing the computational overhead of deploying such models on portable translation devices.

3.2. Cybersecurity and Explainability Frameworks

The application of generative AI in cybersecurity has also gained traction, particularly for threat detection and explanation. Integrated frameworks combining CNN-LSTM networks with Explainable AI (XAI) and LLMs have been proposed to detect phishing attacks in IoT environments. While these hybrid models achieve high classification accuracy, the implementation details imply gateway-level processing, lacking specific profiling of power consumption on resource-constrained sensors. To address evolving threats, the “ZeroDay-LLM” framework employs Retrieval-Augmented Generation (RAG) and fine-tuned Llama models to identify zero-day vulnerabilities. Despite its innovative use of RAG to mitigate outdated training data, the study prioritizes detection metrics (precision, recall) over hardware performance. Furthermore, the reliability of such models has been scrutinized through systematic validation methodologies. The “Learnability Framework” and scalable cross-domain evaluation pipelines focus on detecting hallucinations and validating information extraction consistency using metrics like BERTScore. However, these studies are strictly software-centric, neglecting the impact of hardware quantization on model reliability.

3.3. Edge Optimization and Benchmarking

A distinct body of work focuses on the optimization of neural networks for edge deployment. The “DeepEdgeBench” framework established a robust standard for profiling Deep Learning models across heterogeneous accelerators (e.g., Coral TPU, Jetson Nano). While seminal for computer vision tasks, DeepEdgeBench does not account for the autoregressive generation dynamics unique to LLMs. Alternative approaches explore network-level optimizations, such as dynamic quality-latency-aware routing, which simulates the offloading of inference tasks in decentralized wireless networks. Additionally, LLMs have been utilized as Neural Architecture Search (NAS) agents to design lightweight networks for edge devices. While this demonstrates the utility of LLMs as design tools, it does not address the challenge of running the generator model itself on the edge.

3.4. Identification of the Research Gap

Collectively, the reviewed literature exhibits a clear bifurcation: studies either focus on the semantic capabilities of LLMs running on powerful cloud infrastructure, or they focus on benchmarking traditional discriminative models (CNNs) on edge hardware. There remains a significant scarcity of frameworks that specifically evaluate generative AI directly on edge hardware. Existing benchmarks typically overlook the “Circular Economy” aspect of hardware reuse and fail to correlate semantic degradation (accuracy) with hardware efficiency (Joules/Token) in a unified metric. The proposed LEAF addresses this gap by providing a multidimensional assessment specifically tailored for on-device LLM inference.

The current research has taken a new direction in the granular profiling of quantized LLMs using constrained hardware. The quantization to integer (int8/int4), as shown by Dettmers et al. [4], was shown to be able to save memory bandwidth by a significant amount without any significant loss in accuracy, which is essential for edge deployment. In the same way, Lin et al. [5] pointed out that activation-aware quantization (AWQ) is critical for preserving reasoning in low-bit environments. In addition, hardware-aware profiling systems, such as those studied by Cai et al. [6], imply that static measures (such as TOPS) frequently cannot be used to forecast real inference latency because of thermal throttling and memory congestion on platforms such as the Raspberry Pi and Jetson Nano. These analyses highlight the importance of the applicable telemetry-based energy measures used in LEAF.

3.5. Metric Selection Rationale

Equal-Weight Multidimensional Assessment Rationale: The selection and weighting of the five LEAF measures are directly based on the conflicting demands that were defined during the literature review.

Semantic Accuracy: Since the hallucination thresholds reported in medical and legal LLM experiments are rigorous, we use BERTScore as an essential attribute.
Latency and Speed: We are informed by the QoS limitations of real-time IoT systems, such as SOLAR of Time-to-First-Token and Tokens/Second.
Circular Economy: The striking lack of metrics of sustainability in the common benchmarks, such as DeepEdgeBench, inspired the addition of the ‘Circular Economy Score.’

The LEAF, therefore, uses an equal-weighting strategy on the radar chart. Such a design option indicates the observation that, in general-purpose Edge AI, where a single parameter (e.g., speed) is maximized at the cost of another parameter (e.g., accuracy), deployment will not be feasible.

Detailed research gap analysis that shows benchmarking and LLM-on-edge implementation in research is shown in Table 1 below:

Table 1. Detailed Gap analysis on the most related Literature.

Paper	Main Focus	Benchmarking	LLM on Edge
Huang (2025) [1]	Ollama on Edge Benchmark	Yes	Yes
Nezami (2025) [2]	(BeDGED)	Yes (Dataset)	Yes
Dettmers (2022) [4]	8-bit Optimizers (Training)	No	Indirect
Adnyana (2026) [7]	PLC Code Gen Prompting	No	No
Al-Masri (2025) [8]	Edge API Discovery	No	No
Alasmari (2025) [9]	IoT Phishing Detection	No	Partial/Ambiguous
Alsuwaiket (2025) [10]	Zero-Day Threat Detection	No	No
Ataman (2025) [11]	Machine Translation Survey	No	No
Baller (2021) [12]	DNN Hardware Benchmarking	Yes (DNNs)	No (CNNs)
Bao (2025) [13]	Wireless Network Routing	Yes (Network)	Yes
Benmeziane (2024) [14]	NAS for Edge via LLM	Indirect	No
Çetinkaya (2025) [15]	Validation Framework	No	No
Chakraborty (2025) [16]	Hallucination Evaluation	No	No
Chen (2025) [17]	Length Control Fine-Tuning	No	No
Han (2016) [18]	Deep Compression (Pruning)	Yes (Mobile)	No (CNNs)
Hao (2023) [19]	Distributed Benchmarking	Yes (System)	No
Jain (2025) [20]	Chatbot Arch. Scaling	No (Simulation)	Yes (Theoretical)
Jebli (2025) [21]	Fog Computing LLM Survey	No	Yes
Jin (2025) [22]	Cloud-Edge Collaboration	Yes	Yes
Kim (2025) [23]	Mobile Korean LLM	Yes	Yes
Kohli (2025) [24]	Heterogeneous Edge Profiling	Yes	No (DNNs)
Krishnamurthy (2025) [25]	Fog Resource Provisioning	Yes (System)	No (LLM guides)
Lee (2025) [26]	Edge GPU Holistics (Thermal)	Yes	No (CNNs)
Li (2025) [27]	Efficient LLM Survey	No	Yes (Techniques)
Liu (2025) [28]	DeepSeek-R1 Fintech Eval	No	No
Liu (2025) [29]	Robotics and LLM Review	No	Yes (Concept)
Liu (2025) [30]	Smart Home Edge Routing	Yes (System)	Yes
Minott (2025) [31]	GenAI Edge Dataset	Yes	No (Coral/CNNs)
Nezami (2025) [32]	GenAI Edge Perf. Eval	Yes	Yes
Pozi (2025) [33]	Data-Augmented Routing	Yes (System)	Yes
Ranjan (2025) [34]	Vision transformers	Yes (Efficiency)	No (Vision)
Ray (2025) [35]	P2P CPU-Only LLM	Yes	Yes
Ren (2025) [36]	Edge Expert Deployment Cost	Yes (Simulation)	Yes
Saha (2025) [37]	Medical LLM Accuracy Eval	No	No
Shaikh (2025) [38]	Agriculture LLM Review	No	No
Sun (2025) [39]	Satellite Edge Optimization	Yes (System)	No (LLM guides)
Sun (2025) [40]	Trusted 6G LLM	Yes	Yes
Thapa (2025) [41]	Social Science LLM Review	No	No
Wang (2025) [42]	Federated Learning	Yes (Simulation)	No
Wang (2025) [43]	Edge LLM Survey	No	Yes (Concept)
Yang (2025) [44]	IoT + LLM + Privacy Review	No	Yes (Concept)
Yin (2024) [45]	Edge PM2.5 Forecasting	Yes (System)	Yes
Yuan (2025) [46]	Smart City Offloading (LLM)	Yes (System)	No (LLM guides)
Zhang (2025) [47]	5G Spec Contradiction Detect	No	No
Zhang (2025) [48]	Wireless Edge GenLLM	Yes	Yes
Zhang (2025) [49]	HLS Code Correction	No	No
Zhu (2024) [50]	UAV Task Offloading (MARL)	Yes (System)	No (LLM guides)
Surianarayanan (2023) [51]	Edge AI Optimization Survey	No	No (DL/CNNs)
Stadnicka (2022) [52]	Industrial AI Needs Survey	No	No
Rupanetti (2024) [53]	Edge IoT Security (Intrusion)	Yes (System)	No (ML)
Liang (2025) [54]	Math Methods for Edge AI	No	No
Lawal (2024) [55]	Railroad Bridge Monitoring	Yes (Sensor)	No (TinyML)
Gültekin (2022) [56]	Vehicle Fault Detection	Yes (System)	No (ML)
Chen (2024) [57]	Feasibility of Edge AI	Yes (Simulation)	No
Bourechak (2023) [58]	AI/Edge Convergence	No	No
Mustafa (2025) [59]	Automation of benchmarking	Yes	No

4. Methodology and Framework Design

To better evaluate and test the feasibility and sustainability of deploying Large Language Models (LLMs) on edge infrastructure, we proposed the LLM Edge Assessment Framework (LEAF). This section details the architectural design of LEAF, the mathematical formulation of its metrics, and the experimental setup used to validate the framework.

4.1. The LEAF Architecture

The LEAF is designed as a modular, three-layered system that ingests heterogeneous hardware and model configurations, then it outputs a multi-dimensional performance score.

The LEAF system architecture consists of three primary layers, as shown in Figure 1:

The Input Layer (Edge Environment): This layer represents the physical hardware (e.g., Raspberry Pi, Jetson, Legacy GPU Servers) as well as the Edge AI software stack (quantized models via Ollama).
The Assessment Core (LEAF Engine): The central processing unit that monitors system telemetry (power, latency) and evaluates model output quality against a gold standard.
The Visualization Layer (Output): In this layer, we generated a 5-point radar chart, which is the best method to visualize the trade-offs between sustainability, speed, and accuracy.

4.2. Evaluation Metrics Definition

Quantization Details: All the models have been implemented with the GGUF (GPT-Generated Unified Format) to be compatible across platforms between ARM CPUs (Arm Ltd., Cambridge, UK) and NVIDIA GPUs (NVIDIA Corporation, Santa Clara, CA, USA). We used the q4_k_m (4-bit Medium K-Quant) quantization scheme. In this technique, k-quants technology is used, assigning more resolution bits to sensitive attention blocks and fewer to less sensitive feed-forward blocks, thus mitigating the non-uniform perplexity loss in standard uniform 4-bit quantization.

The LEAF evaluates edge performance through five distinct metrics; they are normalized to a scale for comparative analysis.

Circular Economy Score (S_CE) This metric quantifies the environmental benefit when we are repurposing the hardware. It is a discrete score assigned based on the hardware’s lifecycle.
- Definition: $S_{CE} \in {0.0, \dots, 1.0}$ where 1.0 represents fully repurposed e-waste (e.g., 5+ year old GPU) and represents newly manufactured silicon with high embodied carbon.
- Rationale: Incentivizes the extension of device lifespan, aligning with green computing principles.
Energy Efficiency (E_eff)
- Definition: This is the energy cost to generate a complete response.
  
  $E_{total} = P_{avg} \times T_{inference}$
  
  (1)
  
  where $P_{avg}$ is the average power consumption (Watts) during load, and $T_{inference}$ is the total time taken.
- Normalization: Higher efficiency (lower Joules) results in a higher score.
Performance Speed (R_gen)

$R_{gen} = \frac{N_{tokens}}{T_{gen}}$

(2)
- Definition: The rate of text generation, measured in Tokens Per Second (TPS).
- Relevance: Critical for user experience in interactive applications like chatbots.
Model Accuracy (F1_BERT)
- Definition: We utilize BERTScore to evaluate semantic similarity between the LLM-generated summary and a human-verified reference summary.
- Rationale: Unlike n-gram metrics (ROUGE/BLEU), BERTScore captures contextual meaning, which is essential for evaluating generative reasoning.
End-to-End Latency (T_lat)
- Definition: The total wall-clock time measured from the initial request submission to the completion of the task
- Normalization: Inverted scale, where lower time yields a higher score.

4.3. Experimental Setup

To validate the proposed framework, we deployed a hardware testbed of edge devices representing different architectural paradigms:

Testbed Hardware:
- Low-Power Edge: Raspberry Pi 4 (4 GB) and Raspberry Pi 5 (8 GB) (Raspberry Pi Foundation, Cambridge, UK)—representing ARM-based CPU inference.
- Specialized Edge: NVIDIA Jetson Nano (NVIDIA Corporation, Santa Clara, CA, USA)—representing older, dedicated edge accelerators.
- Repurposed Workstation: AI Server with NVIDIA GTX 1050 Ti (4 GB) (NVIDIA Corporation, Santa Clara, CA, USA)—representing the “Circular Economy” candidate.
- Professional Edge: Physical Server with NVIDIA T400 (4 GB)—representing modern entry-level professional workstations.
Software Stack:
○
Inference Engine: Ollama (v0.1.29) (Ollama, San Francisco, CA, USA) serving GGUF quantized models (q4_k_m).
○
Models Evaluated: granite3.3:2b, llama3.2:3b, gemma:2b, tinyllama, qwen2:0.5b, and deepseek-r1:1.5b.
○
Benchmarking Tool: A custom Python (v0.3.13) (Python Software Foundation, Wilmington, DE, USA) pipeline using BERTScore for accuracy and system timers for latency.
Procedure: Each device processed a standardized prompt (“Summarize the history of artificial intelligence”) across all models. Metrics were recorded over multiple runs to ensure statistical stability, capturing inference time, output text, and system telemetry.

5. Implementation

In order to validate the operation of the LEAF, we established a heterogeneous hardware testbed representing a spectrum of edge computing devices. These devices range from embedded IoT boards to a repurposed commercial GPU. This section details the hardware specifications, the software stack, and the automated testing pipeline that is used to generate performance metrics. The implementation testbed is shown in Figure 2 below.

5.1. Hardware Testbed Configuration

The experimental setup consisted of five edge nodes, selected to represent different “Edge Classes” as defined in the LEAF.

Node A: Specialized edge (NVIDIA Jetson Nano)

CPU: Quad-core ARM Cortex-A57 @ 1.43 GHz.
GPU: 128-core Maxwell.
RAM: 4 GB LPDDR4.
Role: Represents older, GPU-accelerated edge devices common in edge-industrial deployments.

Node B: Standard IoT Edge (Raspberry Pi 4 Model B)

CPU: Quad-core Broadcom BCM2711 (Cortex-A72) @ 1.5 GHz.
RAM: 4 GB LPDDR4.
Role: Represents the baseline for CPU-based edge inference.

Node C: Modern IoT Edge (Raspberry Pi 5)

CPU: Quad-core Broadcom BCM2712 (Cortex-A76) @ 2.4 GHz.
RAM: 4 GB LPDDR4X.
Role: Represents the new generation of high-performance CPU edge nodes.

Node D: Rack-Mounted Industrial Edge Server (Physical Server T400)

CPU: 16-Core Processor.
RAM: 128 GB.
GPU: NVIDIA T400 (Professional Low-Profile).
Role: Represents a professional-grade edge gateway.

Node E: Circular Economy Server (AI Server)

CPU: Intel Core i5 (4 Cores).
RAM: 32 GB.
GPU: NVIDIA GTX 1050 Ti (Consumer Legacy).
Role: Represents the “Circular Economy” approach, utilizing repurposed consumer hardware.

5.2. Software Stack and Orchestration

To ensure reproducibility and fair comparison, a unified software stack was deployed across all edge nodes. The specification of the software stack is as follows:

Operating System: Linux-based environments (Ubuntu 20.04/22.04 LTS for Servers/Jetson; Raspberry Pi OS Bookworm for the Raspberry Pis).
LLM Runtime Engine: Ollama (v0.1.x) was utilized for its lightweight footprint and efficient management of the used GGUF quantized models.
Benchmarking Agent: A custom Python script (score_llm.py) was deployed on each node. This script utilizes a subprocess module to invoke the LLM and a customized version of the BERTScore library (v0.3.13) for semantic evaluation on edge devices. It should be noted that the original BERTScore cannot run on edge nodes due to its huge memory requirements. The code for the Python script will be provided upon request from the author.

5.3. Automated Testing Pipeline

The Efficiency Paradox (75 W vs. 5 W)

A critical discrepancy in edge benchmarking is that desktop GPUs and embedded SoCs have a discrepancy in power envelopes. In our performance comparison of a repurposed 75 W GTX 1050 Ti versus a 5 W Raspberry Pi 4 repurposed in generative algorithms, our analysis indicates that the 75 W GTX 1050 Ti has a better performance-watt-hour profile than the 5 W Raspberry Pi 4 does.

Although the Raspberry Pi 4 has a smaller instantaneous power (around 5 W), it has a longer inference time (around 11 s), which means that the overall power consumption is around 55 Joules per task. Compared to this, the GTX 1050 Ti, with a larger system power usage (approximately 100 W), finishes the task in 0.29 s at only approximately 29 Joules.

Thus, considering the Energy Efficiency (Metric 2) and Circular Economy (Metric 1) through this perspective, the high-wattage legacy hardware becomes counterintuitive and slightly more efficient per task than the low-power SoC. This confirms the LEAF strategy of focusing on task-completion energy rather than thermal design power (TDP). A detailed power envelope and energy consumption were added in Table 2 below.

The testing process was automated to eliminate any human error that might happen in timing. The steps below highlight the workflow implemented on each device.

Model Pull: The target model (e.g., ibm/granite3.3:2b) is explicitly pulled to local storage so that download time is excluded from the inference metric.
Inference Execution: The script sends a standardized prompt (“Summarize the history of artificial intelligence”) to the Ollama API.
○
Telemetry Capture:
○
Latency: This is captured via system timestamps (time.time()) immediately before sending the prompt and after receiving the final token (End-of-Sequence).
Output Text: The full generated string is captured from stdout.
Quality Assessment: The generated text is compared against a pre-defined “Gold Standard” summary using a customized BERTScore algorithm for edge devices to calculate precision, recall, and F1. The gold standard was Google’s Gemini 3 Pro model.

5.4. Limitations and Organizational Adoption

Although LEAF offers a strong foundation for single-node evaluation, three critical weaknesses regarding the organizational-level scalability exist. Scaling Sustainability Statistically: The existing Circular Economy Score is based on a rubric that is discrete and assigned. This ought to be substituted in a corporate setting with active integration with Lifecycle Assessment (LCA) databases to quantify embodied carbon accurately. Single-Node Isolation: LEAF is currently testing isolated devices. It fails to include network overhead in distributed fog clusters, in which inference may be distributed among several Raspberry Pis. Quality is a proxy for semantic similarity and is therefore subject to domain-specific hallucinations (e.g., in legal or medical). Organizational Fixes: Organizations that aim to successfully implement LEAF should incorporate the framework into CI/CD pipelines, such that the framework executes checks for Energy-per-Token and accuracy whenever a model is changed. Moreover, the sustainability indicators are to be adjusted to the corporate ESG (Environmental, Social, and Governance) reporting criteria to convert the subjective Circular Score into an auditable carbon indicator.

6. Results and Discussion

The results obtained from the edge devices were plotted individually for each metric. We will discuss each result individually in this section.

6.1. Hardware Speed and Latency Analysis

The inference latency (time-to-completion) for all six quantized LLMs across the five hardware nodes is visualized in Figure 3 below. Note that the Y-axis utilizes a logarithmic scale to accommodate the magnitude of the difference between CPU and GPU architectures.

Key Observations for the Hardware Speed Comparison.

1- “ Performance Disparity Across Architectures”: A distinct performance chasm that separates the dedicated GPU nodes from the CPU-based edge devices. The AI Server (GTX 1050 Ti) and the Physical Server (T400) consistently completed the injected prompt in under 0.5 s, achieving a speedup factor of ~20× to 50× compared to the Raspberry Pi 4 edge node.

2- Evaluation of Reused Hardware Efficacy: Surprisingly, the AI Server (GTX 1050 Ti), which is a repurposed consumer GPU, outperformed the professional-grade T400 Server across all the quantized LLMs. For instance, on the quantized (tinyllama) model, the 1050 Ti GPU achieved a latency of 0.17 s, compared to 0.36 s on the rack-mounted server with the T400 GPU. This result challenges the assumption that enterprise-grade hardware is strictly necessary for edge inference. Hence, high-clock-speed consumer silicon remains highly competitive for generative AI tasks.

3- Edge Hierarchy: Among the edge nodes in our testbed, the Raspberry Pi 5 emerged as the clear leader, processing the quantized mode (qwen2:0.5b) in 1.01 s, a 4.5× improvement over the Raspberry Pi 4 (which performed in 4.50 s). The Jetson Nano, despite having a dedicated GPU, lagged behind the RPi 5 (which averaged ~8 s), illustrating the limitations of older GPU microarchitectures (Maxwell architecture) against modern high-frequency CPUs (Cortex-A76 architecture) for quantized transformer workloads that lack specific tensor optimization on the Maxwell GPU architecture for the Jetson Nano.

Impact of Thermal Throttling on Active Runtime

The difference in the time of ‘Device Processing Time’ (Table 2) is directly proportional to the thermal stability. Sustained CPU load per inference on the Raspberry Pi 4 takes 10.96 s. This active state results in rapid thermal saturation of passively cooled systems in the long term, which results in Dynamic Voltage and Frequency Scaling (DVFS) throttling.

On the contrary, the GTX 1050 Ti enjoys the race-to-idle effect. The latency of inference is also only 0.29 s; hence, the graphics card finishes its workload before the thermal soak events cause significant temperature rises. Therefore, the GTX 1050 Ti had better performance stability in repeated runs (standard deviation

σ \approx 0.02 s

) than the Raspberry Pi 4 exhibited higher variance (

σ \approx 0.8 s

) because it is thermally downclocked during the processing window of 11 s. This information confirms the hypothesis that minimization of active runtime is a more successful thermal management approach than minimization of peak power.

6.2. Semantic Accuracy (F1 Score) Stability

Figure 4 below illustrates the F1 scores derived via BERTScore. Theoretically, quantization should yield identical outputs across devices, but our results show minor variance that will be discussed below.

Key Observations for F1 Score:

1- Quantization Reliability: The F1 scores remained highly stable across all edge nodes, fluctuating by less than 5% for most models. This confirms that deploying a 4-bit quantized model (in the GGUF format) on low-power edge devices like the Raspberry Pi does not result in “lobotomized” performance.

2- The “Slow but Accurate” RPi 4: Counterintuitively, the Raspberry Pi 4 achieved the highest single F1 score (0.783 for the quantized llama3.2), surpassing the servers. This anomaly suggests that the specific quantization kernels used for ARM CPUs in Ollama may prioritize precision over speed, whereas the CUDA kernels prioritize throughput. This finding validates the idea of using ultra-low-cost hardware for offline, high-accuracy summarization tasks where latency is not a critical issue.

6.3. Holistic Assessment: The LEAF Radar Chart

While speed and accuracy provide isolated data points, the LEAF Radar Chart shown in Figure 5 below synthesizes these metrics with sustainability and efficiency. This will reveal the true “personality” of each deployment strategy.

Prior to analysis, the collected data is presented in the data collected, which is shown in Table 3 below:

6.4. Metrics Calculation Methodology

Calculation Methodology per Metric.

Metric A: Circular Economy (

S_{C E}

)

Definition: A subjective score (0–10) representing the hardware’s sustainability status (Old/Refurbished = High, New = Low).
Formula:

$S_{C E} = \frac{C}{10}$

(3)
Example (AI Server): The GTX 1050 Ti is older hardware (high reuse), assigned a raw score of 9.

0.9 = 9 / 10

Metric B: Energy Efficiency (

E_{e f f}

)

Step 1: Calculate Raw Energy (Joules).

$Energy (J) = Time (s) \times Power (W)$

(4)

○
AI Server: $0.29 s \times 100 W = 29 J$ (Lowest Energy).
○
Jetson Nano: $8.04 s \times 10 W = 80.4 J$ (Highest Energy).
Step 2: Normalize (Inverted). Since lower energy is better, we invert the scale, so the lowest Joules gets 1.0.

$Normalized = \frac{M a x_{J} - {V a l}_{J}}{M a x_{J} - M i_{J}}$

(5)

○
Max $(J_{m a x})$ : 80.4 (Jetson).
○
Min ( $J_{m i n}$ ): 17.2 (T400).
○
AI Server Score: $\frac{80.4 - 29}{80.4 - 17.2} = \frac{51.4}{63.2} \approx 0.81$

Metric C: Performance Speed (

R_{g e n}

)

Step 1: Calculate Throughput Proxy.

Speed (v) = \frac{1}{Time}

(6)

AI Server: $1 / 0.29 \approx 3.44$ (Fastest).
RPi 4: $1 / 10.96 \approx 0.09$ (Slowest).

Step 2: Normalize (Standard). Higher speed is better.

Normalized = \frac{{V a l}_{v} - {M i n}_{v}}{{M a x}_{v} - {M i n}_{v}}

(7)

AI Server Score: $\frac{3.44 - 0.09}{3.44 - 0.09} = 1.0$

Metric D: Model Accuracy

(F 1_{B E R T})

Step 1: Identify Raw F1 Scores.
Best $(F_{m a x}) : 0.764 (R P i 4)$ .
Worst $(F_{m i n}) : 0.745 (R P i 5)$ .
Step 2: Normalize (Standard).

$Normalized = \frac{{V a l}_{f} - {M i n}_{f}}{{M a x}_{f} - {M i n}_{f}}$

(8)
RPi 4 score: $\frac{0.764 - 0.745}{0.764 - 0.745} = 1.0$

Note: Because the raw F1 scores are very close (0.74 vs. 0.76), normalization stretches these small differences to the full 0–1 range to make them visible on the chart.

Metric E: Processing Time Score

(T_{lat})

.

Step 1: Identify Raw Times.

Best

(T_{m i n}) : 0.29 s (Al Server)

.

Worst

(T_{m a x}) : 10.96 s (RPi 4)

.

Step 2: Normalize (Inverted). Lower time is better.

Normalized = \frac{{Max}_{t} - {Val}_{t}}{{Max}_{t} - {Min}_{t}}

(9)

AI Server Score:

\frac{10.96 - 0.29}{10.96 - 0.29} = 1.0

RPi 4 score:

\frac{10.96 - 10.96}{10.96 - 0.29} = 0.0

Final Normalized Data Table

This is the exact matrix used to plot the pentagons in the image. As shown in Table 4 below:

6.5. Key Observations for the LEAF Assessment Chart

1- The Balanced polygon (AI server): The GTX 1050 Ti (purple pentagon) shows a large and relatively regular pentagon, which shows a good overall performance. It achieves perfect scoring in speed (1.0) and End-to-End Latency (1.0), while also performing well in Circular Economy (0.9). The primary limitation observed here is in the Energy Efficiency; however, because it generates high Tokens/Second, the overall energy-per-task remains high.

2- Specialist spikes: The Raspberry Pi 4 (orange pentagon) shows a spike toward Circular Economy and accuracy. It also displays a clear collapse on the speed axis. This reflects a system that is sustainable and accurate, which is not targeted towards real-time Edge AI applications.

3- Efficiency triangle: The Raspberry Pi 5 (red pentagon) has a shape of a distinct triangular profile that emphasizes Energy Efficiency and Latency. It has comparatively less focus on the Circular Economy Score. This pattern identifies this edge node as a strong option for battery-powered, interactive edge applications, particularly when a server-class form factor is undesirable, and a smaller physical footprint is preferred.

7. Conclusions and Future Work

The fast development of generative AI at the edge requires a new evaluation paradigm. A paradigm that transcends the traditional binary of “Speed vs. Accuracy.” Our paper introduced LEAF (LLM Edge Assessment Framework), a novel methodological framework that integrates Circular Economy principles with the technical assessment for Edge AI deployments.

Our comprehensive evaluation of the five hardware edge nodes using the proposed LEAF will lead to a pivotal and novel conclusion. The conclusion is that Sustainability in Edge AI does not require sacrificing performance.

The “Circular Economy” concept is to repurpose legacy consumer GPUs (in this case, the GTX 1050 Ti GPU). This proved to be the superior strategy, outperforming both modern embedded SoCs and professional edge servers in raw speed while diverting e-waste from landfills to be used in modern edge generative AI applications.

For future work, we are going to expand the LEAF to include “Data Privacy” as a quantifiable metric and explore the orchestration of heterogeneous clusters, where “Circular” edge hardware and “Efficient” edge nodes work together in a tiered fog architecture.

Author Contributions

Conceptualization, M.A. and S.R.R.; methodology, M.A. and S.R.R.; software, M.A.; validation, M.A. and S.R.R.; formal analysis, M.A.; investigation, M.A. and S.R.R.; resources, S.R.R.; data curation, M.A.; writing—original draft preparation, M.A.; writing—review and editing, M.A.; visualization, M.A.; supervision, S.R.R.; project administration, S.R.R.; funding acquisition, not applicable. All authors have read and agreed to the published version of the manuscript.

Funding

The APC for getting this work published was funded by Széchenyi Istvan University’s Publication Support Program.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data underlying this article will be shared at a reasonable request to the corresponding author.

Acknowledgments

The authors thank Sandor. R. Repas for his supervision in this work, and Széchenyi Istvan University for the offering of the required lab equipment. The authors reviewed and edited the output and take full responsibility for the content.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

AI	Artificial Intelligence
API	Application Programming Interface
ARM	Advanced RISC Machines (Processor Architecture)
BERT	Bidirectional Encoder Representations from Transformers
CNN	Convolutional Neural Network
CPU	Central Processing Unit
DL	Deep Learning
F1	F-Measure (Harmonic mean of precision and recall)
GenAI	Generative Artificial Intelligence
GGUF	GPT-Generated Unified Format
GPU	Graphics Processing Unit
IoT	Internet of Things
LEAF	LLM Edge Assessment Framework
LLM	Large Language Model
LPDDR	Low-Power Double Data Rate (Memory)
LTS	Long-Term Support (Software Version)
MEC	Mobile Edge Computing/Multi-Access Edge Computing
NAS	Neural Architecture Search
P2P	Peer-to-Peer
RAM	Random Access Memory
RLHF	Reinforcement Learning from Human Feedback
RNN	Recurrent Neural Network
SoC	System on a Chip
TDP	Thermal Design Power
TPS	Tokens Per Second
TPU	Tensor Processing Unit
TTFT	Time To First Token
ViT	Vision Transformer

References

Huang, D.; Wang, Z. LLMs at the Edge: Performance and Efficiency Evaluation with Ollama on Diverse Hardware. In Proceedings of the 2025 International Joint Conference on Neural Networks (IJCNN), Rome, Italy, 30 June–5 July 2025. [Google Scholar] [CrossRef]
Nezami, Z.; Hafeez, M.; Djemame, K.; Zaidi, S.A.R.; Xu, J. Descriptor: Benchmark Dataset for Generative AI on Edge Devices (BeDGED). IEEE Data Descr. 2025, 2. [Google Scholar] [CrossRef]
Menon, S.; Addula, S.R.; Parkavi, A.; Subbalakshmi, C.; Dhandayuthapani, V.B.; Pokkuluri, K.S.; Soni, A. Streamlining Task Planning Systems for Improved Enactment in Contemporary Computing Surroundings. SN Comput. Sci. 2024, 5, 993. [Google Scholar] [CrossRef]
Dettmers, T.; Lewis, M.; Shleifer, S.; Zettlemoyer, L. 8-bit Optimizers via Block-wise Quantization. arXiv 2022. [Google Scholar] [CrossRef]
Lin, J.; Tang, J.; Tang, H.; Yang, S.; Chen, W.-M.; Wang, W.-C.; Xiao, G.; Dang, X.; Gan, C.; Han, S. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv 2024. [Google Scholar] [CrossRef]
Cai, H.; Gan, C.; Wang, T.; Zhang, Z.; Han, S. Once-for-All: Train One Network and Specialize it for Efficient Deployment. arXiv 2020. [Google Scholar] [CrossRef]
Adnyana, K.; Schwung, A. Benchmarking and validation of prompting techniques for AI-assisted industrial PLC programming. Mach. Learn. Appl. 2026, 23, 100804. [Google Scholar] [CrossRef]
Al-Masri, E.; Subramanian, I.N. SOLAR: Illuminating LLM performance in API discovery and service ranking for edge AI and IoT. Internet Things 2025, 32, 101630. [Google Scholar] [CrossRef]
Alasmari, S.M.; Sakly, H.; Kraiem, N.; Algarni, A. Phishing detection in IoT: An integrated CNN-LSTM framework with explainable AI and LLM-enhanced analysis. Discov. Internet Things 2025, 5, 102. [Google Scholar] [CrossRef]
Alsuwaiket, M.A. ZeroDay-LLM: A Large Language Model Framework for Zero-Day Threat Detection in Cybersecurity. Information 2025, 16, 939. [Google Scholar] [CrossRef]
Ataman, D.; Birch, A.; Habash, N.; Federico, M.; Koehn, P.; Cho, K.; Ataman, D.; Birch, A.; Habash, N.; Federico, M.; et al. Machine Translation in the Era of Large Language Models: A Survey of Historical and Emerging Problems. Information 2025, 16, 723. [Google Scholar] [CrossRef]
Baller, S.P.; Jindal, A.; Chadha, M.; Gerndt, M. DeepEdgeBench: Benchmarking Deep Neural Networks on Edge Devices. arXiv 2021. [Google Scholar] [CrossRef]
Bao, R.; Xue, N.; Sun, Y.; Chen, Z. Dynamic Quality-Latency Aware Routing for LLM Inference in Wireless Edge-Device Networks. In Proceedings of the 2025 IEEE/CIC International Conference on Communications in China (ICCC Workshops), Shanghai, China, 10–13 August 2025; pp. 1–6. [Google Scholar] [CrossRef]
Benmeziane, H.; Maghraoui, K.E. Are Large Language Models Good Neural Architecture Generators for Edge? In Proceedings of the 2024 IEEE International Conference on Edge Computing and Communications (EDGE), Shenzhen, China, 7–13 July 2024; pp. 162–165. [Google Scholar] [CrossRef]
Çetinkaya, A. A Systems Approach to Validating Large Language Model Information Extraction: The Learnability Framework Applied to Historical Legal Texts. Information 2025, 16, 960. [Google Scholar] [CrossRef]
Chakraborty, S.; Chowdhury, R.; Shuvo, S.R.; Chatterjee, R.; Roy, S. A scalable framework for evaluating multiple language models through cross-domain generation and hallucination detection. Sci. Rep. 2025, 15, 29981. [Google Scholar] [CrossRef] [PubMed]
Chen, P.; Li, Z. Length Instruction Fine-Tuning with Chain-of-Thought (LIFT-COT): Enhancing Length Control and Reasoning in Edge-Deployed Large Language Models. Electronics 2025, 14, 1662. [Google Scholar] [CrossRef]
Han, S.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv 2016. [Google Scholar] [CrossRef]
Hao, T.; Hwang, K.; Zhan, J.; Li, Y.; Cao, Y. Scenario-Based AI Benchmark Evaluation of Distributed Cloud/Edge Computing Systems. IEEE Trans. Comput. 2023, 72, 719–731. [Google Scholar] [CrossRef]
Jain, A.M.; Jain, A. Scaling LLM Inference Architectures: A Performance Analysis for Chatbot Applications. In Proceedings of the 2025 6th International Conference on Artificial Intelligence, Robotics and Control (AIRC), Savannah, GA, USA, 7–9 May 2025; pp. 8–16. [Google Scholar] [CrossRef]
Jebli, A.; Fourati, R.; Drira, F. Resource Management and Security Challenges for Deploying and Adapting Large Language Models in Fog Computing. In Proceedings of the 2025 IEEE 9th Forum on Research and Technologies for Society and Industry (RTSI), Tunis, Tunisia, 24–26 August 2025; pp. 174–179. [Google Scholar] [CrossRef]
Jin, X.; Katsis, C.; Sang, F.; Sun, J.; Kundu, A.; Kompella, R. Edge Security: Challenges and Issues. arXiv 2022. [Google Scholar] [CrossRef]
Kim, J.-H.; Choi, Y.-S. Lightweight Pre-Trained Korean Language Model Based on Knowledge Distillation and Low-Rank Factorization. Entropy 2025, 27, 379. [Google Scholar] [CrossRef]
Kohli, P.; Jayanth, R.; Gupta, N.; Fan, H.; Prasanna, V. Performance-Energy Characterization of ML Inference on Heterogeneous Edge AI Platforms. In Proceedings of the 2025 IEEE High Performance Extreme Computing Conference (HPEC), Virtual, 15–19 September 2025; pp. 1–7. [Google Scholar] [CrossRef]
Krishnamurthy, B.; Shiva, S.G.; Krishnamurthy, B.; Shiva, S.G. Scalable Resource Provisioning Framework for Fog Computing Using LLM-Guided Q-Learning Approach. Algorithms 2025, 18, 230. [Google Scholar] [CrossRef]
Lee, H.; Kang, P. Performance Evaluation of Modern GPU Accelerator-Based Edge Systems: A Holistic Approach. IEEE Internet Things J. 2025, 12, 51716–51729. [Google Scholar] [CrossRef]
Li, R.; Fu, D.; Shi, C.; Huang, Z.; Lu, G. Efficient LLMs Training and Inference: An Introduction. IEEE Access 2025, 13, 32944–32970. [Google Scholar] [CrossRef]
Liu, S.; Chen, L.; Yan, J.; Jiang, Y.; Wang, X.; Li, X.; Yang, Q. When DeepSeek-R1 meets financial applications: Benchmarking, opportunities, and limitations. Front. Inf. Technol. Electron. Eng. 2025, 26, 1862–1870. [Google Scholar] [CrossRef]
Liu, Y.; Sun, Q.; Kapadia, D.R.; Liu, Y.; Sun, Q.; Kapadia, D.R. Integrating Large Language Models into Robotic Autonomy: A Review of Motion, Voice, and Training Pipelines. AI 2025, 6, 158. [Google Scholar] [CrossRef]
Liu, Z.; Guo, P.; Wang, P. LLMSwitchBench: A New Edge-Cloud Routing Benchmark for Smart Home LLM Inference. IEEE Access 2025. [Google Scholar] [CrossRef]
Minott, D.; Siddiqui, S.; Haddad, R.J. Benchmarking Edge AI Platforms: Performance Analysis of NVIDIA Jetson and Raspberry Pi 5 with Coral TPU. In Proceedings of the SoutheastCon 2025, Concord, NC, USA, 22–30 March 2025; pp. 1384–1389. [Google Scholar] [CrossRef]
Nezami, Z.; Hafeez, M.; Djemame, K.; Zaidi, S.A.R. Generative AI on the Edge: Architecture and Performance Evaluation. In Proceedings of the ICC 2025-IEEE International Conference on Communications, Montreal, QC, Canada, 8–12 June 2025; pp. 4595–4602. [Google Scholar] [CrossRef]
Pozi, M.S.M.; Sato, Y. A data-augmented model routing framework for efficient LLM deployment in edge–cloud environments. J. Supercomput. 2025, 81, 1573. [Google Scholar] [CrossRef]
Ranjan, N.; Savakis, A. Mix-QViT: Mixed-Precision Vision Transformer Quantization Driven by Layer Importance and Quantization Sensitivity. arXiv 2025. [Google Scholar] [CrossRef]
Ray, P.P.; Pradhan, M.P. P2PLLMEdge: Peer-to-Peer Framework for Localized Large Language Models using CPU only Resource-Constrained Edge. EAI Endorsed Trans. AI Robot. 2025, 4, 1–27. [Google Scholar] [CrossRef]
Ren, J.; Wang, C.; Zhong, Y.; Cao, S.; Zheng, D.; Cao, X. Towards Expert Models Deployment Cost Optimization in Edge Computing Networks. In Proceedings of the ICC 2025-IEEE International Conference on Communications, Montreal, QC, Canada, 8–12 June 2025; pp. 838–843. [Google Scholar] [CrossRef]
Saha, H.; Bhattacharya, D.; Dutta, S.; Bera, A.; Basuray, S.; Changdar, S.; Banerjee, S.; Turdiev, J. Transforming Healthcare with State-of-the-Art Medical-LLMs: A Comprehensive Evaluation of Current Advances Using Benchmarking Framework. Comput. Mater. Contin. 2025, 86, 1–56. [Google Scholar] [CrossRef]
Shaikh, T.A.; Rasool, T.; Veningston, K.; Yaseen, S.M. The role of large language models in agriculture: Harvesting the future with LLM intelligence. Prog. Artif. Intell. 2025, 14, 117–164. [Google Scholar] [CrossRef]
Sun, M.; Hou, J.; Qiu, K.; Wang, K.; Chu, X.; Zhang, Z. LLM-based Task Offloading and Resource Allocation in Satellite Edge Computing Networks. IEEE Trans. Veh. Technol. 2025, 74, 1–6. [Google Scholar] [CrossRef]
Sun, Y.; Liu, J.; Xiong, G.; Song, Q.; Liu, J.; Wang, G.; Wang, R. Towards Trusted 6G Mobile Edge Computing: A Secure Batch Large Language Models Deployment Framework. IEEE Trans. Mob. Comput. 2025, 25, 3328–3346. [Google Scholar] [CrossRef]
Thapa, S.; Shiwakoti, S.; Shah, S.B.; Adhikari, S.; Veeramani, H.; Nasim, M.; Naseem, U. Large language models (LLM) in computational social science: Prospects, current state, and challenges. Soc. Netw. Anal. Min. 2025, 15, 4. [Google Scholar] [CrossRef]
Wang, J.; Wu, Y.; Xiong, X.; Zhang, Y.; Lyu, Z.; Ghoneim, A.; Zhao, H. FedLMA: A Federated Learning Framework Integrating LLM-Based Multi-Agent Reasoning With Knowledge Distillation. IEEE Trans. Consum. Electron. 2025, 71, 11339–11349. [Google Scholar] [CrossRef]
Wang, R.; Gao, Z.; Zhang, L.; Yue, S.; Gao, Z. Empowering large language models to edge intelligence: A survey of edge efficient LLMs and techniques. Comput. Sci. Rev. 2025, 57, 100755. [Google Scholar] [CrossRef]
Yang, H.; Liu, H.; Yuan, X.; Wu, K.; Ni, W.; Zhang, J.A.; Liu, R.P. Synergizing Intelligence and Privacy: A Review of Integrating Internet of Things, Large Language Models, and Federated Learning in Advanced Networked Systems. Appl. Sci. 2025, 15, 6587. [Google Scholar] [CrossRef]
Yin, C.; Mao, Y.; He, Z.; Chen, M.; He, X.; Rong, Y.; Yin, C.; Mao, Y.; He, Z.; Chen, M.; et al. Edge Computing-Enabled Secure Forecasting Nationwide Industry PM2.5 with LLM in the Heterogeneous Network. Electronics 2024, 13, 2581. [Google Scholar] [CrossRef]
Yuan, X.; Li, H.; Yuan, X.; Li, H. LLM-Driven Offloading Decisions for Edge Object Detection in Smart City Deployments. Smart Cities 2025, 8, 169. [Google Scholar] [CrossRef]
Zhang, W.; Wei, Q.; Chen, H.; Wang, Y. Automated detection of contradictions in 5G network specifications using reinforcement learning-trained small LLM. EURASIP J. Wirel. Commun. Netw. 2025, 2025, 85. [Google Scholar] [CrossRef]
Zhang, X.; Nie, J.; Huang, Y.; Xie, G.; Xiong, Z.; Liu, J.; Niyato, D.; Shen, X. Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks. IEEE Trans. Wirel. Commun. 2025, 24, 643–658. [Google Scholar] [CrossRef]
Zhang, Z.; Fu, Y.; Li, J.; Ma, S.L.; Sham, C.-W. Enhancing Synthesis Efficiency in HLS through LLM-Based Automated Code Correction. In Proceedings of the 2025 IEEE 14th Global Conference on Consumer Electronics (GCCE), Osaka, Japan, 23–26 September 2025; pp. 382–384. [Google Scholar] [CrossRef]
Zhu, F.; Huang, F.; Yu, Y.; Liu, G.; Huang, T.; Zhu, F.; Huang, F.; Yu, Y.; Liu, G.; Huang, T. Task Offloading with LLM-Enhanced Multi-Agent Reinforcement Learning in UAV-Assisted Edge Computing. Sensors 2024, 25, 175. [Google Scholar] [CrossRef]
Surianarayanan, C.; Lawrence, J.J.; Chelliah, P.R.; Prakash, E.; Hewage, C.; Surianarayanan, C.; Lawrence, J.J.; Chelliah, P.R.; Prakash, E.; Hewage, C. A Survey on Optimization Techniques for Edge Artificial Intelligence (AI). Sensors 2023, 23, 1279. [Google Scholar] [CrossRef]
Stadnicka, D.; Sęp, J.; Amadio, R.; Mazzei, D.; Tyrovolas, M.; Stylios, C.; Carreras-Coch, A.; Merino, J.A.; Żabiński, T.; Navarro, J.; et al. Industrial Needs in the Fields of Artificial Intelligence, Internet of Things and Edge Computing. Sensors 2022, 22, 4501. [Google Scholar] [CrossRef]
Rupanetti, D.; Kaabouch, N.; Rupanetti, D.; Kaabouch, N. Combining Edge Computing-Assisted Internet of Things Security with Artificial Intelligence: Applications, Challenges, and Opportunities. Appl. Sci. 2024, 14, 7104. [Google Scholar] [CrossRef]
Liang, Y.; Bi, X.; Shen, R.; He, Z.; Wang, Y.; Xu, J.; Zhang, Y.; Fan, X.; Liang, Y.; Bi, X.; et al. When Mathematical Methods Meet Artificial Intelligence and Mobile Edge Computing. Mathematics 2025, 13, 1779. [Google Scholar] [CrossRef]
Lawal, O.; Shajihan, S.A.V.; Mechitov, K.; Billie, F.; Spencer, J.; Lawal, O.; Shajihan, S.A.V.; Mechitov, K.; Billie, F.; Spencer, J. Edge Integration of Artificial Intelligence into Wireless Smart Sensor Platforms for Railroad Bridge Impact Detection. Sensors 2024, 24, 5633. [Google Scholar] [CrossRef]
Gültekin, Ö.; Cinar, E.; Özkan, K.; Yazıcı, A.; Gültekin, Ö.; Cinar, E.; Özkan, K.; Yazıcı, A. Real-Time Fault Detection and Condition Monitoring for Industrial Autonomous Transfer Vehicles Utilizing Edge Artificial Intelligence. Sensors 2022, 22, 3208. [Google Scholar] [CrossRef]
Chen, Y.; Wu, C.; Sui, R.; Zhang, J.; Chen, Y.; Wu, C.; Sui, R.; Zhang, J. Feasibility Study of Edge Computing Empowered by Artificial Intelligence—A Quantitative Analysis Based on Large Models. Big Data Cogn. Comput. 2024, 8, 94. [Google Scholar] [CrossRef]
Bourechak, A.; Zedadra, O.; Kouahla, M.N.; Guerrieri, A.; Seridi, H.; Fortino, G.; Bourechak, A.; Zedadra, O.; Kouahla, M.N.; Guerrieri, A.; et al. At the Confluence of Artificial Intelligence and Edge Computing in IoT-Based Applications: A Review and New Perspectives. Sensors 2023, 23, 1639. [Google Scholar] [CrossRef]
Abdulkadhim, M.; Repas, S.R. SHEAB: A Novel Automated Benchmarking Framework for Edge AI. Technologies 2025, 13, 515. [Google Scholar] [CrossRef]

Figure 1. LEAF architecture.

Figure 2. The Edge AI implementation testbed for the LEAF.

Figure 3. Hardware speed comparison.

Figure 4. Edge nodes’ model accuracy “F1 score.”

Figure 5. LEAF assessment chart.

Table 2. Power envelope and energy consumption analysis.

Hardware Class	Device Node	Component TDP	Est. Total System Power (Load)	Inference Time (Avg)	Total Energy Cost (System)
Embedded	Raspberry Pi 4	~4 W (SoC)	~6 W	10.96 s	65.7 J
Embedded	Raspberry Pi 5	~7 W (SoC)	~9 W	2.20 s	19.8 J
Workstation	T400 Server	30 W (GPU)	~65 W	0.43 s	27.9 J
Circular	GTX 1050 Ti	75 W (GPU)	~110 W	0.29 s	31.9 J
Legacy	Jetson Nano	10 W (Module)	~12 W	8.04 s	96.4 J

Table 3. The calculated data is to be used in the LEAF.

Device	Avg Time (t)	Avg F1 (f)	Est. Power (P)	Circular Eco (C) [Assigned]
Jetson Nano	8.04 s	0.758	10 W	8 (High Reuse)
RPi 4	10.96 s	0.764	5 W	9 (High Reuse)
RPi 5	2.20 s	0.745	8 W	4 (New HW)
Server (T400)	0.43 s	0.748	40 W	3 (New HW)
AI Server (1050 Ti)	0.29 s	0.759	100 W	9 (High Reuse)

Table 4. The normalized values of the LEAF’s chart.

Device	Circular Eco	Energy Eff.	Speed	Accuracy	Proc. Time
Jetson	0.8	0.00	0.01	0.68	0.27
RPi 4	0.9	0.41	0	1.00	0
RPi 5	0.4	0.99	0.11	0	0.82
T400	0.3	1.00	0.67	0.16	0.99
AI Server	0.9	0.81	1	0.74	1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Abdulkadhim, M.; Repas, S.R. Introducing LEAF: LLM Edge Assessment Framework for Generative AI on the Edge. Mach. Learn. Knowl. Extr. 2026, 8, 48. https://doi.org/10.3390/make8020048

AMA Style

Abdulkadhim M, Repas SR. Introducing LEAF: LLM Edge Assessment Framework for Generative AI on the Edge. Machine Learning and Knowledge Extraction. 2026; 8(2):48. https://doi.org/10.3390/make8020048

Chicago/Turabian Style

Abdulkadhim, Mustafa, and Sandor R. Repas. 2026. "Introducing LEAF: LLM Edge Assessment Framework for Generative AI on the Edge" Machine Learning and Knowledge Extraction 8, no. 2: 48. https://doi.org/10.3390/make8020048

APA Style

Abdulkadhim, M., & Repas, S. R. (2026). Introducing LEAF: LLM Edge Assessment Framework for Generative AI on the Edge. Machine Learning and Knowledge Extraction, 8(2), 48. https://doi.org/10.3390/make8020048

Article Menu

Introducing LEAF: LLM Edge Assessment Framework for Generative AI on the Edge

Abstract

1. Introduction

2. Background

2.1. The Evolution: From Discriminative to Generative Edge AI

2.2. Challenges in Benchmarking Edge LLMs

2.3. Optimization Techniques Enabling Edge Deployment

2.4. The Sustainability Gap: Energy vs. Circular Economy

3. Literature Survey

3.1. LLMs in Industrial and IoT Applications

3.2. Cybersecurity and Explainability Frameworks

3.3. Edge Optimization and Benchmarking

3.4. Identification of the Research Gap

3.5. Metric Selection Rationale

4. Methodology and Framework Design

4.1. The LEAF Architecture

4.2. Evaluation Metrics Definition

4.3. Experimental Setup

5. Implementation

5.1. Hardware Testbed Configuration

5.2. Software Stack and Orchestration

5.3. Automated Testing Pipeline

The Efficiency Paradox (75 W vs. 5 W)

5.4. Limitations and Organizational Adoption

6. Results and Discussion

6.1. Hardware Speed and Latency Analysis

Impact of Thermal Throttling on Active Runtime

6.2. Semantic Accuracy (F1 Score) Stability

6.3. Holistic Assessment: The LEAF Radar Chart

6.4. Metrics Calculation Methodology

6.5. Key Observations for the LEAF Assessment Chart

7. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI