1. Introduction
Large language models (LLMs) and AI-powered chatbots have become powerful tools in software development, enabling automated code generation from natural-language descriptions [
1]. These technologies use deep learning architectures trained on large repositories of source code to assist developers in writing, debugging, and optimizing software. The core ability of LLMs to understand programming requirements expressed in natural language and convert them into executable code has greatly reduced development time and lowered barriers to software creation [
2]. AI-assisted code generation tools can automate routine programming tasks, offer code suggestions, and generate boilerplate code, boosting developer productivity across various fields, including web development, data science, and enterprise software [
3].
The use of LLMs in code generation covers various areas of software engineering, such as code completion, test creation, bug fixing, and documentation. However, the trustworthiness of AI-generated code remains a significant concern because LLMs can produce code that is syntactically correct but functionally wrong or does not meet the specified requirements [
4]. Beyond general functional errors, research has highlighted that LLMs frequently generate problematic outputs in specialized domains, ranging from security vulnerabilities [
5] and SQL syntax inconsistencies [
6] to poor quality in multilingual code commenting [
7]. To tackle these issues, researchers have created standardized metrics and evaluation methods to assess code accuracy and performance. In traditional software engineering, code correctness is usually checked through automated testing frameworks that run unit tests, integration tests, and regression tests repeatedly without human input [
8]. These automated methods help developers verify code functionality efficiently by running extensive test suites that compare expected results with actual results, measure code coverage, and identify edge cases. One example for evaluating LLM-generated code is Pass@k, which measures the chance that at least one of k generated code samples passes all the predefined test cases for a specific programming task [
9].
Microcontroller and embedded systems development present unique challenges that distinguish them from general software development [
10]. A key difference lies in the testing methods: while general software engineering allows for fully automated, repeatable code integrity testing cycles run entirely in software, embedded system development often involves physical hardware interactions that require human involvement for validation and final integration testing. Automated tests alone are not enough for developing microcontroller applications because the code interacts with physical components, sensors, actuators, and communication peripherals whose behavior depends on real-world conditions. Verification includes observing program output via serial or debug interfaces, assessing peripheral and sensor responses, simulating inputs, and ensuring correct timing and control during real-time operation. This hardware-dependent validation adds significant complexity, as developers must physically connect components, configure hardware, and interpret results that can vary due to environmental factors, component tolerances, and signal quality. Additionally, embedded systems often operate under strict constraints, including limited computational resources, real-time processing needs, and direct hardware register manipulation or precise timing control. The unique challenges of embedded development require that the generated code not only be syntactically correct but also deterministic, memory-efficient, and compatible with real-time operating system requirements.
Leaderboards and benchmarks for large language models offer a structured way to compare how well different models perform on the same tasks. They utilize standardized tests to ensure results are measured consistently, helping researchers track progress and identify strengths and weaknesses across models [
11]. These evaluations are important because they highlight whether a model can reason, follow instructions, or produce reliable technical or domain-specific content. However, benchmarks also have limitations: they often oversimplify real-world situations, can reward models for memorizing patterns instead of understanding them, and may become outdated as new types of tasks emerge. Therefore, leaderboard rankings should be viewed as useful indicators rather than comprehensive measures of a model’s overall ability or practical quality.
Current evaluation frameworks do not fully cover the range of practical microcontroller development scenarios faced by developers in real-world projects. To fill this gap, our research systematically assesses and compares a wide variety of LLMs for microcontroller code generation across eight development scenarios, using expert-based evaluations and the Analytic Hierarchy Process (AHP) scoring method. By centering the evaluation on the ESP32 ecosystem, this study leverages a platform that transitions seamlessly from hobbyist prototyping to industrial IoT applications, providing a representative yet specific hardware context for the analysis. These scenarios represent common tasks in modern embedded systems development, such as environmental multi-output sensor reading with multiple displays, distance measurement with multiple displays, cloud-based Internet of Things (IoT) data platforms, cloud databases and data management systems, and IoT environmental monitoring stacks.
The assessment includes models across the performance spectrum, from advanced reasoning systems with extended thinking and general-purpose high-performance architectures to specialized regional variants, efficiency-optimized models, and emerging research implementations.
This comprehensive evaluation aims to give embedded systems developers empirical data on which LLMs are most effective for microcontroller code generation, helping guide tool selection and highlight areas for further model improvement.
The remainder of this paper is organized as follows:
Section 2 details the experimental setup, the definition of the evaluation scenarios, LLMs and the AHP used for scoring.
Section 3 presents the experimental results and performance benchmarks of the models.
Section 4 provides a comprehensive discussion of the findings. Finally,
Section 5 offers concluding remarks and outlines directions for future research.
2. Related Works
LLMs are primarily trained on vast collections of web-scraped text—including extensive repositories of open-source software projects—they develop a rich, semantic understanding of programming languages and syntactic structures. As a result, AI assistants and integrated development environments (IDEs) equipped with these models have demonstrated the ability to generate software that matches the quality of skilled human developers. While base models offer strong general-purpose coding skills, fine-tuning LLMs on domain-specific languages significantly improves their accuracy and usefulness compared to generic models. In the realm of hardware-related development, industry tools are beginning to emerge. For instance, a blog post by the Arduino team [
12] compares a specialized “Arduino AI Assistant” to general-purpose models like ChatGPT, suggesting that tools integrated directly into the cloud environment with access to specific documentation might reduce library errors. However, such industrial comparisons often lack the thorough, systematic assessments needed to fully understand these tools’ limitations in complex engineering settings.
The landscape of LLM code generation evaluation has expanded significantly, with several specialized benchmarks addressing limitations of earlier frameworks. BigCodeBench [
13] introduced rigorous evaluation of practical task automation, challenging LLMs to invoke multiple function calls from extensive library collections across diverse domains and achieving high branch coverage with comprehensive test cases. Evaluation of numerous LLMs revealed that even top-performing models substantially lag behind human performance, indicating considerable room for improvement. LiveCodeBench [
14] addressed data contamination concerns by continuously collecting problems from competitive programming platforms (LeetCode, AtCoder, CodeForces), expanding evaluation beyond code generation to include self-repair, test output prediction, and code execution scenarios. Notably, the benchmark showed only moderate correlation with HumanEval+ performance [
15], with significantly larger performance variations across models. For real-world software engineering tasks, SWE-bench [
16] established a framework comprising thousands of authentic GitHub issues and pull requests from multiple Python repositories, requiring models to generate multi-file patches validated through execution-based testing. EvoCodeBench [
17] introduced an evolving, repository-level benchmark aligned with real-world code and dependency distributions, releasing periodic updates to prevent data leakage. Evaluations showed that even state-of-the-art models achieve relatively low pass rates on these repository-level tasks.
Beyond generation, CodeEditorBench [
18] specifically evaluates code editing capabilities, including debugging, translating, polishing, and requirement switching, emphasizing practical software development scenarios. Domain-specific benchmarks have also emerged, with SciCode [
19] curating hundreds of subproblems from research-level scientific coding problems across numerous natural science subfields to evaluate numerical methods, system simulation, and scientific calculations, and McEval [
20] providing the first massively multilingual evaluation across 40 programming languages to assess cross-linguistic capabilities, revealing that code LLMs perform better in object-oriented, high-resource languages while struggling with functional and procedural, low-resource languages. However, comprehensive benchmarks specifically targeting microcontroller code generation across various real-world scenarios are still scarce in the literature.
While software code generation is well-studied, applying LLMs to embedded systems—which require understanding both software logic and hardware constraints—presents unique challenges. Englhardt et al. [
21] investigated how LLMs perform on embedded programming tasks using an automated testbench with 450 trials, revealing that although these models may fail to produce immediately working code, they generate valuable reasoning about embedded design tasks and provide specific debugging suggestions beneficial to both novice and expert developers. Quan et al. [
22] introduced SensorBench, a comprehensive benchmark for evaluating LLMs on coding-based sensor processing tasks with diverse real-world sensor datasets. Their findings showed that while LLMs are quite proficient in simpler tasks, they face fundamental challenges in compositional tasks involving parameter selection when compared to domain experts, with self-verification prompting strategies outperforming other methods in 48% of the evaluated tasks. Beyond assessing functional correctness, researchers have also attempted to measure the “creativity” of LLM-generated hardware solutions. The CreativEval framework [
23] evaluated models on fluency, flexibility, originality, and elaboration in Register Transfer Level (RTL) code generation. Their results suggest that models like GPT-3.5 can demonstrate measurable creativity in hardware design, surpassing other models in producing novel solutions.
Recent non-peer-reviewed pre-prints have introduced more autonomous frameworks for embedded development, indicating possible future directions for the field. The EmbedGenius platform [
24] offers a fully automated approach for general-purpose embedded IoT systems. By using a component-aware library resolution method, this work claims to outperform human-in-the-loop benchmarks in task completion rates, aiming to address the complex hardware dependencies that often delay manual development [
5]. Similarly, the EmbedAgent pre-print [
25] presents a benchmark called “Embedbench” to simulate professional roles such as System Architect and Integrator. This work highlights a performance gap in cross-platform migration; while LLMs performed fairly well migrating code to MicroPython, they struggled with more complex environments like ESP-IDF. The authors suggest that general-purpose models often fail to retrieve relevant pre-trained domain knowledge effectively without Retrieval-Augmented Generation (RAG) strategies.
In summary, although current research emphasizes the increasing abilities of LLMs in general software tasks, there is a significant lack of systematic, peer-reviewed benchmarking specifically for microcontroller-related development. This area is often hampered by complex hardware-software dependencies, which can cause library hallucinations and implementation failures.
3. Materials and Methods
The experimental setup (
Figure 1) used an ESP32 development board (Espressif Systems, Shanghai, China) as the main processing unit because of its built-in dual-core processor and Wi-Fi connectivity, both crucial for simultaneous sensing and network tasks. Environmental parameters, specifically barometric pressure (hPa), ambient temperature (°C), relative humidity (%), and calculated altitude (m), were measured with a BME280 sensor (Bosch Sensortec GmbH, Reutlingen, Germany). Non-contact distance measurements were made using an HC-SR04 ultrasonic sensor (ElecFreaks, Shenzhen, China). For local feedback and diagnostics, two display technologies were integrated: a pixel OLED display (Waveshare Electronics, Shenzhen, China) and a character LCD based on HD44780 controller (Hitachi Ltd., Tokyo, Japan). The OLED displayed sensor readings and system status at high resolution, while the character LCD provided simple, low-power textual output. Standard Dupont jumper wires (generic OEM, Shenzhen, China) were used to interconnect all components during prototyping, as no custom PCB was available at this stage. During development and calibration, real-time data logging and debugging were performed by streaming data through the serial port. All testing and deployment required the ESP32 to connect to a standard local Wi-Fi network (with an average speed of 20 Mbit/s) with a predefined SSID. The network functionality had two parts: first, the ESP32 was set up as a local HTTP web server to stream real-time sensor data within HTML pages, making the data immediately available to any device on the network. Second, the system was configured to connect to multiple IoT platforms and databases, including ThingSpeak (
https://thingspeak.com), Thinger.io (
https://thinger.io), Beebotte (
https://beebotte.com), Firebase Realtime Database (firebase.google.com) and InfluxDB (
https://www.influxdata.com, all platforms accessed on 10 February 2026). Authentication data, such as API keys, tokens, and credentials, were prepared and set up before testing the LLM scenarios. The same credentials were used throughout all testing phases and all IoT platforms to keep data posting consistent and comparable.
To systematically assess the performance of LLMs in generating embedded IoT code for ESP32 microcontrollers, eight distinct scenarios were created with increasing complexity. These scenarios evaluate LLM capabilities across three main aspects: local multi-output implementations, cloud platform integrations, and complete end-to-end data visualization workflows.
Basic Multi-Output Sensor Reading: LLMs generate code for periodic BME280 readings with simultaneous output to the Serial Monitor, an OLED display, and a local web server.
Distance Measurement with Multi-Display: LLMs integrate the HC-SR04 sensor, measure obstacle distance, and show results on the Serial Monitor, an I2C LCD display, and a locally hosted webpage while managing Wi-Fi credentials correctly.
ThingSpeak Cloud Integration: LLMs create ESP32 code that reads BME280 values, displays them locally (OLED and Serial Monitor), and successfully uploads data to a ThingSpeak channel using provided API keys.
Thinger.io Platform Setup: LLMs prepare complete code and user instructions for connecting an ESP32 with a BME280 sensor to Thinger.io, covering device provisioning and dashboard configuration.
Beebotte Data Publishing: LLMs build ESP32 applications that send BME280 data to Beebotte via API or MQTT and include step-by-step platform setup instructions for users.
Firebase Realtime Database: LLMs implement secure, authenticated logging of BME280 data to Firebase, including timestamp synchronization and email/password authentication.
InfluxDB Time-Series Storage: LLMs generate valid code for periodically sending BME280 measurements to InfluxDB Cloud using the correct credentials, measurement names, and tags, along with instructions for dashboard creation.
ESP32—Grafana End-to-End Visualization: LLMs produce a full workflow for ingesting BME280 data, storing it in InfluxDB, and visualizing it in Grafana, including dashboard steps and valid JSON import configurations.
We focus on zero-shot prompting to establish a baseline of the LLMs’ inherent “initial” reasoning and domain-specific knowledge regarding the ESP32 ecosystem. While techniques such as RAG and few-shot prompting can enhance performance, they introduce external variables. The quality of the retrieved documentation or the bias of the provided examples can mask the model’s actual generative capabilities. By using zero-shot prompts, we simulate the most common and accessible developer workflow and evaluate the model’s ability to interpret complex hardware requirements without extensive prompt engineering or external data infrastructure. This approach ensures that the resulting benchmarks reflect the model’s fundamental understanding of embedded systems rather than the effectiveness of a specific retrieval system.
The code generation performance was tested on LMArena.ai, which offers direct chat access to a wide range of state-of-the-art LLMs through a unified interface (
Table 1). This setup ensures a consistent and comparable testing environment. In all scenarios, the LLMs were used in direct chat mode, simulating a typical developer interaction where context and instructions are given conversationally.
We employed the AHP to systematically evaluate LLMs for generating microcontroller code, enabling us to structure evaluation criteria hierarchically. Through pairwise comparisons (
Table 2), we determined the relative importance of main categories—Functional, Instructions, Output, and Creativity—using Saaty’s 1–9 scale [
26], where 1 indicates equal importance, 3 moderate, 5 strong, 7 very strong, and 9 extreme importance of one criterion over another. We derived criterion weights from the resulting pairwise comparison matrix, which is shown below, with Functional assigned the highest priority (1 over Instructions, 5 over Output, 9 over Creativity). The matrix yielded normalized priority weights of 0.544 for Functional, 0.292 for Instructions, 0.107 for Output, and 0.057 for Creativity, calculated via the principal eigenvector.
Matrix consistency was verified with λ_max = 4.006, Consistency Index (CI) = (λ_max − n)/(n − 1) = 0.002, and Consistency Ratio (CR) = CI/RI = 0.002 (RI = 0.90 for n = 4), confirming acceptable consistency (CR < 0.10).
Detailed sub-criteria (
Table 3) for each category included: Functional (complete code provision, correct libraries, error-free compilation); Instructions (step-by-step setup guides with hardware connections and code comments explaining structure); Output (serial monitor clarity, display information on OLED/LED, reliable Web/IoT/database connections); and Creativity (additional features and novel graphical outputs beyond requirements).
Two independent evaluators—a senior microcontroller expert (25 years of experience) and an advanced practitioner (15 years of experience)—conducted assessments separately before reaching consensus on final scores, ensuring reliability and minimizing individual bias in LLM performance rankings.
4. Results
This analysis uses a comprehensive mixed-methods approach that combines both qualitative and quantitative techniques to evaluate LLMs across various IoT-related tasks. We analyse performance metrics such as functionality, instruction clarity, output quality, and creativity of LLM-generated code and documentation in detail. Our findings highlight the strengths and challenges LLMs face across different platforms and integration complexities, demonstrating their growing ability to automate embedded IoT development and cloud visualization processes.
4.1. Environmental and Motion Tracking Sensors
The AHP results for both environmental sensors and distance measurement show several key similarities and performance factors for deployment in sensor integration applications. In both cases, models from the gemini-2.5 family showed optimal performance, with gemini-2.5-flash winning in the environmental sensor reading task (Final score ≈ 1.00) and tying with gemini-2.5-pro in the distance measurement task (Final score ≈ 0.9857). Most candidates achieved the highest Functional score (≈0.5435). Still, a few models consistently failed this essential baseline (8 models in sensor reading, 4 models in distance measurement), highlighting a sharp functional threshold needed for entry into the high-competency group. For models that met the functional requirements, the main factor influencing ranking was Instruction score compliance, which created the most significant gap between the top performers (≈0.292) and mid-tier models (≈0.2435). This consistent result across both tasks underscores that while LLM functional ability is generally accessible, success in high-stakes sensor and motion processing relies not just on output quality but on the model’s capacity to follow complex operational constraints and output formatting rules strictly (
Figure 2).
In the environmental sensor task, 19 of 27 models produced compilable code on the first attempt, correctly configuring the I2C bus and including explanatory comments for exceptional cases. The remaining 8 models failed at this initial hurdle because they targeted incompatible OLED driver libraries or referenced obscure third-party display libraries, even though all models selected appropriate BME280 and Wi-Fi libraries. For the distance measurement task, simplifying the display subsystem led to 23 models compiling successfully on the first attempt, with the remaining models again hindered by unrelated or missing third-party display libraries rather than by sensor or connectivity misconfiguration.
In the environmental sensor scenario, models were required to deploy an ESP32-based web server that exposed measurements, but they received no guidance on layout, styling, or visualization strategy. As a result, the graphical interfaces emerged entirely from the models’ own design choices, with variations in layout structure, data grouping, and presentation style (
Figure 3). Because the visual form was neither specified nor constrained by the task definition, these outcomes were assessed under the creativity dimension.
During the implementation of both the environmental sensor and distance measurement tasks, an additional qualitative distinction emerged in how models designed the web server component. Without explicit prompting, five models adopted an asynchronous web server implementation on ESP32, typically relying on specialized asynchronous libraries to support concurrent client connections and non-blocking request handling. This design reduced main-loop blocking, improved responsiveness under load, and aligned more closely with contemporary embedded networking practices. Models that chose this approach were awarded higher creativity scores in the evaluation. In contrast, most other models generated simpler polling-based solutions that updated sensor readings via periodic page refreshes (for example, using a 5 s meta refresh in the HTML header). These are straightforward to implement but generate unnecessary network traffic and limit interaction flexibility. The contrast between these two patterns—static, refresh-driven pages versus modern, event-driven asynchronous servers—illustrates that, under identical prompts and functional requirements, models can differ not only in correctness but also in architectural sophistication, with direct implications for robustness and scalability in real deployments.
4.2. IoT Platforms
The analysis of the three IoT platform integration scenarios (ThingSpeak, Thinger.io, and Beebotte) revealed significant shifts in top-tier performance compared to typical sensor tasks, with the Claude architecture and gpt-5-high model consistently ranking highest (
Figure 4). Notably, BeeBotte integration led to a three-way tie for the top model (Final score ≈ 1.00) among claude-opus-4-1, claude-sonnet-4-5, and gpt-5-high. Later, claude-sonnet-4-5 secured the top position in both the ThingSpeak (Score ≈ 1.00) and Thinger.io (Score ≈ 0.9713) scenarios, demonstrating its robustness across different integration protocols. Analyzing the functional threshold showed varying difficulty levels across platforms: only 16 models achieved the high Functional score (≈0.5435) needed for Thinger.io integration, while 20 models reached this benchmark for ThingSpeak. This indicates differing barriers to entry, likely due to platform complexity or API documentation. Regardless of the platform, the key factor that distinguished high-performing models was the Instructions score (Max ≈ 0.292), which remained the most important performance metric after meeting the core functional requirement. This supports the conclusion that, for complex, platform-specific integration tasks, a large language model’s ability to carefully follow predefined operational constraints is the ultimate factor in determining suitability.
For the set of IoT service scenarios, using the correct libraries for each service was essential. When testing with ThingSpeak, 20 models compiled successfully initially, while 7 failed, mainly due to missing the ThingSpeak communication library. In the fourth scenario, 23 models correctly selected Thinger.io service libraries, compared with 3 that chose incorrectly. However, only 18 source codes compiled on the first try, versus 11, indicating that choosing the right library did not ensure that the LLMs would use the functions properly. In the Beebotte test, 24 models had suitable libraries, with 5 opting for MQTT communication and the remaining 19 choosing API access. Three models selected incorrect libraries, and overall, 8 models failed to compile initially. The communication method was determined by the LLM, as it was not specified in the input prompt. For example, glm 4.5 chose MQTT, while its successor, glm 4.6, chose API access.
4.3. Data Storage in Cloud Databases
The two cloud database integration scenarios—Firebase Realtime Database and InfluxDB Time-Series Storage—were evaluated to assess LLM performance in generating secure, authenticated ESP32 code for logging BME280 sensor data and providing clear instructions for database setup and dashboard creation. Firebase is a Google cloud platform that enables microcontrollers to store and retrieve data in real time over the Internet, while InfluxDB is an open-source database optimized for time-series data, making it ideal for storing large volumes of sensor data from IoT devices.
The evaluation revealed a consistently high barrier to entry for these complex, security-sensitive cloud integration tasks, significantly limiting the pool of viable models (
Figure 5). The Functional score threshold, which assessed the security and correctness of the generated authenticated code, proved the most restrictive metric: only nine models achieved the highest Functional score (≈0.5435) for Firebase, and only 11 did so for InfluxDB. Models falling below the minimum required functional quality (≈0.4529) failed to produce secure, platform-specific authentication logic—the primary challenge for current LLMs. Among functionally capable models, the Instructions score (≈0.292) differentiated those that could generate correct code from those that could also provide clear, precise guidance on database setup and dashboard creation. Ultimately, claude-sonnet-4-5 emerged as the sole optimal performer in both scenarios, achieving a maximum Final score of approximately 1.00.
In scenario 6 (Firebase), nine models generated executable code on the first attempt, and eight produced fully functional applications. Seven additional models achieved functionality after minor adjustments to formal errors or with supplementary prompts, yielding 15 functional applications in total. In scenario 7 (InfluxDB), 11 compilations succeeded initially, but only six resulted in functional applications, with five failing in various areas. Five additional models achieved functionality after minor code adjustments and prompting, resulting in 11 functional applications overall. Notably, errors in both scenarios stemmed from flawed source code functionality, improper communication function calls, and inadequate cloud configuration rules—not from inappropriate library selection, as all libraries were suitable.
4.4. Cloud Database and Visualization Integration
The most complex scenario required a complete integration chain: compilable, functionally correct ESP32 code; successful data transmission to InfluxDB; a valid query usable in the InfluxDB/Grafana dashboard; and a syntactically correct Grafana JSON dashboard description. This setup extends the earlier InfluxDB-only task by adding the full visualization layer, and the AHP results show that this added complexity sharply amplifies differences between models. Only three models produced a completely error-free solution that satisfied all four requirements in a single pass: claude-opus-4-1-20250805, claude-sonnet-4-5-20250929-thinking-32k, and gpt-5-high. All three achieved the maximum Functional score (≈0.5435), the maximum Instructions score (≈0.2923), and the highest Output score tier (≈0.1069), indicating correct ESP32—InfluxDB communication, a valid dashboard query, and a valid Grafana JSON file. Their final scores clustered at or near 1.0, marking them as the clear top performers (
Figure 6).
One additional model, command-a-03-2025, initially failed to compile due to a minor, easily correctable code issue. After this fix, it successfully sent data to InfluxDB, produced a correct dashboard query, and generated a valid Grafana JSON configuration. In the AHP scoring, this model exhibits a slightly reduced Functional score (≈0.4529) and a mid-range Instructions score (≈0.1948), but a solid Output score (≈0.0713) and a non-zero Creativity contribution (≈0.0287), yielding a respectable Final score of ≈0.748. This demonstrates that modest weaknesses in code quality and documentation can coexist with a fully usable end-to-end solution.
Across the full set of 27 models, the Output score quantifies how many of the three integration endpoints—InfluxDB ingestion, a correct dashboard query, and a valid Grafana JSON dashboard—were achieved. Models that failed to produce any working component have Output scores at or near 0.0, whereas models such as gemini-2.5-pro, grok-4-0709, claude-opus-4-20250514, glm-4.5, and others with Output scores around 0.077–0.095 indicate that two of the three elements worked (≈66% success). Only the top trio plus command-a-03-2025 reached the highest Output tier (≈0.071–0.107), corresponding to full or near-full correctness across all integration steps.
Creativity scores reflect the qualitative design sophistication of the Grafana dashboards. Models with zero Creativity scores (gpt-5-high, grok-4-0709, mistral-medium-2508, and several mid- to low-ranked systems) produced minimal or purely utilitarian dashboards, whereas models with non-zero Creativity scores (such as claude-opus-4-1-20250805, claude-sonnet-4-5-20250929-thinking-32k, gemini-2.5-pro, claude-opus-4-20250514, glm-4.5, qwen3-coder-480b-a35b-instruct, and command-a-03-2025) proposed richer layouts with multiple panels, clearer structure, or additional informative elements, thereby improving interpretability without sacrificing correctness (
Figure 7).
4.5. The Overall Ranking of Models
The comparative evaluation of 27 large language models revealed substantial performance heterogeneity, with normalized total scores ranging from 0.984 to 0.539 (
Table 4). At the top of the ranking, claude-sonnet-4-5-20250929-thinking-32k achieves the highest overall AHP score (0.984), followed by claude-opus-4-1-20250805 (0.961) and gemini-2.5-pro (0.918), with several other high-capacity general-purpose or advanced reasoning models (claude-opus-4-20250514, glm-4.5, gpt-5-high) also clustered above 0.89. These models include both explicitly reasoning-oriented systems (e.g., claude-sonnet “thinking”) and general-purpose flagships, suggesting that strong chain-of-thought capabilities and broad instruction-following competence translate into superior functional code quality, better-structured setup instructions, and more mature output handling for ESP32 scenarios. The presence of o4-mini-2025-04-16, deepseek-v3.2-exp-thinking, and gpt-4.1-2025-04-14 in the upper segment (scores around 0.83–0.83) further indicates that compact or experimental “thinking” variants can reach near-flagship performance when judged on this practically oriented, multi-criteria coding task.
The middle of the distribution is populated by a mixture of high-performance general models, mixture-of-experts (MoE) systems, and fast/efficient variants, with total scores roughly between 0.75 and 0.83. qwen3-coder-480b-a35b-instruct (0.872) and minimax-m1 (0.820) illustrate that MoE architectures can achieve competitive overall quality, but in this evaluation, they do not systematically outperform monolithic flagship models. Fast- and efficiency-oriented systems such as gemini-2.5-flash (0.879), mistral-medium-2508 (0.851), glm4-6 (0.804), longcat-flash-chat (0.779), grok-4-fast (0.756), and kimi-k2-web (0.744) generally occupy the upper-middle band, indicating that designs optimized for latency and throughput can still deliver functionally adequate and reasonably well-documented ESP32 solutions, albeit with modest deficits relative to the very best systems in at least one of the heavily weighted criteria (typically completeness of code or richness of instructional material).
5. Discussion
This study shows that a zero-shot prompt strategy can produce fully compilable ESP32/IoT applications for a substantial subset of LLMs, but performance degrades sharply as application complexity increases. While 19–23 models successfully built simple sensor applications, only 5–8 produced usable outputs for complex scenarios involving Grafana dashboards and cloud databases. This performance gap aligns with recent benchmarking on embedded systems, such as the EmbedBench study [
25], which reported significantly lower pass rates for complex ESP-IDF migration tasks (29.4%) than for simpler MicroPython implementations (73.8%). Similarly, the EmbedGenius framework [
24] found that while LLMs can handle modular IoT tasks, their efficacy drops when coordinating multiple hardware-software interfaces without RAG or compiler feedback loops. Our results confirm that for intricate integrations, zero-shot prompting alone—despite its labor-saving appeal—often lacks the necessary context to resolve multi-dependency constraints.
The most frequent cause of compilation failure was hallucinated non-existent libraries or incorrect API usage. This reflects a broader trend identified in software engineering research: a study analyzing [
27] over 2 million code samples found that hallucinated package references are pervasive, with open-source models exhibiting hallucination rates up to four times higher (21.7%) than closed-source counterparts like GPT-4 (5.2%). In our specific context, these hallucinations were most prevalent in the integration of IoT services (e.g., ThingSpeak, Thinger.io, and Beebotte), display drivers, and environmental sensor configurations. Rather than utilizing official or currently supported libraries, the LLMs frequently referenced individual, deprecated repositories that are no longer available or functional. These technical shortcomings are directly quantified in our AHP framework under the Functionality score; a lower score in this category typically indicates that the model failed to identify official library standards, resulting in non-compilable code.
Our observation that models often generate valid logic but invalid syntax for specific sensor libraries is consistent with findings that general-purpose LLMs struggle to use domain-specific pre-trained knowledge effectively in embedded contexts. Furthermore, the prevalence of invalid JSON configurations for Grafana dashboards suggests that while models excel at imperative code (C++/Python), they are less robust at generating strictly schema-compliant configuration files without few-shot examples.
LLMs generally produced instructions that were practical for implementation, but they varied in how they distributed this guidance between the conversational response and the source code. Three recurring strategies were observed: (i) a text-centric approach, in which models provide detailed, step-by-step setup guidance in the conversational response while leaving the source code with minimal or incomplete explanatory comments; (ii) a code-centric approach, in which models give only a brief conversational summary but embed substantial, context-aware usage guidance directly in the code through headers and inline comments; and (iii) a combination approach, in which models provide both a structured setup narrative and complementary code comments that clarify configuration choices and reduce ambiguity during deployment. The combination strategy was the most common outcome, suggesting that many models implicitly recognize that embedded workflows require both an external procedural description (to support initial assembly and configuration) and in-code annotation (to support later modification, troubleshooting, and reuse). However, performance across these two documentation channels was not always consistent: in some scenarios, models tended to deliver stronger step-by-step setup explanations than code comments, while in other scenarios the code comments were relatively richer than the accompanying conversational setup narrative. This pattern indicates that conversational setup guidance and in-code guidance capture different aspects of instructional quality and should be evaluated as complementary, rather than assuming that strength in one channel implies strength in the other.
Our findings suggest that while LLMs can serve as effective “teachable agents” for novices by generating initial boilerplate and wiring instructions, reliance on them for complex system integration requires caution. Literature in computing education, including the HypoCompass study [
28], indicates that involving novices in debugging LLM-generated code can be educationally beneficial, but only if the initial generation is close enough to be “repairable”. The high variability in documentation quality we observed—where some models provided excellent step-by-step wiring guides while others offered none—mirrors the inconsistency noted in studies evaluating LLMs for introductory programming assistance. For less-experienced users, the “human-in-the-loop” strategy remains essential: our results show that even when the initial code fails, the provided comments and hardware instructions often provide enough scaffolding for a user to finalize the project, provided they can interpret compiler errors.
6. Conclusions
This comprehensive evaluation of 27 large language models across eight embedded systems scenarios reveals substantial performance heterogeneity in microcontroller code generation capabilities. The study demonstrates that while a majority of contemporary LLMs can successfully generate basic sensor integration code for ESP32 platforms, performance degrades precipitously as task complexity increases—from 19 to 23 models succeeding in simple applications to only 3–5 achieving full functionality in complex cloud database and visualization integrations. The claude-sonnet-4-5-20250929-thinking-32k model emerged as the highest-performing system overall (0.984), followed closely by claude-opus-4-1-20250805 (0.961) and gemini-2.5-pro (0.918), indicating that both advanced reasoning architectures and general-purpose flagship models demonstrate superior capabilities for embedded systems code generation. Critically, functional correctness constituted the primary barrier to entry, with instruction-following quality serving as the principal differentiator among models that achieved baseline functional competence.
6.1. Theoretical and Practical Implications
The findings have significant implications for both embedded systems development practice and LLM evaluation methodology. For practitioners, the results indicate that zero-shot prompting strategies, while labor-efficient for simple modular tasks, prove insufficient for complex multi-dependency integrations requiring coordinated hardware-software interfaces. The high prevalence of hallucinated libraries and incorrect API usage—consistent with broader software engineering research showing hallucination rates up to 21.7% in open-source models—underscores the necessity of human-in-the-loop verification for production embedded systems. Furthermore, the observation that models employ heterogeneous documentation strategies (text-centric, code-centric, or combination approaches) suggests that conversational setup guidance and in-code annotation capture complementary aspects of instructional quality and should be evaluated independently. For novice developers, LLMs can serve as effective teachable agents by generating initial boilerplate and wiring instructions, provided the generated code is sufficiently close to functional to enable educational debugging.
6.2. Limitations
Several constraints delimit the scope and generalizability of these findings. The evaluation focused exclusively on ESP32 microcontrollers with specific sensor configurations, and results may differ across alternative embedded platforms, architectures, or peripheral ecosystems. The zero-shot prompting approach, while representative of typical developer interactions, does not capture the potential performance improvements achievable through RAG, few-shot examples, or iterative compiler feedback loops—techniques shown to enhance LLM efficacy in related embedded systems research. Furthermore, the qualitative assessment of creativity and instruction following was conducted by a single domain expert and a single practitioner, which introduces the possibility of subjective bias; however, this was mitigated by assigning these criteria lower priority weights within the AHP framework compared to objective functional metrics.
6.3. Future Research Directions
Several promising avenues emerge for extending this research. Comparative studies incorporating RAG frameworks, multi-turn conversational refinement, and automated compiler feedback could quantify the performance gains achievable beyond zero-shot prompting and identify optimal human-AI collaboration patterns for embedded development. Expansion of the evaluation framework to encompass additional microcontroller families (STM32, Arduino, Raspberry Pi Pico), real-time operating systems (FreeRTOS, Zephyr), and communication protocols (LoRaWAN, BLE, Zigbee) would enhance generalizability across the embedded systems landscape. Investigation of fine-tuning approaches using domain-specific embedded systems codebases could establish whether specialized training substantially improves hardware-aware code generation compared to general-purpose models. Finally, longitudinal studies examining how rapidly evolving LLM architectures close the performance gap on complex integration tasks would inform adoption timelines and investment strategies for organizations considering AI-assisted embedded development workflows.
This study establishes that contemporary LLMs have reached a threshold of practical utility for embedded IoT development, capable of substantially accelerating routine programming tasks while requiring domain expertise for verification, integration, and troubleshooting of complex multi-component systems.