Automated UAV Object Detector Design Using Large Language Model-Guided Architecture Search

Kong, Fei; Shan, Xiaohan; Hu, Yanwei; Li, Jianmin

doi:10.3390/drones9110803

Open AccessArticle

Automated UAV Object Detector Design Using Large Language Model-Guided Architecture Search

by

Fei Kong

,

Xiaohan Shan

^*,

Yanwei Hu

and

Jianmin Li

Qiyuan Lab, Beijing 100095, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(11), 803; https://doi.org/10.3390/drones9110803

Submission received: 22 September 2025 / Revised: 5 November 2025 / Accepted: 11 November 2025 / Published: 18 November 2025

(This article belongs to the Topic International Conference on Autonomous Unmanned Systems (5th ICAUS 2025))

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

PhaseNAS introduces a phase-aware, dynamic NAS framework driven by large language models for UAV perception tasks.
It achieves state-of-the-art accuracy and efficiency, generating superior YOLOv8 variants for object detection with lower computational cost.

What is the implication of the main findings?

PhaseNAS enables automated, resource-adaptive model design, making real-time, high-performance perception feasible for UAVs and edge devices.
Its structured template and adaptive LLM resource allocation strategies can be extended to broader AI applications beyond aerial object detection.

Abstract

Neural Architecture Search (NAS) is critical for developing efficient and robust perception models for UAV and drone-based applications, where real-time small object detection and computational constraints are major challenges. Existing NAS methods, including recent approaches leveraging large language models (LLMs), often suffer from static resource allocation and ambiguous architecture generation, limiting their effectiveness in dynamic aerial scenarios. In this study, we propose PhaseNAS, an adaptive LLM-driven NAS framework designed for drone perception tasks. PhaseNAS dynamically adjusts LLM capacity across exploration and refinement phases, and introduces a structured template language to bridge natural language prompts with executable model code. We also develop a zero-shot detection score for rapid screening of candidate YOLO-based architectures without full training. Experiments on NAS-Bench-Macro, CIFAR-10/100, COCO, and VisDrone2019 demonstrate that PhaseNAS consistently discovers superior architectures, reducing search time by up to 86% while improving accuracy and resource efficiency. On UAV detection benchmarks, PhaseNAS yields YOLOv8 variants with higher mAP and reduced computational cost, highlighting its suitability for real-time onboard deployment. These results indicate that PhaseNAS offers a practical and generalizable solution for autonomous AI model design in next-generation UAV systems.

Keywords:

large language models; neural architecture search; UAV; drone perception; object detection; deep learning; autonomous intelligence

1. Introduction

Object detection is a core component for autonomous drones and unmanned aerial vehicles (UAVs), enabling real-time perception in applications such as aerial surveillance, traffic monitoring, precision agriculture, and disaster response [1,2]. Compared with ground-view detection, drone-borne perception faces strict, measurable constraints from both the data side and the platform side. First, aerial targets are typically small and densely distributed. For example, on VisDrone2019 (a widely used UAV benchmark), a large fraction of instances fall into small object regimes (e.g., short side within 8–32 pixels), which makes mAP particularly sensitive to multi-scale feature quality and occlusion [3,4,5]. In particular, recent UAV-tailored YOLO variants explicitly introduce P2 small object heads, lightweight neck/backbones (e.g., Ghost/EMA), and deformable detection heads to boost VisDrone-style small object performance under edge constraints [6,7]. Second, onboard computing power and memory are limited [8]. In real deployments, embedded SoCs (e.g., Xavier NX/Orin Nano class) often operate within 5–15 W power envelopes, with tight model+buffer memory budgets (e.g., <512 MB) and latency constraints targeting 30–50 FPS (i.e., ≤20–33 ms per frame) for stable UAV control loops [9,10]. Empirically, lightweight UAV detectors report model sizes of only a few MB with 50–160+ FPS on commodity GPUs by adding P2 heads and pruning large-object heads, while adopting mobile backbones [6,7]. Third, scenes exhibit strong illumination changes, viewpoint/altitude variations, and frequent occlusions, which jointly challenge detector robustness and data association. Such occlusion/out-of-view phenomena are pervasive in UAV videos and often require re-identification or recovery logic in downstream trackers [11]. These quantified constraints motivate us to automate detector design toward high accuracy on small objects with strict latency/efficiency guarantees, rather than solely optimizing generic image classifiers.

Deep learning, particularly convolutional neural networks (CNNs), has powered significant advances in UAV object detection, inspiring a wave of domain-specific models such as YOLOv8, EdgeYOLO, and SL-YOLO [9,10,12]. However, manually designing and tuning optimal detection architectures for diverse and dynamic UAV scenarios remains labor-intensive and suboptimal, especially given the fast-evolving requirements for efficiency, accuracy, and real-time performance.

Neural Architecture Search (NAS) automates the discovery of high-performing model architectures, offering a promising path toward tailored, deployment-ready detection networks [13,14]. Yet, traditional NAS methods—whether based on evolutionary algorithms [15], reinforcement learning [16], or gradient-based optimization [17]—are often prohibitively expensive for UAV platforms, and are rarely adapted for small object detection or edge scenarios.

Recent works start to employ LLMs to (i) propose architecture blueprints from natural-language constraints, (ii) synthesize runnable code, and (iii) iterate with scoring feedback [18,19,20,21,22]. However, most studies still focus on classification spaces and use a fixed LLM capacity throughout search, which can be inefficient during early exploration and insufficiently expressive in late refinement. To address these critical gaps, we propose PhaseNAS, a dynamic, LLM-driven NAS framework designed to meet the stringent requirements of UAV-based object detection. PhaseNAS introduces phase-aware LLM resource allocation and template-based architecture generation, enabling efficient and robust search for high-quality detection networks under strict resource constraints. In UAV small object settings, PhaseNAS’s detection-aware ZS score prioritizes high-resolution heads and lightweight necks under tight latency–size budgets, enabling rapid pre-training screening. This preference is consistent with recent UAV detectors that add P2 heads, remove P5, and employ lightweight/mobile backbones for small objects [6,7], and with UAV-oriented NAS attempts balancing mAP, FPS, and model size [23]. Beyond model design, real UAV deployments must consider UTM/airspace safety assessment workflows, communication reliability, and cyber–physical threats [24]. Task-level coordination with ground assets and low-altitude economy constraints further shape perception requirements and real-time budgets [25]. At swarm scale, communication, occlusion, and synchronization amplify robustness demands on onboard detection [26].

While our NAS scoring and evaluation methodology is rooted in classification-based research, we extend and adapt it for multi-scale object detection in UAV scenarios. For completeness and to demonstrate generality, we also evaluate lightweight classification models (<1 M parameters) as benchmarks for extreme on-device constraints. Nevertheless, the core focus of this work remains advancing fully automated, high-performance object detection for UAVs. Table 1 summarizes representative LLM-NAS methods. Compared with them, our PhaseNAS introduces phase-aware resource scaling (small LLM for breadth-first exploration; large LLM for fine-grained refinement), a structured template-to-code interface to reduce invalid generations, and a detection-aware zero-shot score to screen YOLO-style candidates rapidly under UAV constraints.

The main contributions of this work are as follows.

1: Phase-Aware Dynamic LLM Allocation: We introduce a resource-adaptive NAS strategy that adjusts LLM capacity according to search phase, balancing broad exploration with precise refinement for UAV detection tasks.
2: Structured Detection Architecture Templates: We design a parameterized template system that reliably maps LLM prompts to executable YOLO-style detection architectures, reducing errors and code failures in the search loop.
3: Zero-Shot Detection Scoring: We propose a training-free detection scoring mechanism for rapid, accurate evaluation of candidate detectors on UAV datasets such as VisDrone2019, accelerating search without full model retraining.
4: Comprehensive UAV Evaluation: We demonstrate the effectiveness of PhaseNAS on both classic classification benchmarks (with lightweight models) and challenging UAV detection tasks, showing state-of-the-art mAP and efficiency on datasets such as COCO and VisDrone2019.

By combining LLM-driven reasoning, adaptive resource scheduling, and robust code synthesis, PhaseNAS advances the automation and intelligence of perception model design for next-generation UAV and drone systems, making real-time small object detection more effective and accessible for real-world deployment.

2. Materials and Methods

The PhaseNAS framework is fundamentally motivated by the need for automated neural architecture search tailored to UAV-based object detection, where accurate small object localization, real-time inference, and stringent platform constraints are essential. Our methodology builds upon zero-shot neural architecture scoring techniques originally developed for classification tasks, extending and adapting them to the unique multi-scale and resource-aware requirements of object detection in drone scenarios. While PhaseNAS is designed as a general NAS framework supporting both object detection and classification, its primary focus is on discovering efficient, high-performance detection architectures for UAV deployment. Lightweight classification tasks are included as auxiliary benchmarks to assess the NAS efficiency and generality, particularly for extremely compact models suitable for edge or onboard scenarios.

The fundamental insight driving PhaseNAS is that neural architecture search exhibits phase-dependent computational requirements. Early exploration benefits from rapid, broad sampling across the search space—a task that smaller language models can handle efficiently. Later refinement requires sophisticated reasoning about architectural trade-offs and subtle optimizations—demanding the advanced capabilities that only larger models possess.

This observation leads to our core design principle: dynamically match computational resources to search phase complexity. Rather than using a fixed LLM configuration throughout the search process, PhaseNAS adaptively transitions between appropriately-sized models based on real-time assessment of search progress.

As illustrated in Figure 1, the algorithm’s core innovation lies in its dynamic model scaling strategy, where the exploration phase utilizes a smaller, cost-efficient language model for broad and rapid architectural discovery, significantly reducing search cost. Once promising candidate architectures emerge, the refinement phase transitions to a larger, more capable language model, which further optimizes and diversifies these candidates with advanced reasoning and creativity. This two-stage design achieves an effective balance between search efficiency and the quality and novelty of the final architectures, governed by real-time score thresholds that control phase transitions and termination.

Additionally, to support more advanced tasks beyond standard classification, PhaseNAS incorporates object detection through the YOLOv8 framework, which plays a critical role in enabling real-time perception and situational awareness in autonomous intelligent systems. This extension highlights the flexibility of our approach in adapting the search process to meet the stringent requirements of detection tasks, such as low latency, constrained FLOPs, and compact model size.

2.1. Dynamic Search Process

The search process operates within a constrained space

S

defined by three fundamental design principles:

1: Modular Component Selection: All architectural components must be selected from predefined functional groups $G_{k}$ (e.g., convolutional layers, residual blocks).
2: Dimensional Compatibility: Adjacent modules must preserve dimensional consistency through strict channel matching, ensuring seamless integration.
3: Hardware-Aware Constraints: The computational complexity of the generated architectures must remain within predefined limits, ensuring practical applicability across diverse deployment scenarios.

The methodology progresses through two distinct phases, each designed to maximize efficiency and performance by leveraging the strengths of different LLM capacities.

2.1.1. Exploration Phase

In the exploration phase, a smaller and cost-efficient language model is employed to efficiently generate diverse architectural variants. The model receives natural language prompts encoding dimensional constraints, complexity boundaries, and task-specific requirements. The generated candidates are then evaluated by a scoring function and validated against resource constraints. Valid candidates are added to a quality-ordered pool, maintaining the top K architectures.

2.1.2. Refinement Phase

Once any candidate in the pool achieves the transition threshold, the system transitions to the refinement phase. In this phase, a larger and more capable language model is used to further optimize and diversify the high-potential architectures based on performance feedback. The refinement process focuses on improving underperforming components while preserving the overall structure. The search terminates when the stopping threshold is reached, ensuring efficient progress through the search space.

2.2. Search Space Definition

The design of an effective NAS framework fundamentally depends on a well-defined search space for different tasks. In PhaseNAS, the search space is systematically tailored for both image classification and object detection to maximize efficiency, compatibility, and innovation.

Search Space for Classification Tasks: For classification, the search space

S_{cls}

is made explicit and highly structured to facilitate efficient search and reduce semantic ambiguity. The core constraints are as follows:

1: Available Building Blocks: Each network is constructed from a predefined set of residual and convolutional blocks with various kernel sizes and activation functions. The search is restricted to combinations of these standardized modules, ensuring compatibility and reproducibility across all candidate architectures.
2: Channel Compatibility: The output channel of any block must match the input channel of the subsequent block to ensure seamless tensor propagation.
3: Input/Output Constraints: The model input must be a three-channel image, and the output must be suitable for the classification head.
4: Resource Constraints: The parameter count and network FLOPs are constrained within practical deployment ranges to ensure efficiency.

This explicit, template-based design ensures that LLMs can reliably generate, interpret, and modify architectures with minimal risk of invalid structures or compilation errors.

Search Space for Object Detection Tasks: We build on Ultralytics YOLO, a fast, accurate, and easy-to-use family of models supporting detection, tracking, segmentation, classification, and pose estimation. For object detection, the search space

S_{\det}

is designed to extend and enhance the YOLOv8 family of models while adhering to strict resource constraints. The main characteristics are as follows:

1

YOLOv8 as Baseline: The search is anchored on the official YOLOv8n and YOLOv8s backbones, allowing modifications and reordering of backbone, neck, and head blocks, while constraining overall parameters and FLOPs to remain similar to the base models.

2

Block-Level Modifications: Candidate architectures are generated by reconfiguring, replacing, or inserting blocks within the backbone and neck, provided input/output channels and tensor shapes remain compatible.

3

Innovative Block Extensions:

YOLOv8+: Models in this variant are generated by reorganizing existing YOLOv8 blocks and tuning their parameters, without introducing new block types.
YOLOv8*: In addition to the modifications allowed in YOLOv8+, this variant explicitly introduces two novel blocks:
-
SCDown [30]: A spatial compression downsampling module, designed to enhance multi-scale feature extraction and improve information flow in the early backbone.
-
PSA [30]: A polarized self-attention block, which introduces channel-wise attention to boost target localization and feature selectivity.
These custom blocks are only available in the YOLOv8* search space, enabling the framework to explore more expressive and powerful architectures beyond the original YOLOv8 design. These novel modules are particularly beneficial for UAV scenarios, where the ability to preserve fine-grained features and enhance small object localization under limited computational budgets is critical.

4

Multi-Scale and Detection Head Compatibility: All searched architectures must support multi-scale feature outputs and remain compatible with the YOLO detection heads.

5

Resource Constraints: Parameter count and FLOPs are kept within the original YOLOv8n/s budgets to ensure real-time performance and fair comparison.

Summary: The detection search space, while based on YOLOv8, is extended in YOLOv8* by the inclusion of two advanced modules (SCDown and PSA), providing greater architectural diversity and enabling higher mAP under resource constraints. As summarized in Table 2, we define a nested search space: YOLOv8+ permits block-level reconfiguration within YOLOv8, while YOLOv8* further unlocks SCDown/PSA toggles under identical n/s budgets. The classification search space is strictly limited to ten modular blocks with clear parameters and depth constraints.

2.3. Algorithm Overview

Algorithm 1 outlines the full PhaseNAS search procedure, dynamically alternating between LLM-driven exploration and efficient refinement based on real-time score thresholds. To clarify cost management, PhaseNAS minimizes large-LLM usage by allocating small-LLM calls to early exploration and invoking the large LLM only after the transition criterion is met, which concentrates high-capacity reasoning in the final <50% of iterations in our runs.

Algorithm 1: PhaseNAS Architecture Search

Require: Initial architecture

S_{init}

, small LLM

M_{E}

(for Exploration), large LLM

M_{R}

(for Refinement), thresholds

γ_{trans}, γ_{stop}

, pool size K

Ensure: Optimal architecture

S^{*}

1:: Initialize: $A_{c} \leftarrow {S_{init}}$ , $ϕ \leftarrow$ Exploration Phase
2:: while ${max}_{S \in A_{c}} E_{z} (S) < γ_{stop}$ do
3:: if $ϕ =$ Exploration then
4:: Generate $S_{new} \leftarrow M_{E} (exploration_prompt)$ ▹ Small LLM for cost-efficient exploration
5:: if $V (S_{new})$ and $E_{z} (S_{new}) > {min}_{S \in A_{c}} E_{z} (S)$ then
6:: Add $S_{new}$ to $A_{c}$
7:: if $| A_{c} | > K$ then
8:: Remove architecture with lowest $E_{z}$ from $A_{c}$
9:: end if
10:: end if
11:: if $\exists S \in A_{c}$ such that $E_{z} (S) \geq γ_{trans}$ then
12:: $ϕ \leftarrow$ Refinement Phase
13:: Set base architecture $S_{base} \leftarrow arg {max}_{S \in A_{c}} E_{z} (S)$
14:: end if
15:: else ▹ Refinement Phase
16:: Generate refined architecture $S_{new} \leftarrow M_{R} (S_{base}, feedback)$ ▹ Large LLM for high-quality refinement
17:: if $V (S_{new})$ and $E_{z} (S_{new}) > E_{z} (S_{base})$ then
18:: Update base architecture: $S_{base} \leftarrow S_{new}$
19:: Add $S_{new}$ to $A_{c}$
20:: end if
21:: end if
22:: end while
23:: return $S^{*} \leftarrow arg {max}_{S \in A_{c}} E_{z} (S)$

In our implementation, we define

M_{E}

with multiple variants, including Qwen2.5-7B, Qwen2.5-14B, and Qwen2.5-32B (7-32 billion parameters) for cost-efficient exploration, and

M_{R}

as Qwen2.5-72B, Llama-3.3-70, and Claude-3.5-Sonnet (70+ billion parameters) for high-quality refinement. The choice of model sizes can be adapted based on available computational resources, with the key principle being

| M_{E} | < | M_{R} |

to ensure progressive capability scaling.

We adopt task-appropriate phase thresholds. For classification, we set

γ_{trans}

to approximately

80 %

of

γ_{stop}

(i.e.,

γ_{trans} \approx 0.8, \times, γ_{stop}

), where

γ_{stop}

is a relatively high, fixed-target threshold for the NAS score. For object detection,

γ_{trans}

is progress-based: we transition when the candidate score shows no improvement for 3–5 consecutive iterations (patience

= 3

–5), while

γ_{stop}

remains a high, fixed expected value. In sensitivity checks (

\pm 5

percentage points around the

0.8, \times, γ_{stop}

rule for classification; patience

\in 3, 5

for detection; and high

γ_{stop}

equivalents spanning

P 93

–

P 98

), the top-5 ranking stays stable (Spearman

ρ > 0.9

); these thresholds primarily trade off exploration length and transition timing rather than altering the best candidate.

2.4. LLM-Compatible Architecture Representation

A core challenge in LLM-based NAS is bridging natural-language descriptions and executable implementations [22,31]. PhaseNAS uses a parameterized, block-level template language that explicitly encodes kernel sizes, channels, strides, and residual patterns. For example, a convolutional block is denoted ConvK3BNRELU(3, 8, 1, 1) and a residual block ResK3K3(16, 32, 2, 1). This explicit interface markedly reduces ambiguity and decoding failures, increases compile success rates, and integrates cleanly with modular detectors like YOLOv8. We carefully design the LLM prompts to guide the architecture search process. The exact prompt templates and example user inputs are provided in Appendix A.

2.5. Task Adaptation: From Classification to Object Detection

To guide the architecture search process effectively, PhaseNAS employs task-specific NAS score computation methods for evaluating candidate architectures. These scores are designed to quantify the architecture’s response to input perturbations and its normalization stability, offering fast proxies for generalization without full training. Below, we present the scoring procedures for classification and object detection tasks, including the adaptations made for the YOLO framework.

2.5.1. NAS Score for Classification

We follow the scoring function proposed in Zen-NAS [32], which effectively captures the model’s sensitivity to perturbation and the stability of Batch Normalization layers. As these calculations remain the same as in Zen-NAS, we omit the detailed formula derivations here for brevity.

2.5.2. NAS Score for Object Detection

Object detection demands multi-scale feature quality and larger inputs. We therefore extend Zen-style zero-shot scoring to measure the consistency of multi-scale features under input perturbations, aggregating responses across pyramid levels and combining them with BatchNorm stability. This yields a detector-specific proxy that correlates with downstream mAP yet avoids full training.

Input Perturbation

x_{mix} = x_{1} + γ \cdot x_{2} .

(1)

Feature Map Extraction

Given model

M (\cdot)

, let

F (\cdot)

extract the internal feature maps. We compute:

f_{1} = F (M (x_{1})), f_{mix} = F (M (x_{mix})),

(2)

with

f^{(l)} \in R^{b \times c_{l} \times h_{l} \times w_{l}}

as the feature map at scale l.

Multi-Scale Difference

Total feature perturbation response is

Δ = \sum_{l = 1}^{L} {∥ f_{1}^{(l)} - f_{mix}^{(l)} ∥}_{1},

(3)

where L is the number of feature scales.

BatchNorm Scaling

B = \sum_{m} log (\sqrt{{running_var}_{m} + ϵ}) .

(4)

Final NAS Score

The score for detection is defined as

N A S_{d e t} = log (Δ + ϵ) + B

(5)

Intuition:

Δ

quantifies how strongly the model’s pyramid features (P3–P5) react to a mild input perturbation; larger

Δ

suggests higher feature discriminability but may reflect sensitivity.

B

summarizes BN running variances across layers; larger

B

indicates well-spread activation scales and stable normalization. The combined score

log ((Δ + ϵ) + B

thus prefers architectures that produce informative, multi-scale features without sacrificing normalization stability. On YOLOv8{n, s, m, l, x},

N A S_{d e t}

strongly correlates with trained mAP (high Spearman

ρ

and Pearson r); see Appendix B (Table A1 and Table A2) for the exact vectors and correlation results.

Repetition and Aggregation

Repeat the process R times and compute

μ = \frac{1}{R} \sum_{i = 1}^{R} s_{i}, σ = \sqrt{\frac{1}{R} \sum_{i = 1}^{R} {(s_{i} - μ)}^{2}},

(6)

where

μ

is used for ranking and selection.

Summary: The NAS scores for classification and detection provide a unified yet adaptable framework for evaluating candidate architectures efficiently. By extending the scoring mechanism to handle multi-scale outputs and detector-specific challenges, PhaseNAS ensures robust and scalable neural architecture search across a wide range of AI tasks.

3. Results

Experimental Design Rationale. Our experimental protocol is designed to verify that PhaseNAS not only reduces search cost via dynamic LLM scaling and structured representation, but also discovers architectures that outperform state-of-the-art NAS baselines in both efficiency and accuracy. The primary focus is on UAV-based object detection, especially small object, real-time scenarios as reflected by the VisDrone2019 and COCO datasets. To validate the generality and portability of the framework, we additionally include classification tasks on compact models (<1 M parameters) as auxiliary benchmarks for extreme resource-constrained deployment.

We first evaluate the effectiveness of PhaseNAS by directly comparing it with the recent LLM-based NAS baseline GUNIUS on the NAS-Bench-Macro dataset. This comparison demonstrates the performance advantage of our dynamic, phase-aware approach and validates the benefit of adaptive model switching. We then extend our study to both image classification and object detection domains, applying PhaseNAS to multiple search spaces and tasks to showcase its scalability and general applicability.

3.1. Comparison with LLM-Based NAS Baselines

To demonstrate the core effectiveness of PhaseNAS, we first benchmark it against the recent LLM-based NAS method GUNIUS on the NAS-Bench-Macro dataset. GUNIUS utilizes Qwen2.5-32B and Qwen2.5-72B as independent architecture generators, while our PhaseNAS dynamically switches between these two models in different search phases.

Experimental Setup. All methods are evaluated on the NAS-Bench-Macro dataset under identical search protocols. The original GUNIUS paper adopts GPT-4 as the LLM backbone; for resource and accessibility reasons, we instead use Qwen2.5-32B and Qwen2.5-72B as LLM generators for both GUNIUS and our PhaseNAS. For GUNIUS, each model is tested independently as the architecture generator. For PhaseNAS, the phase-aware controller adaptively alternates between Qwen2.5-32B and Qwen2.5-72B based on candidate performance distribution. Each search run is repeated for 10 iterations to ensure convergence.

Results and Analysis. As shown in Figure 2, all three approaches are evaluated under consistent search protocols. For GUNIUS with Qwen2.5-32B, the best architecture achieves an accuracy of 92.73% with a final rank of 119. When using Qwen2.5-72B, GUNIUS yields a slightly better result: the top architecture reaches 92.75% accuracy with a rank of 110. In contrast, our PhaseNAS approach achieves a significantly higher accuracy of 93.11% and a much lower (better) rank of 3, indicating that it not only discovers more accurate architectures but also identifies solutions that are closer to the global optimum in the search space.

Cost–benefit note. Relative to single-LLM baselines, PhaseNAS concentrates large-LLM calls in late-stage refinement while using a small LLM for breadth-first exploration. Empirically this reduces exploration-time LLM usage and wall-clock search time while improving best-found accuracy and rank (Figure 2). This targeted allocation underpins the observed efficiency gains later reproduced in classification and detection experiments.

3.2. Generalization to Classification and Detection Tasks

Building upon the LLM-based NAS benchmark results, we further evaluate PhaseNAS on both image classification and object detection tasks to verify its generalizability and scalability. The classification experiments are conducted on CIFAR-10 and CIFAR-100, while object detection experiments are performed on the COCO validation set. To ensure fair and meaningful architecture comparisons, we impose the following consistent constraints during the search process:

Model Size: Limit total parameters for edge device deployment.
FLOPs: Bound computational cost for real-time efficiency.
Latency: Ensure practical inference speed on target hardware.
Depth: Prevent over-deep, hard-to-optimize models.

All methods are evaluated using the same NAS scoring functions, and all training pipelines are aligned across baselines for fairness.

Hardware/Software: All searches ran on 1× NVIDIA GeForce RTX 4090 24GB, AMD EPYC 7642 CPU, 512 GB RAM, Ubuntu 22.04, CUDA 12.1, Python 3.10.13, PyTorch 2.2, Ultralytics 8.3.63. YOLO training used imgsz = 640, batch = 32, AdamW with cosine schedule, 600 epochs (COCO)/200 epochs (VisDrone), default Ultralytics data augmentation unless noted. Zero-shot scoring ran on GPU with mixed precision. Reported search times are end-to-end wall-clock including compile/validation.

3.3. Classification Results and Analysis

We compare the efficiency and performance of PhaseNAS and the widely used Zen-NAS on CIFAR-10 and CIFAR-100, ensuring a fair comparison by adopting the same Zen-score evaluation. Table 3 reports the results for different search settings.

As shown in Table 3, PhaseNAS achieves up to 86% reduction in search time compared to Zen-NAS, comparable or improved classification accuracy (especially on CIFAR-100), and comparable Zen-scores, indicating better architectural potential.

These results demonstrate that PhaseNAS can significantly accelerate the NAS process while maintaining or improving final accuracy. This efficiency gain is attributed to the dynamic use of large and small language models and the phase-aware search control, as verified earlier in the LLM-based NAS macro-benchmark.

3.4. Object Detection Results and Analysis

To further validate the versatility of PhaseNAS, we apply it to object detection on COCO and VisDrone2019 using the YOLOv8 framework. Architecture search is conducted under the same FLOPs and parameter constraints as YOLOv8n and YOLOv8s, with LLM-guided prompts driving both exploration and refinement.

Table 4 shows that PhaseNAS generates detection models that not only outperform the hand-designed YOLOv8 baselines in mAP, but also reduce model size and computational complexity on both generic and drone-based datasets:

1

COCO Results:

YOLOv8n series: YOLOv8n+ and YOLOv8n* improve mAP to 38.0 and 39.1, respectively, both with fewer parameters and lower FLOPs than the original (37.3, 3.2 M, 8.7 G).
YOLOv8s series: YOLOv8s+ and YOLOv8s* achieve mAPs of 45.4 and 46.1, respectively, while reducing parameter count and FLOPs compared to the baseline (44.9, 11.2 M, 28.6 G).

2

VisDrone2019 Results:

YOLOv8n series: On the challenging VisDrone2019 benchmark, PhaseNAS improves mAP@50:95 from 18.5 (YOLOv8n) to 18.8 (YOLOv8n+) and 19.1 (YOLOv8n*), while maintaining lower parameter and FLOPs budgets.
YOLOv8s series: Similarly, the mAP@50:95 increases from 22.5 (YOLOv8s) to 22.7 (YOLOv8s+) and 23.4 (YOLOv8s*), further verifying the effectiveness of PhaseNAS in drone-based detection scenarios.

Efficiency perspective and UAV relevance. Under YOLOv8n/s budgets, PhaseNAS variants deliver higher mAP with fewer parameters and lower FLOPs, improving accuracy per GFLOP and per-MParam efficiency. This translates into better real-time viability on resource-constrained onboard computing platforms commonly used in UAVs, where latency and memory are limited. The consistent gains on VisDrone2019—a challenging drone-based detection benchmark with many small objects—indicate that PhaseNAS is well-suited for real-world aerial perception, enabling more accurate and efficient onboard deployment.

4. Discussion

The experimental results demonstrate that PhaseNAS achieves superior performance and efficiency compared to both traditional and recent LLM-based NAS baselines, particularly in the context of UAV-based object detection. This advantage can be attributed to three key factors: (1) Adaptive resource allocation: By dynamically transitioning between small and large LLMs in different search phases, PhaseNAS enables rapid, cost-efficient exploration and high-quality refinement, substantially lowering the total search cost while maintaining or improving final model quality. (2) Structured representation: The use of template-based architecture encoding reduces semantic ambiguity and decoding failures, ensuring that LLMs generate valid, executable architectures even for complex detection models—an issue that often hinders prior LLM-NAS approaches relying on free-form prompts. (3) Task-specific scoring: The extended NAS scoring function for object detection effectively captures multi-scale feature quality and normalization stability, providing a robust, training-free proxy for downstream detection accuracy, which is especially important for small object and real-time UAV scenarios.

Compared to previous works such as GUNIUS and Zen-NAS, which either utilize fixed LLMs or focus primarily on classification tasks, PhaseNAS introduces a phase-aware, adaptive resource strategy and extends NAS scoring to the detection domain, thus bridging the gap between general NAS methodologies and the stringent requirements of UAV perception. PhaseNAS improves end-to-end cost–benefit by (i) reducing search-time LLM and compute usage via phase-aware allocation and (ii) yielding detectors with higher accuracy per FLOP and per parameter for real-time UAV deployment. Our results on VisDrone2019 and COCO further confirm that PhaseNAS not only generalizes across tasks but also consistently delivers improved mAP and reduced computational cost—key criteria for real-world drone deployment. These findings have broader implications for the design of automated AI systems in resource-constrained, real-time environments. The dynamic LLM allocation and structured representation strategies may be beneficial for other edge AI applications beyond UAVs, such as robotics, IoT, and smart surveillance.

Limitations and Future Directions

Our present design intentionally centers on YOLO-style pipelines because they couple well with LLM-driven edits via explicit, parsable architecture descriptions and a mature deployment toolchain for real-time UAVs. This pragmatic choice narrows architectural breadth and complicates immediate validation on detectors that lack standardized, script-level descriptors (e.g., SSD, Faster R-CNN). Template-based, parsable descriptions are essential to keep generations valid and compilable under basic input/output compatibility and resource constraints; yet they also induce a practical “template boundary” that can bias exploration toward existing components. We promote diversity within this boundary (decoding choices, near-duplicate filtering, macro-topology changes) and will broaden applicability by (i) introducing concise, parsable descriptors for alternative detector families and (ii) adopting an agent-style workflow that manages proposals beyond the current template. Finally, we will generalize the template design with reusable checks and example-driven guidance to ease transfer across related domains.

In conclusion, PhaseNAS demonstrates that LLM-driven, resource-adaptive NAS frameworks can effectively automate the design of high-quality, efficient perception models for UAVs and beyond. We envision that future advances in language model reasoning and program synthesis will further enhance the autonomy and generalizability of neural architecture search, paving the way for next-generation intelligent aerial systems. In summary, PhaseNAS improves end-to-end cost–benefit by (i) reducing search-time LLM and compute usage via phase-aware allocation and (ii) yielding detectors with higher accuracy per FLOP and per parameter for real-time UAV deployment.

5. Conclusions

We present PhaseNAS, a neural architecture search framework that addresses the core resource-allocation challenge in LLM-based NAS via phase-aware dynamic scaling. By matching LLM capacity to the needs of each search phase, PhaseNAS enables breadth-first exploration with small models and concentrates high-capacity reasoning in late-stage refinement.

Comprehensive experiments validate PhaseNAS across classification and detection: it discovers superior architectures on NAS-Bench-Macro (93.11% accuracy, rank 3), reduces wall-clock search time by up to 86% on classification tasks, and generalizes to object detection with automatically generated YOLOv8 variants that surpass baselines. Under equal parameter and FLOPs budgets, these variants deliver higher mAP with fewer resources, improving accuracy per GFLOP and per million parameters and supporting real-time feasibility on resource-constrained onboard computing platforms commonly used in UAVs.

Overall, PhaseNAS shows that resource-aware, adaptive search can improve both effectiveness and end-to-end cost–benefit—reducing search-time LLM/compute usage while yielding models with stronger accuracy–efficiency trade-offs at inference. Future work includes extending phase-aware allocation to additional modalities and tasks, and integrating richer hardware-aware constraints and latency targets.

Author Contributions

Methodology, F.K.; validation, F.K. and Y.H.; formal analysis, Y.H.; resources, X.S.; data curation, F.K.; writing—original draft preparation, F.K.; writing—review and editing, X.S.; visualization, F.K.; supervision, J.L.; project administration, X.S.; funding acquisition, X.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Qiyuan Lab, grant number S20240201001.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

The authors thank all members of Qiyuan Lab for helpful discussions and support.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

NAS	Neural Architecture Search
LLM	Large Language Model
FLOPs	Floating Point Operations
mAP	mean Average Precision
UAV	Unmanned Aerial Vehicle
CNN	Convolutional Neural Network

Appendix A. Details of LLM Prompt Design

In this appendix, we provide the prompts used by PhaseNAS for the classification tasks. The prompts for YOLOv8 are similar, with the main difference being that the input raw strings are different and longer due to the complexity of the YOLOv8 architecture. For reference, the YOLOv8 configuration can be found at the following link: yolov8.yaml on GitHub (https://github.com/ultralytics/ultralytics/blob/main/ultralytics/cfg/models/v8/yolov8.yaml (accessed on 1 November 2025)).

Appendix A.1. System Content

We define the system content used in PhaseNAS:

system_content = (

‘‘You are a computer scientist and an artificial intelligence ’’

‘‘researcher who is widely regarded as one of the leading ’’

‘‘experts in deep learning and neural network architecture search. ’’

‘‘Your work in this area has focused on developing efficient ’’

‘‘algorithms for searching the space of possible neural network, ’’

‘‘architectures with the goal of finding architectures that ’’

‘‘perform well on a given task while minimizing the computational ’’

‘‘cost of training and inference.’’

)

Appendix A.2. User Input

User input used to guide the LLM:

user_input = (

‘‘You are an expert in the field of neural architecture search. ’’

‘‘Your task is to assist me in selecting the best operations to ’’

‘‘design a neural network block using the available operations. ’’

‘‘The objective is to maximize the model’s performance.’’

‘‘The optional model blocks are divided into three groups, strictly ’’

‘‘selected from the following when generating the structure:’’

‘‘ 1. [SuperResK3K3, SuperResK5K5, SuperResK7K7],’’

‘‘ 2. [SuperResK1K3K1, SuperResK1K5K1, SuperResK1K7K1],’’

‘‘ 3. [SuperConvK1BNRELU, SuperConvK3BNRELU, SuperConvK5BNRELU, ’’

‘‘SuperConvK7BNRELU]’’

‘‘You can only choose from the above 10 blocks at build time, and ’’

‘‘you are not allowed to change the names of these 10 blocks at will! ’’

‘‘Each block can be given several sublayers at build time. ’’

‘‘The selection rule of sub_layers is that the sum of sub_layers of ’’

‘‘structures does not exceed 18!’’

‘‘Let’s break this down step by step: First, analyze the structure ’’

‘‘of a block of type SuperRes. The structure is strictly selected’’

‘‘from these ten available blocks when it is generated. In the final’’

‘‘model structure, between adjacent blocks, the output channel of ’’

‘‘the previous block and the input channel of the next block must ’’

‘‘be consistent. Another very important point is that the input to ’’

‘‘the model structure is the image of the three channels.’’

‘‘Next, you can generate a new model structure. A block can have ’’

‘‘several sub-layers when generated. Finally, propose a block ’’

‘‘design that prioritizes performance! After suggesting a design,’’

‘‘I will test its performance and provide feedback. Based on ’’

‘‘the results of previous structures and scores, you can ’’

‘‘collaborate to iterate and improve the design. Please avoid ’’

‘‘suggesting the same design again during this iterative process.’’

)

Appendix A.3. Initial Structure and Prompt

The initial network structure and the corresponding prompt used in the experiments are provided below:

initial_structure_str = (

‘‘SuperConvK3BNRELU(3,8,1,1)SuperResK3K3(8,16,1,8,1)’’

‘‘SuperResK3K3(16,32,2,16,1)SuperResK3K3(32,64,2,32,1)’’

‘‘SuperResK3K3(64,64,2,32,1)SuperConvK1BNRELU(64,128,1,1)’’

)

prompt = (

f‘‘Given the neural network structure: {initial_structure_str}, ’’

‘‘generate a new structure by replacing two parts of the network ’’

‘‘while keeping it optimized for a classification task.’’

)

Appendix A.4. Experimental Prompt

The experimental prompt used to guide the LLM is defined as follows:

experiments_prompt = lambda best_arch_list, max_score_list: (

‘‘Here are some structure’s score results for reference: ’’

f‘‘{’’.join ([f’{best_structure} gives {max_score:.2f}%’ "

f‘‘for best_structure, max_score in zip(best_arch_list, ’’

f‘‘max_score_list)])}.’’

‘‘Suggest a better structure to improve these scores. ’’

‘‘Gradually increase channels to avoid exceeding 1M parameters, ’’

‘‘and slowly increase the total number of layers to 18. ’’

‘‘If scores plateau, explore different configurations.’’ )

Appendix B. Correlation Between NAS det and mAP

Setup and data. We evaluate YOLOv8 {n, s, m, l, x}. For each model, we record: (i) YAML-derived

N A S_{\det}

, (ii) PT-derived

N A S_{\det}

computed on the official checkpoint, and (iii) trained mAP@50:95 under our unified recipe (imgsz

= 640

). The raw values and correlations are reported in Table A1 and Table A2.

Table A1. YOLOv 8 family: YAML- and PT-based

N A S_{d e t}

scores and trained mAP (imgsz = 640).

Table A1. YOLOv 8 family: YAML- and PT-based

N A S_{d e t}

scores and trained mAP (imgsz = 640).

Model	YAML ${NAS}_{\det}$	PT ${NAS}_{\det}$	${mAP}_{50 - 95}$
YOLOv8n	2.02	11.22	37.3
YOLOv8s	3.07	12.53	44.9
YOLOv8m	4.10	18.36	50.2
YOLOv8l	4.99	25.27	52.9
YOLOv8x	5.51	25.21	53.9

Table A2. Correlation between

N A S_{d e t}

and mAP across YOLOv8 {n, s, m, l, x}.

Table A2. Correlation between

N A S_{d e t}

and mAP across YOLOv8 {n, s, m, l, x}.

Pair	Spearman $ρ$	Pearson r
YAML $N A S_{d e t}$ vs. mAP	1.000	0.994
PT $N A S_{d e t}$ vs. mAP	1.000	0.996

Interpretation and use. Both YAML- and PT-based

N A S_{\det}

show near-perfect alignment with trained mAP across the YOLOv8 family (perfect monotonicity and very high linear correlation). This supports using

N A S_{\det}

as a reliable zero-shot proxy for downstream detection quality in our search loop. Minor differences between YAML and PT correlations reflect checkpoint-specific characteristics; in practice, both variants are sufficiently predictive for ranking.

References

Ramachandran, A.; Sangaiah, A.K. A review on object detection in unmanned aerial vehicle surveillance. Int. J. Cogn. Comput. Eng. 2021, 2, 84–97. [Google Scholar] [CrossRef]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Wang, P.; Zhao, J. SOD-YOLO: Enhancing YOLO-Based Detection of Small Objects in UAV Imagery. arXiv 2025, arXiv:2507.12727. [Google Scholar]
Zheng, Y.; Jing, Y.; Zhao, J.; Cui, G. LAM-YOLO: Drones-based small object detection on lighting occlusion attention mechanism YOLO. Comput. Vis. Image Underst. 2025, 235, 104489. [Google Scholar] [CrossRef]
Lu, L.; He, D.; Liu, C.; Deng, Z. MASF-YOLO: An Improved YOLOv11 Network for Small Object Detection on Drone View. arXiv 2025, arXiv:2504.18136. [Google Scholar] [CrossRef]
Huang, M.; Mi, W.; Wang, Y. EDGS-YOLOv8: An Improved YOLOv8 Lightweight UAV Detection Model. Drones 2024, 8, 337. [Google Scholar] [CrossRef]
Nguyen, P.T.; Nguyen, G.L.; Bui, D.D. LW-UAV–YOLOv10: A Lightweight Model for Small UAV Detection on Infrared Data Based on YOLOv10. Geomatica 2025, 77, 100049. [Google Scholar] [CrossRef]
Gong, J.; Liu, H.; Zhao, L.; Maeda, T.; Cao, J. EEG-Powered UAV Control via Attention Mechanisms. Appl. Sci. 2025, 15, 10714. [Google Scholar] [CrossRef]
Liu, S.; Zha, J.; Sun, J.; Li, Z.; Wang, G. EdgeYOLO: An edge-real-time object detector. In Proceedings of the 42nd Chinese Control Conference (CCC), Tianjin, China, 24–26 July 2023; pp. 9130–9135. [Google Scholar]
Chen, D.; Zhang, L. SL-YOLO: A stronger and lighter drone target detection model. arXiv 2024, arXiv:2411.11477. [Google Scholar] [CrossRef]
Micheal, A.A.; Micheal, A.; Gopinathan, A.; Barathi, B.U.A. Deep Learning-based Multi-class Object Tracking with Occlusion Handling Mechanism in UAV Videos. Res. Sq. 2024. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8 (Version 8.0.0). GitHub Repository. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 1 November 2025).
Elsken, T.; Metzen, J.H.; Hutter, F. Neural architecture search: A survey. J. Mach. Learn. Res. 2019, 20, 1–21. [Google Scholar]
Real, E.; Aggarwal, A.; Huang, Y.; Le, Q.V. Regularized evolution for image classifier architecture search. Proc. Aaai Conf. Artif. Intell. 2019, 33, 4780–4789. [Google Scholar] [CrossRef]
Real, E.; Moore, S.; Selle, A.; Saxena, S.; Suematsu, Y.L.; Tan, J.; Le, Q.V.; Kurakin, A. Large-Scale Evolution of Image Classifiers. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Sydney, NSW, Australia, 6–11 August 2017; Volume 70, pp. 2902–2911. Available online: https://proceedings.mlr.press/v70/real17a.html (accessed on 1 November 2025).
Zoph, B.; Le, Q. Neural architecture search with reinforcement learning. arXiv 2016, arXiv:1611.01578. [Google Scholar]
Liu, H.; Simonyan, K.; Yang, Y. DARTS: Differentiable architecture search. arXiv 2018, arXiv:1806.09055. [Google Scholar]
Nasir, M.U.; Earle, S.; Togelius, J.; James, S.; Cleghorn, C. LLMatic: Neural architecture search via large language models and quality diversity optimization. In Proceedings of the Genetic and Evolutionary Computation Conference, Melbourne, Australia, 14–18 July 2024. [Google Scholar]
Chen, A.; Dohan, D.; So, D.R. EvoPrompting: Language models for code-level neural architecture search. arXiv 2023, arXiv:2302.14838. [Google Scholar]
Rahman, M.H.; Chakraborty, P. Lemo-nade: Multi-parameter neural architecture discovery with LLMs. arXiv 2024, arXiv:2402.18443. [Google Scholar]
Zheng, M.; Su, X.; You, S.; Wang, F.; Qian, C.; Xu, C.; Albanie, S. Can GPT-4 perform neural architecture search? arXiv 2023, arXiv:2304.10970. [Google Scholar] [CrossRef]
Wu, X.; Wu, S.H.; Wu, J.; Feng, L.; Tan, K.C. Evolutionary computation in the era of large language model: Survey and roadmap. arXiv 2024, arXiv:2401.10034. [Google Scholar] [CrossRef]
Slimani, H.; El Mhamdi, J.; Jilbab, A. Deep Learning Structure for Real-time Crop Monitoring Based on Neural Architecture Search and UAV. Braz. Arch. Biol. Technol. 2024, 67, e24231141. [Google Scholar] [CrossRef]
Asghari, O.; Ivaki, N.; Madeira, H. UAV Operations Safety Assessment: A Systematic Literature Review. ACM Comput. Surv. 2025, 57, 1–37. [Google Scholar] [CrossRef]
Wang, Y.; Li, J.; Yang, X.; Peng, Q. UAV–Ground Vehicle Collaborative Delivery in Emergency Response: A Review of Key Technologies and Future Trends. Appl. Sci. 2025, 15, 9803. [Google Scholar] [CrossRef]
Alqudsi, Y.; Makaraci, M. UAV Swarms: Research, Challenges, and Future Directions. J. Eng. Appl. Sci. 2025, 72, 12. [Google Scholar] [CrossRef]
Yu, C.; Liu, X.; Wang, Y.; Liu, Y.; Feng, W.; Xiong, D.; Tang, C.; Lv, J. GPT-NAS: Evolutionary Neural Architecture Search with the Generative Pre-Trained Model. J. Latex Cl. Files 2025, 14, 1–11. [Google Scholar]
Morris, C.; Jurado, M.; Zutty, J. LLM Guided Evolution—The Automation of Models Advancing Models. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO ’24), Melbourne, Australia, 14–18 July 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 377–384. [Google Scholar]
Yu, Y.; Zutty, J. LLM-Guided Evolution: An Autonomous Model Optimization for Object Detection. In Proceedings of the Genetic and Evolutionary Computation Conference Companion (GECCO ’25 Companion), Málaga, Spain, 14–18 July 2025; Association for Computing Machinery: New York, NY, USA, 2025; pp. 2363–2370. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Dong, X.; Yang, Y. NAS-Bench-201: Extending the scope of reproducible neural architecture search. arXiv 2020, arXiv:2001.00326. [Google Scholar] [CrossRef]
Lin, M.; Wang, P.; Sun, Z.; Chen, H.; Sun, X.; Qian, Q.; Li, H.; Jin, R. Zen-NAS: A zero-shot NAS for high-performance image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 347–356. [Google Scholar]

Figure 1. Overview of PhaseNAS. Phase I employs a small LLM for cost-efficient exploration; Phase II switches to a larger LLM for high-quality refinement. Phase transitions are triggered by real-time zero-shot scores, aligning model capacity with phase complexity to balance efficiency and final quality.

Figure 2. Accuracy and rank curves on NAS-Bench-Macro for three approaches.

Table 1. Representative LLM-guided NAS approaches and comparison. “Det.” indicates explicit support for object detection; “Dyn.” indicates dynamic LLM capacity/resource adaptation; “Template” indicates explicit, parsable architecture templates; “ZS-Score” indicates training-free/zero-shot scoring.

Method	Core Paradigm	Tasks	Det.	Dyn.	Template	ZS-Score
LLMatic [18]	LLM + QD search	Cls.	No	No	No	No
EvoPrompting [19]	Evo-prompting	Cls.	No	No	No	No
GPT-NAS [27]	GPT + EA-guided NAS	Cls.	No	No	No	No
LLM-GE (GECCO’24) [28]	LLM-guided evolution + source-level code edits	Cls.	No	No	No	No
LeMo-NADe [20]	Code-level param NAS	Cls.	No	No	Yes	No
GUNIUS [21]	Agentic LLM NAS	Cls.	No	No	Yes	No
LLM-GE (GECCO’25) [29]	LLM-guided evolution + YOLO YAML	Det.	Yes	No	Yes	No
PhaseNAS (Ours)	Phase-aware LLM NAS	Cls. + Det.	Yes	Yes	Yes	Yes

Table 2. Compact comparison of YOLOv8 variants in the detection search space. Budget: adherence to YOLOv8n/s params/FLOPs. (+) denotes in-family reconfiguration; (*) adds SCDown/PSA under identical n/s budgets.

Variant	Scope	Key Changes	New Blocks
YOLOv8 (baseline)	Official n/s backbones; standard neck/head; multi-scale (P3–P5)	—	No
YOLOv8+ (reconfig)	Block-level reconfiguration within backbone/neck	Re-order/tune blocks; keep I/O shapes	No
YOLOv8* (extended)	YOLOv8+ with optional modules	SCDown; PSA (toggled per candidate)	Yes

Table 3. Comparison between PhaseNAS and Zen-NAS on classification benchmarks. For PhaseNAS, parentheses after Search Time show the relative reduction versus Zen-NAS in the same setting. Accuracies are reported as mean ± sd over 3 runs.

Method	Zen-Score	Search Time (min)	CIFAR-10 Acc.	CIFAR-100 Acc.
Zen-NAS	99.20	7.43	96.72 ± 0.07	79.95 ± 0.08
PhaseNAS	99.43	1.20 (−83.9%)	96.80 ± 0.06	80.76 ± 0.07
Zen-NAS	111.63	19.07	96.00 ± 0.08	80.78 ± 0.07
PhaseNAS	111.33	7.06 (−63.0%)	96.65 ± 0.05	81.22 ± 0.06
Zen-NAS	121.38	67.46	96.94 ± 0.06	81.24 ± 0.09
PhaseNAS	121.44	9.01 (−86.6%)	97.33 ± 0.04	81.35 ± 0.07

Table 4. Architecture comparison on COCO and VisDrone2019. mAP@50:95 is reported as mean ± sd over 3 seeds.

Δ

mAP is the absolute gain of the mean vs. the family base (YOLOv8n or YOLOv8s); percentages in parentheses denote relative changes vs. the family base for mAP, parameters, and FLOPs (negative indicates reduction). (+) denotes in-family reconfiguration; (*) adds SCDown/PSA under identical n/s budgets.

Table 4. Architecture comparison on COCO and VisDrone2019. mAP@50:95 is reported as mean ± sd over 3 seeds.

Δ

mAP is the absolute gain of the mean vs. the family base (YOLOv8n or YOLOv8s); percentages in parentheses denote relative changes vs. the family base for mAP, parameters, and FLOPs (negative indicates reduction). (+) denotes in-family reconfiguration; (*) adds SCDown/PSA under identical n/s budgets.

COCO
Family	Model	mAP@50:95 (mean ± sd)	$Δ$ mAP	MParams	GFLOPs
YOLOv8n	YOLOv8n	37.30 ± 0.05	–	3.2	8.7
	YOLOv8n+	38.02 ± 0.04	+0.72 (+1.9%)	3.0 (−6.3%)	8.2 (−5.7%)
	YOLOv8n*	39.08 ± 0.07	+1.78 (+4.8%)	2.95 (−7.8%)	8.5 (−2.3%)
YOLOv8s	YOLOv8s	44.90 ± 0.08	–	11.2	28.6
	YOLOv8s+	45.42 ± 0.05	+0.52 (+1.1%)	10.3 (−8.0%)	25.0 (−12.6%)
	YOLOv8s*	46.08 ± 0.09	+1.18 (+2.6%)	9.9 (−11.6%)	22.4 (−21.7%)
VisDrone2019
Family	Model	mAP@50:95 (mean ± sd)	$Δ$ mAP	MParams	GFLOPs
YOLOv8n	YOLOv8n	18.50 ± 0.04	–	3.2	8.7
	YOLOv8n+	18.82 ± 0.05	+0.32 (+1.7%)	3.0 (−6.3%)	8.2 (−5.7%)
	YOLOv8n*	19.10 ± 0.08	+0.60 (+3.2%)	2.95 (−7.8%)	8.5 (−2.3%)
YOLOv8s	YOLOv8s	22.50 ± 0.07	–	11.2	28.6
	YOLOv8s+	22.68 ± 0.09	+0.18 (+0.8%)	10.3 (−8.0%)	25.0 (−12.6%)
	YOLOv8s*	23.38 ± 0.10	+0.88 (+3.9%)	9.9 (−11.6%)	22.4 (−21.7%)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kong, F.; Shan, X.; Hu, Y.; Li, J. Automated UAV Object Detector Design Using Large Language Model-Guided Architecture Search. Drones 2025, 9, 803. https://doi.org/10.3390/drones9110803

AMA Style

Kong F, Shan X, Hu Y, Li J. Automated UAV Object Detector Design Using Large Language Model-Guided Architecture Search. Drones. 2025; 9(11):803. https://doi.org/10.3390/drones9110803

Chicago/Turabian Style

Kong, Fei, Xiaohan Shan, Yanwei Hu, and Jianmin Li. 2025. "Automated UAV Object Detector Design Using Large Language Model-Guided Architecture Search" Drones 9, no. 11: 803. https://doi.org/10.3390/drones9110803

APA Style

Kong, F., Shan, X., Hu, Y., & Li, J. (2025). Automated UAV Object Detector Design Using Large Language Model-Guided Architecture Search. Drones, 9(11), 803. https://doi.org/10.3390/drones9110803

Article Menu

Automated UAV Object Detector Design Using Large Language Model-Guided Architecture Search

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Dynamic Search Process

2.1.1. Exploration Phase

2.1.2. Refinement Phase

2.2. Search Space Definition

2.3. Algorithm Overview

2.4. LLM-Compatible Architecture Representation

2.5. Task Adaptation: From Classification to Object Detection

2.5.1. NAS Score for Classification

2.5.2. NAS Score for Object Detection

3. Results

3.1. Comparison with LLM-Based NAS Baselines

3.2. Generalization to Classification and Detection Tasks

3.3. Classification Results and Analysis

3.4. Object Detection Results and Analysis

4. Discussion

Limitations and Future Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Details of LLM Prompt Design

Appendix A.1. System Content

Appendix A.2. User Input

Appendix A.3. Initial Structure and Prompt

Appendix A.4. Experimental Prompt

Appendix B. Correlation Between NAS det and mAP

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI