UAVEdit-NeRFDiff: Controllable Region Editing for Large-Scale UAV Scenes Using Neural Radiance Fields and Diffusion Models

Ye, Chenghong; Chen, Xueyun; Chen, Zhihong; Sun, Zhenyu; Wu, Shaojie; Deng, Wenqin

doi:10.3390/sym17122069

Open AccessArticle

UAVEdit-NeRFDiff: Controllable Region Editing for Large-Scale UAV Scenes Using Neural Radiance Fields and Diffusion Models

by

Chenghong Ye

¹

,

Xueyun Chen

^1,*,

Zhihong Chen

¹

,

Zhenyu Sun

²

,

Shaojie Wu

¹ and

Wenqin Deng

¹

School of Electrical Engineering, Guangxi University, Nanning 530004, China

²

School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(12), 2069; https://doi.org/10.3390/sym17122069

Submission received: 21 October 2025 / Revised: 23 November 2025 / Accepted: 1 December 2025 / Published: 3 December 2025

(This article belongs to the Section Engineering and Materials)

Download

Browse Figures

Versions Notes

Abstract

The integration of Neural Radiance Field (NeRF)-based 3D reconstruction with text-guided diffusion models enables flexible editing of real-world scenes. However, for large-scale UAV-captured scenes, existing methods struggle to achieve strong semantic consistency (e.g., in local editing) and suffer from cross-view inconsistency, primarily due to the globally free generative behavior and the lack of scene continuity constraints in diffusion models. To address these issues, we propose the UAVEdit-NeRFDiff framework, which ensures the maintenance of overall symmetry by restricting the editing operations to the target region. First, we leverage both visual priors and semantic masks to achieve semantically consistent editing for key views, and then design Optimal Editing Propagation (OEP) and Progressive Inheritance Propagation (PIP) methods to achieve cross-view geometric consistency propagation for Single-View-Dependent Regions (SVDRs) and Multi-View-Dependent Regions (MVDRs). Finally, experiments on diverse editing tasks demonstrate our method’s superiority in semantic alignment, cross-view consistency, and visual fidelity on UAV scenes, with promising applications in weather and disaster scenario simulations. On the proposed TDB metric, our approach delivers more than 50% improvement over prior methods. To the best of our knowledge, this is the first text–visual bimodal-guided diffusion editing framework for NeRF-reconstructed UAV-captured scenes, offering a practical and effective route for related research.

Keywords:

Neural Radiance Fields (NeRF); 3D reconstruction; diffusion models; UAV-based scene editing; disaster simulation

1. Introduction

The 3D reconstruction of high-resolution UAV imagery has been deeply integrated into remote sensing and related fields, serving critical applications such as emergency response for natural disasters, dynamic monitoring, and damage assessment [1,2,3,4,5], as well as the digitization, preservation, restoration, and virtual experience of cultural heritage [6,7,8,9]. However, 3D models generated by existing techniques are predominantly direct replications of the physical world; once acquired, the scene content remains relatively static. Enabling post-acquisition editing would facilitate the efficient generation of low-cost, infinitely diverse training samples for model training in remote sensing and related domains, thereby satisfying the growing demands for simulation and training data. Notably, the academic community has begun exploring simulation and editing technologies to animate static 3D scenes. For instance, Kokosza et al. [10] achieved combustion simulation of 3D vegetation by constructing highly complex physical models, realistically reproducing wildfire spread in virtual environments. Nevertheless, such methods typically rely on precise physical modeling, posing challenges regarding efficiency and flexibility when applied to large-scale, low-cost, and diverse training data generation. Consequently, the research paradigm is shifting from physics-based simulation to data-driven generation. The core driving force behind this transition stems from breakthrough advancements in implicit 3D scene reconstruction and generative editing within the field of computer vision.

In the domain of 3D reconstruction, the advent of Neural Radiance Fields (NeRFs) marked a significant milestone. Debuting at ECCV 2020 [11], NeRF implicitly encodes the geometry and appearance of a scene via neural networks, enabling the generation of continuous, highly photorealistic novel views from sparse 2D images. Subsequently, research has rapidly expanded in multiple directions. Mip-NeRF [12] significantly improves image quality by efficiently rendering anti-aliased conical frustums instead of rays, thereby reducing aliasing artifacts. Instant-NGP [13] combines multi-resolution hash encoding with lightweight neural networks to reduce training time from days to seconds or minutes while maintaining high-quality rendering. The NeuS series [14,15] introduced the Signed Distance Function (SDF) into NeRF’s volume rendering framework, endowing implicit scene representations with geometric interpretability, which allows reconstruction results to stably converge to true surfaces and yield high-quality explicit geometry. RegNeRF [16] combines spatial and view regularization to achieve stable reconstruction and high-quality novel view synthesis even under extremely sparse view conditions. Furthermore, numerous studies have dedicated efforts to applying NeRF to large-scale scenes [17,18,19,20,21,22,23]. Through methods such as block-wise training, these works have achieved the rendering of city-scale or even hundred-meter-scale outdoor scenes. Collectively, these studies have greatly propelled the development of NeRF in terms of high-quality rendering, geometric interpretability, sparse view reconstruction, and large-scale scene applications.

In the field of generative editing, Diffusion Models marked a milestone with the introduction of Denoising Diffusion Probabilistic Models (DDPMs) [24], which learn data distributions through an iterative process of gradual noise addition and denoising. Subsequent research based on the Latent Diffusion Model (LDM) framework [25] further integrated CLIP text encoders [26], catalyzing a series of advancements in text-driven 2D image editing [27,28,29,30] and Text-to-3D generation [31,32,33,34,35,36,37]. Notably, InstructPix2Pix [28] pioneered the use of the GPT-3 language model to generate and refine natural-language editing instructions, employing Stable Diffusion to synthesize paired source–target images. By utilizing these automatically constructed triplets (source image, instruction, edited image), the method fine-tunes a Stable Diffusion model into an instruction-conditioned diffusion model capable of performing high-fidelity, instruction-guided image edits.

Driven by this trend, text-guided diffusion models have been introduced into the editing of NeRF-reconstructed real-world 3D scenes, demonstrating significant potential in editing tasks for small-scale natural scenes [38,39,40,41,42,43,44,45,46,47]. Concurrently, pioneering studies have begun to explore applications in large-scale scenarios [48,49,50]. For instance, Instruct-NeRF2NeRF [48] employs InstructPix2Pix [28] within an iterative 2D editing–3D updating loop, enabling text-driven object modification in small scenes and style transfer in large-scale natural environments. However, its strategy of iteratively updating the dataset struggles to maintain cross-view consistency, and increased editing iterations can lead to the gradual loss of original geometric and appearance information. In contrast, VICA-NeRF [49] establishes image correspondences across different views by leveraging depth information to propagate editing results throughout the scene; this strategy effectively improves cross-view consistency. Nevertheless, its performance remains constrained by the limited generalization capability of InstructPix2Pix on real-world images and the lack of a strict preservation mechanism for input image content. Consequently, its editing capability proves insufficient when applied to complex, large-scale scenes with multi-object interactions captured by UAVs. Although remarkable breakthroughs have been achieved in large-scale city-level 3D reconstruction, research on editing specifically for UAV scenes remains relatively lagging. The recently proposed UAV-ENeRF [51] supports text-driven online editing of UAV scenes (e.g., seasonal, meteorological, temporal, or catastrophic changes); however, it lacks substantive exploration into local editing within complex scenes involving multi-object interactions. Therefore, this paper aims to propose a text-driven editing method tailored for large-scale UAV scenes, expecting to achieve local editing in complex environments with multi-object interactions while preserving cross-view consistency and original content fidelity.

Through analysis of existing research methodologies, we identify three key challenges in UAV scene editing via diffusion models: first, the intrinsically global denoising process fundamentally conflicts with precise local editing requirements; second, geometric misalignment between edited and original regions disrupts scene continuity; and third, cross-view inconsistency arising from sparsely distributed Multi-View-Dependent Regions (MVDRs). Fundamentally, a UAV scene editing framework for large-scale, complex, multi-object environments must preserve the scene’s generalized symmetries across multiple dimensions—that is, the invariances and consistencies that core attributes must maintain under specified transformations. Specifically, this includes (1) semantic symmetry, meaning that while altering the target’s semantic attributes, the semantic identity of unedited regions remains unchanged; (2) geometric and topological symmetry, meaning that editing operations are strictly confined to the target region, and the projection mapping does not affect the non-target regions; and (3) cross-view symmetry, meaning that the edited outcomes remain consistent across all viewpoints. To address these goals, we propose UAVEdit-NeRFDiff, with the main contributions of this work summarized as follows:

Introducing visual-prior-guided local diffusion editing that leverages semantic masks to achieve object-level precision over edited regions while significantly enhancing semantic consistency.
An Optimal Editing Propagation (OEP) method is proposed based on TDB metrics to strengthen multi-view scene continuity for Single-View Dependent Regions (SVDRs) in UAV scenes. A subsequent Diffusion Refinement phase to further improve visual quality in target regions is included.
A Progressive Inheritance Propagation (PIP) method is developed that employs Adaptive Blending to predict pixel-wise mixing weights, thereby balancing cross-view propagated edits with structural scene details and enhancing consistency in sparsely distributed Multi-View-Dependent Regions (MVDRs).

2. Methods

2.1. Overview

Figure 1 presents the main workflow of UAVEdit-NeRFDiff. Due to the arbitrary viewpoints inherent in UAV-captured data, and to achieve robust multi-view consistent results across various editing tasks, we categorize each editing task into two types based on whether there exists at least one single view that fully covers the target editing region. Single-View-Dependent Regions (SVDRs) refer to targets that are spatially compact and can be completely covered by a single view (e.g., a small patch of grassland or a road segment). Multi-View-Dependent Regions (MVDRs) refer to targets that are widely distributed or exhibit complex 3D structures and thus cannot be fully covered by any single view (e.g., scattered trees or building facades). These two cases differ fundamentally during cross-view propagation, and the corresponding methods are discussed in Section 2.3. We note that current categorization is performed manually; we leave the development of an automated classifier for future work.

Prior to methodological discussions, we formalize the notation for our workflow. Let

I^{i} \in R^{3 \times H \times W}

denote the i-th original NeRF-rendered view,

I_{v} \in R^{3 \times H \times W}

the visual prior guiding edits, and

I_{e}^{i} \in R^{3 \times H \times W}

the edited i-th view. Projection mapping is indicated by superscripts:

I_{e}^{j i} \in R^{3 \times H \times W}

represents the view obtained by mapping

I_{e}^{i}

to

I^{j}

. For SVDRs, the operator

{[\cdot]}_{DR}

denotes Diffusion Refinement (DR) in Optimal Editing Propagation (OEP). For MVDRs, the Progressive Inheritance Propagation (PIP) introduces stage-specific notation (e.g.,

I_{e}^{j i, ℓ}

for the ℓ-th stage projection of

I_{e}^{i, ℓ}

to

I^{j, ℓ}

), with

{[\cdot]}_{AB}

indicating Adaptive Blending (AB).

2.2. Single-View Editing

To enhance single-view editing semantic consistency, as shown in Equation (1), we jointly introduce semantic masks and visual prior

I_{v}

to guide the diffusion editing process, producing an initial edited result

I_{e}^{'}

:

\begin{matrix} I_{e}^{'} = Diff \{α \cdot [I_{v} ⊙ Ψ (M) + I ⊙ Ψ (1 - M)] + (1 - α) I\} \end{matrix}

(1)

where

α \in [0, 1]

denotes the blending weight;

M \in {0, 1}^{H \times W}

denotes the region semantic mask generated by SegFormer [52] for NeRF’s original rendered view; mask value 1 indicates target editing regions, and 0 indicates non-target regions; ⊙ denotes the Hadamard product; and

Ψ (\cdot)

is a composite operator performing morphological dilation and Gaussian blur on region masks to smooth mask edges. To eliminate the issue of editing diffusion, we perform pixel-wise restoration of non-target regions using the semantic mask, obtaining the final single-view editing result

I_{e}

.

\begin{matrix} I_{e} = I_{e}^{'} ⊙ Ψ (M) + I ⊙ Ψ (1 - M) \end{matrix}

(2)

2.3. Cross-View Propagation

2.3.1. Projection Mapping

Cross-view propagation fundamentally relies on projection mapping. Based on the single-view editing results (defined as the reference view), we formulate a geometrically consistent cross-view mapping mechanism through projection transformation, propagating edits to arbitrary scene views (defined as target views). This is achieved by constructing geometric consistency constraints between reference and target views using forward and backward projections.

For a pixel coordinate

(u_{ref}, v_{ref})

in the reference view, we first compute its corresponding world coordinates

X_{world}

using the depth

D_{ref}

estimated by NeRF and the reference view’s camera parameters (rotation matrix

R_{ref}

, translation vector

t_{ref}

, and intrinsic matrix

K_{ref}

).

X_{world} = R_{ref}^{T} (D_{ref} (u_{ref}, v_{ref}) \cdot K_{ref}^{- 1} [\begin{matrix} u_{ref} \\ v_{ref} \\ 1 \end{matrix}]) - t_{ref}

(3)

The world coordinates

X_{world}

are then projected to the target view’s pixel coordinate system (

u_{tar}^{'}, v_{tar}^{'}

) through two steps using the target camera parameters, as shown in the 3D projection of Equation (8) and the normalization operation of Equation (5):

\begin{matrix} {[\begin{matrix} x, y, z \end{matrix}]}^{T} = K_{tar} (R_{tar} X_{world} + t_{tar}) \end{matrix}

(4)

\begin{matrix} {[u_{tar}^{'}, v_{tar}^{'}, 1]}^{T} = {[x / z, y / z, 1]}^{T} \end{matrix}

(5)

Since the projected target view coordinates

(u_{tar}^{'}, v_{tar}^{'})

are typically floating-point values while reference view pixels are discrete integers, we employ bilinear interpolation from the reference view image

I_{ref}

to obtain the preliminary mapped image

I_{map}^{'}

for the target view. Due to viewpoint differences between

I_{ref}

and

I_{tar}

, certain mapped regions may become geometrically invisible, which can be identified through the cyclic error

e_{c}

derived from back-projected (target → reference → target) coordinates

(u_{tar}^{″}, v_{tar}^{″})

.

e_{c} (u_{tar}, v_{tar}) = {∥(u_{tar}, v_{tar}) - (u_{tar}^{″}, v_{tar}^{″})∥}_{2}

(6)

Regions with significant cyclic error (

e_{c} > e_{th}

) are identified as geometrically invisible in the target view and require completion, as shown in Equation (7), to generate the final image:

I_{map} (u_{tar}, v_{tar}) = \{\begin{matrix} I_{map}^{'} (u_{tar}, v_{tar}), & e_{c} (u_{tar}, v_{tar}) \leq e_{th} \\ I_{tar} (u_{tar}, v_{tar}), & Otherwise \end{matrix}

(7)

where

e_{th}

is the cyclic error threshold, which enables robust projection mapping under unsupervised conditions.

2.3.2. Optimal Editing Propagation

For Single-View-Dependent Regions (SVDRs), Optimal Editing Propagation (OEP) avoids error accumulation from multiple propagations. As shown in Figure 2a, we first edit the original NeRF-rendered view

I^{i}

using the method described in Section 2.2 to obtain

I_{e}^{i}

(i = 1, 2, \dots, m)

and then select the optimally edited view

I_{e}^{k}

through TDB metrics (for more details, see Section 3.2.3) and propagate its edits to all

I^{j}

using the method in Section 2.3.1, resulting in

I_{e}^{j k}

(j = 1, 2, \dots, m)

to maximally preserve scene continuity across viewpoints. Subsequently, we restore non-target regions in

I_{e}^{j k}

using

I^{j}

and

M^{j}

. However, projection errors cause stretching artifacts in target regions that degrade visual quality, prompting our proposed DR module for visual quality improvement.

The DR module utilizes the projected mapped images as both input images (encoded into latent representations with added noise for progressive denoising) and conditioning images (encoded into latent representations to constrain the denoising process). This design ensures that the conditioning images can regulate the diffusion process, thereby preventing the target regions in the refined results from being over-modified. In other words, although target regions undergo diffusion refinement, their fundamental morphology remains consistent with the projection mapping results, thereby repairing stretching artifacts while maintaining multi-view consistency with other viewpoints. We employ semantic mask

M

to fine-tune the text guidance scale and image guidance scale, applying differentiated guidance strengths during denoising: reducing the image guidance scale for target regions while keeping non-target regions unchanged, and decreasing the text guidance scale for non-target regions while maintaining target regions constant. This approach enables refined target regions while preserving overall structure, producing semantically consistent multi-view edits

{[I_{e}^{j k}]}_{DR}

for SVDRs.

2.3.3. Progressive Inheritance Propagation

The Progressive Inheritance Propagation (PIP) method is proposed to further enhance cross-view consistency for Multi-View-Dependent Regions (MVDRs). We calculate the target coverage ratio

ρ_{i} = {∥ M_{i} ∥}_{1} / (H \times W)

for each view

I^{i}

(

i = 1, 2, \dots, m

). We first retain those views with the highest coverage ratios, and then randomly sample additional views from the remaining candidates to assemble the key view set

G = {I^{ω (1)}, I^{ω (2)}, I^{ω (3)}, \dots, I^{ω (n)}}

, where n denotes the number of key views and satisfies

n \leq m

. The bijective function

ω : {1, 2, \dots, m} \to {1, 2, \dots, m}

is constructed such that

\begin{matrix} ρ_{ω (1)} \geq \dots \geq ρ_{ω (n)}, I^{ω (i)} \in G (i = 1, \dots, n) \end{matrix}

(8)

\begin{matrix} ρ_{ω (n + 1)} \geq \dots \geq ρ_{ω (m)}, I^{ω (i)} \notin G (i = n + 1, \dots, m) \end{matrix}

(9)

This ensures that the key view set

G

is ranked first by descending target coverage ratio, followed by the remaining non-key views also in descending order, yielding a globally ordered set. Compared to completely random selection, this strategy improves the average coverage of effective editing regions.

For clarity, Figure 2b shows the complete PIP workflow. Beginning with the first view

I^{ω (1)}

in

G

, we perform editing using the method described in Section 2.2 to obtain

I_{e}^{ω (1), 1}

and then propagate these edits to all other views

I_{e}^{ω (j) ω (1), 1}

(

j = 1, 2, \dots, m

) via the mapping method described in Section 2.3.1. By definition of

G

, the initial view

I^{ω (1)}

possesses the highest target coverage ratio

ρ

, maximizing scene continuity of the edited results

I_{e}^{ω (1), 1}

across viewpoints. To mitigate cumulative projection errors across propagation stages, we design the Adaptive Blending module shown in Figure 3.

Taking

I_{e}^{ω (i), ℓ}

as an example, given the mapped view

I_{e}^{ω (j) ω (i), ℓ} \in R^{3 \times H \times W}

and the semantic mask

M^{ω (j)} \in {[0, 1]}^{1 \times H \times W}

, we construct a 4-channel tensor.

For notational clarity and to improve readability, we introduce the alias

ϖ : = (ω (j) ω (i), ℓ)

, and accordingly denote its previous-stage counterpart as

ϖ^{-} : = (ω (j) ω (i - 1), ℓ - 1)

. We require

ℓ > 1

in this definition; when

ℓ = 1

, we set

{[I_{e}^{ϖ^{-}}]}_{AB} = I^{ω (j)}

. Using this convention, we construct the 4-channel tensor:

X_{1}^{ϖ} = Concat (I_{e}^{ϖ}, M^{ω (j)}) \in R^{4 \times H \times W}

(10)

Incorporating normalized coordinate grids

ϕ_{x}^{ϖ} (W)

and

ϕ_{y}^{ϖ} (H)

(linearly mapped to

[- 1, 1]

) enhances spatial awareness:

X_{2}^{ϖ} = Concat (X_{1}^{ϖ}, ϕ_{x}^{ϖ} (W), ϕ_{y}^{ϖ} (H)) \in R^{6 \times H \times W}

(11)

Subsequently, a

3 \times 3

convolution projects the 6-channel features to a 16-channel space:

F_{1}^{ϖ} = ReLU ({Conv}_{3 \times 3}^{6 \to 16} (X_{2}^{ϖ})) \in R^{16 \times H \times W}

(12)

Then, it employs Squeeze-and-Excitation-style attention. Specifically, the squeeze phase uses

3 \times 3

convolution for mixed spatial-channel compression to reduce computational redundancy, with ReLU activation enhancing nonlinear expressiveness. The excitation phase employs

1 \times 1

convolution to reconstruct channel importance, adjusting weight distribution across channels. Finally, a sigmoid function generates spatial-channel adaptive attention weights

μ^{ϖ} \in [0, 1]

(weights closer to 1 indicate regions requiring enhancement, while those near 0 need suppression) for dynamic feature response adjustment:

μ^{ϖ} = σ ({Conv}_{1 \times 1}^{8 \to 16} (ReLU ({Conv}_{3 \times 3}^{16 \to 8} (F_{1}^{ϖ})))) \in {[0, 1]}^{16 \times H \times W}

(13)

F_{2}^{ϖ} = F_{1}^{ϖ} ⊙ μ^{ϖ} \in R^{16 \times H \times W}

(14)

where

σ

denotes the Sigmoid activation function. Through element-wise multiplication in Equation (14), the network can adaptively enhance responses in spatially consistent regions. Furthermore, a

3 \times 3

convolution followed by Sigmoid activation generates the blending weights

β^{ϖ}

:

β^{ϖ} = σ ({Conv}_{3 \times 3}^{16 \to 1} (F_{2}^{ϖ})) \in {[0, 1]}^{1 \times H \times W}

(15)

These weights

β^{ϖ}

are used to obtain the adaptively blended version of

I_{e}^{ϖ}

, denoted as

{[I_{e}^{ϖ}]}_{AB}

:

\begin{matrix} {[I_{e}^{ϖ}]}_{AB} = & (β^{ϖ} ⊙ I_{e}^{ϖ} + (1 - β^{ϖ}) ⊙ {[I_{e}^{ϖ^{-}}]}_{AB}) ⊙ Ψ (M^{ω (j)}) + {[I_{e}^{ϖ^{-}}]}_{AB} ⊙ Ψ (1 - M^{ω (j)}) \end{matrix}

(16)

Following this approach, all views in

G

undergo n-stage processing through the “edit–map–blend” pipeline, progressively propagating edits to other views in the scene. This “adaptive blending” effectively balances editing effects with scene structure details.

3. Experiments and Results

3.1. Experimental Setup

All experiments were performed on a workstation equipped with an NVIDIA GeForce RTX 4070 GPU (12 GB VRAM), an Intel Core i7-14700KF CPU, and 32 GB of DDR5 RAM. The software environment consisted of CUDA 11.8, Python 3.8.20, and PyTorch 2.0.0+cu118. We used Nerfstudio [53] to process real-world scene datasets. Nerfstudio is an advanced NeRF framework that improves training efficiency and rendering quality through its optimized pipeline and modular architecture. We used the Nerfacto model from Nerfstudio as the base NeRF renderer. The model was trained using the Adam optimizer with an initial learning rate set to 0.01. In our experiments, training was conducted for 50,000 iterations to achieve high-quality reconstructions of real-world scenes. The reconstruction is a one-time preprocessing step; the resulting model checkpoint is saved and reused for all subsequent edits on the same dataset, eliminating the need for repeated reconstruction and substantially reducing overall computation time.

The dataset used in this section was captured by a DJI Mini 3 drone flying at an altitude of 120 m over a university campus (3840 × 2160 resolution). The UAV followed a predefined circular (orbit) trajectory while maintaining a mild nadir angle (approximately 30°), which is critical for ensuring sufficient pairwise scene overlap between images. Such high overlap enables reliable projection-based propagation of single-view edits, thereby enhancing multi-view consistency.

In terms of performance, editing a set of 30 images at 1280 × 720 resolution takes approximately 45 min and reaches a sustained peak GPU memory usage of about 11.6 GB—very close to our 12 GB VRAM limit. By contrast, downsampling the images to 800 × 450 reduces the average editing time to about 8 min, with a peak GPU memory usage of around 5.8 GB. Considering both stability and turnaround time, we downsampled all images to 800 × 450 for our experiments. With this resolution, NeRF reconstruction and editing can be completed within one hour for datasets ranging from 30 to 200 images. We employed COLMAP [54] for camera pose estimation and InstructPix2Pix [28] as our diffusion baseline, using a fixed text guidance scale of 7.5 and an image guidance scale of either 1.0 or 1.2 for our method in all experiments. During the image editing diffusion process, the number of denoising steps was set to 10; the lower bound for diffusion timestep sampling was set to 0.5, and the upper bound was set to 0.98.

3.2. Evaluation Metrics

3.2.1. Single-View Editing Evaluation Metrics

The CLIP Text-Image Direction Similarity (CLIP-TIDS) metric quantifies the effectiveness of single-view editing by measuring the semantic alignment between text prompts and edited images in the CLIP embedding space. A higher similarity indicates a stronger semantic consistency between the text prompt and the image. Formally, given a text description

T

and its corresponding edited image

I

, we extract the respective textual and visual features

ϕ (T)

and

ψ (I)

using the CLIP model, and compute their cosine similarity as follows:

{TIDS}_{(T, I)} = \frac{ϕ (T) \cdot ψ (I)}{∥ ϕ (T) ∥ \cdot ∥ ψ (I) ∥}

(17)

The CLIP Direction Consistency (CLIP-DC) metric quantifies the directional alignment between the textual and visual changes induced by an editing prompt in the multimodal embedding space. Higher directional consistency implies that the edited image better retains the original visual characteristics while responding to the text prompt. Formally, given original image

I_{1}

, edited image

I_{2}

, original description

T_{1}

and editing prompt

T_{2}

, we compute the textual difference vector

Δ_{T} = ϕ (T_{2}) - ϕ (T_{1})

and the image difference vector

Δ_{I} = ψ (I_{2}) - ψ (I_{1})

, with their directional consistency defined as

{DC}_{(Δ_{T}, Δ_{I})} = \frac{Δ_{T} \cdot Δ_{I}}{∥ Δ_{T} ∥ \cdot ∥ Δ_{I} ∥}

(18)

3.2.2. Image Quality Assessment Metrics

Since edited images undergo significant changes compared to originals, we employ the no-reference image quality assessment (NR-IQA) method BRISQUE (Blind/Referenceless Image Spatial Quality Evaluator) to independently evaluate edited image quality. BRISQUE assesses image quality by computing the deviation between an image’s statistical features and those of natural images. Lower scores indicate better image quality, representing closer alignment with natural image distributions.

3.2.3. Comprehensive Evaluation Metrics

While CLIP-TIDS captures the semantic alignment between the edited image and the given text prompt, it does not account for whether the original image’s structural integrity is preserved. Consequently, even if the image is entirely re-rendered, a high CLIP-TIDS score may still be assigned as long as the output aligns well with the text description, potentially resulting in outputs that are “semantically correct but structurally malformed”. Similarly, CLIP-DC assesses whether a meaningful semantic shift has occurred in the direction dictated by the text instruction—specifically, whether the edited image reflects the intended transition from the source image toward the target text. However, this metric neglects the visual quality or structural coherence of the generated image, potentially leading to selections that are “directionally consistent but visually degraded”. As a referenceless image quality assessment method, BRISQUE evaluates only the perceptual quality and naturalness of the image, without considering semantic fidelity. Thus, even when the output deviates significantly from the text prompt, retaining low-level visual properties such as structural continuity and texture realism can lead to a high BRISQUE score, yielding results that are “visually plausible yet semantically inconsistent”.

Given that no single metric can comprehensively characterize the overall editing performance, combining multiple complementary metrics offers a more robust and holistic evaluation across semantic consistency, structural preservation, and perceptual quality. Motivated by the characteristics of diffusion-based editing, we propose a composite scoring function TDB:

TDB = \frac{CLIP-TIDS + CLIP-DC}{log (BRISQUE + 1)}

(19)

where BRISQUE’s logarithmic normalization ensures balanced weighting with semantic metrics (CLIP-TIDS and CLIP-DC). As demonstrated in Figure 2’s Optimal Editing Propagation (OEP) method, TDB-guided selection of optimal edits enables reliable global propagation.

3.3. Comparative Experiments

To rigorously validate that our method outperforms established approaches such as CustomNeRF [42], Instruct-NeRF2NeRF [48], and VICA-NeRF [49] across multiple dimensions—including semantic consistency, structure preservation, and cross-view consistency—we design experiments that cover a variety of editing types, including local object insertion, attribute replacement, global style transformation, and complex semantic operations. In designing text prompts, with the goal of effectively differentiating each method’s ability to understand and execute complex instructions, we follow two principles: (i) Challenge, select prompts that can reveal the limitations of existing methods; for example, “Ignite the grass” is used to test a model’s ability to translate abstract semantics into plausible visual content; and (ii) Representativeness, adopt editing tasks that are common and practically meaningful in UAV scenes, such as time of day and seasonal conversions (“Turn into night”, “Turn into autumn”), to reflect practicality in typical application scenarios. Based on these considerations, we construct eight comparative experiment groups, and the results are shown in Figure 4. Scene 1’s full editing tasks and Scene 2’s first task are designed for Single-View-Dependent Regions (SVDRs), while Scene 2’s remaining tasks (2–4) address Multi-View-Dependent Regions (MVDRs). Qualitative analysis enables direct comparison of all methods’ semantic consistency and image quality across different editing tasks and viewpoints.

CustomNeRF proposes a local–global iterative editing framework for natural scenes, but exhibits significant limitations when applied to UAV-captured large-scale scenes. In global editing tasks (the last two groups), the generated results completely lose the original image information. For local object editing (the first six groups), despite achieving effective foreground region editing (e.g., group 5), the method fails to maintain geometrically consistent integration with UAV scenes characterized by complex multi-object interactions.

Although Instruct-NeRF2NeRF claims to be applicable to large-scale scene editing, such as temporal or seasonal modifications, its implementation relies on iteratively applying InstructPix2Pix [28] to update the scene via multi-view image editing. This approach has two critical flaws: First, InstructPix2Pix underperforms for large-scale editing. Second, the same text prompt generates diverse outputs, leading to inconsistent multi-view edits. Given the UAV-captured scenes containing multiple targets and complex structures, Instruct-NeRF2NeRF loses substantial original information during iterative editing, ultimately failing to achieve desired edits in both scenes.

Although VICA-NeRF can better preserve scene structure, it remains limited by InstructPix2Pix’s poor performance on UAV multi-object interactions. Specifically, in Scene 1, for the text prompts “Cars added to the grass” and “Ignite the grass”, VICA-NeRF fails to generate meaningful results. For “Add trees to the grassy area”, the results show significant instruction deviation (enlarging the original trees in the scene instead of adding trees in the grassy area) and edit leakage, with suboptimal multi-view consistency; while VICA-NeRF correctly handles “Grassland turns into flower fields” in both scenes, editing leakage still persists, especially in Scene 2. For MVDRs editing tasks like “Turn into night”, VICA-NeRF’s editing results cannot fully cover the scene: only the central region darkens, while the periphery remains unchanged, which clearly does not correspond to reality.

In contrast, UAVEdit-NeRFDiff achieves superior performance and precise control over the editing region through the joint utilization of visual priors and semantic masks. For example, in SVDRs editing tasks, the prompt “Grassland turns into flower fields” generates rich, natural floral details in both scenes, thereby significantly enhancing realism. In MVDRs editing tasks, the method produces visually compelling results with strong cross-view consistency, aligning both with real-world conditions and the provided instructions.

To quantitatively assess performance, we compute TDB, CLIP-TIDS, CLIP-DC, and BRISQUE metrics across three views, with results detailed in Table 1, Table 2, Table 3 and Table 4.

The limitations of the CLIP-TIDS, CLIP-DC, and BRISQUE metrics have been thoroughly discussed in Section 3.2.3. Relying on a single metric is insufficient for reliably evaluating editing performance. Under the CLIP-TIDS metric, Instruct-NeRF2NeRF achieves significantly higher semantic consistency scores than other methods for text instructions including “Leaves turn red” (View 1, 2), “Turn into autumn” (all views), and “Turn into night” (View 1, 3) in Scene 2. However, this method actually fails to properly respond to these instructions while producing low-quality, even meaningless edits. Similarly, when evaluated by the BRISQUE metric, VICA-NeRF scores higher than our method on image quality for the instruction “Add trees to the grassy area” (View 1) in Scene 1, yet these results suffer from edit leakage issues. The proposed TDB metric overcomes the aforementioned limitations by integrating all three evaluation dimensions. In the TDB evaluation, our method demonstrates significant improvements compared with CustomNeRF [42], Instruct-NeRF2NeRF [48], and VICA-NeRF [49]; this result is consistent with qualitative analysis.

To further verify that the observed performance gains are not due to chance, we conducted paired t-tests on the results of the composite evaluation metric TDB. The analysis is based on 24 paired observations per method (8 editing experiments × 3 viewpoints), yielding 23 degrees of freedom. M represents the mean value, SD denotes the standard deviation,

t (23)

denotes the t statistic, and the p value quantifies the probability that the performance difference is attributable to random variation (

p < 0.0001

indicates a probability below 0.01%). The percentage improvement is computed as

\frac{M_{ours} - M_{comparison}}{M_{comparison}} \times 100 %

(20)

The statistical results indicate that our method achieves statistically significant improvements on TDB over existing methods. Specifically, our method (M = 0.5102, SD = 0.0879) exceeds CustomNeRF (M = 0.1882, SD = 0.0478) by 171.1%,

t (23) = 14.0359

,

p < 0.0001

; Instruct-NeRF2NeRF (M = 0.1866, SD = 0.0613) by 173.4%,

t (23) = 15.3206

,

p < 0.0001

; and VICA-NeRF (M = 0.3293, SD = 0.0947) by 54.9%,

t (23) = 6.0487

,

p < 0.0001

.

3.4. Ablation Study

In this section, we verify the effectiveness of each key component in our method, including Restoring Non-Target Regions (RNTR), Visual Prior (VP), Diffusion Refinement (DR), and Adaptive Blending (AB). The experimental results are shown in Figure 5, where (a) and (c) correspond to Single-View-Dependent Regions (SVDRs) editing tasks, while (b) and (d) correspond to Multi-View-Dependent Regions (MVDRs) editing tasks. In addition, we compute CLIP-TIDS, CLIP-DC, BRISQUE, and TDB on images from three viewpoints to quantitatively assess each component’s contribution to improving overall editing quality, as shown in Table 5.

In Figure 5a, “w/o RNTR” denotes the scenario where no restoration is applied to the non-target regions either after editing or after projection mapping. As a result, editing leakage is evident in the non-target region (e.g., the roof). Moreover, when there is a large angular disparity between the reference and target views, the quality of the projected image deteriorates significantly, leading to substantial degradation in visual quality. This is corroborated by the quantitative results in Table 5: “w/ RNTR” performs better on CLIP-DC, BRISQUE, and TDB, providing evidence that the RNTR component plays an important role in improving overall editing quality. By contrast, because CLIP-TIDS lacks the ability to distinguish cases that are “semantically correct but structurally distorted”, “w/o RNTR” can obtain a higher score even when edit leakage occurs, which in turn highlights the limitations of CLIP-TIDS.

In Figure 5b, “w/o VP” denotes that no visual priors are introduced during single view editing. Under this setting, the outputs respond weakly to the text prompts, and a slightly better BRISQUE score is observed. By contrast, introducing VP (“w/ VP”) yields edits with improved semantic accuracy and better performance on CLIP-TIDS, CLIP-DC, and TDB.

In Figure 5c, “w/o DR” denotes using the raw projection-mapped results as the final output. Significant stretching artifacts are observed in the flowerbed region. In contrast, introducing the DR module (“w/ DR”) yields more natural generation with finer details and shows better performance on CLIP-TIDS, CLIP-DC, BRISQUE, and TDB.

In Figure 5d, “w/o AB” denotes that blending between key view projections is skipped. When the reference and target views have large angular disparities, projection-induced distortions accumulate across the PIP stages. As a result, subsequent edits yield semantically aligned but structurally inconsistent outputs. It is clearly observable from the figure that the “w/o AB” result has already lost the original structure and merely follows the text instruction to turn trees into red leaves; consequently, it achieves higher scores on CLIP-TIDS and CLIP-DC. This is consistent with our discussion of the limitations of these two metrics in Section 3.2.3 and further underscores the necessity and effectiveness of the proposed TDB metric for a comprehensive evaluation of editing performance.

4. Discussion

Diverse Editing Capabilities: To further assess the generalization capability of UAVEdit-NeRFDiff, we conducted experiments on the open-source OMMO dataset [55], which consists of complex scenes captured by unmanned aerial vehicles (UAVs). We selected four representative scenes (IDs 05, 10, 14, and 15) for evaluation. As shown in Figure 6, the experiments encompass a wide range of realistic scenarios, including both regular environmental transformations (e.g., seasonal transitions, ecological restoration, and day–night variations) and disaster-related simulations (e.g., wildfires, droughts, and sandstorms). The results demonstrate that UAVEdit-NeRFDiff achieves strong performance across semantic fidelity, multi-view consistency, and perceptual quality, highlighting its effectiveness in large-scale scene editing tasks.

Limitations: Despite promising results, we observed several limitations in our experiments. First, inserting novel objects whose depth structures differ substantially from the original scene at the insertion site (e.g., high-rise buildings) remains challenging. As the angular disparity between the reference and target views increases, the inserted object may exhibit pronounced projection-induced artifacts—such as tilting, floating, or deformation—that are difficult to fully correct through subsequent refinement. Second, the accuracy of the semantic masks is a critical factor affecting our method’s performance. When the target region in the mask is incomplete, the visual prior cannot be effectively applied to the intended areas, and relying solely on text prompts often fails to achieve the desired edits. Furthermore, misclassification at complex boundaries (e.g., between vegetation and roads) can lead to incorrect restoration of target regions during mask-based non-target region recovery. It is important to emphasize that our framework treats the semantic mask as a generic, replaceable input. When applying the method to new datasets, users can adopt more advanced segmentation models or high-quality manual annotations to improve the editing results. In future work, we will focus on developing more robust cross-view propagation strategies and reducing reliance on segmentation accuracy.

Practical Implications of This Work: For practical applications like environmental simulation, disaster monitoring, and urban planning, which are typical remote sensing tasks, 3D scene data with specific semantic features is often required to support model training, effect evaluation, or decision-making. However, such scene data is often difficult to acquire in real environments, especially for disaster scenarios, which are highly sporadic, uncontrollable, and difficult to capture directly due to safety constraints. Our UAVEdit-NeRFDiff provides a controllable, natural editing solution that maintains multi-view consistency while generating photorealistic semantic content in specified target regions. This approach facilitates the construction of high-quality, reliable remote sensing scene data, offering robust data support for multi-source data fusion, disaster prediction, and decision support.

5. Conclusions

In this work, we present UAVEdit-NeRFDiff, a novel editing framework for complex multi-object interactions in UAV-captured large-scale scenes. To address the limitations of existing text-guided diffusion methods in achieving strong semantic consistency and cross-view coherence for UAV captured images, our framework integrates visual priors with semantic masks to enable semantically consistent editing of key views. Through the proposed Optimal Editing Propagation (OEP) and Progressive Inheritance Propagation (PIP) methods, we achieve geometrically consistent propagation for Single-View-Dependent Regions (SVDRs) and Multi-View-Dependent Regions (MVDRs), significantly improving the overall semantic consistency, multi-view coherence, and visual quality of editing results.

Beyond these technical contributions, our method provides an efficient pathway for generating low-cost, diverse training samples for low-altitude remote sensing research. It holds substantial practical value in facilitating model training, performance evaluation, and decision support for applications such as disaster monitoring and object detection. However, we acknowledge certain limitations: the current framework faces challenges when inserting objects with significant depth discrepancies relative to the original scene, and the stability of editing results remains constrained by the inherent performance of the underlying diffusion models. In future work, we plan to explore more robust cross-view propagation strategies and further extend the framework to handle more complex multi-object interactions and highly dynamic scenarios.

Author Contributions

Conceptualization, C.Y.; methodology, C.Y. and X.C.; investigation, C.Y., Z.C., S.W. and W.D.; validation, C.Y. and Z.C.; software, C.Y.; visualization, C.Y. and Z.S.; writing—original draft preparation, C.Y.; writing—review and editing, C.Y., Z.S., Z.C., X.C., S.W. and W.D.; supervision, X.C.; funding acquisition, X.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (Grant No. 62061002) and the Guangxi Science and Technology Program (Grant No. AB23075105).

Data Availability Statement

The OMMO dataset used in this study is publicly available at https://github.com/luchongshan/OMMO?tab=readme-ov-file (accessed on 15 November 2024). The dataset created in this work is not publicly available but can be obtained from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gupta, S.K.; Shukla, D.P. Application of drone for landslide mapping, dimension estimation and its 3D reconstruction. J. Indian Soc. Remote Sens. 2018, 46, 903–914. [Google Scholar] [CrossRef]
Eckert, G.; Cassidy, S.; Tian, N.; Shabana, M.E. Using Aerial Drone Photography to Construct 3D Models of Real World Objects in an Effort to Decrease Response Time and Repair Costs Following Natural Disasters. In Advances in Computer Vision, Proceedings of the 2019 Computer Vision Conference (CVC), Las Vegas, NV, USA, 25–26 April 2019; Springer: Cham, Switzerland, 2020; pp. 317–325. [Google Scholar] [CrossRef]
Pattanaik, R.K.; Singh, Y.K. Study on characteristics and impact of Kalikhola landslide, Manipur, NE India, using UAV photogrammetry. Nat. Hazards 2024, 120, 6417–6435. [Google Scholar] [CrossRef]
Chen, Y.; Liu, X.; Zhu, B.; Zhu, D.; Zuo, X.; Li, Q. UAV Image-Based 3D Reconstruction Technology in Landslide Disasters: A Review. Remote Sens. 2025, 17, 3117. [Google Scholar] [CrossRef]
Le, N.; Karimi, E.; Rahnemoonfar, M. 3DAeroRelief: The first 3D Benchmark UAV Dataset for Post-Disaster Assessment. arXiv 2025, arXiv:2509.11097. [Google Scholar] [CrossRef]
Xing, Y.; Yang, S.; Fahy, C.; Harwood, T.; Shell, J. Capturing the Past, Shaping the Future: A Scoping Review of Photogrammetry in Cultural Building Heritage. Electronics 2025, 14, 3666. [Google Scholar] [CrossRef]
Themistocleous, K. The Use of UAVs for Cultural Heritage and Archaeology. In Remote Sensing for Archaeology and Cultural Landscapes: Best Practices and Perspectives Across Europe and the Middle East; Springer: Cham, Switzerland, 2020; pp. 241–269. [Google Scholar] [CrossRef]
Xu, L.; Xu, Y.; Rao, Z.; Gao, W. Real-Time 3D Reconstruction for the Conservation of the Great Wall’s Cultural Heritage Using Depth Cameras. Sustainability 2024, 16, 7024. [Google Scholar] [CrossRef]
Yan, Y.; Du, Q. From digital imagination to real-world exploration: A study on the influence factors of VR-based reconstruction of historical districts on tourists’ travel intention in the field. Virtual Real. 2025, 29, 85. [Google Scholar] [CrossRef]
Kokosza, A.; Wrede, H.; Esparza, D.G.; Makowski, M.; Liu, D.; Michels, D.L.; Pirk, S.; Palubicki, W. Scintilla: Simulating Combustible Vegetation for Wildfires. ACM Trans. Graph. 2024, 43, 70. [Google Scholar] [CrossRef]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In Proceedings of the European Conference on Computer Vision (ECCV), Virtual Conference, 23–28 August 2020; pp. 405–421. [Google Scholar] [CrossRef]
Barron, J.T.; Mildenhall, B.; Tancik, M.; Hedman, P.; Martin-Brualla, R.; Srinivasan, P.P. Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 5835–5844. [Google Scholar] [CrossRef]
Müller, T.; Evans, A.; Schied, C.; Keller, A. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. 2022, 41, 102. [Google Scholar] [CrossRef]
Wang, P.; Liu, L.; Liu, Y.; Theobalt, C.; Komura, T.; Wang, W. NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction. In Proceedings of the Advances in Neural Information Processing Systems, Virtual Conference, 6–14 December 2021; Volume 34, pp. 27171–27183. [Google Scholar]
Wang, Y.; Han, Q.; Habermann, M.; Daniilidis, K.; Theobalt, C.; Liu, L. NeuS2: Fast Learning of Neural Implicit Surfaces for Multi-view Reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 3272–3283. [Google Scholar] [CrossRef]
Niemeyer, M.; Barron, J.T.; Mildenhall, B.; Sajjadi, M.S.M.; Geiger, A.; Radwan, N. RegNeRF: Regularizing Neural Radiance Fields for View Synthesis from Sparse Inputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 5470–5480. [Google Scholar] [CrossRef]
Tancik, M.; Casser, V.; Yan, X.; Pradhan, S.; Mildenhall, B.P.; Srinivasan, P.; Barron, J.T.; Kretzschmar, H. Block-NeRF: Scalable Large Scene Neural View Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8238–8248. [Google Scholar] [CrossRef]
Turki, H.; Ramanan, D.; Satyanarayanan, M. Mega-NeRF: Scalable Construction of Large-Scale NeRFs for Virtual Fly- Throughs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 12912–12921. [Google Scholar] [CrossRef]
Xiangli, Y.; Xu, L.; Pan, X.; Zhao, N.; Rao, A.; Theobalt, C.; Dai, B.; Lin, D. BungeeNeRF: Progressive Neural Radiance Field for Extreme Multi-scale Scene Rendering. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Volume 13692, pp. 106–122. [Google Scholar] [CrossRef]
Xu, L.; Xiangli, Y.; Peng, S.; Pan, X.; Zhao, N.; Theobalt, C.; Dai, B.; Lin, D. Grid-guided Neural Radiance Fields for Large Urban Scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 8296–8306. [Google Scholar] [CrossRef]
Zhang, G.; Xue, C.; Zhang, R. SuperNeRF: High-Precision 3-D Reconstruction for Large-Scale Scenes. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
Zhang, Y.; Gao, Z.; Sun, W.; Lu, Y.; Zhu, Y. MD-NeRF: Enhancing Large-Scale Scene Rendering and Synthesis with Hybrid Point Sampling and Adaptive Scene Decomposition. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Zhang, Y.; Chen, G.; Cui, S. Efficient large-scale scene representation with a hybrid of high-resolution grid and plane features. Pattern Recognit. 2025, 158, 111001. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the Advances in Neural Information Processing Systems, Virtual Conference, 6–12 December 2020; Volume 33, pp. 6840–6851. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 10674–10685. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), Virtual Conference, 18–24 July 2021; Volume 139, pp. 8748–8763. [Google Scholar]
Avrahami, O.; Lischinski, D.; Fried, O. Blended Diffusion for Text-driven Editing of Natural Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 18187–18197. [Google Scholar] [CrossRef]
Brooks, T.; Holynski, A.; Efros, A.A. InstructPix2Pix: Learning to Follow Image Editing Instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 18392–18402. [Google Scholar] [CrossRef]
Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 22500–22510. [Google Scholar] [CrossRef]
Hertz, A.; Mokady, R.; Tenenbaum, J.; Aberman, K.; Pritch, Y.; Cohen-Or, D. Prompt-to-Prompt Image Editing with Cross-Attention Control. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Chen, R.; Chen, Y.; Jiao, N.; Jia, K. Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 22189–22199. [Google Scholar] [CrossRef]
Lin, C.; Gao, J.; Tang, L.; Takikawa, T.; Zeng, X.; Huang, X.; Kreis, K.; Fidler, S.; Liu, M.; Lin, T. Magic3D: High-Resolution Text-to-3D Content Creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 300–309. [Google Scholar] [CrossRef]
Poole, B.; Jain, A.; Barron, J.T.; Mildenhall, B. DreamFusion: Text-to-3D using 2D Diffusion. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Raj, A.; Kaza, S.; Poole, B.; Niemeyer, M.; Ruiz, N.; Mildenhall, B.; Zada, S.; Aberman, K.; Rubinstein, M.; Barron, J.T.; et al. DreamBooth3D: Subject-Driven Text-to-3D Generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 2349–2359. [Google Scholar] [CrossRef]
Zhang, J.; Li, X.; Wan, Z.; Wang, C.; Liao, J. Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields. IEEE Trans. Vis. Comput. Graph. 2024, 30, 7749–7762. [Google Scholar] [CrossRef] [PubMed]
Wu, Z.; Li, Y.; Yan, H.; Shang, T.; Sun, W.; Wang, S.; Cui, R.; Liu, W.; Sato, H.; Li, H.; et al. BlockFusion: Expandable 3D Scene Generation using Latent Tri-plane Extrapolation. ACM Trans. Graph. 2024, 43, 1–17. [Google Scholar] [CrossRef]
Yang, X.; Man, Y.; Chen, J.; Wang, Y. SceneCraft: Layout-Guided 3D Scene Generation. Adv. Neural Inf. Process. Syst. 2024, 37, 82060–82084. [Google Scholar]
Bao, C.; Zhang, Y.; Yang, B.; Fan, T.; Yang, Z.; Bao, H.; Zhang, G.; Cui, Z. SINE: Semantic-driven Image-based NeRF Editing with Prior-guided Editing Field. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 20919–20929. [Google Scholar] [CrossRef]
Zhuang, J.; Wang, C.; Lin, L.; Liu, L.; Li, G. DreamEditor: Text-Driven 3D Scene Editing with Neural Fields. In Proceedings of the SIGGRAPH Asia 2023 Conference Papers, Sydney, Australia, 12–15 December 2023; pp. 1–10. [Google Scholar] [CrossRef]
Chiang, P.; Tsai, M.; Tseng, H.; Lai, W.; Chiu, W. Stylizing 3D Scene via Implicit Representation and HyperNetwork. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 215–224. [Google Scholar] [CrossRef]
Rojas, S.; Philip, J.; Zhang, K.; Bi, S.; Luan, F.; Ghanem, B.; Sunkavalli, K. DATENeRF: Depth-Aware Text-Based Editing of NeRFs. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; Volume 15069, pp. 267–284. [Google Scholar] [CrossRef]
He, R.; Huang, S.; Nie, X.; Hui, T.; Liu, L.; Dai, J.; Han, J.; Li, G.; Liu, S. Customize your NeRF: Adaptive Source Driven 3D Scene Editing via Local-Global Iterative Training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 6966–6975. [Google Scholar] [CrossRef]
Wang, C.; Chai, M.; He, M.; Chen, D.; Liao, J. CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 3825–3834. [Google Scholar] [CrossRef]
Huang, Y.H.; He, Y.; Yuan, Y.J.; Lai, Y.K.; Gao, L. StylizedNeRF: Consistent 3D Scene Stylization as Stylized NeRF via 2D-3D Mutual Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 18321–18331. [Google Scholar] [CrossRef]
Wang, C.; Jiang, R.; Chai, M.; He, M.; Chen, D.; Liao, J. NeRF-Art: Text-Driven Neural Radiance Fields Stylization. IEEE Trans. Vis. Comput. Graph. 2024, 30, 4983–4996. [Google Scholar] [CrossRef] [PubMed]
Mirzaei, A.; Aumentado-Armstrong, T.; Derpanis, K.G.; Kelly, J.; Brubaker, M.A.; Gilitschenski, I.; Levinshtein, A. SPIn-NeRF: Multiview Segmentation and Perceptual Inpainting with Neural Radiance Fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 20669–20679. [Google Scholar] [CrossRef]
Mirzaei, A.; Aumentado-Armstrong, T.; Brubaker, M.A.; Kelly, J.; Levinshtein, A.; Derpanis, K.G.; Gilitschenski, I. Watch Your Steps: Local Image and Scene Editing by Text Instructions. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; Volume 15096, pp. 111–129. [Google Scholar] [CrossRef]
Haque, A.; Tancik, M.; Efros, A.A.; Holynski, A.; Kanazawa, A. Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 19683–19693. [Google Scholar] [CrossRef]
Dong, J.; Wang, Y.X. ViCA-NeRF: View-Consistency-Aware 3D Editing of Neural Radiance Fields. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Volume 36, pp. 61466–61477. [Google Scholar]
Dihlmann, J.; Engelhardt, A.; Lensch, H.P.A. SIGNeRF: Scene Integrated Generation for Neural Radiance Fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 6679–6688. [Google Scholar] [CrossRef]
Wang, Y.; Fang, S.; Zhang, H.; Li, H.; Zhang, Z.; Zeng, X.; Ding, W. UAV-ENeRF: Text-Driven UAV Scene Editing with Neural Radiance Fields. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Proceedings of the Advances in Neural Information Processing Systems, Virtual Conference, 6–14 December 2021; Volume 34, pp. 12077–12090. [Google Scholar]
Tancik, M.; Weber, E.; Ng, E.; Li, R.; Yi, B.; Wang, T.; Kristoffersen, A.; Austin, J.; Salahi, K.; Ahuja, A.; et al. Nerfstudio: A Modular Framework for Neural Radiance Field Development. In Proceedings of the ACM SIGGRAPH 2023 Conference Proceedings, Los Angeles, CA, USA, 6–10 August 2023; pp. 1–12. [Google Scholar] [CrossRef]
Schönberger, J.L.; Frahm, J.M. Structure-from-Motion Revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4104–4113. [Google Scholar] [CrossRef]
Lu, C.; Yin, F.; Chen, X.; Liu, W.; Chen, T.; Yu, G.; Fan, J. A Large-Scale Outdoor Multi-modal Dataset and Benchmark for Novel View Synthesis and Implicit Scene Reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 7523–7533. [Google Scholar] [CrossRef]

Figure 1. Main workflow of UAVEdit-NeRFDiff.

Figure 2. The implementation processes of Optimal Editing Propagation and Progressive Inheritance Propagation.

Figure 3. Architecture of the Adaptive Blending module.

Figure 4. Qualitative Comparison with CustomNeRF, Instruct-NeRF2NeRF, and VICA-NeRF.

Figure 5. Ablation studies of each component in our method.

Figure 6. Editing experiments of UAVEdit-NeRFDiff on the open source OMMO dataset: (a) OMMO-05 scene, (b) OMMO-10 scene, (c) OMMO-14 scene, and (d) OMMO-15 scene.

Table 1. Quantitative evaluation using the proposed composite scoring function TDB ↑ metric.

		View 1 (Left)				View 2 (Upper Right)				View 3 (Lower Right)
		Custom [42]	N2N [48]	VICA [49]	Ours	Custom [42]	N2N [48]	VICA [49]	Ours	Custom [42]	N2N [48]	VICA [49]	Ours
Scene 1	“Grassland turns into flower fields”	0.2064	0.1852	0.4015	0.5603	0.2064	0.1750	0.4368	0.5256	0.1998	0.1669	0.3211	0.3597
	“Ignite the grass”	0.1899	0.1398	0.1385	0.6148	0.2288	0.1564	0.1760	0.5799	0.2032	0.1625	0.1585	0.5704
	“Cars added to the grass”	0.0900	0.2756	0.2208	0.5800	0.1332	0.1468	0.2008	0.5359	0.1033	0.2121	0.2988	0.7208
	“Add trees to the grassy area”	0.1540	0.1357	0.3446	0.3772	0.1787	0.0716	0.3056	0.5440	0.2043	0.1444	0.3194	0.4520
Scene 2	“Grassland turns into flower fields”	0.2155	0.1226	0.2943	0.4503	0.2604	0.1504	0.3845	0.4545	0.2443	0.1379	0.2809	0.5062
	“Leaves turn red”	0.2429	0.2587	0.3953	0.4890	0.2712	0.2420	0.3431	0.4313	0.2054	0.1812	0.4140	0.4578
	“Turn into autumn”	0.2092	0.2436	0.3805	0.4563	0.1682	0.3364	0.3637	0.3993	0.1918	0.2983	0.4075	0.6712
	“Turn into night”	0.1497	0.1741	0.4870	0.5490	0.1347	0.1799	0.3806	0.4839	0.1256	0.1802	0.4497	0.4761