ConWave-LoRA: Concept Fusion in Customized Diffusion Models with Contrastive Learning and Wavelet Filtering

Liu, Xinying; Huo, Xiaogang; Yang, Zhihui

doi:10.3390/computers15010005

Open AccessArticle

ConWave-LoRA: Concept Fusion in Customized Diffusion Models with Contrastive Learning and Wavelet Filtering

by

Xinying Liu

,

Xiaogang Huo

and

Zhihui Yang

^*

College of Science, North China University of Technology, Beijing 100144, China

^*

Author to whom correspondence should be addressed.

Computers 2026, 15(1), 5; https://doi.org/10.3390/computers15010005

Submission received: 6 November 2025 / Revised: 12 December 2025 / Accepted: 17 December 2025 / Published: 22 December 2025

(This article belongs to the Special Issue Advanced Image Processing and Computer Vision (2nd Edition))

Download

Browse Figures

Versions Notes

Abstract

Customizing diffusion models via Low-Rank Adaptation (LoRA) has become a standard approach for customized concept injection. However, synthesizing multiple customized concepts within a single image remains challenging due to the parameter pollution problem, where naive fusion leads to gradient conflicts and severe quality degradation. In this paper, we introduce ConWave-LoRA, a novel framework designed to achieve hierarchical disentanglement of object and style concepts in LoRAs. Supported by our empirical validation regarding frequency distribution in the latent space, we identify that object identities are predominantly encoded in high-frequency structural perturbations, while artistic styles manifest through low-frequency global layouts. Leveraging this insight, we propose a Discrete Wavelet Transform (DWT) based filtering strategy that projects these concepts into orthogonal optimization subspaces during contrastive learning, thereby isolating structural details from stylistic attributes. Extensive experiments, including expanded ablation studies on LoRA rank sensitivity and style consistency, demonstrate that ConWave-LoRA consistently outperforms strong baselines, producing high-fidelity images that successfully integrate multiple distinct concepts without interference.

Keywords:

text to image generation; latent diffusion model; model fine-tuning; contrastive learning; wavelet filtering

1. Introduction

The advent of latent diffusion [1] has significantly accelerated progress in the field of text-to-image generation. By introducing the latent space that balances generation quality, computational efficiency, and accessibility, latent diffusion has enabled both academic researchers and industry practitioners to explore a wide range of creative and practical applications. At its core, a diffusion model operates through a two-stage process: during training, a clean latent representation of an image is progressively perturbed with Gaussian noise over a sequence of time steps, producing increasingly degraded representations; the model then learns to reverse this process by iteratively denoising the noisy latents, conditioned on text prompts, to recover the original image content [2]. This bidirectional framework allows the model to capture a rich distribution over the data, enabling controllable and high-fidelity generation. More recently, the emergence of flow matching based diffusion models [3], such as SD3 [4], FLUX.1-dev and Qwen Image [5], has further advanced the state of the art, offering the capability to synthesize visually compelling and semantically coherent images with remarkable fidelity. These developments have not only broadened the accessibility of advanced generative techniques but also catalyzed innovations across domains including digital art, content creation, and visual design.

For professionals in content creation and visual design, a key requirement on applying diffusion models is the ability to generate various images of a specified concept while preserving its distinctive features, such as a particular product or model [6,7]. However, a pretrained diffusion model cannot accomplish this task, as it has not been exposed to images of the concept during its training phase. To meet this requirement, a common strategy is to further fine-tune the diffusion model, thereby obtaining a customized version tailored to the concept [8]. Specifically, we employ a dataset containing images of the concept along with their corresponding image captions. The objective of fine-tuning is to enable the diffusion model to generate images, conditioned on the given image captions, that closely resemble those in the dataset. Based on the above approach, we can train diffusion models capable of depicting a wide variety of specific concepts. In fact, several platforms dedicated to sharing such customized models have emerged within the community, such as Civitai [9] and HuggingFace [10], where users can search for and utilize a diverse range of customized diffusion models.

The above method effectively addresses the problem of generating images containing a single specific concept. However, generating images that incorporate multiple specific concepts remains a challenging task [11,12]. On the one hand, due to the absence of a training dataset containing images with multiple specific concepts, it is not feasible to directly fine-tune a diffusion model for this purpose. On the other hand, directly employing multiple customized diffusion models leads to the issue of parameter interference, commonly referred to as the parameter pollution problem.

Figure 1 presents an example on the pollution problem when increasing the number of customized diffusion models being applied in FLUX.1-Dev. The first line shows the generated images using three different models separately, while the second line shows the generated images when applying them sequentially. The results indicate that applying each model separately would yield high quality images that conform to the concept from which the model is trained. In contrast, with the number of models being applied increasing, the image quality degrades drastically due to the parameter conflictions between models. In addition, the distinct behaviors of object and style concepts in image generation further exacerbate the challenge. Specifically, object concepts typically require the generated objects to maintain consistent details and characteristics across different images and prompts. In contrast, style concepts do not demand such consistency in fine-grained features, but instead emphasize uniformity in overall texture and color.

In this paper, we propose a novel method, ConWave-LoRA, to effectively address the challenge of multi-concept fusion. Our approach diverges from existing unified optimization methods by adopting a hierarchical disentanglement strategy. This design is grounded in two key hypotheses that we empirically validate in this work: (1) Concepts in customization tasks can be structurally decoupled into object-oriented identities (requiring strict shape consistency) and style-oriented attributes (emphasizing global texture and palette); and (2) These distinct concept types occupy separable frequency bands within the diffusion latent space. Specifically, we demonstrate that object identities are predominantly encoded in high-frequency structural perturbations (e.g., edges and contours), while artistic styles are manifested through low-frequency global distributions (e.g., color tones and layouts). By mapping these concepts to orthogonal frequency subspaces via Discrete Wavelet Transform (DWT), we can mitigate gradient conflicts during fusion, ensuring that style adaptation does not degrade object structure.

We formally characterize the parameter pollution problem in multi-LoRA concept fusion as a conflict between optimization directions in shared parameter spaces, and analyze its degradation effects on image generation quality.
We introduce ConWave-LoRA, a frequency-aware two-stage framework. Unlike prior works treating all parameters uniformly, we propose a DWT-based orthogonal filtering strategy to isolate high-frequency object details from low-frequency style attributes, effectively creating non-conflicting optimization subspaces for different concepts.
We provide empirical validation of the frequency distribution assumption within the latent space (Section 4.2), visually and statistically confirming the mapping between frequency bands and semantic concepts.
We conduct extensive experiments and expanded ablation studies (including LoRA Rank sensitivity and Style Consistency metrics). The results demonstrate that ConWave-LoRA achieves state-of-the-art performance in preserving both object fidelity and stylistic coherence compared to existing strong baselines.

The remainder of this paper is structured as follows. Section 2 reviews key topics related to our work. Section 3 presents the technical details and discussion of the proposed method ConWave-LoRA. Section 4 reports extensive experiments demonstrating its effectiveness, and Section 5 concludes the paper.

2. Related Work

2.1. Parameter-Efficient Fine-Tuning

Large-scale diffusion models and vision-language models have demonstrated remarkable performance across various generative tasks, but their deployment in specialized domains is often hindered by the computational and storage costs associated with full fine-tuning. To address this, Low-Rank Adaptation [13] has emerged as a widely adopted method for parameter-efficient fine-tuning. Instead of updating all model parameters, LoRA introduces trainable low-rank matrices into existing layers (e.g., attention and feed-forward blocks), significantly reducing the number of trainable parameters and memory footprint. LoRA has since been extended and adapted to various domains, including image generation, language modeling, and multi-modal tasks [14]. Recent works such as LoRA-CLIP [15] and LoRA-DreamBooth [8] demonstrate that even when only a small subset of weights are adapted, models can still learn novel concepts or styles effectively. Moreover, LoRA enables fast adaptation and merging of multiple skills by supporting modular training and composition, which has led to a surge of community-driven LoRA repositories for sharing and recombining pre-trained LoRA modules, such as Civitai [9] and Huggingface [10]. The proposed method is based on LoRAs trained using existing parameter-efficient finetuning methods, and focuses on how to effectively merge multiple LoRAs to generate images with better image quality, and alignment with corrresponding concepts.

2.2. Concept Fusion with LoRA

Although LoRA facilitates image generation for novel concepts, effectively generating images that incorporate multiple novel concepts remains a challenging task—commonly referred to as the concept fusion problem. Early attempts to address this issue primarily rely on training-free strategies, such as weighted averaging of LoRA parameters [11] or LoRA weight decomposition [16], to combine multiple LoRAs and fuse their associated concepts. Recognizing the redundancy in LoRA parameters, K-LoRA [17] introduces a more refined approach that selectively extracts parameters deemed more important during the merging process. However, these methods often suffer from limitations, including image quality degradation, loss of fine-grained details, and reduced generalization of the fused concepts across diverse contexts. Recent advancements have shifted towards training-based strategies for LoRA fusion, exemplified by methods such as LoRACLR [18] and ZipLoRA [19]. By designing task-specific loss functions tailored to the LoRA merging objective, these approaches achieve improved performance over training-free alternatives. Our proposed method falls within this training-based paradigm but distinguishes itself through a novel hierarchical concept disentanglement framework and a frequency-aware filtering strategy that enhances the merging of multiple LoRAs.

2.3. Contrastive Learning

While diffusion models have demonstrated impressive generative capabilities, integrating contrastive learning into diffusion frameworks has recently gained attention as a means to enhance representation quality [20], improve controllability [21], and stabilize training [22]. Some early approaches incorporate contrastive objectives into the denoising process to learn more semantically meaningful intermediate representations. For instance, contrastive loss can be applied between different noise levels or denoised latents to encourage temporal consistency across the diffusion trajectory [23]. Other works leverage contrastive learning to improve conditional generation, aligning the intermediate representations of noisy inputs with condition embeddings (e.g., prompt, class labels, or images) [18,19]. Additionally, contrastive regularization has been used in latent diffusion models to enforce structure and disentanglement in the latent space, improving downstream tasks such as editing [24], retrieval, or interpolation [25]. The proposed method also falls within the contrastive learning paradigm and uses contrastive learning objectives in separate stages to improve the model performance, and robustness of multiple concepts in different LoRAs after fusion.

3. Proposed Method

In this section, we present the technical details of our proposed method, which is designed to mitigate the pollution problem illustrated in Figure 1. We first introduce the learning paradigm of flow-matching-based diffusion models under parameter-efficient fine-tuning frameworks, with a particular focus on LoRA [13]. A formal problem formulation is then provided. Subsequently, we give a comprehensive overview of the proposed method, elaborating beyond the preliminary intuitions and qualitative insights discussed earlier. The following subsections describe the two major stages of our approach in detail: the object fusion stage, which forms the first stage, and the style fusion stage, which forms the second. Finally, we conclude this section with an analysis of the computational cost and a discussion of the novelty of our method in comparison with existing approaches.

3.1. Learning Paradigm of LoRA and Problem Formulation

State-of-the-art diffusion models for text-to-image generation typically contain billions of parameters. For instance, FLUX.1-dev, released by Black Forest Lab, comprises 12 billion parameters and currently represents the leading performance within the open-source community, while SD3.5-Large from Stability.ai contains 8.1 billion parameters and also demonstrates competitive performance. However, directly fine-tuning such large-scale models is prohibitively expensive, as it requires vast amounts of training data and substantial GPU resources. To mitigate these costs, a variety of parameter-efficient fine-tuning approaches have been proposed. Among these, LoRA [13] has emerged as the most widely adopted technique within the community.

Given a learnable weight matrix

Δ W \in R^{m \times n}

from the model to be fine-tuned, under the assumption that many parameters in

Δ W

are redundant, LoRA first decomposes

Δ W

into two low-rank matrices

D \in R^{m \times r}

and

U \in R^{r \times n}

with

r ≪ min (m, n)

.

Δ W ≜ D \times U

(1)

The learnable parameters

D

and

U

are then iteratively updated under a pre-defined training loss, using gradient descent methods such as Stochastic Gradient Descent (SGD) or Adam. Intuitively, the number of learnable parameters is reduced from

m \times n

to

(m + n) \times r

. Taking FLUX.1-dev as an example, a common case is

m = 5072

,

n = 1024

and

r = 32

, therefore the number of parameters is reduced from

5072 \times

1024 = 5,193,728 to

(5072 + 1024) \times

32 = 219,456, accounting for only 4% of the original learnable parameters. As a result, training costs on time and computation resources could be greatly reduced.

Given a training dataset for fine-tuning

D = {x^{i}, c^{i}}_{i = 1}^{N}

, where

x^{i}

and

c^{i}

denote training images and their corresponding captions, respectively, the fine-tuning process proceeds as follows. First, the encoder of a variational autoencoder (VAE) is used to project the images

{x^{i}}_{i = 1}^{N}

into a latent space, producing representations

{z^{i}}_{i = 1}^{N}

. Following standard notation, we denote

z_{0}^{i}

as the clean latent representation of an image, and

z_{t}^{i}

as its perturbed version after injecting Gaussian noise

ϵ_{t}

at diffusion step t.

To optimize the model, we adopt a flow-matching loss built upon the Rectified Flow framework [26]. Rectified Flow reformulates flow matching by explicitly modeling the probability flow ODE in a rectified manner, thereby improving stability and convergence during training. Under this framework, the loss function for fine-tuning is defined as:

L_{FM} = E_{(x^{i}, c^{i}) \sim D, t \sim T} | | (ϵ_{t} - z_{0}^{i}) - ϵ_{θ} (z_{t}^{i}, c^{i}, t) {| |}_{2}^{2}

(2)

where

ϵ_{θ} (z_{t}^{i}, c^{i}, t)

denotes the velocity predicted by the model at timestep t, conditioned on the noisy latent

z_{t}^{i}

and the text prompt

c^{i}

, while T is the total number of diffusion timesteps (typically set to 1000). Conceptually, this objective enforces the model to align its predicted velocity with the ground-truth rectified flow field, thereby learning a more accurate transport map between noisy and clean latents across the diffusion trajectory.

Given the learned parameters

Δ W \in R^{m \times n}

using the flow-matching loss and dataset

D

, the final fine-tuned diffusion model is obtained by adding

Δ W

to the original parameter

W

; that is, for each weight matrix in the model, we have:

W_{new} = W + α \times Δ W = W + α (D \times U)

(3)

where

α

is a pre-defined coefficient known as LoRA scale, commonly set as a value between 0.2 and 1 in practice. During the infernece stage of the finetuned model, given an image caption

c

and a random Gaussian noise

z_{T}

, the predicted velocity from the model is iteratively used, and the final result

z_{0}

is a clean image in the latent space. Finally, the decoder of VAE is used to transform

z_{0}

into an image in the original pixel space.

Formally, given N LoRAs

{m_{1}, m_{2}, \dots, m_{N}}

trained from separate datasets, the proposed method aims to merge them into a new LoRA

m_{merged}

that better preserves concepts in these datasets. How to simultaneously merge multiple LoRAs for text to image generation remains as a challenging problem, since each LoRA would update the same parameters in the original diffusion model according to Equation (3). Simple averaging (i.e.,

m_{merged} = \frac{1}{N} \sum_{n = 1}^{N} m_{n}

) inevitably causes conflictions, and brings side effects to each other. Such effects lead to the degradation of image quality, known as the model pollution problem in the community.

3.2. Overview

The proposed method addresses the aforementioned challenge through a hierarchical concept disentanglement framework, consisting of the object fusion stage and the style fusion stage. In the first stage, object fusion aims to effectively integrate multiple object LoRAs. We employ a contrastive training strategy to determine the parameters of the fused LoRA. Specifically, the fused LoRA is required to generate images that are semantically similar to those produced by each original object LoRA under its corresponding concept prompt, while simultaneously producing images that are dissimilar to those generated by other LoRAs when conditioned on the same prompt.

The second stage, style fusion, focuses on combining the fused object LoRA from the first stage with style LoRAs. Unlike object LoRAs, which are tailored to represent a single concept or entity, style LoRAs capture broader stylistic patterns that influence the global artistic characteristics of generated images. To better preserve style-specific information, we avoid parameter learning for the fused LoRA as in the first stage. Instead, we introduce a set of learnable coefficients that perform weighted averaging between the fused object LoRA and style LoRAs. These coefficients are optimized using the same contrastive training framework as in the object fusion stage.

Furthermore, motivated by the observation that object LoRAs primarily affect high-frequency structures (e.g., edges and contours), whereas style LoRAs predominantly influence low-frequency components (e.g., textures and colors), we incorporate a discrete wavelet transform (DWT) in both stages. DWT is used to decompose generated images into frequency components, enabling contrastive training to focus on the most informative representations for each LoRA in orthogonal optimization subspaces. An overview of this two-stage framework is illustrated in Figure 2.

3.3. Object Fusion

Given multiple object LoRAs, the object fusion stage focuses on effectively fusing corresponding object concepts into a single merged LoRA, while eliminating the side effects and the model pollution problem when applying all of them, as illustrated in Figure 1. Formally, given N object LoRAs with different parameters

{m_{θ_{1}}, m_{θ_{2}}, \dots, m_{θ_{n}}, \dots, m_{θ_{N}}}

, with each of them corresponds to a object concept that could be characterized by a prompt

{c_{1}, c_{2}, \dots, c_{n}, \dots, c_{N}}

, the problem in object fusion stage is to optimize parameters for a merged LoRA

m_{\hat{θ}}

, which is capable of generating similar images as each

m_{θ_{n}}

using prompt

c_{n}

. In addition, we are given a set of images generated independently by each LoRA

m_{θ_{n}}

using prompt

c_{n}

, i.e.,

{x_{n}^{i}}_{i = 1}^{p}

, where

x_{n}^{i}

denotes the i-th image generated using LoRA

m_{θ_{n}}

. Lastly, since the prediction of each LoRA is the velocity used for denoising in latent space at one specific timestep, each image is first preprocessed into its latent representation using VAE encoding, and we have encoded image latents

{z_{n}^{i}}_{i = 1}^{p}

.

Prior to fine-tuning parameters of the merged LoRA

m_{\hat{θ}}

, its initial parameters are set to the average of all object LoRAs using Equation (4). Compared to initializing parameters of the merged LoRA from a random uniform distribution, parameters averaged from LoRAs serve as a better starting point for training loss optimization, commonly leading to fewer training steps before loss convergence.

m_{\hat{θ}} = \frac{1}{N} \sum_{n = 1}^{N} m_{θ_{n}}

(4)

During each training step, an object LoRA and corresponding prompt are randomly sampled as the positive class

{m_{θ_{n}}, c_{n}}

. In addition, the latent representation of an image generated by

m_{θ_{n}}

is also randomly sampled as the positive target image

z_{n}^{i}

. Then, all LoRAs aside from

m_{θ_{n}}

are treated as the negative class,

{m_{θ_{j}}, c_{j}}_{j \neq n}

. Same as the positive class, the latent representation of an image is randomly sampled for each LoRA

{z_{j}^{i}}_{j \neq n}

. After randomly choosing a diffusion timestep t, the overall training loss is defined as follows:

L_{object} = s (m_{\hat{θ}} (z_{n, t}^{i}, c_{n}, t), z_{n, 0}^{i}) + max (0, \frac{1}{2} - min ({s (m_{\hat{θ}} (z_{n, t}^{i}, c_{j}, t), z_{n, 0}^{i})}_{j \neq n}))

(5)

where

m_{\hat{θ}} (z_{n, t}^{i}, c_{n}, t)

is the predicted velocity of the target merged LoRA

m_{\hat{θ}}

using prompt

c_{n}

and latent

z_{n, t}^{i}

at diffusion timestep t, and

z_{n, 0}^{i}

corresponds to the original latent image before adding noise.

s (\cdot, \cdot)

is a distance measurement in the latent space:

s (v, z_{0}) = | | H (ϵ_{t} - z_{0}) - H (v) {| |}_{2}^{2}

(6)

where

ϵ_{t}

is the added noise at t-th diffusion timestep,

z_{0}

is the original image encoded into the latent space,

v

is the predicted velocity. In the rectified flow framework,

ϵ_{t} - z_{0}

is the ground truth velocity. Equation (6) follows the flow matching loss detailed in Equation (2) except the operator H, which denotes discrete wavelet transform (DWT). Using different low-pass and high pass filters, the input data

x

could be decomposed into four sub-bands, the low-frequency sub-band

x_{L L}

and the high-frequency sub-bands

x_{L H}

,

x_{H L}

,

x_{H H}

. More specifically, the Haar wavelet transform is applied channel-wise to the spatial dimensions of the high-frequency sub-bands of the latent tensor

ϵ_{t} - z_{0}

and

v

. In the object fusion stage, since objects and their lines commonly correspond to high frequency parts in the image, the high-frequency sub-band decomposition for the predicted velocity

v

and the target

ϵ_{t} - z_{0}

are used for distance measurement.

According to Equation (5), the training loss in the object fusion stage requires the merge LoRA to have similar predictions as the LoRA

m_{θ_{n}}

given input prompt

c_{n}

, serving as the positive loss. In addition, it requires the merged LoRA to have predictions different from other LoRAs

{m_{θ_{j}}}_{j \neq n}

, serving as the negative loss. More specifically, a min-max formulation is used, which first locates the most similar prediction in LoRAs

{m_{θ_{j}}}_{j \neq n}

, then requires the merged LoRA to have disimilar prediction to it. Furthermore, the margin in Equation (5) is empirically set to

0.5

. This value serves as an approximate upper bound for the flow matching loss, ensuring that the discriminative term (i.e., the second term in Equation (5)) contributes meaningful gradients throughout the entire object fusion stage. This prevents the inter-class separation loss from vanishing prematurely.

3.4. Style Fusion

Given the merged object LoRA

m_{\hat{θ}}

capable of generating images with multiple concepts, the style fusion stage further conducts style fusion between

m_{\hat{θ}}

and another style LoRA

m_{β}

. The final merged LoRA

m_{final}

is able to effectively generate images with multiple concepts under the specific image style provided by

m_{β}

. In contrast to object LoRAs, style LoRAs are capable of converting images with a broad range of objects into a specific style, such as anime style, oil-painting style, etc. To fully perserve its capability on converting image style, we cannot use contrastive training loss similar to the object fusion stage, since there exist countless objects and it is infeasible to include all of them into the training loss described in Equation (5). Therefore, instead of optimizing model parameters of the final merged LoRA towards generating similar images as the original LoRA using different prompts, the style fusion stage instead focuses on effectively merging

m_{\hat{θ}}

and

m_{β}

without changing their model parameters.

Formally, for parameters in each block

{i}_{i = 1}^{B}

of

m_{\hat{θ}}

and

m_{β}

, the style fusion stage aims to find a set of parameters

{w_{i}^{o}, w_{i}^{s}}_{i = 1}^{B}

, and the parameters in the block i of the final LoRA

Δ W_{i}

are determined by:

Δ W_{i} = w_{i}^{o} \times (D_{i}^{o} \times U_{i}^{o}) + w_{i}^{s} \times (D_{i}^{s} \times U_{i}^{s}), i = 1, \dots, B

(7)

where

{D_{i}^{o}, U_{i}^{o}}

denotes low-rank matrices of the i-th block parameters in object LoRA

m_{\hat{θ}}

, and

{D_{i}^{s}, U_{i}^{s}}

denotes low-rank matrices of the i-th block parameters in style LoRA

m_{β}

. The right part of Figure 2 illustrates the weighted merge used in the style fusion stage. In addition, prior to the training stage, all parameters

{w_{i}^{o}, w_{i}^{s}}_{i = 1}^{B}

are set to initial value 0.5. Therefore,

m_{final}

corresponds to the simple average of

m_{\hat{θ}}

and

m_{β}

before training starts.

During each training step i, the VAE encode result of a generated image and corresponding prompt are randomly sampled for the merged object LoRA

m_{\hat{θ}}

and style LoRA

m_{β}

, denoted by

{z_{i}^{o}, c_{i}^{o}}

and

{z_{i}^{s}, c_{i}^{s}}

. Given a randomly chosen timestep t, the training loss is defined as:

L_{style} = s (m_{final} (z_{i, t}^{o}, c_{i}^{o}, t), z_{i, 0}^{o}) + s (m_{final} (z_{i, t}^{s}, c_{i}^{s}, t), z_{i, 0}^{s}) + λ | | w_{i} {| |}_{2}^{2}

(8)

where

z_{i, t}^{o}

denotes the latent after applying noise at timestep t,

z_{i, 0}^{o}

denotes the original image in latent space, and

m_{final} (z_{i, t}^{o}, c_{i}^{o}, t)

is the denoise velocity prediction of the final merged LoRA

m_{final}

. The last term is the regularization term on learnable parameters.

s (\cdot, \cdot)

is a distance measurement similar to Equation (6), wheras instead of extracting high-frequency sub-bands as in the object fusion stage, since the image style is mostly represented by textures in the image, the low-frequency sub-band

x_{L L}

is extracted here and used for measuring similarity. Since

m_{final}

is a weighted average of

m_{\hat{θ}}

and

m_{β}

using learnable parameters

{w_{i}^{o}, w_{i}^{s}}_{i = 1}^{B}

, optimizing the loss in Equation (8) requires the final merged LoRA to generate similar images to the merged object LoRA and style LoRA using corresponding prompts

c_{i}^{o}

and

c_{i}^{s}

. In more detail,

c_{i}^{o}

corresponds to the combination of all object prompts, e.g., <A>&<B>&<C> in Figure 2, and

c_{i}^{s}

corresponds to the style prompt.

3.5. Computational Cost and Novelty of the Proposed Method

The computational cost of the object-fusion stage is comparable to training a LoRA from scratch, whereas the cost of style-fusion stage is far smaller because it only learns a small set of merge coefficients over the (frozen) object and style LoRAs, rather than updating full LoRA weight tensors. As detailed in our ablation experiments, both stages converge within only a few hundred training steps.

While Contrastive Learning and DWT are established techniques individually, our core novelty lies in their structural integration to fundamentally solve the “parameter pollution” problem in multi-concept fusion. We distinguish our approach from existing works in two key aspects:

1.: Orthogonal Optimization Subspaces: Existing methods optimize a unified objective where gradients from object identity and artistic style are mixed, often leading to destructive interference (e.g., style textures overriding object edges). In contrast, ConWave-LoRA introduces a frequency-aware filtering strategy. By applying DWT, we theoretically project the optimization targets into orthogonal subspaces: high-frequency bands for structural identity and low-frequency bands for global style. This ensures that the gradient updates for style adaptation do not pollute the structural integrity of the object, a mechanism validated by our analysis in Section 4.2.
2.: Hierarchical Concept Disentanglement: Unlike prior “joint training” paradigms that treat all concepts uniformly, we propose a ”Structure-First, Style-Second” hierarchical framework. This design is motivated by the inherent semantic hierarchy of image generation—geometric structure (Object) forms the foundation upon which textural attributes (Style) are rendered. Our method respects this hierarchy, effectively decoupling the learning process to avoid the image pollution observed in naive fusion.

The assumption characterizing “objects as high-frequency” and “styles as low-frequency” is a general heuristic that holds for the majority of customization tasks (e.g., photorealistic or painterly styles) but may face limitations in specific edge cases. For instance, “shape-altering” styles (such as abstract geometric deformations or detailed line art) inherently possess significant high-frequency components that overlap with object details. In such scenarios, a strict frequency separation might theoretically constrain the style’s ability to alter the object’s contours. However, in multi-concept fusion, the primary failure mode is often “parameter pollution”, where aggressive style adaptation destroys the subject’s recognizability (as shown in Figure 1). By solidifying the object’s high-frequency structural integrity in the first stage, our framework ensures that the subject remains identifiable, even if it means trading off a degree of geometric deformation in extreme shape-altering styles. This trade-off is essential for maintaining robustness in few-shot customization tasks.

4. Experiment

The experimental section is organized as follows. We first present the implementation details of the proposed method ConWave-LoRA and the baselines used for comparison, along with descriptions of the datasets and evaluation metrics employed during training and evaluating. Next, we present in-depth analysis on the low and high frequency reconstruction results of training images in latent space, demonstrating the validity of our intuition. We then detail the experiments related to image quality evaluation, which directly reflect the model’s performance in image generation. This includes both quantitative results across multiple metrics and qualitative analyses under various scenarios. Finally, we conduct ablation studies on the contributions of discrete wavelet transform (DWT) in the proposed method, and effects of different LoRA ranks in the object fusion stage.

4.1. Experiment Settings

4.1.1. Implementation Details

All experiments are conducted using the open-sourced FLUX.1-dev as the base diffusion model, selected for its strong capability in generating high-fidelity images and adhering to textual prompts. Both object and style LoRAs used for merging are independently trained on FLUX.1-dev, with the LoRA rank r fixed at 16. We adopt the AdamW optimizer with a learning rate of

1 \times 10^{- 4}

to update model parameters during both the object and style fusion stages. The model and training pipeline are implemented in PyTorch 2.5, and all training and evaluation stages are performed on an NVIDIA L20 GPU.

4.1.2. Baselines

We evaluate the proposed method ConWave-LoRA against a broad range of baselines. For training-free approaches, we include Model Average Merge [11], Block Interleaved Merge [11] K-LoRA [17], and TARA [27]. Model Average Merge directly integrates all LoRAs into the diffusion model, as illustrated in Figure 1. Building on the observation that many LoRA parameters are redundant, Block Interleaved Merge improves upon this by separately injecting object LoRAs and style LoRA into the odd- and even-numbered blocks of the diffusion model, respectively. K-LoRA achieves training-free object fusion by dynamically selecting either the object or style LoRA for each attention layer based on the aggregated magnitude of their Top-K weight elements. TARA achieves concept fusion by employing token focus masking to restrict each LoRA module’s influence to its specific class token. For training-based baselines, we include LoRACLR [18] and ZipLoRA [19]. All of K-LoRA, TARA, LoRACLR, and ZipLoRA are competitive approaches for merging multiple LoRAs effectively.

4.1.3. Datasets and Evaluation Metrics

To comprehensively evaluate the effectiveness of the proposed method, we manually curated five datasets, each comprising 10 images depicting a distinct object. The selected objects span a diverse range of categories, including animals, humans, and inanimate objects. Detailed information about the collected datasets is provided in Figure 3. These datasets are employed to train object-specific LoRAs for use in the fusion process. Additionally, style LoRAs are sourced directly from Civitai, a public open-source platform.

For quantitative evaluation, we employ the CLIP score [28] to assess the semantic alignment between the input prompt and the generated image. Specifically, the cosine similarity between the CLIP embeddings of the original and generated images is computed to measure image-level fidelity. In addition, the Human Preference Score (HPS) [29] is adopted to evaluate perceptual quality and overall visual plausibility from a human perspective [30]. For image style comparison after style fusion stage, the VGG style loss [31] is used to measure the style similarity. All experiments are conducted over five independent trials, and results are reported as the mean ± standard deviation to ensure statistical reliability.

4.2. Frequency Analysis on Images in Latent Space

To investigate the distinct roles of high- and low-frequency components, we first randomly sampled an image from each dataset and encoded it into the latent space using the VAE from FLUX.1-dev. Subsequently, we employed a Haar wavelet transform to decompose the latent representation into high-frequency sub-bands (

x_{L H}

,

x_{H L}

,

x_{H H}

) and a low-frequency sub-band (

x_{L L}

). These sub-bands were then independently decoded back into the pixel space via the VAE decoder. As illustrated in Figure 4, the reconstructions derived from the high-frequency sub-bands predominantly preserve object contours and edges, whereas the low-frequency counterparts retain the global layout and textural information. These observations corroborate our hypothesis that high-frequency components in the latent space correspond to the structural outlines of objects in the original images.

4.3. Image Generation Performance Comparison

Comparisons of image generation performance are organized into three components: (1) single-object generation, which evaluates the ability of the merged LoRA to generate images containing a single target object; (2) multi-object generation, which assesses the model’s capacity to compose multiple objects within a single image; and (3) qualitative analysis of various baseline methods in generating multi-object images under diverse artistic styles.

Table 1 presents the CLIP scores of different baseline methods in generating single-object images from the dataset described in Figure 3. For each object, every method generates five images using distinct random seeds, and the reported results are averaged across all object categories. The CLIP score quantifies the cosine similarity between the text prompt and the generated image in the shared CLIP embedding space, with higher values indicating better text-image semantic alignment. As shown in the table, the proposed method achieves superior performance across all object categories except for <woman>. Furthermore, all training-based approaches—LoRACLR, ZipLoRA, and our method—outperform the training-free baselines, including Model Averaging, Block Interleaving, K-LoRA, and TARA. However, the performance differences among methods remain relatively modest, as the CLIP score primarily reflects semantic relevance and does not account for perceptual image quality or fidelity to the original image content.

Table 2 further presents the image-level alignment performance of different baseline methods on each single object. For each method, five images are independently generated, and the cosine similarity between the CLIP embeddings of the generated image and the original training image (used to train the corresponding object-specific LoRA) is computed. A higher similarity score indicates greater visual fidelity to the source object. As shown in the table, the proposed method consistently achieves the highest scores across all object categories, demonstrating superior object preservation and feature consistency. Consistent with the text-alignment results, all training-based approaches (e.g., LoRACLR, ZipLoRA, and our method) outperform the training-free baselines, highlighting the effectiveness of fine-tuning in preserving object identity during fusion.

Table 3 shows the human preference score of different baselines on each object category. For each baseline, five images are independently generated, and the predict scores from HPSv2 model are averaged as the final result. Larger value predicted by HPSv2 model indicates that the image is more visually appealing from the perspective of human eyes. According to the results, the proposed method achieves the best performance among all baselines, demonstrating the effectiveness of the proposed method.

In addition to the quantitative evaluation results above, Figure 5 shows the single object images generated by the proposed method. Comparing the results with images in the training dataset illustrated in Figure 3, it can be clearly shown that the proposed method is capable of generating high-quality images while maintaining high similarity with the original object category.

Table 4 shows the text alignment and human preference score of all baselines when generating images using the prompt <dog> & <cat> & <vase> & <woman> & <sofa>. Since there is no ground-truth image that contains multiple objects in the training dataset, we cannot evaluate the image alignment, so it is therefore omitted. Same as the experiment setting of the single-object comparison, five images are independently generated for each baseline, and the averaged results are used as the final result. Figure 6 further shows images with multiple objects generated by the proposed method. It can be shown that the proposed method accurately preserves the main characteristics of original objects in the merged images.

Building on the method introduced in the style fusion stage, two style LoRAs: pixel art and Ghibli style are further merged independently into the previously merged object LoRA. For training-free methods (Model Average, Block Interleaved Average, K-LoRA, and TARA), the style LoRA is directly merged into the object LoRA. In contrast, for training-based methods with a single training stage (e.g., LoRACLR and ZipLoRA), all object LoRAs and the style LoRA are jointly trained to obtain the final merged LoRA.

Qualitative results of different baselines after style fusion are shown in Figure 7. It can be observed that Model Average fails to generate images consistent with the intended styles, whereas the other methods successfully preserve the style information from the style LoRA after merging. While several methods can generate objects that resemble those in the training dataset to some extent, the proposed method achieves the most consistent and visually coherent results. In addition, Table 5 presents the VGG style loss of different baselines on the pixel art style and ghibli style. Specifically, five classic images of each style are picked, and the Gram matrix distance of the VGG19 feature map of these images and the generated images from each baseline is reported as the style loss [31].

4.4. Ablation Study

4.4.1. Necessity of Discrete Wavelet Transform

In this section, we investigate the impact of incorporating discrete wavelet transform (DWT) filtering within the proposed method, specifically in terms of its effect on reducing the difficulty of the LoRA merge task through introducing orthogonal optimization subspaces. As shown in Figure 8, the learning curves for the object fusion stage with and without DWT filtering are compared. A similar comparison for the style fusion stage is presented in Figure 9. The results indicate that applying DWT filtering leads to a consistently lower training loss in both object and style fusion stage, suggesting a reduction in task difficulty. Moreover, the convergence speed is significantly improved: with DWT filtering, the object training loss reaches 0.3 after approximately 400 steps, whereas without it, a comparable loss level requires about 600 steps. In addition, the style training loss with DWT filtering reaches 0.21 after the training converges, which is much samller compared to the training loss without DWT filtering.

These findings demonstrate that the use of discrete wavelet transform filtering not only facilitates faster convergence but also contributes to enhanced performance following the LoRA merge. In addition, it can be observed from the training loss curves that the proposed method is computation-efficient and requires very few training steps. In terms of wall-clock time, the object fusion stage requires approximately 3.5 s per training step, whereas the style fusion stage is more efficient, averaging 1.2 s per step. Consequently, the complete fusion process typically converges in under one hour.

4.4.2. Effects of Different LoRA Ranks

In this section, we investigate the impact of LoRA rank variations specifically within the object fusion stage. The style fusion stage is excluded from this analysis, as it employs off-the-shelf style LoRAs with a fixed rank of 16. We train object LoRAs using ranks

{4, 8, 16, 32, 64}

. As shown in Table 6, excessively low ranks yield suboptimal performance due to limited number of learning parameters. Conversely, while higher ranks theoretically offer greater capacity, they require more training iterations to converge due to the increased parameter count. Consequently, under a fixed training budget, the rank-64 model exhibits performance degradation compared to rank 16, as the model remains under-trained.

5. Conclusions

In this paper, we presented ConWave-LoRA, a theoretically grounded framework designed to resolve the parameter pollution problem in multi-concept fusion. Moving beyond heuristic combinations, our approach establishes a hierarchical disentanglement strategy that respects the semantic order of image synthesis—prioritizing structural identity before stylistic rendering. Crucially, we leveraged Discrete Wavelet Transform (DWT) to project conflicting concepts into orthogonal optimization subspaces. This mechanism effectively isolates high-frequency object perturbations from low-frequency style distributions, thereby mitigating gradient interference during the contrastive learning process. Extensive experimental results and ablation studies validate the effectiveness of the proposed method in combining multiple LoRAs while preserving the semantic integrity of each concept. This work offers new insights into the roles of different types of LoRAs in model merging. We plan to explore end-to-end strategies for LoRA fusion to further enhance efficiency and generalizability as next work.

Author Contributions

Conceptualization, X.L.; methodology, X.L.; software, X.L.; validation, X.L.; formal analysis, X.L.; investigation, X.L.; resources, X.L.; data curation, X.L.; writing—original draft preparation, X.L.; writing—review and editing, X.H.; visualization, X.H.; supervision, Z.Y.; project administration, Z.Y.; funding acquisition, Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Construction of Doctoral Programs in Mathematics (Grant No. 109051360025XN087).

Data Availability Statement

The datasets and models that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. arXiv 2020, arXiv:2011.13456. [Google Scholar]
Lipman, Y.; Chen, R.T.; Ben-Hamu, H.; Nickel, M.; Le, M. Flow matching for generative modeling. arXiv 2022, arXiv:2210.02747. [Google Scholar]
Esser, P.; Kulal, S.; Blattmann, A.; Entezari, R.; Müller, J.; Saini, H.; Levi, Y.; Lorenz, D.; Sauer, A.; Boesel, F.; et al. Scaling rectified flow transformers for high-resolution image synthesis. In Proceedings of the Forty-First International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Wu, C.; Li, J.; Zhou, J.; Lin, J.; Gao, K.; Yan, K.; Yin, S.; Bai, S.; Xu, X.; Chen, Y.; et al. Qwen-Image Technical Report. arXiv 2025, arXiv:2508.02324. [Google Scholar] [CrossRef]
Elsharif, W.; Alzubaidi, M.; She, J.; Agus, M. Visualizing Ambiguity: Analyzing Linguistic Ambiguity Resolution in Text-to-Image Models. Computers 2025, 14, 19. [Google Scholar] [CrossRef]
Martini, L.; Iacono, S.; Zolezzi, D.; Vercelli, G.V. Advancing Persistent Character Generation: Comparative Analysis of Fine-Tuning Techniques for Diffusion Models. AI 2024, 5, 1779–1792. [Google Scholar] [CrossRef]
Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22500–22510. [Google Scholar]
Civitai. Available online: https://civitai.com/ (accessed on 1 November 2025).
HuggingFace. Available online: https://huggingface.co/ (accessed on 1 November 2025).
Low-Rank Adaptation for Fast Text-to-Image Diffusion Fine-Tuning. Available online: https://github.com/cloneofsimo/lora (accessed on 1 November 2025).
Antona, H.; Otero, B.; Tous, R. Low-Cost Training of Image-to-Image Diffusion Models with Incremental Learning and Task/Domain Adaptation. Electronics 2024, 13, 722. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wallis, P.; Chen, W. Lora: Low-rank adaptation of large language models. ICLR 2022, 1, 3. [Google Scholar]
Li, Z.; Li, H.; Meng, L. Model Compression for Deep Neural Networks: A Survey. Computers 2023, 12, 60. [Google Scholar] [CrossRef]
Zanella, M.; Ben Ayed, I. Low-rank few-shot adaptation of vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 1593–1603. [Google Scholar]
Tewel, Y.; Kaduri, O.; Gal, R.; Kasten, Y.; Wolf, L.; Chechik, G.; Atzmon, Y. Training-free consistent text-to-image generation. ACM Trans. Graph. TOG 2024, 43, 1–18. [Google Scholar] [CrossRef]
Ouyang, Z.; Li, Z.; Hou, Q. K-lora: Unlocking training-free fusion of any subject and style loras. arXiv 2025, arXiv:2502.18461. [Google Scholar]
Simsar, E.; Hofmann, T.; Tombari, F.; Yanardag, P. LoRACLR: Contrastive Adaptation for Customization of Diffusion Models. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10 June 2025; pp. 13189–13198. [Google Scholar]
Shah, V.; Ruiz, N.; Cole, F.; Lu, E.; Lazebnik, S.; Li, Y.; Jampani, V. Ziplora: Any subject in any style by effectively merging loras. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 422–438. [Google Scholar]
Meral, T.H.S.; Simsar, E.; Tombari, F.; Yanardag, P. Conform: Contrast is all you need for high-fidelity text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 9005–9014. [Google Scholar]
Kim, G.; Kwon, T.; Ye, J.C. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2426–2435. [Google Scholar]
Zeng, D.; Wu, Y.; Hu, X.; Xu, X.; Shi, Y. Contrastive learning with synthetic positives. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 430–447. [Google Scholar]
Mittal, S.; Abstreiter, K.; Bauer, S.; Schölkopf, B.; Mehrjou, A. Diffusion based representation learning. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 24963–24982. [Google Scholar]
Nam, H.; Kwon, G.; Park, G.Y.; Ye, J.C. Contrastive denoising score for text-guided latent diffusion image editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 9192–9201. [Google Scholar]
Dalva, Y.; Yanardag, P. Noiseclr: A contrastive learning approach for unsupervised discovery of interpretable directions in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 24209–24218. [Google Scholar]
Liu, X.; Gong, C.; Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv 2022, arXiv:2209.03003. [Google Scholar] [CrossRef]
Peng, Y.; Zheng, L.; Yang, Y.; Huang, Y.; Yan, M.; Liu, J.; Chen, S. TARA: Token-Aware LoRA for Composable Personalization in Diffusion Models. arXiv 2025, arXiv:2508.08812. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Wu, X.; Sun, K.; Zhu, F.; Zhao, R.; Li, H. Human preference score: Better aligning text-to-image models with human preference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 2096–2105. [Google Scholar]
Ghiurău, D.; Popescu, D.E. Distinguishing Reality from AI: Approaches for Detecting Synthetic Content. Computers 2025, 14, 1. [Google Scholar] [CrossRef]
Gatys, L.A.; Ecker, A.S.; Bethge, M. A neural algorithm of artistic style. arXiv 2015, arXiv:1508.06576. [Google Scholar] [CrossRef]

Figure 1. Demonstration of the parameter pollution problem when applying multiple diffusion models. (a) Image generated using single model A (b) Image generated using single model B. (c) Image generated using single model C. (d) Image generated using single model A. (e) Image generated using two models A and B. (f) Image generated using three A, B and C.

Figure 2. Overview of the proposed LoRA fusion method ConWave-LoRA.

Figure 3. Overview of datasets used for training object LoRAs.

Figure 4. Visualizations of high- and low-frequency images reconstructed from the latent space of FLUX.1-dev. The blurred appearance of the low-frequency decode results highlights the absence of contours and lines that represent objects (Best viewed when zoomed in).

Figure 5. Single-object images generated by the merged LoRA using the proposed method. Each column shows three images generated with different random seeds.

Figure 6. Multiple object images generated by the merged LoRA using the proposed method with prompt <dog> & <cat> & <vase> & <woman> & <sofa>.

Figure 7. Images with different styles generated by each baseline, using prompt <image style> & <dog> & <cat> & <vase> & <woman> & <sofa>.

Figure 8. Training loss in the object fusion stage with and without DWT.

Figure 9. Training loss in the style fusion stage with and without DWT.

Table 1. Text alignment results (mean ± std) on single-object image after merging. Best result for each object category is bolded.

Method	Dog	Cat	Vase	Woman	Sofa
Model Average	0.245 ± 0.006	0.261 ± 0.004	0.273 ± 0.005	0.209 ± 0.009	0.257 ± 0.009
Block Interleaved Average	0.239 ± 0.006	0.260 ± 0.005	0.270 ± 0.009	0.201 ± 0.007	0.258 ± 0.005
K-LoRA	0.221 ± 0.003	0.243 ± 0.002	0.261 ± 0.003	0.213 ± 0.005	0.243 ± 0.004
TARA	0.264 ± 0.007	0.263 ± 0.004	0.268 ± 0.007	0.227 ± 0.002	0.273 ± 0.005
LoRACLR	0.270 ± 0.005	0.265 ± 0.003	0.283 ± 0.005	0.235 ± 0.008	0.280 ± 0.007
ZipLoRA	0.262 ± 0.010	0.249 ± 0.008	0.271 ± 0.010	0.230 ± 0.009	0.267 ± 0.009
ConWave-LoRA	0.271 ± 0.004	0.272 ± 0.002	0.292 ± 0.007	0.234 ± 0.010	0.287 ± 0.005

Table 2. Image alignment results (mean ± std) on single-object image after merging. Best result for each object category is bolded.

Method	Dog	Cat	Vase	Woman	Sofa
Model Average	0.584 ± 0.013	0.602 ± 0.011	0.632 ± 0.015	0.613 ± 0.020	0.627 ± 0.010
Block Interleaved Average	0.592 ± 0.023	0.594 ± 0.018	0.633 ± 0.023	0.607 ± 0.028	0.601 ± 0.021
K-LoRA	0.597 ± 0.007	0.583 ± 0.010	0.594 ± 0.007	0.580 ± 0.014	0.574 ± 0.005
TARA	0.832 ± 0.024	0.899 ± 0.017	0.873 ± 0.012	0.840 ± 0.032	0.892 ± 0.014
LoRACLR	0.893 ± 0.035	0.927 ± 0.018	0.913 ± 0.024	0.869 ± 0.049	0.938 ± 0.020
ZipLoRA	0.854 ± 0.027	0.901 ± 0.015	0.894 ± 0.018	0.862 ± 0.037	0.910 ± 0.015
ConWave-LoRA	0.912 ± 0.031	0.948 ± 0.020	0.928 ± 0.022	0.874 ± 0.051	0.956 ± 0.013

Table 3. Human preference score (mean ± std) on single-object image after merging. Best result for each object category is bolded.

Method	Dog	Cat	Vase	Woman	Sofa
Model Average	0.238 ± 0.004	0.232 ± 0.001	0.247 ± 0.002	0.241 ± 0.005	0.243 ± 0.002
Block Interleaved Average	0.240 ± 0.007	0.228 ± 0.003	0.257 ± 0.001	0.253 ± 0.009	0.250 ± 0.004
K-LoRA	0.251 ± 0.006	0.262 ± 0.004	0.270 ± 0.002	0.267 ± 0.003	0.257 ± 0.002
TARA	0.270 ± 0.008	0.281 ± 0.006	0.284 ± 0.008	0.273 ± 0.004	0.275 ± 0.007
LoRACLR	0.264 ± 0.010	0.265 ± 0.007	0.283 ± 0.004	0.271 ± 0.005	0.270 ± 0.009
ZipLoRA	0.280 ± 0.005	0.276 ± 0.005	0.284 ± 0.007	0.279 ± 0.007	0.273 ± 0.010
ConWave-LoRA	0.286 ± 0.008	0.305 ± 0.002	0.302 ± 0.002	0.286 ± 0.011	0.286 ± 0.004

Table 4. Evaluation results (mean ± std) on multiple-object image after merging. Best result for each evaluation metric is bolded.

Method	Text Alignment	Aesthetic Score
Model Average	0.257 ± 0.022	0.308 ± 0.009
Block Interleaved Average	0.252 ± 0.024	0.307 ± 0.012
K-LoRA	0.243 ± 0.011	0.331 ± 0.004
TARA	0.284 ± 0.015	0.341 ± 0.010
LoRACLR	0.289 ± 0.020	0.340 ± 0.007
ZipLoRA	0.290 ± 0.026	0.346 ± 0.008
ConWave-LoRA	0.301 ± 0.013	0.358 ± 0.006

Table 5. VGG style loss (mean ± std) of different baselines after style fusion, lower value indicates better image style alignment. Best result for each image style is bolded.

Method	Pixel Art Style	Ghibli Style
Model Average	0.068 ± 0.007	0.073 ± 0.011
Block Interleaved Average	0.044 ± 0.008	0.050 ± 0.005
K-LoRA	0.041 ± 0.009	0.051 ± 0.004
TARA	0.034 ± 0.012	0.040 ± 0.008
LoRACLR	0.029 ± 0.014	0.032 ± 0.009
ZipLoRA	0.030 ± 0.007	0.028 ± 0.006
ConWave-LoRA	0.024 ± 0.008	0.025 ± 0.007

Table 6. Evaluation results (mean ± std) on multiple objects image after merging trained LoRAs with different LoRA ranks. Best result for each evaluation metric is bolded.

LoRA Rank	Text Alignment	Aesthetic Score
4	0.254 ± 0.007	0.338 ± 0.003
8	0.262 ± 0.012	0.355 ± 0.007
16	0.301 ± 0.013	0.358 ± 0.006
32	0.309 ± 0.015	0.356 ± 0.009
64	0.299 ± 0.019	0.353 ± 0.007

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, X.; Huo, X.; Yang, Z. ConWave-LoRA: Concept Fusion in Customized Diffusion Models with Contrastive Learning and Wavelet Filtering. Computers 2026, 15, 5. https://doi.org/10.3390/computers15010005

AMA Style

Liu X, Huo X, Yang Z. ConWave-LoRA: Concept Fusion in Customized Diffusion Models with Contrastive Learning and Wavelet Filtering. Computers. 2026; 15(1):5. https://doi.org/10.3390/computers15010005

Chicago/Turabian Style

Liu, Xinying, Xiaogang Huo, and Zhihui Yang. 2026. "ConWave-LoRA: Concept Fusion in Customized Diffusion Models with Contrastive Learning and Wavelet Filtering" Computers 15, no. 1: 5. https://doi.org/10.3390/computers15010005

APA Style

Liu, X., Huo, X., & Yang, Z. (2026). ConWave-LoRA: Concept Fusion in Customized Diffusion Models with Contrastive Learning and Wavelet Filtering. Computers, 15(1), 5. https://doi.org/10.3390/computers15010005

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ConWave-LoRA: Concept Fusion in Customized Diffusion Models with Contrastive Learning and Wavelet Filtering

Abstract

1. Introduction

2. Related Work

2.1. Parameter-Efficient Fine-Tuning

2.2. Concept Fusion with LoRA

2.3. Contrastive Learning

3. Proposed Method

3.1. Learning Paradigm of LoRA and Problem Formulation

3.2. Overview

3.3. Object Fusion

3.4. Style Fusion

3.5. Computational Cost and Novelty of the Proposed Method

4. Experiment

4.1. Experiment Settings

4.1.1. Implementation Details

4.1.2. Baselines

4.1.3. Datasets and Evaluation Metrics

4.2. Frequency Analysis on Images in Latent Space

4.3. Image Generation Performance Comparison

4.4. Ablation Study

4.4.1. Necessity of Discrete Wavelet Transform

4.4.2. Effects of Different LoRA Ranks

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI