Personalized Text-to-Image Model Enhancement Strategies: SOD Preprocessing and CNN Local Feature Integration

: Recent advancements in text-to-image models have been substantial, generating new images based on personalized datasets. However, even within a single category, such as furniture, where the structures vary and the patterns are not uniform, the ability of the generated images to preserve the detailed information of the input images remains unsatisfactory. This study introduces a novel method to enhance the quality of the results produced by text-image models. The method utilizes mask preprocessing with an image pyramid-based salient object detection model, incorporates visual information into input prompts using concept image embeddings and a CNN local feature extractor, and includes a ﬁltering process based on similarity measures. When using this approach, we observed both visual and quantitative improvements in CLIP text alignment and DINO metrics, suggesting that the generated images more closely follow the text prompts and more accurately reﬂect the input image’s details. The signiﬁcance of this research lies in addressing one of the prevailing challenges in the ﬁeld of personalized image generation: enhancing the capability to consistently and accurately represent the detailed characteristics of input images in the output. This method enables more realistic visualizations through textual prompts enhanced with visual information, additional local features, and unnecessary area removal using a SOD mask; it can also be beneﬁcial in ﬁelds that prioritize the accuracy of visual data.


Introduction
Recent advancements in text-to-image models [1][2][3][4][5][6] have considerably improved the quality of generated images.Models like Stable Diffusion [1], which rely on large-scale datasets, can create realistic and imaginative images by capturing the intricate relationship between images and text grounded in extensive image-text pair data.Consequently, users can now perform various application tasks, such as stylization [7][8][9][10] and editing [11][12][13].However, users' ability to create images aligned with individual conceptualizations is limited.The results often fall short of expectations, even when users provide prompts with specific descriptions tailored to individual concepts, because the vast image-text pair data used for training lack information about personal concepts.Such a problem becomes an obstacle in applications where the precise image generation of the desired object is necessary.In order to address this, recent research in personalized text-to-image models [14][15][16][17][18] has involved learning additional concepts from user image sets.Textual inversion [16] inverts the input image into the text embedding space and subsequently learns new pseudo-words.DreamBooth [14] fine-tuning diffusion models use several userprovided images and unique identifiers.Custom Diffusion [15] enhances memory efficiency by fine-tuning learning parameters in specific layers, enabling users to create personalized images based on newly learned words from their prompts.Figure 1 illustrates examples of samples generated using these models and our method.Images generated using the personalized text-to-image models.We generate a personalized image using the input image representing the personal concept and the text prompt.We fine-tune the text-to-image model to ensure that the identifier <V> embedded in the prompt can encapsulate information about the concept.Our approach can preserve concept details by directly infusing visual information into the identifier.
Nevertheless, Despite these advancements, certain issues, as exemplified in Figure 2, persist, especially in categories like furniture, where structures and shapes vary significantly within the same class and where attention to detail is essential for high fidelity.Additionally, as illustrated in Figure 3, when input images learn not only the desirable concept of the object but also unintended concepts (such as the background), the quality of the image degrades.This leads to a mix of undesirable elements in the generated images, which lowers the fidelity of the images or hinders the creation of images that align with the text prompts, thus decreasing text-image alignment.This is especially detrimental in the creation of images for catalogs or advertisements, where an accurate depiction of the product's condition is critical.This study introduces several novel methods that incorporate careful steps to address the challenges of the above-mentioned detail preservation and quality degradation.The study conducted experiments utilizing the Custom Diffusion [15] fine-tuning technique.Custom Diffusion fine-tunes Stable Diffusion [1] that was pretrained on large-scale text-image paired data.During this fine-tuning process, Custom Diffusion updates the parameters of specific layers only, thereby enhancing memory efficiency and processing speed.Based on this Custom Diffusion, the new strategies we have introduced are as follows.First, to prevent nontarget features from influencing the generated results, superfluous background information is removed during preprocessing.This ensures a representation that is focused solely on the specific object of interest, eliminating potential distractions or interferences from the background.In order to achieve this, a mask for the desired object within the concept image is extracted using InSPyReNet [19], a salient object detection (SOD) model.InSPyReNet was chosen for its superior capability in salient object detection, especially for high-resolution images.Subsequently, the concept image data, with the extraneous background removed via the extracted mask, is augmented and employed as the training dataset.Second, the concept image is mapped into the textual word embedding space for further learning.The method maintains a detailed representation of the object by feeding information about the concept image into the text prompt, which acts as a condition in the diffusion network.Unlike previous methods that relied on text-image attention mechanisms, our approach utilizes a pretrained CLIP image encoder to extract image embeddings from concept images.Additionally, we introduced a CNN network, ResNet-50 [20], to extract local features from these images, providing additional information on shapes, forms, and patterns.By extracting local features from the intermediate layers of a pretrained ResNet and combining them with image embeddings, we achieved a comprehensive feature representation.This combined feature information was then injected into the text embeddings corresponding to the identifier prompts.This method preserves details and structural information better than previous methods that initialized identifiers with arbitrary words.Third, using a Siamese network [21], the similarity between the images generated in the postprocessing step and the training images was assessed, and the results that fell below a predetermined threshold were discarded.Finally, quantitative and qualitative evaluations were conducted, comparing our approach with existing models.By applying our strategy, we observed improved results in terms of detail retention performance, as evidenced by the increased text alignment scores and DINO image alignment scores.The comprehensive workflow is illustrated in Figure 4.
We recognize several concurrent studies with similar themes [17,18].ELITE [17] develops a training network that maps visual inputs to multiple textual embeddings using multi-layer embeddings, fine-tuning the attention layers of a pretrained text-image model by projecting the foreground object into the textual feature space.On the other hand, Instanbooth [18] utilizes a learnable image encoder to convert input images into textual tokens, employing these as conditions for the cross-attention layers.It also learns visual features through separate adaptor layers and encoders for fine details.Our approach, which is similar to ELITE and Instanbooth, employs image encoders to map acquired image tokens into the textual space.However, unlike these methods, we introduce a separate network for extracting local features, combining these extracted features with the tokens to create new embeddings.These new embeddings are then used to fine-tune the attention layers.Additionally, we employ cosine similarity between the embeddings of the generated sample images and concept images during the training process, guiding the generated samples to more closely resemble the input images.1) From the concept image, we remove the background using the SOD model, InSPyReNet [19], and then obtain image embeddings using the CLIP [22] image encoder.(2) We extract local features from the concept image using CNN.(3) We concatenate the image embeddings with the extracted local features to form a new visual feature.Next, we replace the embeddings corresponding to the identifier in the original prompt using the text encoder to obtain the modified embeddings.(4)The Siamese network [21] is then used to measure the similarity between the samples generated through the modified prompt embeddings and concept images.Finally, we obtain images with high similarity as the final results.
In summary, the contributions of our study are as follows:

•
We have introduced the use of a salient object detection (SOD) mask in the preprocessing phase to remove information other than the prominent object.This ensures that the image generation process is focused on the target object, thereby avoiding the degradation of image quality by irrelevant information.• Instead of relying on text embeddings that represent identifiers, we have mapped the image embeddings obtained from concept images and provided additional local features.This has improved the detail preservation performance of our models.

•
We have employed a Siamese network in the postprocessing phase to compare the similarity between the generated images and the concept images, which allows for quality control.This ensures that only the images with high fidelity are selected, enhancing the overall quality of the output.

Related Work 2.1. Text-to-Image Models
The text-to-image model generates images based on user-provided text, allowing users to influence the resulting image directly.Recently, deep learning-based text-to-image models have garnered significant attention.Current research in deep text-to-image models primarily centers around generative adversarial networks (GANs) [23][24][25][26][27], variational autoencoders (VAEs) [28,29], and diffusion-based models [1,2,14].However, GAN and VAEbased models exhibit limitations, particularly when precise objects or feature placements are required.Moreover, These models struggle when generating images with intricate patterns and structures, such as faces, eyes, noses, mouths, or complex decorations.Furthermore, even when these models produce reasonably plausible images, they fall short of closely aligning with the provided text prompts.In contrast, diffusion-based models leverage extensive training datasets containing text-image pairs to generate more realistic and intricate images.Examples include DALL-E [30], which has demonstrated impressive results by employing an autoregressive model.DALL-E2 [2], Imagen [3], Stable Diffusion [1], and others incorporate large-scale text encoders based on data, enabling enhanced control during image synthesis.Moreover, researchers are increasingly harnessing the control capabilities offered by pretrained diffusion-based models with extensive imagetext data for image editing and style transfer.SINE [31] employs a pretrained large-scale diffusion model for single-image editing and style transfer.Additionally, GLIGEN [32] explores image inpainting by introducing additional layers and incorporating various conditions beyond text, such as bounding boxes and keypoints.ControlNet [33] is a neural network designed to add spatial condition control to large-scale pretrained text-to-image diffusion models.It safely adjusts those parameters that leverage "zero convolution" and demonstrates robust learning under various conditions and across large and small datasets.

Personalized Image Generation
While text-to-image models [31][32][33] have made significant strides in providing precise control with textual guidance, their generated images are often limited to generating general instances.In contrast, personalized image generation takes user-defined concepts as input, allowing for the precise editing and transformation of these concepts.Numerous studies [14][15][16][17][18] have delved into this domain, employing various techniques to achieve personalized image manipulation.In GAN-based models, the GAN-inversion method [10,26,27,[34][35][36] has been commonly employed for image editing and personalized image creation.The method projects an image directly into the latent space, obtains an edited latent code, and subsequently generates the edited image through the generator process.GAN-based approaches have primarily been used for tasks like overall image style transfer [10,26,27], facial expression changes [35,36], and age modifications [34].More recently, methods for personalized image generation have emerged, leveraging pretrained large-scale text-to-image models.This approach, known as Textual Inversion [16], discovers new embeddings within the embedding space that represent the user-provided visual concept.Subsequently, a new image is generated using the pseudo-word associated with this embedding.Similar to text inversion, DreamBooth [14] takes an image representing the concept as input and uses the information corresponding to the instance's class as input; it then fine-tunes it and encodes it into a unique identifier.This method allows for learning new concepts with higher fidelity and addresses language drift.Custom diffusion [15], an extension of this technique, has demonstrated satisfactory performance improvements with faster fine-tuning, achieved by updating only the parameters of the cross-attention layer.ELITE [17] employs local mapping and multi-layer global mapping networks to preserve the details when encoding visual concepts into textual embeddings.Similarly, InstantBooth [18] maps input images to the textual space and introduces adapter layers to inject identity information from the input images into the backbone model.

Salient Object Detetion
SOD aims to identify and segment the most attention-grabbing object or region within an image.Hou et al. [37] incorporated short connections into the skip-layer structure of the holistically nested edge detection [38] framework.Each layer within this architecture yields rich multi-scale feature maps.Xie et al. [39] addressed the challenges posed by the shallow layers of the backbone network, which struggle to acquire global semantic information.Incorporating fully convolutional networks [40] and multi-path recurrent feedback mechanisms was instrumental in enhancing performance.Moreover, Pang et al. [41] proposed aggregate interaction modules to effectively integrate features from neighboring levels while mitigating noise.Additionally, the InSPyReNet [19] framework introduced a novel pyramid blending method, which systematically synthesizes two distinct pyramids derived from low-and high-resolution scales for high-resolution SOD.

Image Similarity Comparison
Traditional methods for comparing image similarity encompass pixel-based [42] and structural feature-based approaches [43][44][45].Pixel-based methods that rely on direct pixel comparisons are susceptible to variations in lighting, scale, or viewing angles.Therefore, the structural similarity index [42] was introduced to capture perceptual changes in images rather than mere pixel-level differences.In contrast, feature-based methods, such as scaleinvariant feature transform [43] and speeded-up robust features [44], employ key points and descriptors for image comparison.These methods exhibit robustness when dealing with transformations and occlusions.In recent years, the field has witnessed the emergence of neural network-based methods [21,46] for image similarity comparison.The approach involves extracting feature maps from the intermediate layers of pretrained deep learning models, such as VGG [47] or ResNet [20].Subsequently, metrics such as cosine similarity or Euclidean distance are computed, and if they surpass a predefined threshold, the images are deemed to represent the same object.Another neural network-based approach is the Siamese network [21], comprising two subnetworks that share identical weights.It calculates the similarity distance between feature vectors extracted from two input images.Expanding on this, the triplet network [46] processes three input images, referred to as the anchor, positive, and negative samples.The aim is to ensure that the anchor image is closer in feature space to the positive image (same class) than to the negative image (different class).This study employs the Siamese network to determine whether a generated sample and a reference concept image depict the same object.

Method
We aim to generate images that faithfully represent the underlying concept by employing a pretrained text-to-image model [1].First, we utilized a SOD network [19] to extract masks corresponding to the target salient objects while eliminating extraneous background information unrelated to the concept.In order to integrate visual features from concept images, we concatenated the embeddings derived from the CLIP [22] image encoder with the local features of the concept image extracted using the CNN and replaced segments of the textual prompt embeddings with this integrated information.Next, we evaluated the similarity between the generated samples derived from the adapted prompts and the reference concept images using a Siamese network.Finally, we filtered out any results below a predefined similarity threshold to obtain the final outcome.This chapter presents an overview of the proposed large-scale text-to-image model, beginning with a discussion of the background information in Section 3.1.Section 3.2 outlines our proposed data preprocessing methods, Section 3.3 details the procedure for inserting image embeddings into text embedding, and finally, Section 3.4 elaborates on postprocessing techniques that leverage image similarity measurements.

Text-to-Image Diffusion Models
We employ Stable Diffusion [1], a text-to-image diffusion model comprising various components and modules and trained on large-scale image-text pairs.Initially, the autoencoder's encoder (denoted as ε) is trained to map the input image (x) to the spatial latent code (z = ε(x)).The decoder, D, learns to map the latent code back to the image D(ε(x)) ≈ x.Moreover, the diffusion model, like other generative models, models the conditional distribution as p(z|y).In the text-to-image task, image generation is controlled based on the input y(textcondition).In order to preprocess y, the CLIP text encoder c θ sends y to an intermediate representation, and a cross-attention layer calculates the correlation between the text and image.The objective of this conditional latent diffusion model is as follows: where , and the value as , and W V are the weight parameters of the query, key, and value projection layers, respectively.The attention mechanism is then executed as a weighted sum over the value features.
where d represents the output dimensions of the key and query.The latent image features are updated using the attention block output.During fine-tuning, we adjust the distribution mapping from images to text, drawing inspiration from Custom Diffusion and specifically updating the parameters W K and W V of the diffusion model.

Data Preprocessing with SOD
If the object for personalization is not clearly highlighted in the concept images used as input data, there can be issues.For example, if multiple objects are captured or if the background has too much influence (as seen in Figure 3), unwanted objects may contribute to the generation process and impede accurate creation.Moreover, there can be issues where backgrounds that do not match the context of the input text are generated as the background of the created image.In order to address these issues, we propose a preprocessing method for concept images.The aim of our proposed method is to detect the object to be preserved in the concept image and filter out all other parts.Therefore, we propose using a salient object detection model that identifies the most visually salient object within the image, as the identification of objects other than the target object is unnecessary.In this study, we used high-resolution images collected from Unsplash and the high-resolution image dataset BIG as our experimental data.To this end, we employed InSPyReNet [19], which has demonstrated superior salient object detection capabilities in high-resolution images.
By using an original concept image set, X o , containing N images, i.e., o , we used InspyRenet to detect only the most salient object in each concept image and obtain the mask m n o for this object.m n o is the mask for the salient object of the nth concept image, with the object having a pixel value of 1 and the remaining having a value of 0. By using the concept images and masks for the salient objects, we obtained the final dataset for training, X m , as follows: We used X m with random augmentation during training, and we did not augment it during inference.
Figure 5 illustrates the generation results of our method (using SOD preprocessing) compared to those of previous studies [14][15][16] that did not remove areas interfering with the preservation of the concept.The prior methods tend to reflect all features included in the concept image, regardless of the content of the prompt.That is, they attempt to replicate not only the bicycle, which is the object to be preserved, but also the blue wall, the beige floor, and even the composition of the wall and floor.In contrast, by applying our preprocessing method, we can see that the identity of the target object is preserved without being affected by unnecessary regions.Comparison with previous methods that did not employ SOD mask preprocessing.We compared our method with previous approaches [14 -16] that did not remove the backgrounds from the input images wherein the bicycle is intended to be preserved as the concept.The earlier methods are affected by extraneous information beyond the concept, such as the colors and composition of walls and floors, regardless of the text prompts entered.In contrast, our preprocessing removes such information, resulting in the generation of images that better align with the given conditions.
InstantBooth [18] is similar to our method in that it uses mask-based image preprocessing.However, there is a difference in the process of detecting the concept object and generating masks.While InstantBooth employs entity segmentation models, our approach utilizes SOD models.SOD models may have lower detection capabilities for multiple objects compared to entity segmentation models, but they are more suited for identifying the most prominent object within an image and offer computational efficiency with faster processing speeds.These features are particularly advantageous for on-site applications requiring immediate image editing.For example, when taking product photographs in a store and needing to edit them on the spot to meet customer requirements, the quick preprocessing provided by SOD models presents a significant benefit.

Concept Embedding to Text Prompt
We inject the features of the concept image into the textual embedding to obtain a new text embedding.These new text embeddings are created by combining the image embedding of the concept image with the additional local features obtained from the concept image, replacing the embedding corresponding to the identifier.Our approach has demonstrated an improved capability to preserve the details of the concept over the previous methods [14][15][16] that employ random initialization of word embeddings corresponding to the identifier.
Initially, for a given target concept image for personalization, the text prompt must be modified accordingly.By drawing inspiration from DreamBooth [14], we inserted a unique identifier before the class noun to avoid the overhead caused by detailed descriptions of the concept image set.For instance, suppose the modified prompt takes the following form: 'A photo of [V] chair', where '[V]' serves as the unique identifier and 'chair' represents the class noun.We denote this modified text prompt as p and utilize the CLIP text encoder to generate a text embedding, denoted as CLIP Text ( p) = f p. Subsequently, as depicted in Figure 6, by leveraging the pretrained CLIP image encoder, we obtain a feature vector, f k = CLIP Image (I c ), for the concept image I c .Additionally, to utilize the local features of the concept image, which capture specific regions, patterns, and structures within the image, we extracted the local features f l of the concept using a CNN.Specifically, local features containing features on shapes and forms were extracted from the pretrained ResNet-50 [20].In the initial blocks of ResNet-50, low-level features, such as edges and corners, are output, whereas the intermediate blocks extract mid-level features representing patterns, structures, and forms.The latter stages handle more complex and abstract highlevel features.Therefore, we used the local features, f l , of the concept extracted from the intermediate blocks of ResNet-50 to obtain additional information for preserving details, such as the structure, form, and patterns of the concept image.The image embedding, f k , and local feature, f l , are concatenated to form a new image embedding, f c = Concat(f k , f l ).In order to prevent the influence of f l from becoming dominant, normalization and rescaling were performed based on f c .Then, we found the embedding corresponding to the identifier, [V], within f p and replaced it with f c .In this process, a trainable, fully connected (FC) layer is introduced to match the dimensions (768 dimension) of the text embeddings with the new image embeddings T .The resulting finalized text embedding is used as a conditional input for the diffusion model, and fine-tuning is conducted through attention operations with the concept image at the cross-attention layer of the pretrained Diffusion U-net. Figure 7 visually demonstrates that the pattern of the concept image is more accurately maintained by applying the new image embedding technique that incorporates the local feature that we have proposed.
Our approach shares similarities with ELITE [17] and InstantBooth [18] in that it uses an image encoder to map visual features into the textual space.However, the key distinction lies in obtaining the visual feature to replace the text embedding.In contrast to the approaches used by ELITE and InstantBooth, our method utilizes the intermediate blocks of a pretrained ResNet-50 as a local feature extractor, combining the extracted local features with image embeddings to create new image embeddings that incorporate these local features.These are then mapped to the word embeddings corresponding to the identifiers that serve as conditions for the diffusion process.

Image Similarity Measurement Using Siamese Networks
In postprocessing, we selected the final image by assessing the similarity between two generated images.We employed a Siamese neural network [21] designed to quantify image similarity.This model is trained to distinguish between pairs of images and calculate a similarity score.The Siamese network has two identical subnetworks, each taking individual input images and producing corresponding feature vectors.The final similarity score is computed based on the Euclidean distance between these feature vectors.Siamese network training uses a contrastive loss function tailored to determine whether two input samples are similar or dissimilar.This loss is defined as where Y represents a binary class variable that signifies whether the image pair belongs to the same class (1) or different classes (0).Variable D is the Euclidean distance between the feature vectors of image pairs produced by the Siamese network, and M is the hyperparameter that serves as a margin to determine the desired separation between embeddings for different pairs.We trained the Siamese network by pairing concept images with images from different objects within the same class.Subsequently, we used images generated by diffusion alongside concept images to rank them based on similarity scores represented by Euclidean distance.We removed the bottom 40% of images to filter out those images with insufficient similarity, resulting in the final sample.Figure 8 illustrates the results of the similarity scores computed using the Siamese network.We utilized two types of loss functions during training.First, L LDM learns the latent representation of the input image from a noise vector, enabling the effective reconstruction of complex textures and details in the image.It also incorporates text prompts conditionally to properly adjust the relationship between text and image.Here, instead of using parameter y for the conditional input, we used the text embedding T newly obtained from the image embedding and local feature.The newly defined L LDM is as follows: Second, we introduced a cosine similarity loss to train the FC layer appended to the image encoder and ResNet, which was used as a local feature extractor.While L LDM focuses on the accuracy of image reconstruction, the cosine similarity loss L cos is centered on enhancing the similarity between the embeddings of the generated images and the embeddings of the concept images.The cosine similarity loss, which is generally used as a traditional loss function, is being applied in various studies.Barz et al. [48] demonstrated the usefulness of cosine loss in maximizing performance, especially in cases of small datasets with a limited amount of training data.Our method, which utilizes only 4 to 8 input images during the fine-tuning process, introduces cosine loss to maximize the similarity between the concept images and the generated images.Moreover, SimCLR [49] measured the cosine similarity between the image embeddings of two images to assess the similarity of the images.SimCLR calculated the cosine similarity by including both similar and dissimilar images in the learning process for training regarding the judgment of similarity between the selected images and then defined the loss function by applying it to a softmax.However, our study focuses solely on ensuring that the generated images are similar to the concept images, thereby only measuring the cosine similarity between the embeddings of the concept image and the generated image, and the cosine similarity loss is as follows: where f c represents the concept image embedding concatenated with local feature, and f g is the image embedding of the generated sample.When the embeddings are highly similar, the loss approaches 0, indicating a closer match between the generated sample and concept image.Conversely, the loss increases as the embeddings diverge.We defined the total loss function by combining these two loss functions, and we applied it during the training process.Through experimentation, we found that if the proportion of the cosine loss is excessively large, it yields good results in stylization but fails to properly reflect the text prompts in editing.Conversely, if the proportion is too small, the opposite occurs.In order to solve this issue, we introduced a learnable parameter, α, to apply an appropriate ratio of cosine loss, thereby determining the final cosine loss.The initial value of α is set to 0.5, and the overall loss is as follows: We updated the FC layer that was additionally connected for image embeddings, the ResNet for local feature extraction, and only the Key and Value weights of the crossattention layer of the diffusion network, as it has been shown in Custom Diffusion [15] that updating only the Key and Value weights of the cross-attention layer is sufficient to improve the model's understanding of text-image pairs.The rest are frozen.
The training process is shown in Figure 9, and the effects of L cos are described in Section 4.3.2.

Results Per Training Epochs
In order to verify the performance of our method, we examined the visual outcomes generated at each stage of the generation process and measured the image alignment, text alignment, and DINO scores, epoch by epoch.As seen in Figure 10, the cup images, which have relatively simple patterns and structures in the input image, are well-represented by both the baseline model [15] and our model.However, the baseline tends to overimitate the input image, causing overfitting and an inability to properly follow the prompt.Conversely, our method performed well, as it was prompted without being affected by the background, demonstrating the effectiveness of the preprocessing method.Similarly, for the clock input images at the bottom, our method was less affected by background noise and better captured the color and structure of the clock compared to the baseline model.This improvement appears to originate from the additional local features learned that provided color and pattern details for the concept image.Moreover, implementing cosine similarity loss during training seems to have effectively preserved the concept by enhancing the similarity between the embeddings of the generated samples and the concept image.Moreover, Figure 11 illustrates how image alignment, text alignment, and DINO alignment change over the epochs.As the training progresses, the gradual increase in image alignment and DINO scores indicates effective learning in preserving the concept image.Furthermore, the improvement in text alignment demonstrates that the generated images are being trained to faithfully follow the text prompts.

Experiment
We present the datasets and evaluation metrics used in our experiments, and subsequently compare and analyze the results obtained using our method with those of existing approaches.

Datasets
We conducted experiments on 10 target datasets, encompassing various categories, including furniture items, such as chairs, tables, beds, and sofas, as well as animals, such as dogs and cats.
The images used for the experiments were sourced from the BIG [50] dataset, which includes high-resolution images, ranging from 2048 × 1600 to 5000 × 3600, from Unsplash [51], which is known for providing copyright-free high-resolution images.In addition, we extracted class-specific images from the large-scale text-image dataset LAION-400M [52], utilizing these as the regularization dataset and also for training the Siamese network as either positive or negative datasets.
Furthermore, to enhance the reliability of our experiments, we utilized images that we had captured ourselves.This approach played a significant role in assessing the network's performance compared to existing datasets and in verifying the applicability of our research findings.
Figure 12 showcases one or more sample images for each subject.Figure 13 shows the results of the experiments with our own dataset.In order to evaluate general performance, we compared the generation results of our method to those of previous methods using selfcaptured data at resolutions below 1000 × 1000.For the cat toy (left), both Custom Diffusion and DreamBooth were influenced by background information and failed to generate accurate images corresponding to the prompt.Moreover, Textual Inversion produced completely different images, but our method represented the cat's face and the context of the prompt relatively well.Similarly, for the flowerpot (right), the previous methods were affected by the background area, or the number and shape of black labels differed from the input image.Our method, however, accurately depicted the location and number.Nevertheless, we observed that there are still limitations in depicting very fine details, such as the content of the text.

Evaluation Metrics
We employed several metrics to evaluate the fidelity of the generated images, the distributional similarity between the concept and generated images, and the alignment between the given prompt and the image.First, we evaluated the similarity between the generated and actual images using CLIP [22] image alignment and DINO [53] evaluation metrics.CLIP image alignment measures pairwise cosine similarity between the embeddings of the generated and real images, reflecting their semantic content alignment.In contrast, DINO assesses the cosine similarity between the embeddings of an image, focusing on the fidelity and distinctiveness of the features and structures within the image.In other words, CLIP image alignment quantifies the 'similarity' of the content of two images, while DINO helps distinguish the detailed differences between images or objects within the same class.Additionally, we calculate the kernel inception distance (KID) [54] between the generated and concept images to measure distributional similarity.Furthermore, CLIP text alignment is computed to assess the alignment between the given prompt and image by measuring the average cosine similarity between the prompt and image embeddings.The results of these measurements are given in Table 1.Our method shows an increase in the CLIP-T and DINO metrics compared to the previous methods.This suggests that the images generated by our approach more closely follow the given prompts.We surmise that this results from the preprocessing technique we introduced, which reduces background interference.The increase in DINO also indicates an improved ability to preserve objects and patterns within the image.However, we have observed a decrease in the CLIP-I and KID metrics compared to Custom Diffusion [15].This can be attributed to the fact that the resulting images generated by the previous methods retain more of the input image's background, which is also included in the concept images that are the subject of measurement.
Table 1.Quantitative evaluation comparison.CLIP image alignment and CLIP text alignment are denoted as CLIP-I and CLIP-T, respectively.Compared to existing models, we observed improvements in the metrics according to CLIP-T and DINO [53].Our proposed approach more accurately represents images related to the prompt's description and discerns finer details within the images.

Implementation Details
We employed Stable Diffusion [1] v1-4 as a pretrained [55], large-scale text-to-image model for the experiment.For image embeddings, we employed the CLIP image encoder with an additional FC layer.To extract local features, we used a CNN up to the intermediate layers of a pretrained ResNet, which was our local feature extractor.During training, all parameters were frozen except for the diffusion cross-attention layer, the FC layer of the image encoder, and the CNN local feature extractor.In order to optimize the image encoder and CNN, we incorporated a cosine similarity loss, parameterized by α, which was initialized at 0.5 and constrained not to exceed 1.During data preprocessing, we superimposed the salient object extracted using the SOD mask onto various monochromatic backgrounds.This strategy accentuated the prominence of object information amidst monotonous backgrounds.We set the batch size to 4 for our training configurations and adapted the number of training steps based on categories.Specifically, objects, such as pieces of furniture (e.g., chairs and tables), exhibiting complex patterns or an inconsistent number of legs were subjected to more than 500 training steps.Conversely, concepts with more straightforward forms underwent 250 steps.We trained the networks using a learning rate of 1 × 10 −5 on an Intel Core i7-10700 processor clocked at 2.9 GHz, with two NVIDIA RTX 3090 GPUs, each equipped with 24 GB GPU memory.

Ablation Study
In this section, we conduct an ablation experiment to evaluate the effect of SOD mask preprocessing and cosine similarity loss.

Preprocessing Using SOD Mask
We conducted ablation experiments to compare the effects of preprocessing with and without the use of SOD masks, and we analyzed both the qualitative and quantitative results.
Figure 14 compares the visual results obtained using the masked and unmasked images.When SOD mask preprocessing is not applied, it can be observed that the generated images do not preserve the concept well.This could be inferred as the result of the features from objects other than the concept object blending into the target object.
Additionally, a qualitative evaluation was conducted for whether or not the SOD mask was used.In Table 2, CLIP image alignment and DINO exhibited the most significant improvements when employing the SOD mask.By filtering unnecessary information from the concept image using the SOD mask, the generated samples exhibit a higher fidelity to the concept image.Figure 14.Generated image after removing unnecessary regions using SOD model [19].When using the concept image without preprocessing, the desired object could not be properly preserved due to unwanted features being present in the background, such as the yellow entity (first row).In contrast, using the SOD model to detect only the salient objects, the concept remains unaffected by the background (second row).

Cosine Similarity Loss Ablation
Next, we compare the model performance based on the cosine similarity loss introduced to train the FC layer appended to the image encoder for text embedding conversion and the CNN local feature extractor.Figure 15 displays samples based on the cosine similarity loss, and Table 3 presents a quantitative comparison.Table 3. Quantitative comparison using cosine similarity loss with the learnable parameter α.We conducted a qualitative comparison based on the influence ratio of L cos .The optimal parameter value of 0.6 obtained through learning provided the best results in CLIP text and image alignment and KID [54].When α is high, the DINO evaluation showed favorable results but had the lowest text alignment.As α increases, DINO captures details better, but overfitting results in a less accurate alignment with the textual description.For the reconstruction samples, good results were obtained regardless of the L cos value.In addition, appropriate results for the text prompt were produced in Editing and Stylization for small and large values of L cos , respectively.Moreover, a balanced result is observed at α = 0.6.

Qualitative Results
In this section, we compare the visual results of our approach with those of existing models.

Visual Comparison
In Figure 16, we use the same prompts to compare the proposed method with existing methods for image reconstruction, image editing, and style transfer across five categories (such as sofa, toys, and vase).Notably, Textual Inversion [16] distorts information, such as ratios and colors, and may not entirely adhere to the prompts.DreamBooth generates high-quality images but has substantial training time and storage requirements.In contrast, Custom Diffusion enhances speed but cannot incorporate fine patterns and structural information.Our method shares a similar speed to Custom Diffusion, yet it mitigates the loss of color and structural information while preserving detailed patterns.[14][15][16] (columns 2, 3, and 4) and our approach (fifth column).The proposed method better preserves the detailed patterns and structures of the cat's face (third row), the color and pattern of the penguin toy (fourth row), and the intricate pattern and structure of the vase (fifth row).

Failure Cases
Our method occasionally over-emphasizes patterns or produces incorrect structures when faced with excessively intricate patterns, complex structures, or situations where parts of the concept object are occluded.Figure 17 illustrates the failed generation results, which make it difficult to discern its complete form.

Discussion and Conclusions
We introduced a method to enhance the performance of personalized image generation models trained on large-scale text-to-image models, aiming to improve detail preservation in personalized text-to-image tasks.Our experiments used high-resolution image datasets collected from sources like BIG and Unsplash, as well as images captured by ourselves.By providing clear information about the region of the image to be preserved and using the SOD mask, we reduce unnecessary background information that contaminates the generated output.Unlike similar studies that preprocess using entity segmentation, preprocessing with SOD aims to detect only the most prominent objects, allowing for efficient preprocessing at a faster rate.This enabled us to observe visual improvements compared to previous methods that were unable to follow the input text prompts properly due to interference from unnecessary areas like the background, and we also noted enhancements in text alignment.Additionally, mapping the concept image to the text embedding space allowed for the utilization of a wealth of visual information.We mapped the concept image to the text space using pretrained image and text encoders.In contrast to prior research, our method employs a CNN local feature extractor to supply local features in conjunction with text embeddings.The incorporated local features offer a wealth of information on the patterns, colors, and structures of the concept image, generating new image embeddings mapped to the text space.This helped ensure the generated images better preserve the concept's details, as confirmed by the visual results and improvements in qualitative metrics like DINO.Furthermore, the introduction of cosine similarity loss guided the generation of images that were more similar to the input concept image.Although there was a slight decrease in CLIP Image alignment and KID compared to the baseline model, we attribute this to the tendency of previous methods to overfit to the input image, particularly when the input is ambiguous due to the background.Finally, during postprocessing, we employed a Siamese network to selectively choose high-similarity images.Our strategies demonstrated the generation of images with high fidelity that closely follow the prompts.However, we also recognized the generative limitations in creating very fine patterns, the text included in an image, and unusual structures and proportions.Although our experiments utilized fomulas that are established and traditional, future work may benefit from incorporating more advanced, precise techniques to further enhance the outcomes.Moving forward, we plan to explore multimodal methods in future research to improve these issues by employing additional conditions beyond text prompts.

Figure 1 .
Figure 1.Images generated using the personalized text-to-image models.We generate a personalized image using the input image representing the personal concept and the text prompt.We fine-tune the text-to-image model to ensure that the identifier <V> embedded in the prompt can encapsulate information about the concept.Our approach can preserve concept details by directly infusing visual information into the identifier.

Figure 2 .
Figure 2. Samples that failed to preserve the concept of the input images.The generated samples in the second row fail to maintain the details in the input images (first row).Notable changes in color, shape, and pattern can be observed.

Figure 3 .
Figure 3. Poorly created image due to unnecessary concepts.Due to the background information included in the image other than the target object, the details of the object were stained, or the image was created in a way that did not fit the situation required by the prompt.

Figure 4 .
Figure 4. Pipeline of our proposed method.(1)From the concept image, we remove the background using the SOD model, InSPyReNet[19], and then obtain image embeddings using the CLIP[22] image encoder.(2) We extract local features from the concept image using CNN.(3) We concatenate the image embeddings with the extracted local features to form a new visual feature.Next, we replace the embeddings corresponding to the identifier in the original prompt using the text encoder to obtain the modified embeddings.(4)The Siamese network[21] is then used to measure the similarity between the samples generated through the modified prompt embeddings and concept images.Finally, we obtain images with high similarity as the final results.

Figure 5 .
Figure 5.Comparison with previous methods that did not employ SOD mask preprocessing.We compared our method with previous approaches [14-16]  that did not remove the backgrounds from the input images wherein the bicycle is intended to be preserved as the concept.The earlier methods are affected by extraneous information beyond the concept, such as the colors and composition of walls and floors, regardless of the text prompts entered.In contrast, our preprocessing removes such information, resulting in the generation of images that better align with the given conditions.

Figure 6 .
Figure 6.Our network structure to obtain new image embeddings, including local features.The concept image is used as input to obtain image embeddings through a pretrained CLIP image encoder.In addition, local features are extracted from the intermediate blocks of a pretrained ResNet-50 and are combined with the image embeddings.A trainable Fully Connected (FC) layer is introduced to align the dimensions with the text embeddings, from which new image embeddings are derived.These new image embeddings replace the identifier text embeddings and are utilized as the condition for fine-tuning in the diffusion process.

Figure 7 .
Figure 7. Visual comparison of the results for the new image embedding with the local feature.By additionally providing the local feature, it can be confirmed that the details of the pattern in the input image are better preserved.

Figure 8 .
Figure 8. Image samples based on similarity scores.When using the Siamese network, we measure the similarity score between the concept and generated images, selecting only those with high similarity.The similarity score is computed by calculating the Euclidean distance between the embeddings of the two images, where a low score indicates high similarity.Samples with scores in the bottom 40% were classified as negative images, and the remaining samples were designated as positive images.

Figure 9 .
Figure 9.The workflow for calculating our model's loss function.

Figure 10 .
Figure 10.Epoch-by-epoch visual results compared to the baseline.

Figure 11 .
Figure 11.Quantitative score evolution with training epochs.

Figure 13 .
Figure 13.Comparison between the generated results (using our own data) and those of the previous methods.

Figure 15 .
Figure15.Results based on cosine similarity loss L cos and the trainable parameter α.We compared the result samples of Reconstruction (first row), Editing (second row), and Stylization.For the reconstruction samples, good results were obtained regardless of the L cos value.In addition, appropriate results for the text prompt were produced in Editing and Stylization for small and large values of L cos , respectively.Moreover, a balanced result is observed at α = 0.6.

Figure 16 .
Figure16.Visual comparison of existing methods.The generated outcomes that had given concept images as input (first column).We visually compared the results of image reconstruction (first row), image editing (rows 2, 3, and 4), and style transfer (fifth row) generated under the same prompts between the existing methods [14-16] (columns 2, 3, and 4) and our approach (fifth column).The proposed method better preserves the detailed patterns and structures of the cat's face (third row), the color and pattern of the penguin toy (fourth row), and the intricate pattern and structure of the vase (fifth row).

Figure 17 .
Figure 17.Failure cases.Limitations in the image quality can be observed when dealing with irregular structures (first row) or fine patterns (second row).Additionally, the results are inaccurate for objects with occlusion (third row).
Higher values for CLIP-T, CLIP-I, and DINO indicate better performance, while lower values for KID signify superior performance.Values highlighted in bold indicate the best performance.

Table 2 .
Quantitative comparison for SOD mask ablation.
Higher values for CLIP-T, CLIP-I, and DINO indicate better performance, while lower values for KID signify superior performance.Values highlighted in bold indicate the best performance.