You are currently viewing a new version of our website. To view the old version click .
Mathematics
  • Article
  • Open Access

27 October 2023

Advanced Deep Learning Techniques for High-Quality Synthetic Thermal Image Generation

,
,
and
Escuela de Ingeniería Eléctrica, Pontificia Universidad Católica de Valparaíso, Avenida Brasil 2147, Valparaíso 2362804, Chile
*
Author to whom correspondence should be addressed.
This article belongs to the Section E: Applied Mathematics

Abstract

In this paper, we introduce a cutting-edge system that leverages state-of-the-art deep learning methodologies to generate high-quality synthetic thermal face images. Our unique approach integrates a thermally fine-tuned Stable Diffusion Model with a Vision Transformer (ViT) classifier, augmented by a Prompt Designer and Prompt Database for precise image generation control. Through rigorous testing across various scenarios, the system demonstrates its capability in producing accurate and superior-quality thermal images. A key contribution of our work is the development of a synthetic thermal face image database, offering practical utility for training thermal detection models. The efficacy of our synthetic images was validated using a facial detection model, achieving results comparable to real thermal face images. Specifically, a detector fine-tuned with real thermal images achieved a 97% accuracy rate when tested with our synthetic images, while a detector trained exclusively on our synthetic data achieved an accuracy of 98%. This research marks a significant advancement in thermal image synthesis, paving the way for its broader application in diverse real-world scenarios.

1. Introduction

In recent years, artificial intelligence and deep learning models have achieved significant advancements, providing robust solutions in recognition, detection, generation, and classification tasks [,,]. These models predominantly rely on substantial datasets, captured using conventional visible cameras, to effectively generalize input distributions.
However, these visible cameras face inherent limitations; they cannot operate efficiently in complete darkness or discern temperature distributions. Thermal cameras, in contrast, can capture images in total darkness, detecting the infrared energy or ‘heat signatures’ emitted by objects. This unique capability of thermal cameras to record the apparent surface temperature of subjects under observation opens novel avenues in the domain of computer vision.
The evolving landscape of computer vision and deep learning has presented new challenges and opportunities. As the applications of thermal imaging grow, ranging from security surveillance to healthcare diagnostics, the need for rich and diverse datasets becomes increasingly apparent. However, the acquisition of genuine thermal images is resource-intensive, often constrained by privacy concerns, environmental conditions, and equipment costs. This gap between the potential of thermal imaging and the availability of adequate datasets to harness its full capabilities underlines the motivation for our work. By developing a system to autonomously generate high-quality synthetic thermal images, we aim to bridge this gap, providing the research community with tools and resources to push the boundaries of what is possible with thermal imaging in the realm of AI.
Merging the capabilities of thermal cameras with deep learning models can unlock impactful solutions, such as facial detection in complete darkness (see Figure 1), disease prediction based on body temperature, and machinery overheating prevention, to name but a few. Nevertheless, while deep learning models have excelled in generating and recognizing non-thermal 2D face images, applying them directly to thermal images is challenging. This is primarily because visible-light cameras capture reflected photons, whereas thermal cameras detect emitted infrared radiation, introducing unfamiliar patterns that can affect model performance.
Figure 1. Comparison of visible and thermal face detection: (top) visual detector with light, (middle) visual detector without light, and (bottom) thermal detector without light.
The dearth of extensive thermal datasets compared to visible images poses another challenge. Creating new thermal samples using thermal cameras and annotators is an option, but it is time-consuming and costly, and requires substantial human intervention. An innovative solution lies in employing deep learning-based generative models to produce these samples, circumventing these challenges.
In this context, our work introduces a novel approach: a model designed to autonomously generate high-quality thermal samples, negating the need for manual oversight or labeling. Built upon the foundation of Stable Diffusion [] and leveraging a thermal classifier based on Vision Transformers (ViTs) [], our model distinguishes between high- and low-quality samples. This differentiation is crucial, offering feedback that influences our prompt designer, which, in turn, modulates the input text to refine the generation process.
This model’s introduction has culminated in a comprehensive dataset, primed for training deep learning algorithms in tasks like face recognition and facial expression recognition. Importantly, each sample in this dataset is autonomously generated, ensuring a high-quality, unsupervised data creation process.
To substantiate the efficacy of our system, we utilize face detection models, gauging the quality and variability of the generated samples. The outcome is a facial detection model adept at recognizing identities in pitch-black conditions using thermal cameras, marking a pivotal contribution to computer vision.
Moreover, the adaptability of our proposed model, with its ability to generate samples that can be fine-tuned to diverse styles, signifies its versatility and its potential as a tool for various tasks, including image classification, face detection, and face recognition. In essence, our research presents an efficient, cost-effective automatic generation model, marking a significant stride in producing high-quality thermal data, while emphasizing the novelty of applying cutting-edge techniques to thermal image synthesis.

3. Proposed Method

The proposed method seeks to automate the generation of high-quality thermal face images maintaining their quality throughout an unsupervised process. Our approach includes four main components: the Thermal Generator, the Thermal Classifier, the Prompt Database, and the Prompt Designer. Figure 3 illustrates the system’s scheme. The Thermal Generator is tasked with synthesizing thermal images, while the Thermal Classifier determines whether to provide model feedback or save and deliver the final image as a high-quality thermal image. The Prompt Database acts as a storage for prompts, enabling the Prompt Designer to suggest prompts from this database. The objective is to provide a diverse set of prompts that enhance the generated thermal image. The Prompt Designer plays a pivotal role in controlling the generation process by specifying the desired output characteristics. It is worth noting that the proposed system is programmed in an environment with Python 3.7 as the programming language, implemented on a local server. The training processes are executed with a Tesla T4 (16 GB VRAM) in Google Colab, thanks to its ability to access data stored on Google Drive.
Figure 3. Schematic diagram of the proposed thermal image generation system.
System Walkthrough:
  • Process Initiation: Everything begins with a text input to the system. This text acts as a guide or a descriptor of the kind of thermal image that is intended to be generated.
  • Preliminary Generation: The input text is processed by the ‘Thermal Generator’, which utilizes the Stable Diffusion model to attempt to synthesize a thermal image that matches the text specifications.
  • Quality Evaluation: Once a preliminary thermal image is crafted, it is passed onto the ‘Thermal Classifier’. This component, built upon the ViT architecture, assesses the quality of the generated image. It determines if the image looks akin to a true thermal image and if it aligns with the original text specifications.
  • Feedback and Adjustments: If the Thermal Classifier finds that the generated image does not meet quality standards or does not adequately match the text specifications, a feedback loop is initiated. This is where the ‘Prompt Designer’ comes into play. Utilizing the ‘Prompt Database’, the Prompt Designer suggests tweaks or variations to the original text, aiming to guide the Thermal Generator towards producing a better quality image in the subsequent attempt.
  • Iterations: This generation, evaluation, and feedback process is iteratively performed until a high-quality thermal image that matches the desired specifications is crafted.
  • Completion: Once a high-quality thermal image is synthesized, it is stored in a database. Over time, this automated process results in a vast database of high-quality thermal images, ready for use in various computer vision applications.
This comprehensive methodology enables the creation of expansive image databases filled with high-quality generated thermal images. Such databases are especially valuable in the fields of computer vision and machine learning, where large datasets are essential for training models. By automating the generation of high-quality thermal images, our approach offers a more efficient and effective alternative to traditional methods that rely on manual labeling or other time-consuming processes.
  • Generator Module: The Thermal Generator utilizes a thermally fine-tuned Stable Diffusion, based on DreamBooth []. This generation model has been specifically designed to capture and emulate the distinct thermal style exhibited by thermal cameras. This module is committed to producing novel samples that genuinely reflect the thermal aesthetic. However, the generation process is not uncontrolled. Other modules work in concert with it, closely monitoring and guiding its output to ensure the creation of accurate and high-quality generative samples that adhere to the standards established by thermal imaging experts and practitioners. The system guarantees results that conform to the desired text specifications, maintaining the fidelity and reliability of the generated content. The Generator receives a text prompt and a seed, a predefined value that determines the initial state of the generation process and ensures the consistency of the generated thermal image, as inputs. The output of the Generator is a thermal image of a subject’s face, heavily influenced by the input text, ensuring that the created text perfectly matches the generated thermal image.
  • Classifier Module: The Thermal Classifier is built upon the ViT architecture and fine-tuned using the methodology proposed in reference []. This module acts as an image classifier, specifically trained with thermal images to discriminate between accurately generated samples and those that do not match the intended thermal style. Leveraging this classification process, the module generates a flag indicating whether the image should be stored among the resulting generated samples. The primary function of the Classifier Module is to assess the quality and fidelity of the generated thermal images. It serves as a critical component in determining whether the synthetic examples align with the desired thermal style. Once an image has been generated, the Classifier provides critical feedback to the Prompt Designer, advising its decision on whether a thermal image meets the quality standards to be delivered by the system as a high-quality thermal image. This feedback is given after each image generation, allowing for iterative improvements to the generated images. By incorporating this classifier into our system, we ensure that only high-quality and visually consistent thermal images are selected for further processing.
  • Prompt Designer and Prompt Database: To ensure the creation of the optimal text, we have designed a specialized text creator that allows for large variations and automatic search for the best possible text. The output text ( T o u t ) is designed from a main text ( T m ) and the flag text ( T f ). The main text contains the primary feature to be generated, such as names of celebrities, animals, or specific items. This text remains unchanged during the search process as it contains our main topic that must persist during the generation process. Conversely, the flag text is meticulously crafted by selecting a set of ‘n’ words through a comprehensive search process. During the refinement stage, this text undergoes variations in both word selection and quantity. The flag text encapsulates distinctive features and incorporates carefully chosen words to facilitate prompt engineering. If the quality of the generated image meets the required standards, it is added to the final database. However, if the image is of poor quality, the search process is initiated using the Prompt Designer until the desired category is achieved. This iterative process allows for the generation of high-quality images that match the desired category, thereby ensuring that only high-quality images are included in the final database.

4. System Module Training

In this section, we provide a comprehensive analysis of the training processes and resulting outcomes for both the generator and classifier modules. The successful implementation and evaluation of these modules are vital for the overall efficacy and reliability of the proposed system. We aim to clarify the training methodologies employed and highlight the attained results to offer a thorough understanding of each module’s performance and capabilities.

4.1. Generator Training

The training stage for the Thermal Generator aims to fine-tune the Stable Diffusion model to generate thermal images accurately, effectively learning this style. We employ DreamBooth for this process, a method for personalizing text-to-image diffusion models. The fine-tuning process using DreamBooth proceeds in two steps:
  • Fine-tuning the low-resolution text-to-image model with input images paired with a text containing a unique identifier and the class name to which the subject belongs (for example, “A photo of a thermal face”). Simultaneously, a class-specific prior preservation loss is applied, leveraging the semantic prior the model possesses on the class. It encourages the generation of diverse instances belonging to the subject’s class by injecting the class name into the text prompt (e.g., “A photo of a face”).
  • Fine-tuning the super-resolution components with pairs of low- and high-resolution images sourced from our input image set, ensuring the model maintains fidelity to the subject’s minute details.
Importantly, the fine-tuning process using DreamBooth does not necessitate numerous thermal images. We aim to evaluate how the Fréchet inception distance (FID) metric [] varies with the increasing number of training images (ranging from 5 to 40 images for fine-tuning). The FID metric (Equation (2)), widely used in the evaluation of image generators, provides a quantitative measure of the quality of the generated images. The FID is a metric of differences in the density of two distributions in the high-dimensional feature space of an InceptionV3 classifier, comparing the activations of a previously trained classification network on real and generated images, using the following equation:
F I D = m m w 2 2 + T r ( C + C w 2 ( C C w ) 1 / 2 )
The parameters m and C represent the mean vectors and covariance matrices in the embedding space, respectively. The subscript ‘w’ pertains to the generated image, while the terms without subscripts refer to the real image. A low FID value implies superior generation of synthetic images.
We evaluated the FID using a total of 100 randomly generated images, post adjusting Stable Diffusion, by introducing variations in the input text to generate a more representative and varied set of images. This ensures that the Generator can generalize correctly—an essential attribute for its incorporation into the proposed system. The results are presented in Table 1 and show the FID for two versions of Stable Diffusion (V1.5 and V2.0) with different amounts of images used for fine-tuning (5, 10, 20, and 40 images). The numbers in the table represent the FID scores obtained after fine-tuning the Stable Diffusion model with different numbers of images. The FID scores provide a quantitative measure of the quality of the generated images; lower scores indicate better synthetic image quality. For Stable Diffusion version 1.5, the FID decreases when the number of images is increased from 5 to 10. However, further increasing the number of images to 20 results in an increase in the FID for both versions. Increasing the number of images to 40 again reduces the FID for both versions, but it does not reach the lowest value obtained with 10 images. The increase in FID values when fine-tuning with 20 images may be due to overfitting, where the Generator becomes too adapted to the training images and generalizes poorly to new inputs. This is mitigated when using a larger dataset of 40 images, leading to a decrease in FID values. However, the exact reasons for these trends may be complex and depend on various factors, such as the specific images used for fine-tuning and the stochastic nature of the training process. The training process has an average duration of 28 min for both versions 1.5 and 2, trained in an environment created with Google Colab using a Tesla T4 GPU (16 GB VRAM). In general, version 1.5 of Stable Diffusion outperforms version 2.0 as it produces lower FID values across all image counts.
Table 1. Comparative FID results for fine-tuning Stable Diffusion.

4.2. Classifier Training

The Thermal Classifier utilized in this study is an implementation of the traditional ViT model []. It is trained using a database of 18,579 thermal images, which were sourced from reference []. The images are segregated into two main classes for training the ViT model: “thermal face” and “other”. The “other” class, which corresponds to poor-quality thermal face images, is determined by applying the FID on randomly generated images from the generator module. A low FID score indicates high quality of synthetic images. For our purpose, images with high FID values are classified as low quality and are included in the “other” class for training the Classifier.
To evaluate the Classifier training and the detection outcomes from Section 6, we used the accuracy (Equation (3)) and F1-score (Equation (4)) metrics. The first is a measure that considers the number of correct predictions over the total, while the second combines the precision (Equation (5)) and recall (Equation (6)) measures, which are ideal for unbalanced data, as in the case of those used for this study. Precision indicates how many of the predicted cases were true positives, while recall shows the number of true-positive cases that the model could predict correctly.
A c c u r a c y = T P + T N T P + T N + F P + F N
P r e c i s i o n = T P T P + F P
R e c a l l = T P T P + F N
F 1   S c o r e = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l
where TP—true positive; TN—true negative; FP—false positive; FN—false negative.
Training the Thermal Classifier enables the creation of a model capable of automatically distinguishing between high- and low-quality thermal samples. In our study, we experimented with classic pre-training methods such as Inceptionv3 [], VGG16 [], Xception [], and InceptionresnetV2 []. However, only the Vision Transformer (ViT) yielded significant results suitable for further deep training. We chose not to report the results from the other models as they were significantly outperformed by ViT. The most efficient ViT Classifier was obtained after six epochs, using a batch size of 16 images and a learning rate of 0.00001, which has a training duration of 6.2 h in a Python environment created with Google Colab using a Tesla T4 GPU (16 GB VRAM). The performance achieved by the thermal ViT is displayed in Table 2, with the Classifier attaining an accuracy and F1 score of 98%.
Table 2. Performance metrics of the ViT Thermal Classifier.

5. Results of the Proposed System

This section elucidates the results obtained from the proposed system, demonstrating its ability to generate high-quality synthetic thermal face images. The system’s efficacy is evaluated across various scenarios, including the generation of thermal images embodying different facial features and the creation of thermal images of renowned celebrities. Additionally, the influence of input text on the resultant images is also analyzed. As a consequence, this initiative led to the creation of a comprehensive dataset comprising a total of 11,828, called “PUCV-Synthetic Thermal Images (PUCV-STI)”. These images represent 2957 distinct individuals, each depicted with various degrees of kinship and different interpretations of the primary neutral individual.
  • Diverse facial features: The proposed system’s functionality was evaluated by generating synthetic thermal images, each portraying unique facial features. For this assessment, four categories—neutral, smiling, angry, and bald—were identified. The goal was to generate high-quality thermal images that accurately represent these classes. Figure 4 presents examples of three subjects, each with unique facial features as generated by the system. The high quality of the images and the successful capture of the desired thermal style are particularly noteworthy.
    Figure 4. Synthetic thermal images compared with real thermal images. (1) Synthetic thermal images generated by the proposed system. (2) Thermal PUCV database [].
  • Influence of text on the generated thermal image: A further examination of the system involved testing the impact of input text on the thermal image produced. For this test, famous individuals, such as celebrities, actors, and former presidents, were included with the intention of assessing the system’s capability to generate corresponding thermal images. The original textual content was maintained, and variations were introduced to the flag text, determined by the output of the Thermal Classifier. The purpose of this approach was to explore the model’s capacity to identify and represent the thermal patterns exhibited by well-known personalities. These personalities, recognizable to the model in the visible spectrum, were successfully incorporated with the desired thermal style.
The generation of celebrity faces (as shown in Figure 5) demonstrates that the impact of input text on the resulting images is a critical factor in forming the generated celebrity faces. By leveraging semantic information derived from the provided textual descriptions, the model infers and prioritizes the defining characteristics contributing to a given celebrity’s likeness. The input text guides the model’s attention towards salient attributes, allowing it to concentrate on relevant facial components and their respective configurations. The constraints imposed by the restricted variations in the target generation further underscore the role of common patterns in celebrity faces. Given that celebrities often exhibit distinctive yet recognizable facial traits, the model capitalizes on these patterns to narrow down the potential options and generate faces that align closely with the desired celebrity in the thermal spectrum.
Figure 5. Generation of celebrity faces in the thermal spectrum.

6. System Validation via Thermal Face Detection

To evaluate the efficacy of our synthetic thermal image generation system, we embarked on a two-pronged validation approach using a facial detection model. The approach involved training two separate detectors, one with real thermal images from the thermal PUCV database [], and the PUCV-Synthetic Thermal Images generated by our system. By comparing the performance of these two detectors, we aim to gauge the utility of our synthetic images as a substitute for real thermal images in practical applications.
We opted for Detectron2 [] for this task, due to its excellent track record in object detection tasks. With its capabilities spanning object detection, instance segmentation, key point detection, and panoptic segmentation, Detectron2 provides a robust platform to assess the quality of our generated images.
In the training process, we employed the Faster_rcnn_R_50_FPN_3x architecture, which is specifically designed for bounding box detection. The training process, conducted separately for the real and synthetic datasets, spanned 100 epochs with a learning rate of 0.005 and a batch size of 16. Further details on the hyperparameters used can be found in Table 3.
Table 3. Hyperparameters used in training the detection model.
Evaluation is a critical phase in the development of an object detection model, and in our case, we utilized a total of 200 automatically generated images to evaluate the model’s performance. Tests were conducted with both datasets for this facial detection application, the results of which are presented in Table 4.
Table 4. Evaluation results of face detection application.
The results suggest that our synthetic images are fit for training a thermal detection model, achieving an accuracy and F1-score of approximately 98%. Interestingly, these performance metrics remained high even when the target face was in different positions. This underscores the feasibility of using synthetically generated images in detection processes, demonstrating low error rates and high adaptability. The model’s ability to detect faces in various positions significantly enhances its applicability in real-life situations, permitting operation even in complete darkness. The results of thermal face detection are presented in Figure 6.
Figure 6. Results of thermal face detection.

7. Conclusions

In this study, we devised a methodological approach to synthesize synthetic thermal images using advanced deep learning techniques. The seamless integration of the thermally fine-tuned Stable Diffusion and the Vision Transformer (ViT) classifier lies at the heart of our innovative system, precisely tailoring the generation process to the unique challenges of thermal imaging.
The Stable Diffusion model, inspired by DreamBooth, expertly encapsulates the distinctive style of thermal imaging. In tandem, the ViT classifier ensures the generation of images that adhere to stringent quality standards. Our experiments showcased the pivotal role of textual prompts in shaping the image generation process, demonstrating the nuanced impact of varying levels of description specificity.
A seminal achievement of our research is the creation of a synthetic thermal face image database. This resource not only offers immense potential for training cutting-edge face detection models but also lays the groundwork for applications in face recognition and thermal pattern analysis. Such analysis might be instrumental in early disease detection or other health-related diagnostics.
In practical applications, our synthetic thermal images exhibited exceptional results in facial detection tasks. This demonstrates the real-world utility and effectiveness of our approach, further validating the quality and authenticity of our generated images.
Furthermore, while our work is rooted in thermal imaging, the methodologies and insights bear relevance to other imaging domains, such as X-ray imaging or various medical imaging techniques. Such adaptability underlines the broader applicability of our findings and methodologies.
To conclude, our research marks a pivotal advancement in the realm of synthetic thermal image generation using deep learning. It underscores the potential of these methodologies in producing high-fidelity thermal images, thereby catalyzing future explorations and potential applications in diverse imaging areas.

Author Contributions

Investigation, V.P. and G.H.; software, V.P.; supervision, G.H.; writing—original draft, V.P.; writing—review and editing, G.H., M.S. and G.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by FONDECYT under Grant 1191188.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
  2. Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
  3. OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
  4. Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
  5. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
  6. Koh, J.Y.; Fried, D.; Salakhutdinov, R. Generating Images with Multimodal Language Models. arXiv 2023, arXiv:2305.17216. [Google Scholar]
  7. Xu, X.; Guo, J.; Wang, Z.; Huang, G.; Essa, I.; Shi, H. Prompt-Free Diffusion: Taking “Text” out of Text-to-Image Diffusion Models. arXiv 2023, arXiv:2305.16223. [Google Scholar]
  8. Elata, N.; Kawar, B.; Michaeli, T.; Elad, M. Nested Diffusion Processes for Anytime Image Generation. arXiv 2023, arXiv:2305.19066. [Google Scholar]
  9. Li, D.; Li, J.; Hoi, S.C.H. BLIP-Diffusion: Pre-Trained Subject Representation for Controllable Text-to-Image Generation and Editing. arXiv 2023, arXiv:2305.14720. [Google Scholar]
  10. Kim, S.; Lee, J.; Hong, K.; Kim, D.; Ahn, N. DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models. arXiv 2023, arXiv:2305.15194. [Google Scholar]
  11. Brock, A.; Donahue, J.; Simonyan, K. Large Scale GAN Training for High Fidelity Natural Image Synthesis. arXiv 2019, arXiv:1809.11096. [Google Scholar]
  12. Dhariwal, P.; Nichol, A. Diffusion Models Beat GANs on Image Synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
  13. Cheng, W.; Cao, Y.-P.; Shan, Y. SparseGNV: Generating Novel Views of Indoor Scenes with Sparse Input Views. arXiv 2023, arXiv:2305.07024. [Google Scholar]
  14. Rangwani, H.; Bansal, L.; Sharma, K.; Karmali, T.; Jampani, V.; Babu, R.V. NoisyTwins: Class-Consistent and Diverse Image Generation through StyleGANs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
  15. Singh, R.; Shukla, A.; Turaga, P. Polynomial Implicit Neural Representations for Large Diverse Datasets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
  16. Hashemi, H.; Hartmann, N.; Sharifzadeh, S.; Kahn, J.; Kuhr, T. Ultra-High-Resolution Detector Simulation with Intra-Event Aware GAN and Self-Supervised Relational Reasoning. arXiv 2023, arXiv:2303.08046. [Google Scholar]
  17. Hashemi, H.; Hartmann, N.; Kuhr, T.; Ritter, M. PE-GAN: Prior Embedding GAN for PXD Images at Belle II. EPJ Web Conf. 2021, 251, 03031. [Google Scholar] [CrossRef]
  18. You, Z.; Zhong, Y.; Bao, F.; Sun, J.; Li, C.; Zhu, J. Diffusion Models and Semi-Supervised Learners Benefit Mutually with Few Labels. arXiv 2023, arXiv:2302.10586. [Google Scholar]
  19. Bashkirova, D.; Lezama, J.; Sohn, K.; Saenko, K.; Essa, I. MaskSketch: Unpaired Structure-Guided Masked Image Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
  20. Deng, Y.; Hui, S.; Zhou, S.; Meng, D.; Wang, J. T-Former: An Efficient Transformer for Image Inpainting. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 6559–6568. [Google Scholar]
  21. Yildirim, A.B.; Baday, V.; Erdem, E.; Erdem, A.; Dundar, A. Inst-Inpaint: Instructing to Remove Objects with Diffusion Models. arXiv 2023, arXiv:2304.03246. [Google Scholar]
  22. Zhang, G.; Ji, J.; Zhang, Y.; Yu, M.; Jaakkola, T.; Chang, S. Towards Coherent Image Inpainting Using Denoising Diffusion Implicit Models. In Proceedings of the Fortieth International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
  23. Liu, W.; Cun, X.; Pun, C.-M.; Xia, M.; Zhang, Y.; Wang, J. CoordFill: Efficient High-Resolution Image Inpainting via Parameterized Coordinate Querying. arXiv 2023, arXiv:2303.08524. [Google Scholar] [CrossRef]
  24. Luo, Z.; Gustafsson, F.K.; Zhao, Z.; Sjölund, J.; Schön, T.B. Image Restoration with Mean-Reverting Stochastic Differential Equations. arXiv 2023, arXiv:2301.11699. [Google Scholar]
  25. Kim, B.; Kwon, G.; Kim, K.; Ye, J.C. Unpaired Image-to-Image Translation via Neural Schr\”odinger Bridge. arXiv 2023, arXiv:2305.15086. [Google Scholar]
  26. Torbunov, D.; Huang, Y.; Tseng, H.-H.; Yu, H.; Huang, J.; Yoo, S.; Lin, M.; Viren, B.; Ren, Y. Rethinking CycleGAN: Improving Quality of GANs for Unpaired Image-to-Image Translation. arXiv 2023, arXiv:2303.16280. [Google Scholar]
  27. Li, S.; van de Weijer, J.; Wang, Y.; Khan, F.S.; Liu, M.; Yang, J. 3D-Aware Multi-Class Image-to-Image Translation with NeRFs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
  28. Zingman, I.; Frayle, S.; Tankoyeu, I.; Sukhanov, S.; Heinemann, F. A Comparative Evaluation of Image-to-Image Translation Methods for Stain Transfer in Histopathology. arXiv 2023, arXiv:2303.17009. [Google Scholar]
  29. Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and Improving the Image Quality of StyleGAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  30. Karras, T.; Aittala, M.; Hellsten, J.; Laine, S.; Lehtinen, J.; Aila, T. Training Generative Adversarial Networks with Limited Data. Adv. Neural Inf. Process. Syst. 2020, 33, 12104–12114. [Google Scholar]
  31. Karras, T.; Aittala, M.; Laine, S.; Härkönen, E.; Hellsten, J.; Lehtinen, J.; Aila, T. Alias-Free Generative Adversarial Networks. Adv. Neural Inf. Process. Syst. 2021, 34, 852–863. [Google Scholar]
  32. Isola, P.; Zhu, J.-Y.; Zhou, T.; Efros, A.A. Image-to-Image Translation with Conditional Adversarial Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  33. Zhou, Y.; Zhang, R.; Sun, T.; Xu, J. Enhancing Detail Preservation for Customized Text-to-Image Generation: A Regularization-Free Approach. arXiv 2023, arXiv:2305.13579. [Google Scholar]
  34. Yu, Q.; Li, J.; Ye, W.; Tang, S.; Zhuang, Y. Interactive Data Synthesis for Systematic Vision Adaptation via LLMs-AIGCs Collaboration. arXiv 2023, arXiv:2305.12799. [Google Scholar]
  35. Yariv, G.; Gat, I.; Wolf, L.; Adi, Y.; Schwartz, I. AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation. arXiv 2023, arXiv:2305.13050. [Google Scholar]
  36. Liu, C.; Liu, D. Late-Constraint Diffusion Guidance for Controllable Image Synthesis. arXiv 2023, arXiv:2305.11520. [Google Scholar]
  37. Chen, Y.; Liu, L.; Ding, C. X-IQE: EXplainable Image Quality Evaluation for Text-to-Image Generation with Visual Large Language Models. arXiv 2023, arXiv:2305.10843. [Google Scholar]
  38. Xiao, G.; Yin, T.; Freeman, W.T.; Durand, F.; Han, S. FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention. arXiv 2023, arXiv:2305.10431. [Google Scholar]
  39. Yarom, M.; Bitton, Y.; Changpinyo, S.; Aharoni, R.; Herzig, J.; Lang, O.; Ofek, E.; Szpektor, I. What You See Is What You Read? Improving Text-Image Alignment Evaluation. arXiv 2023, arXiv:2305.10400. [Google Scholar]
  40. Zhong, S.; Huang, Z.; Wen, W.; Qin, J.; Lin, L. SUR-Adapter: Enhancing Text-to-Image Pre-Trained Diffusion Models with Large Language Models. arXiv 2023, arXiv:2305.05189. [Google Scholar]
  41. Lu, Y.; Lu, P.; Chen, Z.; Zhu, W.; Wang, X.E.; Wang, W.Y. Multimodal Procedural Planning via Dual Text-Image Prompting. arXiv 2023, arXiv:2305.01795. [Google Scholar]
  42. Mansimov, E.; Parisotto, E.; Ba, J.L.; Salakhutdinov, R. Generating Images from Captions with Attention. arXiv 2016, arXiv:1511.02793. [Google Scholar]
  43. Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; Lee, H. Generative Adversarial Text to Image Synthesis. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016. [Google Scholar]
  44. Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; Chen, M. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. arXiv 2022, arXiv:2112.10741. [Google Scholar]
  45. Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.; Ghasemipour, S.K.S.; Ayan, B.K.; Mahdavi, S.S.; Lopes, R.G.; et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. Adv. Neural Inf. Process. Syst. 2022, 35, 36479–36494. [Google Scholar]
  46. Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
  47. Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
  48. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar]
  49. Bank, D.; Koenigstein, N.; Giryes, R. Autoencoders. In Machine Learning for Data Science Handbook: Data Mining and Knowledge Discovery Handbook; Springer: Cham, Switzerland, 2021. [Google Scholar]
  50. Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
  51. Pavez, V.; Hermosilla, G.; Pizarro, F.; Fingerhuth, S.; Yunge, D. Thermal Image Generation for Robust Face Recognition. Appl. Sci. 2022, 12, 497. [Google Scholar] [CrossRef]
  52. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  53. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
  54. Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  55. Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; AAAI Press: San Francisco, CA, USA, 2017; pp. 4278–4284. [Google Scholar]
  56. Hermosilla, G.; Gallardo, F.; Farias, G.; San Martin, C. Fusion of Visible and Thermal Descriptors Using Genetic Algorithms for Face Recognition Systems. Sensors 2015, 15, 17944–17962. [Google Scholar] [CrossRef]
  57. Wu, Y.; Kirillov, A.; Massa, F.; Lo, W.-Y.; Girshick, R. Detectron2. 2019. Available online: https://ai.facebook.com/blog/-detectron2-a-pytorch-based-modular-object-detection-library-/ (accessed on 23 October 2023).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.