Next Article in Journal
Bridging Macro–Microscopic Parameters: Correspondence and Calibration Approaches
Next Article in Special Issue
DHS-CNN: A Defect-Adaptive Hierarchical Structure CNN Model for Detecting Anomalies in Contact Lenses
Previous Article in Journal
Research on Mechanical Behavior of Geogrid–Soil Interface Under Rainfall Infiltration
Previous Article in Special Issue
Image Analysis of the Influence of the Multi-Mission Radioisotope Thermoelectric Generator (MMRTG) on the Mars Environmental Dynamics Analyzer at Extremely Low Reynolds Number
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

High-Quality Text-to-Image Generation Using High-Detail Feature-Preserving Network

1
Department of Information Management, National Chung Cheng University, Chiayi 62102, Taiwan
2
Advanced Institute of Manufacturing with High-tech Innovations & Center for Innovative Research on Aging Society (CIRAS), National Chung Cheng University, Chiayi 62102, Taiwan
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(2), 706; https://doi.org/10.3390/app15020706
Submission received: 29 September 2024 / Revised: 3 January 2025 / Accepted: 3 January 2025 / Published: 13 January 2025
(This article belongs to the Special Issue Advanced Image Analysis and Processing Technologies and Applications)

Abstract

:
Multistage text-to-image generation algorithms have shown remarkable success. However, the images produced often lack detail and suffer from feature loss. This is because these methods mainly focus on extracting features from images and text, using only conventional residual blocks for post-extraction feature processing. This results in the loss of features, greatly reducing the quality of the generated images and necessitating more resources for feature calculation, which will severely limit the use and application of optical devices such as cameras and smartphones. To address these issues, the novel High-Detail Feature-Preserving Network (HDFpNet) is proposed to effectively generate high-quality, near-realistic images from text descriptions. The initial text-to-image generation (iT2IG) module is used to generate initial feature maps to avoid feature loss. Next, the fast excitation-and-squeeze feature extraction (FESFE) module is proposed to recursively generate high-detail and feature-preserving images with lower computational costs through three steps: channel excitation (CE), fast feature extraction (FFE), and channel squeeze (CS). Finally, the channel attention (CA) mechanism further enriches the feature details. Compared with the state of the art, experimental results obtained on the CUB-Bird and MS-COCO datasets demonstrate that the proposed HDFpNet achieves better performance and visual presentation, especially regarding high-detail images and feature preservation.

1. Introduction

Generative adversarial networks (GAN) [1] and deep learning have recently been increasingly used in image and video generation and restoration [2,3,4,5,6,7,8] for the use and application of related optical devices. These networks are also commonly used to transform text descriptions into realistic images, known as the text-to-image task. A primary goal of artificial intelligence (AI) development is to enable AI machines to understand natural language and generate appropriate responses. To achieve this goal, various scholars have strived to improve the output quality of AI-generated images based on text prompts and ensure that these text-based images possess the detail required by users. Several methods have been proposed to improve the text-based image quality [9,10,11,12,13,14]. Stacked methods [9,11] were proposed to generate images through multiple stages and achieve improved image resolutions. Xu et al. [10] used an attention mechanism to combine text and image features, ensuring that the generated images matched the corresponding textual descriptions. Text-to-image and image-to-text models [13] were combined to improve the consistency between image details and textual descriptions. A dynamic memory network [14] was used to compute and integrate word-level features with image features. This method enabled the network to generate images that contained the details indicated by single words.
Although these methods have generated higher-quality images compared to previous approaches, several issues remain unresolved. Multistage methods [9,10,11,14] improve image resolutions and extract image and text features. They first extract text and image features from the previous stage and then implement a residual network [15] to retain relevant image features. A residual network uses skip connections to resolve the gradient vanishing problem commonly encountered in deep neural networks. In general, residual networks use nonlinear transformations to integrate the data of the current and previous stages, thereby preventing data loss. However, the number of parameters processed by the residual network increases in deeper network layers, causing the network to lose essential data when these data propagate through the bottleneck layer. Thus, information regarding the contextual representation of text and image fusion is lost, resulting in the calculated contextual representation being unusable for efficient image generation. Hence, producing the contextual representation requires additional computing resources.
To address these issues, the novel High-Detail Feature-Preserving Network (HDFpNet) is proposed to attempt to generate images with high detail and feature preservation. First, the initial text-to-image generation (iT2IG) module is used to generate initial feature maps. The fast excitation and squeeze feature extraction (FESFE) module is then proposed to recursively generate high-detail and feature-preserving images through channel excitation (CE), fast feature extraction (FFE), and channel squeeze (CS). Finally, the channel attention (CA) mechanism [16] further enriches the feature details. The experimental results indicate that the proposed HDFpNet achieves better performance and visual representation on the CUB-Bird [17] and MS-COCO [18] datasets than the state-of-the-art methods, especially regarding high-detail images and feature preservation.
We list some specific practical examples combined with our method and present schematic diagrams in Figure 1. By utilizing devices like mobile phones as sensors to capture text and analyzing it with the proposed HDFpNet model, we can enhance both understanding and safety. As shown in Figure 1a, when people see a warning sign but cannot recognize the language, they may face safety risks. At this point, by using a mobile phone or other device to sense the text and then analyzing it with the proposed HDFpNet model, relevant images can be generated to understand the meaning of the text, thereby avoiding dangerous behavior. In Figure 1b, we show another example in which people can use devices like mobile phones to sense operating instructions and obtain a process that matches the text, allowing individuals who do not understand the language to clearly comprehend the operational steps. Additionally, as shown in Figure 1c, when people see unfamiliar bird descriptions while birdwatching outdoors, they can use the text description and our method to generate images that match the text, making it easier to understand the appearance of the described bird. The examples above demonstrate the application of text descriptions combined with sensors, providing better visual assistance for machine operation and warning signs and offering a valuable technological application for those who have not personally encountered birds and other organisms. The contributions of this study are summarized as follows.
  • The novel HDFpNet is proposed to be able to fully exploit a contextual representation of text and image fusion to generate high-quality, high-detail, and feature-preserving images efficiently in the use and application of optical devices.
  • A novel FESFE module is proposed to preserve the contextual representation of text and image fusion through CE and CS blocks to avoid information loss before feature extraction, enabling the FFE block to extract features quickly without information loss.
  • The experimental results indicate that the proposed HDFpNet achieves better performance and visual representation on the CUB-Bird and MS-COCO datasets than comparable state-of-the-art algorithms. The generated images are closer to the real ones and faithful to the text description, especially in terms of high image detail and feature preservation, making this method suitable for the use and application of optical devices.

2. Related Work

2.1. GANs

The effectiveness of GANs [1] and deep learning has been recently demonstrated by breakthroughs in various applications [2,3,4,5,6,7,8,9,10,11,12,13,14], including improving the accuracy of pedestrian detection by means of super-resolution [19], enhancing underwater images [20], and ensuring color constancy and color consistency [21]. Scholars have attempted to improve the efficiency of deep learning [22,23,24,25,26,27,28,29] and GANs [30,31,32], attaining favorable results; others have conducted image generation research based on these networks. Most GANs generate images using conditioning variables, including attributes or class labels [33,34,35]. GAN-based methods also use images as conditioning variables to generate or edit other images [36,37]. Text-to-image generators are a key focus in GAN research. Text-to-image generation aims to create near-realistic images based on a provided description in a natural language. To achieve this goal, the quality and content of the generated images must be in accordance with the provided descriptions.

2.2. Text-to-Image Generation Networks

Reed et al. [35] used GANs to convert text into images for text-to-image generation, developing a simple and effective GAN architecture and training strategy that can synthesize captivating bird and flower images from human-written descriptions. However, the resolution of these images is very low. To address this issue, Zhang et al. proposed using StackGAN [9] and StackGAN++ [11], two multistage methods in which the image generated at each stage is used as input for the next stage of image generation, thereby increasing the resolution of the image. However, the generated images still cannot reflect the shapes or colors of objects and generate unreasonable bird images. AttnGAN [10] introduced an attention mechanism into multistage image generation, allowing the model to perform word-level processing and enhance the image details while effectively generating text-to-image content for complex scenes. However, there are still many distorted images generated, such as birds with two heads, which are illogical.
DM-GAN [14] introduced a dynamic memory generation network (DM-GN) to adjust the weight of each word in a sentence and enhance the blurry parts of the image generated in the previous stage. By assigning weights reflecting the importance of text individually, it improves the quality of the final image. However, it emphasizes the overall completeness and clarity of the image generated after text generation, while ignoring illogical local details. MirrorGAN [13] reconstructed text descriptions during image generation to improve the consistency between text and images. Since it emphasizes the consistency between the generated images and text descriptions, it ignores the effect of details, including the restoration of eyes and feet. KT-GAN [12] used a knowledge transfer method from image-to-image models to support text-to-image models, thereby improving the image quality. However, the generated images do not restore the features well. The recently proposed Transformer-based text-to-image methods can generate more complex images by calculating a large number of text parameters and combining them with an image encoder. CogView [38] and DALL-E [39] leveraged the advantages of Transformer-based methods to generate visually reasonable images in various types of scene descriptions. Although they can generate rough background outlines, the former produces images with relatively blurry backgrounds, while the latter has insufficient restoration. CSM-GAN [40] proposed a cross-modal semantic matching method that can better encode local key information to highlight detailed features in synthesized images and highlight important local structural information in text descriptions. However, it ignores the details in the picture, such as a bird’s feet, which may be presented in an illogical manner, such as having only one foot or abnormally fine feet. DR-GAN [41] generated images from text descriptions through improved distribution learning, obtaining more accurate image distribution from key information to generate better-quality images. Although the overall performance regarding details is better, there are still some problems, including some artifacts and details, such as birds’ feet not being well presented.
The multistage methods [9,10,11,12,13,14] primarily aim to enhance the image resolution and improve the extraction of image and text features. The extracted features are processed using a residual network to retain relevant features. However, none of these methods optimize the current image features. As the depth of the network layer increases, the number of parameters processed by the residual network also increases; some data may be lost during propagation through the bottleneck layer. To address these issues, the HDFpNet model is proposed to retain the image features and improve the image quality in this study.

3. Method

The network architecture of the proposed HDFpNet is shown in Figure 2. It consists of three components: the iT2IG module, the FESFE module, and the CA mechanism.

3.1. iT2IG Module

The transformation of text descriptions to image feature maps is a vital topic in text-to-image generation because the generation of an adequate feature map is important for subsequent image enhancement. The first component of the proposed framework is the iT2IG module (Figure 2a), which performs initial feature generation. The module incorporates a text encoder, an initial image generator G0, and a dynamic memory generative network (DM-GN) [14].
  • Text Encoder: Following the methods adopted in [10,14], textual data are processed by using a pre-trained bidirectional long short-term memory (LSTM) text encoder [10] to transform descriptive sentences into sentence features (s) and word features (W).
  • Initial Image Generator G0: The conditioning augmentation F c a proposed in [4] is used to optimize sentence features. Next, G0 is used to generate the initial image I 0 and image features H 0 ,
z :   I 0 , H 0 = G 0 z ,   F c a s ,
where z   ~   N ( 0 , 1 ) , and G0 denotes the initial image generator. This generator is composed of four upsampling blocks. F c a denotes the conditioning augmentation block [11].
3.
DM-GN: The DM-GN [14] is used to generate feature maps because its effective image enhancement method dynamically adjusts word weights to achieve images with more word-level detail. As presented in Figure 2a, the inputs of the DM-GN are the initial image features Hi and the word features W to the DM-GN:
W = { w 1 , w 2 ,   w 3 , , w T } ,   w i R N w
H i = { h 1 , h 2 , h 3 , , h N } ,   h i R N r
where T denotes the number of words, N w denotes the dimensions of word features, N denotes the number of pixels, and N r denotes the dimensions of the image features. In this study, N w = 256 ,   N r = 64 . The image feature and word feature vectors are input into the DM-GN to perform feature recombination. To process text features, 1 × 1 kernels are used to select word features, which are then placed in a feature space,
m i = C o n v w i ,   m i R N m ,
where C o n v ( ) denotes the selection of word features by the 1 × 1 kernels, N m denotes the dimensions of the feature space ( N m = 128 ), and m denotes the feature space. Equation (4) is not used to process the corresponding word features to select specific word vectors and extract image features. Instead, the word features W are integrated with the image features H i generated in the previous stage to determine the weight of each word. The weight is calculated as
d i w ( H , w i ) = δ ( A w i + B 1 N i = 1 N h i ) ,
where δ ( ) denotes the sigmoid activation function, A denotes a 1 × N w matrix, and B denotes a 1 × N r matrix. After the weight of each word is calculated, the word and image features are recalculated based on the corresponding weights and placed in the feature space m i R N m ,
m i = C w w i d i w + C r ( 1 N i = 1 N h i ) 1 d i w ,   m i R N m ,
where C w and C r denote the selection of word and image features, respectively, by the 1 × 1 kernels, and m denotes the feature space with dimension N m . After the image and word features are integrated, the weight of the jth image feature in the ith feature space is calculated:
α i , j = exp ( K ( m i ) T h j ) l = 1 T exp ( K ( m l ) T h j ) ,
where α i , j denotes the similarity probability between the ith feature space and jth image feature, and K denotes the feature selection process by the 1 × 1 kernels. The dimension of the feature space is changed to Nr to ensure that the image space and image feature space have the same dimensions. After the weight calculation, α i , j and the feature space m i   are used:
o j = i = 1 T α i , j V ( m i ) ,
where o denotes the weighted feature space, and V denotes the feature selection process by the 1 × 1 kernels. The dimension of the feature space is changed to Nr to ensure that the image space and image feature have the same dimensions. o is integrated with the corresponding image features to create a set of initial image features h ,
h i = o i , h i ,
where [·,·] denotes a series operation that matches each image feature to its corresponding textual description.

3.2. FESFE Module

After the weight calculation, the image enhancement block must retain relevant image features in the next stage. In this study, a novel FESFE module is proposed to address the shortcomings of conventional residual blocks by retaining image features. This module contains three blocks: CE, FFE, and CS. These blocks are illustrated in Figure 2b–d, respectively.
  • CE Block: Based on previous studies on deep residual networks [35,42,43], the present study introduces a receptive field to improve the relevance of the prediction models and increase the number of extracted image details. Previous studies have used 3 × 3 filters without dilation to extract features [10,14]. The proposed CE block optimizes and excites existing channels to increase the effectiveness of subsequent feature extraction (Figure 2b). The purpose of the CE block is to create a channel excitation effect that increases the number of channels, enabling a subsequent feature extraction process without feature loss. Batch normalization (BN) [37] is also implemented in the excitation process to optimize the overall network; B N ( x ) = h m e a n ( x ) V a r x + e p s . Each iteration of BN might result in a small shift, which causes considerable changes to the network output. Therefore, a swish activation function [37], f ( x ) = x 1 + e x , is added to improve the BN process. The proposed method enables the extraction of more image details than in existing methods, thereby generating images that more closely resemble the target images. Mathematically, the process can be described as follows:
p i = ρ B N C F E h i ,   p R 2 N m ,
where C F E denotes a filter composed of a 1 × 1 kernel; the filter doubles the number of channels.   B N ( ) denotes the BN process [37], and ρ ( ) denotes the swish activation function [38].
2.
FFE Block: Extending the dimensions of the CE block (Figure 2b) increases the computational cost. To accelerate the feature extraction process without reducing the model’s effectiveness, an FFE block (Figure 2c) is incorporated into the proposed model based on previous research [42,43]. The FFE block performs convolutions on the spatial and depth dimensions (i.e., channel dimensions) to reduce the computations and quickly extract features. The block includes both depth-wise convolution and point-wise convolution calculations.
p i D W = C D W p i ,   p R 2 N m ,
p i P W = C P W p i D W ,   p R 2 N m ,
where C D W ( ) denotes a depth-wise convolution with N 5 × 5 kernels, where N is the number of channels, to calculate and stack the channels in each layer. C D W ( ) denotes point-wise convolution, in which 1 × 1 kernels are used to generate a new feature map.
3.
CS Block: After a new feature map is obtained, the CS block performs a channel squeeze process to accelerate the data propagation speed (Figure 2d) and reduce the number of channels to its original value, namely that obtained after the iT2IG module calculation.
p F C = ρ B N C F C p i ,   p R N m ,
where C F C denotes the use of 1 × 1 filters to reduce the number of channels to their original value. B N ( ) denotes the BN process [37], and ρ ( ) denotes the swish activation function [38].

3.3. CA Mechanism

Finally, squeeze-and-excitation (SE) networks [16] are used to improve the quality of the feature map. These networks can learn an information relationship between channels to calculate the weights for feature adjustment in each channel. Figure 2e presents the architecture of the squeeze-and-excitation-based CA mechanism. This mechanism recalibrates the features in an input feature map through the “squeeze” process, in which each input feature map undergoes global average pooling to compress the two-dimensional feature maps of all channels into a single global feature map comprising every feature in each map. The output of global average pooling is calculated as follows:
z c = F s q p F C = 1 H × w i = 1 H j = 1 W p F C ( i , j ) ,
where p F C   R H × W denotes the input feature maps, and z c denotes the compression result. After the channel information from the squeeze process is obtained, the “excitation” process is performed to predict the importance of each channel. This is achieved using two fully connected layers, a rectified linear activation function, and a sigmoid function. The channel weights and feature mapping weights obtained from model learning are applied to the generated channel information, which is the input features. The output of the SE process is calculated as
s = F s e Z c , W = δ W 2 σ W 1 Z c ,
where W 1 and W 2 denote the parameters of the first and second fully connected (squeeze) layers, respectively; δ ( ) denotes the sigmoid function; and σ ( ) denotes the rectified linear activation function. Finally, the output weights of the excitation operation are multiplied by the original features to further enhance the essential features, which are used as the final output of the SE module.
x ~ c = F s c a l e p F C , s c = s c · p F C ,
Finally, the output of the HDFpNet model is calculated as follows:
h N E W = h + λ h × F H D F p N e t h ,
where λ h denotes a hyperparameter that is equal to 0.1. F H D F p N e t ( )   denotes the calculation process of the FESFE module and CA mechanism. This process is repeated three times in each stage.

3.4. Loss Functions

In this study, GAN loss functions are combined with other loss functions to calculate the match level between text and image features. The image generator is trained with the loss function L G i , which is expressed as follows:
L G i = 1 2 [ E x P G i log D i ( x ) unconditional   + E x P G i D i ( x , s ) conditional loss   ] ,
where the first term represents unconditional losses, enabling the generation of images that closely resemble real images. The second term represents conditional losses, ensuring that the generated image’s content matches the description of the input sentences.
The discriminator is trained with the loss function L D i , which is expressed as follows:
L D i = 1 2 [ E x Pdata log D i ( x ) + E x P G i log ( 1 D i ( x ) ) unconditional loss , + E x Pdata log D i ( x , s ) + E x P G i log ( 1 D i ( x , s ) ) conditional   loss ] ,
where the unconditional loss is designed to distinguish the generated image from real images, and the conditional loss determines whether the image and the input sentence match. Text processing using the CA loss function was proposed in [9]. This function uses an independent Gaussian distribution to resample sentence vectors and prevent overfitting, thereby increasing the volume of training data. The CA loss function L C A is expressed as follows:
L C A = D K L ( N ( μ s , Σ s ) | N 0 , I ,
where μ s denotes the mean sentence vector and Σ s denotes a covariance matrix. The values of μ s and Σ s are calculated by fully connected layers. The deep attentional multimodal similarity model [3] determines the match level between the word and image features. Accordingly, the total loss function L is expressed as follows:
L = L G i + λ 1 L C A + λ 2 L D A M S M .
where λ 1 and λ 2 are hyperparameters.

4. Experiments

4.1. Experimental Settings

  • Implementation Details: To generate an initial feature map, text processing was performed using a bidirectional LSTM text encoder as proposed in [10]. This encoder transformed each textual description into two hidden states representing the forward and backward sentences. Each word also had two corresponding hidden states. These two hidden states for each sentence were concatenated and then input into the initial image generator to create the initial image features. Three image generation stages were conducted to produce images with resolutions of 64 × 64, 128 × 128, and 256 × 256 in sequence. Subsequently, the dynamic network generator proposed in [14] was employed to integrate the hidden states of each word with the initial image features. By employing the training method described in [14], the CUB-Bird dataset parameters were configured as λ 1 = 1 and λ 2 = 5, while the MS-COCO dataset parameters were configured as λ 1 = 1 and λ 2 = 50. After generating the initial text–image feature map, 1 × 1 filters doubled the number of feature map channels. Then, depth-wise and point-wise convolution were performed using 5 × 5 and 1 × 1 filters, respectively. Finally, 1 × 1 filters were employed to reduce the number of feature map channels to its original value ( N m   = 128). All networks were trained using the Adam optimizer [39] with a batch size of 16 and a learning rate of 0.0002. The CUB-Bird and MS-COCO datasets were used to train the HDFpNet model for 600 and 200 epochs, respectively. The model training process was conducted using an NVIDIA RTX A5000 card.
  • Dataset: Two public datasets were utilized in this study—the CUB-Bird dataset and the MS-COCO dataset. The CUB-Bird dataset comprises 11,788 images of 200 bird species, with each image being accompanied by ten descriptive sentences. The MS-COCO dataset contains 80,000 training images and 40,000 test images, with each image being associated with five textual annotations.
  • Evaluation: The performance of the HDFpNet model was evaluated through a series of experiments. Initially, the effectiveness of each block within the model was assessed. Subsequently, the performance of the model was compared with that of other state-of-the-art text-to-image generation models. For a task involving the generation of 30,000 images from textual descriptions recorded in a dataset that had not been used for testing other models, the Fréchet inception distance (FID) [42] and R-precision indicators were employed to evaluate the model’s performance as direct comparisons were not possible. The FID is based on features extracted from the Inception v3 network [44] and is used to calculate the Fréchet distance between generated and real-world images. A low FID value indicates that a generated image closely resembles a real-world image. R-precision evaluates the extent to which a generated image matches the conditions specified in a textual description. Specifically, it employs a query mechanism to determine the relevance of documents retrieved by a system. This study implemented R-precision following the procedures described in [10]. The cosine distance between the global image vector and 100 candidate sentence vectors was calculated. The candidate documents described R real images and 100 randomly selected unmatched samples. In each query, if r results obtained from the first R retrieved documents are relevant, then the R-precision equals r/R. In this study, R was set to 1. The generated images were divided into 10 leaflets for the performance of queries. The obtained scores are presented as means with standard deviations.

4.2. Comparisons with the State-of-the-Art Algorithms

To validate the performance of the proposed method, HDFpNet was compared with state-of-the-art approaches on the CUB-Bird and MS-COCO datasets. The FID and R-precision results are presented in Table 1 and Table 2, respectively. As shown in Table 1, HDFpNet outperformed the compared methods in terms of the FID, with lower FID scores being indicative of better performance. For example, on the CUB-Bird dataset, HDFpNet and DM-GAN achieved FID scores of 13.89 and 16.09, respectively, while, on the MS-COCO dataset, their respective FID scores were 27.21 and 32.64. The lower FID scores achieved by HDFpNet demonstrate that it obtains a superior data distribution. Table 2 also reveals that the proposed HDFpNet model achieved R-precision scores of 84.33% and 92.44% on the CUB-Bird and MS-COCO datasets, respectively. These scores are notably higher than those achieved by the other methods. The high R-precision scores indicate that the proposed model generated images that closely matched their corresponding descriptions, thereby verifying its effectiveness. Figure 3 presents examples of images generated by HDFpNet and the state-of-the-art models. Overall, the HDFpNet model generated images with greater detail than the StackGAN [9], AttnGAN [10], and DM-GAN [14] models. This superior performance can be attributed to the fact that HDFpNet retained the image features selected by DM-GAN, enabling it to better understand the logic of the input description and present a clearer image structure. An analysis of the image samples generated from the CUB-Bird dataset (Figure 3) indicates that, at the word level, the AttnGAN [10] and DM-GAN [14] models performed well in capturing and presenting text features. However, HDFpNet was even more effective in displaying word-level features, generating images that accurately reproduced object features and structures in detail. These images closely matched the corresponding input text and contained fine details such as the feather textures and other anatomical characteristics (e.g., eyes and legs) of birds (Figure 3). As a result, the generated bird images closely resembled real images. These findings confirm that the FESFE module enables the HDFpNet model to generate more realistic images. Furthermore, the MS-COCO dataset was used to test the network’s ability to generate images containing multiple objects (Figure 4). When an input text described multiple themes or objects with complex structures, generating an accurate image was challenging. In each image, HDFpNet could identify the most essential themes and capture the main scene, with the remaining textual content being sorted in a logical sequence to arrange the objects in the image. For instance, as shown in the second row of Figure 4, HDFpNet could accurately generate an image of a room with wooden textural characteristics. These results further verify that the proposed model can enhance image details and follow textual logic.
Figure 5 presents comparisons of real images with those generated by the HDFpNet model at various image enhancement stages. The generated images were based on the feature maps created by the iT2IG module using the CUB-Bird dataset. The results reveal that, even before the final enhancement stage, the generated images already contained all of the required details. The final enhancement stage further improved these details to create more realistic images. The feature maps enhanced by the FFE block and CA mechanism facilitated the generation of objects with accurate structures and colors. The feature enhancement process at each stage involved optimizing the feature map generated in the previous stage. This process significantly improved the object structures and details, such as eyes and texture features, resulting in generated images that closely resembled real images. Figure 6 presents various types of images generated by HDFpNet using the same text descriptions from the CUB-Bird dataset. Despite being generated from the same input text, these images exhibit diverse styles and satisfactory quality in terms of object structures and textural details.

4.3. Ablation Study

(1)
Effectiveness of FESFE Module and CA Mechanism: The proposed HDFpNet model comprises the iT2IG module, which is based on DM-GAN [14]; the FESFE module; and the CA mechanism. An ablation experiment was conducted to evaluate the effectiveness of the CE block and CS block, the FFE block in the FESFE module, and the CA mechanism. The results are presented in Table 3. The feature extraction capability of the proposed model was significantly improved by increasing the number of channels, while including the CA mechanism optimized the overall network structure. The results demonstrate that the combined use of these blocks enabled the outstanding performance of the proposed model for text-to-image generation, thereby verifying the effectiveness of the FESFE module and CA mechanism. Notably, the +CE +FFE +CA +CS sample in the fourth row of Table 3 was created by extracting features through the FFE block before they were processed by the CA mechanism and then by the CS block. This sample was created to verify the effectiveness of the CA mechanism in the HDFpNet model. The results confirm that initiating the CA mechanism while the FFE block was still extracting features reduced the model’s performance. In contrast, initiating the CA mechanism after feature extraction by the FFE block had been completed yielded optimal model performance.
(2)
Study of λ h : The hyperparameter λ h was employed to improve the performance of the HDFpNet model. An experiment was conducted using two λ h values: 0.1 and 1. The larger λ h value (i.e., λ h = 1) was found to have a negative impact on the model performance (Table 4). As a result, λ h = 0.1 was used for subsequent analyses. Evaluations based on the FID and R-precision indicators also confirmed that this value yielded the optimal results.
(3)
Study of F H D F p N e t : An experiment was conducted to determine the optimal number of F H D F p N e t iterations required to generate the best feature map. Between one and five iterations were performed. It was found that lower iteration numbers (i.e., ≤3) resulted in higher performance, with the optimal performance being achieved at three iterations (Table 5). In contrast, iteration numbers > 3 reduced the model performance. As a result, the iteration number was set to three to obtain optimal results in the subsequent analyses. This configuration was verified to yield outstanding image generation and quality results.

5. Conclusions

The novel HDFpNet network has been proposed for high-detail and feature-preserving text-to-image generation tasks and requires fewer feature computation resources. The iT2IG module is first used to generate initial feature maps to avoid the loss of initial image details and reduce feature loss. The FESFE module is proposed to efficiently and recursively generate high-detail and feature-preserving images using CE, FFE, and CS blocks with a lower feature computation cost. It can significantly avoid feature loss and simultaneously enhance the image details to greatly improve the quality of the generated images. Finally, the CA mechanism is adopted to further enrich the feature details. The experimental results indicate that the proposed HDFpNet has more promising performance and better visual presentation than the state-of-the-art approaches on the CUB-Bird and MS-COCO datasets, especially in terms of high-detail images and feature preservation. In future work, studies can be expanded to various applications of optical devices, such as creating art using cameras and smartphones, translating text into images for web applications using smartphones and other optical devices, and creating data visualizations with smartphones.

Author Contributions

Conceptualization, W.-Y.H.; methodology, W.-Y.H.; software, J.-W.L.; validation, J.-W.L.; formal analysis, W.-Y.H. and J.-W.L.; investigation, W.-Y.H.; resources, W.-Y.H.; data curation, W.-Y.H.; writing—original draft preparation, W.-Y.H. and J.-W.L.; writing—review and editing, W.-Y.H. and J.-W.L.; visualization, J.-W.L.; supervision, W.-Y.H.; project administration, W.-Y.H.; funding acquisition, W.-Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Science and Technology Council, Taiwan grant number NSTC110-2221-E-194-027-MY3 and NSTC111-2410-H-194-038-MY3. And the APC was funded by National Science and Technology Council, Taiwan.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

We acknowledge Jia-Yan Yang and Tsung-Ju Li for their assistance with the experiments.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Systems. 2014, 27, 2672–2680. [Google Scholar]
  2. Xia, W.; Zhang, Y.; Yang, Y.; Xue, J.-H.; Zhou, B.; Yang, M.-H. GAN Inversion: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3121–3138. [Google Scholar] [CrossRef]
  3. Li, J.; Li, B.; Jiang, Y.; Tian, L.; Cai, W. MrFDDGAN: Multireceptive Field Feature Transfer and Dual Discriminator-Driven Generative Adversarial Network for Infrared and Color Visible Image Fusion. IEEE Trans. Instrum. Meas. 2023, 72, 5006228. [Google Scholar] [CrossRef]
  4. Huang, Z.; Zhang, J.; Zhang, Y.; Shan, H. DU-GAN: Generative adversarial networks with dual-domain U-Net-based discriminators for low-dose CT denoising. IEEE Trans. Instrum. Meas. 2021, 71, 4500512. [Google Scholar] [CrossRef]
  5. Hsu, W.-Y.; Chang, W.-C. Wavelet Approximation-Aware Residual Network for Single Image Deraining. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15979–15995. [Google Scholar] [CrossRef]
  6. Duman, B. A Real-Time Green and Lightweight Model for Detection of Liquefied Petroleum Gas Cylinder Surface Defects Based on YOLOv5. Appl. Sci. 2025, 15, 458. [Google Scholar] [CrossRef]
  7. Hsu, W.-Y.; Yang, P.-Y. Pedestrian Detection Using Multi-Scale Structure-Enhanced Super-Resolution. IEEE Trans. Intell. Transp. Syst. 2023, 24, 12312–12322. [Google Scholar] [CrossRef]
  8. Hsu, W.-Y.; Chung, C.-J. A Novel Eye Center Localization Method for Head Poses With Large Rotations. IEEE Trans. Image Process. 2020, 30, 1369–1381. [Google Scholar] [CrossRef]
  9. Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Wang, X.; Huang, X.; Metaxas, D. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 27–29 October 2017; pp. 5907–5915. [Google Scholar]
  10. Xu, T.; Zhang, P.; Huang, Q.; Zhang, H.; Gan, Z.; Huang, X.; He, X. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1316–1324. [Google Scholar]
  11. Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Wang, X.; Huang, X.; Metaxas, D.N. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 1947–1962. [Google Scholar] [CrossRef]
  12. Tan, H.; Liu, X.; Liu, M.; Yin, B.; Li, X. KT-GAN: Knowledge-transfer generative adversarial network for text-to-image synthesis. IEEE Trans. Image Process. 2020, 30, 1275–1290. [Google Scholar] [CrossRef]
  13. Qiao, T.; Zhang, J.; Xu, D.; Tao, D. Mirrorgan: Learning text-to-image generation by redescription. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1505–1514. [Google Scholar]
  14. Zhu, M.; Pan, P.; Chen, W.; Yang, Y. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the 019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5802–5810. [Google Scholar]
  15. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  16. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  17. Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. The caltech-ucsd birds-200-2011 dataset. 2011. Available online: https://authors.library.caltech.edu/records/cvm3y-5hh21 (accessed on 2 January 2025).
  18. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, L. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014, Proceedings, Part V 13; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
  19. Hsu, W.-Y.; Wu, C.-H. Wavelet structure-texture-aware super-resolution for pedestrian detection. Inf. Sci. 2025, 691, 121612. [Google Scholar] [CrossRef]
  20. Hsu, W.-Y.; Hsu, Y.-Y. Multi-Scale and Multi-Layer Lattice Transformer for Underwater Image Enhancement. ACM Trans. Multimedia Comput. Commun. Appl. 2024, 20, 354. [Google Scholar] [CrossRef]
  21. Padovano, D.; Martinez-Rodrigo, A.; Pastor, J.M.; Rieta, J.J.; Alcaraz, R. Deep Learning and Recurrence Information Analysis for the Automatic Detection of Obstructive Sleep Apnea. Appl. Sci. 2025, 15, 433. [Google Scholar] [CrossRef]
  22. Hsu, W.Y.; Lin, H.W. Context-Detail-Aware United Network for Single Image Deraining. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 20, 1–18. [Google Scholar] [CrossRef]
  23. Hsu, W.Y.; Jian, P.W. Wavelet Pyramid Recurrent Structure-Preserving Attention Network for Single Image Super-Resolution. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 15772–15786. [Google Scholar] [CrossRef] [PubMed]
  24. Hsu, W.-Y.; Lin, W.-Y. Ratio-and-Scale-Aware YOLO for Pedestrian Detection. IEEE Trans. Image Process. 2020, 30, 934–947. [Google Scholar] [CrossRef] [PubMed]
  25. Hsu, W.-Y.; Jian, P.-W. Recurrent Multi-scale Approximation-Guided Network for Single Image Super-Resolution. ACM Trans. Multimedia Comput. Commun. Appl. 2023, 19, 1–21. [Google Scholar] [CrossRef]
  26. Hsu, W.-Y.; Chang, W.-C. Recurrent wavelet structure-preserving residual network for single image deraining. Pattern Recognit. 2023, 137, 109294. [Google Scholar] [CrossRef]
  27. Mouri Zadeh Khaki, A.; Choi, A. Optimizing Deep Learning Acceleration on FPGA for Real-Time and Resource-Efficient Image Classification. Appl. Sci. 2025, 15, 422. [Google Scholar] [CrossRef]
  28. Pico, N.; Montero, E.; Vanegas, M.; Erazo Ayon, J.M.; Auh, E.; Shin, J.; Doh, M.; Park, S.-H.; Moon, H. Integrating Radar-Based Obstacle Detection with Deep Reinforcement Learning for Robust Autonomous Navigation. Appl. Sci. 2024, 15, 295. [Google Scholar] [CrossRef]
  29. Hsu, W.-Y.; Chung, C.-J. A novel eye center localization method for multiview faces. Pattern Recognit. 2021, 119, 108078. [Google Scholar] [CrossRef]
  30. Ding, M.; Yang, Z.; Hong, W.; Zheng, W.; Zhou, C.; Yin, D.; Lin, J.; Zou, X.; Shao, Z.; Yang, H.; et al. Cogview: Mastering text-to-image generation via transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 19822–19835. [Google Scholar]
  31. Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-shot text-to-image generation. In Proceedings of the in International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR. pp. 8821–8831. [Google Scholar]
  32. Tan, H.; Liu, X.; Yin, B.; Li, X. Cross-Modal Semantic Matching Generative Adversarial Networks for Text-to-Image Synthesis. IEEE Trans. Multimed. 2022, 24, 832–845. [Google Scholar] [CrossRef]
  33. Tan, H.; Liu, X.; Yin, B.; Li, X. DR-GAN: Distribution Regularization for Text-to-Image Generation. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 10309–10323. [Google Scholar] [CrossRef] [PubMed]
  34. Vahdat, A.; Kautz, J. NVAE: A deep hierarchical variational autoencoder. Adv. Neural Inf. Process. Syst. 2020, 33, 19667–19679. [Google Scholar]
  35. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  36. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
  37. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the in 32nd International Conference on International Conference on Machine Learning, Lille, France, 6–11 July 2015; PMLR. pp. 448–456. [Google Scholar]
  38. Zoph, B.; Le, Q. Searching for activation functions. In Proceedings of the in 6th International Conference on Learning Representations, ICLR 2018-Workshop Track Proceedings, Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–13. [Google Scholar]
  39. Kingma, D.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the ICLR, Banff, AB, Canada, 14–16 April 2014; pp. 1–15. [Google Scholar]
  40. Li, B.; Qi, X.; Lukasiewicz, T.; Torr, P. Controllable text-to-image generation. Adv. Neural Inf. Process. Syst. 2019, 32, 2065–2075. [Google Scholar]
  41. Liu, B.; Song, K.; Zhu, Y.; de Melo, G.; Elgammal, A. Time: Text and image mutual-translation adversarial networks. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 2–9 February 2021; volume 35, pp. 2082–2090. [Google Scholar]
  42. Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 2017, 30, 6629–6640. [Google Scholar]
  43. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
  44. Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved techniques for training gans. In Proceedings of the NIPS′16: Proceedings of the 30th International Conference on Neural Information Processing Systems; Barcelona, Spain, 5–10 December 2016, Volume 29.
Figure 1. Applications of the proposed HDFpNet model. (a) High-quality image generation from the warning signs on a construction site; (b) high-quality image generation from the descriptive content of machine operation steps; (c) high-quality image generation from information about birds in a tourist area.
Figure 1. Applications of the proposed HDFpNet model. (a) High-quality image generation from the warning signs on a construction site; (b) high-quality image generation from the descriptive content of machine operation steps; (c) high-quality image generation from information about birds in a tourist area.
Applsci 15 00706 g001
Figure 2. Architecture of the proposed HDFpNet. It consists of the iT2IG module, FESFE module (including CE block, FFE block, and CS block), and CA mechanism.
Figure 2. Architecture of the proposed HDFpNet. It consists of the iT2IG module, FESFE module (including CE block, FFE block, and CS block), and CA mechanism.
Applsci 15 00706 g002
Figure 3. Visual presentation and comparison of generated images by StackGAN [9], AttnGAN [10], DM-GAN [14], and HDFpNet (ours) conditioned on text descriptions from the CUB-Bird [17] test dataset.
Figure 3. Visual presentation and comparison of generated images by StackGAN [9], AttnGAN [10], DM-GAN [14], and HDFpNet (ours) conditioned on text descriptions from the CUB-Bird [17] test dataset.
Applsci 15 00706 g003
Figure 4. Visual presentation and comparison of generated images by StackGAN [9], AttnGAN [10], DM-GAN [14], and HDFpNet (ours) conditioned on text descriptions from the MS-COCO [18] test dataset.
Figure 4. Visual presentation and comparison of generated images by StackGAN [9], AttnGAN [10], DM-GAN [14], and HDFpNet (ours) conditioned on text descriptions from the MS-COCO [18] test dataset.
Applsci 15 00706 g004aApplsci 15 00706 g004b
Figure 5. Visual presentation and comparison of generated images by HDFpNet in different stages and GT in the CUB-Bird dataset [17].
Figure 5. Visual presentation and comparison of generated images by HDFpNet in different stages and GT in the CUB-Bird dataset [17].
Applsci 15 00706 g005
Figure 6. Different styles of images generated by HDFpNet from the same text description.
Figure 6. Different styles of images generated by HDFpNet from the same text description.
Applsci 15 00706 g006
Table 1. Quantitative evaluation of different methods (StackGAN [9], AttnGan [10], ControlGAN [40], MirrorGan [13], KT-GAN [12], DM-GAN [14], TIME [14], and HDFpNet) in terms of FID↓ on CUB-Bird and MS-COCO datasets. Lower is better for FID.
Table 1. Quantitative evaluation of different methods (StackGAN [9], AttnGan [10], ControlGAN [40], MirrorGan [13], KT-GAN [12], DM-GAN [14], TIME [14], and HDFpNet) in terms of FID↓ on CUB-Bird and MS-COCO datasets. Lower is better for FID.
DatasetStackGAN [9]AttnGAN [10]ControlGAN [40] MirrorGAN [13]KT-GAN [12]DM-GAN [14]TIME [14]Ours
CUB--23.98----17.3216.0914.313.89
COCO74.0581.5935.49--30.7332.6431.1427.21
Table 2. Quantitative evaluation of different methods (StackGAN [9], AttnGAN [10], ControlGAN [40], MirrorGAN [13], KT-GAN [12], DM-GAN [14], TIME [14], and HDFpNet) in terms of R-precision↑ on CUB-Bird and MS-COCO datasets. Higher is better for R-precision.
Table 2. Quantitative evaluation of different methods (StackGAN [9], AttnGAN [10], ControlGAN [40], MirrorGAN [13], KT-GAN [12], DM-GAN [14], TIME [14], and HDFpNet) in terms of R-precision↑ on CUB-Bird and MS-COCO datasets. Higher is better for R-precision.
DatasetStackGAN [9]AttnGAN [10]ControlGAN [40]MirrorGAN [13]KT-GAN [12]DM-GAN [14]TIME [14]Ours
CUB10.3767.8269.3369.58--72.3171.5784.33
COCO--83.5382.4384.21--91.8789.5792.44
Table 3. Performance evaluation of HDFpNet in different model combinations on the CUB-Bird dataset in terms of the FID and R-precision. Baseline represents the generation of feature maps only using the It2IG module. CE, CS, FEE, and CA represent the CE block, CS block, FFE block, and CA mechanism, respectively.
Table 3. Performance evaluation of HDFpNet in different model combinations on the CUB-Bird dataset in terms of the FID and R-precision. Baseline represents the generation of feature maps only using the It2IG module. CE, CS, FEE, and CA represent the CE block, CS block, FFE block, and CA mechanism, respectively.
ArchitectureFID ↓R-Precision ↑
Baseline22.9471.40%
+CE +CS21.4579.12%
+CE +FFE +CS 15.5182.57%
+CE +FFE +CA +CS17.5081.38%
+CE +FFE +CS +CA13.8984.33%
Table 4. Performance evaluation of different λ h on CUB-Bird dataset in terms of FID and R-precision.
Table 4. Performance evaluation of different λ h on CUB-Bird dataset in terms of FID and R-precision.
λ h FID ↓R-Precision ↑
λ h = 0.1 13.8984.33%
λ h = 1.0 15.4282.91%
Table 5. Performance evaluation of HDFpNet in different iterations in terms of FID and R-precision.
Table 5. Performance evaluation of HDFpNet in different iterations in terms of FID and R-precision.
Iteration (Times)FID ↓R-Precision ↑
125.7581.12%
2 14.30182.72%
3 13.8984.33%
4 14.9183.15%
5 16.115383.25%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hsu, W.-Y.; Lin, J.-W. High-Quality Text-to-Image Generation Using High-Detail Feature-Preserving Network. Appl. Sci. 2025, 15, 706. https://doi.org/10.3390/app15020706

AMA Style

Hsu W-Y, Lin J-W. High-Quality Text-to-Image Generation Using High-Detail Feature-Preserving Network. Applied Sciences. 2025; 15(2):706. https://doi.org/10.3390/app15020706

Chicago/Turabian Style

Hsu, Wei-Yen, and Jing-Wen Lin. 2025. "High-Quality Text-to-Image Generation Using High-Detail Feature-Preserving Network" Applied Sciences 15, no. 2: 706. https://doi.org/10.3390/app15020706

APA Style

Hsu, W.-Y., & Lin, J.-W. (2025). High-Quality Text-to-Image Generation Using High-Detail Feature-Preserving Network. Applied Sciences, 15(2), 706. https://doi.org/10.3390/app15020706

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop