Exploration of Generative Neural Networks for Police Facial Sketches

Sádaba-Campo, Nerea; Gómez-Moreno, Hilario

doi:10.3390/bdcc9020042

Open AccessArticle

Exploration of Generative Neural Networks for Police Facial Sketches

by

Nerea Sádaba-Campo

¹ and

Hilario Gómez-Moreno

^1,2,*

¹

Departamento de Teoría de la Señal y Comunicaciones, Universidad de Alcalá, 28871 Alcalá de Henares, Madrid, Spain

²

Instituto Universitario de Investigación en Ciencias Policiales (IUICP), Universidad de Alcalá, 28801 Alcalá de Henares, Madrid, Spain

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(2), 42; https://doi.org/10.3390/bdcc9020042

Submission received: 19 December 2024 / Revised: 30 January 2025 / Accepted: 12 February 2025 / Published: 14 February 2025

Download

Browse Figures

Versions Notes

Abstract

This article addresses the impact of generative artificial intelligence on the creation of composite sketches for police investigations. The automation of this task, traditionally performed through artistic methods or image composition, has become a challenge that can be tackled with generative neural networks. In this context, technologies such as Generative Adversarial Networks, Variational Autoencoders, and Diffusion Models are analyzed. The study also focuses on the use of advanced tools like DALL-E, Midjourney, and primarily Stable Diffusion, which enable the generation of highly detailed and realistic facial images from textual descriptions or sketches and allow for rapid and precise morphofacial modifications. Additionally, the study explores the capacity of these tools to interpret user-provided facial feature descriptions and adjust the generated results accordingly. The article concludes that these technologies have the potential to automate the composite sketch creation process. Therefore, their integration could not only expedite this process but also enhance its accuracy and utility in the identification of suspects or missing persons, representing a groundbreaking advancement in the field of criminal investigation.

Keywords:

Generative Neuronal Networks; police facial sketches; machine learning; Generative Adversarial Networks (GANs); Variational Autoencoders (VAEs); diffusion models; DALL-E; Midjourney; Stable Diffusion; FLUX

1. Introduction

In police investigations, the individualization and identification of individuals through somatometric characteristics are fundamental. Both concepts are closely related to identity, a value that uniquely distinguishes a person [1].

Currently, forensic identification aims to confirm or refute an identity beyond a reasonable doubt, relying on morphological analysis as a comparison method by examining the correspondences and discrepancies of each facial feature. This process follows international scientific protocols and recommendations, such as those issued by the Facial Identification Scientific Working Group (FISWG) [2]. Analysis ranges from examining global features, such as the ear, to specific marks like scars or alterations such as tattoos, including local features or specific areas within a global region, like the adherence of the left earlobe.

The immense variability and uniqueness of facial features are due to the random combination of factors such as phenotype (the physical expression of an individual’s characteristics), genotype (genetic factors), and environmental influences throughout an individual’s life (diet, climate, diseases, accidents, etc.) [3]. Thus, in police investigations, obtaining the most accurate and detailed graphical representation of an individual, whether a missing person or a suspect, based on descriptions from witnesses or victims with visual contact with the subject, can be of great importance.

This representation, known as a police composite sketch, is a fundamental tool in guiding investigations. Although exact figurative representation may not always be achieved, it reduces the number of suspects or presents possible candidates. According to Chief Inspector José Carlos Beltrán Martín, head of the Forensic Anthropology Laboratory of the Scientific Police, Spanish National Police, the quality of the sketch largely depends on the specialist’s skill and the inherent subjectivity of witness statements, influenced by factors like stress and descriptive ability [4]. According to his testimony, criminal investigations have employed three evolving methods over time to achieve the most accurate representation of an individual’s facial features [5]:

Direct or Artistic Method: An artist’s drawing, created based on witness descriptions using traditional drawing techniques. This method is labor-intensive and time-consuming (Figure 1).
Composition Method: This approach is more straightforward and includes two types that have evolved sequentially:
−
Identi-Kit: A semi-manual method involving paper-based drawings of different facial features that are inserted into a specially designed frame to assemble a complete face.
−
Photo-Fit: Developed by SIRCHIE Laboratories, this method is similar to Identi-Kit but uses photographs of different facial segments instead of drawings.
Mixed Method: This method combines the direct/artistic approach (using artistic drawing techniques) with semi-mechanized techniques, such as the composition method. This approach is computerized, allowing the exchange of facial segments represented by drawings or photographs in a digital environment.

With technological advances, computer programs are now used to generate an individual’s face, incorporating software like Photoshop to refine image details. Software applications such as Faces (Figure 2) in Canada [6] or Facette in Germany [7] have been developed to create schematic facial configurations. The latter is used by the Spanish Scientific Police and is based on a database containing images of various features and accessories, enabling layer-by-layer modifications of each facial component: facial contour, eyes, mouth, hair, etc.

This evolution has led to the development of 3D software (e.g., FaceGen [8]), and, similarly, artificial intelligence could now be employed to facilitate this task further.

1.1. Artificial Intelligence (AI)

Artificial intelligence (AI) has developed the ability to create strikingly realistic images that challenge our perception of reality and prompt us to reflect on the meaning of identity in the digital age. Generative Neural Network-based systems, such as Diffusion Models and Generative Adversarial Networks (GANs), have become widespread, enabling high-quality image creation through sophisticated learning algorithms.

In facial image generation, it is essential to recognize that an image is an approximate representation of reality, containing inaccuracies from the capture process and disparities due to morphological changes, such as weight, expression, or pose shifts. This complexity can significantly hinder analysis or comparison. When configuring faces using composite sketch techniques, the process typically starts with general features and is refined toward specific details, unless a witness vividly recalls a particular trait. This approach will be applied in facial generation using artificial intelligence.

Generative AI has the ability to generate highly realistic images that challenge our perception of reality. Leveraging neural networks, tools currently in development have become widespread in a highly competitive field that evolves daily.

Generative Neural Networks can analyze large volumes of training data and, through machine learning, learn the probability distribution from which the data originated, allowing the generation of new examples based on this distribution. Neural networks are computational models comprising interconnected units, or neurons, designed to process information by capturing patterns or distributions in input data to generate outputs. In the field of image generation, facial synthesis often begins with a broad textual description that is iteratively refined to match a specific concept.

Beyond text inputs, additional guidance can be provided through images or targeted modifications using inpainting [9] or outpainting [10] techniques. These approaches allow for precise adjustments to align the generated image with user-defined preferences and specifications.

Tools discussed below, like DALL-E [11], Midjourney [12], and Stable Diffusion, [13] are based on Diffusion Models, probabilistic models that gradually degrade data by injecting noise, then learn to reverse this process to generate samples [14]. There are also well-known Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), which are often combined to produce results more effectively.

1.1.1. Generative Adversarial Networks (GANs)

Generative Adversarial Networks (GANs) (Figure 3) are capable of generating images in a single step, providing speed but also some instability. They consist of two models:

Generative Model: Captures a data distribution and produces new output images.
Discriminative Model: Estimates the probability that a given sample originates from the training data rather than being created by the generator.

Both models train simultaneously, much like a dynamic between a forger and a police officer: one aims to improve the quality of its “forgeries” to appear real, while the other refines its ability to distinguish these from authentic samples. Consequently, the generated images become increasingly similar to real ones over time.

1.1.2. Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs) accelerate image processing by using an encoder network that compresses input images into a lower-dimensional latent representation, which is then reconstructed by a decoder network. VAEs encode the input as a distribution in the latent space, aiming to minimize reconstruction error during decoding, thus reducing the loss function. Effective encoding–decoding reconstruction is essential, and regularization ensures that the latent space approximates a normal distribution, maintaining continuity (nearby points in the latent space produce similar content) and completeness (decoded results are meaningful) [15,16].

However, due to the reduced latent space dimension, results can be blurry or low-quality. To improve this, models like Vector Quantized VAE (VQVAE) [17] discretize the latent space, unlike the continuous nature of traditional VAEs, stabilizing training and enabling powerful models, such as Transformers, to enhance relationships between latent codes, though at a slower processing rate. Another approach is VQGAN models, which add a discriminator to encourage the decoder to generate sharper, more realistic samples [18,19].

1.1.3. Transformers

Transformers have revolutionized sequential data processing, especially in Natural Language Processing (NLP) tasks like machine translation, text summarization, classification, and image captioning. Initially developed to enhance language translation, they excel at parallel data processing and handling variable-length sequences, enabling AI models to identify relationships and derive meaning from data fragments. In NLP, Transformers demonstrate high stability and efficiency, capturing complex relationships without relying on recurrent or convolutional networks [20,21].

Their architecture, depicted in Figure 4, is based on an encoder–decoder structure. The encoder maps input sequences to continuous representations, while the decoder generates output. The foundation of this architecture is the self-attention mechanism, allowing each token (meaningful text fragment) to relate its importance to others in a sequence, ensuring contextual relationships and preserving order.

Upon receiving input, it is transformed into numerical vectors, or embeddings, that represent each token’s context. Similar words (e.g., “good” and “great”) receive closely positioned embeddings, while positional embeddings ensure sequence order, distinguishing phrases like “hot dog” from “dog hot”. The encoder leverages multi-head self-attention and feed-forward networks to enhance contextual understanding and learning.

In the final phase, the decoder processes dual input: its prior output and the encoder’s processed input, refining the output sequence. Here, self-attention is adjusted to prevent future position attention, preserving the model’s autoregressive nature.

Despite their versatility, Transformers face challenges related to model complexity and bias detection, often due to model size. Efforts to improve these models include architectural adjustments and developments like Generative Adversarial Transformers, integrating adversarial networks [22] and U-Nets [23] with diffusion models for advanced generative tasks.

1.1.4. Diffusion Models

Diffusion Models rely on a training process where noise is iteratively added to images in a controlled manner by a scheduler until the image is completely distorted, leaving only noise (Figure 5). Then, a chain parameterized by neural networks, trained through variational inference, is used to restore the sample so that it aligns with the initial data distribution [24]. Diffusion models were first introduced in 2015 by Sohl-Dickstein et al. [25] and have since evolved through frameworks such as the Denoising Diffusion Probabilistic Model (DDPM) [26]. This framework was further refined in the paper Improved Denoising Diffusion Probabilistic Models [27], enabling the generation of realistic images.

Rather than directly predicting a less noisy image, the model aims to estimate the residual noise—that is, the discrepancy between a noisy image and its previous iteration (Figure 6)—so that it can remove the noise. Thus, with a sufficiently large dataset, typically of images, the model can be trained to reverse the noise process, generating clear new images.

The iterative nature of these models contributes to their diverse and high-quality outputs; however, they are resource-intensive in terms of time and memory. Consequently, these models are often combined with other neural networks to produce highly efficient models that yield higher-quality results. For example, diffusion can be applied in the latent space or a discriminator can be added to refine the results [24,28].

Image generation can be based solely on learned data patterns or guided by user input through a text prompt (known as text-to-image or txt2img) or can use an initial image (known as image-to-image or img2img), guiding the generation towards a specific set of characteristics.

In txt2img models, embeddings generated by the encoder of a pre-trained transformer neural network (e.g., CLIP: Contrastive Language-Image Pretraining [29]) are used. This network operates in a compressed representation, learning to generate embeddings where image and text are closely aligned. Thus, when generating an image conditioned on text, the text is encoded as a numeric vector, serving as an input that guides the noise removal process according to its relation to the learned embeddings. In the img2img scenario, an image can be added alongside the prompt, which is encoded into latent space and injected with noise. In this case, the noise predictor utilizes both inputs to complete the image generation process [11,24].

A specific case of img2img is outpainting [10], which allows the extension of the original boundaries. This is achieved by pre-extrapolation of the pixels, maintaining image coherence through the high spatial correlation in latent space. For instance, if only an image of a person’s eyes is available, several potential face options can be generated that match the original eyes. Spatial conditioning can also be applied using networks like ControlNet [30] and edge detectors such as Canny [31], maximizing the signal-to-noise ratio to achieve precise edge location by enabling the use of an initial control map. This approach can be used to convert a detailed or simple sketch into a realistic, colored image within seconds.

Additionally, if an image meets a witness’s description but requires modification to a specific feature or area, image-editing techniques such as inpainting [24] can be employed. Inpainting involves restoring damaged or missing content in an image by masking the latent codes of the grid to be modified, producing as many variations as needed. The diffusion model regenerates the masked area while maintaining semantic coherence with surrounding pixels. This technique can also be used to fix a particular region or feature while generating multiple variations for other facial elements.

2. Materials and Methods

The following section aims to detail the tools and methodologies identified as the most effective for generating realistic facial images based on textual descriptions. It is divided into two parts: the first introduces and analyzes the image generation tools employed, while the second part provides a clear and detailed explanation of the generation process, offering a clear and detailed explanation of how these tools are applied to refine the output and achieve accurate representations based on the provided textual input.

2.1. Image Generation Tools

The neural networks discussed earlier form the foundation of tools such as DALL-E, Midjourney, Stable Diffusion, and the recently introduced FLUX. The quality of results largely depends on the underlying model and the datasets used for training.

DALL-E 3 [11], by OpenAI, is an image generation tool leveraging diffusion models. It progressively refines images by removing noise, guided by natural language prompts. Trained using GPT-based Transformer parameters, it effectively interprets natural language and supports intuitive interaction, including conversational queries. Despite its capabilities, DALL-E exhibits limitations in spatial awareness, realism, text rendering, and detail accuracy in the generated images.

Midjourney [12] is a powerful closed-source tool accessed via Discord. Its prompt structure, involving commands like “/imagine”, offers extensive control over generation, albeit less user-friendly compared to DALL-E. This field supports input such as a reference image URL; textual descriptions; and customizable weights and parameters like quality, influence, appearance, or excluded elements (negative prompts).

Midjourney excels in producing highly realistic images and supports features like image scaling resolutions, inpainting, outpainting, and generating variations of existing outputs. These capabilities make it a versatile tool for high-quality image synthesis.

FLUX [32], developed by Black Forest Labs (Black Forest Labs is a research team behind foundational AI models such as Stable Diffusion, Latent Diffusion, and Adversarial Diffusion Distillation), builds upon a hybrid architecture of Multimodal Diffusion Transformers (MM-DiT) with 12 billion parameters, incorporating techniques like flow matching [33], rotary positional embeddings [34], and parallel attention layers [35], achieving good performance. It integrates CLIP to align textual descriptions with visual outputs in a shared vector space, ensuring precise image generation [36]

FLUX is available in three versions: FLUX.1 [pro], optimized for professional use; FLUX.1 [dev], balancing high-quality output with reduced resource demands; and FLUX.1 [schnell], prioritizing speed and local usability. Although its open source nature offers a versatile tool for researchers and developers, FLUX requires large computing resources, as even its smallest version requires 24 GB of VRAM. While it can run on lower-resource hardware (e.g., 12 GB GPUs) with configurations like ComfyUI’s lowram mode, these setups can limit its performance [37].

Stable Diffusion [13,38], developed by Stability AI and released in August 2022, is a powerful open-source system based on latent diffusion models [28]. These models iteratively remove Gaussian noise in the latent space to accelerate image generation, optimizing time and memory efficiency. The architecture comprises a text encoder (processing prompts into vectors), a U-Net (generating the image), and a VAE (compressing the image in the latent space and reconstructing it to the original size with the decoder). Initially, Stable Diffusion was trained on LAION-5B [39], a dataset of 5.85 billion high-quality image–text pairs filtered via CLIP; the model—as a deep-learning system—has evolved through community feedback, resulting in a wide range of customized variants.

This tool provides users with extensive control over output generation by adjusting parameters. It supports text prompts, images, and advanced editing techniques like inpainting and outpainting, enabling modification or expansion of images. While the workflow may appear more complex compared to DALL-E, and the initial realism of FLUX or Midjourney outputs may be higher, we focus our analysis on Stable Diffusion, through its local installation with webUi, due to its flexibility to adjust parameters and modify styles by incorporating different generative models. Parameters such as Denoising Strength (which adjusts variance from the input image—increasing at or near 1) and CFG Scale (controlling prompt influence) allow precise tuning of the results.

2.2. Methodology

This article aims to demonstrate the application and adaptation of various tools and models for generating facial images conditioned on specific descriptions. Unlike DALL-E, tools like Stable Diffusion and Midjourney, which deliver more realistic results, require structured and concise prompts where keywords play a critical role. Keywords placed at the beginning of a prompt carry greater weight. Moreover, numerical weighting (e.g., keyword:1.3 or (((keyword)))) can amplify influence, while lower values (e.g., keyword:0.73 or [[[keyword]]]) reduce it. Adjustments such as prompt reformulation or increasing inference steps can enhance outcomes, though additional steps demand more computational resources.

2.2.1. Comparison and Selection of Image Generation Tools

As an initial approach in this study, the objective is to select the tool that best aligns with the study’s goals. To achieve this, extensive testing was conducted, with selected examples presented below. In Figure 7, the outputs generated by different tools based on the following description are shown:

Prompt: “Realistic portrait face of a 25-year-old woman with tanned skin, prominent brown eyes, a straight nose, thin lips, wavy brown hair, and a serious expression. The lighting in the image must be natural. It is crucial that the lips are thin and the eyes are large and brown.”

As observed, the generated images of Figure 7 exhibit distinct styles due to the specific training of each tool. DALL-E (Figure 7a) generates more cartoon-like style outputs, which does not align with the objectives of this study. Conversely, the other tools produce more realistic images. For Midjourney (Figure 7b) and Stable Diffusion (Figure 7c), the prompts were refined as follows to enhance performance in image generation. Finally, it is worth noting that FLUX tends to produce overly perfect faces with highly pronounced features, resulting in a less realistic appearance and making it challenging to guide the tool toward generating more conventional facial characteristics.

Midjouney—Figure 7b: “/imagine: [prompt: realistic portrait female 25 years old, tanned skin, straight nose, serious face, open brown eyes::2 long brown wavy hair::1.5 very thin lips::2 –no malformed, draw, anime, deformed, bad quality, make-up –style raw]”
Stable Diffusion—Figure 7c: realistic portrait photo of a woman 25 years old, tanned skin, straight nose, serious face, (very thin lips:1.4), natural light //BREAK// big brown eyes //BREAK// dark brown wavy hair //BREAK// very thin lips:1.7, highly detailed. And negative prompt: lowres, drawing, sketch, malformed, bad quality.

Figure 8 once again highlights the results where Midjourney and Stable Diffusion excel in producing highly realistic images, whereas FLUX and DALL-E 3 exhibit lower fidelity to realism. These outputs were generated using the prompts detailed below. Specifically, DALL-E 3 demonstrates a tendency to produce visual artifacts in the output images, along with inconsistencies in the key features of them. As illustrated in Figure 9, these inconsistencies often manifest as disproportionate or misplaced elements, which detract from the overall quality and coherence of the images. Based on the conducted tests, it was determined that 57% of the generated images exhibited defects or artifacts introduced by the tool, in contrast to Stable Diffusion, which produced artifacts in 37% of the generated images, showcasing a significantly higher consistency and reliability in its outputs. Such limitations highlight the challenges of using DALL-E 3 for applications requiring precise and realistic visual representations.

DALL-E—Figure 8a: generate an image of a portrait photo face man 70 years old short white hair, small eyes, big nose, fat, smile
Midjouney—Figure 8b: “/imagine: [prompt: portrait photo face man 70 years old short white hair, small eyes, big nose, fat, smile –no malformed, draw, anime, deformed, bad quality –stylize 75 –style raw]”
Stable Diffusion—Figure 8c: portrait photo face man 70 years old short white hair, small eyes, big nose, fat, smile. And negative prompt: lowres, drawing, sketch, malformed, bad quality.
FLUX—Figure 8d: portrait photo face man 70 years old short white hair, small eyes, big nose, fat, smile

Finally, it is worth analyzing an example of sketch-guided image generation. Figure 10 compares the results obtained from the same sketch (Figure 10a) using DALL-E 3 (Figure 10b), Midjourney (Figure 10c), and Stable Diffusion (Figure 10d). As shown, DALL-E 3 produces images with low realism and minimal resemblance to the original sketch’s pose and expression. Midjourney achieves a higher degree of realism but follows the sketch boundaries more flexibly. In contrast, Stable Diffusion remains faithful to the sketch’s position and details, aided by its integration with ControlNet and the Canny edge detection processor. The prompts used to generate the respective images were:

DALL-E—Figure 8b: turns my drawing into a photo-realistic style.
Midjouney—Figure 8c: “/image [prompt: https://s.mj.run/38me-KNonTY (Accessed on 12 December 2024) hyperrealistic portrait picture of an 18 year old male following the composition and borders of the attached image, brown eyes, dark brown hair, photographic style–style raw].”
Stable Diffusion—Figure 8d: “(((brown eyes))) //BREAK// (((brown hair))) //BREAK// realistic portrait photo of a boy 20 years old, friendly expression.” And negative prompt: draw, painting, sketch, noisy, blurred //BREAK// lowres, low quality, render, anime, bad photography.

As observed, all these tools provide the capability to generate images rapidly, which is a key advantage in many applications. However, for our specific objective, DALL-E and FLUX deviate from the realistic facial image style required for the study. DALL-E often produces outputs with an animated aesthetic and recurrent errors, undermining its reliability. Although FLUX demonstrates a high level of prompt adherence, excelling in the accurate rendering of textual elements and anatomically challenging features such as fingers, these strengths are not directly relevant to the study’s goals. Instead, critical factors such as variability, control, and realism in the generated results hold greater importance.

In the case of Midjourney, while it matches Stable Diffusion in terms of realism and prompt structure, it provides less control over generation parameters and lacks the ability to manipulate local images, a capability that Stable Diffusion offers. Stable Diffusion also supports the use of various generative models, allowing for greater variability and customization of results. Furthermore, it excels in sketch-guided image generation, and when combined with fine-tuning capabilities and support for local image integration, it stands out as the optimal choice for highly customizable facial image synthesis.

Table 1 provides a comprehensive overview of the strengths and limitations of these models concerning forensic sketch generation.

2.2.2. Procedure Development Methodology

In our study, we used Stable Diffusion Web UI (https://github.com/AUTOMATIC1111/stable-diffusion-webui. Accessed on 20 September 2024) installed locally via GitHub. It includes dedicated fields for positive prompts and negative prompts (where users can specify undesired elements). To effectively generate facial images, it is essential to provide a well-crafted description of the desired features. Details such as age, gender, facial contour, skin tone, hair characteristics, eye shape and size, nose, mouth, and expression should be included.

However, overly detailed prompts do not always yield the best results. As text processing relies on token decomposition, adjustments may be required for optimal outcomes. In Stable Diffusion, we should follow the prompt structure: the initial keywords dictate the global composition, while subsequent terms refine specific details. Therefore, starting with essential elements like the subject, medium (e.g., ”portrait, ultra-realistic illustration”), and style (e.g., “hyperrealistic”) is recommended. Refinements can be applied iteratively once the initial image aligns with the desired concept.

It is also important to consider the model’s token limit, which processes prompts in chunks of 75 tokens. It is crucial to understand that tokens and words are not the same. When the model encounters a term it has not previously seen, it will break it down into smaller components it can process. For example, the word “dreambeach” would be split into two tokens: “dream” and “beach” independently processed. When exceeding this limit, a new chunk begins, with the total limit reaching 150 tokens. The resulting representations are then concatenated before being fed into the U-Net. To optimize this process, a new chunk can be initiated before reaching 75 tokens to group related keywords together into separate chunks. This can be achieved by using the BREAK command.

Figure 11 shows two images that have been generated following a different prompt structure: the first one, (Figure 11a), was generated with the following prompt, adding the “hyper-realistic style”:

Prompt: “portrait photo of a 24-year-old Caucasian woman with a round face, fair skin, wide forehead, long brown straight hair, small ears, small silver hoop earrings and a tiny silver pendant necklace with a circular design, realistic low saturation blue grey eyes, small curved thin eyebrows, small pointed nose, marked philtrum. She is looking at the camera, smiling slightly with thin closed lips with a friendly expression. Natural makeup that includes winged eyeliner. Neutral background and midday natural light indoors. She is wearing a simple black tank top.”

Negative prompt: “open mouth, skinny complexion, deformed, low resolution.”

And Figure 11b was generated following the prompt structure explained before:

Prompt: “portrait picture 24-years-old fat:0.3 Caucasian boy, large oval face, broad build, rounded //BREAK// black wavy hair that is longer on top and shaved:1.8 on the sides //BREAK// eyes centered and small, brown eyes, slight bags //BREAK// wide nose, left nostril wider //BREAK// big closed mouth, smile, lips thin and fading slightly at the corners //BREAK// beige striped button-up shirt.”

Negative prompt: “bread:1.8, open mouth.”

The next crucial step is selecting the appropriate model. For instance, realistic human faces require a different model than animated ones. The model’s training significantly influences the output. For example, if the training dataset predominantly features beardless men seated, the model is more likely to generate seated individuals when prompted for “a beardless man”, even without specifying posture. Figure 12 demonstrates the model’s impact on the tool’s outputs. Using the same or similar descriptions, images produced with the same model exhibit consistent patterns that differ significantly from those generated by another model, highlighting the influence of model-specific training and characteristics on the results. The images in Figure 12, were generated with the following prompt:

A highly detailed portrait photo of a 24-year-old Caucasian woman with a round face, fat girl, wide forehead, very fair skin //BREAK// large, expressive brown eyes, small curved thin eyebrows, natural makeup with winged eyeliner //BREAK// loose long straight black hair parted in the middle //BREAK// small pointed nose with the right nostril smaller //BREAK// small silver hoop earrings and a delicate silver pendant necklace with a circular design. Black, thin-strapped top, and her friendly expression is directed towards the camera. The high-resolution image highlights the texture of her skin //BREAK// thin lips:1.3, slight smile.

There are several types of model files [38] that can be downloaded from platforms such as CivitAi (Available at: https://civitai.com/models, Accessed on 12 December 2024) or Huggingface (Available at: https://huggingface.co/models, Accessed on 12 December 2024), designed for use with this tool:

Checkpoint Models: Comprehensive Stable Diffusion models containing all components required for image generation without additional files. They are large, typically ranging from 2 to 7 GB.
Textual Inversions (Embeddings): Small files (10–100 KB) that define new keywords to generate specific objects or styles. They must be used alongside a Checkpoint model.
LoRA Models (Low-Rank Adaptation): Compact patch files (10–200 MB) designed to modify styles without full model retraining. Also dependent on a Checkpoint model.
Hypernetworks: Additional network modules (5–300 MB) that enhance flexibility and style adaptation of Checkpoint models.

Achieving the desired output may require iterative refinement of prompts and parameters, but the process delivers a visual representation of the input description within seconds.

Once a facial image closely resembling the initial concept is generated, further refinements to specific facial features can be performed. This is achieved by transferring the generated image to the inpainting tab for targeted adjustments while keeping the rest of the image unchanged. This approach allows for the iterative fine-tuning of specific morphofacial traits, ensuring the final result meets the desired aesthetic or functional requirements with high fidelity.

Therefore, a realistic portrait picture can be obtained in a relatively simple and fast way that requires neither artistic skill nor expertise. The process begins with a well-crafted written description that adheres to the prompt structure required by the tool to maximize performance. Moreover, selecting a model that best aligns with the intended outcome and adjusting the prompt and parameters as needed is essential. Once a satisfactory image is generated, it is transferred to the inpainting tab for specific morphofacial refinements to achieve the final result.

Table 2 further complements this by offering a visual comparison of the influence of the models and certain parameters employed, and a flowchart illustrating this procedure is provided in Figure 13.

3. Results and Discussion

Our approach aims to demonstrate how following the outlined procedure can yield high-quality, realistic images that can help in criminal investigation.

Once the importance of the description and models has been established, it is worth noting that the results were primarily obtained using the Juggernaut XL checkpoint model, as it generates outputs closely aligning with our vision and offers greater variability.

Figure 14 shows different outputs generated using the Juggernaut XL model, refining the initial description provided and varying the sampling method. Specifically, Figure 14a,c utilize the “Euler a” sampling method, while Figure 14b,d employ the “DPM++ 2M SDE”. Prompts used in images of Figure 14 are presented bellow:

Figure 14a: A close-up portrait of a 24-year-old Caucasian woman with round face, very fair skin, long straight black hair, and large brown eyes. The woman is wearing silver hoop earrings and a tiny silver pendant necklace with a circular design and a thin-strapped top, her expression is friendly as she looks directly at the camera. Neutral background, natural midday light, realistic photographic style, high resolution //BREAK// round face, elongated face, thick complexion:1.8, fair skin, wide forehead //BREAK// thin lips:1.3, slight smile //BREAK// Very big eyes
Figure 14b: same prompt as image (a) but adding “fat”.
Figure 14c: the hyperrealism style was added; Prompt “A highly detailed, close-up portrait of a 24-year-old Caucasian woman with a round face, wide forehead, very fair skin, and large, expressive brown eyes. Small pointed nose with the right nostril smaller, loose long straight black hair parted in the middle, and is wearing small silver hoop earrings and a delicate silver pendant necklace with a circular design. Natural makeup that includes winged eyeliner. Thin-strapped top, friendly expression, slightly blurred neutral background with soft natural midday light. Realistic photographic style, high-resolution, lifelike representation //BREAK// thick complexion:1.8, fat:1.8, small curved thin eyebrows //BREAK// thin lips:1.3, slight smile ”; Negative prompt “hair up”.
Figure 14d: A highly detailed portrait photo of a 24-year-old Caucasian woman with a round face, fat girl, wide forehead, very fair skin //BREAK// large, expressive brown eyes, small curved thin eyebrows, natural makeup with winged eyeliner //BREAK// loose long straight black hair parted in the middle //BREAK// small pointed nose with the right nostril smaller //BREAK// small silver hoop earrings and a delicate silver pendant necklace with a circular design. Black, thin-strapped top, and her friendly expression is directed towards the camera. The high-resolution image highlights the texture of her skin //BREAK// thin lips:1.3, slight smile.

After selecting the output image, in this case, Figure 14d, facial features can be modified efficiently without altering the rest of the image using inpainting. We start creating a mask over the specific area, as shown in Figure 15b. Adding the input prompt specified in such Figure, we obtain as a result, the image (Figure 15c).

This approach allows for precise adjustments to align the output image with the intended concept. Such tools can produce realistic facial renderings resembling real individuals, offering potential applications in assisting forensic experts in cases of missing persons or suspect identification. The images in Figure 16 show the rapid generation of highly realistic artificial portraits, significantly streamlining tasks such as creating composite sketches for law enforcement. In this case, Figure 15c was used as a starting point, and prompts such as the following were used to refine specific features, including skin, eyes, mouth, nose, and expression, to achieve the desired result.

Prompt example: “small nosed thin:1.7 lips//smiley expression//thin lips//occluded lip at the end//Negative prompt: thick lips, teeth”
Prompt example: “black thin eyeliner//pointed eyebrows//huge realistic eyes//greenish:0.0001 brown:1.7 eyes//huge upper eyelid”
Prompt example: “few freckles:0.00001//white complexion//skin detail//fat girl fair complexion//smooth”

Additional examples are shown in Figure 17, where the prompt presented bellow was used to generate Figure 17a, which is compared with the real Figure 17b. Additionally, Figure 17c,d are presented: the former generated via inpainting based on Figure 11a, and the latter being a real photograph used for final comparison.

Prompt: “realistic portrait photo of a 24-year-old Caucasian girl, rounded face, fair skin, broad forehead //BREAK// straight black hair, long hair, ears close to the head, silver hoop earrings //BREAK// small thin curved eyebrows, very large eyes, light brown eyes, pointed nose //BREAK// small mouth and thin lips, closed mouth, wearing black tank top”

In order to conduct a quantitative evaluation, the DeepFace framework was utilized [40,41]. DeepFace is a hybrid face recognition package that integrates several state-of-the-art face recognition models. These models leverage Convolutional Neural Networks (CNNs) to represent facial features as vectors. This approach ensures that the vector representations of two images of the same individual are closer in the vector space compared to those of two images from different individuals. Due to the Facenet512 [42] model demonstrating superior performance [41], it was selected for this study.

Face verification in this framework is performed by calculating similarity using various distance metrics, such as Cosine Similarity, Euclidean Distance, or L2-normalized Euclidean Distance. According to experiments, no distance metric is overperforming more than another, though the default configuration employs Cosine Similarity due to its robustness and computational efficiency.

A modern face recognition pipeline typically involves five stages: detection, alignment, normalization, representation, and verification. Experiments [41] show that detection increases the face recognition accuracy by up to 42%, while alignment increases it by up to 6%. In this study, RetinaFace [43] was employed as the primary face detector to achieve both detection and alignment. While detectors like OpenCV and SSD (included in the DeepFace framework) offer faster processing speeds, RetinaFace and MTCNN deliver superior precision. In our study, accuracy is more important than processing speed, and RetinaFace excels in facial landmark detection.

Verification requires defining a threshold for similarity. In the range of 0 to 1, pairs of images with a distance close to 0 indicate high similarity, suggesting they likely represent the same individual. Conversely, distances closer to 1 reflect greater dissimilarity. Within the DeepFace framework, the exclusion threshold is configured at 0.3 for the Facenet512 model. Pairs of images with distances greater than 0.3 are classified as “false”, indicating they likely represent different individuals. However, for our analysis, this distance metric is interpreted as a measure of resemblance rather than strict discrimination. A distance near 0.3 between an original image and one generated by Stable Diffusion suggests a high degree of similarity, which aligns with the objectives of this study.

The application of these facial recognition networks yields the following results:

The calculated distance between Figure 16a,b is 0.38, which, despite exceeding the threshold, indicates a high degree of similarity.
The calculated distance between Figure 17a,b is 0.62, which denotes a less accurate outcome with a lower degree of similarity.
The calculated distance between Figure 17c,d is 0.39, indicating a high degree of similarity.

The variability of the most prominent morphofacial features, such as the eyes or mouth, makes them the most challenging aspects to generate accurately. Artifacts or outputs unrelated to the intended prompt can often occur. Thus, iterative prompt reformulation and repeated attempts are crucial to achieving coherent results. Figure 18 illustrates this process using a 55-year-old male subject. Figure 18a represents the initial generation based on the description of Figure 18d, which is used for comparison. The initial description fits with the prompt:

’Caucasian man 50 years old, oval elongated face, narrow bone structure //BREAK// few shaved dark gray hair, pronounced receding messy hairline //BREAK// curved eyebrows, sparse eyebrows, huge and pronounced nose, raised nostrils //BREAK// dark brown eyes, almond eyes //BREAK// very thin lips, closed smile, big mouth, very marked nasofacial lines //BREAK// Blue tank top with white borders’; and negative prompt: ’anime, draw, malformed, disformed, teeth, bread’.

Inpainting was used to improve Figure 18b, resulting in a calculated distance to image (d) of 0.65. The refinement process utilized prompts such as:

“small:1.6 ear backward rounded, [[[[upper part of the ear back]]]] //BREAK// 55 man skin detail, straight huge nose //BREAK// very wide nose, huge nose, straight prominent nose”, “55 years old man, skin detail, elongated face //BREAK// soft features //BREAK// three vertical age line; with the negative prompt: “bread:1.9, malformed, bread, mustache, hair, wrinkles”
“serious expression, friendly expression, 50 years man //BREAK// very thin lips, closed smile, big mouth //BREAK// very thin lower lip, occluded lips; with the negative prompt: “teeth:1.7, thick lips, open mouth, teeth”

Incoherent outcomes are evident in Figure 18c, where additional prompts such as “huge opened eyes, dark brown eyes, old man, centered eyes //BREAK// dark brown eyes, almond eyes” are used, along with a negative prompt including “malformed, lowres, eyelashes, upper eyelid”.

To enhance the clarity and comprehensiveness of the results, facial recognition networks from the DeepFace framework have been employed to systematically compare three distinct types of facial representations. Specifically, original facial images sourced from the Color FERET Database [44] were analyzed. This comprehensive facial image database was compiled as part of the The Facial Recognition Technology (FERET) program, which was established to advance techniques, technologies, and algorithms for automatic human face recognition (Available at: https://www.nist.gov/itl/products-and-services/color-feret-database, Accessed on 12 December 2024). The analysis of that faces was made alongside corresponding hand-drawn sketches created by professional forensic artists, obtained from the CUHK Face Sketch Database (CUFS)[45]. This database is intended for research on face sketch synthesis and face sketch recognition. There are 606 faces in total collected from other databases. For each face, there is a sketch drawn by an artist (Available at: https://www.kaggle.com/datasets/arbazkhan971/cuhk-face-sketch-database-cufs?resource=download, Accessed on 12 December 2024). Additionally, these were compared with AI-generated facial images synthesized using Stable Diffusion. The comparative results, quantified using the Cosine Similarity metric, are presented in Table 3.

As evidenced by the findings, these AI-based tools provide investigators with a more realistic, full-color representation of a subject’s appearance without requiring advanced artistic skills or manual drawing expertise. The objective is not to achieve a perfect reconstruction. Witness testimony is inherently limited to verbal descriptions of an individual’s physical attributes. Instead, this approach serves as a valuable aid in streamlining and improving the efficiency of suspect identification, thereby facilitating the investigative process.

It is important to note that younger and more conventionally attractive faces are easier to generate than those of older, less aesthetically typical individuals or those with higher body mass. To address this limitation, this tool can be used to generate images of aged faces, as illustrated in Figure 19, just by adding an order like “older man” for changing Figure 19a,b, or “older woman, 70 years old” for changing Figure 19c,d. It is important for the Denoising Strength to be around 0.62 in order to ensure consistent image variation and achieve optimal results.

Another approach to mitigating this limitation would involve training models on diverse datasets with a wide range of average individuals. This strategy could yield results more representative of common facial features observed in everyday life. The underlying bias originates from the training datasets, which are often dominated by images of celebrities or conventionally attractive subjects, prioritizing visually appealing outcomes over realistic diversity.

4. Conclusions and Future Lines

Police composite sketches are critical tools in criminal investigations, aiding in the facial identification of missing persons and narrowing suspect pools. Their creation relies on two key components: interviews with witnesses or victims, and the generation of the sketch. While human input is vital, this study explored the application of advanced AI algorithms to assist specialists in crafting these images.

Our conclusions after using different tools in the context of facial sketch generation are as follows:

DALL-E 3 offers intuitive natural language processing for detailed prompts, enabling users to describe desired images interactively. Its intuitive interface simplifies usage; however, it is limited in integrating external visual references and often produces artifacts or inaccurate details.
Midjourney provides high realism through concise, structured prompts but has a less user-friendly interface compared to DALL-E.
FLUX demonstrates the capability to generate high-resolution images with a strong interpretation of prompts, delivering detailed outputs. However, FLUX exhibits a notable limitation in the variability of facial features. Furthermore, it tends to favor an idealized aesthetic, prioritizing symmetry and perfection over realism. As a result, the generated images, while visually appealing, lack the nuanced imperfections and diversity of features that are characteristic of realistic human faces. This limitation poses challenges for applications requiring natural visual representations.
In terms of close-up portraits, Stable Diffusion stands out for its ability to generate greater variability, refined skin textures, and a higher degree of realism. The model’s flexibility in adjusting details allows for more lifelike and diverse facial representations, making it particularly effective for tasks requiring high accuracy in facial feature generation. Stable Diffusion is open-source and highly customizable. Despite its steeper learning curve compared to other tools, because it requires more technical familiarity for optimal use, it shines in sketch-guided generation and allows users to select models tailored to detail and realism.

Selecting the appropriate model and crafting an initial detailed prompt aligned with the desired concept is essential. The prompt structure is crucial. If the generated image aligns with the intended visual concept, it can be refined in the inpainting tab to modify specific morphofacial features without altering the rest of the image. It is important to note that a coherent and representative result may not be achieved on the first attempt; initial outputs may include artifacts or deviations from the specified description, requiring iterative adjustments and prompt refinement.

In summary, these AI tools represent significant advancements in automated composite sketch generation, offering versatile solutions tailored to specific investigative needs. They significantly enhance the efficiency of generating police facial sketches by substantially reducing production time. Unlike traditional approaches, which require extensive manual effort and advanced artistic expertise, these sophisticated systems can generate highly detailed and realistic images within minutes. Moreover, the capability to produce facial images in full color offers a distinct advantage in criminal investigations, as it enhances the investigator’s ability to discern subtle details such as skin tone, eye color, and other unique facial characteristics critical for identification. Additionally, these tools remove the necessity for investigators to possess professional-level artistic skills or drafting proficiency. This democratization of image generation enables the creation of remarkably accurate and lifelike representations with minimal effort and technical expertise. By integrating speed, accessibility, and precision, AI tools represent a paradigm shift, optimizing the investigative process and delivering superior results while conserving both time and resources. While they expedite the creation of realistic facial representations, expert oversight remains critical to ensure the validity and reliability of the generated results.

Although this study has shown the potential of AI tools in generating composite sketches for investigative purposes, several areas remain open for further exploration. Future research could focus on the integration of more diverse and representative datasets for training these models, particularly datasets that include a wider range of facial features from various demographic groups. This would address the bias seen in current AI-generated images, which often favor idealized or stereotypical faces, ensuring more realistic and varied facial representations.

In addition, improving the relationship between textual descriptions and image generation presents an important area for future research. The effectiveness of composite sketch generation is heavily reliant on the quality and specificity of the prompts provided. Investigating advanced techniques in prompt engineering could enhance the accuracy of image generation by establishing clearer and more precise links between textual descriptions and visual outputs. Furthermore, refining training processes to include more detailed and context-specific descriptions could significantly enhance the AI’s ability to produce images that align with investigators’ needs.

Author Contributions

Conceptualization, N.S.-C. and H.G.-M.; methodology, N.S.-C. and H.G.-M.; software, H.G.-M.; investigation, N.S.-C.; resources, H.G.-M.; writing—original draft preparation, N.S.-C.; writing—review and editing, N.S.-C. and H.G.-M.; supervision, H.G.-M.; project administration, H.G.-M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Instituto Universitario de Investigación en Ciencias Policiales (IUICP) de la Universidad de Alcalá (Madrid, Spain) through grant number IUICP-2023/02.

Data Availability Statement

The original contributions presented in this study are included in the article material. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors have used different AI image generation tools as described in the text. Details of which tools and with which parameters they have been used can be found in the different examples shown.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jain, A.K.; Bolle, R.; Pankanti, S. Biometrics: Personal Identification in Networked Society; Springer: Berlin/Heidelberg, Germany, 2006; Volume 479. [Google Scholar]
Facial Identification Scientific Working Group. Facial Image Comparison Feature List for Morphological Analysis, Version 2.0. 2018. Available online: https://fiswg.org/FISWG_Morph_Analysis_Feature_List_v2.0_20180911.pdf (accessed on 20 May 2024).
Richmond, S.; Howe, L.J.; Lewis, S.; Stergiakouli, E.; Zhurov, A. Facial genetics: A brief overview. Front. Genet. 2018, 9, 462. [Google Scholar] [CrossRef] [PubMed]
Ouyang, S.; Hospedales, T.M.; Song, Y.Z.; Li, X. ForgetMeNot: Memory-Aware Forensic Facial Sketch Matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Frowd, C.D.; Carson, D.; Ness, H.; McQuiston-Surrett, D.; Richardson, J.; Baldwin, H.; Hancock, P. Contemporary composite techniques: The impact of a forensically-relevant target delay. Leg. Criminol. Psychol. 2005, 10, 63–81. [Google Scholar] [CrossRef]
Faces Software. Facial Composite Software. Available online: https://facialcomposites.com/ (accessed on 10 June 2024).
FACETTE Face Design System—Phantombild-Programm—Facial Composites. Available online: http://www.facette.com/index.php?id=1&L=1 (accessed on 10 June 2024).
Singular Inversions. 2024. FaceGen 3D. Available online: https://facegen.com/3dprint.htm (accessed on 10 June 2024).
Lugmayr, A.; Danelljan, M.; Romero, A.; Yu, F.; Timofte, R.; Van Gool, L. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11461–11471. [Google Scholar]
Cheng, Y.C.; Lin, C.H.; Lee, H.Y.; Ren, J.; Tulyakov, S.; Yang, M.H. Inout: Diverse image outpainting via GAN inversion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11431–11440. [Google Scholar]
Betker, J.; Goh, G.; Jing, L.; Brooks, T.; Wang, J.; Li, L.; Ouyang, L.; Zhuang, J.; Lee, J.; Guo, Y.; et al. Improving image generation with better captions. Comput. Sci. 2023, 2, 8. [Google Scholar]
Midjourney. Midjourney. Available online: http://www.midjourney.com (accessed on 20 November 2024).
Stable Diffusion AI. Stable Diffusion. Available online: https://stablediffusionweb.com/es (accessed on 25 October 2024).
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Chen, Y.; Liu, J.; Peng, L.; Wu, Y.; Xu, Y.; Zhang, Z. Auto-Encoding Variational Bayes. Camb. Explor. Arts Sci. 2024, 2. [Google Scholar] [CrossRef]
Rocca, J.; Rocca, B. Understanding Variational Autoencoders (VAEs). Towards Data Sci. 2019, 23. Available online: https://medium.com/towards-data-science/understanding-variational-autoencoders-vaes-f70510919f73 (accessed on 15 April 2024).
Van Den Oord, A.; Vinyals, O. Neural discrete representation learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6309–6318. [Google Scholar]
Esser, P.; Rombach, R.; Ommer, B. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12873–12883. [Google Scholar]
Khan, S.H.; Hayat, M.; Barnes, N. Adversarial training of variational auto-encoders for high fidelity image generation. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; IEEE: New York, NY, USA, 2018; pp. 1312–1320. [Google Scholar]
Williams, K. Transformers in Generative AI. 2024. Available online: https://www.pluralsight.com/resources/blog/ai-and-data/what-are-transformers-generative-ai (accessed on 6 November 2024).
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
Hudson, D.A.; Zitnick, L. Generative adversarial transformers. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 4487–4499. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Yang, L.; Zhang, Z.; Song, Y.; Hong, S.; Xu, R.; Zhao, Y.; Zhang, W.; Cui, B.; Yang, M.H. Diffusion Models: A Comprehensive Survey of Methods and Applications. ACM Comput. Surv. 2023, 56, 1–39. [Google Scholar] [CrossRef]
Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning, PMLR, Lile, France, 6–11 July 2015; pp. 2256–2265. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the NIPS ’20: 34th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 6–12 December 2020. [Google Scholar]
Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8162–8171. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 21–23 June 2022; pp. 10674–10685. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Zhang, L.; Rao, A.; Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 3836–3847. [Google Scholar]
Canny, J. A Computational Approach to Edge Detection. IEEE Trans. Pattern Anal. Mach. Intell. 1986, 6, 679–698. [Google Scholar] [CrossRef]
Lack Forest Labs. FluxBF. Black Forest Labs Web. Available online: https://blackforestlabs.ai/ (accessed on 25 November 2024).
Lipman, Y.; Chen, R.T.Q.; Ben-Hamu, H.; Nickel, M.; Le, M. Flow Matching for Generative Modeling. arXiv 2023, arXiv:2210.02747. [Google Scholar]
Su, J.; Ahmed, M.; Lu, Y.; Pan, S.; Bo, W.; Liu, Y. RoFormer: Enhanced transformer with Rotary Position Embedding. Neurocomputing 2024, 568, 127063. [Google Scholar] [CrossRef]
Dehghani, M.; Djolonga, J.; Mustafa, B.; Padlewski, P.; Heek, J.; Gilmer, J.; Steiner, A.P.; Caron, M.; Geirhos, R.; Alabdulmohsin, I.; et al. Scaling vision transformers to 22 billion parameters. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 7480–7512. [Google Scholar]
Research Graph via Medium. The Ultimate FLUX.1 Hands-On Guide. 2024. Available online: https://medium.com/@researchgraph/the-ultimate-flux-1-hands-on-guide-067fc053fedd (accessed on 20 November 2024).
Emanuele via Medium. Flux: An Advanced (and Open Source) Text-to-Image Model Comparable to Midjourney. 2024. Available online: https://medium.com/diffusion-images/flux-an-advanced-and-open-source-text-to-image-model-comparable-to-midjourney-1b01cf5a7148 (accessed on 20 November 2024).
Andrew via Sagio Development LLC. How to Use Stable Diffusion. 2024. Available online: https://stable-diffusion-art.com/beginners-guide/ (accessed on 21 November 2024).
Beaumont, R. LAION-5B: A New Era of Open Large-Scale Multi-Modal Datasets. 2022. Available online: https://laion.ai/blog/laion-5b/ (accessed on 21 November 2024).
Serengil, S.I.; Ozpinar, A. LightFace: A Hybrid Deep Face Recognition Framework. In Proceedings of the 2020 Innovations in Intelligent Systems and Applications Conference (ASYU), Istanbul, Turkey, 15–17 October 2020; IEEE: New York, NY, USA, 2020; pp. 23–27. [Google Scholar] [CrossRef]
Serengil, S.; Ozpinar, A. A Benchmark of Facial Recognition Pipelines and Co-Usability Performances of Modules. J. Inf. Technol. 2024, 17, 95–107. [Google Scholar] [CrossRef]
Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar] [CrossRef]
Deng, J.; Guo, J.; Ververas, E.; Kotsia, I.; Zafeiriou, S. RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5202–5211. [Google Scholar] [CrossRef]
Phillips, P.; Wechsler, H.; Huang, J.; Rauss, P.J. The FERET database and evaluation procedure for face-recognition algorithms. Image Vis. Comput. 1998, 16, 295–306. [Google Scholar] [CrossRef]
Zhang, W.; Wang, X.; Tang, X. Coupled information-theoretic encoding for face photo-sketch recognition. In Proceedings of the Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, USA, 20–25 June 2011; IEEE: New York, NY, USA, 2011; pp. 513–520. [Google Scholar]

Figure 1. Composite sketch created by the FBI of the unknown hijacker (“Dan Cooper”) of Northwest Orient Flight 305. FBI Sketch Artist Roy Rose, Public domain, via Wikimedia Commons. Source: https://commons.wikimedia.org/wiki/File:CompositeB-FBI-1973.jpg (Accessed on 12 December 2024).

Figure 2. A self-portrait of user “kevin586” produced by FACES 3.0 Composite Software. Kencaesi (talk) (Uploads), Public domain, via Wikimedia Commons. Source: https://commons.wikimedia.org/wiki/File:Caesius_facial_composite.jpg (Accessed 12 December 2024).

Figure 3. General architecture of GANs. Zhang, Aston and Lipton, Zachary C. and Li, Mu and Smola, Alexander J., CC BY-SA 4.0. License: https://creativecommons.org/licenses/by-sa/4.0 (Accessed on 12 December 2024). Source: https://commons.wikimedia.org/w/index.php?curid=152265649 (Accessed on 12 December 2024).

Figure 4. Transformer architecture. dvgodoy, CC BY 4.0. License: https://creativecommons.org/licenses/by/4.0 (Accessed on 12 December 2024). Source: https://commons.wikimedia.org/wiki/File:Transformer,_full_architecture.png (Accessed on 12 December 2024).

Figure 5. Stable diffusion architecture. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer, CC BY 4.0. License: https://creativecommons.org/licenses/by/4.0 (Accessed on 12 December 2024). Source: https://en.m.wikipedia.org/wiki/File:Diffusion_Architecture.png (Accessed on 12 December 2024).

Figure 6. Plot of algorithmically-generated AI art of European-style castle in Japan demonstrating Denoising Diffusion Implicit Model (DDIM) diffusion steps. Benlisquare, CC BY-SA 4.0. License: https://creativecommons.org/licenses/by-sa/4.0 (Accessed on 12 December 2024). Source: https://commons.wikimedia.org/wiki/File:X-Y_plot_of_algorithmically-generated_AI_art_of_European-style_castle_in_Japan_demonstrating_DDIM_diffusion_steps.png (Accessed on 12 December 2024).

Figure 7. Images generated using DALL-E 3 for image (a), Midjourney for image (b), Stable Diffusion with realisticVisionV60B1v51VAE model for image (c), and FLUX for image (d).

Figure 8. Images generated using DALL-E 3 for image (a), Midjourney for image (b), Stable Diffusion using Juggernaut XL checkpoint model for image (c), and FLUX for image (d).

Figure 9. Images generated by DALL-E demonstrate its tendency to produce artifacts (a) and inconsistencies (b) in the output when refining specific user-requested details.

Figure 10. Images generated using Stable Diffusion. Image (a) represents the initial sketch used for generating images (b) with DALL-E 3, (c) with Midjourney, and (d) with Stable Diffusion and realisticVisionV60B1v51VAE model.

Figure 11. Stable Diffusion images generated with the Juggernaut XL model. (a) 24 -year-old woman, (b) 24-year-old man.

Figure 12. Images generated using Stable Diffusion. (a,b) are generated using the Juggernault XL model, while fullyREALXL has been used to generate (c,d). Besides that, in images (a,c), the “DPM++ 2M SDE” sampling method was used, whereas in (b,d), the “Euler a” method was applied.

Figure 13. Flowchart of the procedure followed for conditional image generation.

Figure 14. Images generated using Juggernault XL model with Stable Diffusion.

Figure 15. Images generated using Stable Diffusion: (a) represents the previously generated input image, (b) is the mask used to generate the output (c) using the positive prompt: ’realistic long straight brown hair //BREAK// dark brown hair, tiny silver hoop earrings’; and Negative prompt: ’malformed, deformed, white hair’.

Figure 16. Image (a) was generated using Stable Diffusion with the Juggernault XL model. Image (b) represents the original input (real picture) used to guide our initial textual description and to compare the generated outcome in (a).

Figure 17. Images generated using Stable Diffusion: (a) represents the output image based on the textual description of the real image (b). Image (c) was generated via inpainting of Figure 11a, which is compared with the real photograph in image (d).

Figure 18. Images generated using Stable Diffusion. Image (a) represents the output image based on the textual description of image (d), which is used as the actual final comparison image. (b,c) was generated via inpainting of image (a).

Figure 19. Images generated using Stable Diffusion. Images (a,c) represent the initial inputs, while (b,d) depict the results of facial aging.

Table 1. Comparative analysis of strengths and weaknesses of image generation models.

Model	Strengths	Weaknesses
DALL-E 3	Intuitive interaction via natural language prompts through a user-friendly interface, leveraging GPT-based transformer architecture.	Limited realism and control over results. Prone to visual artifacts and inaccuracies in fine details.
Midjourney	Capable of producing photorealistic outputs with high fidelity. Advanced control through structured prompts with enhanced flexibility for customizing style, quality, and exclusions.	Less granular control over underlying generation parameters compared to open-source alternatives.
FLUX	Faithfully adheres to prompts while delivering highly precise and reliable text rendering within images. Open-source, offering modularity and adaptability for research and development.	High computational requirements. Overly idealized facial representations often compromise realism.
Stable Diffusion	Open-source model with extensive customizability, supporting precise parameter adjustments (e.g., CFG Scale, Denoising Strength) and advanced techniques such as sketch-guided synthesis and inpainting. Broad community-driven ecosystem fostering continuous improvements and model diversification.	Steeper learning curve for novice users due to its intricate workflow. Initial outputs may lack the immediate photorealism seen in FLUX or Midjourney without prompt optimization.

Table 2. Comparison of Stable Diffusion techniques for image generation and refinement.

Model/Checkpoint	Prompt Structure	Results	Strengths	Weaknesses
Juggernaut XL	Structured prompts incorporating `“BREAK”` and detailed facial descriptions.	Realistic images with clear textures and strong coherence.	High fidelity in facial details.	Overly sensitive to excessively detailed prompts.
fullyREALXL	Similar structured prompts with reduced emphasis on fine details.	Smooth images with less intricate textures compared to Juggernaut XL.	Balanced detail and generation time.	Reduced fidelity in intricate details.
Textual Inversions	Embeddings added for specific stylistic elements (e.g., “hyperrealistic style”).	Enhanced stylistic accuracy while maintaining overall realism.	Improved stylistic personalization.	Requires a compatible checkpoint model.
LoRA Models	Style and object adjustments integrated into prompts using LoRA files.	Fine-tuned results with enhanced stylistic or artistic control.	High flexibility for customization.	Depends on a base checkpoint model for operation.
Negative Prompting	Exclusion of undesirable elements (e.g., “low resolution”, “distorted”).	Reduced artifacts and improved result coherence.	Effectively filters undesirable elements.	Overuse may remove excessive details.
Sampling Methods	DPM++ 2M SDE: Produces smoother outputs. Euler a: Generates more dynamic and textured results.	Variations in sharpness and texture based on the sampling method.	Versatile methods tailored for specific needs.	Results vary depending on the applied method.
Inpainting Refinement	Iterative editing focusing on morphological adjustments while preserving the rest of the image.	Precise adjustments in specific features such as eyes, nose, and mouth.	High precision and control over outcomes.	Slower process compared to initial generation.

Table 3. Comparative analysis of original images, professional sketches, and AI-generated outputs.

	Image 1	Image 2	Image 3	Image 4	Image 5
Original Image
Professional Sketch
Original-Sketch Comparison	0.65	0.46	0.45	0.42	0.26
Stable Diffusion Image Generated
Original-SD Comparison	0.49	0.34	0.48	0.40	0.34

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sádaba-Campo, N.; Gómez-Moreno, H. Exploration of Generative Neural Networks for Police Facial Sketches. Big Data Cogn. Comput. 2025, 9, 42. https://doi.org/10.3390/bdcc9020042

AMA Style

Sádaba-Campo N, Gómez-Moreno H. Exploration of Generative Neural Networks for Police Facial Sketches. Big Data and Cognitive Computing. 2025; 9(2):42. https://doi.org/10.3390/bdcc9020042

Chicago/Turabian Style

Sádaba-Campo, Nerea, and Hilario Gómez-Moreno. 2025. "Exploration of Generative Neural Networks for Police Facial Sketches" Big Data and Cognitive Computing 9, no. 2: 42. https://doi.org/10.3390/bdcc9020042

APA Style

Sádaba-Campo, N., & Gómez-Moreno, H. (2025). Exploration of Generative Neural Networks for Police Facial Sketches. Big Data and Cognitive Computing, 9(2), 42. https://doi.org/10.3390/bdcc9020042

Article Menu

Exploration of Generative Neural Networks for Police Facial Sketches

Abstract

1. Introduction

1.1. Artificial Intelligence (AI)

1.1.1. Generative Adversarial Networks (GANs)

1.1.2. Variational Autoencoders (VAEs)

1.1.3. Transformers

1.1.4. Diffusion Models

2. Materials and Methods

2.1. Image Generation Tools

2.2. Methodology

2.2.1. Comparison and Selection of Image Generation Tools

2.2.2. Procedure Development Methodology

3. Results and Discussion

4. Conclusions and Future Lines

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI