Generating Optimized Guessing Candidates toward Better Password Cracking from Multi-Dictionaries Using Relativistic GAN

: Despite their well-known weaknesses, passwords are still the de-facto authentication method for most online systems. Due to its importance, password cracking has been vibrantly researched both for offensive and defensive purposes. Hashcat and John the Ripper are the most popular cracking tools, allowing users to crack millions of passwords in a short time. However, their rule-based cracking has an explicit limitation of depending on password-cracking experts to come up with creative rules. To overcome this limitation, a recent trend has been to apply machine learning techniques to research on password cracking. For instance, state-of-the-art password guessing studies such as PassGAN and rPassGAN adopted a Generative Adversarial Network (GAN) and used it to generate high-quality password guesses without knowledge of password structures. However, compared with the probabilistic context-free grammar (PCFG), rPassGAN shows inferior password cracking performance in some cases. It was also observed that each password cracker has its own cracking space that does not overlap with other models. This observation led us to realize that an optimized candidate dictionary can be made by combining the password candidates generated by multiple password generation models. In this paper, we suggest a deep learning-based approach called REDPACK that addresses the weakness of the cutting-edge cracking tools based on GAN. To this end, REDPACK combines multiple password candidate generator models in an effective way. Our approach uses the discriminator of rPassGAN as the password selector. Then, by collecting passwords selectively, our model achieves a more realistic password candidate dictionary. Also, REDPACK improves password cracking performance by incorporating both the generator and the discriminator of GAN. We evaluated our system on various datasets with password candidates composed of symbols, digits, upper and lowercase letters. The results clearly show that our approach outperforms all existing approaches, including rule-based Hashcat, GAN-based PassGAN, and probability-based PCFG. The proposed model was also able to reduce the number of password candidates by up to 65%, with only 20% cracking performance loss compared to the union set of passwords cracked by multiple-generation models.


Introduction
The password is the de-facto authentication method. It is popular due to its simplicity to implement and easiness to use. Password authentication ultimately depends on human memory. Thus, as revealed by the leaked passwords of websites, such as Rockyou, people tend to generate easy-to-remember passwords, primarily composed of common English words or names [1][2][3]. Password cracking utilities provide many functions for attacking weak passwords when password

Our Previous Approach: Recurrent PassGAN
In our previous paper, we proposed some ways to improve PassGAN [9], which is a deep learning-based password cracking model that is designed to make up for the limited password cracking space of both rule-based approaches and data-driven probabilistic models, such as Markov models e.g., the Ordered Markov ENumerator (OMEN) [2]. The PassGAN, proposed by Hitaj et al. [9], trained deep neural networks to determine the letter distribution of passwords autonomously. Next, PassGAN applied this learned knowledge to generate password candidates that followed the distribution of real passwords. Basically, PassGAN exploits two properties of deep learning. First, deep neural networks are sufficiently expressive to sketch the various letter patterns and context structures used in most user-chosen passwords. Furthermore, they can be trained from data only, unlike legacy machine learning. Therefore, a deep learning model does not require any prior knowledge (we usually refer to this knowledge as a feature) of the passwords' properties and structures. However, a deep learning model can learn features only from the training data. These properties distinguish deep neural networks from other contemporary methods such as Markov and rule-based models. In the case of Markov models, such as OMEN [2], it is assumed that all the relevant characteristics of passwords can be expressed in terms of the n-gram.
However, only password candidates derived from the available rules, which reflect the expert's knowledge and experience, can be guessed for using the rule-based approaches. Hitaj et al. used a GAN [10] as a deep learning model for password guessing [9]. Among the various GAN models, the Wasserstein GAN-gradient penalty (WGAN-GP) [11] model was used in PassGAN. The base deep neural network of the WGAN-GP was a Convolutional Neural Network (CNN)-based residual network. The Recurrent PassGAN (rPassGAN), which was developed in our previous approach, improved the password cracking performance by modifying PassGAN's base neural network type and structure. Our previous study proved that the password cracking performance of rPassGAN [8,12] was better than that of PassGAN. The rPassGAN uses a Recurrent Neural Network (RNN) as its primary deep neural network.
Furthermore, rPassD2CGAN, a dual discriminator version of rPassGAN, outperformed rPassGAN. However, during the training of rPassD2CGAN, it sometimes becomes unstable. Therefore, we overcome this weakness by using another model, rPassD2SGAN. We demonstrated the effectiveness of the password candidates generated by the rPassGAN for enhancing the password strength estimator through several experiments.

Our New Approach: REDPACK
In our previous studies, the rPassGAN series outperformed PassGAN. Nevertheless, compared with Probabilistic Context-Free Grammar (PCFG) and other Markov models, such as the OMEN [1], rPassGAN sometimes cracked fewer passwords. In this study, we propose a model to maximize the password cracking performance using a small number of guessing. With our model, we can selectively collect more realistic password candidates to improve the efficiency of password cracking. Then, we can make an effective cracking dictionary with these collected candidates. If password candidate dictionaries from various models are available, prior knowledge of the password's properties and structure is not required for selecting a more realistic password candidate. The relativistic average standard GAN [13]'s discriminator operates as an estimator, which evaluates how realistic input passwords are from the various pre-generated candidates. Although a GAN generator is typically used, our model introduces a novel way of using the GAN's discriminator in password guessing. We refer to this model as a relativistic discriminator for password candidates effective pack(REDPACK). We demonstrate that our model can compete favorably with the Hashcat transformation rules (rockyou-30000, dive), PCFG [14], and OMEN [2]. Our new approach's contributions are as follows: • GAN is a generative deep learning model. Generally, the generator of a GAN is used for producing fake samples. However, the trained discriminator of the GAN is not utilized for the inference. The discriminator is only used in the training phase for training the generator. Our first contribution is that we propose how to utilize the discriminator of the GAN not for training the generator but for enhancing password cracking performance. To the best of our knowledge, our approach is the first model that has been proposed in the field of password cracking. The discriminator of the GAN estimates how realistic password candidates are. Then, we use these estimates as criteria to select the best candidates among the multiple dictionaries.

•
If there are multiple pre-generated password candidate dictionaries, REDPACK enables us to make a more effective and efficient password candidate dictionary without any background knowledge about the pre-generated dictionaries. If we provide this efficient and effective password candidate dictionary to a password strength estimator like Zxcvbn, we are able to make the criteria of the password strength estimator stricter [7,8,12].

•
Our last contribution is the building of a custom ruleset for REDPACK. This custom ruleset helps the password candidates from REDPACK maximize its cracking performance.

Organization
The rest of this paper is organized as follows. In Section 2, we provide an overview of the relevant password guessing models. In Section 3, we provide a brief overview of GANs and relativistic average GANs as background knowledge for REDPACK. In Section 4, we explain the concept of REDPACK and describe its architecture. In Section 5, we explain the process of training the hyperparameter configuration. Then, we evaluate the password cracking performance of REDPACK. In this section, we present the results of REDPACK and compare them with other advanced password guessing techniques. Finally, the conclusions are presented in Section 6.

Related Works
In this section, we discuss related works regarding password guessing that use categories, probability-based approaches, and deep learning-based approaches. The probability-based approaches are divided into whole-string methods and template-based methods. These terms were defined in Ma et al. [15]

Rule-based Approaches
Basically, in the password-guessing attack, the adversary attempts to match one or more users' passwords by repeatedly testing large numbers of password candidates. This attack could be conducted in offline or online modes. Password-guessing attacks might be as old as password authentication themselves [16]. JtR [4] and Hashcat [5] are two popular modern password guessing (or cracking) open-source software. This software provides multiple types of password cracking strategies, including exhaustive brute-force attacks, dictionary-based attacks, rule-based attacks (also called hybrid-attack). The rule-based attack generated password candidates by transforming the words in the dictionary according to its own grammar [6,17]; and Markov-model-based attacks [18,19], in which each letter of a password candidate is chosen via a certain statistical and probabilistic process that considers one or more preceding letters, and this model is trained on dictionaries of plaintext passwords. In the practical password cracking field, JtR and Hashcat are promising and useful. Heterogeneous computing technology enhances the cracking performance of both tools. There have been several instances in which well over 90% (370 of 1321 sites' password hash files of Hashes.org have been recovered 90% over in April 2020) of the passwords leaked from online services have been successfully recovered [20]. However, both Hashcat and JtR have an explicit limitation. If their cracking target password is in a complex form and their pre-defined rulesets do not cover the target password, both tools are totally unable to succeed in cracking the target password.

Markov-Based Approaches & PCFG
In the stochastic approach, a method incorporating the Markov model has been proposed. Narayanan et al. proposed a method of generating a fake password (or password candidates) using the Markov model for the alphanumeric characters and the Turing machine for special symbols [21]. This method's the fundamental idea is that frequently used words and patterns are limited to easy-to-remember passwords; further, the passwords in this space are included in a specific probability distribution of alphanumeric combinations. Narayanan et al. used the Markov model as a filter to eliminate the low-probability password candidates. Furthermore, they applied the rainbow table concept (time-space tradeoff) to speed up password cracking. Finally, to handle the passwords with special symbols, the Turing machine concept was adopted. Based on a password generation rule, various alphabets and numbers were combined into regular expressions, and the probability for each combination created was defined. This pioneering work was subsequently extended by Ma et al. [15] and Dürmuth et al. [2].
Dürmuth et al. [2] proposed an efficient password guesser based on Narayanan's model. This model was called the OMEN. Basically, OMEN aimed to improve the cracking speed of Narayanan's model. It incorporated an improved enumeration algorithm called "enumPwd", which enabled it to produce candidates in order of probability by implementing multiple bins. In each bin, candidates of similar probabilities were stored.
The most important aspect of these password guessing studies was incorporating the PCFG concept into the password-guessing method. PCFG was invented by Weir et al. [22] originally as a password guesser. Current complex passwords have grammatical structure. This grammatical structure is a combination of alphanumerical sequences, special characters, and keyboard-walks. PCFG analyzes the grammatical structure of leaked passwords and calculates the distribution probability from these leaked passwords. For generating password candidates, PCFG uses the grammatical structure in order of the probability. Recently, several studies have improved upon the performance of the PCFG [23][24][25]. Based on common current password usage patterns and on government recommendations [26], password guessing must be able to produce grammatical structures that include not only simple alphabetical and numerical combinations but also complex combinations that include special characters and keyboard-walks. PCFG has enabled us to generate these complicated password patterns. In experiments, PCFG exhibited a higher cracking success rate than dictionary-based attacks using Hashcat built-in rules. This method could expand the cracking area of the password space effectively. Houshmand et al. [24] focused on improving the cracking performance for keyboard-walk structures and employed a smoothing technique. In their experiments, they achieved good performance on cracking keyboard-walk patterned passwords. Furthermore, Houshmand et al. proposed the use of PCFG as a target-oriented cracking approach [25].
Finally, Ma et al. [15] attempted to maximize the performance of the probability-based password cracking approaches through optimized configuration and usage. Additionally, they proposed a new way of measuring the password cracking performance. Ma et al. categorized password guessing methods as whole-string or template-based. Narayanan et al.'s model and OMEN were included in the whole-string models. PCFG was classed as a template-based model. Ma et al. attempted to derive each model's best configuration and introduced the n-gram model for statistical language modeling into password modeling. In the experiments conducted by Ma et al., in several cases, the whole-string Markov models outperformed PCFG. However, specific configurations, such as smoothing, enhanced the performance of PCFG in such a way that it outperformed the whole-string approach.

Deep Learning-Based Approaches
The first password guessing method that employed deep learning was proposed by Melicher et al. [27]. They incorporated an RNN [28] into their model. The purpose of Melicher's study was to enhance the password strength estimator based on a deep-learning password guesser. RNNs, a deep learning neural network that has been popularly adopted in the field of Natural Language Processing (NLP), usually exhibit good performance in various applications, such as chat-bots, translation, and auto-completion. In Melicher et al.'s method [27], leaked passwords were used as the training data. Subsequently, a guessing candidate was produced letter by letter. In the RNN model, the characters that constitute the password are based on all the previously selected characters.
In addition to the method by Melicher et al. [ [11]. Throughout their experiments, PassGAN competed favorably with state-of-the-art password generation tools. PassGAN is the first approach to apply GAN to the password guessing problem. PassGAN used the original IWGAN model, which used the WGAN-GP cost function, for training the generator of the GAN without any modifications. In the original IWGAN model, both the generator and discriminator of the GAN used a Convolutional Neural Network (CNN) as their primary components. However, CNNs are usually used for processing images in deep learning studies. So, in our previous studies, a Recurrent Neural Network (RNN) based model was used to improve the password cracking performance [8,12]. We named our models rPassGAN and rPassD2SGAN, which indicates a dual discriminator architecture. When Jensen-Shannon Divergence (JSD) is high between the training dataset and the cracking target set, rPassGAN cracked more passwords than PCFG. That is to say, the general performance of rPassGAN is better than PCFG. Otherwise, PCFG cracked more passwords than rPassGAN. Although PassGAN could not outperform other password guessing models, Hitaj et al. emphasized that their model was able to crack some passwords that could not be cracked by other stochastic models [9]. rPassGAN was able to achieve similar results. Furthermore, this trend can be observed among the RNN-based PassGAN models [8,12].

Background for REDPACK
In this section, we provide a brief overview of a standard GAN and a relativistic average GAN. First, we describe GAN's development history, following which we explain relativistic GANs, which are at the core of our model.

Generative Adversarial Networks
GANs have brought about remarkable advances in the field of deep learning. A GAN is composed of two neural networks. The first is a generative deep neural network G, which performs the main task of training. The other is a discriminative deep neural network D, which functions as the supervisor. Given an n-sized input batch I = {x 1 , x 2 , ..., x n }, the main goal of G is to generate fake samples that D can confuse with the real ones based on the responses of D; otherwise, the goal is to learn to distinguish the fake samples from G from the real ones coming from I. The optimization problem for GANs is a minimax problem. Goodfellow et al. [10] solved this problem by allowing the GAN to have a global optimum when the distribution of fake samples produced by G was mathematically identical to the distribution of the given real data. When z is a noise input from the uniform distribution, the minimax problem can be expressed as follows: The learning of the generator can be regarded as optimized when D cannot distinguish between the fake samples generated by G and the real samples from I. Since Goodfellow et al. proposed the initiative GAN, various GAN models with more stable training performance have been proposed. Among these GAN models, the PassGAN-related ones are the Wasserstein GAN (WGAN) [29] and the improved Wasserstein GAN (IWGAN) [11]. WGAN, which was introduced by Arjovsky et al., improves the training stability of a standard GAN by employing the Wasserstein distance for loss. The benefits of this approach include reduced mode collapse and meaningful learning curves, which are helpful in identifying optimal hyperparameters. WGAN incorporates a new cost function; however, experiments on the WGAN focus on generating realistic images. Gulrajani et al. [11] proposed IWGAN to find the global optimum more effectively, compared to WGAN. They introduced the concept of the gradient penalty to replace the gradient clipping of WGAN. Gulrajani et al. proposed the use of IWGAN to solve the text generation problem. In Gulrajani et al.'s IWGAN, both G and D consisted of simple residual CNNs. The residual architecture makes the training of the GAN fast and stable [30,31]. G takes as input a latent noise vector, transforms it by forwarding it through its convolutional layers, and outputs a sequence of 32 one-hot character vectors. The output layer of G adopts a softmax nonlinearity function and forwards it to D. Each output character of a fake sample is determined by the result of the argmax function, which takes each output vector generated by G as input. The IWGAN experiment motivated Hitaj et al. [9] to apply IWGAN to the password guessing problem. They referred to the model that they created as PassGAN.

Relativistic average GAN
The applications of GAN, a groundbreaking framework for learning generative models, are varied. However, standard GAN (SGAN), originally proposed by Goodfellow et al. [10], is unstable in the learning phases, and optimizing the model is difficult. Many alternatives, including WGAN, have been proposed to mitigate this problem. GAN-based models can be broadly divided into integral probability metric (IPM)-based GANs and non-IPM-based GANs. Generally, IPM constraints provide stability for training the GAN. Jolicoeur [13] analyzed the loss function of SGAN to analyze the limitations of the non-IPM-based GAN and proposed a relativistic GAN to address this. Goodfellow et al. [10] proved that GAN training attains the global optimum when the discriminator classifies real data with 0.5 probability. However, in many cases, the discriminator classifies both real and fake data as real, which is disadvantageous in training a good generator. This is because the discriminator is not aware that half of the data in the learning process is fake; the IPM-based GAN is relatively stable during learning because it implicitly accounts for this fact. From the perspective of divergence minimization, the discriminator is trained to increase D(x f ), whereas D(x r ) does not decrease accordingly, where x r and x f denote the real and fake data, respectively. To address this issue, Jolicoeur [13] designed the output of the discriminator to depend on both the real and fake data: Where C(x) is the presumed critic (D(x) = σ(C(x))). Equation (2) can be interpreted as the discriminator's estimation of the probability of the given real data being more realistic than the fake data (D(x f , x r ) can be interpreted in the opposite manner). If the discriminator is set according to Equation (2), unlike the generator of SGAN, which relies solely on the fake data, the generator from the relativistic GAN will depend on both the real and fake data. However, it has O(m 2 ) complexity when calculating the loss (m means the size of the mini-batch) because it calculates the pair-wise differences of the critic between the real and fake data in the mini-batch. To solve this problem, Jolicoeur [13] proposed a relativistic average GAN (RaGAN), which takes the expectation of the opposite type of data for some given data. The RaGAN uses the following loss functions to learn the discriminator and generator: where L D and L G represent the loss for learning the discriminator and generator, respectively. Equation (3) has O(m) complexity. The discriminator of RaGAN estimates the probability of some given data being more realistic than the opposite type of data, on average. Using different datasets, Jolicoeur [13] showed that training RaGAN, compared to other GAN models, was faster and more reliable, and the generator of RaGAN generated high-quality fake data.

Proposed Model: REDPACK
In this section, we describe the structure of REDPACK. At first, we explain the overview of REDPACK. Next, the discriminator training structure of REDPACK and the password candidate selecting structure of REDPACK will be described in detail.

Overview
Generally, GAN models are trained for solving generative problems. So, after the training of the GAN model is finished, the generator of the GAN should be used for achieving its goal. In our previous research [8,12], we used the generator of our GAN for producing more realistic password candidates. However, in the case of REDPACK, the discriminator of the GAN is used after finishing the training of the model. Figure 1 shows the train phase and selection phase. In the training phase, the generator produces fake passwords. Then, the discriminator tries to distinguish fake passwords from real passwords. Next, the discriminator sends feedback to the generator. By using the feedback from the discriminator, the generator learns how to make fake passwords that are almost the same as real passwords. After training the generator, the discriminator should solve the more difficult problem of differentiating real passwords from fake passwords than it was in the previous training step. This process makes the discriminator stronger. Then, the training process continues until the training parameters converge. In the selection phase, multiple password generators can be used for producing the password candidates. As a result of the training phase, the discriminator calculates the probability of how realistic each generator's password candidates are. The closer the probability is to one, the closer the password candidates are to realistic passwords. In the final step, the password candidates of the highest probability are supplied to a password cracking tool like Hashcat (or John the Ripper). From an implementation perspective, there is no limit to the number of password candidate generators that can be used in the selection phase. In this paper, we use three or four password candidate generators for REDPACK.  Figure 2 shows a process to optimize the parameters of an RNN-based GAN. The RNN-based GAN is built upon a generator (G) and a discriminator (D). We employ an RNN as the base model of the GAN because passwords are a type of sequential data. As with a standard GAN [10], G is trained to generate fake passwords, but these are very similar to real passwords; D tries to distinguish real passwords from the fake passwords. However, our RNN-based GAN adopts the concept of both relativistic average GAN [13] and IWGAN [11] to achieve a more powerful discriminator. In Figure 2, G produces fake passwords from a given arbitrary noise distribution P z (z), which is depicted by a green path; following Equation (3), D determines the fake passwords as real if and only if the critic for the fake password is larger than the real one described as blue storage in the figure. Likewise, the real passwords on the blue path are regarded as real in the opposite case; D gives gradients as a penalty to encourage G to generate more authentic passwords, which is depicted by a red path. This learning framework forces D to achieve stronger judgment criteria than the standard GAN.

The Discriminator Training Structure
Algorithm 1 summarizes the learning procedure for the above described RNN-based GAN. It mainly originates from IWGAN but uses the loss function of a relativistic average GAN. Most notations follow Equation (3). The primary differences are the loss function in lines 9 and 15. These loss functions depend on the relativistic discriminator, estimating the critic for one type of passwords over the average critic of the opposite type. This direct comparison allows G to quickly converge to an optimum point and produce high-quality fake passwords. Also, since our RNN-based GAN adopts IWGAN, we add the gradient penalty to the discriminator's loss function. This penalty forces a 2-norm of gradients forx to be less than 1, wherex is a random sample on a straight line between the pair of points (x r , x f ). This gives great stability to the training of the GAN. Another important factor is the iteration for optimizing G from lines 13 to 16. Although most GANs have a loop for optimizing D on a given G, this is insufficient to maximize the GAN's performance. So, we add the loop for training G to stabilize and enhance our GAN. According to our experiments, described in Section 5, this factor has a critical effect on the cracking performance of REDPACK. In general cases, once the model is optimized, the generator is used as a generative model. Instead, we utilize the discriminator as a genuine password estimator for REDPACK. Critic C's parameters w and generator G's parameter θ. Random samplesx on straight line between real passwords x r and fake passwords x f . 1: while θ has not converged do 2: for t = 1, ..., n critic do 3: for i = 1, ..., m do 4: Sample real data x r ∼ P data , latent variable z ∼ P z , and a random number ∼ U [0, 1]. 5: 10: end for 11: end for 13: for t = 1, ..., n gen do 14: Sample a batch of latent variable end for 17: end while

Password Candidates Selecting Structure
The REDPACK consists of multiple password-guessing models: for example, Hashcat, PCFG, rPassGAN with WGAN-GP, and rPassGAN with RaGAN-GP. Two rPassGAN with different loss functions can be used with different hyperparameter configurations as a component of multiple password candidate generators. This is because each RNN-based PassGAN has its own password cracking results that other models could not crack. In our previous research [8,12], both single discriminator rPassGAN and dual discriminator rPassGAN had their own cracked password candidates. So, various deep learning-based password guessing models are used as password candidate sources for REDPACK, as shown in Figure 3. One billion candidates, which are generated by each password generating model, are transformed from strings to tensors. These tensor inputs are supplied to the discriminator (D). The discriminator consists of two fully connected neural networks and one layer of an RNN. The RNN uses GRU cells. The discriminator (D) estimates how realistic each password input is and provides the probability of each tensor input as an estimated result. The MAX Probability Selector in Figure 3 chooses the password candidate with the highest probability and transforms the tensor into a password string. Then, these selected candidates are transferred to Hashcat or saved in the refined password candidates dictionary. Then, the final step is Hashcat password cracking, as shown in Figure 3. Hashcat's mode is a hybrid attack using transformation rules like the best64, dive, and Rockyou-30000 rules.

Evaluation
In this section, we explain the configuration of our experiment in detail. Then, we comparatively evaluate the cracking performance and password estimation performance of our approach.

Experimental Data Preparation
In various studies [2,8,12,14,22,32], a significant amount of leaked passwords have been analyzed. These have notably provided insight into the usage patterns of real-world users. For the experiments in this study, we used publicly available cracked plaintext passwords and leaked password dictionaries. In most of the previous studies on password cracking, leaked Rockyou and LinkedIn passwords were used [2,8,9,12,22,24]. However, our team used additional password dictionaries containing long and 4class passwords for our cracking performance experiments. We use the password classification of Melicher et al. [27]. Melicher et al. defined test datasets by 1class, 2class, 3class, 4class. 2class: passwords in this dataset must contain at least two character classes, 3class: passwords must consist of at least three character classes. 4class: password must contain all classes(symbols, digits, lower and upper letters). Rockyou and LinkedIn contained a few 4class passwords. We also used four cracked password dictionaries from Hashes.org, where several cracked and leaked plaintext passwords were provided. A total of seven training and cracking datasets were present. With these datasets, password cracking performance could be observed in relation to each password length. Information on these datasets is summarized in Table 1. Dataset 1 has been used in several passwords cracking studies. Therefore, through experiments with dataset1, it was possible to compare the performance between our approach and previous studies. With datasets 2-7, we show the cracking performance of our approach in practical situations. Hashcat-Rockyou30000, Hashcat-dive, OMEN, PCFG, and rPassGAN-WGAN were used as the password generators. Subsequently, we applied the best64 rule to all the models' password candidates to maximize cracking performance. This is the method typically used in practice. Each generator model produced one billion password candidates for repeated experiments.

REDPACK Training Configuration
To optimize the deep learning model's performance, it is essential to define the training hyper-parameters' values. Unlike other deep learning models, GAN has the G/D training iteration as its specific hyper-parameter. This parameter affects training stability and performance. The essential training hyper-parameters for our experiments are as follows.

•
The G/D iteration number represents the number of generator and discriminator training iterations. Although 1:1 and 1:10 are typically used for the discriminator in RaSGAN, 40:10 is usually used for WGAN-GP.

•
The batch size represents the number of passwords from the training set that are transferred to the model at each step of the optimization. Although higher values improve the generality of the model, they can result in unstable training. Thus, using the optimal value is important. Typically, a batch size of 128 is used. A batch size of 64 is used to counter training instability. All the parameters are summarized in Table 2. We used Tensorflow-gpu 1.10.1 with Python version 3.5.4 for GPU computing. All the experiments were conducted on the OpenHPC system that NMLab of Korea Univ. developed. Each node of OpenHPC runs on a CentOS 7 server with 32GB Memory; the nodes use Intel Xeon E5 2.20GHz CPUs(x2) and Nvidia TitanXP 12GB GPUs(x4).

GAN Training and Testing
As inferred from our previous research [8,12], the major factors that determine cracking performance are the epoch, G/D ratio, and recurrent neural network (RNN) cell type. In the case of the training epoch, models trained for too many epochs are prone to overfitting, as shown in Figure 4a. Overfitting should be avoided to maximize the effectiveness of selecting realistic password candidates. The G/D ratio also has a powerful effect on the cracking performance. To improve the cracking performance, it is necessary to apply a wide range of G/D ratios to the model. However, this is time-consuming. Therefore, we conducted experiments using G/D ratios of 1:1 and 1:10 for the RaSGAN-GP cost function, as shown in Figure 4b. Although neither may be the optimal value, they suffice to show the effectiveness of the model. The final factor to consider is the cell type. It is necessary to determine which is better between Long Short-Term Memory (LSTM) [28] and Gated Recurrent Unit (GRU) [33] for the RNN-based relativistic discriminator. Throughout experiments for testing these factors, 200k training epochs, a 1:10 G/D ratio, and the GRU cell type were settled on as the optimal settings for password cracking and to ensure generality, as shown in Figure 5.  Based on the number of cracked passwords, G/D ratio of 1:10 yielded better password cracking performance. We could not ascertain whether 1:10 was the best value or not. However, G/D ratio of 1:10 is better than 1:1, which was proposed by Jolicoeur-Martineau [13].  40k 50k 60k 70k 80k 90k 100k 110k 120k 130k 140k 150k 160k 170k 180k 190k 40k 50k 60k 70k 80k 90k 100k 110k 120k 130k 140k 150k 160k 170k 180k 190k

Password Cracking
In this section, we present the results of several experiments. Multiple datasets, which are mentioned in Table 1, were used in these experiments. Our model, REDPACK, outperformed all other models at password cracking across all the experiments. First, dataset 1 (a short length Rockyou dataset) was used. This dataset has been used in many previous passwords cracking studies [2,8,9,12,14,24]. Figure 6 shows a cracking performance comparison between REDPACK and other password candidate generators. The x-axis shows the number of password guesses (or the used password candidates). The y-axis is the number of cracked target passwords. In the experiment with dataset 1, all the password-guessing models exhibited similar performance. REDPACK cracked 5%(87,946) more passwords than the PCFG. Although the short Rockyou dataset test was sufficient for theoretic performance comparison, it does not reflect recent password usage trends.  For a practical password cracking comparison, datasets 2-7 were created and used. Multiple experiments were conducted repeatedly. These experiments used only 4class passwords for a comparison of extremely difficult password cracking. First, Hashcat (best64, rockyou-30000, and dive), PCFG, and rPassGAN were used for the password generation unit. Overall, REDPACK showed 5-20% better cracking performance than any single password guessing model on dataset2 to dataset7. In the case of dataset1 (short-length password), PCFG showed strong performance in the early cracking stages. In the second half of cracking, REDPACK eventually overtook PCFG. However, as password-length gets longer (from dataset3 to dataset7), REDPACK outperformed PCFG and other password candidate generators clearly. That is to say, the effectiveness of REDPACK was shown clearly with complicated and long-length passwords. In the case of datasets 6 and 7, our team could not create one billion Hashcat candidates because of the small amount of training data. Rule-based candidate generation was defined as the multiplication of the number of rules and the number of candidates.
In Table 3, the number of each model's candidate selected by the MAX Probability Selector from the Figure 3 was proportional to the single model's password cracking performance. This result shows that REDPACK did not select candidates randomly but chose candidates selectively. Additionally, this means that the discriminator of REDPACK correctly evaluates the probability of how realistic passwords are generated.

Limitation of REDPACK
REDPACK showed better performance than any single guessing model. However, its cracking performance was also limited. The relativistic discriminator apparently selected more realistic password candidates. However, more realistic password candidate selection does not always guarantee effective password cracking. Although REDPACK compressed the number of password candidates by up to 66% in the case of three generators, it also missed some candidates that could be important to password cracking. Table 4 shows the performance loss of REDPACK. In the experiment, we included OMEN as a password candidate generator component. The inclusion of OMEN as a password generating unit worsened the cracking performance of the candidate dictionary by REDPACK. OMEN prevented the discriminator from selecting PCFG candidates. This degradation was caused by REDPACK's incorrect selections. OMEN hindered the selection of the PCFG candidates, which could potentially crack the password. Both OMEN and PCFG have similar characteristics. They both generate password candidates with high-order probabilities. To make up for the loss of these two probability based-models, we simply apply a random shuffle to the password candidate sets from both OMEN and PCFG. This simple approach cannot remove the loss of cracking performance completely.
When models based on three different approaches (Hashcat: rule-based, PCFG: probability-based, rPassGAN: deep learning-based) were used as generators, REDPACK was at its most efficient in our experiments. Table 4. Performance loss of REDPACK was compared with the union set of unit generator Table 5. Cracking performance of custom rule set for REDPACK; this custom set of rules was effective when applied to REDPACK. When this custom rule was applied to both the PCFG and OMEN, cracking performance of custom set of rules was less than that of best64 in some cases. Number in parentheses shows percentage increase or decrease in relation to best64.