Cross-Domain Feature Enhancement-Based Password Guessing Method for Small Samples

Liu, Cheng; Li, Junrong; Liu, Xiheng; Li, Bo; Hou, Mengsu; Yu, Wei; Li, Yujun; Liu, Wenjun

doi:10.3390/e27070752

Open AccessArticle

Cross-Domain Feature Enhancement-Based Password Guessing Method for Small Samples

by

Cheng Liu

^1,2,3

,

Junrong Li

⁴

,

Xiheng Liu

⁴,

Bo Li

⁴

,

Mengsu Hou

^1,5,*

,

Wei Yu

^2,3

,

Yujun Li

¹ and

Wenjun Liu

⁴

¹

School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

²

No. 30 Institute of CETC, Chengdu 610041, China

³

National Key Laboratory of Security Communication, Chengdu 610041, China

⁴

School of Computer and Software Engineering, Xihua University, Chengdu 610039, China

⁵

School of Big Data and Artificial Intelligence, Chengdu Technological University, Chengdu 611730, China

^*

Author to whom correspondence should be addressed.

Entropy 2025, 27(7), 752; https://doi.org/10.3390/e27070752

Submission received: 29 April 2025 / Revised: 4 July 2025 / Accepted: 12 July 2025 / Published: 15 July 2025

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

As a crucial component of account protection system evaluation and intrusion detection, the advancement of password guessing technology encounters challenges due to its reliance on password data. In password guessing research, there is a conflict between the traditional models’ need for large training samples and the limitations on accessing password data imposed by privacy protection regulations. Consequently, security researchers often struggle with the issue of having a very limited password set from which to guess. This paper introduces a small-sample password guessing technique that enhances cross-domain features. It analyzes the password set using probabilistic context-free grammar (PCFG) to create a list of password structure probabilities and a dictionary of password fragment probabilities, which are then used to generate a password set structure vector. The method calculates the cosine similarity between the small-sample password set B from the target area and publicly leaked password sets

A_{i}

using the structure vector, identifying the set

A_{m a x}

with the highest similarity. This set is then utilized as a training set, where the features of the small-sample password set are enhanced by modifying the structure vectors of the training set. The enhanced training set is subsequently employed for PCFG password generation. The paper uses hit rate as the evaluation metric, and Experiment I reveals that the similarity between B and

A_{i}

can be reliably measured when the size of B exceeds 150. Experiment II confirms the hypothesis that a higher similarity between

A_{i}

and B leads to a greater hit rate of

A_{i}

on the test set of B, with potential improvements of up to 32% compared to training with B alone. Experiment III demonstrates that after enhancing the features of

A_{m a x}

, the hit rate for the small-sample password set can increase by as much as 10.52% compared to previous results. This method offers a viable solution for small-sample password guessing without requiring prior knowledge.

Keywords:

password guessing; small samples; similarity computation; probabilistic context-free grammar

1. Introduction

As a fundamental mechanism for ensuring user account security, research on password guessing techniques is of great practical significance for evaluating the effectiveness of account protection measures and conducting system penetration testing. Based on the type of data required, password guessing techniques can generally be categorized into three types: training-free, intra-site, and cross-site [1]. Among these, the latter two belong to data-driven password guessing methods [2,3,4,5], which typically require large-scale training datasets and often rely on probabilistic or machine learning models [6] to generate guessed passwords. Studies show that due to legal constraints from data privacy regulations and technical challenges in data acquisition, security researchers often only have access to limited, unidentified publicly leaked password datasets or small-scale target password sets. This limitation poses a bottleneck for traditional data-driven password guessing methods that heavily depend on large sample sizes.

In recent years, academic research on password guessing has made significant progress, particularly in data-driven approaches. Among them, intra-site password guessing has attracted the most theoretical and practical attention [7], as it allows for the extraction of password features more closely aligned with the target system, potentially leading to higher hit rates. However, with the strengthening of cybersecurity measures and the enforcement of data privacy laws, acquiring sufficiently large password datasets in intra-site scenarios has become increasingly difficult. Consequently, researchers often face a dilemma: the size of the available training dataset for password guessing is extremely small, usually ranging from dozens to hundreds of samples—far below the typical requirement of over

10^{4}

samples needed by traditional data-driven approaches. Therefore, it is of considerable practical importance to conduct in-depth research on password guessing techniques applicable to small-sample scenarios, especially within the intra-site context, to address the challenges posed by the scarcity of password data in real-world applications. Currently, methods for small-sample password guessing can be primarily classified into two categories:

Enhancing the corpus: To address sample limitations, additional personal data, topics, regional corpora, password fragments, and other relevant information can be incorporated. In 2016, Wang et al. [8] introduced a targeted online password guessing framework based on user personal information, which combines seven probabilistic attack models to enhance guessing efficiency in scenarios with limited samples. In 2019, Bi et al. [9] developed a topic-based PCFG model that integrates user interests, cultural backgrounds, and other topic-related data to lessen dependence on leaked password information. In 2021, Gan et al. [10] introduced a corpus-based password guessing approach that leverages extensive corpora to make up for the scarcity of words in smaller training sets, thus enhancing password guessing efficiency in such contexts. In 2022, Chen [11] suggested generating guessing rules that can be adapted for longer passwords by examining the corpus transfer characteristics of shorter ones. That same year, Han et al. [12] tackled the challenge of traditional methods that depend on large datasets by proposing a parameterized hybrid guessing framework (hyPassGu). This framework merges the strengths of various techniques, including probabilistic context-free grammar (PCFG) and Markov models, along with model pruning and optimal guessing number allocation strategies to create effective guessing sets with minimal data. However, while these methods can produce relatively comprehensive password guessing dictionaries, they are heavily dependent on corpus information, and research shows that acquiring relevant corpus data can be difficult in real-world applications.
Feature extraction approach: This method involves using AI models to identify additional potential features to make up for a lack of samples. Significant advancements have been made in neural password models. For instance, PassGAN [13] was the first to apply generative adversarial networks to understand the underlying distribution of passwords. Following this, models like FLA [14] and Transformer-based approaches [15], including discussions around PassGPT, have been able to capture intricate long-range dependencies in passwords through more advanced architectures. In 2022, Geng [16] introduced a small-sample password guessing model that employs multi-task learning, which reduces the dependence on a single large sample set by sharing feature representations across different datasets. Additionally, transfer learning techniques have emerged as a crucial strategy for tackling the small-sample issue.Melicher et al. [17] were pioneers in showing that pre-training recurrent neural networks (RNNs) or gated recurrent units (GRUs) on extensive leaked datasets allows the general knowledge of password structures to be transferred and adapted to new, smaller contexts. In 2023, Wang et al. [18] introduced the Pass2Edit algorithm, which utilizes GRU to understand the effects of password modification actions. With just one previous password from the user, it can produce a high-hit-rate guessing sequence, achieving a 24.2% hit rate for regular users within 100 attempts. In the same year, Wei [19] employed a pre-trained model to extract general features from a large leaked dataset and apply them to a small-sample context, enhancing the model’s adaptability to specific user demographics, while feature extraction methods can offer robust solutions for various complex issues, the high training costs and low efficiency they involve present significant challenges that researchers and developers need to tackle.

In this research, we present a novel approach to password guessing that leverages cross-domain feature enhancement, specifically designed to address the challenge of low guessing accuracy associated with small-sample password sets, denoted as B, within the target domain. The experimental framework utilizes publicly available leaked password datasets, which include information from platforms such as Youku and Gmail, among others. We define the collection of publicly leaked passwords as

A = {A_{1}, A_{2}, \dots, A_{n}}

, where A represents the comprehensive set of all publicly leaked password datasets, and

A_{i}

signifies the i-th independent dataset within this collection. To enhance the guessing accuracy, we compute the cosine similarity between the target password set B and each dataset

A_{i}

, subsequently identifying the dataset

A_{m a x}

that exhibits the highest similarity. This dataset is then utilized for feature enhancement, resulting in a password set that demonstrates an improved hit rate for the target domain. The key innovations of this methodology are as follows:

This study introduces a methodology for evaluating the similarity of password sets through the structural probability vectorization of these sets, enabling a quantitative analysis of the degree of similarity among various password collections. The findings indicate that this approach can reliably quantify the similarity between a limited sample of password sets within a specific domain and a publicly available compromised set, provided that the size of the set exceeds 150;
This study validates the hypothesis that a greater degree of similarity between $A_{i}$ and B correlates with an increased hit rate of $A_{i}$ on the test set associated with B. Experimental results indicate that the hit rate for password guessing is enhanced by as much as 32% compared to training conducted with B;
The impact of the guessing hit rate is particularly pronounced when there is a low degree of similarity between $A_{i}$ and B. Empirical findings indicate that when the similarity index between $A_{i}$ and B falls below 0.6, the application of this method can enhance the guessing hit rate by a maximum of 10.52%;
Our approach leverages password set similarity to transfer the most relevant features from publicly leaked datasets to the intra-site small-sample password set. This effectively addresses the issue of insufficient feature extraction in traditional intra-site methods under small-sample conditions, thereby significantly improving the password guessing hit rate in a simple yet efficient manner.

2. Background Knowledge

2.1. Probabilistic Context-Free Grammar

In this study, we employ the probabilistic context-free grammar (PCFG)-based password guessing model introduced by Weir et al. in 2009 to train the parameters B and

A_{i}

. This model generates a compilation of probabilities associated with password structures and a lexicon of probabilities for password fragments by examining a substantial dataset of actual password collections, thereby facilitating the efficient guessing of passwords. The entire methodology is delineated into two distinct phases:

During the training phase, the composition of each password is systematically analyzed by segmenting it into three distinct components: the letter segment (L), the number segment (D), and the special character segment (S). Each segment is extracted and subsequently labeled based on its respective length. For instance, the password “password12345!!!” can be delineated into a letter segment of length 8, a number segment of length 5, and a special character segment of length 3, resulting in the structural label L8D5S3. Similarly, the structure of a passphrase can also be represented as L8D5S3. The structural probability is then computed based on the frequency of occurrence of this structure within a dataset of passphrases, leading to the generation of a probability list for the passphrase structures. Concurrently, for each password structure, the frequency of occurrence of each structural fragment within the password dataset is quantified, and its probability is derived from this frequency, culminating in the creation of a password fragment probability dictionary. This comprehensive, stepwise methodology facilitates the effective training of the model on the password dataset.

In the initial phase of the process, the starting symbol of the probabilistic context-free grammar (PCFG) is designated as the initial node of the search. Subsequently, new nodes, representing new password strings, are generated through the iterative application of the generative rules of the PCFG. This process prioritizes password rules with higher probabilities to ensure that the generated passwords more accurately reflect the actual distribution of existing passwords. The specific probability calculation is expressed as follows:

P (p a s s w o r d 12345!!!) = P (L 8 D 5 S 3) \times P (L 8 \to P a s s w o r d) \times P (D 5 \to 12345) \times P (S 3 \to!!!) = p

. The resulting probability p indicates the frequency of password generation. Furthermore, this study establishes a minimum probability threshold for password generation to prevent infinite loops in the generation process and to avoid the creation of excessively long password strings, which could lead to inefficient use of computational resources and a decrease in generation efficiency.

2.2. Cosine Similarity

Cosine similarity is a quantitative metric utilized to assess the degree of similarity between two vectors. This measure is derived by calculating the cosine of the angle formed between the two vectors. The resulting value ranges from −1 to 1, where a value approaching 1 indicates a higher degree of similarity between the vectors.

The mathematical expression for cosine similarity is as follows:

C o s i n e S i m i l a r i t y (A, B) = \frac{A \cdot B}{∥ A ∥ ∥ B ∥} = \frac{\sum_{i = 1}^{n} A_{i} \cdot B_{i}}{\sqrt{\sum_{i = 1}^{n} A_{i}^{2}} \cdot \sqrt{\sum_{i = 1}^{n} B_{i}^{2}}}

(1)

Included among these are:

A and B represent two vectors for which the computation of similarity is to be conducted, respectively;
$A_{i}$ and $B_{i}$ represent the ith eigenvalues associated with their respective eigenvectors;
The notation $∥ A ∥$ and $∥ B ∥$ represents the Euclidean norms of the vectors A and B, respectively.

3. Algorithmic Progression

The limited availability of small-sample features within the target domain induces distributional bias, resulting in a generalization bottleneck in the generative model. This bottleneck is characterized by insufficient coverage of the password space and a low hit rate. To address this issue, we propose a small-sample password guessing methodology that leverages cross-domain feature enhancement, which is predicated on three key assumptions (refer to Figure 1 for the schematic representation). The efficacy of this approach has been validated. The three assumptions are as follows:

The password set $A_{i}$ , which bears resemblance to B, has the capacity to significantly enhance the password characteristics associated with B.
The pertinence of password features to the target region can be improved by enhancing the relevant features of B in $A_{i}$ ;
Assumption 1 can expand the features, and Assumption 2 can enhance the features, and the combination of the two can improve the hit rate of password guessing in the target region.

This approach comprises four distinct stages:

(1): Conduct a statistical analysis of the password dataset utilizing the probabilistic context-free grammar (PCFG) methodology. This analysis should yield a probability list detailing the structural components of the passwords, as well as a probability dictionary for the fragments within the password set;
(2): Similarity computation involves converting the probability list of password structures from a given password set into a corresponding structure vector. Subsequently, the cosine similarity is calculated between the structure vectors of the password sets, serving as a measure of similarity between these sets. The objective is to identify the publicly leaked password set, denoted as $A_{m a x}$ , that exhibits the highest degree of similarity;
(3): Feature enhancement: Enhance the structure of B in the password structure probability list of $A_{m a x}$ and the fragment of B in the password fragment probability dictionary.
(4): Password generation: generate a guess password set based on the password structure probability list and password fragment probability dictionary of $A_{m a x}$ after feature enhancement.

3.1. Password Analysis

This study conducts a comprehensive statistical analysis of the variable B utilizing the probabilistic context-free grammar (PCFG) methodology. Specifically, we parse and generalize a substantial number of samples from the password dataset to extract the structural features of the passwords and the probability distributions of their components. Based on this analysis, we derive two significant statistical outcomes. The first is a probability list of password structures, which delineates the length and composition of each field within a password, encapsulating the primary feature information. This list meticulously documents the frequency of various password structures alongside their corresponding probability values. The second outcome is a probability dictionary of password fragments, which encompasses common segments of passwords, including combinations of letters, numbers, and special characters. The structure of the passphrase set refers to the structural characteristics inherent to all passphrases within a given set. By examining the structure of the password set, we gain enhanced insights into the similarities and differences that exist among various password sets.

3.2. Similarity Calculation

We assess the structural cosine similarity among the password sets to identify the publicly leaked password set

A_{m a x}

within

A_{i}

that exhibits the greatest similarity to B. This approach aims to ensure that

A_{m a x}

not only encompasses the characteristics of B more comprehensively under the prevailing conditions but also compensates for the deficiencies in the password set B. The subsequent steps outline the detailed methodology for calculating the structural similarity.

First, the structural characteristics are derived for each password set $A_{i}$ within the collections B and A through the process of model training, resulting in the generation of a structural probability list for each password set;
Then, the structural features of a password set $A_{i}$ in A, which is to be measured for structural similarity, are concatenated with the structural features of B to get the length of the proposed vector and the meaning of each component.
The structural probabilities of $A_{i}$ and B are subsequently incorporated into the vector file in accordance with the sequential interpretations of the components. Ultimately, this process yields the structural feature vectors associated with $A_{i}$ and B.
Ultimately, the cosine similarity is computed for the derived vectors utilizing the formula outlined in Section 2.2, in order to assess the structural similarity between $A_{i}$ and B. The vector $A_{i}$ exhibiting the highest degree of similarity is designated as $A_{m a x}$ .

3.3. Feature Enhancement

During the course of our experiments, we observed that when the structural similarity value falls below a specified threshold (refer to Section 4 for the detailed methodology), the structural characteristics of

A_{m a x}

can be enhanced by incorporating the structural features of B. This integration compensates for the deficiencies in

A_{m a x}

due to the absence of B’s features, thereby increasing the accuracy of predicting B. The augmentation technique involves enhancing the probabilities of the structures and fragments present in B within

A_{m a x}

. The comprehensive algorithmic procedure is as follows Algorithm 1:

Algorithm 1: Procedure for Enhancing Structural Features

Input:

P_{a}

: Probability distribution of

A_{m a x}

;

P_{b}

: Probability distribution of B;

α

:
Enhancement strength parameter
Output:

P_{E}

: Enhanced probability distribution

In this context, let

P_{a}

and

P_{b}

denote the probabilities associated with the respective structures or elements derived from training the model on

A_{m a x}

and B. The parameter

α

serves as a customizable factor that modulates the influence of the features of B to augment those of

A_{m a x}

. During the enhancement procedure, if

P_{a} > α \cdot P_{b}

, the probability value

P_{a}

for

A_{m a x}

remains unchanged, as it is indicative of a more prevalent usage pattern at that juncture. Conversely, when

P_{a} < α \cdot P_{b}

, the probability for

A_{m a x}

is adjusted to

α \cdot P_{b}

, thereby aligning

A_{m a x}

more closely with the characteristics of B. Following the completion of the probability enhancement, a normalization process is conducted to ensure that the aggregate of the probabilities for all structures and elements equals 1, thereby validating the reasonableness and integrity of the probability distribution.

3.4. Password Generation

After constructing the model for feature enhancement, we proceed to the password generation phase. Beginning with the initial symbols and utilizing the probabilistic dictionary of password structures, we prioritize the structures with the highest probabilities for expansion. Using the dictionary of password fragment probabilities, we populate the password structure by emphasizing the fragments with the highest likelihood. In this way, the initial symbol is gradually expanded into a complete string.

After generating the string, the resulting passphrase is evaluated to determine whether it meets the predefined length requirements and possesses adequate complexity. For instance, it should include a minimum number of different character types, such as uppercase letters, lowercase letters, numbers, and special symbols. If the generated passphrase fails to satisfy these criteria, the system will backtrack to the previous generation step and create a new segment of the passphrase by re-selecting alternative generative rules based on their probability until a passphrase that fulfills all requirements is produced.

4. Experiments and Analysis of Results

In order to systematically evaluate the effectiveness of the method proposed in this paper, we have constructed a three-level validation system. This system assesses the method’s effectiveness from three dimensions: data constraints assessment, core assumptions, and effect analysis following feature enhancement.

Data Constraint Evaluation: Assess the data constraints for similarity metrics and quantitatively establish the minimum sample size threshold required for stable similarity metrics within a small-sample password set B;
The core hypothesis is to verify whether an increase in similarity between $A_{i}$ and B corresponds with an increase in the guessing hit rate of the training-generated password set, thereby confirming the positive correlation between similarity and hit rate.
Analyzing the Impact of Feature Enhancement: This study examines the varying effects of feature enhancement on different sets of similar passwords.

Based on the research questions outlined above, we designed and conducted the following three experiments.

4.1. Data Sets

Since 2009, several large-scale data leakage incidents have come to light [6]. In 2011, a number of well-known domestic Internet companies, including Tianya, experienced user password leakage incidents that resulted in the exposure of personal information for approximately 40 million users [20]. In 2014, 12306, a prominent ticketing platform, was also affected, with leaked data involving users’ accounts, plaintext passwords, ID numbers, and private information such as email addresses [21]. In 2015, Twitch, a live gaming platform, suffered a massive data breach in which the password files of about 55 million users were compromised, potentially triggering the leakage of accounts related to Gmail and other services [22]. In 2016, security researchers uncovered a significant data breach involving approximately 9.86 GB of Twitter user datasets, which contained more than 200 million records linked to user account profiles, names, and email addresses [23]. In 2017, CashCrate experienced a data breach that compromised information including users’ email addresses, names, passwords, and physical addresses [24]. These ongoing data leakage events provide valuable data support for password guessing research and serve as a practical basis for optimizing guessing models.

Based on password databases derived from public leaks, the dataset utilized in this paper is divided into two parts: the small-sample password set B from the target region and the publicly leaked password clusters

A = A_{1}, A_{2}, . . ., A_{n}

. Here, A represents all publicly leaked password sets within the complete dataset, while

A_{i}

denotes the i-th independent publicly leaked password set. The set B is further divided into two subsets,

B_C N

and

B_E N

, represented by “12306” and “cashcrate,” respectively, reflecting the characteristics of the samples in two distinct environments. To investigate the effect of training set size on model performance, we sampled

B_C N

and

B_E N

hierarchically, gradually reducing the training set size from 500 to 10 entries. Correspondingly, the size of the test set was gradually increased from 69,500 to 69,990 entries. The publicly leaked passphrase set

A_{i}

comprises six categories of data, each with a sample size of 70,000, covering several common service platforms and tools, such as Tianya, Gmail, Twitter, and 126, among others. This dataset provides rich diversity and reflects real-world password usage patterns for experimentation (See Table 1).

4.2. Evaluation Indicators

In this paper, the Hit Rate (HR) is utilized as the primary metric for assessing the performance of the model. The hit rate is defined as the proportion of generated password guessing dictionaries that include the test password. The formula for calculating the hit rate is as follows:

H i t R a t e = \frac{N_{h i t}}{N_{t i t l e}}

(2)

Among them:

$N_{h i t}$ indicates the number of test passphrases included in the generated passphrase guessing dictionary.
$N_{t i t l e}$ indicates the total number of test passwords.

4.3. Experiment 1: Evaluation of Data Constraints for Similarity Metrics

In the predefined context of this paper, a smaller sample from the password set is utilized, meaning that the number of passwords included is limited. However, when the sample size of the passwords to be guessed is too small, its similarity to other password sets becomes highly unstable. To ensure the robustness of the similarity metric, this experiment assesses the data constraints of the similarity metric to identify the minimum effective sample size, denoted as B. This involves determining a threshold value for the minimum number of passwords that can reliably reflect its similarity with other password sets. By establishing this threshold, a dependable benchmark for password set similarity metrics is created, which not only meets the preconditions of the experiment but also provides a foundation for future research.

For this reason, we divided the experiments into Chinese and English environments, corresponding to the password sets

B_C N

and

B_E N

, respectively. We conducted twelve sets of sampling experiments with training set sizes based on these two small-sample password sets, with the sizes of the training sets being 10,000, 7000, 5000, 3000, 1000, 500, 300, 200, 150, 50, 20, and 10, in that order. In each set of experiments, the similarity of B to

A_{i}

is calculated separately. The specific steps are as follows: First, merge B and

A_{i}

pairwise to form a concatenated set and generate the corresponding vector files. Then, perform cosine similarity computations on each set of vectors to obtain the similarity between the twelve sets of

B_C N

and

A_{i}

in the Chinese environment, as well as the similarity between the twelve sets of

B_E N

and

A_{i}

in the English environment (See Figure 2 and Figure 3).

The experimental results demonstrate the similarity between the password sets

B_C N

and

B_E N

in relation to

A_{i}

within Chinese and English environments, respectively, across varying sample sizes. The horizontal axis of the graphs denotes the categories of password sets, while the vertical axis indicates the similarity values. The twelve sets of lines represent different training set sizes.

As a whole, the fluctuation in similarity between B and

A_{i}

gradually increases as the size of the training set decreases. This fluctuation is particularly significant when the sample size is fewer than 50 entries. In the Chinese context, the similarity between the password set

B_C N

and the password sets “126”, “163”, and “tianya” in A is notably abnormal when the number of samples is 50. Similarly, in the English context, the password set

B_E N

exhibits anomalous fluctuations in similarity to the password sets “000webhost”, “126”, “163”, and “tianya” at a sample size of 50. Given the premise of small sample sizes and the need for stability in similarity, it is more appropriate to select 150 as the threshold for the small-sample training set in the next experiment. A sample size below this threshold may result in unstable experimental outcomes due to significant fluctuations in similarity, while a sample size above this threshold may not fulfill the requirements for small-sample experiments.

4.4. Experiment 2: Verifying the Positive Correlation Between Similarity and Hit Rate

In order to verify the proposed positive correlation between similarity and hit rate, this experiment is conducted in both a Chinese environment (

B_{C N}

) and an English environment (

B_{E N}

). The goal is to identify the leaked password set

A_{m a x}

that is most similar to the small sample B (which includes both

B_{C N}

and

B_{E N}

) and to assess its effectiveness in enhancing password guessing accuracy within small samples. (See Table 2).

The experiment builds upon the findings of Experiment 1, establishing that the minimum stable number of entries for B (which includes both

B_C N

and

B_E N

) is 150. Under these conditions, the training set for

B_C N

(

B_C N_t r a i n i n g

) consists of 150 entries, while the test set for

B_C N

(

B_C N_t e s t

) comprises 69,850 entries. Similarly, the training set for

B_E N

(

B_E N_t r a i n i n g

) also contains 150 entries, and the test set for

B_E N

(

B_E N_t e s t

) includes 69,850 entries.

First, we will calculate the cosine similarity between

B_C N_t r a i n i n g

and each of the mnemonic sets

A_{1}

through

A_{6}

to obtain the similarity values between

B_C N_t r a i n i n g

and

A_{1}

to

A_{6}

. The same method will be applied to the datasets of

B_E N_t r a i n i n g

with

A_{1}

to

A_{6}

to compute the similarity results in the English context. By comparing these similarity values, we can rank the similarity between each

A_{i}

and B, allowing us to filter the leaked password set

A_{m a x}

that is most similar to B.

We begin with the most similar leaked password set, organized by similarity, and train it using the traditional probabilistic context-free grammar (PCFG) method in a Chinese environment. The dataset, denoted as

B_C N_t r a i n i n g

, is trained with all six sets of

A_{i}

using the traditional PCFG approach. Corresponding dictionaries are generated sequentially, and guesses are produced based on these dictionaries according to breadth-first search (BFS) rules.To compare performance, we also introduced PassGPT and Markov models as control groups. Simultaneously, we impose a maximum depth limit, allowing for the generation of guesses in sets of

10^{4}

,

10^{5}

,

10^{6}

and

10^{7}

. The generated guesses from each group are then matched against

B_C N_t e s t

to evaluate the hit rate. A similar process and flow are applied in the English environment, and the results are as follows Figure 4 and Figure 5:

These two charts illustrate the hit rate of

B_C N_t e s t

in a Chinese environment and

B_E N_t e s t

in an English environment, respectively. The horizontal axis of the chart represents the number of guesses, which are categorized in orders of

10^{4}

,

10^{5}

,

10^{6}

, and

10^{7}

. The vertical axis indicates the hit rate as a percentage. The six sets of lines represent the hit rates of six different password sets under varying numbers of guesses.

Through in-depth observation and analysis of the experimental results, a clear law emerges: whether in a Chinese or English environment, the greater the similarity of the password set, the higher its hit rate for

B_C N_t e s t

and

B_E N_t e s t

. For instance, in the charts for the Chinese environment, the lines corresponding to the 126 and 163 password sets generally exhibit higher hit rates than the other datasets for the same number of guesses. Similarly, in the charts for the English environment, the lines corresponding to the Gmail dataset also demonstrate more pronounced hit rates for the respective number of guesses. This result strongly supports our initial conjecture that if

A_{i}

is more similar to B, then

A_{i}

will have a higher hit rate on the B test set.

Further in-depth exploration reveals that the most similar set of passphrases excels in terms of hit rate and can significantly enhance the hit rate, typically ranging from 5% to 32%. For instance, in the case of the Gmail dataset in English, as the number of guesses increases from

10^{4}

to

10^{7}

, the hit rate rises from approximately 15% to about 35%, representing a substantial increase. This serves as strong evidence for the validity and accuracy of our method, demonstrating that the most similar password set,

A_{m a x}

, effectively improves the hit rate of guessing for small samples.

4.5. Experiment 3: Feature Enhancement and Effect Analysis

In order to further improve the hit rate, we will perform feature enhancement on

A_{i}

. We anticipate that this method of feature enhancement and probability adjustment will enable

A_{i}

to better capture the unique characteristics of B. Consequently, this will allow for the more accurate generation of passwords that align with the patterns of B during password guessing, thereby improving the hit rate and achieving optimal guessing results.

According to the conclusion of Experiment 1, we use the dataset B, which consists of the 150 samples from Experiment 1, and the features are enhanced for the training results in the Chinese environment (B_CN) and the English environment (B_EN). Then, the strength of feature enhancement of

A_{i}

by the features of B is adjusted by introducing a custom parameter

α

(which cannot be less than 1), and the list of password structures of

A_{i}

and the probability dictionary of each password fragment are modified accordingly. In order to systematically evaluate the effect of feature enhancement, we introduce the parameter

α

and adjust it dynamically, taking values from 1 to 10 with a step size of 0.2, to quantitatively analyze the effect of enhancement strength on model performance. We take the 126 and Twitter password sets as examples in the Chinese environment, and Gmail and 000webhost password sets as examples in the English environment, and the experimental results are as follows Figure 6:

From the experimental results, it is evident that the system’s hit rate exhibits a monotonically decreasing trend as the value of

α

increases. Based on the findings of this empirical analysis, we will fix the value of

α

at 1 in the subsequent feature enhancement process to achieve a better balance between the enhancement effect and system performance.

Using the feature-enhanced probability dictionary, we constructed a more targeted post-guessing dictionary and verified its effectiveness in improving the hit rate with a small-sample test set, denoted as B. By comparing the original hit rate to the feature-enhanced hit rate, the results presented in Figure 7 and Figure 8 were obtained.

Among the figures, Figure 7 illustrates the hit rate of

A_{i}

on B_CN_test before and after feature enhancement in a Chinese environment, and Figure 8 depicts the improvement in hit rate of

A_{i}

on B_EN_test before and after feature enhancement in an English environment. The horizontal axis of the graphs represents the number of guesses, including values such as

10^{4}

,

10^{5}

,

10^{6}

, and

10^{7}

, among others. The vertical axis indicates the percentage value of the hit rate. A negative value signifies a decrease in the hit rate, while a positive value indicates an increase. The twelve groups of lines in the graphs represent the hit rates of six sets of password combinations under varying numbers of guesses. Each of the twelve lines corresponds to one of the six password sets, differentiated by color and label. Dashed lines represent the original hit rates, while solid lines denote the enhanced hit rates.

By analyzing the experimental results, it is evident that when the custom parameter

α

is set to 1, the implementation of the feature enhancement algorithm significantly improves the system’s hit rate performance in low-similarity scenarios, as indicated by the similarity calculations in Experiment 2. Specifically, when the similarity between B and

A_{m a x}

falls below a threshold of approximately 0.6, the hit rate can be substantially enhanced through the use of feature enhancement, achieving an improvement of up to 10.52% compared to the rate prior to enhancement for

A_{m a x}

. Based on these findings, we propose a dynamic training strategy selection mechanism: for high-similarity scenarios with a similarity greater than 0.6, the original

A_{m a x}

is utilized directly as the training set; conversely, for low-similarity scenarios with a similarity less than 0.6,

A_{m a x}

enhanced with B features is preferred as the training set, as detailed in the specific formula in Section 3.3. This strategy ensures optimal performance in high-similarity scenarios while effectively enhancing model performance in low-similarity situations.

5. Conclusions

The performance of traditional probabilistic context-free grammar (PCFG) methods in small-sample scenarios is constrained by the size of the password set. To address this challenge, this paper proposes an approach that combines cross-domain similarity measures with a feature enhancement mechanism. This approach aims to improve the effectiveness of password guessing in data-scarce scenarios.

The method specifically generates a probabilistic dictionary by training a set of passwords using the probabilistic context-free grammar (PCFG) model, converting it into a structured vector representation. This approach enables the measurement of similarity at the structural level between different sets of passwords. In Experiment 1, we systematically evaluate the stability of the structured vectors across various sample sizes and find that the similarity computation demonstrates strong consistency and discriminative ability when the sample size reaches 150 or more. Subsequently, we select multiple publicly leaked password sets from both Chinese and English environments, compare their similarity to the target small-sample password set, and rank them according to their similarity as potential training sets. The results of Experiment 2 indicate that the hit rate of the password set with high similarity to the small-sample target can be significantly improved by up to 32% following direct training. Furthermore, by introducing a feature enhancement strategy in Experiment 3, the hit rate can be increased by as much as 10.52% across multiple experimental sets, thereby validating the effectiveness of the enhancement strategy in improving small-sample password modeling.

In summary, the method proposed in this study effectively models and accurately predicts small-sample password sets by utilizing structural similarity and feature enhancement, all without requiring prior knowledge of the target domain. This approach demonstrates strong generalizability and practicality. Future work will focus on further developing the dynamic adjustment mechanism of the enhancement algorithm to better adapt to a wider range of diverse and complex practical application scenarios.

Finally, it is worth noting that the findings of this study can provide system security administrators with more effective password policy configurations to better defend against increasingly sophisticated traversal attacks. Meanwhile, penetration testers can also utilize these results to obtain higher-hit-rate password datasets, thereby enabling more accurate and reliable security assessments of information systems.

Author Contributions

Conceptualization, C.L. and M.H.; Methodology, C.L. and B.L.; Software, J.L. and X.L.; Formal analysis, W.Y.; Investigation, W.Y.; Resources, C.L.; Writing—original draft, J.L. and X.L.; Writing—review and editing, M.H., Y.L. and W.L.; Supervision, C.L.; Project administration, B.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Sichuan Science and Technology Program grant number 2025ZNSFSC1479. This research was funded by the Science and Technology Program of Sichuan Province, China grant number 2023YFG0264.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ji, S.; Yang, S.; Hu, X.; Han, W.; Li, Z.; Beyah, R. Zero-Sum Password Cracking Game: A Large-Scale Empirical Study on the Crackability, Correlation, and Security of Passwords. IEEE Trans. Dependable Secur. Comput. 2017, 14, 550–564. [Google Scholar] [CrossRef]
Wang, C.; Zhang, J.; Xu, M.; Zhang, H.; Han, W. #Segments: A Dominant Factor of Password Security to Resist against Data-driven Guessing. Comput. Secur. 2022, 121, 102848. [Google Scholar] [CrossRef]
Su, X.; Zhu, X.; Li, Y.; Li, Y.; Chen, C.; Esteves-Veríssimo, P. PagPassGPT: Pattern Guided Password Guessing via Generative Pretrained Transformer. In Proceedings of the 2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Brisbane, Australia, 24–27 June 2024; pp. 429–442. [Google Scholar] [CrossRef]
Deng, G.; Yu, X.; Guo, H. Efficient Password Guessing Based on a Password Segmentation Approach. In Proceedings of the 2019 IEEE Global Communications Conference (GLOBECOM), Waikoloa, HI, USA, 9–13 December 2019; pp. 1–6. [Google Scholar] [CrossRef]
Xie, J.; Cheng, H.; Zhu, R.; Wang, P.; Liang, K. WordMarkov: A New Password Probability Model of Semantics. In Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 3034–3038. [Google Scholar] [CrossRef]
Zou, Y.; Wang, D. Advances in Password Guessing Research. J. Cryptogr. (Chin. Engl. Ed.) 2024, 11, 67–100. [Google Scholar] [CrossRef]
Xie, Z.; Shi, F.; Zhang, M.; Rao, Z.; Zhou, Y.; Ji, X. Similarities: The Key Factors Influencing Cross-Site Password Guessing Performance. Electronics 2025, 14, 945. [Google Scholar] [CrossRef]
Wang, D.; Zhang, Z.; Wang, P.; Yan, J.; Huang, X. Targeted Online Password Guessing: An Underestimated Threat. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS), Vienna, Austria, 24–28 October 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 1242–1254. [Google Scholar] [CrossRef]
Bi, H.; Tan, R.; Zhao, J.; Li, Y. Research on a Password Guessing Model Based on Topic PCFG. Inf. Netw. Secur. 2019, 8, 1–7. [Google Scholar]
Gan, X.; Chen, M.; Li, D.; Wu, Z.; Han, W.; Chen, H. Corpora-based Password Guessing: An Efficient Approach for Small Training Sets. In Proceedings of the 2021 IEEE 4th International Conference on Electronics and Communication Engineering (ICECE), Xi’an, China, 17–19 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 311–319. [Google Scholar]
Chen, M. Research and Application of Password Guessing Rule Extension Method. Master’s Thesis, South China University of Technology, Guangzhou, China, 2022. [Google Scholar]
Han, W.; Zhang, J.; Xu, M.; Wang, C.; Zhang, H.; He, Z.; Chen, H. Parametric Mixed Password Guessing Method. Comput. Res. Dev. 2022, 59, 2708–2722. [Google Scholar]
Hitaj, B.; Gasti, P.; Ateniese, G.; Perez-Cruz, F. Passgan: A deep learning approach for password guessing. In Proceedings of the 17th International Conference on Applied Cryptography and Network Security, ACNS 2019, Bogota, Colombia, 5–7 June 2019; Springer International Publishing: Cham, Switzerland, 2019; pp. 217–237. [Google Scholar]
Yu, F.; Martin, M.V. GNPassGAN: Improved generative adversarial networks for trawling offline password guessing. In Proceedings of the 2022 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), Genoa, Italy, 6–10 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 10–18. [Google Scholar]
Rando, J.; Perez-Cruz, F.; Hitaj, B. Passgpt: Password modeling and (guided) generation with large language models. In Proceedings of the European Symposium on Research in Computer Security, The Hague, The Netherlands, 25–29 September 2023; Springer Nature Switzerland: Cham, Switzerland, 2023; pp. 164–183. [Google Scholar]
Geng, Y. Research on Small Sample Password Set Guessing Model Based on Multi-Task Learning. Master’s Thesis, Xi’an University of Electronic Science and Technology, Xi’an, China, 2022. [Google Scholar]
Melicher, W.; Ur, B.; Segreti, S.M.; Komanduri, S.; Bauer, L.; Christin, N.; Cranor, L.F. Fast, lean, and accurate: Modeling password guessability using neural networks. In Proceedings of the 25th USENIX Security Symposium (USENIX Security 16), Austin, TX, USA, 10–12 August 2016; pp. 175–191. [Google Scholar]
Wang, D.; Zou, Y.; Xiao, Y.-A.; Ma, S.; Chen, X. Pass2Edit: A Multi-Step Generative Model for Guessing Edited Passwords. In Proceedings of the 32nd USENIX Security Symposium (USENIX Security 23), Anaheim, CA, USA, 9–11 August 2023. [Google Scholar]
Yu, H. Research on Password Guessing Model Based on Digital Semantics. Master’s Thesis, East China Normal University, Shanghai, China, 2023. [Google Scholar]
Tianya, China’s Biggest Online Forum 40 Million Users Data Leaked [EB/OL]. (2011-12). Available online: https://thehackernews.com/2011/12/tianya-chinas-biggest-online-forum-40.html (accessed on 15 March 2025).
Train Ticket Website Leaks User Info [EB/OL]. 2014. Available online: https://www.globaltimes.cn/content/898706.shtml (accessed on 15 March 2025).
Taylor, C. Twitch Resets All User Passwords After Suffering Data Breach [EB/OL]. 2015. Available online: https://techcrunch.com/2015/03/23/twitch-passwords-data-breach-hack/ (accessed on 15 March 2025).
Twitter Data Breach and 10 Billion Password Leak Details [EB/OL]. 2016. Available online: https://www.cybersecurity-insiders.com/twitter-data-breach-and-10-billion-password-leak-details/ (accessed on 15 March 2025).
Hackers Steal 6 Million User Accounts for Cash-for-Surveys Site [EB/OL]. 2017. Available online: https://www.vice.com/en/article/hackers-steal-6-million-user-accounts-for-cash-for-surveys-site/ (accessed on 15 March 2025).

Figure 1. Schematic representation of a small-sample password guessing technique utilizing cross-domain feature enhancement.

Figure 2. Similarity case between the password set

B_C N

and

A_{i}

.

Figure 2. Similarity case between the password set

B_C N

and

A_{i}

.

Figure 3. Similarity case of password set

B_E N

with

A_{i}

.

Figure 3. Similarity case of password set

B_E N

with

A_{i}

.

Figure 4. Hit rate of

B_C N_t e s t

in Chinese environment.

Figure 4. Hit rate of

B_C N_t e s t

in Chinese environment.

Figure 5. Hit rate of

B_E N_t e s t

in English environment.

Figure 5. Hit rate of

B_E N_t e s t

in English environment.

Figure 6. Comparison of hit rates for adjusting parameter

α

.

Figure 6. Comparison of hit rates for adjusting parameter

α

.

Figure 7. Hit rate of

B_C N_t e s t

in Chinese environment before and after enhancement.

Figure 7. Hit rate of

B_C N_t e s t

in Chinese environment before and after enhancement.

Figure 8. Hit rate of

B_E N_t e s t

in English environment before and after enhancement.

Figure 8. Hit rate of

B_E N_t e s t

in English environment before and after enhancement.

Table 1. Data Set.

Name	Sample Content	Training Set	Test Set
Small Command Set $B$
$B_C N$	12306	500	69,500
$B_C N$	12306	300	69,700
$B_C N$	12306	200	69,800
$B_C N$	12306	150	69,850
$B_C N$	12306	50	69,950
$B_C N$	12306	20	69,980
$B_C N$	12306	10	69,990
$B_E N$	cashcraete	500	69,500
$B_E N$	cashcraete	300	69,700
$B_E N$	cashcraete	200	69,800
$B_E N$	cashcraete	150	69,850
$B_E N$	cashcraete	50	69,950
$B_E N$	cashcraete	20	69,980
$B_E N$	cashcraete	10	69,990
Large-scale Command Cluster
$A_{1}$	126	70,000	\
$A_{2}$	163	70,000	\
$A_{3}$	youku	70,000	\
$A_{4}$	Gmail	70,000	\
$A_{5}$	Twitter	70,000	\
$A_{6}$	000webhost	70,000	\

Table 2. The similarity between

B_C N

,

B_E N

, and

A_{i}

.

Table 2. The similarity between

B_C N

,

B_E N

, and

A_{i}

.

Similarity	$B_CN$	$B_EN$
126	0.74	0.29
163	0.72	0.22
youku	0.63	0.24
Gmail	0.52	0.89
Twitter	0.36	0.77
000webhost	0.35	0.25

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, C.; Li, J.; Liu, X.; Li, B.; Hou, M.; Yu, W.; Li, Y.; Liu, W. Cross-Domain Feature Enhancement-Based Password Guessing Method for Small Samples. Entropy 2025, 27, 752. https://doi.org/10.3390/e27070752

AMA Style

Liu C, Li J, Liu X, Li B, Hou M, Yu W, Li Y, Liu W. Cross-Domain Feature Enhancement-Based Password Guessing Method for Small Samples. Entropy. 2025; 27(7):752. https://doi.org/10.3390/e27070752

Chicago/Turabian Style

Liu, Cheng, Junrong Li, Xiheng Liu, Bo Li, Mengsu Hou, Wei Yu, Yujun Li, and Wenjun Liu. 2025. "Cross-Domain Feature Enhancement-Based Password Guessing Method for Small Samples" Entropy 27, no. 7: 752. https://doi.org/10.3390/e27070752

APA Style

Liu, C., Li, J., Liu, X., Li, B., Hou, M., Yu, W., Li, Y., & Liu, W. (2025). Cross-Domain Feature Enhancement-Based Password Guessing Method for Small Samples. Entropy, 27(7), 752. https://doi.org/10.3390/e27070752

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cross-Domain Feature Enhancement-Based Password Guessing Method for Small Samples

Abstract

1. Introduction

2. Background Knowledge

2.1. Probabilistic Context-Free Grammar

2.2. Cosine Similarity

3. Algorithmic Progression

3.1. Password Analysis

3.2. Similarity Calculation

3.3. Feature Enhancement

3.4. Password Generation

4. Experiments and Analysis of Results

4.1. Data Sets

4.2. Evaluation Indicators

4.3. Experiment 1: Evaluation of Data Constraints for Similarity Metrics

4.4. Experiment 2: Verifying the Positive Correlation Between Similarity and Hit Rate

4.5. Experiment 3: Feature Enhancement and Effect Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI