# Study on Massive-Scale Slow-Hash Recovery Using Unified Probabilistic Context-Free Grammar and Symmetrical Collaborative Prioritization with Parallel Machines

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

- Recovering slow hashes by reusing passwords from a large source corpus. Our corpus is the largest ever studied in literature to our knowledge. Based on it, we find cross-site passwords account for a large proportion of the target sites, which can be exploited for hash recovery.
- Identifying a less-studied issue which degrades the efficiency of massive-scale slow-hash recovery: weak accounts are blocked by stronger accounts during expanding and guessing. We solve this by proposing concurrent global prioritization and overcome two key shortcomings of the usage of a huge global heap which the method brings in.
- Helping organizations to better protect their data. Our algorithm models the behavior of real-world attackers who would try the best to maximize the cracking profit before it is finally exceeded by the cost. Based on it, organizations can better balance the hashing costs with the sever load, and can proactively detect weak credentials before financial and reputational damages happen.

## 2. Related Work

#### 2.1. Offline Password Guessing

**Brute-force and mask attacks.**Brute-force method simply enumerates all possible strings by trying all combinations of characters from given charsets. This is inefficient and only used in practice when guessing short or randomly generated passwords. Mask attack reduces the password candidate keyspace by configuring the attack to explore a more specific and popular structure, such as trying upper-case letters only on the first position. Mask attack can be effective for short passwords as it requires less exhaustion.

**Dictionary and mangled wordlist attacks.**Dictionary attack, also known as wordlist attack, simply tries all words in a list. Many modern cracking tools like Hashcat [4] and John the Ripper [5] base their core attack modes on an improved version called mangled wordlist attack or rule-based attack. In this case, the attacker applies transformation rules to the words from wordlists. There are a few variations such as concatenating words from multiple wordlists (i.e., combinator attack) and combining wordlists with masks (i.e., hybrid attack). The number of available candidates increases with the exploration of keyspace being still restrained, which makes this category of attack very successful in practice.

**Markov models.**The Markov model was first introduced to the field of password cracking by Narayanan et al. [11] as a template-based model. In that work, Markov chain was only used for assigning probabilities to letter-based segments. Later, Castelluccia et al. [12] proposed to use whole-string Markov models (i.e., n-gram models) for evaluating password strength and they do not divide a password into segments. Ma et al. [13] studied many variations of Markov models under different configurations and found that Markov models tend to be more efficient than other existing methods for certain datasets.

**Probabilistic context-free grammar.**To generate guesses in decreasing probability order, Weir et al. [10] proposed to use a probabilistic context-free grammar (PCFG). They divided passwords into different templates according to character category. For example, ‘password123’ belongs to the 8-letter-and-3-digit template, symbolized as L8D3. The probabilities of the templates are trained from a large corpus. Additionally, Komanduri [14] proposed substantial improvements to Weir’s PCFG, such as intelligent skipping and pattern compaction, to make guessing more effective.

**Neural Networks.**Melicher et al. [15] proposed to use artificial neural networks to model text passwords’ resistance to guessing attacks. They showed that neural networks can often guess passwords more effectively than state-of-the-art approaches, when beyond 1010 guesses and on non-traditional password policies.

#### 2.2. Password Reuse

**Methods leveraging password history of the same website**. Zhang et al. [18] provided an algorithmic framework for predicting future passwords from expired ones due to a password expiration policy. Their algorithm modeled users’ behaviors of password modification when they were forced to change them, and their experimental results well verified the conjecture that users tend to generate future passwords based on old passwords.

**Methods leveraging leaked passwords in other websites.**Das et al. [6] identified a few popular rules users often use to transform a basic password between sites, by analyzing several hundred thousand leaked passwords from eleven web sites. Using a fixed order of these rules, they are the first to propose a cross-site guessing algorithm, which is able to guess 30% of the transformed passwords within 100 attempts. Wang et al. [19] proposed to use the Bayesian model to generate a customized order of rules to improve the performance of [6]. Han et al. [20] examined the state-of-the-art Intra-Site Password Reuses (ISPR) and Cross-Site Password Reuses (CSPR) based on the leaked passwords of 668 million members in China. By utilizing the patterns used by the same user, they achieved a major improvement in guessing success rate compared to John the Ripper.

**Methods leveraging leaked passwords and personal information.**Li et al. [21] extracted some of the most popular password structures, which can be expressed by personal information (e.g., name, birthdates, phone number, national ID, email address and user name). Based on the findings, they proposed a semantics-rich algorithm, Personal-PCFG, to crack passwords by generating personalized guesses. Wang et al. [7] proposed TarGuess, a framework that can characterize seven typical targeted guessing scenarios with sound mathematical models. They used customized PCFG models to address the issue of cross-site online guessing, when given the victim’s one sister password and some PII.

#### 2.3. Bcrypt Recovery

**The bcrypt password hash.**Being used as the default password hash algorithm in many today’s services, bcrypt [23] was designed to be resistant to brute force attacks and to remain secure despite of hardware improvements. The usage of expensive key setup with user-defined cost setting makes this hash algorithm very slow. Rapid random 32-bit lookups using Blowfish’s variable S-boxes typically require 4 KB of local memory per instance and make bcrypt unfriendly to CPU- or GPU-based parallelization.

**Famous bcrypt dumps and their difficulty.**We mainly study the three famous public bcrypt dumps, Dropbox, Ashley Madison (AM for short) and Edmodo (passwords in it were hashed with MD5+bcrypt and then obfuscated). They are extensively studied by researchers, perhaps by attackers as well. The difficulty to recover these data is well illustrated in the report presented by Jens [27]. The report summarized the performance of hashcat (v1.32, with a single GPU and the default settings for configurable algorithms) for various hash algorithms. The hashing speed is 1–10 BH/s for MD5 and SHA1-512, while it sharply decreases to 10-100 KH/s for bcrypt, slowing down by five orders of magnitude. In addition, the hashing speed does not necessarily correspond to the cracked accounts per unit time, which could be much slower. Indeed, it is reported that a few passwords can be recovered from these bcrypt dumps unless exploiting implementation bugs of the algorithm itself, if they exist [1,2,3].

**Special-purpose hardware based cracking.**Despite the security enforcement of bcrypt against common-purpose hardware improvements, it is possible to achieve much better performance with bcrypt implementations on homogeneous and heterogeneous multiprocessing platforms. Malvoni et al. [28] proposed one such implementation that was integrated into the John the Ripper password cracker and resulted in improved energy efficiency by a factor of 35+ compared to heavily optimized implementations on modern CPUs. Our work, however, concentrates on common-purpose hardware, as it is cheap and readily available for a real-world attacker. Special-purpose hardware implementations are orthogonal and complementary to our approach.

#### 2.4. Recovery Metric

## 3. Our Solution

**Data and recovery model.**Here, we will briefly describe the VeriClouds corpus, which is closely related to our study. The total amount of credentials in the VeriClouds corpus is 9,118,017,411 by 1 March 2018, collected from more than 90% [9] of the leaked databases on the deep/dark web and from paste sites such as PasteBin. This dump includes most of the recent famous breaches and therefore is of perfect quality. For example, the fraction of overlapped Dropbox’ accounts is up to 59% with the VeriClouds corpus. There are a large number of email addresses with upper-case letters in VeriClouds dump, which may be caused due to implementation errors, since no widely used mail systems are case-sensitive to email addresses. These email addresses are lowercased to increase chances of matching cross-site accounts. Table 1 gives some typical sources of the dump. As it is merely a combo list, it is hard to exactly count the overlapped credentials with the listed sources. This dataset is larger in magnitude than the ones ever studied in previous works [6,7,8]. VeriClouds has made the data private and reclaimed any personal or academic access since 2018, due to the policy change after the Facebook–Cambridge Analytica data scandal. The fourth author of this paper is a co-founder of VeriClouds, who exclusively maintains the dataset to guarantee data safety. Note that we do not directly exploit the data in this paper and the other authors have no access to it, in case of moral issues.

**System goal.**Given fixed computational resources, we aim to maximize the speed of cracking. It implies we can maximize the number of recovered accounts in a fixed time—in other words, we can minimize the time to recover a fixed fraction of accounts.

- Effective: Recover a substantial proportion of the target dataset.
- Efficient: Make the more promising (account, password) pairs to be attempted earlier.
- Parallelizable: Perform as many trials as possible per unit time.
- Scalable: Handle a large dataset with restricted memory resource.

**PCFG Expansion:**To meet the 1st criterion, we explore differentiated PCFG models for different accounts. Each heap elements, once popped from the heap, is expanded (by Selecter) according to the PCFG rules of the corresponding account. The derived results will be pushed back to the heap to maintain the global order of priority.

**Global Ranking:**Candidates of all accounts are dynamically inserted to or removed from a max-min heap [32], which ensures that a candidate with higher probability will always be popped before any other one with lower probability. This relates to our 2nd criterion. To settle the problem described in the 4th criterion, a skiplist [33] is used (by Heapifier) to migrate some of the candidates when the heap size becomes extremely large.

**Bcrypt Trial:**During the recovery, a candidate is hashed with bcrypt and compared with the target hash. If both hashes are the same, the account is claimed to be cracked. In our design, the bcrypt-trial procedures (Bcrypters) can be separated from PCFG expansion (Selecter), so that we can assign each Bcrypter to a single core. The system works asynchronously such that maximal parallelization can be achieved, and thus satisfies the 3rd criterion.

#### 3.1. PCFG Expansion

_{t}is a tuple <Σ, V, E, λ, s>, where:

- Σ is the set of 95 printable ASCII codes.
- V = {s; L
_{i}, D_{i}, S_{i}} (i ∈ {1, 2, …, 16}), where L_{i}, D_{i}, S_{i}each stands for an alphabetic/digital/symbolic string of length i (called segment). - s ∈ V is the start symbol.
- E is a set of rules in the form of v → v’, where v ∈ V and v’ ∈ (V ∪ Σ)∗.
- λ:E → [0, 1] assigns each rule to a probability and it fulfills the constraint

_{8}→) of G

_{t}, all the probabilities associated with its rules (e.g., L

_{8}→password, L

_{8}→aaaaaaaa, ……) must sum to 1.

_{r}is a tuple <Σ

_{r}, V

_{r}, E

_{r}, λ

_{r}, s

_{r}>. E

_{r}, λ

_{r}, s

_{r}are defined as the same with G

_{t}, while:

_{r}= Σ ∪ {src, C

_{1}, …, C

_{4}; R

_{1}, R

_{2}; L

_{1}, …, L

_{5}; Yes, No; No’}

_{r}= V ∪ {C, LT, R, SM; ti, td, hi, hd; ti’, td’, hi’, hd’}

_{r}, G

_{t}’) → G

_{r}’ = <Σ

_{r}’, V

_{r}’, E

_{r}’, λ

_{r}’, s

_{r}’> and the factor α to integrate another trawling model G

_{t}’ = <Σ’, V’, E’, λ’, s’> (using a similar definition with G

_{t}) into G

_{r}for cross-site accounts, where

_{r}

^{’}= V

_{r}

^{’}∪ {s

_{r}

^{’}},

_{r}

^{’}= E

_{r}∪ {s

_{r}

^{’}→ s

^{’}, s

_{r}

^{’}→ s

_{r}},

_{r}

_{’}(s

_{r}

_{’}→ s

^{’}) = 1 − α, λ

_{r}

_{’}(s

_{r}

_{’}→ s

_{r}) = α, λ

_{r}

_{’}(e) = λ

_{r}(e)(e ∈ E

_{r}).

**Theorem**

**1.**

**Proof.**

_{t’}or G

_{r}) equals 1. In addition, for all rules beginning with s

_{r’}, we have Σe start with s

_{r’}Pr(e) = Pr(s

_{r’}→ s’) +Pr(s

_{r’}→ s

_{r}) = (1 − α) + α = 1. Thus, for any specific left-hand side variable in the rules of G

_{r’}, the sum of the probabilities of all of its productions must sum to 1. □

**Training**

_{r}, we first collect password pairs (src_pw, train_pw) by matching the training site with the source site w.r.t. email addresses. After filtering out the 10,000 most common passwords from the training site, for each of the remaining pairs, both the source and target passwords are split into segments. Then, we go through two phases of training, which is the same as the procedure described in [7]. We sequentially apply cross-site transformation rules (capitalization (C

_{1}, …, C

_{4}), leet (L

_{1}, …, L

_{5}), reversal (R

_{1}, R

_{2}), sub-word movement (SM), structural manipulation (ti, td, hi, hd), segmental manipulation (ti

^{’}, td

^{’}, hi

^{’}, hd

^{’})) on src_pw, and count the occurrence of a rule if it makes the resulted password src_pw’ closer to train_pw in terms of Levenshtein distance [34]. We here describe each rule in detail. C

_{1}capitalizes all letters; C

_{2}capitalizes the 1

^{st}letter; C

_{3}lowers all letters; C

_{4}lowers all letters; L

_{1}, L

_{2}, L

_{3}, L

_{4}, L

_{5}each performs the substitution of ‘a’ <-> ‘@’, ‘s’ <-> ‘$’, ‘o’ <-> ‘0’, ‘i’ <-> ‘1’, ‘e’ <-> ‘3’; R

_{1}reverses all characters; R

_{2}reverses each segment; SM moves sub-words within a password; ti inserts a segment at the tail; td deletes the last segment; hi inserts a segment at the head; hd deletes the first segment; ti’ inserts a character at the tail of a segment; td’ deletes the last character of a segment; hi’ inserts a character at the head of a segment; hd’ deletes the first character of a segment. Since most passwords are not longer than 16, only passwords within 16 characters are considered.

_{r}.

**Expansion**

_{r’}, we can repeatedly apply transform rules to derive lists of guesses. Candidates are generated in the descending order of probability according to their corresponding grammar tree. For example, the guessing list of Gt can be 123456 (Pr(s → D6) Pr(D6 → 123456) = 0.016), 4abc$$4 (Pr(s → D1L3S2D1)·Pr(D1 → 4)·Pr(L3 → abc) Pr(S2 → $$) Pr(D1 → 4) = 0.008), etc. The left-hand side of a rule in Gt is called a terminal if it only contains ASCII chars; otherwise, it is called as a non-terminal. The parameterized model Gr

^{’}, when given the password princess, generates princess1 (Pr(s

_{r’}→ s

_{r})·Pr(s

_{r}→ princess1)·Pr(s

_{r}→ L8)·Pr(C → No)·Pr(LT → No) Pr(R → No) Pr(SM → No) Pr(L8 → ti) Pr(ti → D1) Pr(D1 → 1) Pr(L8 → No0) Pr(D1 → No0) = 0.018), Princess1 (Pr(s

_{r’}→ s

_{r}) Pr(s

_{r}→ princess1) Pr(C → C3) Pr(LT → No) Pr(R → No) Pr(SM → No) Pr(L8D1 →No) Pr(L8 → No

^{’}) Pr(D1 → No

^{’}) = 0.012), etc.

#### 3.2. Global Ranking

Algorithm 1. GlobalSort. |

Input:PQ:max-min heap, EQ:expansion queue,TQ:trial queue, IQ:insertion queue Output: 4 updated queues1 while True do2 if EQ!=0 then3 cand<-EQ.Get()4 else5 with PQ.lock do6 cand<-PQ.PopMax()7 if isPseudTerminal(cand) or isTerminal(cand) then8 TQ.Put(cand)9 else10 cands<-IncTranse(cand)11 foreach e in cands do12 IQ.Put(cands) |

_{r}, only after sequentially extending a given candidate using all six types of cross-site transformation rules in a round, we can then use the resulting new candidates (called pseudo-terminals as they are still non-terminals) for guessing. After trying the candidates by hashing them with bcrypt algorithm, they are enqueued again for next-round’s expansion. Expanding the six classes of rules all at once will cause thread blocking due to exponential blow-up of the size of resulted candidates.

_{r}. Once a non-terminal is dequeued for expansion, it only takes one of the six classes of transformation rules, i.e., the successor of the last rule used along the sequence: C, LT, R, SM, structural rules, segmental rules, C, LT, R, SM, structural rules, segmental rules. The resulting non-terminals will then be enqueued in one batch. The space of alphabetic segments can be huge, which, however, can be resolved by quantifization [14]. We can prove that the order of dequeuing the fully transformed passwords remains correct.

**Theorem**

**2.**

**Proof.**

_{t}for cross-site users and G

_{t}

^{’}for the others) to generate common candidates. Each common candidate is stored only once for all accounts in the queue, which saves a large amount of space. In contrast, given distinct source passwords of different accounts, G

_{r}has to generate different candidates accordingly.

_{t}, there can be different instances of expansion trees for different cross-site accounts using G

_{r}. To alleviate the situation, when the number of pseudo-terminals in the priority queue exceeds a certain threshold l, we remove the elements at the tail. These elements have lower probabilities and will not be tried earlier than the preceding l pseudo-terminals. In other words, if we can only run the recovery procedure for d = l/r (r is the average recovered accounts per day) days, they are unlikely to be used. Therefore, removing these unpromising pairs can make room for other candidates with higher probabilities.

Algorithm 2. Materialization. |

Input:PQ:max-min heap, SL:skiplist, IQ:insertion queueOutput: 3 updated queues1 while True do2 if IQ!=0 then3 cand<-IQ.Get() /* candidate to insert */4 if cand.prob>=SL.ReadMax() then5 with PQ.lock do6 if Len(PQ)>=MAX_LEN then7 if cand.prob<=PQ.ReadMin() then8 SL.Push(cand)9 else10 pq_min<-PQ.PopMin()11 SL.Push(pq_min)12 PQ.Push(cand)13 else14 SL.Push(cand)15 with PQ.lock do16 if Len(PQ)<MAX_LEN then17 sl_max<-SL.PopMax()18 PQ.Push(sl_max) |

**Example**

**1.**

#### 3.3. Bcrypt Trial

Algorithm 3. Bcrypt Trial. |

Input:PQ:max-min heap, EQ:expansion queue,TQ:trial queue, sets of recovered accounts Output: Updated queues and sets1 while True do2 if TQ!=0 then3 cand<-TQ.Get()4 else5 with PQ.lock do6 cand<-PQ.PopMax()7 if not(isPseudTerminal(cand) or isTerminal(cand)) then8 EQ.Put(cand)9 break10 isCracked<-try(cand)11 if isCracked then12 Output: <cand.usr:cand.password>13 if isTerminal(cand) and not isAllUsrTried(cand) then14 TQ.Put(cand)15 if not isCracked or not isTerminal(cand) then16 EQ.Put(cand) |

**Example**

**2.**

_{0}), so we can directly drop 123456 after trying it for usr

_{0}. We can observe that only terminals and pseudo-terminals can be inserted into Trial Queue.

## 4. Experimental Evaluation

#### 4.1. Experimental Setting

**Methods compared:**

- oclHashcat. We use the default configuration of oclHashcat and pipe the output of trawling PCFG [10] into it for guessing. In the default configuration, oclHashcat will not try the next account until all candidates are tried or a correct guess is made for the current account. The PCFG is trained on 1 million passwords randomly sampled from Linkedin.
- TarGuessII+Trawling PCFG. TarGuessII can be parallelized, naively but inefficiently, by segmenting the data and assigning each segment to a core. The password-reused model is trained on cross-site password pairs from the source site of Linkedin to the target site of 000webhost, though it would be better to train distinctly for different target sites. We vary the number of trials per account (k) by 100, 1000, and 10000. The trawling model is similarly trained as the one used for oclHashcat.
- BcryptRecover. This implements our method. The training configuration is the same with TarGuessII.

#### 4.2. Comparison of Various Approaches

#### 4.3. Impact of Global Ranking

- LocalFullTrans. This is exactly TarGuessII + Trawling PCFG, in which we consider a typical value 1000 for k. All the candidate passwords of a single round are generated and sorted at once before the bcrypt recovery begins, instead of being generated step by step according to the types of transformation rules.
- GlobalFullTrans. It replaces the incremental transformation strategy of BcryptRecover (alias GlobalIncTrans) with the expanding strategy of LocalFullTrans.

#### 4.4. Impact of Parallelization

## 5. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Why You Shouldn’t Panic about Dropbox Leaking 68 Million Passwords. Available online: https://www.forbes.com/sites/thomasbrewster/2016/08/31/dropbox-hacked-but-its-not-thatbad/#675839355576 (accessed on 8 May 2017).
- Lessons Learned from Cracking 4000 Ashley Madison Passwords. Available online: https://arstechnica.com/informationtechnology/2015/08/cracking-all-hacked-ashleymadison-passwords-could-take-a-lifetime/ (accessed on 15 December 2017).
- Deep Dive into the Edmodo Data Breach. Available online: https://medium.com/4iqdelvedeep/deep-dive-into-theedmodo-data-breach-f1207c415ffb (accessed on 22 November 2017).
- Hashcat. Available online: https://hashcat.net/oclhashcat/ (accessed on 1 September 2017).
- John the Ripper. Available online: http://www.openwall.com/john/ (accessed on 1 December 2017).
- Das, A.; Bonneau, J.; Caesar, M.; Borisov, N.; Wang, X. The tangled web of password reuse. In Proceedings of the NDSS 2014, San Diego, CA, USA, 23–26 February 2014. [Google Scholar]
- Wang, D.; Zhang, Z.; Wang, P.; Yan, J.; Huang, X. Targeted online password guessing: An underestimated threat. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS 2016), Vienna, Austria, 24–28 October 2016; pp. 1242–1254. [Google Scholar]
- Han, W.; Li, Z.; Ni, M.; Gu, G.; Xu, W. Shadow attacks based on password reuses: A quantitative empirical view. IEEE Trans. Depend. Secur. Comput.
**2016**. [Google Scholar] [CrossRef] - VeriClouds Whitepapers & Resources. Available online: https://www.vericlouds.com/resources/ (accessed on 3 March 2018).
- Weir, M.; Aggarwal, S.; de Medeiros, B.; Glodek, B. Password cracking using probabilistic context-free grammars. In Proceedings of the 2009 30th IEEE Symposium on Security and Privacy, Berkeley, CA, USA, 17–20 May 2009; pp. 391–405. [Google Scholar]
- Narayanan, A.; Shmatikov, V. Fast dictionary attacks on passwords using time-space tradeoff. In Proceedings of the 12th ACM Conference on Computer and Communications Security (CCS 2005), Alexandria, VA, USA, 7–11 November 2005; pp. 364–372. [Google Scholar]
- Castelluccia, C.; Durmuth, M.; Perito, D. Adaptive password-strength meters from Markov models. In Proceedings of the 2012 Network and Distributed Systems Security Symposium, San Diego, CA, USA, 5–8 February 2012; pp. 23–26. [Google Scholar]
- Ma, J.; Yang, W.; Luo, M.; Li, N. A study of probabilistic password models. In Proceedings of the 2014 IEEE Symposium on Security and Privacy, San Jose, CA, USA, 18–21 May 2014; pp. 689–704. [Google Scholar]
- Komanduri, S. Modeling the Adversary to Evaluate Password Strengh with Limited Samples. Ph.D. Thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 2016. [Google Scholar]
- Melicher, W.; Ur, B.; Segreti, S.; Komanduri, S.; Bauer, L.; Christin, N.; Cranor, L. Fast, lean and accurate: Modeling password guessability using neural networks. In Proceedings of the 25th USENIX Conference on Security Symposium, Austin, TX, USA, 10–12 August 2016; pp. 1–17. [Google Scholar]
- Ur, B.; Segreti, S.M.; Bauer, L.; Christin, N.; Cranor, L.F.; Komanduri, S.; Kurilova, D.; Mazurek, M.L.; Melicher, W.; Shay, R. Measuring real-world accuracies and biases in modeling password guessability. In Proceedings of the 24th USENIX Conference on Security Symposium (USENIX SEC 2015), Washington, DC, USA, 12–14 August 2015; pp. 463–481. [Google Scholar]
- Wang, D.; Wang, P. On the implications of Zipf’s law in passwords. In Proceedings of the European Symposium on Research in Computer Security, Heraklion, Greece, 26–30 September 2016; pp. 1–21. [Google Scholar]
- Zhang, Y.; Monrose, F.; Reiter, M. The security of modern password expiration: An algorithmic framework and empirical analysis. In Proceedings of the 17th ACM conference on Computer and Communications Security (CCS 2010), Chicago, IL, USA, 4–8 October 2010; pp. 176–186. [Google Scholar]
- Wang, C.; Jan, S.T.K.; Hu, H.; Wang, G. Empirical analysis of password reuse and modification across online service. arXiv, 2017; arXiv:1706.01939. [Google Scholar]
- Harsha, B.; Blocki, J. Just in Time Hashing. In Proceedings of the 2018 IEEE European Symposium on Security and Privacy (EuroS&P), London, UK, 24–26 April 2018; pp. 368–383. [Google Scholar]
- Li, Y.; Wang, H.; Sun, K. A study of personal information in human-chosen passwords and its security implications. In Proceedings of the 35th Annual IEEE International Conference on Computer Communications (INFOCOM 2016), San Francisco, CA, USA, 10–14 April 2016; pp. 1–9. [Google Scholar]
- Oechslin, P. Making a faster cryptanalytic time-memory trade-off. In Proceedings of the Annual International Cryptology Conference (CRYPTO 2003), Santa Barbara, CA, USA, 17–21 August 2003; pp. 617–630. [Google Scholar]
- Provos, N.; Mazières, D. A future-adaptable password scheme. In Proceedings of the USENIX Annual Technical Conference 1999 (FREENIX Track), Monterey, CA, USA, 6–11 June 1999; pp. 81–91. [Google Scholar]
- Percival, C. Stronger Key Derivation via Sequential Memory-Hard Functions. Presentation at BSDCan 2009. Available online: http://www.tarsnap.com/scrypt/scrypt.pdf (accessed on 10 May 2018).
- Kaliski, B. PKCS #5: Password-Based Cryptography Specification Version 2.0. RFC 2898. 2000. Available online: http://tools.ietf.org/html/rfc2898 (accessed on 6 April 2018).[Green Version]
- Biryukov, A.; Dinu, D.; Khovratovich, D. Argon2: New generation of memory-hard functions for password hashing and other applications. In Proceedings of the 2016 IEEE European Symposium on Security and Privacy (EuroS&P), Saarbrucken, Germany, 21–24 March 2016; pp. 292–302. [Google Scholar]
- Steube, J. PRINCE: Modern Password Guessing Algorithm. Presentation at Passwords 2014. Available online: https://hashcat.net/events/p14-trondheim/prince-attack.pdf (accessed on 16 December 2017).
- Malvoni, K.; Designer, S.; Knezovic, J. Are your passwords safe: Energy-efficient bcrypt cracking with low-Cost parallel hardware. In Proceedings of the 8th USENIX Workshop on Offensive Technologies (WOOT 2014), San Diego, CA, USA, 19 August 2014. [Google Scholar]
- Blocki, J.; Harsha, B.; Zhou, S. On the economics of offline password cracking. In Proceedings of the 2018 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 20–24 May 2018. [Google Scholar]
- Blocki, J.; Harsha, B.; Kang, S.; Lee, S.; Xing, L.; Zhou, S. Data-Independent Memory Hard Functions: New Attacks and Stronger Constructions. 2018. Available online: https://eprint.iacr.org/2018/944 (accessed on 10 May 2018).
- Weir, M.; Aggarwal, S.; Collins, M.; Stern, H. Testing metrics for password creation policies by attacking large sets of revealed passwords. In Proceedings of the 17th ACM Conference on Computer and Communications Security (CCS 2010), Chicago, IL, USA, 4–8 October 2010; pp. 162–175. [Google Scholar]
- Min-Max Heap. Available online: https://en.wikipedia.org/wiki/Min-max_heap (accessed on 10 May 2018).
- Skip List. Available online: https://en.wikipedia.org/wiki/Skip_list (accessed on 3 April 2018).
- Levenshtein Distance. Available online: https://en.wikipedia.org/wiki/Levenshtein_distance (accessed on 10 May 2018).
- Heapsort. Available online: https://en.wikipedia.org/wiki/Heapsort (accessed on 3 April 2018).
- Once Seen as Bulletproof, 11 Million+ Ashley Madison Passwords Already Cracked. Available online: https://arstechnica.com/informationtechnology/2015/09/once-seen-as-bulletproof-11-million-ashley-madison-passwords-already-cracked/ (accessed on 11 December 2017).
- AWS EC2. Available online: https://aws.amazon.com/ec2/?ft=n (accessed on 22 August 2017).
- Amazon EC2 Instance Types. Available online: https://aws.amazon.com/ec2/instance-types/?nc1=h_ls (accessed on 10 May 2018).
- Ding, Z.; Jia, Y.; Zhou, B.; Han, Y. Mining topical influencers based on the multi-relational network in micro-blogging sites. China Commun.
**2013**. [Google Scholar] [CrossRef] - Wang, P.; Lu, K.; Li, G.; Zhou, X. DFTracker: Detecting double-fetch bugs by multi-taint parallel tracking. Front. Comput. Sci.
**2018**, 1–17. [Google Scholar] [CrossRef] - Wu, Z.; Lu, K.; Wang, X.; Zhou, X. Collaborative technique for concurrency bug detection. Int. J. Parallel Program.
**2015**, 43, 260–285. [Google Scholar] [CrossRef] - Wu, T.; Yang, Y. Detecting android inter-app data leakage via compositional concolic walking. J. Autosoft.
**2019**. [Google Scholar] [CrossRef] - Wu, Z.; Lu, K.; Wang, X.; Zhou, X.; Chen, C. Detecting harmful data races through parallel verification. J. Supercomput.
**2015**, 71, 2922–2943. [Google Scholar] [CrossRef]

**Figure 4.**Overall recovery performance for non-identical passwords (each of the subfigures (

**a**), (

**b**) and (

**c**) corresponds to the recovery-speed graph of the Dropbox-styled list, the AM-styled list and the Edmodo-styled list).

Source | #Accounts | Year |
---|---|---|

Exploit.in Combo List | 593,427,119 | 2016 |

Anti Public Combo List | 457,962,538 | 2016 |

MySpace | 359,420,698 | 2008 |

NetEase | 234,842,089 | 2015 |

164,611,595 | 2012 | |

Badoo | 112,005,531 | 2013 |

Target | Cost | #Accounts | %Overlap | %Recovered |
---|---|---|---|---|

Dropbox-styled list | 8 | 31,862,436 | 59.1% | 20.1% |

AshleyMadison (AM)-styled list | 12 | 30,653,460 | 46.7% | 19.4% |

Edmodo-styled list | 12 | 43,488,310 | 21.4% | 8.1% |

csdn.net (CSDN)-styled list (re-hashed) | 12 | 6,428,630 | 39.2% | 24.9% |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Wu, T.; Yang, Y.; Wang, C.; Wang, R.
Study on Massive-Scale Slow-Hash Recovery Using Unified Probabilistic Context-Free Grammar and Symmetrical Collaborative Prioritization with Parallel Machines. *Symmetry* **2019**, *11*, 450.
https://doi.org/10.3390/sym11040450

**AMA Style**

Wu T, Yang Y, Wang C, Wang R.
Study on Massive-Scale Slow-Hash Recovery Using Unified Probabilistic Context-Free Grammar and Symmetrical Collaborative Prioritization with Parallel Machines. *Symmetry*. 2019; 11(4):450.
https://doi.org/10.3390/sym11040450

**Chicago/Turabian Style**

Wu, Tianjun, Yuexiang Yang, Chi Wang, and Rui Wang.
2019. "Study on Massive-Scale Slow-Hash Recovery Using Unified Probabilistic Context-Free Grammar and Symmetrical Collaborative Prioritization with Parallel Machines" *Symmetry* 11, no. 4: 450.
https://doi.org/10.3390/sym11040450