Next Article in Journal
Open Data-Driven Reconstruction of Power Distribution Grid: A Land Use-Based Approach
Previous Article in Journal
YOLO-CBF: Optimized YOLOv7 Algorithm for Helmet Detection in Road Environments
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

PassRecover: A Multi-FPGA System for End-to-End Offline Password Recovery Acceleration

1
School of Computer Science, Fudan University, Shanghai 200433, China
2
Shanghai HONGZHEN Information Science & Technology Corporation, Shanghai 201112, China
3
Institute for Big Data, Fudan University, Shanghai 200433, China
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(7), 1415; https://doi.org/10.3390/electronics14071415
Submission received: 7 March 2025 / Revised: 29 March 2025 / Accepted: 29 March 2025 / Published: 31 March 2025

Abstract

:
In the domain of password recovery, deep learning has emerged as a pivotal technology for enhancing recovery efficiency. Despite its effectiveness, the inherent computation complexity of deep learning-based password generation algorithms poses substantial challenges, particularly in achieving synergistic acceleration between deep learning inference, and plaintext encryption process. In this paper, we introduce PassRecover, a multi-FPGA-based computing system that can simultaneously accelerate deep learning-driven password generation and plaintext encryption in an end-to-end manner. The system architecture incorporates a neural processing unit (NPU) and an encryption array configured to operate under a streaming dataflow paradigm for parallel processing. It is the first approach to explore the benefit of end-to-end offline password recovery. For comprehensive evaluation, PassRecover is benchmarked against PassGAN and five industry-standard encryption algorithms (Office2010, Office2013, PDF1.7, Winzip, and RAR5). Experimental results demonstrate excellent performance: Compared to the latest work that only accelerate encryption algorithms, PassRecover achieves an average 101.5% speedup across all tested encryption algorithms. When compared to graphics processing unit (GPU)-based end-to-end implementations, this work delivers 93.01% faster processing speeds and 3.73 × superior energy efficiency. These results establish PassRecover as a promising solution for resource-constrained password recovery scenarios requiring high throughput and energy efficiency.

1. Introduction

Although security technology has been advancing, plaintext passwords remain the most widely used authentication method in most applications due to their easy-to-remember format and implementation. However, plaintext passwords typically exhibit strong biases or statistical properties in characters or structures due to human habits or preferences. For example, the most common data pattern in the leaked password dataset is 123456 [1]. Such problems make plaintext passwords vulnerable to guessing attacks. Password recovery, also called password guessing attack, is a method to recover the right user password for authentication by guessing all possible password candidates online/offline. As for offline password recovery, the attack target (e.g., digest and salt) is generally leaked through various methods such as spoofing attacks, penetration attacks, etc. [2]. The offline recovery process usually consists of four steps (as illustrated in Figure 1): ① split the leaked encryption digest and salts for the attack process; ② generate password candidates according to attacking methods such as traditional dictionary attacks, brute-force attacks (e.g., masks), mangled-wordlist attacks (e.g., rules), classical machine learning algorithms (e.g., PCFG [3]), or deep learning-based algorithms (e.g., PassGAN [4]); ③ execute application encryption algorithm with the input of generated password candidates and salt to produce the attack digests; and ④ compare the attack digest with the target digest to determine whether they are equal; if they are equal, the password is recovered successfully. Generally, offline password recovery is a costly and time-consuming process, as the computing system is required to execute the application encryption algorithm for each input password candidate, while the number of password candidates generated in the producer stage is very huge.
Currently, two main approaches have been proposed to reduce the cost of password recovery. The first is the algorithmic approach, which is represented by PCFG [3], OMEN [5], PassGAN [4], PassGPT [6], etc. These probabilistic algorithms are mainly used to generate attacking password candidates/masks that are more closely aligned with human usage patterns, thereby reducing attack attempts by eliminating ultra-low probability candidates. Under the wave of artificial intelligence (AI), deep learning-based password generation algorithms such as PassGAN and PassGPT are more advanced than other classical machine learning algorithms, making PassGAN and PassGPT more suitable to narrow the search space of character and structure in password generation. The other way to reduce the cost of password recovery is the hardware approach. This involves designing high-throughput encryption architectures and energy-efficient computing systems to speed up the attack process and reduce power consumption. For example, Ref. [7] proposed a RUPA processor to accelerate rule-based password generation. In [8], the authors explore a way to balance flexibility and energy efficiency in password recovery through CPU-FPGA hybrid architecture. On the contrary, Refs. [9,10,11] applied an algorithm-specific customization paradigm to maximize the overall performance and energy efficiency in password recovery system. For instance, Ref. [9] proposed a dual-granularity data path adjustment strategy to design a specific architecture for RAR3 encryption algorithm. In [10,11], the authors achieved state-of-the-art energy efficiency by designing a deep pipeline architecture for each evaluated algorithm from Hashcat [12] (an open-source password recovery framework). In addition, Ref. [13] accelerated multiple encryption algorithms using an advanced heterogeneous multi-zone processor called MT-3000, which is designed specifically for high-performance computing (HPC).
Nevertheless, existing researches on hardware acceleration primarily focus on the acceleration of encryption algorithms (the consumer stage in Figure 1). No other research has explored the potential benefit of the co-acceleration of deep learning-based password generator and encryption. This leaves an opportunity to do research on the synergistic acceleration between these two stages to further reduce the cost of password recovery. To illustrate the benefits, consider digital forensics services (one of the password recovery applications), where hourly fees typically range from 200 to 400 USD [14]. For password recovery workflows requiring days-to-weeks duration, an improvement of processing speeds by even 20–30% could yield direct cost savings of 1000–3000 USD per day. For the organization that provides the computing platform for password recovery, enhancing energy efficiency reduces electricity consumption for both the computational workload and cooling system. Furthermore, as illustrated in Section 2, when leveraging deep learning-based password generators like PassGAN, the producer stage becomes a computational bottleneck in almost 68% of Hashcat recovery scenarios, and the situation is worse with PassGPT. Such findings underscore the inadequacy of isolated consumer stage optimizations and highlight the urgency of designing co-acceleration architectures that parallelize password generation and encryption algorithm computations.
To address this challenge, this study presents PassRecover, a multi-FPGA computing system comprising a Zynq MPSoC XCZU7EV (AMD, Santa Clara, CA, USA) and two Virtex Ultrascale+ XCVU9P FPGAs (AMD, Santa Clara, CA, USA). Within this system, an NPU is proposed to accelerate deep learning-based password generation algorithms, and algorithm-specific accelerators are implemented for industry-standard encryption algorithms. For systematic evaluation, we adopt PassGAN as the representative password generator while benchmarking against five encryption algorithms: Office2010, Office2013, PDF1.7, Winzip, and RAR5. By decoupling the NPU and the encryption algorithm accelerator, this study achieves end-to-end co-acceleration of the producer stage and the consumer stage. By employing these techniques, the proposed multi-FPGA system demonstrates significant advancements in terms of speedup and energy efficiency compared to previous works and GPU systems. The major differences between this work and previous studies can be summarized in Table 1, while the primary contributions of this work are as follows:
(1)
A multi-FPGA system containing one XCZU7EV and two XCVU9P FPGAs is proposed. The system provides the ability to recovery password in an end-to-end manner. As far as we know, it is the first work to accelerate both the deep learning-based password generation algorithm and the encryption algorithm on a multi-FPGA system.
(2)
An NPU architecture for deep learning-based password generation is proposed. PassGAN algorithm is taken as a case study to evaluate the NPU performance. Experimental results show a 82.16% and 155.22% improvement in speed and energy efficiency, respectively, compared to Tesla V100 GPU.
(3)
Five representative encryption algorithms, including Office 2010, Office2013, PDF 1.7, WINZIP, and RAR5 from Hashcat, are implemented by specific architecture to co-accelerate with PassGAN in an end-to-end manner. Results show that the speed and energy efficiency are 1.93 × and 4.73 × the GPU’s solution. Compared to the latest work that only accelerates encryption algorithms, the proposed architecture achieves an average of 101.50% higher speed and 22.11% higher energy efficiency.

2. Prerequisite

2.1. Why PassGAN Is Selected as the Study Case

Based on Generative Adversarial Networks (GANs) and GPT2, respectively, PassGAN and PassGPT are two representative deep learning-based password recovery algorithms developed in the past decade. Both of them have proven to outperform classical machine leaning algorithms in terms of the password recovery rate on common password leakage datasets such as Rockyou [15], Linkedin [16], Hotmail [17], etc. More importantly, PassGAN and PassGPT have the capability to recover more rare and super rare passwords [4,6,18]. For example, Figure 2a shows the comparison of the password recovery rate in the Rockyou dataset among PassGAN, OMEN, and PCFG [18]. The results show that PassGAN outperformed the other two classical machine learning algorithms under four password categories (including common, uncommon, rare, and super rare). Moreover, the performance gap between PassGAN and OMEN/PCFG increased when the password pattern became complicated (from common to super rare category), which indicates that the deep learning-based algorithms can learn more hidden grammar information and have better generalization capability.
However, the improved recovery rate comes at the cost of high computational complexity. Ideally, the producer and consumer stages illustrated in Figure 1 should be computationally balanced. In practice however, the producer frequently becomes a bottleneck—as evidenced by Figure 2b, which compares the average speed of PassGAN, PassGPT, and Hashcat benchmarks on the Tesla V100 GPU. The results demonstrate that 68% of the benchmarks exhibit superior performance to PassGAN, with their geometric mean (Geo-mean) speed reaching 79.18 × that of PassGAN. For PassGPT, the situation is worse; even the slowest encryption algorithm, such as Bitcoin/Litecoin, is nearly four times faster than PassGPT. Although it is feasible to pre-generate password candidates as the attack dictionaries before runtime, the ability to generate password candidates dynamically at runtime remains essential for the scalability requirement of password recovery systems. In this work, we explore ways to co-accelerate deep learning-based password generation algorithms and Hashcat encryption algorithms. In order to minimize the performance gap between the producer and consumer, we use PassGAN as a case study to demonstrate the proposed PassRecover system.

2.2. PassGAN Overview

PassGAN is an open-source framework that is available in [19]. Based on the high performance and stable IWGAN [20], PassGAN consists of a generator network (G) and a discriminator network (D), as illustrated in Figure 3a,b. Both G and D contain five residual blocks: a 1D convolution (Conv1D) layer and a linear layer. The architecture of the residual block is shown in Figure 3c. In the training phase, the adversarial procedure between G and D makes G to learn the information distribution of the training dataset. In the inference phase, only network G is used. By input of a random vector (with a size of 1 × 128 ) sampled from Gaussian or uniform distribution, the G will finally generate a password that aligns with the distribution of the training dataset. Table 2 lists the detail information of architecture G (the maximum length of password set to 10 in this example), where the output dimension 97 × 10 represents the probability of 97 characters in 10 positions of the guessing password. In addition, the input channel ( I C ) and the output channel ( O C ) of the Conv1D layer in the residual block are 128, while the size of the convolution weight kernel is 1 × 5 ( K = 5 ). The pseudocode of the Conv1D layer can be described in Algorithm 1 (we ignore the bias here), where I F n 1 , W n , and O F n represent the input feature map, weight, and output feature map of the n-th Conv1D layer.
Algorithm 1: Pseudocode of the n-th Conv1D layer.
Electronics 14 01415 i001

2.3. Characteristics of Application Encryption Algorithms

In password recovery scenarios, the application encryption algorithms are various. There are up to 410 kinds of algorithms provided in the latest version of Hashcat. It is difficult to design a uniform hardware architecture to accelerate those algorithms both flexibly and efficiently. In order to maximize the attacking performance, the proposed PassRecover applies the customization methodology to design a specific architecture for each encryption algorithm, which requires PassRecover to be able to reconfigure the FPGAs at runtime. In addition, password recovery is an embarrassingly parallel [21] process, which means that although the micro-architecture for each encryption algorithm is different, their top-level program model can be the same, such as the data parallel paradigm. In this work, the design schemes for encryption algorithm accelerators in PassRecover will follow these characteristics.

3. Proposed Architecture

3.1. PassRecover: System Architecture with Multi-FPGA

A fabricated PassRecover prototype is shown in Figure 4a, which consists of a Zynq MPSoC XCZU7EV device (AMD, Santa Clara, CA, USA) and two Virtex Ultrascale+ XCVU9P FPGAs (AMD, Santa Clara, CA, USA). This heterogeneity in FPGA deployment is motivated by two key considerations: (1) PassRecover requires a runtime that is capable of handling controlling, management, and scheduling for NPU and encryption algorithm accelerators. The Zynq MPSoC’s quad-core Cortex-A53 processor (ARM, Cambridge, UK) satisfies this requirement effectively. For example, as illustrated in Section 3.2, the Gaussian-distributed random numbers required for PassGAN acceleration are generated by the Cortex-A53 CPU within the Zynq MPSoC instead of the host CPU. This approach enables most of the management workloads on the host-side CPU to be offloaded onto the Cortex-A53, thereby enhancing system scalability. (2) In order to support runtime reconfiguration and provide the capability to dynamically switch computation mode or encryption algorithms, we need one FPGA device to implement the reconfiguration datapath for the others. In PassRecover, we select the XCZU7EV to realize this functionality. Based on these considerations, two working modes are proposed according to different combinations among three FPGA devices. The details will be introduced later.
Figure 4b shows the interconnection data paths among three FPGA devices. The physical interface of PCI-e and DDR4 is connected directly to the XCZU7EV device. Every FPGA is connected with each other through a four-lane gigabyte transceiver (the green arrows in the figure). Based on the Aurora chip-to-chip protocol [22] from AMD (Santa Clara, CA, USA), data movement between the FPGAs can be achieved by the direct memory access (DMA) method. For example, the user data from PCI-e or DDR4 can be moved to the XCVU9P device through a chip-to-chip channel and vice versa. In order to provide the reconfiguration ability at runtime, the proposed PassRecover system applies the SelectMAP [23] interface (the red arrow in the figure) to program the XCVU9P device. The master port (denoted M) of the SelectMAP interface is designed with XCZU7EV, while the slave port (denoted S) is located at XCVU9P. This design scheme enables users to download bitstreams via the PCI-e channel to drive the master port of the SelectMAP, thereby achieving runtime reconfiguration.
Figure 4c,d present two proposed acceleration modes in PassRecover for end-to-end password recovery. The first is referred to as mode A, where the NPU operates on the XCZU7EV FPGA to generate password candidates, which are then fed to the two XCVU9P FPGAs for encryption threads to accelerate the target encryption algorithm. This mode is suitable for slow encryption algorithms. In contrast, the second mode, mode B, is designed for faster encryption algorithms. Here, a XCVU9P FPGA generates password candidates, which are then fed to the other XCVU9P FPGA for encryption threads.
In cases where the computation speed of the encryption algorithm significantly outpaces the password generation speed of the NPU, a hybrid attack approach is typically employed to enhance the speed of password generation. For example, for the plaintext password p@ssword generated by the NPU, we can append an attack mask such as ?d?d?d (representing three-digit numeric characters) to form a hybrid attack template p@ssword?d?d?d. Compared to the single password generated by the NPU, the password space of the hybrid attack template is increased by a factor of 1000. This method can alleviate or resolve the issue where the password generation speed of the NPU becomes a bottleneck for the fast encryption algorithms.

3.2. NPU

3.2.1. Overall Architecture

Like many NPUs presented in previous work [24,25], the NPU proposed in this paper also applies a systolic array architecture to accelerate neural networks. Although we only apply PassGAN as the password generation algorithm for evaluation, to maintain generality and prepare for the acceleration of PassGPT in the future, the proposed NPU also takes into account the computational characteristics of the Transformer architecture.
The proposed NPU architecture in mode A is shown in Figure 5 (the architecture is the same with mode B), which will be introduced in five aspects: (1) The I/O of the NPU is organized by the PCI-e XDMA module, the ARM Cortex A53 CPU system, and the chip-to-chip (C2C) module. During initialization, the weight data of PassGAN and the instruction data of the NPU are loaded into the corresponding address of DDR4 via PCI-e. In execution, the Cortex A53 CPU continuously generates random numbers sampled from a normal distribution and writes them to the input data queue in DDR4. Concurrently, the results generated by the NPU are sent to the Decoder unit to interpret as password candidates or password–mask hybrid representation and then transmitted to the XCVU9P FPGA through the C2C module. (2) From the memory aspect, the NPU includes the input feature map buffer (IF Buffer), weight/activation buffer (Weight/Act. Buffer), and output feature map buffer (OF Buffer). The IF Buffer consumes the most on-chip storage. In the INT-K quantization case (e.g., K = 4, 8, 16), multiple Ultra-RAMs (URAM, 288 Kb) cascade in depth (with D Ultra-RAM, where the depth of each Ultra-RAM is 4096) and concatenate in width to provide M × 2 × K b i t / c l o c k c y c l e bandwidth for the PE array and the off-chip memory. M represents the row number of the PE array. The Weight/Act. Buffer is mainly used to store weight/activation data in activation–weight/activation–activation matrix multiplication. The bandwidth of Weight/Act. Buffer is equal to M × N × K , where N is the column number of the PE array. The multiplication results are queued in the OF Buffer, where the bandwidth requirement is equal to the output bandwidth of PE array. In this design, for simplicity, we set the input data bandwidth of OF Buffer to be 32 × 32 b i t / c l o c k c y c l e . (3) The core computation engine of the NPU is the PE array. In this design, the PE array is organized by N columns and M rows. For each column, M multipliers perform a dot production between Input Feature and Weight/Activation. Like the systolic array proposed in [24], the partial products are reduced through horizontal connections. Finally, an accumulator will execute further accumulation among the outputs of the adder chain if necessary. In mathematical terms, the PE array can be described by the following equation:
O F 1 × N = I F 1 × M × W A M × N ,
which indicates that the PE array computes a matrix multiplication between I F 1 × M (data from IF Buffer) and W A M × N (data from Weight/Act. Buffer) for each clock cycle. (4) The pooling and special unit in this design is proposed to support pooling operation, upsampling, maximum, square root, and several activation functions, including RELU, Swish, Softmax, and GELU. The hardware implementation of activation functions can be found in [26]; the details will not be discussed in this section. (5) To maintain the computation flow, the proposed NPU employs a Very Long Instruction Word (VLIW) architecture. The instruction controller (INSTR CTR) is responsible for fetching instruction data from off-chip memory and decoding the input instructions for each processing unit. For instance, the address generator (ADDR GEN) generates addresses for all internal buffers and off-chip memory based on the configuration provided by the instruction controller.

3.2.2. Tiling

In order to map Conv1D/linear layer of PassGAN into the NPU, we tile each Conv1D/linear layer into multiple blocks so that the Conv1D/linear layer can be executed on the NPU block by block. Algorithm 2 shows the tiling strategy of the n-th Conv1D layer (bias is ignored in the pseudocode). The first and second nested loops indicate the tile number in the output channel and input channel direction, respectively. The loops between line 4 and line 6 represent the clock cycles that are required for each tiling block executed on the NPU. The executing sequence follows the order of the t o and t i indexes. That is to say, we will first compute the block in the input channel direction and then switch to the output channel direction. Similarly, the NPU address generation process follows the order of the loop indexes b, c o l , and k for each tile. Finally, the nested loops on line 7 and line 8 are offloaded on the PE array in each clock cycle.
Algorithm 2: Pseudocode of tiling scheme in the n-th Conv1D layer.
Electronics 14 01415 i002

3.3. Encryption Algorithm Accelerator

In this section, we will first introduce a unified hardware architecture template that is suitable for most common encryption algorithms. Then, the detail architecture of Office2013 will be taken as an example following this template to demonstrate the design methodology that is also suitable for other encryption algorithms.

3.3.1. Unified Hardware Architecture Template

In PassRecover, we adopt a unified hardware architecture template to accelerate encryption algorithms, including, but not limited to, the Office2010, Office2013, PDF1.7, Winzip, and RAR5 algorithms that are evaluated in subsequent experiments. Figure 6 depicts the details of this unified hardware architectural template. It exploits the inherently parallelizable nature of encryption algorithms in password recovery. The proposed template comprises five collaborative modules: (1) The chip-to-chip interface module (Chip2Chip), which facilitates data exchange between XCZU7EV and XCVU9P devices. For example, the input passwords/masks and hit passwords are transmitted between devices through this module during runtime. (2) The passwords/masks on-chip memory (Pwd/Mask On-chip Memory), which is used to temporarily store the attack password candidates or masks. (3) A round-robin allocator, which is used to assign attack password candidates or masks from the passwords/masks on-chip memory to each encryption engine (EE). (4) An EE array, which is a computational core consisting of an M × N parallel processing unit. It applies a data-parallel paradigm, in which each EE unit shares the same target hash digest but processes unique passwords/masks. The architecture of the EE unit consists of four sub-modules, including a password/mask buffer, a password enumeration unit, an encryption thread that is responsible for the acceleration of the target encryption algorithm, and a comparator that is used to validate hash matches. (5) The last module is the result detector, which is the destination of the result collection network. Any EE unit that hits the target digest will generate a h i t c signal to feed forward the right password to the result detector.
In the proposed architectural template, except for the EE unit that needs to be customized according to the target algorithm, the other modules remain unchanged. We use an advanced extensible interface (AXI) in the password/mask allocation bus and the target digest configuration path. As a result, no matter how the internal architecture of the encryption thread in EE unit changed, the overall architecture of the proposed template remains the same.

3.3.2. Customization Methodology for Encryption Thread

The remaining problem is how to customize the detail architecture of the encryption thread for each encryption algorithm. In this work, we propose a heuristic method that can help to make a design decision on architecture customization. The flow of the method is shown in Figure 7. Generally, an encryption algorithm consists of one or more kernels, such as SHA-512 and AES-256 in the Office2013 encryption algorithm. The computational overheads of different kernels in an algorithm determine the computing architecture they employ (e.g., pipeline architecture or iteration architecture). Based on this observation, we first apply a high-level synthesis (HLS) approach to explore the area and latency of each kernel under different design spaces. For example, by unrolling the loop body of the SHA-512 kernel (as shown in Algorithm 3) with different factors of N (line 7 in the pseudocode), we can trade off the performance and area of the SHA-512 hardware architecture. Figure 8 shows the results of some example kernels under different unroll factors, where we can employ different criteria to select the design point. For example, according to [13], 100,000 times of SHA-512 iterations account for 99.24% of the computation time of the Office2013 encryption algorithm, while the remaining part only accounts for 0.76%. In this computational overhead analysis, we can adopt the minimal area-latency products to determine the design point for SHA-512 while adopting the minimal area criteria for the others. With this design decision, we then write Verilog RTL rather than HLS to customize the encryption thread architecture for Office2013 in the implementation stage to maximize the performance.
Algorithm 3: Pseudocode of SHA_512.
Electronics 14 01415 i003

3.3.3. Architecture of Encryption Thread

As we customize the encryption thread architecture for each encryption algorithm, it is impractical to provide an exhaustive introduction to the architecture for each algorithm. Consequently, this section illustrates the specific architecture of the encryption thread using Office2013 as a representative example.
Algorithm 4 presents the pseudocode of the Office2013 encryption algorithm, which generates the final authentication digest through iterative hashing and cryptographic transformations. The encryption process operates in two stages:
  • Stage 1 involves 100,000 iterations of SHA-512 computations. During each iteration n, the input password is dynamically padded with the iteration index n and intermediate h a s h value to construct m e s g 0 —the initial message input for SHA-512 processing.
  • Stage 2 executes post-processing transformations using the Stage 1 output. Through specialized message padding routines, two derived messages m e s g 1 and m e s g 2 are generated for subsequent SHA-512 computations. The resulting hash values k e y 1 and k e y 2 undergo one time AES-256 CBC decryption and encryption using the salt value and verification hash as encryption parameters, ultimately producing the final authentication digest.
Algorithm 4: Pseudocode of Office2013 encryption algorithm.
Electronics 14 01415 i004
According to Algorithm 4, Figure 9a presents the detail architecture of encryption thread for Office2013. In the first stage of the algorithm, which constitutes the majority of the computational workload, we followed the methodology described in the previous subsection and employed a pipelined architecture (we obtained the minimal area-latency products under the design point where the loop is fully unrolled) to implement the SHA-512. Specifically, each round of the SHA-512 computation corresponds to a dedicated round unit, with a total of 80 identical round units interconnected in a cascaded manner. This configuration enables the seamless execution of the 80-round SHA-512 calculation. Detailed architecture of the round unit and message schedule unit are shown in Figure 9b and Figure 9c, respectively. The logic functions include Σ 0 { 512 } , Σ 1 { 512 } , M a j ( a , b , c ) , C h ( a , b , c ) , σ 0 { 512 } , and σ 1 { 512 } , which can be found in [27]. It should be noted that the first part of the algorithm features a deeply pipelined architecture. Assuming the number of pipeline stages is N, to fully leverage hardware resources during computation, batch processing is typically employed. The batch size is set to match the number of pipeline stages (N), thereby eliminating pipeline bubbles and ensuring efficient computation. For the second part of the algorithm, which constitutes a relatively small proportion of the overall computational workload, we implemented the SHA-512 and AES-256 algorithms using an iterative computation method (we obtained the minimal area under the design point where the loop body is rolled). Taking the iterative implementation of SHA-512 as an example, we designed a single round unit and a single message scheduling unit. Under the control of a finite state machine, the SHA-512 computation was completed through 80 iterative rounds. We also employed the same approach to implement both the decryption and encryption processes of the AES-256 algorithm. In the overall architecture, the Office 2013 post-processing acceleration module comprises a SHA-512 unit, an AES-256 CBC decryption unit, and an AES-256 CBC encryption unit. These three computational units, coordinated by a finite state machine, collectively complete the entire Office 2013 post-processing workflow and generate the final digest.

3.4. Putting It All Together

In practical applications, we employ the Dask framework [28] to achieve distributed parallel computing. In this framework, each worker manages a PassRecover computing system. When PassRecover implements the end-to-end acceleration between the NPU and the encryption algorithm accelerator (using mode A as an example, as illustrated in Figure 10), the interaction procedure is as follows: At system power-on time t 0 , the XCZU7EV FPGA boots from flash memory. Subsequently, the server initiates a health check procedure for PassRecover. Upon successful verification, the system enters a standby state until receiving a task request at t 1 . At this time, the server transmits the bitstream, which will be programmed into two XCVU9P FPGAs to accelerate the target encryption algorithm, from the host to the DDR4 memory on the XCZU7EV. Later, two XCVU9P FPGAs are programmed under the control of XCZU7EV. By t 2 , the server starts to initialize the NPU on XCZU7EV. This initialization includes loading weight data, instruction data, and configuring relevant registers. Following NPU configuration, the server configures the XCZU7EV and XCVU9P FPGAs for the password recovery task, such as setting the target digest and masks for hybrid attack with NPU-generated passwords. At t 4 , the server triggers the end-to-end computation across the PassRecover system. During computation, each block of hybrid attacks (combining passwords and masks and denoted as b 0 , b 1 , b 2 , ) is transmitted to two XCVU9P FPGAs via chip-to-chip interconnect for encryption processing. To prevent PassGAN from becoming a bottleneck, the server adjusts the length of the input mask to ensure that the computation time of the encryption process exceeds the time cost of password generation. Additionally, we employed a double-buffer strategy in the encryption accelerator to hide the time cost in the hybrid attack data block transmission. This process continues until the password is successfully recovered or the preset operational timeout is reached. If the password is recovered, the XCVU9P FPGA will send an interrupt signal to terminate the NPU execution and trigger the server to retrieve the recovered password.

4. Experimental Results

4.1. Experimental Setting

To comprehensively evaluate PassRecover, we selected PassGAN and several representative algorithms, including Office2010, Office2013, PDF1.7, WINZIP, and RAR5 from Hashcat for evaluation. The proposed architecture of the NPU, encryption algorithm accelerators in Section 3 were realized by Verilog HDL, and implemented (including synthesis, place&route, and bitstream generation) under Vivado 2022.1 version. The resource utilization was reported after the design achieved routing closure. The power consumption of the hardware design was measured through on-board current and voltage sensors. For comparative purpose, we also evaluated the selected algorithms both on an Intel (Santa Clara, CA, USA) Xeon Gold 5218 CPU (featuring 32 threads at 2.30 GHz) and a Tesla V100 GPU. The software environments, as well as the tools used to detect power consumption for PassGAN and Hashcat, are detailed in Table 3.

4.2. NPU Performance

Consistent with most prior studies [29,30,31], the proposed NPU operates at INT8 precision. The evaluated algorithm, PassGAN, was also quantized to INT8 without suffering any accuracy loss when applied to the Rockyou dataset. To efficiently implement the NPU on an FPGA, we utilized the DSP48E2 slice’s capability to perform two Multiply-Accumulate Operations (MACs) within a single clock cycle, which operates at twice the frequency of the logic fabrics [32]. Given the constraints of the DSP48E2 and URAM resource on the XCZU7EV and XCVU9P devices, the PE array within the NPU was configured as M × N = 32 × 128 for the XCZU7EV under mode A and M × N = 32 × 64 for the XCVU9P under mode B. The depth of IF Buffer was set to 4096 × 11 .
For the XCZU7EV device, only one NPU core was implemented, with the logic fabrics and DSP48E2 slices clocked at 250 MHz and 500 MHz, respectively. For the XCVU9P device, we successfully routed eight NPUs onto a single chip and achieved timing closure with the logic fabric operating at 300 MHz and the DSP48E2 data paths at 600 MHz. The routing results of the NPU on these two devices are depicted in Figure 11a,b, where the PE array and the Weight/Act. Buffer occupied the most area on the FPGA. Table 4 presents the corresponding resource utilization data under mode A and mode B. The results show that the usage of the DSP48E2 reached around 74% of the total FPGA resources, while the usage of the LUTs fell between 60% and 62%.
To evaluate the performance of the proposed NPU architecture, we executed the PassGAN both on the CPU and GPU, as well as on the NPU implemented on the XCZU7EV and XCVU9P devices (referred to as FPGA-7EV and FPGA-9P, respectively), across different batch sizes (ranging from 16 to 16,384). The speed metric was measured by following equation, which defines the number of generated passwords per second:
S p e e d = T h e n u m b e r o f g e n e r a t e d p a s s w o r d T i m e c o s t
Figure 12 illustrates the variation in processing speed relative to the batch size, where it can be seen that the speed of PassGAN on different devices gradually increased with the batch size and then remained nearly constant after a certain batch size was reached. The peak performance of PassGAN on the CPU, GPU, FPGA-7EV, and FPGA-9P was achieved at batch sizes of 1024, 8192, 128, and 1024, respectively. Compared to the peek performance of PassGAN on the GPU, the implementation of PassGAN on FPGA-9P achieved an 82.16% increase in speed. More detail comparison results are presented in Table 5, where the power was measured at runtime by the on-board current and voltage sensors. The energy efficiency is defined by Equation (3).
E n e r g y e f f i c i e n c y = S p e e d P o w e r
For comparison, we calculated the speedup and energy efficiency ratio relative to the CPU. From the table, we can see that the FPGA-9P and FPGA-7EV achieved the fastest speed and the highest energy efficiency among the selected device. Compared to the Tesla V100 GPU, PassGAN on FPGA-9P improved the energy efficiency by 155.22%. Although the speed of PassGAN on XCZU7EV was only 37.96% of GPU, its energy efficiency was 3.09 times and 1.21 times that of the GPU and XCVU9P, respectively.

4.3. Implementation of Encryption Algorithm Accelerator

In the proposed PassRecover computing system, the encryption algorithm accelerator was implemented on the XCVU9P device for both acceleration mode A and mode B. Following the unified hardware architecture template proposed in Section 3, we customized the encryption thread architecture for each selected evaluation encryption algorithm. Table 6 presents the resource consumption of Office2010, Office2013, PDF1.7, WINZIP, and RAR5 on a single XCVU9P device. Except for the Office2013 algorithm, which had lower LUT utilization due to routing congestion, the other four algorithms showed LUT utilization between 62% and 67%. For Block RAM (BRAM) usage, the Winzip algorithm consumed the most memory storage, as it needed to store up to 8 MB of compressed data to generate the attacking digest. In contrast, the other algorithms had relatively low BRAM utilization, as there is no requirement to store additional data.
In the case where running the encryption accelerator only (abbreviated as EO) without the NPU as the password candidate producer, the performance of these accelerators in PassRecover and their comparison with CPU, GPU, and related works on FPGA are presented in Table 7, where the speed and energy efficiency is defined in Equation (4) and Equation (3), respectively.
S p e e d = T h e n u m b e r o f g e n e r a t e d h a s h v a l u e T i m e c o s t
It is worth noting that the FPGA power consumption in this work in Table 7 was measured during the runtime of the accelerator through on-board current and voltage sensors. It is the real power consumption rather than the estimated static/dynamic power reported from Vivado. For better comparison, we calculated the speedup and energy efficiency ratio relative to the CPU in Table 7. The results indicate that the password attack speed and energy efficiency of the five accelerators in PassRecover surpassed those of the Xeon Gold 5218 CPU (Intel, Santa Clara, CA, USA), Tesla V100 GPU, and the FPGA solution from the latest related works [8,11]. For instance, compared to the Tesla V100, PassRecover achieved speed improvements ranging from 1.42 × (Winzip) to 1.62 × (PDF1.7) and improved the energy efficiency by 147.43% (Office2010) to 289.59% (RAR5). Compared to the FPGA solution presented in TC2019 [8], which uses a low-power FPGA (XC7Z030) for embedded applications, our solution achieved 1.59 × , 2.18 × , and 2.66 × higher energy efficiency for Office2010, Office2013, and RAR5, respectively. Compared to the C&S [11] solution that employs four XCKU060 FPGAs as the attack system, PassRecover demonstrated similar energy efficiency but achieved performance improvements of 57.14% (Office2010), 49.95% (Office2013), and 90.11% (Winzip) in computation speed.

4.4. End-to-End Password Recovery

We took acceleration mode A illustrated in Figure 4 as a case study to construct an end-to-end (E2E) password recovery acceleration system. The proposed NPU was deployed in the XCZU7EV device, while the encryption algorithm accelerators were placed on two XCVU9P FPGAs (e.g., Figure 13 shows the Winzip algorithm routing result on an XCVU9P). For comparison, we also set up two experimental configurations. One was the previously introduced EO, which only ran an encryption algorithm acceleration on two XCVU9P devices based on a static password/mask in an off-chip memory. The other was H2E, where passwords and masks were dynamically generated by PassGAN on the host GPU and fed to the two XCVU9P devices for simultaneous encryption acceleration. Additionally, two Tesla V100 GPUs were used for E2E experiments (the two GPUs work in a pipeline, with one for PassGAN acceleration and the other for Hashcat-based encryption). Figure 14a,b show the comparisons of normalized speed and energy efficiency with previous works, CPU, and GPUs, respectively.
Both TC2019 and C&S focused solely on encryption algorithms acceleration (EO scenario). TC2019 reported speed and energy efficiency results for Office2010, Office2013, and RAR5, while C&S excluded PDF1.7. In the EO scenario, our solution was 101.5% faster and achieved 22.11% better energy efficiency on average compared to C&S. In the E2E scenario, although the speed of the proposed solution is comparable to H2E case, it is significantly more energy-efficient, with an average 1.14 × improvement. This highlights one of the advantages of E2E in password recovery, despite H2E’s poor scalability not being considered in this comparison. When compared to two GPUs in E2E setup, our solution under E2E showed 93.01% higher speed and 3.73 × better energy efficiency.

4.5. Scalability

Here, we focused solely on scaling-out scenarios. Given that we have currently manufactured only 16 PassRecover prototypes, the scalability analysis presented here was constrained to this 16-node scale. For evaluation, a single server hosts four PassRecover prototypes, resulting in a total deployment of four servers. The servers are interconnected through 10 Gigabit Ethernet, and the entire computing system utilizes the Dask framework to support distributed parallel computing. In the Dask framework, each worker node corresponds to one PassRecover prototype.

4.5.1. The Scalability of PassRecover

For end-to-end password recovery acceleration, we compared the average computational performance per PassRecover instance across three experimental configurations: 1, 4, and 16 PassRecover prototypes. As shown in Figure 15a, while the average speed of each algorithm exhibited a moderate decline with increased PassRecover counts, the degradation remained minimal. For example, the PDF1.7 encryption algorithm achieved 99.32% of its single-PassRecover performance under the 16-node configuration, demonstrating excellent scalability. This outcome aligns with our expectations because of the embarrassingly parallel nature of password recovery, where individual PassRecover nodes operate independently on their assigned masks to execute hybrid attacks without requiring inter-node data exchange. It can be expected that the system’s computational performance continues to exhibit near-linear scalability if we have additional PassRecover prototypes.

4.5.2. The Scalability of Reconfiguration

Since PassRecover dynamically reconfigures the XCVU9P FPGAs at runtime based on the encryption algorithm of input tasks, the associated reconfiguration overhead is negligible compared to the password recovery process itself. Figure 15b illustrates the average reconfiguration time for host-side reconfiguration of the dual XCVU9P FPGAs across three experimental setups: 1, 4, and 16 PassRecover prototypes. The data reveal that both the single and four PassRecover prototype configurations exhibited average reconfiguration times of 1.13 s. Notably, the 16-instance configuration showed a marginal increase to 1.15 s, yet this remains orders of magnitude lower than typical password recovery tasks that demand hours or even days. Consequently, the reconfiguration overhead of PassRecover does not materially impact the overall recovery performance.

5. Discussion

While this study demonstrates that the proposed PassRecover achieves end-to-end acceleration from password generation to encryption algorithm execution, delivering a 93.01% speedup and 3.73× energy efficiency improvement over GPU-based end-to-end systems, the PassGAN-generated passwords still require a hybrid attack composition with masks to prevent the NPU from becoming a bottleneck during end-to-end computation. This approach is inevitable due to the extreme diversity regarding the speeds of encryption algorithms (ranging from 10 4 to 10 11 H/s). As a result, instead of conducting attempts to optimize FPGA resource partitioning between NPUs and encryption algorithm accelerators for balanced computation, we simply apply mask length adjustment to ensure that the NPU will not be a bottleneck in the co-acceleration process.
When we talk about the advantages of the FPGA solution over the GPU, it is necessary to address the limitations. This study employs a pure FPGA-based implementation for password recovery acceleration. As a result, this approach inherits inherent limitations common to all FPGA acceleration solutions, including slow deployment speed, mandatory algorithm customization, and limited flexibility. These challenges stem from the fundamental nature of FPGA acceleration rather than systemic flaws in our design. In this work, in order to mitigate these constraints, we propose a unified hardware architecture template that can help to reduce algorithm customization efforts. This innovation specifically addresses FPGA’s challenges in rapid deployment and lengthy development cycles. We posit that with continuous advancements in high-level synthesis (HLS) tools, the deployment gap between FPGA and GPU will progressively narrow, ultimately enhancing FPGA’s competitiveness in practical applications.
In addition, with the rapid development of AI technology, it is predictable that an increasing number of advanced models will be applied in the password recovery scenario in the future. As model complexity increases, the co-acceleration of AI and encryption algorithms becomes more significant. In this paper, we have demonstrated the advantage of the end-to-end co-acceleration approach between PassGAN and encryption algorithms. As a continuation of this work, we will focus on the mapping of PassGPT onto the proposed NPU in the near future and do preparation to handle future advanced natural language models (LLMs) for password recovery.
It is worth noting that all the techniques discussed in this work are used mainly in digital forensics services in order to reduce the cost in terms of time and energy consumption. The potential for hostile attacks is not the motivation of this work.

6. Conclusions

In this paper, we present PassRecover, a multi-FPGA computing system designed for the end-to-end acceleration of offline password recovery. Our architecture includes an NPU for deep learning-based password generation algorithm acceleration (e.g., PassGAN) and a unified hardware architecture template for encryption algorithms. The encryption thread in this template was customized according to the target encryption algorithm. For evaluation, we selected five algorithms: Office2010, Office2013, PDF1.7, Winzip, and RAR5. When combining the NPU and encryption algorithm accelerators for end-to-end password recovery acceleration, PassRecover (using mode A as an example) is 93.01% faster and 3.73× more energy-efficient than a GPU-based end-to-end system. Compared to the latest research focusing only on the acceleration of encryption algorithms, PassRecover achieves an average of 101.50% higher speed and 22.11% higher energy efficiency across these five algorithms. These results lay a solid foundation for our future work on the mapping of PassGPT onto the proposed architecture.

Author Contributions

Conceptualization, G.X. and X.F.; methodology, G.X.; software, G.X.; hardware, G.X. and X.F.; validation, G.X. and X.F.; formal analysis, G.X.; investigation, G.X.; resources, G.X. and Z.H.; data curation, G.X.; writing—original draft preparation, G.X.; writing—review and editing, G.X., X.F., Z.H., W.C. and F.Z.; visualization, G.X.; supervision, G.X.; project administration, X.F.; funding acquisition, W.C. and F.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China under grant 2022YFB4500900.

Data Availability Statement

The data are unavailable due to privacy.

Conflicts of Interest

Author Xitian Fan and Zhongchen Huang was employed by the company Shanghai HONGZHEN Information Science & Technology. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
NPUNeural Processing Unit
FPGAFiled-programmable Gate Array
GPUGraphics Processing Unit
CPUCentral Processing Unit
HPCHigh-performance Computing
GANGenerative Adversarial Network
VLIWVery Long Instruction Word
PEProcessing Element
EEEncryption Engine
AXIAdvanced Extensible Interface
N/ANot Applicable
LUTLookup Table
RAMRandom Access Memory
E2EEnd-to-End
EOEncryption Only
H2EHost-to-Encryption
LLMLarge Language Model

References

  1. Wang, D.; Zhang, Z.; Wang, P.; Yan, J.; Huang, X. Targeted Online Password Guessing: An Underestimated Threat. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016; pp. 1242–1254. [Google Scholar] [CrossRef]
  2. Yang, X.; Yue, C.; Zhang, W.; Liu, Y.; Ooi, B.C.; Chen, J. SecuDB: An In-enclave Privacy-preserving and Tamper-resistant Relational Database. Proc. VLDB Endow. 2024, 17, 3906–3919. [Google Scholar] [CrossRef]
  3. Houshmand, S.; Aggarwal, S.; Flood, R. Next Gen PCFG Password Cracking. IEEE Trans. Inf. Forensics Secur. 2015, 10, 1776–1791. [Google Scholar] [CrossRef]
  4. Garg, V.; Ahuja, L. Password Guessing Using Deep Learning. In Proceedings of the 2019 2nd International Conference on Power Energy, Environment and Intelligent Control (PEEIC), Greater Noida, India, 18–19 October 2019; pp. 38–40. [Google Scholar] [CrossRef]
  5. Dürmuth, M.; Angelstorf, F.; Castelluccia, C.; Perito, D.; Abdelberi, C. OMEN: Faster Password Guessing Using an Ordered Markov Enumerator. In Proceedings of the Engineering Secure Software and Systems - 7th International Symposium, ESSoS 2015, Milan, Italy, 4–6 March 2015; Lecture Notes in Computer Science; Proceedings. Springer: Berlin/Heidelberg, Germany, 2015; Volume 8978, pp. 119–132. [Google Scholar] [CrossRef]
  6. Rando, J.; Pérez-Cruz, F.; Hitaj, B. PassGPT: Password Modeling and (Guided) Generation with Large Language Models. In Proceedings of the Computer Security - ESORICS 2023 - 28th European Symposium on Research in Computer Security, The Hague, The Netherlands, 25–29 September 2023; Lecture Notes in Computer Science; Proceedings, Part IV. Springer: Berlin/Heidelberg, Germany, 2023; Volume 14347, pp. 164–183. [Google Scholar] [CrossRef]
  7. Zhang, Z.; Liu, P.; Wang, W.; Jiang, Y. RUPA: A High Performance, Energy Efficient Accelerator for Rule-Based Password Generation in Heterogenous Password Recovery System. IEEE Trans. Comput. 2023, 72, 900–913. [Google Scholar] [CrossRef]
  8. Liu, P.; Li, S.; Ding, Q. An Energy-Efficient Accelerator Based on Hybrid CPU-FPGA Devices for Password Recovery. IEEE Trans. Comput. 2019, 68, 170–181. [Google Scholar] [CrossRef]
  9. Ding, Q.; Zhang, Z.; Li, S.; Liu, P. Energy-Efficient RAR3 Password Recovery with Dual-Granularity Data Path Strategy. In Proceedings of the 2019 IEEE International Symposium on Circuits and Systems (ISCAS), Sapporo, Japan, 26–29 May 2019; pp. 1–5. [Google Scholar] [CrossRef]
  10. Li, B.; Feng, F.; Chen, X.; Cao, Y. Reconfigurable and High-Efficiency Password Recovery Algorithms Based on HRCA. IEEE Access 2021, 9, 18085–18111. [Google Scholar] [CrossRef]
  11. Li, B.; Zhou, Q.; Cao, Y.; Si, X. Cognitively Reconfigurable Mimic-based Heterogeneous Password Recovery System. Comput. Secur. 2022, 116, 102667. [Google Scholar]
  12. Hashcat. Mask Attack. Available online: https://hashcat.net/wiki/doku.php?id=mask_attack (accessed on 1 February 2025).
  13. Luo, Y.; Liu, J.; Gong, C.; Li, T. An Efficient Heterogeneous Parallel Password Recovery System on MT-3000. J. Supercomput. 2025, 81, 38. [Google Scholar] [CrossRef]
  14. Guardian. Standard Forensics & Data Recovery Rates. Available online: https://guardian-forensics.com/digital-forensics-rates/ (accessed on 1 February 2025).
  15. RockYou. RockYou. 2010. Available online: http://downloads.skullsecurity.org/passwords/rockyou.txt.bz2 (accessed on 1 February 2025).
  16. Hashes.org. Linkedin Leak. Available online: https://github.com/brannondorsey/PassGAN/releases/download/data/68_linkedin_found_hash_plain.txt.zip (accessed on 1 February 2025).
  17. SkullSecurity. Wiki: Passwords. Available online: http://downloads.skullsecurity.org/passwords/hotmail.txt.bz2 (accessed on 1 February 2025).
  18. Pasquini, D.; Gangwal, A.; Ateniese, G.; Bernaschi, M.; Conti, M. Improving Password Guessing via Representation Learning. In Proceedings of the 42nd IEEE Symposium on Security and Privacy (SP 2021), San Francisco, CA, USA, 24–27 May 2021; pp. 1382–1399. [Google Scholar] [CrossRef]
  19. Brannondorsey. PassGAN. 2017. Available online: https://github.com/brannondorsey/PassGAN (accessed on 1 February 2025).
  20. Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A.C. Improved Training of Wasserstein GANs. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5767–5777. [Google Scholar]
  21. Wiki. Embarrassingly Parallel. Available online: https://en.wikipedia.org/wiki/Embarrassingly_parallel (accessed on 1 February 2025).
  22. AMD. Aurora 64B66B Protocal Specification, SP011(v3). 2013. Available online: https://docs.amd.com/v/u/en-US/aurora_64b66b_protocol_spec_sp011 (accessed on 1 February 2025).
  23. AMD. UltraScale Architecture Configuration User Guide (UG570). Available online: https://docs.amd.com/r/en-US/ug570-ultrascale-configuration/Introduction?tocId=__uoAGvOd16yWRXoFejPDQ (accessed on 1 February 2025).
  24. Jouppi, N.P.; Young, C.; Patil, N.; Patterson, D.A.; Agrawal, G.; Bajwa, R.; Bates, S.; Bhatia, S.; Boden, N.; Borchers, A.; et al. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, 24–28 June 2017; pp. 1–12. [Google Scholar] [CrossRef]
  25. Chen, Y.; Emer, J.S.; Sze, V. Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. In Proceedings of the 43rd ACM/IEEE Annual International Symposium on Computer Architecture (ISCA 2016), Seoul, Republic of Korea, 18–22 June 2016; IEEE Computer Society: Washington, DC, USA, 2016; pp. 367–379. [Google Scholar] [CrossRef]
  26. Li, T.; Zhang, F.; Xie, G.; Fan, X.; Gao, Y.; Sun, M. A High Speed Reconfigurable Architecture for Softmax and GELU in Vision Transformer. Electron. Lett. 2023, 59, e12751. [Google Scholar] [CrossRef]
  27. Pub, F. Secure Hash Standard (SHS). Fips Pub 2012, 180.4, 10–23. [Google Scholar]
  28. Dask. Scale the Python Tools You Love. Available online: https://www.dask.org/ (accessed on 1 February 2025).
  29. Anders, M.A.; Kaul, H.; Mathew, S.; Suresh, V.B.; Satpathy, S.; Agarwal, A.; Hsu, S.; Krishnamurthy, R. 2.9TOPS/W Reconfigurable Dense/Sparse Matrix-Multiply Accelerator with Unified INT8/INTI6/FP16 Datapath in 14NM Tri-Gate CMOS. In Proceedings of the 2018 IEEE Symposium on VLSI Circuits, Honolulu, HI, USA, 18–22 June 2018; pp. 39–40. [Google Scholar] [CrossRef]
  30. Liu, C.; Yang, Z.; Zhang, X.; Zhu, Z.; Chu, H.; Huan, Y.; Zheng, L.; Zou, Z. A Low-Power Hybrid-Precision Neuromorphic Processor With INT8 Inference and INT16 Online Learning in 40-nm CMOS. IEEE Trans. Circuits Syst. I Regul. Pap. 2023, 70, 4028–4039. [Google Scholar] [CrossRef]
  31. Fan, X.; Xie, G.; Huang, Z.; Cao, W.; Wang, L. Acceleration of Rotated Object Detection on FPGA. IEEE Trans. Circuits Syst. II Express Briefs 2022, 69, 2296–2300. [Google Scholar] [CrossRef]
  32. Fu, Y.; Wu, E.; Sirasao, A.; Attia, S.; Khan, K.; Wittig, R. Deep Learning with INT8 Optimization on Xilinx Devices. 2017. Available online: https://www.origin.xilinx.com/content/dam/xilinx/support/documents/white_papers/wp486-deep-learning-int8.pdf (accessed on 1 February 2025).
Figure 1. Illustration the process of offline password recovery.
Figure 1. Illustration the process of offline password recovery.
Electronics 14 01415 g001
Figure 2. Recovery rate comparison and speed comparison: (a) normalized recovery rate comparison of PassGAN and traditional password generation algorithms in Rockyou dataset [18]; (b) average speed comparison of PassGAN, PassGPT and benchmark of Hashcat on Tesla V100 GPU (NVIDIA, Santa Clara, CA, USA).
Figure 2. Recovery rate comparison and speed comparison: (a) normalized recovery rate comparison of PassGAN and traditional password generation algorithms in Rockyou dataset [18]; (b) average speed comparison of PassGAN, PassGPT and benchmark of Hashcat on Tesla V100 GPU (NVIDIA, Santa Clara, CA, USA).
Electronics 14 01415 g002
Figure 3. The neural network architecture of PassGAN [4]: (a) the neural network architecture of generator (G), (b) the neural network architecture of discriminator (D); (c) residual block architecture in PassGAN.
Figure 3. The neural network architecture of PassGAN [4]: (a) the neural network architecture of generator (G), (b) the neural network architecture of discriminator (D); (c) residual block architecture in PassGAN.
Electronics 14 01415 g003
Figure 4. The architecture, interconnection, and two acceleration modes of the proposed PassRecover system: (a) a photo of PassRecover prototype; (b) the interconnections in PassRecover; (c) acceleration mode A of PassRecover; (d) acceleration mode B of PassRecover.
Figure 4. The architecture, interconnection, and two acceleration modes of the proposed PassRecover system: (a) a photo of PassRecover prototype; (b) the interconnections in PassRecover; (c) acceleration mode A of PassRecover; (d) acceleration mode B of PassRecover.
Electronics 14 01415 g004
Figure 5. The NPU architecture under acceleration mode A.
Figure 5. The NPU architecture under acceleration mode A.
Electronics 14 01415 g005
Figure 6. A unified hardware architecture template for encryption algorithms in PassRecover.
Figure 6. A unified hardware architecture template for encryption algorithms in PassRecover.
Electronics 14 01415 g006
Figure 7. A customization methodology for encryption thread.
Figure 7. A customization methodology for encryption thread.
Electronics 14 01415 g007
Figure 8. The area and latency relationship of MD5, SHA1, SHA-256, SHA512 under different unroll factors.
Figure 8. The area and latency relationship of MD5, SHA1, SHA-256, SHA512 under different unroll factors.
Electronics 14 01415 g008
Figure 9. The overall architecture of encryption thread for Office2013: (a) the pipeline architecture for 100 thousand times computation of SHA-512 and the architecture for Office2013 post-processing; (b) the architecture of round unit for SHA-512; (c) the architecture of message schedule unit for SHA-512.
Figure 9. The overall architecture of encryption thread for Office2013: (a) the pipeline architecture for 100 thousand times computation of SHA-512 and the architecture for Office2013 post-processing; (b) the architecture of round unit for SHA-512; (c) the architecture of message schedule unit for SHA-512.
Electronics 14 01415 g009
Figure 10. Illustration of the execution flow in PassRecover computing system (using mode A as an example).
Figure 10. Illustration of the execution flow in PassRecover computing system (using mode A as an example).
Electronics 14 01415 g010
Figure 11. The routing result of NPU under two acceleration modes: (a) routing results of a single NPU on XCZU7EV device under acceleration mode A; (b) routing results of eight NPUs on XCVU9P device under acceleration mode B (the view is rotated 90 degrees counterclockwise).
Figure 11. The routing result of NPU under two acceleration modes: (a) routing results of a single NPU on XCZU7EV device under acceleration mode A; (b) routing results of eight NPUs on XCVU9P device under acceleration mode B (the view is rotated 90 degrees counterclockwise).
Electronics 14 01415 g011
Figure 12. The speed of PassGAN on different devices under different batch sizes.
Figure 12. The speed of PassGAN on different devices under different batch sizes.
Electronics 14 01415 g012
Figure 13. Routing result of Winzip accelerator on XCVU9P device, with each Super Logic Region (SLR) containing 11 EE units.
Figure 13. Routing result of Winzip accelerator on XCVU9P device, with each Super Logic Region (SLR) containing 11 EE units.
Electronics 14 01415 g013
Figure 14. The comparison of normalized speedup and energy efficiency on five encryption algorithms: (a) normalized speedup comparison; (b) normalized energy efficiency comparison.
Figure 14. The comparison of normalized speedup and energy efficiency on five encryption algorithms: (a) normalized speedup comparison; (b) normalized energy efficiency comparison.
Electronics 14 01415 g014
Figure 15. The scalability of PassRecover and reconfiguration: (a) the normalized average speed of PassRecover across three experimental configurations; (b) the average reconfiguration time across three experimental configurations.
Figure 15. The scalability of PassRecover and reconfiguration: (a) the normalized average speed of PassRecover across three experimental configurations; (b) the average reconfiguration time across three experimental configurations.
Electronics 14 01415 g015
Table 1. Summary of the differences in password recovery solutions.
Table 1. Summary of the differences in password recovery solutions.
Related WorkHardware PlatformAttacking MethodSupport Deep
Learning Password
Generator Acceleration
Encryption Algorithm
Acceleration
[7]FPGARule
[8]CPU-FPGA hybrid
architecture
Mask/Dictionary
[9]FPGAMask/Dictionary
[10]FPGAMask/Dictionary
[11]FPGAMask/Dictionary
[13]MT-3000
(HPC processor)
Rule/Mask/Dictionary
This workFPGAPassGAN/Mask/Dictionary
Table 2. Dimension information of G in PassGAN.
Table 2. Dimension information of G in PassGAN.
Layer NameData Dimension
(Batch × Channel × Width)
Weight Dimension
Inputbatch × 128 × 10
Linearbatch × 1280 × 1128 × 1280
Reshapebatch × 128 × 100
ResBlock1batch × 128 × 10 5 × 128 × 128
5 × 128 × 128
ResBlock2batch × 128 × 10 5 × 128 × 128
5 × 128 × 128
ResBlock3batch × 128 × 10 5 × 128 × 128
5 × 128 × 128
ResBlock4batch × 128 × 10 5 × 128 × 128
5 × 128 × 128
ResBlock5batch × 128 × 10 5 × 128 × 128
5 × 128 × 128
Conv1Dbatch × 97 × 10 5 × 128 × 97
Softmaxbatch × 97 × 100
Table 3. Software evaluation environment.
Table 3. Software evaluation environment.
Intel Xeon Gold 5218 CPUTesla V100 GPU
PassGANTensorFlow 1.15
oneDNN 2.2.4
TensorFlow 1.15
CUDA 10.0
cuDNN 7.6
Hashcat 6.2.6Intel OpenCL 2.1CUDA 12.4
Power detection toolIntel-pcm 202409Nvidia-smi
Table 4. NPU resources utilization on XCZU7EV and XCVU9P device.
Table 4. NPU resources utilization on XCZU7EV and XCVU9P device.
XCZU7EVXCVU9P
UsedAvailablePercentageUsedAvailablePercentage
LUTs142,374230,40061.79%712,7471,182,24060.29%
FFs234,986460,80051.00%1,164,2422,364,48049.24%
BRAMs6831221.79%1105216051.16%
URAMs889691.67%70496073.33%
DSPs1283172874.25%5129684074.99%
Table 5. PassGAN performance on different devices.
Table 5. PassGAN performance on different devices.
CPUGPUFPGA-9PFPGA-7EV
PlatformXeon Gold
5218
Tesla V100Ultrascale+
XCVU9P
Ultrascale+
XCZU7EV
Process14 nm12 nm16 nm16 nm
Device number2111
Frequency (GHz)2.31.30.3/0.60.25/0.5
Batch size102481921024128
Speed
(pwd/s)
22,547316,656576,820120,192
Power
(W)
278.62189134.9223.25
Speedup28.08×51.16×10.66×
Energy efficiency
(pwd/J)
80.921675.434275.275169.55
Energy efficiency ratio20.70×52.83×63.88×
Table 6. The implementation results of five encryption algorithms on XCVU9P device.
Table 6. The implementation results of five encryption algorithms on XCVU9P device.
LUTs (%)FFs (%) BRAMs (%)
LUTs as Logic LUTs as Memory Total
Office2010618,512122,971741,483 (62.72%)1,242,526 (52.55%)774 (35.83%)
Office2013510,13237,321547,453 (46.31%)698,230 (29.53%)120 (5.56%)
PDF1.7719,77049,732769,502 (65.09%)925,181 (39.13%)106.5 (4.93%)
Winzip628,879154,342783,221 (66.25%)1,171,776 (49.56%)1525.5 (70.63%)
RAR5669,664122,548792,212 (67.01%)873,070 (36.92%)97.5 (4.51%)
Table 7. Performance comparison with CPU, GPU, and previous works.
Table 7. Performance comparison with CPU, GPU, and previous works.
HashcatTC2019 [8]C&S [11]This Work
PlatformCPUGPUFPGAFPGAFPGA
Chip typeXeon Gold
5218
Tesla V100XC7Z030XCKU060XCVU9P
Process14 nm12 nm28 nm20 nm16 nm
Chip number21142
Office2010 Frequency (MHz)30001530190250250
Speed (KH/s)12.74144.709.84140220
Power (W)231.5628812.55117.5177.0
Energy efficiency (H/J)55.01502.43783.821191.491242.94
Speedup11.36×0.77×10.99×17.27×
Energy efficiency ratio9.13×14.25×21.66×22.59×
Office2013 Frequency (MHz)30001530190150150
Speed (KH/s)1.89717.60.9791827
Power (W)225.7725010.65104.33134.8
Energy efficiency (H/J)8.470.491.92172.53200.3
Speedup9.28×0.52×9.49×14.23×
Energy efficiency ratio8.38×10.94×20.54×23.85×
PDF1.7
(Acrobat 9)
Frequency (MHz)30001530 N/A N/A 200
Speed (MH/s)359.85029.513,200
Power (W)201.53289196.48
Energy efficiency (MH/J)1.7917.467.18
Speedup13.98×36.69×
Energy efficiency ratio9.72×37.53×
Winzip Frequency (MHz)30001530 N/A 200250
Speed (KH/s)536.458104341.228250
Power (W)222.7272105.71152.36
Energy efficiency (KH/J)2.4121.3641.0754.15
Speedup10.83×8.09×15.38×
Energy efficiency ratio8.86×17.04×22.47×
RAR5Frequency (MHz)30001530190200200
Speed (KH/s)8.59185.1095.711121.96207.33
Power (W)238.7528312.95106.59176.94
Energy efficiency (H/J)35.98300.744411144.151171.75
Speedup9.91×0.66×14.20×24.13×
Energy efficiency ratio8.36×12.26×31.80×32.57×
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xie, G.; Fan, X.; Huang, Z.; Cao, W.; Zhang, F. PassRecover: A Multi-FPGA System for End-to-End Offline Password Recovery Acceleration. Electronics 2025, 14, 1415. https://doi.org/10.3390/electronics14071415

AMA Style

Xie G, Fan X, Huang Z, Cao W, Zhang F. PassRecover: A Multi-FPGA System for End-to-End Offline Password Recovery Acceleration. Electronics. 2025; 14(7):1415. https://doi.org/10.3390/electronics14071415

Chicago/Turabian Style

Xie, Guangwei, Xitian Fan, Zhongchen Huang, Wei Cao, and Fan Zhang. 2025. "PassRecover: A Multi-FPGA System for End-to-End Offline Password Recovery Acceleration" Electronics 14, no. 7: 1415. https://doi.org/10.3390/electronics14071415

APA Style

Xie, G., Fan, X., Huang, Z., Cao, W., & Zhang, F. (2025). PassRecover: A Multi-FPGA System for End-to-End Offline Password Recovery Acceleration. Electronics, 14(7), 1415. https://doi.org/10.3390/electronics14071415

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop