Next Article in Journal
A Source Seeking Method for the Implicit Information Field Based on a Balanced Searching Strategy
Previous Article in Journal
A DRL-Based Load Shedding Strategy Considering Communication Delay for Mitigating Power Grid Cascading Failure
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Dynamic and Static Binary Translation Method Based on Branch Prediction

1
School of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology, Xi’an 710021, China
2
Shaanxi Joint Laboratory of Artificial Intelligence, Shaanxi University of Science and Technology, Xi’an 710021, China
*
Author to whom correspondence should be addressed.
Electronics 2023, 12(14), 3025; https://doi.org/10.3390/electronics12143025
Submission received: 28 May 2023 / Revised: 6 July 2023 / Accepted: 7 July 2023 / Published: 10 July 2023

Abstract

:
Binary translation is an important technique for achieving cross-architecture software migration. However, mainstream dynamic binary translation frameworks, such as QEMU, often generate a large amount of redundant code, which degrades the efficiency of the target code. To this end, we propose a dynamic–static binary translation method based on branch prediction. It first identifies parts of translation blocks following static branch prediction techniques. Then it translates these translation blocks into less-redundant native code blocks by canonical static translation algorithms. Finally, it executes all code blocks that are translated either statically or dynamically by correctly maintaining and switching their running contexts. In order to correctly weave the two types of translation activities, the proposed method only translates the next translation block that is data-independent from the current one by the active variable analysis algorithm, and records and shares the intermediate states of the dynamic and static translation activities via a carefully designed data structure. In particular, a shadow register-based context recovery mechanism is proposed to correctly record the running context of static translation blocks, and to correctly recover the context for dynamically translating and running blocks that were not statically translated. We also designed an adaptive memory optimization mechanism to dynamically release the memory of the mispredicted translation blocks. We implemented a dynamic–static binary translation framework by extending QEMU, called BP-QEMU (QEMU with branch prediction). We evaluated the translation correctness of BP-QEMU using the testing programs for the ARM and PPC instruction sets from QEMU, and evaluated the performance of BP-QEMU using the CoreMark benchmark code. The experimental results show that BP-QEMU can translate the instructions from the ARM and PPC architectures correctly; moreover, the average execution efficiency of the CoreMark code on BP-QEMU improves by 13.3% compared to that of QEMU.

1. Overview

Binary translation is a technique that automatically translates code from a target architecture into functionally equivalent code for a host architecture [1]. Binary translation techniques have a wide range of applications in legacy software system migration [2], program behavior analysis [3], and system virtualization.
Binary translation methods mainly include static translation methods and dynamic translation methods [4]. Static translation methods conduct translation before the target code is executed, and the translation activity does not occupy the program’s execution time. Code generated by static translation methods usually exhibits high-quality and high-execution efficiency. However, static translation methods require an independent interpreter to execute the statically translated code blocks and cannot address the issues of self-modifying code, code mining, and precise interruptions [5,6,7,8,9,10]. Dynamic translation methods translate and execute the target code at the same time, and the translation activity occupies the program’s execution time. Dynamic translation methods can effectively address the issues faced by static translation methods, such as self-modifying code.
Many research studies on dynamic translation methods have been published [11,12,13,14,15,16,17]. Dynamic translation frameworks, such as QEMU [17], are widely used in simulators, virtual machines, and disassemblers, effectively meeting the requirements of cross-architecture software migration. However, dynamic translation methods cannot be optimized due to the lack of global information. The dynamically generated code often includes a large amount of redundant code. Practical applications still face the problems of performance degradation and increased memory usage. Some research studies have been conducted to address these issues. There are many optimization methods for dynamic binary translation, such as hot path optimization [18], register mapping optimization [19,20], multi-threaded parallel optimization [17], and memory access optimization [21]. Díaz et al. proposed a joint hardware/software simulation virtual platform scheme to address the difficulty of system verification for heterogeneous and homogeneous architectures and implemented a parallelized QEMU prototype [22]. Song et al. used code activity analysis methods to reduce code bloat, removing redundant instructions in QEMU [23]; this improved execution efficiency, but increased the management overhead for dynamic translation.
Given that static binary translators generate codes with better efficiency and dynamic binary translations maintain better integrity, the combination of dynamic and static binary translation techniques has attracted the attention of researchers. Wang et al. proposed a dynamic binary translation optimization method with static pre-translation [24]. This approach involves the pre-translation of the entire source program; saving the pre-translation results in memory can reduce the time required for dynamic translation and, thus, improve the execution efficiency of the program. However, the memory space occupied during execution is large.
In order to improve the quality and efficiency of dynamic binary translation, inspired by the branch prediction technique for speeding up the execution in a pipelined processor, this paper proposes a branch prediction based dynamic–static binary translation method and extends the QEMU framework to implement a dynamic–static translation framework called BP-QEMU (branch prediction-enabled QEMU). BP-QEMU uses branch prediction technology to statically translate part of the target code before the code runs, which improves the translation quality and efficiency of dynamic translation; BP-QEMU uses just-in-time dynamic translation (JDT) technology when the target code is executed, which breaks the limitations of static translation methods, such as the inability to handle indirect jumps and resolve self-modifying code. BP-QEMU achieves more efficient binary translation by combining dynamic and static translation techniques.
In summary, the contributions of this paper are as follows:
  • We propose a branch prediction-based dynamic–static binary translation method.
  • We implemented a BP-QEMU prototype framework by extending the QEMU binary translation framework to facilitate the proposed method.
  • We designed and conducted experiments to validate the correctness and the performance of the BP-QEMU.

2. Preliminaries

This section briefly introduces the translation principles of QEMU and the main idea of branch prediction techniques for understanding the rest of this paper.

2.1. Dynamic Translation Principles of QEMU

QEMU, a widely-used dynamic binary translator, automatically converts target code designed for various processors into native code compatible with the host processor, by utilizing the TCG (Tiny Code Generator) middleware [25]. TCG masks semantic differences between processors and enables software emulation of processors of different architectures. TCG uses a streamlined instruction set, which includes instructions for data transfer, arithmetic operations, logic operations, program control, and other types of instructions [26,27,28]. QEMU first translates the target code into TCG instructions and then uses TCG middleware to translate these TCG instructions into the native code, which can be executed by the host machine. TCG middleware effectively separates the target machine from the host machine, greatly improving the portability of the QEMU platform.
QEMU treats a translation block as its basic unit of translation [27]. Each translation block consists of sequential binary instructions loaded at runtime from disk or memory, which are translated, interpreted, and executed in the QEMU virtual machine. QEMU translates the target instructions into TCG instructions sequentially, starting from the entry of each translation block, until it encounters a jump instruction, a system call, or the edge of a page, which indicates the end of translation for that translation block [29]. Multiple translation blocks are connected by inter-block jump instructions. In a typical program code containing basic structures of sequences, branches, and loops, there may be both unconditional and conditional jumps between translation blocks. As shown in Figure 1, during the execution of translation block tb1, if the instruction (jz tb2) is satisfied, it will jump to translation block tb2, and if the instruction (jmp tb3) is satisfied, it will jump to translation block tb3; during the execution of code block tb2, there is no branch jump instruction in the block, and it will jump directly to tb3; during the execution of translation block tb3, if the instruction (jmp tb1) is satisfied, it will jump back to translation block tb1; otherwise, it will exit.
The QEMU processes translation blocks into three stages: query, translation, and execution. As shown in Figure 2, QEMU first queries the translation block to be translated; if the block has already been translated, it skips the translation stage and directly executes the block; if the block has not been translated, it translates the block and stores the translation result in memory, and finally executes the block in its virtual machine.

2.2. Branch Prediction

Branch prediction is the main element of modern processor dynamic execution techniques; it is responsible for predicting the branch outcomes and branch target addresses that may be executed after conditional instructions, allowing for fetching and executing of predicted instructions in advance [30,31]. Branch prediction helps reduce stalls in pipelined processors caused by prefetching incorrect instructions and improves the efficiency of the processor. Similarly, if the next translation block after conditional jumps can be predicted, it can be predictively translated to improve the quality and efficiency of dynamic binary translation.
There are two main types of branch predictors: static predictors [32] and dynamic predictors [33]. Static predictors predict the branch outcome and branch target address at the compilation stage and conduct appropriate jumps according to the predicted target address at runtime. Dynamic predictors predict the branch outcome and branch target address at runtime. Static predictors have higher efficiency and lower prediction accuracy than dynamic predictors. Commonly used static predictors include (1) always-not-taken and (2) always-taken. The always-not-taken predictors always predict the result of the conditional instruction as false, and the prediction accuracy rate is about 30∼40% in practice. Always-taken predictors always predict the result of the condition instruction as true, and the prediction accuracy rate is about 60∼70% in practice. In this paper, the always-taken static predictor is introduced into the dynamic binary translation process due to its satisfactory accuracy, better performance, and simpler implementation. Note that the main purpose of this paper is to explore the feasibility of improving the quality and efficiency of dynamic binary translation by branch prediction technology. Further study on integrating more sophisticated static branch predictors for binary translation, such as those introduced in the literature [34,35], falls outside the scope of this paper.

3. Branch Prediction-Based Dynamic–Static Binary Translation Method

This section introduces the overall idea of the branch prediction-based dynamic–static binary translation method, the architecture of BP-QEMU, and three novel designs for correctly integrating static and dynamic translation activities.

3.1. Methodology Overview

As shown in Figure 3, the dynamic–static binary translation method based on branch prediction includes two phases: phase I for static predictive translation and phase II for dynamic translation and dynamic–static integration.
In phase I, the translation engine first traverses the target code file, finds the jump instructions, calibrates the entry and boundary of translation blocks predicts the jump translation blocks, filters the jump translation blocks that are data-independent of the preceding translation blocks, performs static binary translation, and saves the translation results in memory.
In phase II, the translation engine must fully traverse translation blocks in the target binary code file and use the dynamic translation technique to process the translation blocks not processed in the static translation stage. After completing the execution of a code block that is translated statically or dynamically, the translation engine extracts the actual PC pointer of the next translation block to be translated and compares it with the predicted jump address of the translation block to determine whether the next translation block has undergone predictive static translation. If so, the prediction is considered correct and the static translation result of the next translation block can be loaded directly into memory and executed. Otherwise, it is considered to be a prediction error. When the current translation block is translated statically and then a prediction error emerges, it is necessary to recover from the context of executing the current statically translated block to the context required for translating the next translation block using dynamic translation technology. During this process, the dynamic translation results and the static translation results can be optionally serialized to generate a binary file that can be executed on the host machine.

3.2. BP-QEMU System Architecture

QEMU is an open-source dynamic binary translation framework that is widely used in the field of dynamic binary translation. In order to facilitate the adoption of the proposed method in practice, this paper designs and implements a dynamic–static binary translation framework called BP-QEMU by extending QEMU. This section presents the architecture of BP-QEMU, which extends the original architecture of QEMU, by adding two modules: a static translation module and a dynamic–static integration module. BP-QEMU comprehensively uses static binary translation technology and dynamic binary translation technology to jointly optimize the translation of target code.
As shown in Figure 4, BP-QEMU adds a static branch prediction module before the front-end decoding module of QEMU. This module performs predictive translation of the pre-processed code, finds the branch instruction in the binary instruction stream, determines the jump translation block after the branch instruction by static branch prediction, performs predictive static translation, and stores the translated code in memory. BP-QEMU modifies the native QEMU front-end module (BP-QEMU front-end decoder) to filter out the translation blocks that need to be dynamically translated, translates each translation block into a set of TCG instructions, and feeds the intermediate translation results into the TCG middleware.
BP-QEMU inserts a new dynamic and static integrator module after the native QEMU backend dynamic translator module. The original QEMU dynamic translation module dynamically translates the TCG instruction set into code that can be executed on the host computer, and then executes the dynamically translated code. The dynamic and static integrator module correctly orchestrates the execution of the dynamic and static translation results, maintains the contextual environment in which the dynamic and static code blocks are executed, records the dynamic and static translation results, and generates the binary files that can be executed on the host computer.
In addition, BP-QEMU uses a shared memory structure to enable inter-module communication, which allows different modules to perform static and dynamic translations independently, ensuring data consistency across translation blocks that are translated in different methods, and orchestrating the execution of these translation blocks in the correct order. The shared memory structure, the static translation module, and the dynamic translation module are described in detail below.

3.3. Shared Memory Structure

In order to record the intermediate results of both static and dynamic translation activities, which enables the correct integration of dynamic and static binary translation results, this paper designs a storage structure, called SbpStruct, shared by static and dynamic modules. Each instance of SbpStruct for each translation block records necessary information for ensuring correct processing during the dynamic and static integration phase.
As shown in Table 1, the member variable tb_size records the size of the current translation block and the member variable tb_predict_pc records the predicted jump address, i.e., the address of the next translation block after the execution of the current translation block; the member variable tb_translated records whether the current translation block is translated; the member variable tb_continuous_failures records the number of consecutive prediction failures, which is used to dynamically adjust the behavior of the static and dynamic integration module; the member variable tb_jump_valid records the validity of the jump address, and is used in conjunction with tb_continuous_failures to achieve adaptive adjustment of the behavior of the static and dynamic integration module; tb_independence indicates whether the data between the current translation block and the next translation block are independent or not. This information is used for identifying translation blocks that can be safely translated using static translation methods.

3.4. Static Translation Module

The static translation module consists of two submodules: the static branch prediction module and the static binary translation module. In BP-QEMU, the static binary translation module adopts the classical peephole optimization and variable activity analysis algorithms, which have been thoroughly introduced in the literature [36,37]. It is not necessary to introduce them here again. This section will only introduce the branch prediction module in detail.
In the process of static predictive translation, there is a situation where the jumped translation block refers to the variables defined or modified in the previous translation block; that is, there may be data dependence between the two adjacent translation blocks. If the jumped translation block is directly translated statically, some of its variables may be inconsistent with those in the preceding block, particularly when the preceding block is updated during dynamic execution and, therefore, the dynamic and static translation results cannot be correctly integrated. To this end, this paper designs a data-independent judgment algorithm for adjacent translation blocks, which filters data-independent translation blocks for static translation to avoid data inconsistency when integrating dynamic and static translation results. In this paper, the accuracies of branch prediction, current running state, jump address, and other information are recorded in the shared storage structure. The active variable analysis algorithm is used to determine the data independence of adjacent translation blocks and determine the translation blocks that can perform static binary translation.
To illustrate the active variable analysis method, the following concepts are first introduced. If the variable x is referenced at one of the input paths of a translation block B, x is said to be active at the input of a translation block B, and the set of all variables active at the input of B is denoted as I N B ; if the variable x is referenced on one of the output paths of B, x is said to be active at the output of B, and the set of all variables active at the output of B is denoted as O U T B ; if the variable x does not belong to either I N B or O U T B , variable x is said to be inactive in B. The data flow equation for the active variables is as follows:
I N [ E X I T ] = Φ
I N B = f B ( O U T B )
f B ( x ) = u s e B ( x d e f B )
From Equations (2) and (3), it follows that:
I N B = u s e B ( O U T B d e f B )
where d e f B denotes the set of variables that are assigned values in translation block B, which are not referenced in B before being assigned; u s e B denotes the set of variables that are referenced in translation block B but are not assigned values in B before being referenced.
By analyzing the active variables of two adjacent translation blocks, we determine whether the data between them are dependent or not. Suppose that tb2 is the predicted translation block pointed to by the jump pointer of tb1 (tb_predict_pc). O U T t b 1 and I N t b 2 are derived from the active variable analysis algorithm; if there is an intersection between O U T t b 1 and I N t b 2 , it indicates that the variables in tb2 depend on the output result of tb1, and the early translation of tb2 may cause a data inconsistency problem. Therefore, this translation block will be skipped in BP-QEMU, and the untranslated flag is set to false (tb_translated = false), indicating that it is waiting for dynamic binary translation. Otherwise, BP-QEMU will set the translated flag as true (tb_translated = true), perform the static binary translation of tb2 in advance, and store the static translation results in the virtual memory, which will then be called by the dynamic and static integration module later. The detailed procedure is shown in Algorithm 1. In Algorithm  1, tb.jump.pc indicates the address of the next translation block after the current tb block; After(tb) denotes all possible succeeding translation blocks of the tb block; static_always_taken(tb) uses the always_taken strategy of the static branch prediction to predict the next translation block of the tb block; static_binary_translation(tb) translates the current tb block statically.
This paper currently only uses the simple static branch prediction algorithm to predict the next translation block, with the purpose of exploring the feasibility of improving dynamic binary translation methods via static branch prediction techniques. More efficient static branch prediction algorithms and data-independent judgment algorithms can be explored in future work to further improve the overall translation performance of BP-QEMU.
Algorithm 1 Judge_Independence
Require:
 Translation Block Collection T B = t b 1 ,…, t b n
1:
Begin
2:
for Each tb in TB && t b . j u m p . p c != n u l l  do
3:
    I N t b = ϕ ;
4:
end for
5:
while  I N t b is changed, ∃tb ∈ TB do
6:
    / / Determine if the data between adjacent translation blocks are dependent
7:
   for Each tb in TB && t b . j u m p . p c != n u l l  do
8:
      O U T t b = s A f t e r ( t b )   I N s
9:
      I N t b = u s e t b ( O U T t b d e f t b )
10:
   end for
11:
end while
12:
/ / Loop through the set TB again to see which block tb can be statically translated
13:
for Each t b 1 in TB && t b 1 . j u m p . p c ! = n u l l  do
14:
    t b 2 = static_always_taken( t b 1 )
15:
   if  O U T t b 1 I N t b 2 ! = ϕ  then
16:
      t b 1 . S b p S t r u c t . t b _ i n d e p e n d e n c e = T r u e
17:
   else
18:
      t b 1 . S b p S t r u c t . t b _ i n d e p e n d e n c e = F a l s e
19:
      t b 1 . S b p S t r u c t . t b _ p r e d i c t _ p c = t b 2
20:
      / / Start the static binary translation for the block t b 1
21:
     Static_binary_translation( t b 1 )
22:
   end if
23:
   return
24:
end for
25:
END

3.5. Dynamic and Static Integration Module

In QEMU, all translation blocks must be dynamically translated before they can be executed on the host. The QEMU translation engine contains a main loop to process each translation block one at a time. First, it obtains the dynamic translation context (including the current CPU status, etc.), then it dynamically translates the translation block and executes it on the host. Finally, it automatically updates the dynamic translation context according to the execution of the translation block, and returns to the main loop to process the next translation block, ensuring the correctness and continuity of program execution.
In BP-QEMU, some translation blocks are statically translated and can be executed on the host without dynamic translation. When executing statically translated translation blocks, BP-QEMU can adopt an offline execution mode where multiple statically translated translation blocks can be executed continuously, avoiding the overhead of repeatedly returning to the dynamic translation main loop. However, executing static translation blocks in this way results in the dynamic translation context not being updated in a timely manner, which will break the smooth progress of the subsequent dynamic translation process. Therefore, this section introduces a dynamic translation context recovery mechanism for correctly switching between the efficient offline execution mode for the statically translated blocks and the online translation and execution modes for the dynamic translation blocks. In addition, statically translating and saving some translation blocks in memory will increase the memory overhead of BP-QEMU. Therefore, this section introduces a memory optimization mechanism to dynamically clean the mispredicted static translation results that are hardly integrated with dynamically translated translation blocks.

3.5.1. Context Recovery Mechanism

The BP-QEMU framework executes the statically translated code blocks in a so-called offline mode in order to improve the overall efficiency of the simulation. There is no need to update the contextual information of the QEMU simulation architecture periodically during offline execution. However, the latest CPU information has to be obtained when moving from the execution of static binary translation code to the execution of dynamic binary translation code. In particular, when the current static translation block tb contains multiple jump branches, the static branch prediction mechanism cannot completely avoid the prediction failure, and when the prediction fails, the dynamic translation context must be restored, and the dynamic translation main loop must be restarted to determine the next block for dynamic translation. To this end, this paper designs a shadow register-based dynamic translation context recovery mechanism to record the CPU states of offline execution and restore the dynamic translation context when necessary. This ensures that all translation blocks can be translated and executed in their appropriate context.
BP-QEMU uses a 64-bit array to define a shadow register that satisfies the needs of both 64-bit and 32-bit processor architectures. In the 32-bit processor architecture, the lower 32 bits of the array are used as the shadow register. This software-defined register approach provides a high degree of flexibility. The length of the shadow registers and the number of shadow registers will vary automatically depending on the selected architecture. For example, if the target is an X86 architecture, the number of shadow registers can be defined as 9, and if the target is an ARMv7 architecture, the number of shadow registers can be defined as 17. BP-QEMU supports up to 32 shadow registers of 64-bit that meet the needs of most architectures on the market.
Figure 5 shows the mapping between the shadow registers and the processor status registers of the ARMv7 architecture. This set of shadow registers maintains the current state of the processor during offline execution. For example, one of them, ShadowReg[15], holds the actual PC pointer of the current program. After executing a translation block, the actual PC pointer stored in ShadowReg[15] can be compared with the predicted PC pointer tb_predict_pc of the current translation block. If they are equivalent to each other, it indicates that the next translation block has been correctly predicted and translated, and the host machine will continue to execute the next translation block without jumping back to the main loop of the dynamic translation process. If they are not equivalent to each other, it indicates that the static branch prediction is wrong and the next translation block pointed out by ShadowReg[15] was not statically translated in advance. In this case, the current shadow registers must be restored to the dynamic translation context for the translation engine to correctly restart a dynamic translation process for the next translation block. The details of this process are shown in Algorithm 2.
Algorithm 2 ExecStaticBlock_RestoreContext
Require:
 tb: a statically translated block, ShadowReg[0…16]: the array of shadow registers
Ensure:
 env: dynamic translation context
1:
Begin
2:
/ / Execute a static translation block on the host machine in an offline mode
3:
cpu_tb_exec(tb)
4:
Update S h a d o w R e g [ 0 16 ]
5:
t b 1 = t b . S b p S t r u c t . t b _ p r e d i c t _ p c
6:
if  S h a d o w R e g [ 15 ] == t b 1  then
7:
     ExecStaticBlock_RestoreContext( t b 1 , S h a d o w R e g [ 0 16 ] )
8:
else
9:
      / / Restore the dynamic translation context using shadow registers
10:
     e n v > r e g [ ] = S h a d o w R e g [ 0 16 ]
11:
     / / Return to the main loop of the dynamic translation to process t b 1
12:
    Dynamic_translation(tb1, env)
13:
end if
14:
end

3.5.2. Memory Optimization Mechanism

In the actual translation process, QEMU stores the dynamic translation results of translation blocks in a hash table in memory and uses an FIFO mechanism to maintain the hash table for better memory utilization. In contrast, BP-QEMU statically translates some translation blocks and stores the results in memory, which will inevitably increase the consumption of limited memory resources. Therefore, BP-QEMU designs a memory optimization mechanism to dynamically clean the cached static translation results that are mispredicted and hardly called to mitigate memory consumption.
The basic idea of the proposed memory optimization mechanism is as follows. After each execution of a translation block, the prediction success or failure of the next translation block can be judged by the equivalence of the actual jumping pointer real_pc and the predicted jumping pointer tb_predict_pc. If the prediction is successful (real_pc==th_predict_pc), tb_continuous_failures is set to 0 as the reward for the successful prediction. If the prediction fails, tb_continuous_failures is incremented by 1 as the penalty of the failed prediction. If the total number of consecutive prediction failures is greater than the set threshold N, the value of tb_translated is set to false and the memory occupied by this static translation block will be released, as shown in Algorithm  3.
The memory optimization mechanism for static translation results can reduce the memory consumption by mispredicted translation results. However, if several continuous prediction failures are followed by a successful prediction, the translation engine has to dynamically translate some translation blocks that have been statically translated, which will decrease the overall performance of BP-QEMU.
Algorithm 3 StaticCache_Optimization
Require:
   tb: static translated block, real_pc
1:
Begin
2:
t b 1 = t b . S b p S t r u c t . t b _ p r e d i c t _ p c
3:
/ / Correct prediction
4:
if  r e a l _ p c == t b 1 then
5:
      t b . S b p S t r u c t . t b _ c o n t i n u o u s _ f a i l u r e s = 0
6:
else
7:
      / / Continuous prediction error
8:
      t b . S b p S t r u c t . t b _ c o n t i n u o u s _ f a i l u r e s ++
9:
     if  t b . S b p S t r u c t . t b _ c o n t i n u o u s _ f a i l u r e s >= N then
10:
        t b 1 . S b p S t r u c t . t b _ t r a n s l a t e d = F a l s e
11:
        / / Release the memory footprint of the current block of code
12:
       release_memory( t b 1 )
13:
   end if
14:
end if
15:
end

4. Experiment Analysis

In order to validate the functionality and performance of BP-QEMU, this section first introduces the overall experimental settings, and then presents experiments for validating the correctness of translating typical instruction sets, as well as the efficiency and memory consumption of conducting integer computing by simulating processors using BP-QEMU and QEMU.

4.1. Experimental Settings

In this paper, BP-QEMU is constructed based on the open-source dynamic binary translation framework QEMU 6.2.0. To verify the capability of the binary translation of BP-QEMU, experiments were designed to simulate the Cortex-A9 processor via BP-QEMU and QEMU on a Windows platform, respectively; the detailed experimental configurations are shown in Table 2. Note that we chose the experimental configurations in Table 2 because they have been employed by the CoreMark website. In theory, BP-QEMU can be employed to simulate other processors that can be simulated by QEMU. In addition, the QEMU operation rate was affected by CPU usage. To ensure that the experimental results for QEMU and BP-QEMU comparable, all experiments were conducted under the condition of 5% CPU usage.

4.2. Correctness and Efficiency Experiments

The results of binary translation must be correct to fulfill the designated functionalities of the target programs. To this end, we employed the test programs of QEMU for the ARM and PowerPC architectures to validate the correctness of the binary translation conducted by BP-QEMU. These test instructions included binary conversions, floating point operations, bit shifts, PC misalignment exceptions, etc. The test results show that BP-QEMU passed all the ARM and PowerPC instruction tests, which shows that BP-QEMU can accurately translate the instructions of the ARM and PowerPC architectures.
QEMU is widely used and notorious for its low efficiency. BP-QEMU achieves higher efficiency at the cost of using more memory at runtime when compared to QEMU. This section introduces three experiments for validating the efficiency of BP-QEMU. These experiments mainly involved executing CoreMark test programs on the simulated Cortex A9 processor. The configurations used for the experiments are outlined in Table 2, including the one used for measuring the volume of statically translated codes, the one for gauging the time of static translation, and the one for dynamic translation and execution.
CoreMark software consists of a set of open-source test programs that are widely used to measure the performances of microcontrollers (MCUs) and CPUs in embedded systems. We ran CoreMark test programs on Cortex A9 processors, simulated by QEMU and BP-QEMU, respectively, to gauge the performance of binary translation. As shown in Table 3, CoreMark includes test programs consisting of list processing, matrix operations, state machines, and cyclic redundancy checks. Note that CoreMark test programs do not contain system library calls, which can mitigate the performance fluctuations caused by different ways of implementing library functions.
Efficiency experiment I: Initially we conducted experiments to evaluate the overall efficiency improvements of BP-QEMU over QEMU. We only compare BP-QEMU with QEMU for two reasons. First, similar dynamic–static binary translation solutions are hardly reported in the literature. An exception is reference [24], which adopted a similar idea to ours. However, we cannot obtain or reproduce its solution according to the published materials. Second, it is not necessary to compare our method with the works that focus solely on the optimization of dynamic binary translation [17,18,19,20,21,22,23], which are actually complementary to our method.
CoreMark is compiled using default parameters and executed on the simulated Cortex A9 processor. The final output is a score that indicates the number of times per second the CoreMark test set is executed (unit: CoreMark/MHz). Note that the compiler optimization level is set to O3, and the always-taken and always-not-taken branch predictors are used, respectively. We ran the CoreMark test set 40,000 times to obtain a run score. We repeated this process five times to obtain five run scores in order to eliminate the influence of the host machine load. The value of the boost ratio is obtained from Equation (5).
B o o s t R a t i o = T B P Q E M U T Q E M U T Q E M U
T B P Q E M U is the CoreMark run score of BP-QEMU with the always-taken predictor and always-not-taken predictor. T Q E M U is the CoreMark run score of QEMU.
According to the final experimental results shown in Table 4, BP-QEMU improves performance, on average, by 13.3% over QEMU for binary translation with the always-taken predictor, and by 6% over QEMU with the always-not-taken predictor.
We conducted two more experiments, i.e., we investigated the influences of different static branch predictors and the performance overheads of static translation activities, in order to determine the reason for the efficiency improvement of the BP-QEMU framework.
Efficiency experiment II: BP-QEMU uses a combination of static and dynamic approaches for binary translation. In BP-QEMU, the overall efficiency of executing a target program on a simulated processor is closely related to the number of code blocks that can be statically translated. We built two variants of BP-QEMU using either the always-taken predictor or the always-not-taken predictor, respectively, and executed CoreMark test programs on the processors simulated by the two variants of BP-QEMU. As shown in Figure 6, the number of statically translated code blocks accounts for 66.3% of the total code volume when adopting the always-taken predictor. As a result, BP-QEMU adopts the always-taken predictor. In fact, not all statically translated code blocks are correctly predicted. The experiments show that the average rate of correctly predicted code blocks accounts for 60–70% and is influenced by the content of the test programs. Intuitively, the more correct the predicted code blocks are, the higher the efficiency of executing the target program. This means that the final percentage of the statically translated code that can be executed is 30–40%. The ratio of the prediction strategy to static translation needs to be adjusted in future work to translate as many blocks of code as possible in static binary.
Efficiency experiment III: BP-QEMU achieves higher efficiency than QEMU by statically translating some code blocks in advance. However, the stage of static translation itself introduces performance overhead. To quantitatively study the overhead of static translation, we conducted an experiment for gauging the time for static binary translation in BP-QEMU, as well as the time for dynamic translation and execution in BP-QEMU. We set timing flags in the BP-QEMU framework and executed two test cases on the simulated Cortex A9 processor. As shown in Table 5, the time for static binary translation is generally a fraction of the whole time for translating and executing the target program. We also noticed that the time for static translation is linearly dependent on the size of the target program, and that the times for dynamic translation and execution are not only affected by the code size but also by some other factors, particularly the number of loops in the code. A very short code with a large number of loops could run for a long time.

4.3. Memory Consumption Experiments

BP-QEMU performs static binary translation in advance and stores the static translation results in memory. We designed a memory optimization mechanism in Section 3.5.2. In order to explore the additional memory consumption of BP-QEMU over QEMU and the function of the proposed memory optimization mechanism, we executed test cases listed in Table 5 on the Cortex A9 processors simulated by QEMU, BP-QEMU with memory optimization, and BP-QEMU without memory optimization, respectively. As shown in Table 6, the memory footprints of two BP-QEMU variants are slightly higher than those of QEMU, the memory footprint of BP-QEMU with memory optimization is lower than that of BP-QEMU without memory optimization. The results show that BP-QEMU consumes a little bit more memory than QEMU and the memory optimization mechanism can further reduce the memory consumption of BP-QEMU.

5. Conclusions

Dynamic binary translation techniques are widely used in cross-architecture software migration, but they still suffer the drawbacks of low performance and high memory overhead originating from code redundancy. This paper proposes a branch prediction-based dynamic–static binary translation method and builds a framework called BP-QEMU to improve the quality and efficiency of binary translation. It translates some translation blocks by static translation methods in advance, reduces the code redundancy introduced by dynamic translation blocks, and improves the program execution efficiency by saving the time required for dynamic translation. The experiments show that the execution efficiency of the BP-QEMU framework is 13.3% higher than that of QEMU. In the future, we will explore more accurate and efficient branch prediction algorithms that are suitable for binary translation.

Author Contributions

Conceptualization, Y.W.; methodology, Y.W. and L.S.; software, Y.W., L.L., C.Z. and J.T.; validation, Y.W. and C.Z.; formal analysis, Y.W. and L.S.; investigation, Y.W.; resources, Y.W.; data curation, Y.W.; writing—original draft preparation, Y.W.; writing—review and editing, L.S.; visualization, Y.W.; supervision, L.S.; project administration, L.S.; funding acquisition, L.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Natural Science Basic Research Program of Shaanxi, P.R. China grant number [2023-JC-YB-581] and the APC was funded by [2023-JC-YB-581].

Data Availability Statement

Coremark testing programs: https://github.com/eembc/coremark (accessed on 27 May 2023).

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

  1. Altman, E.R.; Kaeli, D. Welcome to the opportunities of binary translation. Computer 2000, 33, 40–45. [Google Scholar] [CrossRef] [Green Version]
  2. Yarza, I.; Azkarate-Askatsua, M.; Onaindia, P.; Gruettner, K.; Ittershagen, P.; Nebel, W. Legacy software migration based on timing contract aware real-time execution environments. J. Syst. Softw. 2021, 172, 110849. [Google Scholar] [CrossRef]
  3. Chipounov, V.; Kuznetsov, V.; Candea, G. S2E: A platform for in-vivo multi-path analysis of software systems. ACM SIGPLAN Not. 2011, 46, 265–278. [Google Scholar] [CrossRef] [Green Version]
  4. Ebcioglu, K.; Altman, E.; Gschwind, M.; Sathaye, S. Dynamic binary translation and optimization. IEEE Trans. Comput. 2001, 50, 529–548. [Google Scholar] [CrossRef] [Green Version]
  5. Rocha, R.C.O.; Sprokholt, D.; Fink, M.; Gouicem, R.; Spink, T.; Chakraborty, S.; Bhatotia, P. Lasagne: A static binary translator for weak memory model architectures. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation, San Diego, CA, USA, 13–17 June 2022; ACM: New York, NY, USA, 2022; pp. 888–902. [Google Scholar] [CrossRef]
  6. Wenzl, M.; Merzdovnik, G.; Ullrich, J.; Weippl, E. From hack to elaborate technique—A survey on binary rewriting. ACM Comput. Surv. 2019, 52, 1–37. [Google Scholar] [CrossRef] [Green Version]
  7. Di Federico, A.; Agosta, G. A jump-target identification method for multi-architecture static binary translation. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems, Pittsburgh, PA, USA, 2–7 October 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1–10. [Google Scholar] [CrossRef] [Green Version]
  8. Hawkins, W.H.; Hiser, J.D.; Co, M.; Nguyen-Tuong, A.; Davidson, J.W. Zipr: Efficient static binary rewriting for security. In Proceedings of the 2017 47th Annual IEEEIFIP International Conference on Dependable Systems and Networks (DSN), Denver, CO, USA, 26–29 June 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 559–566. [Google Scholar] [CrossRef]
  9. Knorst, T.; Vicenzi, J.; Jordan, M.G.; Korol, G.; Beck, A.C.S.; Rutzig, M.B. An energy efficient multi-target binary translator for instruction and data level parallelism exploitation. Des. Autom. Embed. Syst. 2022, 26, 55–82. [Google Scholar] [CrossRef]
  10. Zhang, H.; Ren, M.; Lei, Y.; Ming, J. One size does not fit all: Security hardening of mips embedded systems via static binary debloating for shared libraries. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 28 February–4 March 2022; ACM: New York, NY, USA, 2022; pp. 255–270. [Google Scholar] [CrossRef]
  11. Chen, I.H.; King, C.T.; Chen, Y.H.; Lu, J.-M. Full System Emulation of Embedded Heterogeneous Multicores Based on QEMU. In Proceedings of the 2018 IEEE 24th International Conference on Parallel and Distributed Systems, Singapore, 11–13 December 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 771–778. [Google Scholar] [CrossRef]
  12. Gouicem, R.; Sprokholt, D.; Ruehl, J.; Rocha, R.C.O.; Spink, T.; Chakraborty, S.; Bhatotia, P. Risotto: A Dynamic Binary Translator for Weak Memory Model Architectures. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vancouver, BC, Canada, 25–29 March 2023; ACM: New York, NY, USA, 2023; Volume 1, pp. 107–122. [Google Scholar] [CrossRef]
  13. NiWu, J.; Dong, J.; Fang, R.; Zhang, W. FADATest: Fast and adaptive performance regression testing of dynamic binary translation systems. In Proceedings of the 44th International Conference on Software Engineering, Pittsburgh, PA, USA, 25–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 896–908. [Google Scholar] [CrossRef]
  14. Kyle, S.; Böhm, I.; Franke, B.; Leather, N. Efficiently parallelizing instruction set simulation of embedded multi-core processors using region-based just-in-time dynamic binary translation. ACM SIGPLAN Not. 2012, 47, 21–30. [Google Scholar] [CrossRef] [Green Version]
  15. Fan, X.; Li, S.; Zhiying, W. Dual-Core Architecture for Dynamic Binary Translation System: Tradeoff between Frequency and Bandwidth. In Proceedings of the 2012 Fourth International Conference on Computational and Information Sciences, Chongqing, China, 17–19 August 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 989–992. [Google Scholar] [CrossRef]
  16. Altinay, A.; Nash, J.; Kroes, T.; Rajasekaran, P.; Zhou, D.; Dabrowski, A.; Gens, D.; Na, Y.; Volckaert, S.; Giuffrida, C.; et al. BinRec: Dynamic binary lifting and recompilation. In Proceedings of the Fifteenth European Conference on Computer Systems, Heraklion, Greece, 27–30 April 2020; ACM: New York, NY, USA, 2020; pp. 1–16. [Google Scholar] [CrossRef]
  17. Yin, L. Dynamic Binary Translation Modeling and Parallelization Research; University of Science and Technology of China: Hefei, China, 2013; Available online: https://kns.cnki.net/kcms2/article/abstract?v=3uoqIhG8C447WN1SO36whHG-SvTYjkCc7dJWN_daf9c2-IbmsiYfKmTpuhyNiwqGSQeMLSmFtTcRJ8SJu7cevoHvOwu2q71d&uniplatform=NZKPT (accessed on 5 November 2022).
  18. Sun, T.; Yang, Y.; Yang, H.; Haibing, G. Return Instruction Analysis and Optimization in Dynamic Binary Translation. In Proceedings of the 2009 Fourth International Conference on Frontier of Computer Science and Technology, Shanghai, China, 17–19 December 2009; IEEE: Piscataway, NJ, USA, 2010; pp. 435–440. [Google Scholar] [CrossRef]
  19. Liao, Y.; Sun, G.; Jiang, H.; Jin, G.; Chen, G. All registers direct mapping method in dynamic binary translation. Comput. Appl. Softw. 2011, 28, 21–24. [Google Scholar]
  20. Wang, J.; Pang, J.; Fu, L.; Yue, F.; Zhang, J. A binary translation backend registers allocation algorithm based on priority. In Proceddings of the Geo-Spatial Knowledge and Intelligence: 5th International Conference, GSKI 2017, Chiang Mai, Thailand, 8–10 December 2017; Revised Selected Papers, Part II 5; Springer: Singapore, 2018; pp. 414–425. [Google Scholar] [CrossRef]
  21. Faravelon, A.; Gruber, O.; Pétrot, F. Optimizing memory access performance using hardware assisted virtualization in retargetable dynamic binary translation. In Proceedings of the 2017 Euromicro Conference on Digital System Design (DSD), Vienna, Austria, 30 August–1 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 40–46. [Google Scholar] [CrossRef]
  22. Díaz, E.; Mateos, R.; Bueno, E.J.; Nieto, R. Enabling parallelized-QEMU for hardware/software co-simulation virtual platforms. Electronics 2021, 10, 759. [Google Scholar] [CrossRef]
  23. Qiang, S.; Xianglan, C.; Huaping, C. Optimization technique of redundant instructions elimination in dynamic binary translator QEMU. Comput. Appl. Softw. 2012, 29, 67–69. [Google Scholar]
  24. Wang, J.; Pang, J.; Liu, X.; Yue, F.; Tan, J.; Fu, L. Dynamic translation optimization method based on static pre-translation. IEEE Access 2019, 7, 21491–21501. [Google Scholar] [CrossRef]
  25. QEMU Sources and Documentations. Available online: https://www.qemu.org/ (accessed on 5 November 2022).
  26. Carvalho, H.; Nelissen, G.; Zaykov, P. mcQEMU: Time-Accurate Simulation of Multi-core platforms using QEMU. In Proceedings of the 2020 23rd Euromicro Conference on Digital System Design (DSD), Kranj, Slovenia, 26–28 August 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 81–88. [Google Scholar] [CrossRef]
  27. Bartholomew, D. Qemu: A multihost, multitarget emulator. Linux J. 2006, 2006, 3. Available online: https://www.ecb.torontomu.ca/~courses/coe518/LinuxJournal/elj2006-145-QEMU.pdf (accessed on 5 November 2022).
  28. Kersey, C.D. QEMU Internals. The Linux Users Group at Georgia Tech Meeting. 2009. Available online: https://lugatgt.org/content/qemu_internals/downloads/slides.pdf (accessed on 5 November 2022).
  29. Fabrice, B. QEMU, a fast and portable dynamic translator. In Proceedings of the Annual Conference on USENIX Annual Technical Conference, Berkeley, CA, USA, 10–15 April 2005; USENIX Association: Berkeley, CA, USA, 2005; p. 41. [Google Scholar]
  30. Smith, J.E. A study of branch prediction strategies. In Proceedings of the 25 Years of the International Symposia on Computer Architecture (Selected Papers), Washington, DC, USA, 12–14 May 1998; pp. 202–215. [Google Scholar] [CrossRef]
  31. Lin, C.K.; Tarsa, S.J. Branch Prediction Is Not a Solved Problem: Measurements, Opportunities, and Future Directions. In Proceedings of the 2019 IEEE International Symposium on Workload Characterization (IISWC), Orlando, FL, USA, 3–5 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 228–238. [Google Scholar]
  32. Chaudhary, P. Implemented static branch prediction schemes for the parallelism processors. In Proceedings of the 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon), Faridabad, India, 14–16 February 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 79–83. [Google Scholar] [CrossRef]
  33. Sparsh, M. A survey of techniques for dynamic branch prediction. Concurr. Comput. Pract. Exp. 2019, 31, e4666. [Google Scholar] [CrossRef] [Green Version]
  34. Ball, T.; Larus, J.R. Branch Prediction for Free. ACM SIGPLAN Not. 1993, 28, 300–313. [Google Scholar] [CrossRef]
  35. Wagner, T.A.; Maverick, V.; Graham, S.L.; Harrison, M.A. Accurate static estimators for program optimization. ACM SIGPLAN Not. 1994, 29, 85–96. [Google Scholar] [CrossRef]
  36. Bansal, S.; Aiken, A. Binary Translation Using Peephole Superoptimizers. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation, San Diego, CA, USA, 8–10 December 2008; USENIX Association: Berkeley, CA, USA, 2008; pp. 177–192. [Google Scholar]
  37. Lu, S.B.; Pang, J.M.; Shan, Z.; Yue, F. Retargetable static binary translator based on QEMU. J. Zhejiang Univ. Sci. 2016, 50, 158–165. [Google Scholar] [CrossRef]
Figure 1. Examples of jumping between translation blocks.
Figure 1. Examples of jumping between translation blocks.
Electronics 12 03025 g001
Figure 2. QEMU working framework.
Figure 2. QEMU working framework.
Electronics 12 03025 g002
Figure 3. Dynamic–static binary translation method.
Figure 3. Dynamic–static binary translation method.
Electronics 12 03025 g003
Figure 4. System architecture of BP-QEMU.
Figure 4. System architecture of BP-QEMU.
Electronics 12 03025 g004
Figure 5. Mapping relationships for shadow registers in ARMv7.
Figure 5. Mapping relationships for shadow registers in ARMv7.
Electronics 12 03025 g005
Figure 6. Static binary translation by different branch predictors.
Figure 6. Static binary translation by different branch predictors.
Electronics 12 03025 g006
Table 1. A list of the main member variables of SbpStruct.
Table 1. A list of the main member variables of SbpStruct.
VariablesNotes
tb_predict_pcPC pointer to the predicted next TB block
tb_sizeThe TB block size
tb_continuous_failuresCount of consecutive failures of tb_predict_pc
tb_jump_validThe validity of the jump address tb_predict_pc
tb_jumpsPossible jump addresses after the TB block
tb_translatedThe TB block is translated or not
tb_independenceData independence between the TB block and the next one
Table 2. Experimental settings.
Table 2. Experimental settings.
ConfigurationHost PlatformTarget Platform
OSWindows10No operating system
CPUIntel(R) Core(TM) i7-7700HQ CPU@2.80 GHzCortexA9
CompilerGcc-5.4.1Gcc-5.4.1
Frequency2.80GHz1.4GHz
Memory16.0 GB512 MB
Table 3. CoreMark test cases.
Table 3. CoreMark test cases.
Test CaseFunction
c o r e _ l i s t _ j o i n . c List search, sorting, adding, and deleting operations
c o r e _ m a t r i x . c Matrix operator program
c o r e _ s t a t e . c State machine console application
c o r e _ u t i l . c CRC calculation program
c o r e _ m a i n . c The main function
Table 4. CoreMark test results.
Table 4. CoreMark test results.
Test NumberRun Score of QEMURun Score of BP-QEMU with Always-Taken PredictorBoost RatioRun Score of BP-QEMU with Always-Not-Taken PredictorBoost Ratio
11954219112%20364.2%
21912225317.8%20647.9%
31949217511.6%20274%
41950219612.6%21037.8%
51991223112.1%21146.2%
Average value1951221113.3%20696%
Table 5. Overhead of static binary translation.
Table 5. Overhead of static binary translation.
Test CaseCode SizeCode DescriptionTime for Static TranslationTime for Dynamic Translation and ExecutionStatic Translation Time as a Percentage of Total Time
case_A122KBThe core_list_join test item in CoreMark0.4 s7.5 s5.1%
case_B1327KBAll test items in CoreMark0.6 s45.6 s1.3%
Table 6. Memory consumption of BP-QEMU.
Table 6. Memory consumption of BP-QEMU.
Test CaseCode SizeQEMUBP-QEMU without Memory OptimizationBP-QEMU with Memory Optimization
case_A122 KB15.2 MB16.5 MB16 MB
case_B1327 KB18.8 MB20.1 MB19.5 MB
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, L.; Wu, Y.; Li, L.; Zhang, C.; Tang, J. A Dynamic and Static Binary Translation Method Based on Branch Prediction. Electronics 2023, 12, 3025. https://doi.org/10.3390/electronics12143025

AMA Style

Sun L, Wu Y, Li L, Zhang C, Tang J. A Dynamic and Static Binary Translation Method Based on Branch Prediction. Electronics. 2023; 12(14):3025. https://doi.org/10.3390/electronics12143025

Chicago/Turabian Style

Sun, Lianshan, Yanjin Wu, Linxiangyi Li, Changbin Zhang, and Jingyan Tang. 2023. "A Dynamic and Static Binary Translation Method Based on Branch Prediction" Electronics 12, no. 14: 3025. https://doi.org/10.3390/electronics12143025

APA Style

Sun, L., Wu, Y., Li, L., Zhang, C., & Tang, J. (2023). A Dynamic and Static Binary Translation Method Based on Branch Prediction. Electronics, 12(14), 3025. https://doi.org/10.3390/electronics12143025

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop