You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Article
  • Open Access

9 July 2023

A Hardware Non-Invasive Mapping Method for Condition Bits in Binary Translation

,
,
,
and
1
Institute of VLSI Design, Zhejiang University, Hangzhou 310000, China
2
Alibaba Group, Hangzhou 310000, China
*
Authors to whom correspondence should be addressed.
This article belongs to the Special Issue Emerging and New Technologies in Embedded Systems

Abstract

Binary translation, as an important bridge for application compatibility between different instruction set architectures (ISAs), has attracted much attention in the industry. However, due to hardware resource limitations of the target ISA, the translation efficiency and the practicability are poor. Recently, Apple has made it possible to run x86 programs on ARM through a translation technology called Rosetta based on software-hardware collaboration. In this paper, we proposed a hardware non-invasive mapping method for condition bits (HNIMCB) in binary translation, which innovatively implements the setting and referencing operations of the condition bits without changing the original instruction encoding and function of the target processor. This method is applicable for binary translation from source architectures with condition bit operations to target architectures without condition bit operations. It eliminates the difference of conditional bit resources between the source and target ISAs, reduces the computational instructions and memory access operations after translation from the source to the target ISA, and dramatically improves the translation efficiency. We conducted this experiment on a functional simulation level using the QEMU binary translator from ARM to RISC-V. A series of benchmark tests revealed that the total number of instructions decreased by 41%, while the number of memory access instructions decreased by 37% after the translation applying with the HNIMCB.

1. Introduction

The development of instruction set architecture (ISA) has never stopped, typically with x86 monopolizing the desktop and server markets, and ARM dominating the mobile computing market, respectively. The design of these ISAs usually differ, and respective applications are often incompatible with each other, thereby presenting a large obstacle to the development of ISA. To solve this problem, there have been many research papers related to virtualization [1,2,3] and the binary translation of the ISAs.
In recent years, RISC-V has developed rapidly with widespread attention from ecosystem developers for its open ecosystem and scalable, customizable ISA features. In the years after its creation in 2010, RISC-V has been mainly used in specialized chips, such as power management and the RF protocol. With the gradual improvement in the basic RV32I and RV32E standard instruction sets, more and more SoC chips of the IoT and MCU are reportedly using RISC-V. In 2022, the shipment of RISC-V processors exceeded 10 billion, and the main work of the RISC-V International Foundation is moving from technology improvement to other key related areas, such as cloud computing, edge computing, and automotive. The RISC-V software ecosystem has also evolved rapidly from the bare metal programs for specialized chips to RTOS for MCUs. Especially in recent years, strong UI interactions and high-performance computing OS on AP chips, such as Android, Ubuntu, Fedora, Anolis, and Kirin already support RISC-V [4,5]. However, in the Android and Linux systems, their applications are still dominated by x86 and ARM, which greatly restricts the further development of the RISC-V ecosystem.
To further accelerate the integration of RISC-V and high-performance application ecosystems, a binary translator based on RISC-V is deemed as a viable path for rapid adaptation to the existing ecosystems. Binary translation techniques have been widely implemented in computer architecture research and commercial applications. Early implementations include FX!32 [6], combined with dynamic translation and static translation, IA-32 EL [7], realized by Intel for IA32 applications running on an IA64 system, and UQBT [8], developed by the University of Queensland supporting multi-source and multi-target ISAs. The company Transmeta directly implemented the x86-like condition bit registers on its own VLIW processor Crusoe so that the binary translator CodeMorphing [9] can directly map the condition bits instructions from x86 to Crusoe, and can reduce the computational complexity caused by the setting and referencing of the condition bits. Currently, there are many open-source binary translation implementations, the most typical of which are QEMU [10], Box64 [11], FEX-EMU [12], and Instrew [13], etc. In recent years, there are several binary translators, such as Apple’s Rosetta and Intel’s Houdini, used commercially due to their excellent translation efficiency.
Binary translation between these ISAs can be categorized into universal compilation technology. It includes the front-end, middle-end, and back-end, and implements a variety of optimization algorithms at each end. Its architecture covers an interpretive execution mode, a static translation mode, and a dynamic just-in-time translation mode, etc. In particular implementation, several modes are often combined to improve the execution performance of the translated target code. Due to the loss of information like indirect jumps information, variable life cycle, and register allocation [14,15,16,17,18] in the source binary code, the optimization effect of pure software binary translation is quite limited. The mainstream optimization methodology includes register allocation and condition bit operation in translation, as well as the runtime native library call techniques. Among them, there has been much research performing regarding the conditional bit operation optimization technology of pure software, but few have been conducted on efficiency improvement. In particular, the target ISA, such as Alpha, MIPS, and RISC-V lacks condition bit registers, as well as the resources for the setting and referencing of the condition bits. Thus, additional computation instruction and memory access overhead are both required to achieve the mapping relationship between the source and target instructions when translating the condition bits instructions of source ISAs with multiple condition bits, such as x86 and ARM.
This paper takes the translation of ARM to RISC-V as an example. The ARM ISA contains N/Z/C/V condition bits, which will be set or referenced by various arithmetic instructions and comparison instructions, and even referenced by conditional codes in instruction encoding. Since RISC-V ISA design usually takes simplicity [4,5] into account, there is no design of the condition bit register, no setting, and no referencing of the condition bits, and usually comparisons and branch functions are completed directly in one instruction. As shown in Figure 1, this difference makes the translation from an ARM’s setting and referencing instructions of the condition bits to RISC-V instructions requiring a lot of computation instructions and memory access operations.
Figure 1. Translation overhead for condition bit instructions.
As shown in Figure 1, the translated code fragment was obtained from the QEMU emulator. For the two-ARM condition bit instructions shown, it required 16 RISC-V instructions to simulate. This highlights the significant cost of translating condition bit instructions. Additionally, we can see that for the same source code, using a native RISC-V compiler requires only one RISC-V instruction, which is much more efficient than using binary translation. This indicates that binary translation techniques, like QEMU, are not competitive with native compilers, and we need to use a more efficient method for condition bit instruction translation.
In order to analyze the impact of translation optimization for condition bit instructions on binary translation efficiency, we have recorded the number and proportions of condition bit instructions during the dynamic execution of the SPEC 2006 benchmark on ARM A64, as shown in Table 1.
Table 1. ARM SPEC 2006 condition bit instructions ratio.
According to Table 1, the proportion of condition bit instructions in ARM SPEC 2006 programs peaks at 34.8%, with an average proportion of 13%. As shown in Figure 1, due to the differences between the ARM and the RISC-V ISAs, a series of RISC-V arithmetic and memory access instructions are therefore required to translate a single ARM condition bit instruction, which severely affects their translation performance. All of these indicate the significant importance of optimizing the translation of the condition bit instructions for the ARM to RISC-V binary translation.
The difference of condition bits resources between the source and the target ISAs leads to a large number of computational instructions and memory access operations in the translation of the setting and the referencing of the condition bits. To solve the problem, this paper proposed a hardware non-invasive mapping method for condition bits (HNIMCB) in binary translation. By adding a new execution mode to the target ISA, this method expands the condition bit register which is mapped one-to-one with the condition bit of the source ISA and realizes the setting and referencing of the condition bits on the original target ISA without destroying the standard instruction set and programming model. Thus, this method achieves the efficient translation of the condition bit instructions from the source ISA to the target ISA, effectively reduces the translation complexity, and at the same time, reduces the count of computation instructions and memory access operations after mapping.
In Section 1, we analyze the problem of the translation efficiency of the setting and referencing of these condition bits and put forward the corresponding solutions. In Section 2, we analyze the current mainstream binary translator, especially in the translation and optimization of the setting and referencing of the condition bits. Section 3 introduces the general framework and the implementation details of applying the HNIMCB in binary translation. The experimental data on a functional simulation level were presented and analyzed in Section 4. Section 5 evaluates the impact of the microarchitecture and the cost of extending the technique to X86 translation. In Section 6, the overall effects and significance of the proposed method are summarized.

3. Design

3.1. Implementation Framework

In this paper, we proposed a hardware non-invasive mapping method for condition bits in binary translation. The purpose of using this method was to provide hardware resources to achieve an efficient mapping between the source and the target ISA for binary translation without modifying the target ISA and the programming model. In view of the problems summarized in Section 2, we need to effectively reduce the complexity of translation and reduce the condition bit calculation instructions and memory access operations through the one-to-one mapping of the condition bit registers, setting, and referencing between the source and the target architecture. In order to achieve the above purpose, this paper added a dynamic execution mode to the target ISA, and cleverly realizes the switching mechanism between the translation mode and the dynamic execution mode on the existing ISA. Figure 2 illustrates the design framework of the hardware non-invasive mapping method for condition bits in binary translation.
Figure 2. The HNIMCB in binary translation.
The mode switching mechanism is responsible for switching between the translation mode and the dynamic execution mode. In the translation mode, the running resources of the target processor are fully compatible with the original target ISA, and the ordinary native applications and the dynamic translator are run in this mode. In the dynamic execution mode, the target processor is extended with the condition bit register CFLAGS, and innovatively introduces the identifier instruction for the condition bit (CCR instruction). Thus, the successor arithmetic instructions following the CCR instruction can carry out the function expansion based on the semantics of the original instruction, which is the corresponding condition bit in the CFLAGS register that will be set or referenced.
Taking the translation from ARM to RISC-V as an example, the dynamic execution mode on RISC-V provides the hardware running environment for the translated target binary program. First of all, the condition bit register CFLAGS maps the condition bits of ARM one by one, so that the target binary program does not need extra storage to simulate the condition bits when running, thereby reducing the memory access operation. Secondly, the CCR instruction is used to indicate whether the successor instruction following it sets or references the condition bit register CFLAGS. Thirdly, aiming at the arithmetic instruction for ARM through setting or referencing the condition bits, we extended the corresponding RISC-V arithmetic instruction function with the setting or referencing of the condition bit so that the translated setting and referencing instructions for the condition bit do not need additional calculation operations, thereby improving the efficiency of the dynamic execution. Finally, the IR was designed according to the source ISA in the dynamic translator to match and keep the information of the source program instruction sequence for the condition bit operation as far as possible. Thus, the translation does not need complex data flow graph analysis and pattern matching algorithms within and between the basic blocks, and direct instruction mapping is adopted, which greatly reduces the translation complexity. For conditional codes implicitly referencing the condition bits in the ARMV8 A32 instruction code, we can also identify them by CCR instructions during translation, thus archiving rapid translation to the RISC-V instruction, and greatly simplifying the translation complexity and translated instruction sequence.
As shown in Figure 2, the hardware non-invasive mapping method for condition bits for binary translation proposed in this paper includes an execution mode switching mechanism, a condition bit register CFLAGS, the identifier instruction for the condition bit, extension instructions for the setting of the condition bit, and an extension of the instructions for the referencing of the condition bit and the dynamic translator.

3.2. Execution Mode Switching Mechanism

The execution mode switching mechanism is responsible for switching back and forth between the translation mode and the dynamic execution mode. As shown in Figure 3, in order to be compatible with the original programming model and not modify the original instruction code, we used the target address least significant bit of the indirect jump instruction in the target ISA to identify mode switching. When the least significant bit of the target address of the indirect jump instruction is ‘1’, the target processor enters or stays in the dynamic execution mode, and can access the condition bit register, extension instructions, etc. Meanwhile, when the least significant bit of the target address of the indirect jump instruction is ‘0’, the target processor enters or remains in the translation mode.
Figure 3. The HNIMCB execution mode.
After the dynamic translator completes a section of code translation, the system needs to jump from the translator to the translated target binary program block. At this time, the target address least significant bit of the indirect jump instruction is set to ‘1’, so as to achieve the switch from the translation mode to the dynamic execution mode. When the target processor is found at the untranslated code during executing the target binary program block, it needs to be switched to the translator for the translation task. At this time, the target address least significant bit of the indirect jump instruction is cleared to ‘0’, so as to achieve the switch from the dynamic execution mode to the translation mode. When the target processor is started, it runs in the translation mode at first.

3.3. Condition Bit Register (CFLAGS Register)

This paper took ARMV8 and RISC-V as an example to design condition bit mapping between the source and the target ISA. In the ARM ISA, there are NZCV condition bits in the PSTATE register, and arithmetic, comparison, or MSR instructions can potentially set or reference these condition bits. However, there are no condition bits in RISC-V, and the setting and referencing of the condition bits are directly implemented within a single branch instruction. This difference makes the translation from the ARM’s setting and referencing instructions of the condition bits to RISC-V instructions requiring a lot of computation instructions and memory access operations.
As shown in Table 2, this paper designs and implements the condition bit register (CFLAGS) in the dynamic execution mode for RISC-V. The condition bits in the register directly map to the condition bits of ARM. C_S is the negative status flag bit corresponding to the ARM N bit, C_Z is the 0-status flag bit corresponding to the ARM Z bit, C_C is the carry status flag bit corresponding to the ARM C bit, and C_O is the overflow flag bit corresponding to the ARM V bit.
Table 2. Mapping NZCV to the CFLGAS.

3.4. Identifier Instruction for the Condition Bits (CCR Instruction)

As there are no condition bits in RISC-V, the setting and referencing of the condition bits are directly implemented within a single branch instruction [4,5], and arithmetic instructions do not set or reference the condition bits. As shown in Table 3, in order to be compatible with the original programming model of RISC-V, this paper proposes the extension of the identifier instruction for condition bit (CCR Instruction), which can be combined with the succeeding instruction to set or reference the condition bits based on the identification specified by the CCR instruction.
Table 3. The encoding fields of CCR instruction.
When the extended RISC-V executes the CCR instruction in the dynamic execution mode, the hardware firstly decodes the CO [0:2], COND [3:6], and CC [7:10] operand information encoded in the CCR instruction. During the succeeding instruction execution, the target processor hardware determines the appropriate setting and referencing of the condition bit based on the CCR operand information and the value of the condition bit in the CFLAGS register.

3.5. Translation of the Setting of the Condition Bits

In the ARM ISA, arithmetic instructions, comparison, or MSR instructions can potentially set or reference the condition bits. While in the RISC-V ISA, there are no condition bits, and the setting and referencing of the condition bits are directly implemented within a single instruction [4,5]. This can lead to a low efficiency of translation from the source ISA to the target ISA or require more redundant computation. A more detailed instructions list and the condition bits setting method have been depicted in Table 4.
Table 4. The mapping table for the setting of the condition bits.
As shown in Table 4, in the dynamic execution mode, this paper extended the functionality of the RISC-V arithmetic instructions. A pair of RISC-V arithmetic instruction and CCR instructions complete the translation of the ARM instruction which sets the condition bits.
As shown in Figure 4, taking ARM ADDS instruction as an example, in order to make the function of the RISC-V ADD instruction consistent with the ARM ADDS instruction, the setting of the condition bits in the CFLAGS register is added through the CCR instruction.
Figure 4. Condition bits setting in the HNIMCB.

3.6. Translation of the Referencing of the Condition Bits

In the ARM ISA, some arithmetic instructions, branch instructions, or MRS may refer to the condition bits, such as carry-addition, borrow-subtraction, and other arithmetic instructions, as well as conditional jump instructions. However, the RISC-V does not have condition bits, and the setting and referencing of the condition bits are directly implemented in one instruction [4,5]. This can cause the generation of a low efficiency of translation from the ARM to the RISC-V or lead to the requiring of more redundant calculations. A more detailed instructions list and the condition bits referencing method have been depicted in Table 5.
Table 5. The mapping table for the referencing of the condition bits.
As shown in Table 5, in the dynamic execution mode, this paper extended the corresponding arithmetic instructions in RISC-V for the referencing instructions of the condition bit in the ARM ISA. By combining the identifier instruction of the condition bit, a one-to-one mapping relationship is thereby achieved from the ARM to the RISC-V.
As shown in Figure 5, taking the ARM ADC instruction as an example, the RISC-V ADD instruction corresponds to its functionality. In order to make the RISC-V ADD instruction function consistent with that of the ARM ADC instruction, on the basis of the original rd = rs1 + rs2 function of ADD, the referencing of the condition bits in the CFLAGS register is added through the CCR instruction, so that the function is directly mapped with the ARM ADC instruction.
Figure 5. Condition bits referencing in the HNIMCB.
Figure 6 shows an example of the translation from the ARM to RISC-V with an extension for the CCR instruction and arithmetic instruction extended for the condition bit setting or referencing. The subsequent target instruction sequence of the translation mapping is much simpler after the extension compared to before.
Figure 6. The HNIMCB vs. the standard RISC-V in translation.

3.7. Dynamic Translator

Binary translation technology includes static translation and dynamic translation. Due to the limitations of information in source binary programs, static translation technology cannot solve all the problems, such as indirect jumps, code and data mixing, self-generated codes, and so on. The translator implemented in this paper adopted the dynamic translation mode.
This translator has its own intermediate representation (IR) and consists of a front end and a back end. The front end converts the binary code of the source ISA to IR, and the back end implements the mapping of the IR to the target ISA. The design of the IR is crucial in compilation technology, as it determines how much information is provided for the back-end translation and optimization. As shown in Figure 7, taking the ARM ADDS instruction as an example, if the design of the IR refers to the target ISA rather than the source ISA, it is possible that ADDS can be transformed into multiple independent IR representations during front-end conversion. In this case, the back end of the translator needs to implement a complex pattern-matching algorithm to reintegrate these multiple IR representations into a single back-end instruction description, greatly increasing the complexity of translation.
Figure 7. Mapping from the source instruction to IR.
To reduce the complexity of data flow analysis and pattern-matching algorithms, the dynamic translator employed in this paper defined many new IRs that are close to the source ISA. If the source instruction includes a condition code, such as the CCMP instruction, the condition code will also be encoded as an immediate parameter to the IR. This allows the front end of the translator to preserve the information of the original instructions as much as possible when translating them into IR. When converting the IR into the target ISA in the back end of a translator, the original information of the source instruction can be obtained as much as possible, thereby allowing for a one-to-one mapping with extended resources or instructions in dynamic execution mode. This simplifies the analysis of data flow and control flow, reduces the fusion of multiple IRs, and thus reduces the translation complexity.
In the conversion of the IRs to the back-end RISC-V instructions, usually a CCR instruction will be emitted first according to the IR semantics. As shown in Figure 7, the ‘adds_i64’ IR mapping to the ARM ADDS instruction will emit a ‘ccr 0, cc_al, co_set’ instruction, which combines with the RISC-V ADD instruction to achieve the function of setting the CFLAGS register like the ADDS in the ARM ISA.

3.8. Optimization of Translation Efficiency

The efficiency of binary translation is measured by the additional time consumption Ttotal during the process of translation and execution of the program, including the time consumption of the translation process itself Ttranslate, as well as the difference between the execution time of the translated program Trun and the execution time of the same program natively Trun_in_native, as shown in Formula (1):
Ttotal = Ttranslate + (TrunTrun_in_native),
Binary translators pay more attention to (TrunTrun_in_native), ignoring Ttranslate, resulting in increasingly complex translation algorithms. For example, pattern matching is used to reduce the number of setting and referencing instructions for the condition bits, and data flow analysis is used to achieve optimization methods, such as deleting redundant condition bit calculations. The complexity of these algorithms is directly related to the number of basic blocks. Assuming that each basic block has 2 branches and there are n layers of successor basic blocks, the calculation complexity is O(2n) [22]. By extending the hardware functions for the condition bit of the arithmetic instructions in the target ISA and the corresponding IR descriptions in the translator, this paper thereby achieved the efficient mapping from the source instructions to the target instructions, which eliminates the data flow analysis process, simplifies instruction pattern matching, and greatly reduces the complexity of dynamic translation. Additionally, there is no need for additional information storage when executing the translated code.
Currently, binary translators mostly use delayed calculation to reduce the calculation of the condition bits in translation optimization. However, delayed calculations increases the memory access consumption during dynamic execution, as it needs to save instruction codes, operands, and other information [21]. This paper expands the condition bit register “CFLAGS”, and the function of the setting and referencing instruction for the condition bits in the target ISA so that it achieves a one-to-one mapping of the condition bit instructions between the source and the target ISA. As a result, there is no need for delayed calculations to obtain the value and save of the condition bits, eliminating the memory access operation caused by the delayed calculation of the condition bits. As shown in Figure 6, the one-to-one mapping of the condition bit instructions between the source and target ISA greatly eliminates additional calculations for the condition bits and the memory access operation for simulating the condition bits.
In summary, the hardware non-invasive mapping method for the condition bits in this paper eliminates the process of pattern matching and data flow analysis, eliminates the storage of the instructions/operands for delayed calculations, and further eliminates the additional operations, such as the calculations of the condition bit and memory simulation for the condition bit, so as to significantly improve the efficiency of binary translation.

4. Results

4.1. Experiment Platform

As shown in Figure 8, this paper adopts the QEMU system mode implemented on an x86 PC to simulate the hardware experimental platform of RISC-V processors, which we refer to as the QEMU-System-RV64. Thus, we can conveniently expand the capabilities of the condition bit register CFLAGS, the setting, and the referencing of the condition bits for RISC-V processors on the QEMU-System-RV64. The experiments were conducted by running the QEMU user mode on the QEMU-System-RV64, serving as the architecture basis for the dynamic translation from the ARM to RISC-V. We inherited the QEMU’s framework, including the IR definition, translation optimization, and other capabilities, while implementing the front end for the ARM ISA and the back end for the RISC-V standard ISA. This forms the binary translator from ARM to RISC-V called the QEMU-User-ARM running on the QEMU-System-RV64.
Figure 8. The standard experimental platform.
As shown in Figure 9, according to the description in Section 3, this experiment extended the execution mode switching mechanism, CCR instruction, CFLAGS register, and the setting or referencing instructions of the condition bits for RISC-V simulated in the QEMU-System-RV64 to obtain the extended RISC-V simulator named the QEMU-System-RV64-ext. Following the description in Section 3, this experiment extended the IR definition and the translation back-end for CFLAGS, CCR instruction, and the setting or referencing instructions of the condition bits based on the QEMU-User-ARM. This forms a binary translator from ARM to an extended RISC-V processor termed the QEMU-User-ARM-ext.
Figure 9. The HNIMCB experimental platform.

4.2. Experiment Steps

The experiment used SPEC 2006 as the test benchmark with a GCC compiler and an O3 compilation option, and the test process is divided into three steps. Firstly, a standard test platform was built using the QEMU-System-RV64 and QEMU-User-ARM. When the ARM binary programs were translated and executed, the QEMU-System-RV64 then recorded the number of translated RISC-V instructions and memory accesses, while the QEMU-User-ARM records the number of condition bit instructions in the ARM binary program. Secondly, an extended test platform was built using QEMU-System-RV64-ext and QEMU-User-ARM-ext. When the ARM binary programs were translated and executed, the QEMU-System-RV64-ext recorded the number of translated and executed RISC-V instructions and memory accesses, while the QEMU-User-ARM-ext recorded the number of condition bit instructions in the ARM binary program. Finally, we compared and analyzed the data obtained in the first two steps, such as the number of dynamic running instructions, the number of memory access instructions, and the number of condition bit instructions.

4.3. Total Instruction Statistics

Dynamic instruction count is one of the important performance indicators in binary translation. This paper specifically conducted statistical analysis on this indicator for assessing the SPEC 2006 benchmark. Each data in the bar charts displayed in Figure 10 represents the percentage decrease in the total dynamic instruction count of different applications before and after applying the hardware non-invasive mapping method for condition bits proposed for binary translation.
Figure 10. Total instruction reduction.
The data indicate that after optimization with the hardware non-invasive mapping method for condition bits proposed in this paper, the total dynamic instruction count decreased by up to 41%, and by 19% on average, respectively. The data also indicates that most programs have a significant optimization effect, mostly because the ARM condition bit instructions are mapped to significantly fewer RISC-V instructions when translated using the proposed method. More details can be found in Figure 6. In summary, the experimental data shows that the proposed method effectively decreases the total dynamic instruction count in the translated programs.

4.4. Memory Access Statistics

Optimization of memory access instruction count is a topic that has been widely discussed in binary translation, specifically in relation to register allocation or condition bit mapping. This paper presents the experimental results on the number of memory access instructions in the SPEC 2006 program after binary translation from ARM to RISC-V. Each data point in the bar chart shown in Figure 11 represents the percentage decrease in the memory access instructions before and after applying the hardware non-invasive mapping method for condition bits proposed for binary translation.
Figure 11. Memory access reduction.
The data shows that after using the optimization method proposed in this paper, the number of memory access instructions during the dynamic runtime of binary translation decreases by up to 37%, and by 19% on average, respectively. The experimental data also shows that most of the programs were significantly optimized, mainly because the ARM NZCV condition bits will be directly mapped by the CFLAGS register, instead of mapped by a memory unit. To sum up, the experiment data shows that the proposed method effectively decreases the number of memory access in translated programs.

4.5. Regression Analysis

In order to further analyze the effectiveness of the proposed method, this paper used the linear regression analysis method to analyze the relationship between the efficiency of memory access optimization and the proportion of condition bit instructions. Figure 12 shows a scatter plot and regression analysis of the proportion of memory access instructions reduction in the SPEC 2006 benchmark and the proportion of condition bit instructions in the source binary programs.
Figure 12. Regression analysis of memory reduction.
According to Figure 12, after optimizing with the hardware non-invasive mapping method for condition bits proposed for binary translation, the proportion of memory access instructions reduction was similar to the predicted value calculated by the linear regression equitation, and the obtained R2 was 0.93, which is a good fit of the model. This means that the proportion of condition bit instructions has a positive linear correlation with the reduction in the memory access instructions. In other words, the higher the proportion of condition bit instructions, the greater the reduction in the memory access instructions. Using the same analysis method, it can be concluded that there is a similar linear relationship between the proportion of total instructions reduction and the proportion of condition bit instructions in the source program.

5. Discussion

5.1. Microarchitecture Impact

It is true that most binary translation techniques aim to execute code in an unmodified target architecture, but there is a performance bottleneck that needs to be addressed. This paper specifically addresses the issue of optimizing the translation of condition bit operations. Firstly, RISC-V is an open-source and open architecture that allows for specific issues to be addressed by defining sub-extensions. Secondly, existing open-source binary translators generally have a low translation efficiency, while binary translators with a higher efficiency are developed through architecture customization by commercial companies, such as Apple’s Rosetta 2. When translating X86 programs to ARM, Rosetta 2 provides resources to ensure TSO memory ordering [25,26], among others.
Our extensions enhance the semantics of some instructions, but overall, the instructions are still relatively simple, and simpler than the ARM instructions. For instance, ADD instructions with the HNIMCB does not require a complicated operation, such as the handling of source register extensions and shifts [27] in ARM.
The operands of the CCR instruction are all immediate values, meaning that the following instructions know that they need to operate on the CFLAGS without waiting for the CCR instruction to execute. In this manner, the CCR instruction does not need to establish any dependency with the subsequent ALU instructions. After the decoding stage, the CCR instruction can be executed as a NOP.
Another similar case of CCR is the VSETVL instruction in the RISC-V vector extension [28], which has been ratified, and many CPU cores (such as the C908 [29] for T-HEAD, and P270 [30] from Sifive) have delivered this extension. In the vector extension, the vsetvl instruction will set the VL and VTYPE CSR. It has been used very frequently for many workloads. For example, some test cases of OpenCV, the rate of vsetvl will be more than 20% of the total vector instructions. With pipeline forwarding and register renaming, it will not be a big challenge for the latency [31].
Thus, we expect that our approach will not have a significant impact on the pipeline performance. As our experiment platform is QEMU, a function level simulator, we cannot provide a quantitative analysis on this question at this time, and this issue needs further research to explore.

5.2. X86 Translation

This article proposed a method of establishing general condition bit semantics on RISC-V. It does not simulate concrete condition bit instructions. Instead, it focuses on the abstract condition bit operation mode, such as the condition bit setting, referencing, and the conditional execution codes.
Although ARM has been used throughout the entire article and experimental part, this mechanism can also be applied to the translation from other architectures, such as X86 to RISC-V. X86 has six status flags (CF, OF, SF, ZF, and PF, respectively) in the EFLAGS register [32], which can be easily mapped by the proposed CFLAGS register. These status flags will be set by addition, subtraction, comparison, and other instructions. The CMOVcc, FCMOVcc, Jcc, and SETcc assess the condition codes, encoded by the one or more status flags. The proposed CCR instruction encodes both the setting and referencing of the condition bits and the condition codes. With the CCR instruction, various condition bit operations from other architectures can be mapped to the most basic RISC-V instructions, such as addition, subtraction, and logical instructions.

6. Conclusions

This paper presents a statistical analysis of the percentage of condition bit instructions in the SPEC 2006 Benchmark and analyzes the condition bit instruction translation techniques of the existing mainstream binary translation tools. Based on previous research, this paper proposes a hardware non-invasive mapping method for the translation of condition bit instructions in the RISC-V open-source architecture. This method extends the RISC-V dynamic execution mode to implement hardware functions and resources, such as the condition bit register, and the setting or referencing of the condition bits, enabling the efficient translation of the condition bit instructions from the source instruction set to RISC-V, while ensuring RISC-V compatibility with the existing software ecosystem. The experimental data show that the proposed method reduces the total number of instructions by up to 41%, and the number of memory access instructions by up to 37%, respectively, effectively reducing the translation complexity and improving the translation performance.
RISC-V has achieved a significant technological and commercial development due to its open-source and open ecosystem. However, its software and application ecosystem in high-performance areas like in mobile devices, desktops, and data centers falls short. The proposed hardware non-invasive mapping method for condition bits can effectively optimize the binary translation performance from ARM or x86 to RISC-V, and further promotes the commercialization process of RISC-V across high-performance fields.

Author Contributions

Conceptualization, C.L.; methodology, C.L. and Z.L.; software, Z.L. and Y.S.; validation, Z.L., L.H., Y.S. and X.Y.; project administration, C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The authors confirm that the data supporting the findings of this study are available within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Barham, P.; Dragovic, B.; Fraser, K.; Hand, S.; Harris, T.; Ho, A.; Neugebauer, R.; Pratt, I.; Warfield, A. Xen and the Art of Virtualization. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, New York, NY, USA, 19 October 2003; ACM: New York, NY, USA; pp. 164–177. [Google Scholar]
  2. Li, H.; Xu, X.; Ren, J.; Dong, Y. ACRN: A Big Little Hypervisor for IoT Development. In Proceedings of the 15th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, Providence, RI, USA, 14 April 2019; ACM: Providence, RI, USA; pp. 31–44. [Google Scholar]
  3. Li, C.; Guo, R.; Tian, X.; Wang, H. KHV: KVM-Based Heterogeneous Virtualization. Electronics 2022, 11, 2631. [Google Scholar] [CrossRef]
  4. Waterman, A.; Asanovic, K. The RISC-V Instruction Set Manual, Volume I: User-Level ISA; Document Version 20191213; EECS Department, University of California: Los Angeles, CA, USA, 2019. [Google Scholar]
  5. Waterman, A.; Asanovic, K. The RISC-V Instruction Set Manual, Volume II: Privileged Architecture; Document Version 20190608-Priv-MSU-Ratified; EECS Department, University of California: Los Angeles, CA, USA, 2019. [Google Scholar]
  6. Chernoff, A.; Herdeg, M.; Hookway, R.; Reeve, C.; Rubin, N.; Tye, T.; Bharadwaj Yadavalli, S.; Yates, J. FX!32 a Profile-Directed Binary Translator. IEEE Micro 1998, 18, 56–64. [Google Scholar] [CrossRef]
  7. Baraz, L.; Devor, T.; Etzion, O.; Goldenberg, S.; Skaletsky, A.; Wang, Y.; Zemach, Y. IA-32 Execution Layer: A Two-Phase Dynamic Translator Designed to Support IA-32 Applications on Itanium/Spl Reg/-Based Systems. In Proceedings of the 22nd Digital Avionics Systems Conference, Proceedings (Cat. No.03CH37449), San Diego, CA, USA, 5 December 2003; IEEE Comput. Soc.: San Diego, CA, USA, 2003; pp. 191–201. [Google Scholar]
  8. Cifuentes, C.; Van Emmerik, M. UQBT: Adaptable Binary Translation at Low Cost. Computer 2000, 33, 60–66. [Google Scholar] [CrossRef]
  9. Dehnert, J.C.; Grant, B.K.; Banning, J.P.; Johnson, R.; Kistler, T.; Klaiber, A.; Mattson, J. The Transmeta Code Morphing/Spl Trade/ Software: Using Speculation, Recovery, and Adaptive Retranslation to Address Real-Life Challenges. In Proceedings of the International Symposium on Code Generation and Optimization, CGO 2003, San Francisco, CA, USA, 23–26 March 2003; IEEE Comput. Soc.: San Francisco, CA, USA, 2003; pp. 15–24. [Google Scholar]
  10. Bellard, F. QEMU, a Fast and Portable Dynamic Translator; USENIX Association: Anaheim, CA, USA, 2005; p. 41. [Google Scholar]
  11. Box86. Available online: https://github.com/ptitSeb/box86 (accessed on 26 May 2023).
  12. Houdek, R. FEX-Emu: Fast(-Er) X86 Emulation for AArch64. In Proceedings of the Free and Open source Software Developers’ European Meeting (FOSDEM), online, 5–6 February 2022. [Google Scholar]
  13. Engelke, A.; Schulz, M. Instrew: Leveraging LLVM for High Performance Dynamic Binary Instrumentation. In Proceedings of the 16th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, Lausanne, Switzerland, 17 March 2020; ACM: Lausanne, Switzerland; pp. 172–184. [Google Scholar]
  14. Liao, Y.; Sun, G.; Jiang, H.; Jin, G.; Chen, G. All Registers Direct Mapping Method in Dynamic Binary Translation. Comput. Appl. Softw. 2011, 28, 21–24+48. [Google Scholar] [CrossRef]
  15. Wang, J.; Pang, J.; Fu, L.; Yue, F.; Shan, Z.; Zhang, J. A Dynamic and Static Combined Register Mapping Method in Binary Translation. J. Comput. Res. Dev. 2019, 56, 708–718. [Google Scholar] [CrossRef]
  16. Smelyanskiy, M.; Tyson, G.S.; Davidson, E.S. Register Queues: A New Hardware/Software Approach to Efficient Software Pipelining. In Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622), Philadelphia, PA, USA, 15–19 October 2000; IEEE Comput. Soc.: Philadelphia, PA, USA, 2000; pp. 3–12. [Google Scholar]
  17. Wang, C.; Wu, Y.; Rong, H.; Park, H. SMARQ: Software-Managed Alias Register Queue for Dynamic Optimizations. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, Vancouver, BC, Canada, 1–5 December 2012; IEEE: Vancouver, BC, Canada, 2012; pp. 425–436. [Google Scholar]
  18. Wen, Y.; Tang, D.; Qi, F. Register Mapping and Register Function Cutting out Implementation in Binary Translation. J. Softw. 2009, 20, 1–7. [Google Scholar]
  19. Lattner, C.; Adve, V. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the International Symposium on Code Generation and Optimization, CGO 2004, San Jose, CA, USA, 20–24 March 2004; IEEE: San Jose, CA, USA, 2004; pp. 75–86. [Google Scholar]
  20. Mac Benchmarks. Available online: https://browser.geekbench.com/mac-benchmarks (accessed on 26 May 2023).
  21. Ma, X.; Wu, C.; Tang, F.; Feng, X.; Zhang, Z. Two Condition Code Optimization Approaches in Binary Translation. J. Comput. Res. Dev. 2005, 42, 329–337. [Google Scholar] [CrossRef]
  22. Tang, F.; Wu, C.; Feng, X.; Zhang, Z. EfLA Algorithm Based on Dynamic Feedback. J. Softw. 2007, 18, 1603–1611. [Google Scholar] [CrossRef]
  23. Wang, W.; Wu, C.; Bai, T.; Wang, Z.; Yuan, X.; Cui, H. A Pattern Translation Method for Flags in Binary Translation. Jisuanji Yanjiu Yu Fazhan/Comput. Res. Dev. 2014, 51, 2336–2347. [Google Scholar] [CrossRef]
  24. Wang, R.; Meng, J.; Chen, Z.; Yan, X. Condition Code Optimization in Dynamic Binary Translation. J. Zhejiang Univ. (Eng. Sci.) 2014, 48, 124–129. [Google Scholar]
  25. Saagarjha TSOEnabler. Available online: https://github.com/saagarjha/TSOEnabler (accessed on 16 June 2023).
  26. Dougallj Why Is Rosetta 2 Fast. Available online: https://dougallj.wordpress.com/2022/11/09/why-is-rosetta-2-fast/ (accessed on 26 May 2023).
  27. A64–Base Instructions (Alphabetic Order). Available online: http://hehezhou.cn/isa/ (accessed on 16 June 2023).
  28. Asanovic, K. RISC-V “V” Vector Extension. Version 1.0. 2021. Available online: https://github.com/riscv/riscv-v-spec/releases/tag/v1.0 (accessed on 16 June 2023).
  29. XuanTie C908. Available online: https://xrvm.com/cpu-details?id=4107904466789928960 (accessed on 16 June 2023).
  30. Frame, A. Introduction To SiFive Vector Processors. Available online: https://www.sifive.com/blog/introduction-to-sifive-vector-processors (accessed on 16 June 2023).
  31. Asanovic, K. Cost of Vsetvl Instructions #642. Available online: https://github.com/riscv/riscv-v-spec/issues/642 (accessed on 16 June 2023).
  32. EFLAGS Cross-Reference and Condition Codes. Available online: https://www.cs.utexas.edu/~byoung/cs429/condition-codes.pdf (accessed on 16 June 2023).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.