1. Introduction
In most architectures, memory locations are eight bits wide, but a load/store instruction can read or write a wider word. Such a multi-byte word is stored using memory positions with consecutive addresses. In this paper, we will call the lowest of those consecutive addresses the
base address. Usually, the stored word is referenced using this address [
1], and its size in bytes is a power of two. The word is said to be
aligned if its base address is a multiple of this size in bytes; otherwise, the word is said to be
misaligned or
unaligned. Implementation requirements usually force any memory access to a misaligned word to be split into at least two sub-accesses [
2]. For this reason, unaligned accesses are forbidden in many architectures. Another issue to take into account when dealing with multi-byte accesses is how to map the memory locations involved in such accesses to each byte of the accessed words. Such mapping is called
endianness or
byte order. Save for rare exceptions, the most widely used byte orders are
little-endian (LE) and
big-endian (BE). The first maps less significant bytes of the accessed word to the lowest memory positions (i.e., less significant bytes are stored first), while the second maps them to the highest memory positions (i.e., more significant bytes are stored first). Problems arise when systems with different endianness communicate [
1]. Depending on the coupling level of such systems, endianness mismatch scenarios can be classified as follows:
Detached systems: These systems do not have a communication line, so they can only communicate by sharing files written in some type of support. Byte order conversions may be necessary, since the byte order of the data within a file depends on the file format [
1,
2,
3]. Furthermore, the byte order of the data structures of the file system itself must be taken into account.
Networked systems: These systems communicate through a network. Byte order mismatch is also an issue in this scenario, even if the native byte order of each system is the same, since network stacks and communication protocols define their own endianness [
1].
Highly coupled systems: Heterogeneous Multiprocessor System-on-Chip (MPSoC) can include several processors with differences in endianness [
1,
2,
4,
5,
6].
Software-emulated systems: A software emulator is a program that simulates the behavior of a computer system. The byte order of the simulated system (called the guest) and the endianness of the system running the emulator (called the host) can differ [
7,
8], so the latter may need to make conversions in order to emulate memory accesses.
Byte order conversions cannot be avoided. For example, most modern computer architectures use little-endian byte order, while big-endian data is extensively used in many application areas, including the following:
Network protocols, such as TCP/IP [
1].
File and multimedia formats, such as JPEG, TIFF and PDF [
1].
Industrial communication protocols, such as CAN [
9] and Modbus [
10].
Data produced by big-endian legacy systems such as IBM mainframes (IBM, North Castle, United States) and Motorola 68000 (Motorola, Schaumburg, United States) [
1].
Performing byte order conversions in software may introduce significant overhead [
1]. The relevance of this penalty depends on the frequency of the conversions. For example, the little-endian systems described in [
11] make intensive accesses to big-endian data, so their execution time could be reduced by 20% to 40% through careful selection of how and when to make byte order conversions. Software overhead can be reduced by including specific hardware for byte order conversion. For example, several instructions have been introduced into x86 architectures for this purpose [
1,
11,
12]. Byte order conversion overhead affects all kinds of processors, but it is especially important in embedded processors and microcontrollers, since this overhead can significantly reduce performance and system autonomy in battery-operated devices.
This paper focuses on improving the performance of RISC-V software involving intensive byte order conversions. RISC-V is a suit of open standard Instruction Set Architecture (ISA) that is developed, is ratified and is maintained by the RISC-V International foundation [
13]. As the name suggests, RISC-V follows most Reduced Instruction Set Computer (RISC) principles. RISC-V standards have been embraced by a fast-growing number of industrial and academic actors in 70 countries [
13]. Some of the reasons behind its success are as follows:
It has a very permissive license. Designers are allowed to develop open or closed compliant implementations for commercial or non-commercial purposes without royalties.
It allows for multiple implementation goals. When designing an implementation, developers can optimize the power consumption, performance, transistor count or any trade-off of these.
It is designed to suit many types of systems, including microcontrollers, personal computers, servers and supercomputers. To this end, RISC-V defines optional extensions and system profiles.
Since the embedded systems market is currently one of the fastest-growing areas for the application of RISC-V, improving RISC-V’s byte order conversion performance through a specific extension may be valuable for embedded RISC-V applications that need to perform such conversions frequently.
The main objective of this paper is to propose and develop a standard extension to remove endianness conversion overhead in RISC-V processors. The proposed extension uses a mechanism called Address-Encoded Byte Order (AEBO). Previous work has introduced the AEBO mechanism and shown that it can reduce the user execution time of software running on an embedded OpenRISC processor by 60% [
2]. The main contribution of this study is its application of the AEBO technique to the RISC-V architecture in such a way that it can be integrated as an optional extension in the RISC-V ecosystem. This paper details not only the implementation of the AEBO technique, but also the necessary additions to RISC-V’s configuration infrastructure, as well as its necessary constraints.
The rest of the paper is organized as follows:
Section 2 includes some background on RISC-V architectures, including their alignment restrictions.
Section 3 describes possible ways to deal with byte order conversions. The proposed extension is presented in
Section 4, and
Section 5 describes how programmers can take advantage of it. Implementation details are described in
Section 6. The results are presented in
Section 7, and are discussed in the last section.
3. RISC-V Endianness Conversions
Since RISC-V is intended to suit virtually any scenario, applications that include intensive accesses and processing of both little-endian and big-endian data are likely. This section explores some methods for handling byte order conversions in RISC-V.
3.1. Using Generic Base ISA Instructions
Endianness conversion implies changing the order of the bytes within a word. The base ISA of RISC-V does not include any specific instruction to manipulate the bytes within a word, so any byte order change has to be performed using standard logic and shift instructions. The assembly code in
Figure 1 is an example of this: a BE signed half-word is read from memory in a GPR
a0. The native endianness is LE, so bytes
X0 and
X1 in the register have to be swapped and the sign extended before any useful calculation can be performed in the native endianness. The result has to be converted to BE format before it is written back to memory.
In this example, the manipulation of one word of data in the foreign endianness requires the execution of nine additional instructions. The relative overhead will depend on the algorithm being executed, but it is likely to be significant if long lists of BE data have to be processed and the computation performed with each word takes only a few assembly instructions.
3.2. Using Specific Byte Order Instructions
One way to reduce endianness conversion overhead could be to use specific instructions for dealing with byte ordering. These instructions may be of two types: instructions to reorder the bytes within a register, and instructions that load or store words in a byte order that is different from the native order of the process. The Zbb RISC-V extension [
18] includes an instruction of the first type, the
rev8 instruction, which reverses the order of the bytes in a GPR. This instruction makes it possible to reduce the endianness conversion penalty of RISC-V programs, but does not remove this penalty completely, since the execution of the instruction itself implies some overhead and more overhead is introduced when dealing with words shorter than
XLEN, since their conversion requires the execution of an additional
srai or
srli instruction [
18]. The code in
Figure 2 rewrites the example in
Figure 1 using the
rev8 instruction. Additional instructions for byte order conversions are still required, but their number is reduced from nine to four.
The presence of instructions like rev8 is an improvement, but this instruction cannot be used with floating-point registers. Introducing variants of the rev8 instruction to deal with single-precision, double-precision and quad-precision floating-point registers would require additional operation codes (opcodes) and, again, would not completely remove the conversion overhead. To date, there have been no proposals of an RISC-V extension to address this.
Regarding instructions of the second type, i.e., instructions to load/store words in the non-native byte order as variants of the multi-byte load/store instructions already available in the ISA, these would remove the byte order conversion overhead, but they would also consume opcodes; therefore, considering that the RISC-V ISA includes more than 30 multi-byte memory access instructions (see
Table 1) and that more may be added in the future [
19], it is hardly an option to include non-native byte order variants of all of these instructions. To date, no RISC-V extension proposes any such instructions.
3.3. Using an RISC-V Bi-Endian Implementation
At first glance, it seems that an RISC-V bi-endian implementation could change the CSR bits that determine the endianness during runtime, so that if a process needs to make an access in a foreign byte order, it could write the corresponding CSR bit field, make the access and restore the CSR bit field to the previous value. Unfortunately, this approach has many drawbacks:
The execution of the instructions necessary to change and restore the CSR bit field would introduce an overhead. Moreover, since software running at privilege level U cannot change the UBE CSR bit field itself, it would have to make a system call to the operating system to change it. The large overhead of a system call compared to that for reordering the bytes using software would not be justified unless a large number of foreign-endian accesses were required to be executed.
The standard RISC-V Application Binary Interfaces (ABIs) are expected to be purely little-endian-only or big-endian-only [
17]. Hence, many library functions will expect the process to remain at the endianness at which it was executed, making a byte order change practical only for executing complete level U programs in the opposite endianness.
Supervisor software (i.e., software running at privilege levels S or VS) cannot change the CSR bit field which controls its own data byte order. The rationale for this is that SBE and VSBE also control the endianness of implicit data accesses to supervisor-level memory management data structures, such as page tables, at the respective privilege levels, and changing these bit fields would alter the implementation’s interpretation of these data structures. Therefore, in practice, level S/VS software will hardly benefit from a bi-endian implementation to accelerate foreign-endian data access.
In summary, a bi-endian implementation has the capacity to enable the execution of processes in only one arbitrary endianness, but not to be efficient when manipulating data in different byte orders within the same process.
4. RISC-V AEBO Extension
In order to overcome the barriers stated in the previous section to support endianness conversion, the extension proposed in this paper uses the Address-Encoded Byte Order (AEBO) technique introduced in [
2]. When using AEBO, there is no need to introduce new instructions to deal with bi-endian data, since the byte order of every data memory access is encoded in the address used to reference the accessed word. The AEBO technique is described in the following subsection.
4.1. The AEBO Technique
When AEBO is enabled for a privilege level, any explicit N-byte data access on that level is affected in the following ways:
The word to be read from or written to memory can be referenced using not only its base address, but also the address of any of its N bytes.
The byte order to be used depends on the address used to reference the accessed word.
The base address is the highest multiple of N that is not greater than the address used to reference the accessed word. This implies that the access is always aligned.
We will represent the address used to reference the accessed word as A, and the base address as . Since N is a power of two, i.e., N is equal to for some integer t, can be obtained just by clearing the t least significant bits of A. These t bits of A are used to select the byte order of the access. For example, if a process accesses a word of size bytes (i.e., ), the AEBO technique operates as follows:
If the t least significant bits of A are all 0 (i.e., ), the access is made using the native byte order of the process.
If is 1, the bytes within each consecutive pair of bytes in the word get swapped.
If is 1, 16-bit sub-words within each consecutive pair of 16-bit sub-words in the word get swapped.
If is 1, both halves of the word get swapped.
The same applies to the other values of
N. As an example,
Table 2 represents the four possible byte orders for a 4-byte word memory write in a process whose native byte order is little-endian, depending on the value of the two least significant bits of
A. Note that, in general, if all the bits are equal to 0, the access will use the native byte order of the process, whereas if they are all equal to 1, the reverse byte order will be used. Any other case will result in different mixed-endian configurations. A reading operation carries out exactly the same byte/word swaps while transferring the data to the destination.
In summary, with the AEBO technique, all memory accesses are aligned to the base address , and will use the least significant bits of A to select the byte order of the access. A straightforward consequence is that an explicit data access cannot raise a misaligned address exception when AEBO is enabled, since such access is implicitly aligned.
4.2. RISC-V AEBO Extension Description
As we highlighted in
Section 2, CSR bit fields control the native byte order of explicit data accesses for each privilege level in RISC-V. From these,
SBE and
VSBE also control the byte order of implicit data accesses to structures such as page tables, so OS-level software cannot change these structures. The proposed AEBO extension only affects explicit data accesses, and it can be enabled separately for each privilege level. In particular, if the effective privilege mode of data accesses is not modified (i.e., the bit field
mstatus.MPRV = 0 [
17]), AEBO is enabled in privilege mode
x by setting the bit field of a CSR as defined in the extension. In this paper, we will denote this bit field as
xAE, and the value of the bit field determining the native byte order of the explicit data accesses in mode
x will be denoted as
xBE. As long as
mstatus.MPRV = 0,
xAE is as follows:
mstatus.MAE if x is M;
sstatus.SAE if x is S;
sstatus.UAE if x is U;
vsstatus.SAE if x is VS;
vsstatus.UAE if x is VU.
These bit fields are 0 right after reset. If AEBO is not implemented, they are read-only. According to the RISC-V documentation terminology, these fields are
Write Any values, Read Legal values (WARL). This means that system-level software can check whether the extension has been implemented by reading one of these bit fields right after trying to set it. Note that since
sstatus is a subset of
mstatus, the bit fields
sstatus.SAE and
sstatus.UAE are aliases of
mstatus.SAE and
mstatus.UAE, respectively. When AEBO is enabled for a privilege level
x, any explicit multi-byte data access in that mode is carried out using the AEBO technique described in
Section 4.1.
In order to formally define the byte order used during an explicit data access when AEBO is enabled, we use the following notation:
W: The word to be read or written.
N: The size of W in bytes.
t: The binary logarithm of N. Hence, .
n: An integer such that .
: The i-th bit of the binary (base 2) representation of n. Hence, .
: The i-th bit of W, whereby the concatenation of the bits , …, and is W.
: The n-th byte of W, that is, the concatenation of the bits , …, ; hence, the concatenation of bytes , …, is W.
A: The address of the memory location used to reference W.
: The i-th bit of the binary (base 2) representation of A.
xBE: The bit field determining the native byte order of the explicit data accesses. Depending on the effective privilege mode, it can be the bit fields MBE, SBE or UBE of mstatus, sstatus or vsstatus.
⊕: The XOR operator.
: The base address of an AEBO access, defined as .
According to the AEBO technique description in
Section 4.1, and generalizing [
2], when the AEBO RISC-V extension is enabled, each byte
of the word
W is mapped to the memory position whose address is as follows:
This has the following implications:
If a word is referenced in the normal way, i.e., using the memory address of its first byte, then the native endianness is applied, i.e., big-endian if xBE = 1 and little-endian otherwise. Because of this, this address is called the native-endian address of the word.
If a word is referenced using the memory address of its last byte, then the reverse of the native endianness is applied, i.e., big-endian if xBE = 0 and little-endian otherwise. Because of this, this address is called the reverse endian address of the word.
If a word is referenced using the memory address of any of its other bytes, then a mixed-endian byte order is used. Any mixed-endian order can be selected by choosing the appropriate address.
For example, suppose
UAE = 1 (AEBO is enabled on level U),
UBE = 0 (the native byte order on level U is little-endian) and the following instruction is executed on level U:
lw x3, 0x1003(x0)
This instruction would read the 4-byte word stored in the memory positions
,
,
and
, since the starting address would be
Also, since the word is referenced using the address of its last byte (
), the loading is performed using a big-endian byte order according to the displacement for each byte, computed using Equation (
1) and shown in
Table 3.
The specific byte order for other word sizes and endianness configurations can be easily derived from Equation (
1).
Note that if AEBO is implemented, any load/store instruction (integer, floating, read, write, read–write, etc.) can make use of it.
5. The Use of AEBO in Software
It can be easily deduced from the previous section that using the AEBO extension in software does not require the introduction of new instructions into the ISA; it just requires the use of the right address. In this section, possible ways to use the AEBO extension from assembly and C code are explored.
5.1. Using the AEBO Extension from Assembly Code
Let
be the base address of an
N-byte data word. If the AEBO extension is enabled, an
N-byte access to address
will read or write the word using the native endianness of the process, while an access to address
will use the reverse endianness. Accesses in any mixed-endianness are possible, but these will not be discussed here because they have very limited applications. For example, the code in
Figure 3 is the AEBO equivalent to the code in
Figure 1 and
Figure 2. As can be easily observed, no byte order manipulation instructions are necessary with the AEBO extension, and the processing overhead of using foreign-endian data is completely eliminated.
5.2. Using the AEBO Extension from C Code
Using the AEBO extension from C language is trickier because of the precise control over the value of pointers that is required by the AEBO extension. The
macros defined in
aebo.h, shown in
Figure 4, may help. The
le_ptr and
be_ptr macros will modify a pointer to produce a little- or big-endian access, respectively, using the AEBO technique. When the selected endianness is the native one, the macro will leave the pointer unchanged; otherwise, the macro will expand to the macro
__foreign_endian_ptr, which carries out the actual pointer conversion.
The macros le_var and be_var can be used directly on variables of basic types (no arrays or structs). They use the __foreign_endian_var macro, which, in turn, is built around the __foreign_endian_ptr macro, by referencing the variable before changing the pointer and de-referencing the resulting pointer afterwards.
The code in
Figure 5 is a sample test program that uses these macros. Variables with names ending in
_be are intended to store data in big-endian byte order. The code should equally work in a native little-endian or big-endian system. In the following discussion, a native little-endian system is assumed. Line 14 defines a pointer to access the list elements in big-endian byte order, so the resulting address assigned to pointer
ptr is
&list_be[0]+1. Lines 16 and 17 populate the list with data. As the pointer is incremented using pointer arithmetic, the right address that triggers big-endian access is preserved, so all the data in the list will have big-endian byte order. Lines 21 and 22 add all the elements of the list. Conversion from BE to LE is performed on the fly as the data is read, and the result of the addition is stored in the variable
sum in the native byte order. Line 24 adds together the LE variable
sum and the BE variable
offset_be, and stores the result in the BE variable
sum_be. Note that the
be_var macro can be used on both the right-hand and the left-hand sides of the assignment.
Although the macros above may simplify the use of AEBO in C programs, the AEBO extension needs the compiler to generate code that does not alter the byte width of the data being transferred. This means, for example, that any access to an
int16_t type (2 bytes) should be translated to half-word transfer instructions like
lh or
sh, and the access should not be divided into multiple byte transfer instructions (like
lb or
sb). However, the compiler is likely to alter the access width as a result of applying typical code optimization techniques. For example, the GNU compiler (GCC) [
20] version 14.2.0 was used to compile the code in
Figure 5 with different code optimization levels. With any optimization level (default is level 2, compilation option
-O2), the code compiled from the expansion of macro
__foreign_endian_ptr keeps the transfer size, but the code compiled from the expansion of macro
__foreign_endian_var will use instructions
lb and
sb, instead of
lh and
lb. Only if optimization is completely disabled (option
-O0) will code be generated that does not alter the data transfer width; however, in general, disabling code optimization is not a realistic option.
This means that, for the practical use of the AEBO extension from C, the compiler should be aware that the extension is available and that code should be generated accordingly, as is the case for the many other RISC-V extensions that are available. For AEBO, this basically means that data with memory accesses should be preserved in the compiled code. It is important to note that the C macros and code samples introduced in this section are examples of the kind of support that is necessary from the compiler/library side. Although complete AEBO compiler/library support is beyond the scope of this paper, it should not be a technical challenge for system software developers, and is recommended as a direction for future work.
5.3. Using the AEBO Extension with Existing Software
The standard C library includes several macros for dealing with byte order conversion from BE and LE to the host byte order, and the other way around. These macros have the form beNNtoh, leNNtoh, htobeNN and htoleNN, where NN is 16, 32 or 64. They are defined in the file endian.h, and the names are self-explanatory. These macros relay on the lower-level macros __bswap_16, __bswap_32 and __bswap_64, which are typically defined in the file byteswap.h. The lower-level macros will, ultimately, be mapped by the compiler to machine code that is optimized for byte swapping depending on the hardware’s capabilities, including the available extensions. These low-level macros could be rewritten to support the AEBO extension and generate code that uses it, greatly improving the performance of existing software that uses the interface defined in endian.h just by re-compiling the source code.
Note that an existing software function executed in an AEBO-enabled system that takes a pointer to a number type (or a list of them) as a parameter, and is designed to work in the native byte order, may also work with data in a foreign byte order by passing the pointer to data incremented by
, with
N being the byte width of the data type, provided that the function uses the type’s byte width in all memory accesses. In this way, a large collection of legacy software may support data processing in the foreign byte order without even recompiling it, although this support is not guaranteed. For example, let
mean32 be a legacy function that takes a pointer to a list of 32-bit integers and a number of elements, and calculates the mean value of the elements in the list in the native byte order. Using the AEBO extension, the same function could be called to process a list in either LE or BE byte order, as shown in the code template in
Figure 6. To enhance its practicality and usefulness in application, newer software can easily support both LE and BE byte orders using AEBO as long as it fulfills the above requirement and the code is compiled with an AEBO-aware compiler.
6. Sample RISC-V AEBO Implementation
The AEBO extension is very easy to implement, and has no impact on the critical path delay or the transistor count. To illustrate this,
Figure 7 depicts a simplified schematic of the interconnection of the load/store unit of an RISC-V implementation with the first memory level. Although it is not designed to support AEBO, there is no need to modify it to introduce the extension, as we will see below. The base ISA of this implementation is RV32I, so
XLEN = 32. The physical addresses of the system in the picture have
p bits. When a word is read or written, the load/store unit provides the memory system with all the physical address bits except for the two least significant ones, i.e.,
. The memory system can provide simultaneous access to the memory locations at the consecutive addresses
,
,
and
. The load/store unit generates the respective control signals
,
,
and
to tell the memory system which of these four memory locations will be accessed. Two levels of swappers, such as the one in
Figure 8, are used to map the accessed memory locations to the bytes of the register word
W to be read or written. These levels are numbered starting from 0. If AEBO is not implemented, the signal
controlling the set of swappers at level
i depends on
, the size of the accessed word (i.e.,
) and the native endianness of the process executing the access (i.e.,
xBE), in the following way:
is equal to
xBE if
; otherwise,
is equal to
.
For example, if a process whose native byte order is big-endian (xBE = 1) accesses a single byte () at an address A such that and , then the data lines corresponding to the memory location will be connected to , i.e., the bits of W, and only the control signal will be activated. If the process accesses a word of two bytes () at the same address, then those data lines will be connected to , i.e., the bits of W; the data lines corresponding to the memory location will be connected to ; and the control signals and will be activated.
In order to implement AEBO, only two modifications are required: First, will be equal to when . Second, the control logic will depend on the xAE bit fields, so that data address-misaligned exceptions will not be raised when AEBO is enabled. Note that the delay and hardware cost should barely be affected by the implementation of AEBO, since only the logic generating these control signals is modified. If, for example, the logic that generates is in the critical path, the delay of only one XOR gate is introduced. In addition, since the AEBO technique works at the first level of the memory interface and does not alter which bytes or words have to be sent to/retrieved from memory, the technique should be transparent to the use of cache memory or virtual addressing techniques.
In order to experimentally estimate its required engineering effort, the proposed extension was implemented in a fork of SCR1 [
21,
22], a high-quality, industry-grade and silicon-proven open-source RISC-V RV32I/RV32E MCU core designed and maintained by Syntacore [
23]. The chosen fork (
gambaman/scr1) is available at [
24]. At the time of writing, the difference between the main branches of SCR1 and
gambaman/scr1 is that the former can only be little-endian, while the latter can be configured to be little-endian, big-endian or bi-endian by editing the file
scr1_arch_description.svh. The default
gambaman/scr1 configuration is bi-endian, i.e.,
mstatus.MBE is writable. The following line can be uncommented to force a little-endian-only implementation:
//`define SCR1_IMMUTABLE_ENDIANNES LITTLE_ENDIAN
Alternatively, the following line can be uncommented to force a big-endian-only implementation:
//`define SCR1_IMMUTABLE_ENDIANNES BIG_ENDIAN
The SCR1 package comes with extensive documentation. Simulation instructions are available in [
21,
22]. The
gambaman/scr1 package includes a new test bench to check bi-endian behavior. The execution of this test bench can be simulated with the command
make TARGETS="biendian_sample"
An experimental branch supporting the proposed extension was added to
gambaman/scr1. This branch was called AEBO, and is available at [
24]. Although AEBO is supported by default in the experimental branch, it can be synthesized without the proposed extension by uncommenting the following line of the file
scr1_arch_description.svh:
//`define SCR1_NO_AEBO
The AEBO branch includes another test bench to check the functionality of the proposed extension in a bi-endian implementation of the core. As long as the endianness setting is not modified, the execution of this test bench can be simulated with the command
make TARGETS="AEBO_sample"
Results were obtained from implementations of the original SCR1 core and three modified versions of it, including the AEBO extension. The modified versions are little-endian, big-endian and bi-endian, and can be found at the AEBO branch of the
gambaman/scr1 fork [
24].
The hardware implementation platform used was a Digilent Nexys A7 Board [
25] featuring a Xilinx Artix XC7A100T-CSG324 Field-Programmable Gate Array (FPGA) chip [
26]. Synthesis was carried out using the Vivado tool [
27] version 2023.2, following the instructions available at [
28] (note that the Nexys A7 Board is referred to as Nexys 4 DDR in these instructions). No synthesis or implementation options were modified.
As discussed in the next section, instructions for building binary images of the test bench, which were used to measure the software’s performance, can be found in the file
README.md of the AEBO branch at [
24].
7. Results
In this section, the costs of implementing the proposed extension in terms of hardware resources, critical path delay, power consumption and software performance are presented. Regarding the engineering effort, although the memory interconnection of
gambaman/scr1 is slightly different from the one depicted in
Figure 7, implementing the proposed extension was very straightforward. It was only necessary to apply the following modifications to the main branch:
The MAE bit field was added to the mstatus CSR. This bit was added to the interfaces of the Control–Status Register File (CSRF), the load/store unit (LSU) and the modules containing them.
The control logic was modified to disable misaligned address exceptions caused by explicit data accesses when AEBO is enabled.
The LSU was modified so that the signals controlling the multiplexers that permute the bytes read from or written to the register would always depend on the least significant bits of the address. In addition, the least significant bits of the LSU output address were masked so that it would always be aligned.
A diff comparison of the main and the AEBO branches can easily be performed using the software repository in [
24] in order to analyze the changes to the base code introduced by the AEBO extension.
7.1. Hardware Resources
Regarding resource utilization, only the number of FPGA Look-Up Tables (LUTs) and flip-flops was affected. Results were obtained for the original SCR1 design and three variants supporting the AEBO extensions: bi-endian, little-endian and big-endian. Whole-core and specific LSU figures are shown in
Table 4. Regarding the whole core, the results are very similar in all cases, with flip-flop utilization slightly reduced in the versions supporting AEBO, while LUT utilization is decreased/increased by less than 1%. In addition, the changes introduced in the LSU for AEBO support are minimal, as discussed in
Section 6, and so the resources are very similar to the original design, with only two additional flip-flops and a very similar (and slightly smaller) number of LUTs used by the AEBO variants compared to the original design.
7.2. Delay and Power
The maximum path delays in the CPU clock domain and estimates of the power consumption of the original core and the AEBO variants are shown in
Table 5. All implementations were carried out at the target frequency of the original core, at 30 MHz. The maximum delay is very similar in all cases, as expected from the implementation analysis in
Section 6, and no changes to the AEBO variants were needed in order to meet the target frequency.
No configuration’s power consumption was significantly affected by the introduction of the AEBO extension; only the dynamic power was slightly affected. The total extra power consumption of the AEBO variants compared to the original design remained consistently below 0.5%.
7.3. Software Performance
In this subsection, the performance gain that can be obtained by using the AEBO extension with a sample data processing algorithm is analyzed. The selected algorithm is a DC blocker filter function/subroutine that subtracts the mean value from a set of 4096 random samples. Each sample is a 16-bit signed integer number. It is assumed that the algorithms are executed on a platform whose native byte order is little-endian. Four cases are compared:
Case LE: The set of samples is in little-endian format, which is the native byte order of the architecture. No format conversion is required. This case is taken as the performance reference when the data to be processed is in the native endianness supported by the test platform.
Case BE: The list of samples is in big-endian format, which is a foreign byte order to the architecture. Format conversions are performed in the software using the base RV32I instruction set supported by the test platform.
Case BE-Zbb: This is similar to the previous case, but it is assumed that the test platform includes the Zbb RISC-V standard extension for bit manipulation, so specific byte swapping and sign extension instructions can be used to improve performance.
Case BE-AEBO: The data is in BE format. The test platform supports the AEBO extension, and it is used to access data in the desired byte order. Therefore, each word is referenced using the address of its last byte.
To compare the four cases above using an optimized algorithm in each case, three versions of the filter subroutine were coded in the RISC-V assembly language:
offset_filter: The native version of the subroutine, used to process data in LE format and also in BE format with the AEBO extension.
offset_filter_be: A BE version of the subroutine using the standard RV32I instruction set.
offset_filter_rev: A BE version of the subroutine using the RV32I instruction set plus bit manipulation instructions from the Zbb RISC-V extension.
In the native (LE) version of the algorithm, each value in the list of samples has to be read twice and written once: the first reading happens in the first loop to accumulate all the values and calculate the offset, and the second reading and the writing happen in the second loop, which updates the values in the list. The two BE versions of the algorithm need to carry out two readings and two writings for each sample: a first reading to calculate the offset and a first writing to store the values converted to the native LE format in order to avoid a new endianness conversion later, followed by another reading and writing in the second loop to update the list, as before.
A test bench program written in C language creates a random list of numbers and invokes the right subroutine depending on the case under study. The test bench program also computes the execution performance of the subroutine by reading the performance counters from the microprocessor. The complete code for each subroutine version and the test bench program can be found in the folder
sw/AEBO_DC_blocker_filter of the AEBO branch at [
24]. As previously mentioned, the instructions for building binary images of the test bench using command-line tools can be found in the file
README.md.
Table 6 shows the values read from the performance counters for cases LE, BE and BE-AEBO when executed on the SCR1 test platform. Case BE-Zbb cannot be executed on the test platform because SCR1 does not support the Zbb extension. The time readings exactly match the expected values calculated from the number of cycles executed at the platform’s clock frequency of 30 MHz, making them redundant. Therefore, in this paper, only the number of cycles will be used as a performance reference.
Table 7 presents the number of instructions per sample processed by the algorithm in the four cases under study,
measured from the readings of the performance counters in the hardware and calculated from the code by counting the number of instructions executed in each loop of the algorithm. Both values agree almost exactly, with a negligible difference due to the few setup instructions executed outside the loops that are included in the measured values. The number of instructions per sample was also calculated based on the BE-Zbb case, even if it could not be executed on the test platform. The relative increment with respect to the LE reference case is also included in the table.
Similarly,
Table 8 shows the measured and calculated number of cycles per sample for each case. The calculated values were obtained by taking into account the throughput of the implementation for different types of instructions: one cycle for integer instructions, two cycles for instructions accessing data memory and three cycles for conditional branching instructions. In this way, it was also possible to calculate the cycle count that would be obtained for the BE-Zbb case if the Zbb extension was included in the SCR1 core.
From the results in
Table 7 and
Table 8, it is easy to see that processing non-native BE data with the basic RV32I instruction set requires twice the number of instructions and increases the processing time by over 60% when compared to processing native LE data. Even when using a more capable processor that implements the Zbb extension, 55% more instructions are needed, and the performance penalty is close to 40% when processing foreign-endian data. On the contrary, when using the AEBO extension, the processing performance for foreign-endian data is the same as that for native-endian data, and there is no need to modify or even reassemble/recompile the native algorithm.