RISC-V Address-Encoded Byte Order Extension

Guerrero, David; Juan-Chico, Jorge; Cano-Quiveu, German; Ruiz-de-Clavijo, Paulino; Viejo, Julian; Ostua, Enrique

doi:10.3390/electronics14163257

Open AccessArticle

RISC-V Address-Encoded Byte Order Extension

by

David Guerrero

^*

,

Jorge Juan-Chico

,

German Cano-Quiveu

,

Paulino Ruiz-de-Clavijo

,

Julian Viejo

and

Enrique Ostua

Departamento de Tecnología Electrónica, Universidad de Sevilla, 41012 Sevilla, Spain

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(16), 3257; https://doi.org/10.3390/electronics14163257

Submission received: 16 July 2025 / Revised: 11 August 2025 / Accepted: 14 August 2025 / Published: 16 August 2025

(This article belongs to the Special Issue High-Performance Computer Architecture)

Download

Browse Figures

Versions Notes

Abstract

In some cases, computer systems need to handle both little-endian and big-endian data, even if it differs from their native endianness. This paper proposes an RISC-V extension that makes it possible to remove the overhead introduced when dealing with foreign-endian data. It can be implemented with little engineering effort and a negligible impact on performance and hardware resources. Our results demonstrate that the extension can reduce the overhead of foreign-endian data processing by 62% or 37% compared to software-based solutions that use the base Instruction Set Architecture (ISA) or current bit manipulation extensions, respectively. This performance boost has the potential to benefit both new and legacy software once compiler and library support have been put in place.

Keywords:

RISC-V; ISA; alignment; byte order; endianness; MPSoC

1. Introduction

In most architectures, memory locations are eight bits wide, but a load/store instruction can read or write a wider word. Such a multi-byte word is stored using memory positions with consecutive addresses. In this paper, we will call the lowest of those consecutive addresses the base address. Usually, the stored word is referenced using this address [1], and its size in bytes is a power of two. The word is said to be aligned if its base address is a multiple of this size in bytes; otherwise, the word is said to be misaligned or unaligned. Implementation requirements usually force any memory access to a misaligned word to be split into at least two sub-accesses [2]. For this reason, unaligned accesses are forbidden in many architectures. Another issue to take into account when dealing with multi-byte accesses is how to map the memory locations involved in such accesses to each byte of the accessed words. Such mapping is called endianness or byte order. Save for rare exceptions, the most widely used byte orders are little-endian (LE) and big-endian (BE). The first maps less significant bytes of the accessed word to the lowest memory positions (i.e., less significant bytes are stored first), while the second maps them to the highest memory positions (i.e., more significant bytes are stored first). Problems arise when systems with different endianness communicate [1]. Depending on the coupling level of such systems, endianness mismatch scenarios can be classified as follows:

Detached systems: These systems do not have a communication line, so they can only communicate by sharing files written in some type of support. Byte order conversions may be necessary, since the byte order of the data within a file depends on the file format [1,2,3]. Furthermore, the byte order of the data structures of the file system itself must be taken into account.
Networked systems: These systems communicate through a network. Byte order mismatch is also an issue in this scenario, even if the native byte order of each system is the same, since network stacks and communication protocols define their own endianness [1].
Highly coupled systems: Heterogeneous Multiprocessor System-on-Chip (MPSoC) can include several processors with differences in endianness [1,2,4,5,6].
Software-emulated systems: A software emulator is a program that simulates the behavior of a computer system. The byte order of the simulated system (called the guest) and the endianness of the system running the emulator (called the host) can differ [7,8], so the latter may need to make conversions in order to emulate memory accesses.

Byte order conversions cannot be avoided. For example, most modern computer architectures use little-endian byte order, while big-endian data is extensively used in many application areas, including the following:

Network protocols, such as TCP/IP [1].
File and multimedia formats, such as JPEG, TIFF and PDF [1].
Industrial communication protocols, such as CAN [9] and Modbus [10].
Data produced by big-endian legacy systems such as IBM mainframes (IBM, North Castle, United States) and Motorola 68000 (Motorola, Schaumburg, United States) [1].

Performing byte order conversions in software may introduce significant overhead [1]. The relevance of this penalty depends on the frequency of the conversions. For example, the little-endian systems described in [11] make intensive accesses to big-endian data, so their execution time could be reduced by 20% to 40% through careful selection of how and when to make byte order conversions. Software overhead can be reduced by including specific hardware for byte order conversion. For example, several instructions have been introduced into x86 architectures for this purpose [1,11,12]. Byte order conversion overhead affects all kinds of processors, but it is especially important in embedded processors and microcontrollers, since this overhead can significantly reduce performance and system autonomy in battery-operated devices.

This paper focuses on improving the performance of RISC-V software involving intensive byte order conversions. RISC-V is a suit of open standard Instruction Set Architecture (ISA) that is developed, is ratified and is maintained by the RISC-V International foundation [13]. As the name suggests, RISC-V follows most Reduced Instruction Set Computer (RISC) principles. RISC-V standards have been embraced by a fast-growing number of industrial and academic actors in 70 countries [13]. Some of the reasons behind its success are as follows:

It has a very permissive license. Designers are allowed to develop open or closed compliant implementations for commercial or non-commercial purposes without royalties.
It allows for multiple implementation goals. When designing an implementation, developers can optimize the power consumption, performance, transistor count or any trade-off of these.
It is designed to suit many types of systems, including microcontrollers, personal computers, servers and supercomputers. To this end, RISC-V defines optional extensions and system profiles.

Since the embedded systems market is currently one of the fastest-growing areas for the application of RISC-V, improving RISC-V’s byte order conversion performance through a specific extension may be valuable for embedded RISC-V applications that need to perform such conversions frequently.

The main objective of this paper is to propose and develop a standard extension to remove endianness conversion overhead in RISC-V processors. The proposed extension uses a mechanism called Address-Encoded Byte Order (AEBO). Previous work has introduced the AEBO mechanism and shown that it can reduce the user execution time of software running on an embedded OpenRISC processor by 60% [2]. The main contribution of this study is its application of the AEBO technique to the RISC-V architecture in such a way that it can be integrated as an optional extension in the RISC-V ecosystem. This paper details not only the implementation of the AEBO technique, but also the necessary additions to RISC-V’s configuration infrastructure, as well as its necessary constraints.

The rest of the paper is organized as follows: Section 2 includes some background on RISC-V architectures, including their alignment restrictions. Section 3 describes possible ways to deal with byte order conversions. The proposed extension is presented in Section 4, and Section 5 describes how programmers can take advantage of it. Implementation details are described in Section 6. The results are presented in Section 7, and are discussed in the last section.

2. RISC-V Background

This section introduces some general RISC-V background that will be useful for better understanding the rest of the paper.

2.1. RISC-V ISA and Control and Status Registers

Like many architectures, RISC-V uses a common address space of 8-bit-wide memory positions for code and data, but has a separate 11-bit address space for Control and Status Registers (CSRs) [14]. For flexibility, the specification defines two types of instructions and CSR subsets: base ISA and ISA extensions. Any RISC-V-compliant processor must implement at least a base ISA, and can optionally implement the instructions and CSRs of one or more ISA extensions.

RISC-V-conforming ISA extensions can be standard or custom. Standard extensions are defined in the RISC-V specification, while custom extensions are not. The specification reserves part of the opcodes and the CSR address space for custom extensions. Base ISAs and standard extensions do not, and will never, use these, so they can be freely used in vendor-specific conforming implementations. Reciprocally, a conforming implementation must neither define nor redefine the semantics of an opcode or CSR that is not reserved for custom use, even if it is not currently defined in the standard, since it is reserved for future use [14].

In order to suit a wide variety of systems, four privilege levels (also called privilege modes) are defined in the RISC-V base ISAs [14]. These levels are as follows, from highest to lowest privilege: Debug (D), Machine (M), Supervisor (S) and User (U). The hypervisor extensions define two additional modes: Virtual Supervisor (VS) and Virtual User (VU). Software that is running at the machine level has direct access to hardware resources; thus, M is the only privilege level that is mandatory for RISC-V implementations.

Following the RISC principles [15], RISC-V base ISAs are load–store architectures, have a few simple addressing modes and have a fixed instruction length of 32 bits (the instructions of the optional C compressed extension have a length of 16 bits, but they are not part of any base ISA). The base ISA determines the number of General Purpose Registers (GPRs) and their width. Thus, an implementation of the RV32E base ISA has 16 GPRs (32 bits wide), while an implementation of RV64I has 32 GPRs (64 bits wide). Register x0 is hardwired with all bits equal to 0 in any base ISA. CSRs and GPRs have the same width, represented as XLEN [14].

2.2. RISC-V Code and Data Alignment

Misaligned code accesses are forbidden in RISC-V, so the address of any stored instruction of a base ISA to be executed must be a multiple of four, while the address of any instruction of the optional C compressed extension must be a multiple of two. Thus, a conforming RISC-V implementation must raise an exception when the address of the taken branch of a control flow instruction is misaligned. On the other hand, unaligned data accesses are allowed, although supporting them entirely in hardware is problematic for the following reasons [2]:

Obviously, it would consume hardware resources.
The word referenced in a misaligned access can cross page boundaries. Taking this into account complicates page fault management.
Hardware implementation of misaligned atomic read–modify–write accesses is complex, since they may require two additional memory sub-accesses.

Because of these factors, full hardware support for misaligned data accesses in RISC-V is optional. Instead, an implementation can support misaligned data accesses through the Execution Environment (EE). In this situation, when a load/store instruction tries to execute a misaligned access, an exception is raised, the EE software takes control and it may emulate the execution of the misaligned load/store instruction in a transparent way or result in a fatal trap, terminating execution [14]. Misaligned accesses are discouraged in general, and are remarkably slow, even when implemented in hardware, so compilers and assemblers are usually instructed to generate code that avoids misaligned data accesses.

2.3. RISC-V Endianness

RISC-V instructions are always stored in little-endian byte order. Storing instructions in a fixed byte order simplifies the design of hardware and software tools such as assemblers and disassemblers [14]. Deciding on the byte order of data accesses is more cumbersome because of the multiple target niches of the architecture. Little-endian systems are currently dominant commercially [14], but many embedded communication processors, custom solutions and certain application areas, such as IP networking, operate on big-endian data structures [1]. In addition, certain legacy code bases have been built under the assumption that they will be operated on by big-endian processors [14]. Because of this, since version 20191213 of the RISC-V ISA was released, it has been possible for the byte order of the data accesses of an implementation to be not only little-endian, but also big-endian or even bi-endian [16] (i.e., the endianness can be set at runtime). The data endianness specification for RISC-V can be summarized as follows [14,17]:

The bit fields MBE, SBE and UBE in CSR mstatus determine the overall byte order of explicit data accesses (i.e., specified by the semantics of the data access instructions themselves) at the machine (M), Supervisor (S) and User (U) privilege levels, respectively. The access is little-endian when the bit is 0, and big-endian when the bit is 1. If XLEN = 32, bits MBE and SBE are in CSR mstatush instead of mstatus.
If a privilege level is not implemented, the corresponding bit field is read-only 0.
If level S is implemented, the bit field sstatus.UBE is an alias of mstatus.UBE, so it determines the byte order at level U.
If the hypervisor extension is implemented, the bit fields hstatus.VSBE and vsstatus.UBE play the role of mstatus(h).SBE and sstatus.UBE for levels VS and VU, respectively.
Depending on the implementation, the bit fields mentioned above can be read–write or read-only. For example, they can be all read-only 1 if the implementation only supports big-endian data accesses. If S-mode is supported, an implementation may make SBE be a read-only copy of MBE. When level U is supported, an implementation may make UBE be a read-only copy of either MBE or SBE. If the hypervisor extension is supported, an implementation may make VSBE be a read-only copy of SBE.

3. RISC-V Endianness Conversions

Since RISC-V is intended to suit virtually any scenario, applications that include intensive accesses and processing of both little-endian and big-endian data are likely. This section explores some methods for handling byte order conversions in RISC-V.

3.1. Using Generic Base ISA Instructions

Endianness conversion implies changing the order of the bytes within a word. The base ISA of RISC-V does not include any specific instruction to manipulate the bytes within a word, so any byte order change has to be performed using standard logic and shift instructions. The assembly code in Figure 1 is an example of this: a BE signed half-word is read from memory in a GPR a0. The native endianness is LE, so bytes X0 and X1 in the register have to be swapped and the sign extended before any useful calculation can be performed in the native endianness. The result has to be converted to BE format before it is written back to memory.

In this example, the manipulation of one word of data in the foreign endianness requires the execution of nine additional instructions. The relative overhead will depend on the algorithm being executed, but it is likely to be significant if long lists of BE data have to be processed and the computation performed with each word takes only a few assembly instructions.

3.2. Using Specific Byte Order Instructions

One way to reduce endianness conversion overhead could be to use specific instructions for dealing with byte ordering. These instructions may be of two types: instructions to reorder the bytes within a register, and instructions that load or store words in a byte order that is different from the native order of the process. The Zbb RISC-V extension [18] includes an instruction of the first type, the rev8 instruction, which reverses the order of the bytes in a GPR. This instruction makes it possible to reduce the endianness conversion penalty of RISC-V programs, but does not remove this penalty completely, since the execution of the instruction itself implies some overhead and more overhead is introduced when dealing with words shorter than XLEN, since their conversion requires the execution of an additional srai or srli instruction [18]. The code in Figure 2 rewrites the example in Figure 1 using the rev8 instruction. Additional instructions for byte order conversions are still required, but their number is reduced from nine to four.

The presence of instructions like rev8 is an improvement, but this instruction cannot be used with floating-point registers. Introducing variants of the rev8 instruction to deal with single-precision, double-precision and quad-precision floating-point registers would require additional operation codes (opcodes) and, again, would not completely remove the conversion overhead. To date, there have been no proposals of an RISC-V extension to address this.

Regarding instructions of the second type, i.e., instructions to load/store words in the non-native byte order as variants of the multi-byte load/store instructions already available in the ISA, these would remove the byte order conversion overhead, but they would also consume opcodes; therefore, considering that the RISC-V ISA includes more than 30 multi-byte memory access instructions (see Table 1) and that more may be added in the future [19], it is hardly an option to include non-native byte order variants of all of these instructions. To date, no RISC-V extension proposes any such instructions.

3.3. Using an RISC-V Bi-Endian Implementation

At first glance, it seems that an RISC-V bi-endian implementation could change the CSR bits that determine the endianness during runtime, so that if a process needs to make an access in a foreign byte order, it could write the corresponding CSR bit field, make the access and restore the CSR bit field to the previous value. Unfortunately, this approach has many drawbacks:

The execution of the instructions necessary to change and restore the CSR bit field would introduce an overhead. Moreover, since software running at privilege level U cannot change the UBE CSR bit field itself, it would have to make a system call to the operating system to change it. The large overhead of a system call compared to that for reordering the bytes using software would not be justified unless a large number of foreign-endian accesses were required to be executed.
The standard RISC-V Application Binary Interfaces (ABIs) are expected to be purely little-endian-only or big-endian-only [17]. Hence, many library functions will expect the process to remain at the endianness at which it was executed, making a byte order change practical only for executing complete level U programs in the opposite endianness.
Supervisor software (i.e., software running at privilege levels S or VS) cannot change the CSR bit field which controls its own data byte order. The rationale for this is that SBE and VSBE also control the endianness of implicit data accesses to supervisor-level memory management data structures, such as page tables, at the respective privilege levels, and changing these bit fields would alter the implementation’s interpretation of these data structures. Therefore, in practice, level S/VS software will hardly benefit from a bi-endian implementation to accelerate foreign-endian data access.

In summary, a bi-endian implementation has the capacity to enable the execution of processes in only one arbitrary endianness, but not to be efficient when manipulating data in different byte orders within the same process.

4. RISC-V AEBO Extension

In order to overcome the barriers stated in the previous section to support endianness conversion, the extension proposed in this paper uses the Address-Encoded Byte Order (AEBO) technique introduced in [2]. When using AEBO, there is no need to introduce new instructions to deal with bi-endian data, since the byte order of every data memory access is encoded in the address used to reference the accessed word. The AEBO technique is described in the following subsection.

4.1. The AEBO Technique

When AEBO is enabled for a privilege level, any explicit N-byte data access on that level is affected in the following ways:

The word to be read from or written to memory can be referenced using not only its base address, but also the address of any of its N bytes.
The byte order to be used depends on the address used to reference the accessed word.
The base address is the highest multiple of N that is not greater than the address used to reference the accessed word. This implies that the access is always aligned.

We will represent the address used to reference the accessed word as A, and the base address as

A_{b}

. Since N is a power of two, i.e., N is equal to

2^{t}

for some integer t,

A_{b}

can be obtained just by clearing the t least significant bits of A. These t bits of A are used to select the byte order of the access. For example, if a process accesses a word of size

N = 8

bytes (i.e.,

t = 3

), the AEBO technique operates as follows:

If the t least significant bits of A are all 0 (i.e., $A = A_{b}$ ), the access is made using the native byte order of the process.
If $A_{0}$ is 1, the bytes within each consecutive pair of bytes in the word get swapped.
If $A_{1}$ is 1, 16-bit sub-words within each consecutive pair of 16-bit sub-words in the word get swapped.
If $A_{2}$ is 1, both halves of the word get swapped.

The same applies to the other values of N. As an example, Table 2 represents the four possible byte orders for a 4-byte word memory write in a process whose native byte order is little-endian, depending on the value of the two least significant bits of A. Note that, in general, if all the bits are equal to 0, the access will use the native byte order of the process, whereas if they are all equal to 1, the reverse byte order will be used. Any other case will result in different mixed-endian configurations. A reading operation carries out exactly the same byte/word swaps while transferring the data to the destination.

In summary, with the AEBO technique, all memory accesses are aligned to the base address

A_{b}

, and will use the least significant bits of A to select the byte order of the access. A straightforward consequence is that an explicit data access cannot raise a misaligned address exception when AEBO is enabled, since such access is implicitly aligned.

4.2. RISC-V AEBO Extension Description

As we highlighted in Section 2, CSR bit fields control the native byte order of explicit data accesses for each privilege level in RISC-V. From these, SBE and VSBE also control the byte order of implicit data accesses to structures such as page tables, so OS-level software cannot change these structures. The proposed AEBO extension only affects explicit data accesses, and it can be enabled separately for each privilege level. In particular, if the effective privilege mode of data accesses is not modified (i.e., the bit field mstatus.MPRV = 0 [17]), AEBO is enabled in privilege mode x by setting the bit field of a CSR as defined in the extension. In this paper, we will denote this bit field as xAE, and the value of the bit field determining the native byte order of the explicit data accesses in mode x will be denoted as xBE. As long as mstatus.MPRV = 0, xAE is as follows:

mstatus.MAE if x is M;
sstatus.SAE if x is S;
sstatus.UAE if x is U;
vsstatus.SAE if x is VS;
vsstatus.UAE if x is VU.

These bit fields are 0 right after reset. If AEBO is not implemented, they are read-only. According to the RISC-V documentation terminology, these fields are Write Any values, Read Legal values (WARL). This means that system-level software can check whether the extension has been implemented by reading one of these bit fields right after trying to set it. Note that since sstatus is a subset of mstatus, the bit fields sstatus.SAE and sstatus.UAE are aliases of mstatus.SAE and mstatus.UAE, respectively. When AEBO is enabled for a privilege level x, any explicit multi-byte data access in that mode is carried out using the AEBO technique described in Section 4.1.

In order to formally define the byte order used during an explicit data access when AEBO is enabled, we use the following notation:

W: The word to be read or written.
N: The size of W in bytes.
t: The binary logarithm of N. Hence, $N = 2^{t}$ .
n: An integer such that $0 \leq n \leq N - 1$ .
$n_{i}$ : The i-th bit of the binary (base 2) representation of n. Hence, $n = \sum_{i = 0}^{t - 1} n_{i} 2^{i}$ .
$W_{i}$ : The i-th bit of W, whereby the concatenation of the bits $W_{8 N - 1}$ , …, $W_{1}$ and $W_{0}$ is W.
$B_{n}$ : The n-th byte of W, that is, the concatenation of the bits $W_{8 n + 7}$ , …, $W_{8 n + 0}$ ; hence, the concatenation of bytes $B_{N - 1}$ , …, $B_{0}$ is W.
A: The address of the memory location used to reference W.
$A_{i}$ : The i-th bit of the binary (base 2) representation of A.
xBE: The bit field determining the native byte order of the explicit data accesses. Depending on the effective privilege mode, it can be the bit fields MBE, SBE or UBE of mstatus, sstatus or vsstatus.
⊕: The XOR operator.
$A_{b}$ : The base address of an AEBO access, defined as $A - (A mod N)$ .

According to the AEBO technique description in Section 4.1, and generalizing [2], when the AEBO RISC-V extension is enabled, each byte

B_{n}

of the word W is mapped to the memory position whose address is as follows:

A_{b} + \sum_{i = 0}^{t - 1} (A_{i} \oplus n_{i} \oplus xBE) 2^{i} .

(1)

This has the following implications:

If a word is referenced in the normal way, i.e., using the memory address of its first byte, then the native endianness is applied, i.e., big-endian if xBE = 1 and little-endian otherwise. Because of this, this address is called the native-endian address of the word.
If a word is referenced using the memory address of its last byte, then the reverse of the native endianness is applied, i.e., big-endian if xBE = 0 and little-endian otherwise. Because of this, this address is called the reverse endian address of the word.
If a word is referenced using the memory address of any of its other bytes, then a mixed-endian byte order is used. Any mixed-endian order can be selected by choosing the appropriate address.

For example, suppose UAE = 1 (AEBO is enabled on level U), UBE = 0 (the native byte order on level U is little-endian) and the following instruction is executed on level U:

       lw  x3,  0x1003(x0)

This instruction would read the 4-byte word stored in the memory positions

0 x 1000

,

0 x 1001

,

0 x 1002

and

0 x 1003

, since the starting address would be

A_{b} = 0 x 1003 - (0 x 1003 mod 4) = 0 x 1000

Also, since the word is referenced using the address of its last byte (

A_{1} = A_{0} = 1

), the loading is performed using a big-endian byte order according to the displacement for each byte, computed using Equation (1) and shown in Table 3.

The specific byte order for other word sizes and endianness configurations can be easily derived from Equation (1).

Note that if AEBO is implemented, any load/store instruction (integer, floating, read, write, read–write, etc.) can make use of it.

5. The Use of AEBO in Software

It can be easily deduced from the previous section that using the AEBO extension in software does not require the introduction of new instructions into the ISA; it just requires the use of the right address. In this section, possible ways to use the AEBO extension from assembly and C code are explored.

5.1. Using the AEBO Extension from Assembly Code

Let

p t r

be the base address of an N-byte data word. If the AEBO extension is enabled, an N-byte access to address

p t r

will read or write the word using the native endianness of the process, while an access to address

p t r + N - 1

will use the reverse endianness. Accesses in any mixed-endianness are possible, but these will not be discussed here because they have very limited applications. For example, the code in Figure 3 is the AEBO equivalent to the code in Figure 1 and Figure 2. As can be easily observed, no byte order manipulation instructions are necessary with the AEBO extension, and the processing overhead of using foreign-endian data is completely eliminated.

5.2. Using the AEBO Extension from C Code

Using the AEBO extension from C language is trickier because of the precise control over the value of pointers that is required by the AEBO extension. The macros defined in aebo.h, shown in Figure 4, may help. The le_ptr and be_ptr macros will modify a pointer to produce a little- or big-endian access, respectively, using the AEBO technique. When the selected endianness is the native one, the macro will leave the pointer unchanged; otherwise, the macro will expand to the macro __foreign_endian_ptr, which carries out the actual pointer conversion.

The macros le_var and be_var can be used directly on variables of basic types (no arrays or structs). They use the __foreign_endian_var macro, which, in turn, is built around the __foreign_endian_ptr macro, by referencing the variable before changing the pointer and de-referencing the resulting pointer afterwards.

The code in Figure 5 is a sample test program that uses these macros. Variables with names ending in _be are intended to store data in big-endian byte order. The code should equally work in a native little-endian or big-endian system. In the following discussion, a native little-endian system is assumed. Line 14 defines a pointer to access the list elements in big-endian byte order, so the resulting address assigned to pointer ptr is &list_be[0]+1. Lines 16 and 17 populate the list with data. As the pointer is incremented using pointer arithmetic, the right address that triggers big-endian access is preserved, so all the data in the list will have big-endian byte order. Lines 21 and 22 add all the elements of the list. Conversion from BE to LE is performed on the fly as the data is read, and the result of the addition is stored in the variable sum in the native byte order. Line 24 adds together the LE variable sum and the BE variable offset_be, and stores the result in the BE variable sum_be. Note that the be_var macro can be used on both the right-hand and the left-hand sides of the assignment.

Although the macros above may simplify the use of AEBO in C programs, the AEBO extension needs the compiler to generate code that does not alter the byte width of the data being transferred. This means, for example, that any access to an int16_t type (2 bytes) should be translated to half-word transfer instructions like lh or sh, and the access should not be divided into multiple byte transfer instructions (like lb or sb). However, the compiler is likely to alter the access width as a result of applying typical code optimization techniques. For example, the GNU compiler (GCC) [20] version 14.2.0 was used to compile the code in Figure 5 with different code optimization levels. With any optimization level (default is level 2, compilation option -O2), the code compiled from the expansion of macro __foreign_endian_ptr keeps the transfer size, but the code compiled from the expansion of macro __foreign_endian_var will use instructions lb and sb, instead of lh and lb. Only if optimization is completely disabled (option -O0) will code be generated that does not alter the data transfer width; however, in general, disabling code optimization is not a realistic option.

This means that, for the practical use of the AEBO extension from C, the compiler should be aware that the extension is available and that code should be generated accordingly, as is the case for the many other RISC-V extensions that are available. For AEBO, this basically means that data with memory accesses should be preserved in the compiled code. It is important to note that the C macros and code samples introduced in this section are examples of the kind of support that is necessary from the compiler/library side. Although complete AEBO compiler/library support is beyond the scope of this paper, it should not be a technical challenge for system software developers, and is recommended as a direction for future work.

5.3. Using the AEBO Extension with Existing Software

The standard C library includes several macros for dealing with byte order conversion from BE and LE to the host byte order, and the other way around. These macros have the form beNNtoh, leNNtoh, htobeNN and htoleNN, where NN is 16, 32 or 64. They are defined in the file endian.h, and the names are self-explanatory. These macros relay on the lower-level macros __bswap_16, __bswap_32 and __bswap_64, which are typically defined in the file byteswap.h. The lower-level macros will, ultimately, be mapped by the compiler to machine code that is optimized for byte swapping depending on the hardware’s capabilities, including the available extensions. These low-level macros could be rewritten to support the AEBO extension and generate code that uses it, greatly improving the performance of existing software that uses the interface defined in endian.h just by re-compiling the source code.

Note that an existing software function executed in an AEBO-enabled system that takes a pointer to a number type (or a list of them) as a parameter, and is designed to work in the native byte order, may also work with data in a foreign byte order by passing the pointer to data incremented by

N - 1

, with N being the byte width of the data type, provided that the function uses the type’s byte width in all memory accesses. In this way, a large collection of legacy software may support data processing in the foreign byte order without even recompiling it, although this support is not guaranteed. For example, let mean32 be a legacy function that takes a pointer to a list of 32-bit integers and a number of elements, and calculates the mean value of the elements in the list in the native byte order. Using the AEBO extension, the same function could be called to process a list in either LE or BE byte order, as shown in the code template in Figure 6. To enhance its practicality and usefulness in application, newer software can easily support both LE and BE byte orders using AEBO as long as it fulfills the above requirement and the code is compiled with an AEBO-aware compiler.

6. Sample RISC-V AEBO Implementation

The AEBO extension is very easy to implement, and has no impact on the critical path delay or the transistor count. To illustrate this, Figure 7 depicts a simplified schematic of the interconnection of the load/store unit of an RISC-V implementation with the first memory level. Although it is not designed to support AEBO, there is no need to modify it to introduce the extension, as we will see below. The base ISA of this implementation is RV32I, so XLEN = 32. The physical addresses of the system in the picture have p bits. When a word is read or written, the load/store unit provides the memory system with all the physical address bits except for the two least significant ones, i.e.,

A_{p - 1} A_{p - 2} \dots A_{3} A_{2}

. The memory system can provide simultaneous access to the memory locations at the consecutive addresses

A_{p - 1} \dots A_{2} 00

,

A_{p - 1} \dots A_{2} 01

,

A_{p - 1} \dots A_{2} 10

and

A_{p - 1} \dots A_{2} 11

. The load/store unit generates the respective control signals

E N_{00}

,

E N_{01}

,

E N_{10}

and

E N_{11}

to tell the memory system which of these four memory locations will be accessed. Two levels of swappers, such as the one in Figure 8, are used to map the accessed memory locations to the bytes of the register word W to be read or written. These levels are numbered starting from 0. If AEBO is not implemented, the signal

S_{i}

controlling the set of swappers at level i depends on

A_{i}

, the size of the accessed word (i.e.,

2^{t}

) and the native endianness of the process executing the access (i.e., xBE), in the following way:

S_{i}

is equal to xBE if

i < t

; otherwise,

S_{i}

is equal to

A_{i}

.

For example, if a process whose native byte order is big-endian (xBE = 1) accesses a single byte (

t = 0

) at an address A such that

A_{1} = 1

and

A_{0} = 0

, then the data lines corresponding to the memory location

A_{p - 1} \dots A_{2} 10

will be connected to

B_{0}

, i.e., the bits

W_{7 \dots 0}

of W, and only the control signal

E N_{10}

will be activated. If the process accesses a word of two bytes (

t = 1

) at the same address, then those data lines will be connected to

B_{1}

, i.e., the bits

W_{15 \dots 8}

of W; the data lines corresponding to the memory location

A_{p - 1} \dots A_{2} 11

will be connected to

B_{0}

; and the control signals

E N_{11}

and

E N_{10}

will be activated.

In order to implement AEBO, only two modifications are required: First,

S_{i}

will be equal to

A_{i} \oplus xBE

when

i < t

. Second, the control logic will depend on the xAE bit fields, so that data address-misaligned exceptions will not be raised when AEBO is enabled. Note that the delay and hardware cost should barely be affected by the implementation of AEBO, since only the logic generating these control signals is modified. If, for example, the logic that generates

S_{i}

is in the critical path, the delay of only one XOR gate is introduced. In addition, since the AEBO technique works at the first level of the memory interface and does not alter which bytes or words have to be sent to/retrieved from memory, the technique should be transparent to the use of cache memory or virtual addressing techniques.

In order to experimentally estimate its required engineering effort, the proposed extension was implemented in a fork of SCR1 [21,22], a high-quality, industry-grade and silicon-proven open-source RISC-V RV32I/RV32E MCU core designed and maintained by Syntacore [23]. The chosen fork (gambaman/scr1) is available at [24]. At the time of writing, the difference between the main branches of SCR1 and gambaman/scr1 is that the former can only be little-endian, while the latter can be configured to be little-endian, big-endian or bi-endian by editing the file scr1_arch_description.svh. The default gambaman/scr1 configuration is bi-endian, i.e., mstatus.MBE is writable. The following line can be uncommented to force a little-endian-only implementation:

//`define SCR1_IMMUTABLE_ENDIANNES LITTLE_ENDIAN

Alternatively, the following line can be uncommented to force a big-endian-only implementation:

//`define SCR1_IMMUTABLE_ENDIANNES BIG_ENDIAN

The SCR1 package comes with extensive documentation. Simulation instructions are available in [21,22]. The gambaman/scr1 package includes a new test bench to check bi-endian behavior. The execution of this test bench can be simulated with the command

make TARGETS="biendian_sample"

An experimental branch supporting the proposed extension was added to gambaman/scr1. This branch was called AEBO, and is available at [24]. Although AEBO is supported by default in the experimental branch, it can be synthesized without the proposed extension by uncommenting the following line of the file scr1_arch_description.svh:

//`define SCR1_NO_AEBO

The AEBO branch includes another test bench to check the functionality of the proposed extension in a bi-endian implementation of the core. As long as the endianness setting is not modified, the execution of this test bench can be simulated with the command

make TARGETS="AEBO_sample"

Results were obtained from implementations of the original SCR1 core and three modified versions of it, including the AEBO extension. The modified versions are little-endian, big-endian and bi-endian, and can be found at the AEBO branch of the gambaman/scr1 fork [24].

The hardware implementation platform used was a Digilent Nexys A7 Board [25] featuring a Xilinx Artix XC7A100T-CSG324 Field-Programmable Gate Array (FPGA) chip [26]. Synthesis was carried out using the Vivado tool [27] version 2023.2, following the instructions available at [28] (note that the Nexys A7 Board is referred to as Nexys 4 DDR in these instructions). No synthesis or implementation options were modified.

As discussed in the next section, instructions for building binary images of the test bench, which were used to measure the software’s performance, can be found in the file README.md of the AEBO branch at [24].

7. Results

In this section, the costs of implementing the proposed extension in terms of hardware resources, critical path delay, power consumption and software performance are presented. Regarding the engineering effort, although the memory interconnection of gambaman/scr1 is slightly different from the one depicted in Figure 7, implementing the proposed extension was very straightforward. It was only necessary to apply the following modifications to the main branch:

The MAE bit field was added to the mstatus CSR. This bit was added to the interfaces of the Control–Status Register File (CSRF), the load/store unit (LSU) and the modules containing them.
The control logic was modified to disable misaligned address exceptions caused by explicit data accesses when AEBO is enabled.
The LSU was modified so that the signals controlling the multiplexers that permute the bytes read from or written to the register would always depend on the least significant bits of the address. In addition, the least significant bits of the LSU output address were masked so that it would always be aligned.

A diff comparison of the main and the AEBO branches can easily be performed using the software repository in [24] in order to analyze the changes to the base code introduced by the AEBO extension.

7.1. Hardware Resources

Regarding resource utilization, only the number of FPGA Look-Up Tables (LUTs) and flip-flops was affected. Results were obtained for the original SCR1 design and three variants supporting the AEBO extensions: bi-endian, little-endian and big-endian. Whole-core and specific LSU figures are shown in Table 4. Regarding the whole core, the results are very similar in all cases, with flip-flop utilization slightly reduced in the versions supporting AEBO, while LUT utilization is decreased/increased by less than 1%. In addition, the changes introduced in the LSU for AEBO support are minimal, as discussed in Section 6, and so the resources are very similar to the original design, with only two additional flip-flops and a very similar (and slightly smaller) number of LUTs used by the AEBO variants compared to the original design.

7.2. Delay and Power

The maximum path delays in the CPU clock domain and estimates of the power consumption of the original core and the AEBO variants are shown in Table 5. All implementations were carried out at the target frequency of the original core, at 30 MHz. The maximum delay is very similar in all cases, as expected from the implementation analysis in Section 6, and no changes to the AEBO variants were needed in order to meet the target frequency.

No configuration’s power consumption was significantly affected by the introduction of the AEBO extension; only the dynamic power was slightly affected. The total extra power consumption of the AEBO variants compared to the original design remained consistently below 0.5%.

7.3. Software Performance

In this subsection, the performance gain that can be obtained by using the AEBO extension with a sample data processing algorithm is analyzed. The selected algorithm is a DC blocker filter function/subroutine that subtracts the mean value from a set of 4096 random samples. Each sample is a 16-bit signed integer number. It is assumed that the algorithms are executed on a platform whose native byte order is little-endian. Four cases are compared:

Case LE: The set of samples is in little-endian format, which is the native byte order of the architecture. No format conversion is required. This case is taken as the performance reference when the data to be processed is in the native endianness supported by the test platform.
Case BE: The list of samples is in big-endian format, which is a foreign byte order to the architecture. Format conversions are performed in the software using the base RV32I instruction set supported by the test platform.
Case BE-Zbb: This is similar to the previous case, but it is assumed that the test platform includes the Zbb RISC-V standard extension for bit manipulation, so specific byte swapping and sign extension instructions can be used to improve performance.
Case BE-AEBO: The data is in BE format. The test platform supports the AEBO extension, and it is used to access data in the desired byte order. Therefore, each word is referenced using the address of its last byte.

To compare the four cases above using an optimized algorithm in each case, three versions of the filter subroutine were coded in the RISC-V assembly language:

offset_filter: The native version of the subroutine, used to process data in LE format and also in BE format with the AEBO extension.
offset_filter_be: A BE version of the subroutine using the standard RV32I instruction set.
offset_filter_rev: A BE version of the subroutine using the RV32I instruction set plus bit manipulation instructions from the Zbb RISC-V extension.

In the native (LE) version of the algorithm, each value in the list of samples has to be read twice and written once: the first reading happens in the first loop to accumulate all the values and calculate the offset, and the second reading and the writing happen in the second loop, which updates the values in the list. The two BE versions of the algorithm need to carry out two readings and two writings for each sample: a first reading to calculate the offset and a first writing to store the values converted to the native LE format in order to avoid a new endianness conversion later, followed by another reading and writing in the second loop to update the list, as before.

A test bench program written in C language creates a random list of numbers and invokes the right subroutine depending on the case under study. The test bench program also computes the execution performance of the subroutine by reading the performance counters from the microprocessor. The complete code for each subroutine version and the test bench program can be found in the folder sw/AEBO_DC_blocker_filter of the AEBO branch at [24]. As previously mentioned, the instructions for building binary images of the test bench using command-line tools can be found in the file README.md.

Table 6 shows the values read from the performance counters for cases LE, BE and BE-AEBO when executed on the SCR1 test platform. Case BE-Zbb cannot be executed on the test platform because SCR1 does not support the Zbb extension. The time readings exactly match the expected values calculated from the number of cycles executed at the platform’s clock frequency of 30 MHz, making them redundant. Therefore, in this paper, only the number of cycles will be used as a performance reference.

Table 7 presents the number of instructions per sample processed by the algorithm in the four cases under study, measured from the readings of the performance counters in the hardware and calculated from the code by counting the number of instructions executed in each loop of the algorithm. Both values agree almost exactly, with a negligible difference due to the few setup instructions executed outside the loops that are included in the measured values. The number of instructions per sample was also calculated based on the BE-Zbb case, even if it could not be executed on the test platform. The relative increment with respect to the LE reference case is also included in the table.

Similarly, Table 8 shows the measured and calculated number of cycles per sample for each case. The calculated values were obtained by taking into account the throughput of the implementation for different types of instructions: one cycle for integer instructions, two cycles for instructions accessing data memory and three cycles for conditional branching instructions. In this way, it was also possible to calculate the cycle count that would be obtained for the BE-Zbb case if the Zbb extension was included in the SCR1 core.

From the results in Table 7 and Table 8, it is easy to see that processing non-native BE data with the basic RV32I instruction set requires twice the number of instructions and increases the processing time by over 60% when compared to processing native LE data. Even when using a more capable processor that implements the Zbb extension, 55% more instructions are needed, and the performance penalty is close to 40% when processing foreign-endian data. On the contrary, when using the AEBO extension, the processing performance for foreign-endian data is the same as that for native-endian data, and there is no need to modify or even reassemble/recompile the native algorithm.

8. Conclusions

An RISC-V extension for the intensive use of bi-endian data has been proposed. It provides true multi-endian accesses at any privilege level. The extension does require additional operation codes, since it does not introduce any new instructions. Both new and legacy software can benefit from the proposed extension.

By performing testing on real hardware, we observed that a typical data processing algorithm performs equally well on foreign- and native-endian data when using the proposed extension, since no extra instructions need to be executed in order to access data in the foreign endianness. Without the extension, there is an execution time penalty of 62% when using the base ISA and 37% when using the specialized bit manipulation instructions of the Zbb RISC-V extension to carry out the same task. Benchmarks for different types of applications will be developed in the future for a broader estimation of the performance impact.

The proposed extension is straightforward to implement and does not introduce any relevant performance or hardware penalties, so it suits any RISC-V target niche.

Author Contributions

Conceptualization, D.G.; validation, J.J.-C., G.C.-Q., P.R.-d.-C., J.V. and E.O.; formal analysis, D.G. and J.J.-C.; investigation, D.G. and J.J.-C.; resources, G.C.-Q.; data curation, D.G.; writing—original draft preparation, D.G.; writing—review and editing, J.J.-C.; supervision, J.J.-C.; project administration, J.J.-C.; funding acquisition, J.J.-C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by grant USECHIP (TSI-069100-2023-001), a project funded by the Secretary of State for Telecommunications and Digital Infrastructure, the Ministry for Digital Transformation and Civil Service and the European Union–NextGenerationEU.

Data Availability Statement

The code of the three modified versions of the SCR1 core can be found at https://github.com/gambaman/scr1 (accessed on 15 July 2025).

Acknowledgments

The availability of the open-source SCR1 MCU from Syntacore made a proof-of-concept implementation of the proposed extension possible. The authors wish to thank Alexander Chuykov and Konstantin Vladimirov for pointing out some mistakes in our test bench.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

ISA	Instruction Set Architecture
LE	Little-Endian
BE	Big-Endian
MPSoC	Multiprocessor System-on-Chip
RISC	Reduced Instruction Set Computer
AEBO	Address-Encoded Byte Order
CSR	Control and Status Register
GPR	General Purpose Register
EE	Execution Environment
OPCODE	Operation CODE
ABI	Application Binary Interface
WARL	Write Any Values, Read Legal Values
GCC	GNU C Compiler
CSRF	Control–Status Register File
LSU	Load/Store Unit
FPGA	Field-Programmable Gate Array
LUT	Look-Up Table

References

Arora, M. Handling Endianness. In The Art of Hardware Architecture; Springer: New York, NY, USA, 2012; pp. 155–168. [Google Scholar] [CrossRef]
Guerrero, D.; Cano-Quiveu, G.; Juan-Chico, J.; Millan, A.; Bellido, M.J.; Viejo, J.; Ruiz-de Clavijo, P.; Ostua, E. Address-encoded byte order. Microprocess. Microsyst. 2020, 78, 103268. [Google Scholar] [CrossRef]
Al Farhan, M.A.; Keyes, D.E. Optimizations of Unstructured Aerodynamics Computations for Many-core Architectures. IEEE Trans. Parallel Distrib. Syst. 2018, 29, 2317–2332. [Google Scholar] [CrossRef]
Yantir, H.E.; Yurdakul, A. An Efficient Heterogeneous Register File Implementation for FPGAs. In Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, Phoenix, AZ, USA, 19–23 May 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 293–298. [Google Scholar] [CrossRef]
Li, J.J.; Wang, S.C.; Hsu, P.C.; Chen, P.Y.; Lee, J.K. A Multi-core Software API for Embedded MPSoC Environments. In Proceedings of the Second Russia-Taiwan Conference on Methods and Tools of Parallel Programming Multicomputers, Vladivostok, Russia, 16–19 May 2010; MTPP’10. Springer: Berlin/Heidelberg, Germany, 2010; pp. 40–50. [Google Scholar]
Henkel, J.; Parameswaran, S. (Eds.) Designing Embedded Processors; Springer: Dordrecht, The Netherlands, 2007. [Google Scholar] [CrossRef]
Auler, R.; Borin, E. The Case for Flexible ISAs: Unleashing Hardware and Software. In Proceedings of the 2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Campinas, Brazil, 17–20 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 65–72. [Google Scholar] [CrossRef]
Kondoh, G.; Komatsu, H. Dynamic binary translation specialized for embedded systems. ACM SIGPLAN Not. 2010, 45, 157–166. [Google Scholar] [CrossRef]
Robert Bosch GmbH. CAN Specification Version 2.0; Technical Report; Robert Bosch GmbH: Gerlingen, Germany, 1991. [Google Scholar]
Modbus. MODBUS over Serial Line Specification & Implementation Guide; Modbus: Westford, MA, USA, 2002. [Google Scholar]
Eguchi, S. “Superluminal” FITS File Processing on Multiprocessors: Zero Time Endian Conversion Technique. Publ. Astron. Soc. Pac. 2013, 125, 565–579. [Google Scholar] [CrossRef]
Souza, M.; Nicácio, D.; Araújo, G. ISAMAP: Instruction Mapping Driven by Dynamic Binary Translation. In Computer Architecture; Varbanescu, A.L., Molnos, A., van Nieuwpoort, R., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 117–138. [Google Scholar] [CrossRef]
RISC-V International. Available online: https://riscv.org/ (accessed on 14 August 2024).
RISC-V International. The RISC-V Instruction Set Manual Volume I: Unprivileged ISA; RISC-V International: Zürich, Switzerland, 2024; Volume 1. [Google Scholar]
Hennessy, J.L.; Patterson, D.A. Computer Architecture: A Quantitative Approach; Elsevier: Amsterdam, The Netherlands, 2011. [Google Scholar]
Horton, I. Basic Ideas. In Beginning C++; Apress: Berkeley, CA, USA, 2014; pp. 1–22. [Google Scholar] [CrossRef]
RISC-V International. The RISC-V Instruction Set Manual Volume II: Privileged Architecture; RISC-V International: Zürich, Switzerland, 2024; Volume 2. [Google Scholar]
RISC-V International. RISC-V Bit-Manipulation ISA-Extensions. Available online: https://github.com/riscv/riscv-bitmanip/releases/download/1.0.0/bitmanip-1.0.0-38-g865e7a7.pdf (accessed on 14 August 2024).
RISC-V International. Load/Store Pair for RV32 Fast-Track Extension (Zilsd & Zclsd). Available online: https://github.com/riscv/riscv-zilsd (accessed on 14 August 2024).
Free Software Foundation. GCC, the GNU Compiler Collection. 2025. Available online: http://www.myurl.com (accessed on 14 August 2024).
Syntacore. SCR1 RISC-V Core. Available online: https://github.com/syntacore/scr1 (accessed on 14 August 2024).
Syntacore. SCR1 User Manual. Available online: https://github.com/syntacore/scr1/blob/master/docs/scr1_um.pdf (accessed on 14 August 2024).
Syntacore. Syntacore. Available online: https://syntacore.com/ (accessed on 14 August 2024).
Guerrero, D. SCR1 RISC-V Core (Fork). Available online: https://github.com/gambaman/scr1 (accessed on 14 August 2024).
Digilent. Nexys A7 FPGA Trainer Board. Available online: https://store.digilentinc.com/nexys-a7-fpga-trainer-board-recommended-for-ece-curriculum/ (accessed on 14 August 2024).
Advanced Micro Devices Inc. AMD Artix 7 FPGAs. Available online: https://www.amd.com/en/products/adaptive-socs-and-fpgas/fpga/artix-7.html (accessed on 14 August 2024).
Advanced Micro Devices Inc. AMD Vivado Design Suite. Available online: https://www.amd.com/en/products/software/adaptive-socs-and-fpgas/vivado.html (accessed on 14 August 2024).
Syntacore. Open-Source SDK for SCR1 Core. Available online: https://github.com/syntacore/scr1-sdk (accessed on 14 August 2024).

Figure 1. Conversion from big-endian to little-endian and back in the RISC-V assembly using the base ISA. The data is 16-bit and in register a0.

Figure 2. Conversion from big-endian to little-endian and back in the RISC-V assembly using the Zbb extension. The data is 16-bit, and in the register a0.

Figure 3. Big-endian to little-endian conversion and back in RISC-V assembly using the AEBO extension. Data is 16-bit in register a0.

Figure 4. Macros defined in the file aebo.h that help with pointer manipulations required by the AEBO extension.

Figure 5. A sample (incomplete) test program that shows a possible way to use the AEBO extension in C programs.

Figure 6. A C code template that shows how to use the AEBO extension when calling legacy C functions.

Figure 7. The simplified interconnection of an RV32I load/store unit with memory.

Figure 8. Swapper circuit.

Table 1. RISC-V multi-byte memory access instructions.

Access Size (Bytes)	Read Instructions	Write Instructions	Read–Write Instructions
2	`LH`, `LHU`	`SH`
4	`LW`, `LWU`,	`SW`, `SC.W`,	`AMOSWAP.W`, `AMOADD.W`, `AMOAND.W`,
	`LR.W`, `FLW`	`FSW`	`AMOOR.W`, `AMOXOR.W`, `AMOMAX.W`,
			`AMOMIN.W`, `AMOMAXU.W`, `AMOMINU.W`
8	`LD`, `LR.D`,	`SD`, `SC.D`,	`AMOSWAP.D`, `AMOADD.D`, `AMOAND.D`,
	`FLD`	`FSD`	`AMOOR.D`, `AMOXOR.D`, `AMOMAX.D`,
			`AMOMIN.D`, `AMOMAXU.D`, `AMOMINU.D`
16	`FLQ`	`FSQ`

Table 2. Memory byte order using the AEBO technique for a 4-byte word write in a little-endian process. * Mixed-endian byte order used in DEC’s PDP-11 computer.

$A_{1} A_{0}$	Memory Byte Order	Endianness
0 0	${B_{0}, B_{1}, B_{2}, B_{3}}$	Little-endian
0 1	${B_{1}, B_{0}, B_{3}, B_{2}}$	Mixed-endian
1 0	${B_{2}, B_{3}, B_{0}, B_{1}}$	Mixed-endian *
1 1	${B_{3}, B_{2}, B_{1}, B_{0}}$	Big-endian

Table 3. Displacement for each byte of a 4-byte word accessed in big-endian byte order.

Byte	$n_{1}$	$n_{0}$	$\sum_{i = 0}^{1} (A_{i} \oplus n_{i} \oplus UBE) 2^{i}$
$B_{0}$	0	0	3
$B_{1}$	0	1	2
$B_{2}$	1	0	1
$B_{3}$	1	1	0

Table 4. The respective resource utilization of the whole core and the load/store unit (LSU). Extra LUTs use SCR1 as a reference.

Core	Whole Core		LSU
Core	Flip-Flops	LUTs	Flip-Flops	LUTs	Extra LUTs
SCR1	19,930	20,042	5	317	0.0%
AEBO bi-endian	19,921	20,067	7	312	−1.6%
AEBO little-endian	19,920	20,050	7	289	−8.8%
AEBO big-endian	19,920	20,032	7	298	−6.0%

Table 5. The maximum delay within the CPU clock domain and the average power consumption. Extra total power uses SCR1 as a reference.

Core	Delay (ns)	Power Consumption (W)
Core	Delay (ns)	Static	Dynamic	Total	Extra Total
SCR1	27.94	0.110	1.036	1.146	0.00%
AEBO bi-endian	27.79	0.110	1.041	1.151	0.44%
AEBO little-endian	29.84	0.110	1.039	1.149	0.26%
AEBO big-endian	27.96	0.110	1.037	1.147	0.09%

Table 6. Performance counter measurements for different configurations.

Case	Instructions	Cycles	Time (μs)
LE	36,890	65,575	2185
BE	73,754	106,535	3551
BE-AEBO	36,894	65,581	2186

Table 7. The number of instructions per sample for various configurations.

Case	Measured	Calculated	Increment over LE (%)
LE	9.01	9.00	0.00%
BE	18.01	18.00	99.93%
BE-AEBO	9.01	9.00	0.00%
BE-Zbb	–	14.00	55.52%

Table 8. Number of cycles per sample for various configurations.

Algorithm	Measured	Calculated	Increment over LE (%)
LE	16.01	16.00	0.00%
BE	26.01	26.00	62.50%
BE-AEBO	16.01	16.00	0.00%
BE-Zbb	–	22.00	37.50%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guerrero, D.; Juan-Chico, J.; Cano-Quiveu, G.; Ruiz-de-Clavijo, P.; Viejo, J.; Ostua, E. RISC-V Address-Encoded Byte Order Extension. Electronics 2025, 14, 3257. https://doi.org/10.3390/electronics14163257

AMA Style

Guerrero D, Juan-Chico J, Cano-Quiveu G, Ruiz-de-Clavijo P, Viejo J, Ostua E. RISC-V Address-Encoded Byte Order Extension. Electronics. 2025; 14(16):3257. https://doi.org/10.3390/electronics14163257

Chicago/Turabian Style

Guerrero, David, Jorge Juan-Chico, German Cano-Quiveu, Paulino Ruiz-de-Clavijo, Julian Viejo, and Enrique Ostua. 2025. "RISC-V Address-Encoded Byte Order Extension" Electronics 14, no. 16: 3257. https://doi.org/10.3390/electronics14163257

APA Style

Guerrero, D., Juan-Chico, J., Cano-Quiveu, G., Ruiz-de-Clavijo, P., Viejo, J., & Ostua, E. (2025). RISC-V Address-Encoded Byte Order Extension. Electronics, 14(16), 3257. https://doi.org/10.3390/electronics14163257

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RISC-V Address-Encoded Byte Order Extension

Abstract

1. Introduction

2. RISC-V Background

2.1. RISC-V ISA and Control and Status Registers

2.2. RISC-V Code and Data Alignment

2.3. RISC-V Endianness

3. RISC-V Endianness Conversions

3.1. Using Generic Base ISA Instructions

3.2. Using Specific Byte Order Instructions

3.3. Using an RISC-V Bi-Endian Implementation

4. RISC-V AEBO Extension

4.1. The AEBO Technique

4.2. RISC-V AEBO Extension Description

5. The Use of AEBO in Software

5.1. Using the AEBO Extension from Assembly Code

5.2. Using the AEBO Extension from C Code

5.3. Using the AEBO Extension with Existing Software

6. Sample RISC-V AEBO Implementation

7. Results

7.1. Hardware Resources

7.2. Delay and Power

7.3. Software Performance

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI