Abstract
Accurate branch prediction is crucial for achieving high instruction throughput and minimizing control hazards in modern pipelines. This paper presents a novel LXOR (Local eXclusive-OR) branch predictor, which enhances prediction accuracy while reducing hardware complexity and memory usage. Unlike traditional predictors (GAg, GAp, PAg, PAp, Gshare, Gselect) that rely on large Pattern History Tables (PHTs) or intricate global/local history combinations, the LXOR predictor employs complemented local history and XOR-based indexing, optimizing table access and reducing aliasing. Implemented and evaluated using the MARSS-RISCV simulator on a 64-bit in-order RISC-V core, the LXOR’s performance was compared against traditional predictors using Coremark and SPEC CPU2017 benchmarks. The LXOR consistently achieved competitive results, with a prediction accuracy of up to 83.92%, lower misprediction rates, and instruction flushes as low as 5.83%. It also attained an IPC rate of up to 0.83, all while maintaining a compact memory footprint of approximately 2 KB, significantly smaller than current alternatives. These findings demonstrate that the LXOR predictor not only matches the performance of more complex predictors but does so with less memory and logic overhead, making it ideal for embedded systems, low-power RISC-V processors, and resource-constrained IoT and edge devices. By balancing prediction accuracy with simplicity, the LXOR offers a scalable and cost-effective solution for next-generation microprocessors.
1. Introduction
Computer architecture is continually advancing with the primary goal of enhancing processor performance, efficiency, and reliability. Within this landscape, branch prediction has emerged as a vital microarchitectural technique for achieving high instruction-level parallelism (ILP) and throughput in contemporary processors. Accurate branch prediction enables speculative instruction fetching and execution without waiting for the resolution of branch outcomes, effectively minimizing stalls and the need for pipeline flushes. Conversely, frequent mispredictions can severely impact performance by instigating instruction rollbacks and elevating latency. Despite advancements in classical prediction schemes such as GAg, GAp, PAg, PAp, Gshare, and Gselect, these methods frequently grapple with the trade-off between prediction accuracy and storage overhead. While larger predictor tables tend to enhance accuracy, they also significantly increase memory requirements, posing challenges particularly in embedded systems and environments where energy constraints are paramount. This ongoing tension between achieving high accuracy and maintaining a low memory footprint represents a critical challenge in the realm of branch predictor design. The motivation behind this work is to tackle this challenge by developing a branch predictor that combines high accuracy with memory efficiency. We introduce the LXOR branch predictor, a novel lightweight design that utilizes XOR-based history encoding to mitigate aliasing effects while ensuring compact table sizes. Our evaluation of the proposed predictor against traditional methods on RISC-V processors, utilising standard benchmarking suites, demonstrates a superior balance of prediction accuracy, instructions per cycle (IPC), and reduced memory footprint.
Branch prediction [1] is a fundamental mechanism in high-performance microprocessors, particularly in architectures featuring deep pipelines and wide instruction windows. Upon encountering a branch instruction, the processor anticipates the outcome to determine the execution path. When predictions are accurate, execution proceeds without disruption; however, a misprediction necessitates pipeline flushing, leading to stalls that significantly impair throughput. Amdahl’s Law [2] emphasises that overall system performance is limited by its slowest component. Even if other instructions are executed rapidly, the delays introduced by mispredicted branches create bottlenecks in the pipeline. This scenario highlights the importance of precise branch prediction algorithms in mitigating stalls and enhancing overall processor performance efficiency.
As illustrated in Figure 1 [3,4], a processor’s operation can be broadly categorized into five distinct stages.
Figure 1.
Five stages in a processor.
During the fetch stage, it retrieves instructions from memory while updating the program counter (PC), which includes calculating branch targets when necessary. In the decode stage, the instruction is analyzed, operands are fetched, and control signals for execution are generated. The Arithmetic Logic Unit (ALU) carries out arithmetic, logical, and branching operations. This is followed by memory access for load/store operations, and the final stage involves writing results back to the appropriate registers.
Pipelining [3,4] is a technique used in processor design that segments the instruction execution process into a series of sequential stages, as shown in Figure 2. Each stage is dedicated to handling a specific portion of the instruction set simultaneously, with one instruction being concurrently assigned to each stage. This method operates under the premise that all stages require a uniform amount of time to complete their tasks. As each time interval elapses, the instructions advance to the next stage in the pipeline. The arrangement of these sequential stages constitutes the pipeline, and processors that implement this architecture are classified as pipelined processors. This design enhances throughput by allowing multiple instructions to be in different stages of execution concurrently.
Figure 2.
Order of operations in a pipeline.
A hazard [3,4] in a processor is a situation where the normal instruction execution flow is disrupted, leading to delays or incorrect results. Hazards commonly occur in pipelined processors, where multiple instructions are processed simultaneously at different stages of execution.
There are three primary categories of hazards to consider:
(a) Structural Hazards: The hazard arises when two or more instructions require the same hardware resources simultaneously, resulting in stalls in the pipeline.
(b) Data Hazards: A data hazard arises when an instruction depends on the outcome of a prior instruction that has not yet finished executing. This dependency can lead to potential stalls or delays in the pipeline, as the subsequent instruction must wait for the necessary data to be available before it can proceed.
(c) Control Hazards [5,6] (Branch Hazards): The hazard arises from the inclusion of branch instructions within the program flow. A control hazard occurs when the processor is uncertain about which instruction to fetch next due to a branch. (e.g., if-else statements, loops, jumps, etc.)
Approximately 15% to 25% of all instructions within a program are estimated to be branch instructions [7,8]. Branch instructions can be categorized into three primary types: jump, call, and return, with each type further classified as either conditional or unconditional. These instructions inherently disrupt the linear flow of program execution; rather than simply incrementing the PC to proceed to the next instruction, the counter is redirected to point to a specific memory address. This deviation can introduce control hazards, leading to potential delays as the pipeline fills with instructions that may ultimately need to be discarded due to mispredictions [1,9,10,11,12]. To mitigate the performance impacts of branch instructions, modern processor architectures frequently employ Branch Prediction (BP) techniques. BP aims to enhance instruction throughput by reducing the penalties typically associated with pipeline stalls. It entails forecasting the outcome of a branch, whether it will be taken or not, prior to its actual execution. Preemptively, the processor initiates the execution of instructions along the predicted path. When the prediction aligns with the actual outcome, it results in a branch hit, thereby optimising the utilisation of the instruction pipeline and improving overall execution efficiency. However, if the prediction is incorrect (known as a branch miss), the instructions executed from the wrong path must be discarded, and the correct instructions from the right path must be fetched into the pipeline. This incorrect prediction can lead to branch overhead, which ultimately slows down processing.
Section 2 provides a comprehensive examination of the relevant literature on both static and dynamic branch prediction methodologies. Section 3 elaborates on the design of the proposed LXOR (Local eXclusive-OR) branch predictor, emphasising its innovative indexing scheme, which aims to reduce aliasing and improve prediction accuracy while ensuring minimal hardware overhead. Section 4 outlines the evaluation methodology and experimental framework, presenting performance results that benchmark LXOR against established predictors, including GAg, GAp, PAg, PAp, Gshare, and Gselect, utilizing workloads from Coremark and SPEC CPU2017. Section 5 delivers an extensive discussion and comparative analysis of LXOR’s performance across various scenarios. Finally, Section 6 concludes the paper by summarizing key findings, implications, and insights.
2. Related Work
Branch predictions are classified broadly into two categories [10,11,13]:
2.1. Static Branch Prediction
This technique is very simple, where either the branch is assumed to be always taken or not taken. In this approach, the prediction does not vary based on the runtime information. This approach operates on a static assumption for the duration of program execution, without leveraging historical data from prior outcomes. As a result, it exhibits lower energy consumption when executing instructions compared to alternative branch prediction methods, which often require dynamic analysis and tracking of previous execution patterns.
Static branch prediction operates in the following manner [10]:
- Single-direction prediction: This method is very straightforward, where prediction is either taken or not.
- Backward taken forward not taken (BTFNT): This method assumes that backward branches are taken and forward branches are not taken.
2.2. Dynamic Branch Prediction
The Dynamic branch predictors [5] depend on the previous outcomes of the conditional branch operation to predict whether the branch is to be taken or not.
The basic classification of dynamic branch predictors is as follows [3,4,14]:
- One-bit Branch Predictor.
- Two-bit Branch Predictor.
2.2.1. One-Bit Branch Predictor
This predictor depends on the recent outcomes of the branch instruction to predict whether the branch is taken or not taken. It utilizes a single bit stored in the Branch History Table (BHT) for each branch instruction, reflecting the outcome of the last execution, specifically, whether the branch was taken or not taken. When the bit is set to 1, the predictor forecasts that the branch will be taken in the subsequent execution; conversely, a cleared bit (0) indicates a prediction that the branch will not be taken. This simplistic approach, while being quite efficient, has limitations in accuracy due to its reliance on only the most recent branch outcome
2.2.2. Two-Bit Predictor
Another classification of Dynamic branch prediction is a 2-bit predictor, where it considers the previous two outcomes of the conditional branch instructions and changes its decision where it predicts false twice. The 2-bit predictor has four state finite automata, as shown in Figure 3, whereas the 1-bit dynamic branch predictor has 2 states.
Figure 3.
Four-state finite automata.
The four states are as follows: Strongly taken (ST), weakly taken (WT), weakly not taken (WNT), strongly not taken (SNT) [14]. In the first two states, i.e., strongly not taken (SNT) and weakly not taken (WNT), the branch is not taken, and in the last two states, the branch is taken. Assuming that the state is in SNT, and if the branch is taken, the state moves from left to right. In this case, it will move to WNT. As it reaches the last state, i.e., ST, and if the branch is still taken, it remains in the same state as long as the branch is not taken, which moves the state from right to left, i.e., in the WT state. Similarly, if it is in the SNT state and if the branch is not taken, it will remain in the same state.
Bimodal Branch Predictor
Bimodal branch prediction [1,10] is one of the simplest and most commonly used branch prediction techniques in modern processors. It aims to forecast the result of conditional branch instructions (i.e., whether the branch will be taken or not) to enhance instruction pipeline efficiency and reduce stalls.
- Components of the bimodal branch prediction unit.
- Branch History Table (BHT): The core component of a Bimodal Branch Prediction unit is the Branch History Table (BHT). It is essentially an array of counters, with each counter representing a particular branch instruction [15]. These counters take any one of the states (value): strongly taken (11), weakly taken (10), weakly not taken (01), or strongly not taken (00) for the counter to be 2-bit as shown in Table 1.
Table 1.
State table for a 2-bit counter.
Table 1.
State table for a 2-bit counter.
| State | Value |
|---|---|
| Strongly Taken | 11 |
| Taken | 10 |
| Not taken | 01 |
| Strongly not taken | 00 |
Table 1 above shows the states in 2 bits. These bits help us predict the branch. The most significant bit (left bit) is used to predict whether the branch will be taken or not. If the bit is 1, it indicates that the branch will be taken; else, it will not.
The basic idea for counter-based branch prediction is to use an N-bit up/down counter for prediction [16]. In the ideal case, an N-bit counter (with some initial value) is assigned to each static branch (branches with distinct addresses). When a branch is about to be executed, the counter value C, associated with that branch, is used for prediction. If C is greater than or equal to a predetermined threshold value L, the branch is predicted to be taken. Otherwise, it is predicted not to be taken. A typical value for L is 2N−1. The counter value C is updated whenever that branch is resolved. If the branch is taken, C is incremented by one; otherwise, it is decremented by one. If C is 2N − 1, it remains at the value as long as the branch is taken. If C is 0, it remains at zero as long as the branch is not taken. The N-bit counter scheme operation corresponds to the final state machine (FSM) with 2N states. Figure 4 illustrates the FSM with N = 2 and L = 2. Smith [10] reported that a counter of 2 bits is usually as good or better than other strategies, and a larger counter size does not necessarily give better results.
Figure 4.
State diagram of a saturating counter.
- Branch Target Buffer (BTB): As shown in Figure 5, the BTB [1] is a specialized cache that stores information regarding the target addresses of recently executed branch instructions. It establishes a connection between each branch instruction and its corresponding target addresses, thereby facilitating expedited branch resolution and execution. To access a BTB with 2n entries, the least significant n bits of the PC are utilized. Each entry within BTB not only contains the target address but also categorizes the type of branch, which may be conditional, unconditional, call, or return. The BTB is crucial in determining whether an instruction constitutes a branch and in identifying its type. Furthermore, it is capable of forecasting the target address associated with a specific branch instruction.
Figure 5.
The branch target buffer.
- The Bimodal Branch Prediction unit operates as follows:
When a branch instruction is fetched, the corresponding entry in the Branch History Table (BHT) is accessed using the PC, as shown in Figure 6.
Figure 6.
Bimodal branch predictor.
The BHT table has two bit-saturating counters (CTRs). The CTR is a saturating counter associated with the specified index entry, which is a 2-bit counter designed to monitor the likelihood of the branch at that index being taken or not taken. This mechanism offers valuable insights into the behaviour of branch predictions.
The counter’s value within BHT is utilized to predict the outcome of the branch. A counter value of strongly taken or weakly taken predicts that the branch will be taken, and a value of strongly not taken or weakly not taken predicts that the branch will not be taken. When a new branch instruction is encountered, it is added to the BHT with a default counter value of 01 (weakly not taken).
The counter facilitates two primary operations [4]:
- Increment: When the counter is incremented, it increases by one, ensuring that it does not exceed the predetermined upper threshold (U). The operation can be expressed as follows:
C = min(C + 1,U)
- 2.
- Decrement: When the counter is decremented, it decreases by 1, ensuring it does not fall below the designated lower threshold (L). This can be articulated as:
C = max(C − 1,L)
These operations allow for precise control of the counter’s value within specified limits. Upon determining the actual outcome of a branch instruction, the Branch History Table (BHT) is updated accordingly through the execution of that instruction. When the prediction is accurate, the corresponding counter is incremented; conversely, when the prediction is inaccurate, the counter is decremented. Given that a two-bit counter has a range of values from 00 to 11, specific behaviors are observed at the limits of this range. If the counter reaches a value of 11 and the associated branch is taken, the counter remains at 11 and is not further incremented. Similarly, if the counter is at 00 and the branch is not taken, it remains at 00 and is not further decremented. This type of counter is referred to as a saturating counter. The predicted outcome is utilized to execute instructions speculatively.
2.3. Adaptive Branch Predictors
These predictors are more sophisticated than bimodal branch predictors. These methods incorporate recent historical data to provide predictions. The idea is that whether a branch is taken or not does not only depend on its address but also on its history (past behavior). This predictor utilises two tables, the GHT (Global History Table) and the PHT (Pattern History Table), where the GHT stores the history of each branch instruction. There are two approaches to storing histories: one where GHT stores the history of the last few branches of the program, which is global history; they can either have the same PC or they can have different PCs, and second, when GHT stores local history (only history of that particular branch) indexed by the lower bits of the address (PC). Based on the type of entries in GHT, the technique is either referred to as a Local branch predictor or a Global branch predictor [11,17]. The Global branch predictor uses a single Branch History Register (BHR) shared by all branches. This BHR indexes into the PHT. The local branch predictor tracks history per branch (each branch has its own history register). This local history is then used to index into the PHT. PHT stores the 2-bit counters as in bimodal predictors indexed by the histories found on corresponding BHR, and also combines the PC with it in some instances. These counters get updated based on whether the branch is taken or not.
The branch history register (BHR) [18], which is a crucial component of the global history table (GHT), functions as an n-bit shift register. Its role is to systematically record the outcomes of the most recent n branch instructions in a bitwise format. This register stores local history or global history, depending on the design. Upon the resolution of a branch outcome, the contents of the BHR are shifted right by one position, with the latest branch outcome loaded into the most significant bit (MSB).
2.3.1. GAg Branch Predictor
This is a straightforward branch predictor that does not utilize any bits from the PC [13,19]. As shown in Figure 7, the first level of the predictor comprises a Branch History Register (BHR) present in GHT, while the second level features a table that is indexed according to the bits stored in the BHR.
Figure 7.
GAg predictor.
The Pattern History Table (PHT) is accessed so that the BHR reflects the behavior of the last k branches. By employing the contents of the BHR (k bits), the PHT, which contains 2-bit saturating counters, can be indexed. While this design offers the advantage of simplicity, it also presents notable disadvantages. The predictor relies heavily on the most recent k branches without adequately accounting for their historical behaviors. For example, if it is assumed that a branch is perpetually taken, there is no mechanism in place to accurately capture this behavior. The accuracy of the predictions is exclusively contingent upon the last k branches, which may not effectively represent the behaviors of individual branches. In extensive programs, multiple occurrences of the last k branches may display identical patterns, even though their actual behavior may differ. This situation can lead to erroneous predictions for branches that demonstrate consistent behaviors, such as those that are always taken or consistently not taken. Additionally, because the GAg predictor uses a global history, different branches may map to the same entry in the PHT, leading to aliasing. This can result in incorrect predictions when the branches exhibit different behaviors. Aliasing in branch predictors occurs when multiple branches contend for the same predictor entry due to the limited capacity of the predictor table. At smaller predictor sizes, the number of available entries fails to adequately capture the diverse set of branch addresses and their corresponding histories typically found in workload patterns. This results in distinct branches mapping to identical table indices, leading to detrimental interference characterized by the overwriting of counters or history states. Such destructive aliasing obfuscates the individual behaviors of branches, hampers the predictor’s capacity to discern stable learning patterns, and considerably undermines prediction accuracy. As the predictor table size increases, the likelihood of these collisions diminishes, subsequently mitigating aliasing effects and enhancing overall predictive performance.
2.3.2. GAp Branch Predictor
The GAg predictor presents a significant limitation due to its reliance on a single PHT for all branches, which often results in inaccurate predictions. This issue arises because different branches with identical branch histories access the same entry within the table, leading to confusion and interference. In contrast, the GAp predictor [13,19] enhances prediction accuracy by utilising both the branch history and specific bits from the PC to index the table. This approach allows distinct branches to access different segments of the table, thereby reducing interference and improving predictive accuracy. Consequently, GAp provides superior predictions by acquiring a more detailed understanding of the behavior of each branch.
The GAp predictor is a sophisticated two-level branch predictor that utilises a Branch History Register (BHR) present in the GHT, in conjunction with PC bits, to enhance branch prediction accuracy, as shown in Figure 8.
Figure 8.
GAp predictor.
This branch predictor operates on two distinct levels: the Branch History Register (BHR) and the PHT. The BHR, functioning globally, captures the recent outcomes of all branch instructions, storing them as taken (1) or not taken (0). Meanwhile, the PHT is address-specific, maintaining an array of 2-bit saturating counters that are indexed using a combination of the BHR’s k bits and the n least significant bits from the PC. This combination creates a composite index of 2n+k bits, allowing for more nuanced predictions based on historical branch behaviors. The predictor enhances prediction accuracy by leveraging both the outcomes of past branches and the specific instruction address information. The methodologies for prediction and training mirror those found in bimodal predictors, with the key distinction being the integrated use of the branch history and PC bits during index access.
The GAp predictor enhances prediction capabilities by leveraging both the branch history register (BHR) and specific bits from the PC to identify and learn recurring patterns effectively.
2.3.3. PAg Branch Predictor
PAg is a simple predictor that uses n bits of the PC as shown in Figure 9. In this context, the first level utilizes n bits from the PC, resulting in a total of 2n possible combinations. Each combination is linked to its respective set of program counters. Instead of using a single GHT, this method incorporates 2n distinct GHTs with one assigned for each set of PC’s. The n bits extracted from the PC address are used to access the corresponding GHT. Once the appropriate GHT is retrieved, a k-bit pattern of its BHR is utilized to access the PHT [13,19].
Figure 9.
PAg predictor.
The PHT does not use PC bits to access the table; instead, it stores information about past branches’ behaviour. It consists of a 2-bit saturating counter that predicts whether a branch will be taken or not. In an ideal predictor, each branch should be predicted based on its own past behavior. However, GAp and GAg predictors do not adhere to the principle. Instead, they amalgamate information from multiple branches, which leads to aliasing a phenomenon where different branches interfere with one another.
As compared to GAg, this predictor has some advantages, as the PAg predictor uses multiple branch history registers (BHRs). If a single BHR were used, then branches in irrelevant regions of code could interfere and reduce the accuracy of its predictions. Thus, giving different PC address neighborhoods keeps each neighboring address behavior mostly localized and minimizes destructive interference. When execution steps are taken to another area, the existing BHRs will be replaced. Fortunately, that is not an awful problem at all because the program tends to stay within a given region for a significant period of time. As execution becomes stable in a new region, the BHRs will be repopulated with pertinent branch patterns, allowing the predictor to adapt effectively.
2.3.4. PAp Branch Predictor
The PAp predictor, as shown in Figure 10, is a branch predictor that utilizes local history and maintains an individual k-bit branch history register (BHR) for every static branch prediction, where the BHR stores the last k outcomes (taken or not) for a particular branch [13,19].
Figure 10.
PAp Predictor.
It improves upon a simple two-bit predictor by using a history table per branch address instead of global history. It utilizes a two-level adaptive branch prediction technique, employing branch history and prediction tables to enhance accuracy based on previous outcomes.
The first level functions similarly to the PAg predictor, where the corresponding GHT is selected by the n bits of the PC. In the second level, k bits from the relevant Branch History Register are concatenated with the m bits from the PC to access the table of the saturating counter (the PHT). Each entry in the PHT is a 2-bit counter, which determines whether the branch is predicted as taken or not taken. Once the actual outcome of the branch is known, the PHT entry is updated, incremented when the branch is taken, and decremented when it is not taken. Similarly, the BHR is also updated by shifting in the latest branch outcome.
The PAp predictor presents notable benefits in branch prediction, particularly when addressing branches with repetitive and locally correlated behaviors. Its primary strength is its high accuracy for branches exhibiting consistent patterns. By utilizing dedicated branch history registers for each branch instruction, the PAp effectively learns and anticipates branch-specific behaviors, avoiding the interferences commonly seen with global predictors. Such global predictors aggregate history data from multiple branches, leading to aliasing, where the behavior of one branch inadvertently impacts another, reducing prediction accuracy. By maintaining distinct history records for each individual branch, the PAp ensures that independent branches do not compromise each other’s predictions. This feature is especially beneficial in systems with multiple independent branches, where the difficulties of global predictors in differentiating correlated execution patterns are magnified. However, it’s important to note that the PAp predictor has considerable overhead; due to its architecture, it implements separate history tables for each branch instruction. Consequently, it is larger, slower, and more power-hungry than simpler branch predictors.
2.3.5. GSHARE Branch Predictor
The GSHARE predictor [17] presents a more streamlined approach. It employs a single GHT at the primary level, which captures the behavior of the last n branches via a bit vector in its corresponding BHR. The method entails extracting a k-bit pattern from the BHR and performing an XOR operation with n bits from the PC, as shown in Figure 11.
Figure 11.
GSHARE Branch Predictor.
At the secondary level, the resulting XOR output is utilized as the index to access the PHT. Notably, if k does not equal n, padding with zeros is applied to the bit count of the lesser number during the XOR operation to ensure consistency.
The underlying premise is that the computation of the XOR operation is likely to yield a unique combination of the PC bits and the BHR bits. The selection of sufficiently large values for k (branch history register size) and n (number of PC bits utilized) serves to minimize conflicts in the PHT, despite the potential for aliasing. Specifically, a larger value of k enables the capture of a more extensive branch history, thereby providing enhanced context for predictive accuracy, while an increased value of n enhances the distinctiveness of the program counter-bit patterns. Once the index is established, it is employed to access a 2-bit saturating counter within the PHT. This counter serves to monitor the probability of a branch being taken. When a branch is taken, the counter is incremented; conversely, it is decremented if the branch is not taken. Following the execution of the actual branch, the predictor is updated. The entry in the PHT is modified based on the actual outcome, and the BHR is shifted to incorporate the result of the new branch, thereby ensuring that subsequent predictions remain responsive to recent branching patterns. By employing an XOR operation instead of concatenation, the GSHARE mechanism markedly reduces the incidence of aliasing and promotes a more uniform distribution of indices across the PHT. Furthermore, it necessitates fewer entries, with a table size of 2max(n,k) as opposed to 2(n+k), thereby enhancing overall hardware efficiency. Unlike the PAp predictor, GSHARE obviates the requirement for per-branch storage, leading to diminished memory demands and elevated performance. GSHARE is the preferred choice in modern branch prediction methodologies, effectively balancing accuracy, efficiency, and power consumption.
2.3.6. GSELECT Branch Predictor
The GSELECT branch predictor [17] is a two-level dynamic branch prediction technique developed to improve prediction accuracy by utilizing both the global behavior of recently executed branches and specific data extracted from the PC. By integrating the Branch History Register (BHR) with selected bits from the PC, the GSELECT predictor establishes a more unique index into the PHT, as shown in Figure 12.
Figure 12.
GSELECT Branch Predictor.
In the traditional architecture of the GSELECT branch predictor, the k bits of BHR present in GHT are combined with the n least significant bits of the PC to generate an index for the PHT. The PHT itself comprises 2-bit saturating counters, which facilitate precise branching predictions. Conventional designs typically concatenate the BHR and the PC bits to form this index. In this variant, each bit of the BHR is ANDed with the corresponding bit from the selected segment of the PC, yielding a fixed-length composite index that captures the interaction between the dynamic execution history and the current state of the program. This method is advantageous for low-power and embedded systems, relying solely on simple, area-efficient, and low-power AND gates. The GSELECT predictor strikes a balance between prediction accuracy and hardware cost. By employing an AND gate to combine the PC with the BHR offers an efficient indexing mechanism that mitigates interference from unrelated branches. While it may lack the complexity of more advanced predictive models, GSELECT remains a reliable choice for processors seeking moderate prediction capabilities without incurring substantial performance penalties. Nonetheless, the GSELECT branch predictor has limitations, which can impact its effectiveness in specific scenarios.
Integrating the global branch history register (BHR) with bits from the PC via a bitwise AND operation may limit the ability to capture intricate correlations between branch behaviors and instruction addresses. This, in turn, can result in reduced prediction accuracy relative to more sophisticated prediction techniques.
Branch prediction has seen substantial advancements, transitioning from early two-level adaptive mechanisms to sophisticated hybrid and neural-inspired models. A pivotal development in this domain was the Perceptron predictor, which supplanted saturating counters with a neural architecture adept at learning correlations across extensive global histories. Jiménez and Lin [20] demonstrated that perceptrons not only outperformed Gshare in terms of misprediction rates but also exhibited favourable scaling with history length, thereby ensuring practicality for hardware implementations.
Building upon history-based prediction, the GEometric History Length (GEHL) [21] predictor and its optimized variant, O-GEHL, utilize multiple predictor tables indexed by geometric history lengths. This approach effectively balances short and long correlations, significantly mitigating aliasing issues and enhancing prediction robustness across a variety of workloads.
The TAGE (Tagged GEometric) family [22] further refines this concept by employing multiple tagged tables, each tailored to specific history lengths, thereby capturing both short- and long-range correlations more comprehensively. The TAGE-SC-L [23] predictor represents the forefront of this research, integrating statistical correction mechanisms along with loop predictors to achieve superior accuracy on modern benchmark suites.
The Last-Level Branch Predictor (LLBP) [24] enhances the TAGE architecture by incorporating high-capacity backing storage and context-aware metadata prefetching. This integration significantly improves the accuracy of predictions for challenging branches characterized by lengthy history dependencies.
Collectively, the advancements represented by the Perceptron, GEHL/O-GEHL, TAGE/TAGE-SC-L, and LLBP frameworks chart the course of contemporary branch prediction research. Our proposed LXOR predictor builds upon this established foundation, introducing an efficient XOR-based indexing scheme that harmonizes prediction accuracy with hardware efficiency within the RISC-V architecture.
3. Proposed Methodology
LXOR (Local eXclusive-OR) Branch Predictor
The LXOR technique represents an adaptive approach to branch prediction that leverages historical data for its predictions. This methodology utilizes local history, specifically the history associated with each individual branch, which provides the advantage of accurately capturing the distinct behavior of individual branch instructions by maintaining a separate history for each static branch, instead of relying on global history. This approach enables the predictor to identify and leverage repetitive patterns that are specific to particular branches, notably those found in loops or conditional statements. In contrast to global history, which often encounters difficulties due to interference arising from shared histories among all branches, local history minimizes aliasing and enhances prediction accuracy for branches that operate independently.
As shown in Figure 13, branch history is determined by indexing the PC bits in the GHT.
Figure 13.
LXOR Branch Predictor.
A specific BHR is selected, and its bits are indexed and concatenated with the complement of the BHR bits, generated through the Exclusive-OR (XOR) function. The inputs to the XOR gate consist of k BHR bits combined with k binary bits, each possessing a value of one (Masking Register). The execution of the XOR operation on these inputs results in the computation of the complement of the BHR bits. Subsequently, the entire index is mapped into the PHT, where saturating counters are incremented or decremented in accordance with the actual outcomes of the branch instructions. LXOR predictor is similar to the PAg approach, where both PAg and LXOR predictors have PHT size as 1, and the PHT is not indexed by bits directly from the PC.
The operation of the predictor is delineated as follows: Initially, the GHT is accessed using designated bits from the PC. A total of 2n GHT, each containing a single entry representing local history, are generated from the selected n bits of the PC. Subsequently, the GHT corresponding to the chosen PC bits is utilized based on the matching index. The bits from the selected BHR of the corresponding GHT are employed for further processing. In this predictor configuration, the size of the PHT is established as one, resulting in the existence of a singular PHT. Within the PHT, a specific number of Next History Tables (NHT) are generated, contingent upon the history bits selected from the BHR. Consequently, 2k NHTs within a single PHT are created from the k selected bits of the BHR. The selection of the NHT is determined by the index bits present in the BHR.
To access the prediction, the two-bit saturating counter must be utilized as the final step. This process involves using the latter portion of the index, specifically the complement of the BHR bits, as illustrated in the accompanying diagram. The number of bits utilized will correspond to the number of bits present in the BHR, thereby determining the quantity of two-bit saturating counters residing within each Next History Table (NHT). These counters are responsible for predicting whether a branch will be taken. Based on the prediction regarding the branch decision, the counter is subsequently either incremented or decremented.
Conventional predictors such as PAg or Gshare directly use BHR bits (or their XOR with PC bits) to index the Pattern History Table (PHT). However, using raw BHR values can lead to high correlation among similar history patterns, which increases the probability of aliasing (different branches mapping to the same PHT entry). By complementing the BHR bits and concatenating them with the original BHR, the LXOR predictor effectively doubles the variability of index patterns. This widens the spread of indexing across the Next History Tables (NHTs), thereby reducing destructive interference among branches.
The LXOR predictor employs a local history-based indexing mechanism in conjunction with an XOR transformation to generate indices for the PHT that are more orthogonal and less susceptible to aliasing. This approach effectively isolates the history of each individual branch and enhances the dispersion of indexing through logical inversion via XOR. As a result, it mitigates the occurrence of destructive aliasing, thereby improving the reliability of prediction outcomes.
4. Results
RISC-V refers to the fifth version of Reduced Instruction Set Computing (RISC) architecture [25]. This instruction set architecture (ISA) is free and open, designed for simplicity, efficiency, and adaptability. Its open-source characteristics permit anyone to create, alter, and produce processors using this architecture without incurring licensing fees. RISC-V supports various standard versions, including 32-bit and 64-bit options, and offers extensions for functionalities such as floating-point operations, cryptography, and more. These extensions facilitate customization to meet the specific requirements of diverse applications and industries. A key benefit of RISC-V is its modular design, which allows implementations to include only essential components. This modularity grants flexibility and scalability, making it ideal for a broad spectrum of devices, from compact embedded systems like IoT devices to extensive servers and supercomputers.
A generic block diagram showing the 64-bit 5-stage RISC-V processor, which we have used in the experimental setup using the MARSS RISC-V emulator tool, is shown in Figure 14.
Figure 14.
Generic block diagram of the 64-bit 5-stage RISC-V processor.
The RISC-V processor configuration, as shown in Figure 14, used for our experimental setup, is as follows
- 64-bit in-order core with a 5-stage pipeline, 1 GHz clock.
- 32-entry instruction and data TLBs.
- 32-entry 2-way branch target buffer with a simple bimodal predictor, with a 256-entry history table.
- 4-entry return address stack.
- Single-stage integer ALU with one cycle delay.
- 2-stage pipelined integer multiplier with one-cycle delay per stage.
- Single-stage integer divider with an eight-cycle delay.
- All the instructions in FPU ALU with a latency of 2 cycles.
- 3-stage pipelined floating-point fused multiply-add unit with one-cycle delay per stage.
- 32 KB 8-way L1-instruction and L1-data caches with one cycle latency and LRU eviction.
- 2 MB 16-way L2-shared cache with 12-cycle latency and LRU eviction.
- 64-byte cache line size with write-back and write-allocate caches.
- 1024 MB DRAM with a base DRAM model with 75 cycles for main memory access.
In this work, all simulations were conducted on a 64-bit in-order RISC-V core with a 5-stage pipeline. We deliberately selected an in-order configuration to provide a clean and controlled environment for evaluating branch predictors, ensuring that the results accurately reflect the predictor’s behaviour without interference from out-of-order execution or other microarchitectural optimisations. While modern high-performance RISC-V implementations are typically out-of-order and superscalar, extending our study to such configurations is planned as future work.
The entire simulation was executed using the Micro-Architectural and System Simulator (MARSS-RISCV) [26]. MARSS is a sophisticated cycle-accurate simulator engineered for in-depth modeling of computer processor internals and complete systems. This tool empowers researchers and developers to dissect the instruction processing capabilities of a processor at an exceedingly granular level, simulating each execution clock cycle. MARSS consists of two core components: a functional simulator, often QEMU, which can execute actual programs and operating systems, and a timing model that provides precise simulations of the processor’s hardware behavior. This synergy allows for thorough analyses of critical aspects such as instruction fetching, decoding, execution, memory access, and the write-back stages. Furthermore, MARSS supports full-system simulation, effectively replicating the entire computing environment, including the processor, memory hierarchy, and operating systems. The MARSS-RISC V variant is specifically designed for simulating RISC-V processor architectures, enabling exploration of diverse RISC-V core designs, memory configurations, instruction sets, and pipeline architectures within a virtualized framework before physical hardware implementation. The tool significantly enhances the ability to investigate complex architectural features like out-of-order execution, branch prediction, and cache behavior. It also accommodates multi-core simulations and supports comprehensive system testing, establishing itself as an essential asset for processor development in academia and industry. The open-source and highly configurable nature of MARSS-RISCV encourages collaborative efforts to innovate and optimize RISC-V-based systems across various applications, including embedded systems, servers, and Internet of Things (IoT) devices
- Benchmarks Used to Evaluate Performance:
Benchmarks are standardized assessments used to quantify and compare the performance characteristics of hardware, software, or entire systems. They play a critical role in evaluating the efficiency and capability of components, including CPUs, memory subsystems, storage solutions, and applications, across a range of operational conditions. Notable benchmarks utilized to gauge processor performance include CoreMark, the Standard Performance Evaluation Corporation (SPEC) suite, and Whetstone, each designed to provide insights into distinct aspects of processing capabilities.
CoreMark [27] is a well-established benchmark developed by the Embedded Microprocessor Benchmark Consortium (EEMBC) that provides a straightforward yet powerful tool for assessing processor core performance. It is particularly valuable for evaluating CPUs and microcontrollers (MCUs) within embedded systems. CoreMark incorporates a suite of key algorithms designed to measure various aspects of computational performance, including list processing for sorting and searching, matrix manipulation for critical operations, state machine logic to validate input streams, and cyclic redundancy checking (CRC) for data integrity. The benchmark’s design allows for compatibility across a broad spectrum of architectures, from 8-bit microcontrollers to 64-bit microprocessors, making it versatile for diverse embedded applications.
The SPEC CPU® 2017 benchmark [28,29,30] package represents the latest iteration of SPEC’s standardized suites for assessing compute-intensive performance metrics. This benchmark suite is engineered to rigorously evaluate a system’s processor capabilities, memory subsystem performance, and compiler efficacy. SPEC has meticulously crafted these suites to facilitate comparative analysis across a diverse array of hardware platforms, employing workloads that reflect real-world application demands. The benchmarks are available in source code form, requiring compiler commands as well as supplementary commands executed via a shell or command prompt. Additionally, the benchmark suite includes an optional metric for energy consumption assessment. The SPEC CPU® 2017 benchmark package comprises 43 benchmarks, which are systematically categorized into four distinct suites:—The SPECspeed® 2017 Integer and SPECspeed® 2017 Floating Point Suites focus on measuring the execution time for individual tasks, providing insight into the latency performance of the system—The SPECrate® 2017 Integer and SPECrate® 2017 Floating Point Suites evaluate throughput, quantifying the number of tasks completed per unit of time, thus delivering a holistic view of the system’s performance under varying workload conditions.
This study selects eight benchmarks from the SPEC suite, specifically 500.perlbench_r, 505.mcf_r, 520.omnetpp_r, 523.xalancbmk_r, 605.mcf_s, 620.omnetpp_s, 623.xalancbmk_s, and 641.leela_s, which exhibit a higher number of branches per thousand retired instructions (PKI) [30]. Specifically 500.perlbench_r has 202.8 branches (PKI) which means about 20.3% of all executed instructions in 500.perlbench_r are branch instructions, 505.mcf_r has 226.6 branches (PKI), 520.omnetpp_r has 220.4 branches (PKI), 523.xalancbmk_r has 238.5 branches (PKI), 605.mcf_s has 242.0 branches (PKI), 620.omnetpp_s has 220.4 branches (PKI), 623.xalancbmk_s has 238.6 branches (PKI), and 641.leela_s has 154.6 branches (PKI) [30]. The rationale for this selection is that branch predictors are designed to anticipate the outcomes of branch instructions; consequently, benchmarks with a greater number of branches activate the predictor more frequently, thereby enhancing its influence on overall performance. This setup allows for an evaluation of the predictor’s accuracy, robustness, and efficiency under load. Furthermore, many real-world applications, such as artificial intelligence, compilers, and control systems, involve substantial branching. The use of high-PKI benchmarks ensures that the predictor is adequately prepared for these practical workloads.
- Factors on which processor performance is calculated:
Instruction per cycle (IPC): Instructions per cycle (IPC) serves as a crucial performance metric that quantifies the number of instructions a processor can execute within a single clock cycle. A higher IPC indicates improved performance and is influenced by a variety of factors, including pipeline efficiency, cache performance, and instruction-level parallelism. An ideal value of IPC is 1, which signifies that the processor, on average, completes one instruction per clock cycle, effectively utilizing its instruction pipeline. This scenario indicates minimal stalling, with instructions progressing smoothly and experiencing infrequent delays. Such performance is generally regarded as a robust baseline, particularly for scalar processors, which are designed to execute, at most, one instruction per cycle.
Accuracy (ACC): In branch prediction, accuracy is defined as the frequency with which a processor successfully anticipates the outcome of a branch instruction. It is quantified as the ratio of correct predictions to the total number of branch instructions, expressed as a percentage. Enhanced accuracy results in a reduction in instruction flushes, contributing to improved overall performance.
Misprediction (MIS): A misprediction occurs when the branch predictor of a processor inaccurately forecasts the outcome of a branch instruction. Due to this incorrect prediction, the central processing unit (CPU) speculatively fetches and begins executing instructions along the wrong path. Upon detecting the misprediction, these speculative instructions are subsequently discarded (flushed), and the appropriate path is executed instead. Although the erroneous instructions are not formally committed, the time expended on their execution results in a performance penalty. In the five-stage in-order pipeline used for evaluation, each branch misprediction incurs a fixed penalty of 3 cycles due to pipeline flushing and refetching from the correct target.
Instructions Flushed (IF): This term refers to the quantity of instructions removed from the processing pipeline before they are completed, typically due to events such as branch mispredictions or exceptions. The flushing of instructions leads to a wastage of processing time and diminishes overall performance, as the central processing unit (CPU) is required to discard these instructions and subsequently re-fetch the correct ones.
Memory footprint: The memory footprint of a branch predictor is a critical factor, as it has direct implications for the area occupied by the processor, power consumption, and overall operational efficiency. Predictors that are smaller in size require less silicon space, which simplifies design complexity and reduces manufacturing costs. Furthermore, these compact predictors facilitate faster access times, thereby enhancing instruction throughput. In power-sensitive environments, such as embedded and mobile processors, the reduction in memory usage contributes to lower energy consumption and diminished heat generation. Moreover, streamlined predictors are more amenable to scaling across multiple cores, making them particularly well-suited for contemporary multicore architectures. Consequently, optimizing the memory footprint of branch predictors is essential for achieving superior performance while adhering to system constraints.
Simulation Model: The branch prediction technique was integrated into the MARSS-RISCV simulator framework through modifications to both the predictor’s source code and its configuration settings. The implementation process involved the following steps:
Modification of Predictor Source Files: The adaptive predictor module was enhanced to integrate the new branch prediction algorithm. This involved updating the source file ‘adaptive_predictor.c‘ and its associated header file ‘adaptive_predictor.h‘ (see Supplementary Materials [adaptive_predictor.c, adaptive_predictor.h, LXOR.c and LXOR.h]) to accommodate the additional functionality required for the novel prediction scheme.
Configuration of the Simulation Environment: MARSS-RISCV utilises a configuration script to define both architectural and microarchitectural parameters for the simulated RISC-V processor, with a particular focus on the branch prediction unit (BPU). Recent updates to this script have enabled the integration of a new branch predictor. By modifying the parameters in the ‘config64.cfg‘ file (see Supplementary Materials [config64.cfg]), various branch prediction strategies can be instantiated within the simulator. The options include GAg, GAp, PAg, PAp, GSHARE, GSELECT, and the newly proposed LXOR, as detailed in Table 2.
Table 2.
Parameters to select the respective predictor.
Parameter Configuration for the Proposed Predictor: To activate the novel LXOR predictor, the following parameters were defined within the configuration script:—Branch Predictor Unit (BPU) Type: Adaptive, Pattern History Table (PHT) Size: 1 entry, Global History Table (GHT) Size: Adjustable, ranging from 8 to 2048 entries, History Bits: 1 bit.
Rebuilding the Simulator: Following the updates to the predictor code and the configuration script, the simulator was recompiled using the Makefile included in the MARSS-RISCV framework. This ensured that all changes were correctly compiled and linked, maintaining integrity in the build process.
Benchmark Execution: To assess the efficacy of the proposed technique, a comprehensive suite of benchmarks was conducted on the modified simulator. The workloads utilized included SPEC CPU2017 (64-bit) and CoreMark (64-bit). The execution of these benchmarks yielded essential performance metrics for analysis, encompassing instructions per cycle (IPC), prediction accuracy, and misprediction rate.
4.1. Calculated Memory Footprints of All Predictors
To quantify the storage cost of each predictor, the calculated memory footprints are presented in Table 3. The table reports the memory requirements for the Global History Table (GHT), Pattern History Table (PHT), and any additional structures (e.g., the Next History Table in LXOR), followed by the total storage size in bytes.
Table 3.
Calculated memory footprints of all predictors.
As shown in Table 3, predictors such as GAg, Gshare, and Gselect incur relatively small footprints of about 2 KB, reflecting their simple organization. In contrast, GAp and PAp demand substantially higher storage, with PAp reaching nearly 82 KB, owing to their reliance on multiple large PHTs indexed per branch address. The proposed LXOR predictor demonstrates a balanced design, requiring only 2096 bytes—just slightly higher than GAg and Gshare—while integrating its Next History Table within the PHT. This indicates that LXOR maintains a compact memory footprint while offering enhanced prediction accuracy compared to conventional designs.
4.2. Performance of GAg Branch Predictor Using SPEC CPU2017 and Coremark Benchmarks
Table 4 summarizes the performance metrics for the GAg branch predictor evaluated using the SPEC CPU2017 and Coremark benchmarks on a 64-bit RISC-V architecture. In this experimental setup, the sizes of the Global History Table (GHT) and the Pattern History Table (PHT) were both set to 1. At the same time, the number of history bits was systematically varied from 1 to 9 to analyze its effect on prediction accuracy as described in Table 2. The simulator was run for each configuration, and the results reflecting the optimal performance for each benchmark are presented in the table.
Table 4.
Readings of GAg predictor using SPEC CPU2017 and Coremark benchmark for 64-bit RISC-V processor, keeping GHT = 1, PHT = 1, and varying History bits from 1 to 9.
4.3. Performance of GSHARE Branch Predictor Using SPEC CPU2017 and Coremark Benchmarks
Table 5 summarizes the performance metrics for the GSHARE branch predictor evaluated using the SPEC CPU2017 and Coremark benchmarks on a 64-bit RISC-V architecture. In this experimental setup, the sizes of the Global History Table (GHT) and the Pattern History Table (PHT) were both set to 1, aliasing function set to XOR. At the same time, the number of history bits was systematically varied from 1 to 9 to analyze its effect on prediction accuracy as described in Table 2. The simulator was run for each configuration, and the results reflecting the optimal performance for each benchmark are presented in the table.
Table 5.
Readings of GSHARE predictor using SPEC CPU2017 and Coremark benchmark for 64-bit RISC-V processor, keeping GHT = 1, PHT = 1, and varying History bits from 1 to 9.
4.4. Performance of GSELECT Branch Predictor Using SPEC CPU2017 and Coremark Benchmarks
Table 6 summarizes the performance metrics for the GSHARE branch predictor evaluated using the SPEC CPU2017 and Coremark benchmarks on a 64-bit RISC-V architecture. In this experimental setup, the sizes of the Global History Table (GHT) and the Pattern History Table (PHT) were both set to 1, the aliasing function set to AND, while the number of history bits was systematically varied from 1 to 9 to analyze its effect on prediction accuracy as described in Table 2. The simulator was run for each configuration, and the results reflecting the optimal performance for each benchmark are presented in the table.
Table 6.
Readings of GSELECT predictor using SPEC CPU2017 and Coremark benchmark for 64-bit RISC-V processor keeping GHT = 1, PHT = 1, and varying History bits from 1 to 9.
4.5. Performance of GAp Branch Predictor Using SPEC CPU2017 and Coremark Benchmarks
Table 7 summarizes the performance metrics for the GAp branch predictor evaluated using the SPEC CPU2017 and Coremark benchmarks on a 64-bit RISC-V architecture. In this experimental setup, the size of the Global History Table (GHT) is set to 1, the number of History bits is set to 1, and the Pattern History Table (PHT) is varied from 8 to 2048 to analyse its effect on prediction accuracy, as described in Table 2. The simulator was run for each configuration, and the results reflecting the optimal performance for each benchmark are presented in the table.
Table 7.
Readings of GAp predictor using SPEC CPU2017 and Coremark benchmark for 64-bit RISC-V processor, keeping History bits = 1, GHT = 1, and varying PHT from 8 to 2048.
4.6. Performance of PAg Branch Predictor Using SPEC CPU2017 and Coremark Benchmarks
Table 8 summarizes the performance metrics for the PAg branch predictor evaluated using the SPEC CPU2017 and Coremark benchmarks on a 64-bit RISC-V architecture. In this experimental setup, the size of the Global History Table (GHT) is varied from 8 to 2048, the Number of History bits is set to 1, and the Pattern History Table (PHT) is set to 1, to analyse its effect on prediction accuracy, as described in Table 2. The simulator was run for each configuration, and the results reflecting the optimal performance for each benchmark are presented in the table.
Table 8.
Readings of PAg predictor using SPEC CPU2017 and Coremark benchmark for 64-bit RISC-V processor, keeping History bits =1, PHT = 1, and varying GHT from 8 to 2048.
4.7. Performance of PAp Branch Predictor Using SPEC CPU2017 and Coremark Benchmarks
Table 9 summarizes the performance metrics for the PAp branch predictor evaluated using the SPEC CPU2017 and Coremark benchmarks on a 64-bit RISC-V architecture. In this experimental setup, the size of the Global History Table (GHT) is set to 2048, the number of History bits is set to 1, and the Pattern History Table (PHT) is varied from 8 to 2048 to analyse its effect on prediction accuracy, as described in Table 2. The simulator was run for each configuration, and the results reflecting the optimal performance for each benchmark are presented in the table.
Table 9.
Readings of PAp predictor using SPEC CPU2017 and Coremark benchmark for 64-bit RISC-V processor, keeping History bits = 1, GHT = 2048, and varying PHT from 8 to 2048.
4.8. Performance of LXOR Branch Predictor Using SPEC CPU2017 and Coremark Benchmarks
Table 10 summarizes the performance metrics for the LXOR branch predictor evaluated using the SPEC CPU2017 and Coremark benchmarks on a 64-bit RISC-V architecture. In this experimental setup, the size of the Global History Table (GHT) is varied from 8 to 2048, the Number of History bits is set to 1, and the Pattern History Table (PHT) is set to 1, to analyse its effect on prediction accuracy, as described in Table 2. The simulator was run for each configuration, and the results reflecting the optimal performance for each benchmark are presented in the table.
Table 10.
Readings of LXOR predictor using SPEC CPU2017 and Coremark benchmark for 64-bit RISC-V processor, keeping History bit = 1, PHT = 1, varying GHT from 8 to 2048.
5. Discussion
5.1. Performance Analysis of LXOR Branch Predictor Against GAg Branch Predictor
Branch prediction plays a crucial role in achieving optimal performance in modern superscalar processors, particularly in deeply pipelined architectures, where inaccuracies can lead to significant pipeline stalls and inefficient use of instruction fetch cycles. This study aims to evaluate and compare the performance of the conventional GAg (Global Adaptive with Global History) branch predictor against a newly proposed LXOR branch predictor within a 64-bit RISC-V architecture as shown in Figure 15.
Figure 15.
Analysis of LXOR branch predictor against GAg Branch predictor.
The evaluation methodology uses a combination of synthetic benchmarks and real-world applications. Performance metrics include Instructions Per Cycle (IPC), prediction accuracy, misprediction rate, instructions flushed (IF), and memory footprint. The benchmarks consist of Coremark (64-bit) and key integer and floating-point workloads from the SPEC CPU2017 suite (namely perlbench_r, mcf_r, omnetpp_r, xalancbmk_r, mcf_s, omnetpp_s, xalancbmk_s, and leela_s).
The graph in Figure 15 was constructed using the data presented in Table 4 and Table 10. Throughout the examined benchmarks, the LXOR predictor consistently exhibited a lower frequency of instruction flushes when compared to the GAg predictor. For instance, in the perlbench_r benchmark, the GAg predictor recorded 11.67% instruction flushes, whereas the LXOR predictor achieved a reduction to 10.83%. Similarly, in the mcf_r, omnetpp_r, xalancbmk_r, mcf_s, and omnetpp_s benchmarks, flush counts declined from 11.21% (GAg) to 10.11% (LXOR), 9.80% to 9.50%, 10.85% to 9.87%, 11.56% to 10.50%, and 9.81% to 9.49%, respectively. Notably, the Coremark results further underscore the practical efficiency of the LXOR predictor in embedded scenarios, where the flush count diminished from 7.66% (GAg) to 5.83% (LXOR). This decrease not only indicates fewer wasted cycles but also reflects a more stable execution pipeline. Such a reduction in instruction flushes has a direct impact on the processor’s ability to sustain high instruction throughput and is attributed to an improvement in prediction accuracy.
Prediction accuracy serves as a critical performance metric for assessing the effectiveness of branch predictors. It is directly linked to the frequency of pipeline flushes and indirectly correlates with IPC. The LXOR predictor consistently surpasses the GAg predictor in terms of prediction accuracy across nearly all benchmarks. For example:—In the perlbench_r benchmark, the LXOR predictor achieved an accuracy of 66.62%, compared to 63.64% for the GAg predictor. In the mcf_r benchmark, the LXOR demonstrated an improvement from 63.92% to 67.74%. In the xalancbmk_s, mcf_s, and omnetpp_s benchmarks, accuracy improved from 72.67% (GAg) to 73.83% (LXOR), from 65.46% to 69%, and from 71.34% to 72.65%, respectively. In the Coremark benchmark, which is indicative of performance in real-time embedded systems, the LXOR achieved a significantly higher accuracy of 83.92%, compared to 76.57% for the GAg predictor. This enhancement in prediction accuracy underscores the architectural advantages of the LXOR predictor, particularly its utilization of local history through the XOR transformation of Branch History Register (BHR) bits. By employing XOR gates to transform branch history and indexing the PHT more effectively, the LXOR predictor minimizes aliasing and enhances prediction granularity.
The misprediction rate is inversely related to prediction accuracy and significantly impacts performance-critical workloads characterized by high control flow divergence. The LXOR predictor consistently demonstrates a lower misprediction rate across various benchmarks. In the xalancbmk_r benchmark, the LXOR predictor exhibits a misprediction rate of 30.68%, in contrast to GAg’s rate of 34.43%. For the omnetpp_s, xalancbmk_s, and mcf_s benchmarks, the LXOR misprediction rates decrease to 27.35%, 26.17%, and 31%, respectively, compared to GAg’s rates of 28.66%, 27.33%, and 34.54%. In the Coremark benchmark, LXOR achieves a substantial reduction in mispredictions from 23.43% to 16.08%, resulting in an absolute improvement of 7.35%. These improvements directly contribute to fewer pipeline stalls and enable the processor to maintain higher throughput levels. The effectiveness of the LXOR predictor in minimizing mispredictions is primarily attributed to its utilization of fine-grained local histories paired with adaptive XOR indexing.
Instructions Per Cycle (IPC) serves as a macro-level performance indicator, encapsulating the overall efficiency of instruction execution. This metric is directly influenced by the accuracy of branch prediction and the frequency of instruction flushes. Across various benchmarks:—In the xalancbmk_s benchmark, the LXOR configuration achieves an IPC of 0.62, surpassing the GAg configuration’s IPC of 0.61—a modest yet significant improvement. In the omnetpp_s benchmark, LXOR similarly records an IPC of 0.62, slightly exceeding GAg’s IPC of 0.61. The most pronounced enhancement is observed in the Coremark benchmark, where the IPC improved from 0.77 with GAg to 0.79 with LXOR. This illustrates superior instruction utilization and improved pipeline flow. While the IPC enhancements may be modest in certain instances, they are consistent with observed improvements in prediction accuracy and reductions in instruction flushes. Furthermore, the stability and consistency of LXOR across a diverse range of workloads underscore its scalability and suitability for both general-purpose and embedded processors.
The memory footprint represents a critical consideration in the design of hardware predictors, particularly within embedded and low-power systems. Although the LXOR method employs a larger GHT of 2048 bytes, compared to only 16 bytes utilized by GAg, it effectively minimizes the size of the PHT through a more efficient indexing technique. Consequently, the overall memory footprint remains comparable: 2096 bytes for LXOR in contrast to 2080 bytes for GAg, reflecting a marginal increase of merely 0.76%. This trade-off is particularly advantageous, given the substantial improvements in prediction accuracy and IPC. Moreover, the additional hardware requirements for implementing XOR logic are negligible in contemporary silicon designs and are well justified by the resulting improvements in execution throughput and prediction reliability. Furthermore, LXOR successfully circumvents the significant memory overhead associated with more intricate predictors, such as GAp or PAp, which tend to exhibit poor scalability when employed with extensive predictor tables.
Hence, the proposed LXOR branch predictor demonstrates a marked improvement over the traditional GAg predictor when evaluated against several critical benchmarks and performance metrics. With only a minimal increase in hardware cost, it achieves significantly higher accuracy, reduced misprediction rates, and enhanced IPC across both synthetic (Coremark) and real-world (SPEC CPU2017) workloads. The LXOR design capitalizes on the advantages of utilizing local history patterns combined with XOR-based indexing to minimize aliasing and refine the modeling of branch behavior. These advancements position LXOR as an attractive option for future RISC-V processors, particularly those aimed at optimizing performance per watt in mobile, Internet of Things (IoT), and edge computing environments.
5.2. Performance Analysis of LXOR Branch Predictor Against GSHARE Branch Predictor
The LXOR predictor consistently demonstrates marginally superior IPC across nearly all benchmarks when compared to the GSHARE predictor, as shown in Figure 16.
Figure 16.
Analysis of LXOR branch predictor against GSHARE Branch predictor.
The graph in Figure 16 was constructed using the data presented in Table 5 and Table 10. In the Coremark benchmark, LXOR attains an IPC of 0.79, slightly surpassing Gshare’s IPC of 0.78. The distinction in performance becomes increasingly evident across several SPEC benchmarks. In the mcf_s benchmark, for example, the IPC rises from 0.57 with GSHARE to 0.60 with LXOR. Similarly, in leela_s, LXOR achieves an IPC of 0.83, in contrast to Gshare’s 0.82. Notably, both omnetpp_s and xalancbmk_s benchmarks reveal an IPC of 0.62 for LXOR, while GSHARE maintains an IPC of 0.61. These improvements, while modest in numerical value, indicate enhanced pipeline utilization and a reduction in stalling, attributed to the more effective branch resolution offered by the LXOR scheme.
The proposed LXOR predictor exhibits consistently superior prediction accuracy across all tested benchmarks. In the Coremark benchmark, LXOR achieves an impressive accuracy of 83.92%, while GSHARE achieves only 81.48%, representing a significant improvement of over 3%. Similarly, in the xalancbmk_r benchmark, the accuracy improves from 65.76% with Gshare to 69.32% with LXOR, and in perlbench_r, the accuracy increases from 64.30% to 66.62%. The most significant enhancements are observed in the mcf_r and mcf_s benchmarks, where LXOR’s accuracy rises to 67.74% and 69.00%, respectively, compared to Gshare’s 65.44% and 65.09%. These improvements underscore LXOR’s capability in capturing local branch behavior through its innovative combination of local history tracking and XOR-based pattern generalization.
Comparatively, the LXOR approach demonstrates a reduction in mispredictions when juxtaposed with GSHARE. In the case of Coremark, the misprediction rate under LXOR is recorded at 16.08%, in contrast to 18.52% with GSHARE. Similarly, in the omnetpp_r benchmark, LXOR reduces mispredictions to 27.4%, compared to Gshare’s rate of 28.14%. Notable improvements are also evident in the xalancbmk_s and leela_s benchmarks, where LXOR achieves a misprediction rate of 26.17% and 25.47%, respectively, relative to Gshare’s figures of 26.00% and 25.62%. Although these enhancements may appear incremental in certain benchmarks, they are of significant importance for high-performance systems, as even marginal reductions in mispredictions can dramatically influence overall execution time and power consumption.
The LXOR predictor exhibits reduced instruction flush counts across the majority of benchmarks, thereby contributing to enhanced IPC and diminished performance penalties. For example, in the Coremark benchmark, the instruction flush count is observed to decline from 6.59% (using GSHARE) to 5.83% (using LXOR). Similarly, in the mcf_r benchmark, the count decreases from 11.25% to 10.11%. Additional reductions are noted in xalancbmk_r, where the count falls from 10.99% to 9.87%, and in mcf_s, where it reduces from 11.66% to 10.50%. These reductions are consistent with the increased accuracy of LXOR and underscore its effectiveness in minimizing both the frequency and impact of control hazards within the pipeline.
A critical consideration in embedded systems is the memory overhead associated with the predictor. Both the GSHARE and LXOR predictors are engineered with efficiency as a primary objective. The GSHARE predictor utilizes a total memory of 2080 bytes, which consists of a 16-byte GHT and a 2064-byte PHT. Conversely, the LXOR predictor employs a slightly higher memory capacity of 2096 bytes. This predictor incorporates a 2048-byte GHT and a 48-byte PHT, which includes the integrated Next History Table (NHT). Despite the marginal increase in memory usage of approximately 0.76%, the significant improvements in prediction accuracy, IPC, and the reduction in flushes substantiate this trade-off, particularly in workloads that are sensitive to performance. Hence, the LXOR predictor demonstrates superior performance compared to GSHARE across all critical metrics, including IPC, accuracy, misprediction rate, and instruction flush count, while maintaining a nearly identical memory footprint. The primary advantage of the LXOR predictor is its adaptive utilization of local history combined with effective generalization through XOR-based indexing. This approach enables it to manage diverse and irregular branch patterns more efficiently than GSHARE, which depends solely on global history. The enhancements are particularly evident in complex benchmarks such as xalancbmk_r and mcf_r, which are characterized by their significant control-flow dependencies. Moreover, the commendable performance of LXOR in Coremark—a benchmark representative of embedded workloads highlights its practical relevance for Internet of Things (IoT) devices, edge computing systems, and real-time embedded controllers.
5.3. Performance Analysis of LXOR Branch Predictor Against GSELECT Branch Predictor
A comparative analysis of two predictors, namely the LXOR and GSELECT, reveals that the LXOR Predictor consistently demonstrates superior performance across all benchmarks analysed, as shown in Figure 17.
Figure 17.
Analysis of LXOR branch predictor against GSELECT branch predictor.
The graph in Figure 17 was constructed using the data presented in Table 6 and Table 10. In the Coremark benchmark, the LXOR model resulted in the flushing of only 5.83% instructions, in contrast to GSELECT, which flushed 7.54% instructions, yielding a significant reduction of 22.7%. In the SPEC CPU2017 workloads, specifically omnetpp_s and xalancbmk_s, LXOR achieved 9.49% and 9.25% flushed instructions, respectively. In comparison, GSELECT recorded 11.05% and 10.23% flushed instructions, thus reflecting a consistent reduction in the range of 10% to 15%. The most notable improvement was observed in the mcf_r, omnetpp_r, xalancbmk_r, and mcf_s benchmarks, where LXOR reduced the number of instructions flushed to 10.11%, 9.5%, 9.87%, and 10.5%, respectively, as compared to GSELECT’s 11.29%, 11.04%, 10.9%, and 11.3%. Although these differences appear marginal, they indicate a stable performance across the various benchmarks. This reduction in flushed instructions suggests fewer pipeline stalls and an enhancement in processor throughput, which is particularly advantageous for workloads characterized by frequent branching patterns.
The LXOR approach shows better prediction accuracy compared to GSELECT across all tested benchmarks: In the Coremark benchmark, LXOR achieved an impressive accuracy of 83.92%, well above GSELECT’s 76.92%. For SPEC workloads like xalancbmk_s and omnetpp_s, LXOR recorded accuracies of 73.83% and 72.65%, respectively, exceeding GSELECT’s 70.47% and 67.76%. Additionally, in challenging integer benchmarks such as mcf_r and perlbench_r, LXOR demonstrated improved prediction consistency, with accuracies of 67.74% and 66.62%, compared to GSELECT’s 64.05% and 63.41%. This steady increase in prediction accuracy reduces pipeline disruptions and is linked to better IPC. The utilization of the LXOR predictor, which incorporates local history and XOR-complement indexing, demonstrates a significant advantage in minimizing mispredictions. In the Coremark benchmark, LXOR achieved a misprediction rate of merely 16.08%, markedly lower than GSELECT, which recorded a misprediction rate of 23.08%. Within the SPEC benchmarks, LXOR consistently maintained a lower misprediction rate across various tests. For example, in the omnetpp_s and xalancbmk_s benchmarks, LXOR displayed misprediction rates of 27.35% and 26.17%, respectively, in contrast to GSELECT’s rates of 32.24% and 29.53%. Furthermore, in integer-intensive workloads, such as mcf_r, LXOR produced a misprediction rate of 32.26%, demonstrating an improvement over GSELECT’s rate of 35.95%. A reduced misprediction rate directly correlates with fewer pipeline flushes and enhanced instruction-level parallelism, thereby contributing to improved overall CPU performance.
The LXOR predictor demonstrates consistent improvements in IPC as follows: LXOR achieves an IPC of 0.79 in the Coremark benchmark, surpassing GSELECT’s performance of 0.77. In the benchmarks omnetpp_s, xalancbmk_s, and leela_s, LXOR attains IPC values of 0.62, 0.62, and 0.83, respectively, all of which exceed the corresponding GSELECT values of 0.6, 0.6, and 0.81. Notably, in a compute-intensive benchmark such as omnetpp_r, LXOR produces an IPC of 0.61, outperforming GSELECT, which registers an IPC of 0.6.
The GSELECT and LXOR predictors exhibit nearly comparable memory requirements, with LXOR demonstrating a marginally higher efficiency. Specifically, GSELECT allocates a total of 2080 bytes, comprising 16 bytes for the Global History Table (GHT) and 2064 bytes for the PHT. In contrast, LXOR employs a larger GHT of 2048 bytes but only requires 48 bytes for the PHT, culminating in a total footprint of 2096 bytes. Moreover, LXOR incorporates a compact Next History Table (NHT) directly within the PHT, thereby eliminating the necessity for extensive auxiliary structures. The minimal difference in memory consumption, approximately 0.76%, is compensated by LXOR’s considerable enhancements in prediction accuracy and IPC. Consequently, LXOR achieves a superior performance-to-memory ratio, which is particularly advantageous in environments where memory resources are constrained.
The proposed LXOR branch predictor presents a viable alternative to conventional global-history-based predictors, such as GSELECT. Its effectiveness is demonstrated through consistent enhancements in IPC, prediction accuracy, and pipeline efficiency across both embedded and general-purpose benchmark suites. Considering its minimal memory overhead alongside high performance, LXOR is particularly well-suited for implementation in contemporary RISC-V processors. This is especially relevant for systems aimed at performance-sensitive and power-constrained applications, including mobile system-on-chips (SoCs), embedded control systems, and real-time applications.
5.4. Performance Analysis of LXOR Branch Predictor Against GAp Branch Predictor
The benchmark suite analysis reveals that both GAp and LXOR exhibit competitive IPC values; however, LXOR demonstrates a slight advantage over GAp in memory-intensive or control-intensive workloads, such as mcf_s, xalancbmk_s, and omnetpp_s, as shown in Figure 18.
Figure 18.
Analysis of LXOR branch predictor against GAp branch predictor.
The graph in Figure 18 was constructed using the data presented in Table 7 and Table 10. For example, in the mcf_s workload, LXOR achieves an IPC of 0.60, compared to GAp’s 0.56, indicating enhanced throughput for LXOR in memory-intensive scenarios. Similarly, in the xalancbmk_s assessment, LXOR records an IPC of 0.62, surpassing GAp’s IPC of 0.61. It is important to note that in the Coremark benchmark, LXOR registers a minor decrease with an IPC of 0.79, while GAp attains 0.80. This reduction is minimal and falls within acceptable margins, thereby affirming LXOR’s competitiveness even in simpler workload contexts.
LXOR consistently demonstrates improved or comparable accuracy when evaluated against GAp. In the Coremark benchmark, GAp exhibits a marginally higher accuracy of 86.63% compared to LXOR’s 83.92%. However, in more intricate SPEC workloads, LXOR either matches or slightly surpasses GAp. For instance in the xalancbmk_r workload, LXOR achieves 69.32%, surpassing GAp’s 69.01%. In perlbench_r, LXOR records 66.62% while GAp tallies 66.01%. For mcf_r, LXOR registers 67.74%, compared to GAp’s 67.49%. In mcf_s, LXOR reaches 69%, whereas GAp records 66.52%. Lastly, in leela_s, GAp attains 76.91%, while LXOR achieves a comparable 74.53%. The emphasis on local history within LXOR facilitates a more nuanced modeling of branch behavior, particularly advantageous in multi-path execution environments.
In the mcf_s benchmark, LXOR exhibits a lower misprediction rate of 31%, compared to GAp’s 33.48%. Comparable improvements are noted in the xalancbmk_s benchmark, where LXOR achieves a misprediction rate of 26.17% in contrast to GAp’s 26.12%, and in the omnetpp_s benchmark, with LXOR at 27.35% versus GAp at 27.4%. There is a marginal increase in mispredictions for LXOR in the Coremark benchmark, registering at 16.08% as opposed to GAp’s 13.37%. In summary, LXOR demonstrates reduced volatility in misprediction rates across diverse workloads, indicating a greater degree of generalizability.
In three major benchmarks (perlbench_r, xalancbmk_r, and mcf_s), the LXOR algorithm demonstrates a superior capability in producing fewer flushed instructions compared to the GAp algorithm. This observation serves as a compelling indication of LXOR’s enhanced branch prediction capabilities within complex, data-intensive codebases. In particular, benchmarks such as mcf_r, omnetpp_r, xalancbmk_s, and omnetpp_s, the disparity between GAp and LXOR is minimal, typically ranging from 0.05 to 0.15 instructions flushed. This difference lies within the margins of statistical noise, suggesting that LXOR exhibits nearly equivalent efficiency in those contexts. Notably, despite its reduced memory footprint, LXOR achieves instruction flush performance that is comparable to or superior to that of GAp. In branch-heavy workloads, including mcf_s and xalancbmk_r, which closely emulate real-world software behavior, LXOR effectively diminishes the frequency of instruction flushes. This reduction leads to improved pipeline utilization and a decrease in operational stalls.
The GAp predictor features a substantial PHT of 49,152 bytes and a GHT of 16 bytes, resulting in an overall memory footprint of 49,168 bytes. In contrast, the LXOR predictor employs a smaller Global History Table of 2048 bytes and a compact PHT of 48 bytes, alongside a nested history structure within the PHT, thereby culminating in a total footprint of merely 2096 bytes. The LXOR predictor achieves an impressive approximate 96% reduction in memory footprint, making it particularly advantageous for deployment in areas with constraints on area and power. This characteristic positions the LXOR predictor as an optimal solution for environments such as the Internet of Things (IoT), edge computing, and mobile platforms. Importantly, the compact design of the LXOR predictor does not compromise its accuracy or IPC, establishing it as an efficient alternative to traditional predictors such as GAp.
The proposed LXOR branch predictor demonstrates a considerable advantage over the traditional GAp predictor across various evaluation metrics. Although both predictors exhibit comparable performance in straightforward benchmarks, LXOR consistently surpasses GAp in complex SPEC workloads concerning IPC, misprediction rate, and overall prediction accuracy. Notably, LXOR achieves these enhancements while maintaining a significantly smaller memory footprint, utilizing approximately 2 KB compared to GAp’s 49 KB. This characteristic makes LXOR particularly suitable for low-power and resource-constrained systems. Furthermore, LXOR minimizes instruction flushes in several critical benchmarks, which reflects improved control-flow prediction and reduced pipeline disruptions. In summary, LXOR presents a balanced, memory-efficient, and performance-effective solution for next-generation RISC-V processors.
5.5. Performance Analysis of LXOR Branch Predictor Against PAg Branch Predictor
In this evaluation, both predictors demonstrate nearly identical IPC values across the majority of benchmarks, with the LXOR predictor exhibiting a slight advantage over the PAg predictor in several cases. The graph in Figure 19 was constructed using the data presented in Table 8 and Table 10. For instance, in the mcf_s benchmark, LXOR achieves an IPC of 0.60, which is significantly higher than PAg’s IPC of 0.47, thus indicating enhanced pipeline efficiency as shown in Figure 19.
Figure 19.
Analysis of LXOR branch predictor against PAg branch predictor.
Likewise, LXOR matches the performance of PAg in the leela_s, perlbench_r, xalancbmk_s and Coremark benchmarks, achieving IPC values of 0.83, 0.51, 0.62 and 0.79, respectively. Conversely, in the xalancbmk_r benchmark, the PAg predictor demonstrates slightly superior performance, with an IPC of 0.48 compared to LXOR’s 0.45. Nevertheless, these discrepancies are statistically insignificant and are contextually outweighed by the overall efficiency of LXOR in various other performance metrics.
Across all benchmarks, LXOR consistently demonstrates prediction accuracies that are comparable to or slightly superior to those of PAg. For example, LXOR achieves a peak accuracy of 83.92% on Coremark, which is marginally lower than PAg’s 84.53%, reflecting a difference of merely 0.61% and indicating comparable efficacy between the predictors. In the context of SPEC workloads, LXOR marginally outperforms PAg in xalancbmk_s (73.83% versus 73.79%), xalancbmk_r (69.32% versus 68.72%), omnetpp_s (72.65% versus 72.59%), perlbench_r (66.62% versus 66.51%), and omnetpp_r (72.6% versus 72.58%), thereby illustrating consistent performance across varying workload behaviors. Notably, in mcf_r, PAg exhibits a slight advantage with an accuracy of 68.25%, surpassing LXOR’s 67.74%. However, this advantage is offset by LXOR’s superior IPC and a reduced rate of instruction flushes. When evaluating mispredictions, which represent the inverse of accuracy, LXOR generally records fewer mispredictions or performs on par with PAg. For instance, in perlbench_r, omnetpp_r, xalancbmk_r, mcf_s, and omnetpp_s, LXOR has exhibited a lower misprediction rate. In the case of leela_s, LXOR presents a misprediction rate of 25.47%, which is slightly higher than PAg’s 24.08%. Nonetheless, both predictors yield the same IPC of 0.83.
As detailed in the memory footprint table, LXOR utilizes only 2096 bytes, whereas the PAg predictor requires 1048 bytes. Although PAg seemingly consumes less memory, this discrepancy can be attributed to its simpler architecture and restricted flexibility. The justification for LXOR’s memory usage lies in the incorporation of additional hardware logic, specifically the XOR-based transformation and localized history tracking. This design facilitates higher IPC and improved accuracy without necessitating extensive predictor tables as required by GAp or PAp. Moreover, LXOR’s memory requirements are significantly lower than those of GAp (49,168 bytes) and PAp (81,920 bytes), further underscoring its exceptional scalability.
5.6. Performance Analysis of LXOR Branch Predictor Against PAp Branch Predictor
The prediction accuracy of the PAp predictor is generally superior across most benchmarks, achieving a peak accuracy of 87.03% on the Coremark benchmark, with a corresponding misprediction rate of 12.97%. In contrast, the proposed LXOR predictor yields a slightly lower accuracy of 83.92% on Coremark, resulting in a misprediction rate of 16.08% as shown in Figure 20.
Figure 20.
Analysis of LXOR branch predictor against PAp branch predictor.
Nevertheless, the LXOR predictor demonstrates competitive performance in several SPEC benchmarks; for instance, in mcf_r and omnetpp_s, its accuracy either closely aligns with or slightly surpasses that of the PAp predictor. Although LXOR presents marginally elevated misprediction rates overall, the differences are within acceptable thresholds, particularly when considering its significantly reduced hardware footprint.
Regarding instruction flushes resulting from branch mispredictions, the PAp generally exhibits slightly superior performance, particularly in benchmarks such as CoreMark and Leela_s, where it experiences flush rates of 4.78% and 5.34% instructions, respectively. In contrast, the LXOR predictor incurs flush rates of 5.83% and 6.01% instructions for the benchmarks as mentioned above. Despite this marginal increase in flush rates, the LXOR predictor maintains a comparable flush rate across the majority of SPEC CPU2017 workloads, with variations typically confined to within a few instructions. This observation suggests that, while the PAp predictor demonstrates a slight advantage in reducing control hazards, the LXOR predictor presents a favorable trade-off when accounting for its significantly lower hardware cost.
The performance of the predictors demonstrates benchmark-dependent behavior in terms of IPC:—In the case of leela_s, both predictors yield an identical IPC of 0.83. For the CoreMark benchmark, PAp achieves a marginally superior IPC of 0.80 compared to LXOR’s 0.79, which corresponds with its enhanced accuracy and reduced flush count. Regarding the omnetpp_r benchmark, both predictors exhibit the same IPC value of 0.61. Notably, in the xalancbmk_r, mcf_s, and omnetpp_s benchmarks, LXOR achieves a slightly higher IPC of 0.45, 0.6 and 0.62 in contrast to PAp’s IPC of 0.43, 0.56, and 0.61.
The primary advantage of the proposed LXOR predictor is its remarkably low memory footprint, utilizing only 2096 bytes, in stark contrast to the 81,920 bytes required by the PAp predictor. Although the PAp predictor offers slightly improved prediction accuracy and minimizes instruction flushes in certain benchmarks, these enhancements are marginal and are accompanied by over 96% increased memory consumption. Despite its compact design, the LXOR predictor achieves comparable IPC and accuracy, with only a minor increase in misprediction rates. For example, LXOR sustains an IPC of 0.79 on the CoreMark benchmark, compared to 0.80 for the PAp predictor, while demonstrating similar accuracy levels in complex SPEC workloads such as mcf_r and omnetpp_s. This equilibrium of efficiency and performance positions LXOR as an ideal solution for resource-constrained environments, including embedded and low-power RISC-V systems, where it is essential to minimize silicon area and energy consumption without significantly compromising execution performance.
The findings reveal that, although the PAp predictor achieves slightly higher prediction accuracy and IPC in certain benchmarks, these advantages are accompanied by a significantly larger memory footprint, exceeding 81 KB, which may prove impractical for memory-constrained systems. Conversely, the LXOR predictor demonstrates comparable IPC and prediction accuracy across a variety of complex workloads, while maintaining a substantially smaller hardware footprint of only 2 KB. This represents a 98% reduction in memory usage compared to the PAp predictor. Additionally, the LXOR predictor exhibits strong performance in benchmarks characterized by high control flow complexity, such as omnetpp and xalancbmk, and matches or surpasses the PAp predictor in IPC in select instances, despite a marginally higher misprediction rate. Its design, which utilizes XOR-based indexing of local history, facilitates effective tracking of branch behavior with minimal aliasing and reduced lookup complexity. The LXOR branch predictor presents a compelling trade-off between prediction performance and hardware efficiency. Its lightweight architecture renders it particularly advantageous for embedded systems, real-time applications, and RISC-V-based processors, where area, power, and cost are critical design considerations
To synthesise the discourse on various branch prediction techniques, Table 11 presents a comparative analysis of several established predictors, focusing on their storage overhead and qualitative performance metrics. This overview elucidates the balance between simplicity, memory efficiency, and prediction precision, offering a framework for assessing the proposed LXOR predictor in relation to traditional architectures.
Table 11.
Comparative Summary of Branch Prediction Techniques.
The analysis presented in Table 11 indicates that traditional predictors excel in areas such as simplicity, correlation effectiveness, or mitigation of aliasing issues; however, they tend to be plagued by heightened storage demands and constrained scalability. Conversely, the proposed LXOR predictor strikes an advantageous equilibrium, providing competitive accuracy while maintaining moderate storage utilization. This positions the LXOR predictor as a viable and memory-efficient alternative for both embedded systems and general-purpose RISC-V processors.
6. Conclusions
The LXOR branch predictor offers a robust, balanced trade-off between compactness and accuracy, an alternative to conventional dynamic branch prediction methodologies, including GAg, GAp, PAg, PAp, Gshare, and Gselect. Comprehensive simulations conducted on the MARSS-RISCV platform, utilising two distinct benchmark suites—Coremark (64-bit) and SPEC CPU2017, indicate that the LXOR predictor consistently exhibits competitive or superior performance across several critical architectural metrics. These include Instructions Per Cycle, prediction accuracy, misprediction rate, and instruction flush percentage, all while preserving a remarkably compact memory footprint.
Unlike traditional predictors that rely significantly on large Pattern History Tables (PHTs) or extensive tracking of global and local histories, the LXOR mechanism introduces a novel XOR-based indexing method combined with complemented local history. This innovative approach facilitates efficient prediction while minimizing hardware overhead. The LXOR predictor achieves an impressive prediction accuracy of up to 83.92%, maintains a low instruction flush rate of 5.83%, and supports an IPC rate between 0.79 and 0.83. Notably, it accomplishes these results with an approximate memory requirement of only 2 KB, which is a fraction of the memory utilized by more complex predictors such as PAp and GAp, which may exceed 49 KB.
In various SPEC workloads, LXOR has shown performance that either matches or surpasses that of traditional predictors, especially in environments with heavy control flow and mixed instructions. Notably, in benchmarks such as omnetpp_r, leela_s, and xalancbmk_s, LXOR achieved equal or higher IPC while also resulting in fewer instruction flushes. This highlights its robustness amid significant branch variability. Even when some predictors, like PAp, achieved slightly higher raw accuracy, LXOR demonstrated a considerably better performance-to-memory ratio. This makes LXOR a practical option for real-world processor design.
The LXOR predictor, characterized by its minimal hardware complexity and reliable predictive stability, is exceptionally well-suited for embedded systems, low-power Internet of Things (IoT) devices, real-time applications, and resource-constrained RISC-V cores. In these contexts, maximizing performance relative to power and memory utilization is of paramount importance. Additionally, the scalable and predictable nature of the LXOR predictor positions it as a strong candidate for edge computing nodes, mission-critical systems, and next-generation general-purpose processors, particularly within the rapidly expanding RISC-V ecosystem. The LXOR predictor offers a compelling equilibrium between performance, simplicity, and scalability, thereby rendering it an ideal solution for contemporary processor architectures that seek to optimize branch prediction under stringent energy, area, and timing constraints.
Supplementary Materials
The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/jlpea15040064/s1
Author Contributions
D.G.S. and N.B.G. contributed equally to the work related to this manuscript. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the Department of Science and Technology & Waste management, Government of Goa, India, Grant number No. 6-383-2018/S&T-DIR/236 DATED 07/06/2022.
Data Availability Statement
The original contributions presented in this study are included in the article/Supplementary Materials. Further inquiries can be directed to the corresponding author.
Acknowledgments
The authors are grateful to the Goa College of Engineering, affiliated with Goa University, and the Department of Science and Technology & Waste Management for their support of the work carried out in this study.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Lee, J.K.; Smith, A.J. Branch prediction strategies and branch target buffer design. Computers 1984, 17, 6–22. [Google Scholar] [CrossRef]
- Martin, C. Multicore processors: Challenges, opportunities, emerging trends. In Proceedings of the Embedded World Conference, Nuremberg, Germany, 25 February 2014; Volume 2014, p. 1. [Google Scholar]
- Hennessy, J.L.; Patterson, D.A. Computer Architecture: A Quantitative Approach; Elsevier: Amsterdam, The Netherlands, 2011. [Google Scholar]
- Sarangi, S. Next-Gen Computer Architecture: Till the End of Silicon Version 2.1; White Falcon: Chandigarh, India, 2023. [Google Scholar]
- Alqurashi, F.S.; Al-Hashimi, M. An Experimental Approach to Estimation of the Energy Cost of Dynamic Branch Prediction in an Intel High-Performance Processor. Computers 2023, 12, 139. [Google Scholar] [CrossRef]
- Healy, I.; Giordano, P.; Elmannai, W. Branch Prediction in CPU Pipelining. In Proceedings of the 2023 IEEE 14th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), New York, NY, USA, 12 October 2023; pp. 364–368. [Google Scholar] [CrossRef]
- Haque, M.S.M.; Hassan, M.R.; Sulaiman, M.; Onoruoiza, S.; Kamruzzaman, J.; Arifuzzaman, M. Enhancing branch predictors using genetic algorithms. In Proceedings of the 2019 8th International Conference on Modeling Simulation and Applied Optimization (ICMSAO), Manama, Bahrain, 15–17 April 2019; pp. 1–5. [Google Scholar] [CrossRef]
- McFarling, S.; Hennesey, J. Reducing the cost of branches. ACM SIGARCH Comput. Archit. News 1986, 14, 396–403. [Google Scholar] [CrossRef]
- Tang, Y.; Chen, Z.; Fang, W. Study of Register Value Branch Predictor Based on CNN. Appl. Sci. 2025, 15, 2725. [Google Scholar] [CrossRef]
- Smith, J.E. A study of branch prediction strategies. In Proceedings of the 25 Years of the International Symposia on Computer Architecture (Selected Papers), Barcelona, Spain, 1 August 1998; pp. 202–215. [Google Scholar] [CrossRef]
- Mittal, S. A survey of techniques for dynamic branch prediction. Concurr. Comput. Pract. Exp. 2019, 31, e4666. [Google Scholar] [CrossRef]
- Yeh, T.Y.; Patt, Y.N. A comprehensive instruction fetch mechanism for a processor supporting speculative execution. ACM SIGMICRO Newsl. 1992, 23, 129–139. [Google Scholar] [CrossRef]
- Yeh, T.Y.; Patt, Y.N. Alternative Implementations of Two-Level Adaptive Branch Prediction. In Proceedings of the 19th Annual International Symposium on Computer Architecture, Queensland, Australia, 19 May 1992; pp. 124–134. [Google Scholar] [CrossRef]
- Bate, I.; Reutemann, R. Efficient integration of bimodal branch prediction and pipeline analysis. In Proceedings of the 11th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA’05), Hong Kong, China, 17 August 2005; pp. 39–44. [Google Scholar] [CrossRef]
- Ismail, N.A. Evaluation of dynamic branch predictors for modern ILP processors. Microprocess. Microsyst. 2002, 26, 215–231. [Google Scholar] [CrossRef]
- Pan, S.T.; So, K.; Rahmeh, J.T. Improving the accuracy of dynamic branch prediction using branch correlation. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, Boston, MA, USA, 1 September 1992; pp. 76–84. [Google Scholar] [CrossRef]
- McFarling, S. Combining branch predictors. In Technical Report TN-36; Digital Western Research Laboratory: Palo Alto, CA, USA, 1993. [Google Scholar]
- Yeh, T.Y.; Patt, Y.N. Two-level adaptive training branch prediction. In Proceedings of the 24th Annual International Symposium on Microarchitecture, Albuquerque, NM, USA, 1 September 1991; pp. 51–61. [Google Scholar] [CrossRef]
- Yeh, T.Y.; Patt, Y.N. A comparison of dynamic branch predictors that use two levels of branch history. In Proceedings of the 20th Annual International Symposium on Computer Architecture, San Diego, CA, USA, 1 May 1993; pp. 257–266. [Google Scholar] [CrossRef]
- Jiménez, D.A.; Lin, C. Dynamic branch prediction with perceptrons. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture (HPCA), Monterey, Mexico, 20–24 January 2001; pp. 197–206. [Google Scholar] [CrossRef]
- Seznec, A. The o-gehl branch predictor. In Proceedings of the 1st JILP Championship Branch Prediction Competition (CBP-1), Munich, Germany, 4–8 December 2004. [Google Scholar]
- Seznec, A.; Michaud, P. A case for (partially) tagged geometric history length branch prediction. J. Instr. Level Parallelism 2006, 8, 23. [Google Scholar]
- Seznec, A. Tage-sc-l branch predictors. In Proceedings of the JILP-Championship Branch Prediction, Minneapolis, MN, USA, 15 June 2014. [Google Scholar]
- Schall, D.; Sandberg, A.; Grot, B. The Last-Level Branch Predictor. In Proceedings of the 2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO), Brisbane, Australia, 2–6 November 2024; pp. 464–479. [Google Scholar] [CrossRef]
- RISC-V Ratified Specifications. Available online: https://riscv.org/specifications/ratified/ (accessed on 12 April 2025).
- MARSS-RISCV. Available online: https://marss-riscv-docs.readthedocs.io/en/latest/ (accessed on 12 April 2025).
- CoreMark—EEMBC Embedded Microprocessor Benchmark Consortium. Available online: https://www.eembc.org/coremark/ (accessed on 12 April 2025).
- Gal-On, S.; Levy, M. Exploring coremark a benchmark maximizing simplicity and efficacy. Embed. Microprocess. Benchmark Consort. 2012, 6, 87. [Google Scholar]
- SPEC CPU 2017. Available online: https://www.spec.org/cpu2017/ (accessed on 12 April 2025).
- Ranjan Hebbar, S.R.; Milenković, A. SPEC CPU2017: Performance, event, and energy characterization on the core i7-8700K. In Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering, Mumbai, India, 4 April 2019; pp. 111–118. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).