3.1. Supported FP8 Formats
The proposed extension to the SoftFloat library supports two FP8 formats, namely E5M2—featuring 5-bit exponent, 2-bit mantissa, and 1 sign bit—and E4M3—featuring 4-bit exponent, 3-bit mantissa, and 1 sign bit—both in the OCP and in the IEEE-like standard. Both the formats include subnormal number representation. To maintain consistency with the standards, the bias is computed as for both types, where n is the number of bits of the exponent. Therefore, the bias will be for the first type, which has a four-bit exponent, and for the second type, which has a five-bit exponent.
While the IEEE-like FP8 formats adhere strictly to IEEE 754 for encoding infinities and NaNs, raising flags, handling subnormal numbers and rounding modes, the OCP E4M3 variant omits infinities entirely and provides only a single NaN mantissa encoding. Consequently, E4M3 supports just two NaN bit patterns—one positive and one negative—and reclaims the infinity and the other NaN bit patterns to extend its maximum exponent from seven to eight, thereby increasing its dynamic range by one binade (a binade is the set of floating-point values sharing the same exponent).
Another key distinction is the handling of saturation. The OCP standard mandates an optional saturation mode for conversions from wider formats into FP8: on overflow, values clamp to the largest finite representable number rather than producing infinities (or NaNs, in the case of E4M3). However, this requirement applies only to conversions—saturation of arithmetic results is left unspecified. In our implementation, we support both behaviors, allowing users to choose whether saturation applies exclusively to conversions or also to the results of arithmetic operations. Notably, the OCP standard does not require exception handling during these conversions, nor does it require the use of status flags to indicate exceptions, unlike the IEEE format, which mandates exception handling and flag management.
Furthermore, the OCP standard specifies that rounding must be performed using the “round to nearest even” mode. While this is a required implementation for OCP, the specification also allows for the possibility of implementing additional rounding modes, though they are not required by the standard. In our model, we support the “round to nearest even” mode for the OCP formats, while for the IEEE-like formats, we implement all IEEE 754 rounding modes to ensure full compatibility with the standard.
Finally, because the sole NaN encoding in OCP E4M3 () has the mantissa’s most significant bit set, it is inherently a quiet NaN. To enable early error detection and prevent silent propagation, we also provide an option to treat this single NaN as a signaling NaN.
The relevant arithmetic properties of the two formats are summarized in
Table 2 and
Table 3.
3.2. Software Implementation
We extend Spike with FP8 support by introducing a new SoftFloat back-end module (Softfloat_8) and making minimal frontend changes, including a format-selection parameter (E4M3/E5M2), small updates to the RVV FP decode/instruction files to dispatch FP8 elements, and a new instruction (vfcvt.f.f).
Figure 1 summarizes the integration.
The implementation of the back-end is organized into three parts.
Headers;
Arithmetic, conversion, and auxiliary functions;
Spike integration functions for vector instructions.
Following the template from the SoftFloat library, the header files have been expanded to add all the necessary definitions and function declarations needed for the correct functioning of the library. The headers are organized as follows:
The softfloat.h header contains all the non-internal arithmetic function declarations.
The internals.h header contains internal definitions such as the union type used to map 8-bit variables, together with the definitions of some basic function for the implementations of the arithmetic functions;
The types.h header contains the SoftFloat types declarations, to which new types float8_1_t for E4M3 and float8_2_t for E5M2 have been added.
The specialize.h header contains both the definition of the default NaN values for the newly implemented data types and the declarations of the functions dedicated to handling NaN values.
The softfloat_types.h header contains the macros needed to select the options for the OCP standard formats, as described in
Section 3.1. In particular, the macro OFP8_saturate_arith saturates the result of FP8 arithmetic operations and the macro E4M3_isSigNaN transforms the only OCP E4M3 NaN into a signaling one.
Lastly, the primitives.h header defines constants and utility functions that are essential for implementing floating-point operations.
The core functions that use the above headers fall into two categories, as illustrated in
Figure 1. The first category comprises the top-level functions invoked by the Spike instruction decoder, encompassing both arithmetic and conversion SoftFloat functions. For both the proposed 8-bit floating-point types E5M2 and E4M3, all the required functions in compliance with the IEEE 754 standard have been implemented, along with an additional function for converting between the two types. Conversion between floating-point types having different precisions is supported via narrowing and widening instructions in the RISC-V instruction set, as well as in the SoftFloat library. In addition to conversion from 8-bit formats to bfloat16 (Brain Floating Point, 16-bit), the newly implemented 8-bit types imply a new conversion operation from E5M2 to E4M3 and vice versa. The existing SoftFloat conversion function templates did not support conversions between formats of the same length, yet, by adapting the existing narrowing and widening functions, we were able to develop the necessary conversion functions while closely adhering to the original template. This new instruction is designated as “vfcvt.f.f”.
The second category of newly defined functions includes common auxiliary routines for floating-point operation support, such as packing and unpacking the FP8 fields, rounding inexact results to the selected round mode, normalization, NaN handling, and flag raising. In the SoftFloat library, a default NaN value is defined for every floating-point data type. To follow the template faithfully, a default NaN value has been implemented for the two FP8 types. The default canonical quiet NaN encoding requires to set the sign bit to `0’, the exponent bits all to `1’, the most significant bit of the mantissa to `1’, and all other bits to `0’. For the E5M2 type the default NaN is and for the E4M3 data type it is for the IEEE-like version and for the OCP version.
As per the IEEE 754 standard for higher precision formats, the arithmetic functions of the SoftFloat extension support the detection of five exceptional conditions (flags), namely inexact, overflow, underflow, divide by zero, and invalid. Flag support is particularly critical in checking the compliance of hardware FP8 unit implementations with existing numerical software applications.
The full list of arithmetic core functions, along with the corresponding RISC-V instructions, is shown in
Table 4.
In order to integrate the new SoftFloat extension into the Spike source files, several updates have been implemented.
The headers and the core function files have been added in the softfloat_8 directory, following a naming convention consistent with their SoftFloat counterparts.
A template Makefile has been added in the softfloat_8 directory, to specify which files to compile and which headers to include in the software build.
The float vector instruction files in the directory have been updated to accommodate the new formats.
The macros in the instruction decode header file have been updated. This header contains macros for instruction decoding and execution, which implement vector loops for single-operand vector instructions, vector–vector and vector–scalar.
A command-line argument that selects the default FP8 type based on the vector standard element width (vsew) and a bitfield “valtfp” has been added in the Vtype CSR register to distinguish between the two FP8 types. This new CSR was defined as a custom CSR register to avoid using the reserved encoding spaces or bit fields in existing CSRs, avoiding potential conflicts with future ISA spec releases.
3.3. Numerical Evaluation of FP8 Formats and Standards
This section quantifies the numerical behavior of the two FP8 encodings implemented in the simulator (E4M3 with higher mantissa precision, and E5M2 with a wider exponent range) under four standard variants for exceptional values: IEEE-like, OCP without saturation, OCP with saturation on conversion only, and OCP with saturation on both conversion and arithmetic, with the goal of assessing whether FP8 arithmetic is accurate enough for typical element-wise vector kernels. The same input seeds are used for all runs so that the different standards are directly comparable.
The tests of this section are driven by a Python harness that calls the C simulator through a wrapper. The FP32 result serves as ground truth; relative error and ULP distances (=how many representable numbers apart two values are) are measured in FP32 units. For each operation (add, sub, fma) we draw independent operands from three input regimes: (i) Normal
, (ii) Uniform
, and (iii) signed log-uniform magnitudes spanning
decades. The unit-scale cases (Normal and Uniform) probe precision around
without activating range effects, which isolates mantissa behavior. The log-uniform case stresses dynamic range: we sample
so
with a flat density in
. This produces both very small and very large operands, exercising underflow/overflow policies.
We use
samples per operation for Normal and Uniform and
for log-uniform. The counts are chosen so that reported percentiles are statistically stable and the non-finite rates (NaN/Inf) are well estimated. For an empirical
p-quantile, the uncertainty in percentile rank can be bounded by the binomial approximation
With
, this gives
percentage points at
and
at
. With
, the bounds tighten to
and
points, respectively. For non-finite rates estimated by
, the standard error is
so even for moderate incidences (e.g.,
) and
, we have
, yielding a
95% interval. The heavier-tailed log-uniform case therefore uses
to stabilize both the P95/P99 on finite samples and the NaN/infinity value (Inf) rates. For FP16/FP32 baselines, we rely on NumPy; for FMA, the baseline is non-fused (multiplication and addition round separately).
For each output
and truth
y, we compute the absolute relative error
the ULP distance in FP32 units (obtained by ordering FP32 bit patterns so that integer distance equals ULP distance), and the rates of non-finite results (NaN and Inf). Percentiles (median, P95, P99) are computed over finite samples only; non-finite rates are reported separately.
On unit-scale inputs (Normal and Uniform), E4M3 shows about half the median relative error of E5M2 across add, sub, and fma (e.g., ≈0.033 vs. 0.066), with similarly smaller P95 values. In ULP units the medians are also about
lower for E4M3 (∼4 · 10
5 vs. ∼8 · 10
5). The tails are bounded: P99
in all cases except fma with E5M2 (1.419). FP16 closely tracks FP32 and serves as a reference. Because these unit-scale tests primarily probe precision rather than dynamic range, results are identical across FP8 standards.
Table 5 reports Normal
; Uniform follows the same pattern.
For the range-stress results (log-uniform,
decades), the standard used dominates the behaviour.
Figure 2 shows the error distributions for fma using the same random seed. A prominent spike at
on the horizontal axis (i.e.,
) is expected for the E4M3 curve. It arises from trials in which FP8 quantization—either at operand conversion or due to underflow after the operation—produces an output equal to zero, while the FP32 reference is non-zero. By definition this yields
, which appears at
in the logarithmic plot. The effect is more pronounced for E4M3 because its exponent range is narrower than E5M2. Consistent with this,
Table 6 shows that under IEEE-like, E4M3 exhibits
NaN and
Inf (median
), whereas E5M2 has negligible NaN and 17.4% Inf (median 0.053); OCP (no saturation) removes Inf for E4M3 but yields
NaN (median 0.036); adding saturation on conversion reduces non-finites (E4M3
NaN, 0% Inf) at the cost of higher finite error (median
); and saturation on operations and conversion eliminates non-finites entirely (0%/0%) but further increases bias (medians
for E4M3 and
for E5M2, with P95 near 1).
From these tests, we can see that for data already normalized near unit scale, element-wise FP8 is usable and E4M3 provides lower typical error. For wide-range data, E5M2 is preferable and the choice of the standard is critical: IEEE-like makes overflows explicit through Inf (useful for diagnostics), OCP without saturation suppresses Inf at the cost of NaN propagation for E4M3, and saturation removes non-finite values while introducing bias. If saturation is required, applying it only at conversion reduces bias compared to saturating arithmetic. Overall, this controlled numerical analysis demonstrates that FP8 arithmetic is a viable option for element-wise vector kernels when data are appropriately scaled: typical errors are modest near unit magnitude, and behavior under wide dynamic range and exceptional cases is governed by the format (E4M3 vs. E5M2) and the standard (IEEE-like vs. OCP variants). In addition to numerical viability, FP8’s compact 8-bit representation (4× smaller than FP32 and 2× smaller than FP16) reduces memory footprint and data movement and can increase arithmetic density and effective bandwidth on bandwidth-bound workloads. These practical advantages motivate FP8 as a compelling choice.