xjb: Fast Float to String Algorithm
Abstract
1. Introduction
1.1. Background and Motivation
- Information Preservation: The printed result must be parsable back to the original floating-point number without loss of precision.
- Minimum Length: The output string should be as short as possible while maintaining information preservation.
- Correct Rounding: When multiple representations satisfy the first two criteria, the algorithm must correctly round to the nearest value, with ties broken by selecting the even value.
- Left-to-Right Generation: The output digits should be generated sequentially from the most significant to the least significant digit.
- Branch Prediction Penalties: Many algorithms rely heavily on conditional branches to handle different cases, leading to frequent branch mispredictions on modern pipelined processors.
- High-Precision Multiplication Overhead: The conversion process requires high-precision arithmetic operations, particularly multiplications involving large precomputed constants, which can be expensive on standard hardware.
- Instruction Dependency Chains: Sequential dependencies between operations limit instruction-level parallelism and prevent efficient utilization of modern superscalar processors.
- Limited SIMD Utilization: Most existing algorithms do not exploit vector instruction sets (SIMD) that are now ubiquitous in contemporary processors.
- Schubfach [8] offers an elegant approach but suffers from suboptimal performance due to unoptimized computation flow.
- Dragonbox [10] reduces multiplications at the cost of increased branches, trading off one bottleneck for another.
- zmij [16] provides competitive performance but still leaves room for improvement in instruction dependency reduction and branch optimization.
1.2. Contributions
- Reduced Instruction Dependencies: Unlike other algorithms that suffer from sequential dependencies, xjb carefully restructures the computation by decomposing d (introduce on Section 3.2) into and instead of computing d directly. This minimizes data dependencies between operations, enabling better instruction-level parallelism and improved pipeline utilization on modern superscalar processors.
- Minimized Multiplication Operations: Building on insights from yy_double, but without the trade-off of increased branches, xjb reduces the number of expensive high-precision multiplications required during conversion, significantly decreasing the computational overhead while maintaining branch efficiency. For IEEE 754 binary64, only one 64-bit by 128-bit multiplication is required, and for IEEE 754 binary32, only one 64-bit by 64-bit multiplication is needed.
- Mitigated Branch Prediction Penalties: Through branchless programming techniques and careful case analysis, xjb addresses the branch prediction problem that plagues algorithms like Dragonbox. All branches in xjb are designed as unlikely branches, and the core conversion of normal floating-point numbers is completely branch-free, minimizing conditional branches that could lead to prediction failures.
- SIMD Instruction Utilization: Unlike most existing algorithms that neglect SIMD potential, xjb is designed from the ground up to leverage SIMD instructions (NEON for ARM64, AVX512/SSE4.1/SSE2 for x86-64) for efficient decimal-to-ASCII conversion, fully exploiting the vector processing capabilities of contemporary processors.
- Concise Core Implementation: Despite its sophisticated optimizations, xjb maintains a compact and readable core implementation, facilitating adoption and maintenance.
1.3. Evaluation Overview
1.4. Explanation of Special Symbols in This Article
2. IEEE 754 Floating-Point Number Representation
2.1. Scope and Assumptions
- We consider only positive floating-point numbers, as negative numbers differ only by a leading minus sign.
- We excluded special values (zero, NaN, and infinity) from our analysis, since these are handled separately in practice.
2.2. Binary Representation
- One sign bit (s): indicates positive () or negative ().
- Eleven exponent bits (e): biased exponent in the range .
- Fifty-two fraction bits (f): significant fraction in the range .
- one sign bit (s): indicates positive () or negative ().
- Eight exponent bits (e): biased exponent in the range .
- Twenty-three fraction bits (f): significant fraction in the range .
2.3. Classification of Floating-Point Numbers
- Subnormal Numbers ( and ): These represent very small values close to zero, where the implicit leading bit of the significand is 0 instead of 1.
- Normal Numbers ( and ): The standard case, where the implicit leading bit of the significand is 1.
- Irregular Numbers ( and ): Numbers with zero fraction field, representing powers of two.
2.4. Value Representation
2.5. Rounding Interval
3. Algorithm Principles
3.1. Design Overview
- Float-to-Decimal Conversion: Converting binary floating-point values to decimal significand–exponent pairs .
- Decimal-to-String Conversion: Formatting into human-readable strings.
3.2. Mathematical Foundation
3.3. Overview of the Schubfach Algorithm and Derivation of Our Method
3.3.1. Candidate Values for the Significand d
3.3.2. Decomposition into Integer and Fractional Parts
3.3.3. Selection Criteria for
- Case (i.e., ): This case applies when falls inside the rounding interval . The condition is derived as follows:The lower bound of the rounding interval must be less than :When equality holds (, or equal to ), we apply the round-to-even rule, requiring c to be even:
- Case (i.e., ): This case applies when falls inside the rounding interval . The condition is derived similarly:The upper bound of the rounding interval must be greater than :When equality holds (, or equal to ), we again apply round-to-even:
- Case : When neither boundary condition applies, the optimal value lies between and . We determine by rounding to the nearest integer:
- –
- If the fractional part : ;
- –
- If the fractional part : ;
- –
- If the fractional part : apply round-to-even.
For irregular floating-point numbers (powers of two), additional verification is required to ensure that the selected value lies within the rounding interval , as the interval boundaries differ for these special cases.
3.3.4. Algorithm Overview
- Lookup table precomputation;
- Efficient computation of m;
- Fast boundary condition testing for ;
- Efficient computation of and rounding;
- Handling of irregular floating-point numbers;
- Implementation of pseudocode.
3.4. Lookup Table Precomputation
3.4.1. Fundamental Calculation
| Algorithm 1: The xjb Algorithm for Float-to-Decimal Conversion |
|
3.4.2. Detailed Calculation Process
- FloatThe range of is calculated to be [−32, 44] through the q value range in Equation (6), so the lookup table contains representation values from 10 to the power of −32 to 10 to the power of 44. The calculation process is as follows:When , the lookup table variable indicates that the values and are equal. In other cases, the relative error is less than , expressed as follows:
- DoubleThe range of is calculated to be [−293, 323] through the q value range in Equation (6), so the lookup table contains representation values from 10 to the power of −293 to 10 to the power of 323. The calculation process is as follows:When , the lookup table variable indicates that the values and are equal. In other cases, the relative error is less than , expressed as follows:
3.4.3. Storage Requirements
3.4.4. Implementation Notes
3.5. Efficient Computation of m
3.5.1. Key Proof
3.5.2. Bit Width Calculation
3.5.3. Results
3.6. Fast Boundary Condition Testing for and
3.6.1. Equivalent Conditions for Boundary Testing
- Case 1: TestingWhen , this is equivalent to
- Case 2: TestingWhen , this is equivalent to
3.6.2. Integer Testing Analysis
- Analysis forFor the case, we can deriveThis implies that when is an integer, it must equal .
- Analysis forSimilarly, for the case,This implies that when is an integer, it must equal .
3.6.3. Key Insight: Integer Divisibility Test
- CaseFrom , we get . The expression simplifies to checking whether is divisible by . Since 2 and 5 are coprime, this reduces to checking whether is divisible by :Let t be a positive integer such that . Since is odd, t must also be odd. Considering the ranges of c for float and double,This gives us the range for t:The maximum values of k where t can be at least one odd integer are
- CaseThe denominator is even, while the numerator is odd, so no solution exists.
- CaseThe denominator is even, while the numerator is odd, so no solution exists.
3.6.4. Summary of Boundary Conditions
3.6.5. Efficient Implementation
- When , is the expression Equation (73), the following holds true:Therefore, when , Equation (70) does not hold true.
- Call function Appendix A.5 to calculate the approximation results and of all possible upper and lower limit rational numbers:Therefore, for , the following conclusion can be drawn from Appendix A.4.By exhausting all possibilities, we thus have the following (the test code file is (test3.py) https://github.com/xjb714/xjb/blob/main/py_test/test3.py) (accessed on 20 April 2026):Therefore, when , Equation (70) does not hold true.
- When , is the expression Equation (86), the following holds true:Therefore, when , Equation (83) does not hold true.
- Call function Appendix A.5 to calculate the approximation results and of all possible upper and lower limit rational numbers:Therefore, for , the following conclusion can be drawn from formula in Appendix A.4.By exhausting all possibilities, we thus have the following (the test code file is (test7.py) https://github.com/xjb714/xjb/blob/main/py_test/test7.py (accessed on 20 April 2026)):Therefore, when , Equation (83) does not hold true.
- (1)
- When , there must exist , and there is
- (2)
- When , there must exist , and there is
- (3)
- When , there must exist , and there is
- (4)
- When , there must exist , and there is
3.7. Efficient Computation of and Rounding
3.7.1.
- When , it can be concluded that , the numerator is even, and the denominator is odd, which does not meet the condition.
- When , it can be concluded that is even, which does not meet the condition.
- is an odd number, and c is an odd multiple of , soTherefore, when q meets the above conditions, c must be an odd multiple of . Therefore, when the following conditions are met, the expression of Equation (135) is an odd number:The following equation holds:Since , is multiple of 5 and is an odd number. Since and are both odd numbers, is an even number, and is multiple of 5 and is an odd number. Therefore, there isThe result of is an even number between and . Therefore,
3.7.2.
3.7.3. Efficient Implementation of for Double
3.7.4. Efficient Calculation of for Double
- when , ;
- when , Equation (160) can be equivalent to the following:
3.8. Irregular Number
3.9. Implementation of Pseudocode
3.9.1. Single-Precision Floating-Point Numbers
3.9.2. Double-Precision Floating-Point Numbers
3.10. Decimal-to-String Conversion
- Scalar Implementation
- Handling Undefined Behavior
| Algorithm 2: Convert an 8-digit decimal number to ASCII: dec_to_ascii8(x) |
|
| Algorithm 3: Convert a 16-digit decimal number to ASCII: dec_to_ascii16(x) |
|
- SIMD Implementation
| Algorithm 4: Floating-point number printing algorithm |
|
3.11. Summary
- Section 3.5 introduces the method for quickly calculating m.
- Section 3.6 and Section 3.7 introduce the methods for quick calculation of .
- Section 3.9 provides a detailed description of the pseudocode implementation.
- Section 3.10 discusses print optimization.
4. Experimental Evaluation
4.1. Correctness Verification
- Single-Precision (binary32): Given the manageable size of the binary32 search space ( possible values), we performed exhaustive testing across the entire range. Each output was compared against the reference Schubfach algorithm to ensure identical results, guaranteeing complete correctness for the binary32 format.
- Double-Precision (binary64): Exhaustive testing of all binary64 values is computationally infeasible. Instead, we employed a comprehensive testing strategy that included the following:
- –
- Large-scale random testing with statistically significant sample sizes.
- –
- Targeted testing of edge cases including subnormal numbers, extreme exponents, and near-power-of-two values.
4.2. Experimental Setup
4.2.1. Hardware Platforms
- AMD R7-7840H: A modern high-performance x86-64 processor with support for AVX2 and AVX-512 instruction sets, running Ubuntu 26.04. This platform represents state-of-the-art x86-64 computing (max frequency: 5.1 GHz).
- Apple M1: A first-generation Apple Silicon ARM64 processor with NEON SIMD support, running macOS 26.4. This platform serves as a baseline for ARM64 performance (max frequency: 3.2 GHz).
- Apple M5: A recent-generation Apple Silicon ARM64 processor with NEON SIMD support, running macOS 26.4. This platform represents the latest ARM64 technology (max frequency: 4.46 GHz).
4.2.2. Compilers and Compilation Flags
- AMD R7-7840H: Intel C++ Compiler (icpx) version 2025.0.4.
- Apple M1/M5: Apple Clang version 21.0.0.
4.2.3. Benchmark Methodology
- Input Generation: Generate (16,777,216) random floating-point numbers, excluding special values (NaN, and infinity) to focus on the core conversion logic.
- Warm-Up Phase: Execute the benchmark multiple times before measurement to eliminate cold-start effects and ensure consistent cache behavior.
- Measurement: Measure the total wall-clock time required to convert all numbers through multiple iterations.
- Analysis: Calculate the average conversion time per floating-point number, discarding outliers to ensure robust results.
4.3. Algorithms Compared
- teju_jagua: Only implements float/double-to-decimal conversion.
- jnum: Only implements double-to-string conversion. When comparing float to string, we convert the double value to a float value. Strictly speaking, the jnum algorithm does not satisfy the SW principle. However, its performance is also quite excellent, so we still included it in the benchmark.
- yy_double, uscalec: Only the double data type is supported.
4.4. Performance Results
- Float/Double-to-Decimal Conversion: Table 8 summarizes the benchmark results for float-to-decimal and double-to-decimal conversions across the AMD R7-7840H and Apple M1/M5 platforms. All benchmarks use random values, excluding NaN and infinity, to focus on core conversion performance.
- Float/Double-to-String Conversion: Figure 1 and Figure 2 present comprehensive benchmark results for double-to-string and float-to-string conversions on the three processor platforms.Specifically,
- –
- _comp (e.g., fmt_comp, dragonbox_comp, xjb32_comp, xjb64_comp): Versions using compressed constant tables for reduced memory footprint.
- –
- _full (e.g., fmt_full, dragonbox_full): Versions using uncompressed constant tables for potentially faster access.
- –
- null: An empty function used to isolate and measure the overhead of function calls.
4.5. Analysis and Discussion
4.5.1. Performance Comparison
- Float-to-decimal: 2.24 ns, 5.44× faster than Schubfach (12.2 ns);
- Double-to-decimal: 3.76 ns, 3.06× faster than Schubfach (11.51 ns);
- Float-to-string: 33.72 cycle, 70% faster than zmij (57.06 cycle);
- Double-to-string: 43.41 cycle, 13% faster than zmij (49.28 cycle).
- Float-to-decimal: 2.15 ns, 5.41× faster than Schubfach (11.64 ns);
- Double-to-decimal: 2.58 ns, 5.08× faster than Schubfach (13.12 ns);
- Float-to-string: 16.98 cycle, 117% faster than zmij (36.87 cycle);
- Double-to-string: 20.77 cycle, 25% faster than zmij (27.74 cycle).
- Float-to-decimal: 1.44 ns, 5.27× faster than Schubfach (7.59 ns);
- Double-to-decimal: 1.55 ns, 4.97× faster than Schubfach (7.71 ns);
- Float-to-string: 13.87 cycle, 136% faster than zmij (32.77 cycle);
- Double-to-string: 17.09 cycle, 20% faster than zmij (20.74 cycle).
4.5.2. Performance Consistency
4.5.3. Cross-Platform Performance
- x86-64 (AMD R7-7840H): The algorithm benefits from the compiler’s ability to generate highly optimized code for arithmetic operations and SIMD instructions.
- ARM64 (Apple M1/M5): xjb maintains consistent performance advantages across both generations of Apple Silicon, demonstrating the algorithm’s robustness and effectiveness across different instruction set architectures.
4.5.4. Comparison with Some Related Algorithms
- vs. Schubfach: With 3–5× speedup over the baseline, xjb validates that Schubfach’s elegant mathematical framework can be substantially optimized through computational restructuring, branch optimization, and SIMD utilization.
- vs. yy_json/yy_double: While these algorithms represent excellent engineering for JSON serialization, xjb outperforms them by 2–3×, demonstrating that SIMD instruction set utilization unlocks significant additional optimization potential.
- vs. zmij: Achieving 1.2–2.3× speedup over zmij highlights the benefits of xjb’s approach to instruction dependency reduction, which enables better instruction-level parallelism on modern superscalar processors.
- vs. Ryū and Dragonbox: xjb outperforms these established algorithms by 2.5–6×, demonstrating that systematic optimization of multiple bottlenecks simultaneously yields substantial performance gains over approaches that focus on single aspects of the problem.
4.5.5. Fixed-Length Performance Analysis
- Consistent Performance: All xjb variants (xjb32, xjb32_comp, xjb64, xjb64_comp) maintain consistent performance across all digit lengths (1–17 digits). In contrast, some competing algorithms show significant performance variations depending on the number of output digits.
- Predictable Latency: The branch-free core design ensures that the conversion time remains relatively constant regardless of output length. This predictability is a valuable property for real-time systems and high-throughput applications, where consistent latency is as important as raw performance.
- Compression Trade-Off: The compressed table variant (_comp) maintains strong competitiveness in performance compared to other algorithms, while also reducing memory usage. This demonstrates the efficiency of xjb in terms of memory utilization.
4.6. Summary
- Superior Performance: xjb consistently outperforms all competing algorithms across both x86-64 (AMD R7-7840H) and ARM64 (Apple M1/M5) architectures.
- Significant Speedups: Achieves 3–5× improvement over the baseline Schubfach algorithm.
- Performance Consistency: Maintains stable performance across diverse input distributions and output lengths due to effective branch prediction optimization and branch-free core design.
5. Conclusions
5.1. Improvements to the Schubfach Algorithm
- Restructured Computation Flow: By decomposing the significand calculation into integer and fractional parts, xjb minimizes instruction dependencies, enabling better instruction-level parallelism and improved pipeline utilization on modern superscalar processors.
- Minimized Multiplication Operations: xjb reduces the number of expensive high-precision multiplications required during conversion. For IEEE 754 binary64, only one 64-bit by 128-bit multiplication is needed, and for binary32, only one 64-bit by 64-bit multiplication is required, significantly decreasing computational overhead.
- Branch Optimization: The algorithm employs branchless programming techniques for core conversion logic and structures remaining branches as unlikely paths, enabling efficient branch prediction on modern processors and resulting in consistent performance across diverse input distributions.
- SIMD Instruction Utilization: Unlike Schubfach and many other existing algorithms, xjb leverages SIMD instructions (NEON for ARM64, AVX512/SSE4.1/SSE2 for x86-64) for efficient decimal-to-ASCII conversion, fully exploiting the vector processing capabilities of contemporary processors.
5.2. Key Findings
- Significant Performance Improvement: xjb achieves a remarkable 3–5× speedup over the baseline Schubfach algorithm, representing a substantial leap in performance compared to prior work. On Apple M5, xjb achieves an impressive 1.44 ns for float-to-decimal conversion and 1.55 ns for double-to-decimal conversion, setting a new performance benchmark in the field.
- Superior to State-of-the-Art Algorithms: xjb consistently outperforms other high-performance algorithms, including yy_json, yy_double, and zmij, by margins of 1.2–3×. This indicates the performance improvement achieved by using the SIMD instruction set and reducing instruction dependencies. On the Apple M5 processor, compared with zmij, our algorithm achieves approximately 20% and 136% speedups for double-to-string and float-to-string conversion, respectively.
- Consistent Performance across Platforms: Unlike prior work, which often shows significant performance variations between architectures, xjb maintains its performance advantage across both x86-64 and ARM64. This portability is achieved through careful algorithm design that works well with compiler optimizations on different platforms.
- Stable Performance across Input Distributions: The algorithm maintains consistent performance regardless of input patterns and output digit lengths. This stability can be attributed to our branch-free core design and effective branch prediction optimization for all conditional branches, making xjb ideal for applications requiring predictable performance.
- Synergistic Optimization Effects: The combination of instruction dependency reduction, multiplication minimization, branch optimization, and SIMD utilization works synergistically to deliver performance gains that exceed what any single optimization could achieve in isolation.
- Concise Core Implementation: The core conversion logic of xjb is implemented in a concise manner, with minimal code lines and clear logic flow. This design simplifies maintenance and allows for easy integration into larger software systems.
5.3. Practical Implications
- Data Serialization: JSON and other text-based data formats require efficient floating-point–string conversion for serialization operations.
- Scientific Computing: Applications that output numerical results in a human-readable format benefit from faster conversion without sacrificing accuracy.
- Database Systems: Export operations and query result formatting can leverage xjb for improved throughput.
- Web Services: RESTful APIs and web applications that return numerical data can achieve lower latency with efficient conversion.
5.4. Limitations and Future Work
- Extended Precision Support: Future work could extend xjb to support extended precision formats (e.g., 16-bit, 80-bit, 128-bit, and 256-bit floating-point numbers) for applications requiring higher precision.
- SIMD Vectorization: Although xjb is designed to be SIMD-friendly, explicit vectorization using AVX-512 or NEON could yield additional performance gains for batch conversion workloads.
- Compiler Compatibility: Further optimization for different compilers (particularly MSVC) would improve portability across development environments.
- Memory-Constrained Environments: Investigating memory-efficient variants of xjb could benefit embedded systems and other resource-constrained platforms.
5.5. Availability
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Mathematical Foundations of Fractional Part Boundary
Appendix A.1. Notation and Assumptions
- P and Q are coprime and ;
- ;
- .
Appendix A.2. Basic Identities
Appendix A.3. A Useful Equivalence
Appendix A.4. Range of the Fractional Parts
- Proof of the Fractional Part Range:
- (1)
- Notation
- (2)
- Integer Linear Representation
- (3)
- Minimum of the Fractional Parts
- (4)
- Maximum of the Fractional Parts
- (5)
- Conclusion
Appendix A.5. Computation via Farey Sequences
References
- Steel, G.L., Jr.; White, J.L. How to Print Floating-Point Numbers Accurately. In Proceedings of the ACM SIGPLAN 1990 Conference on Programming Language Design and Implementation, PLDI 1990; ACM: New York, NY, USA, 1990; pp. 112–126. [Google Scholar] [CrossRef]
- Steel, G.L., Jr.; White, J.L. How to Print Floating-Point Numbers Accurately (Retrospective). ACM SIGPLAN Notices 39(4), April 2004 (Best of PLDI, 1979–1999). Available online: https://dl.acm.org/doi/10.1145/989393.989431 (accessed on 1 April 2004).
- Burger, R.G.; Dybvig, R.K. Printing Floating-point Numbers Quickly and Accurately. In Proceedings of the ACM SIGPLAN1996 Conference on Programming Language Design and Implementation (PLDI ’96); ACM: New York, NY, USA, 1996; pp. 108–116. [Google Scholar] [CrossRef]
- Loitsch, F. Printing Floating-Point Numbers Quickly and Accurately with Integers. In Proceedings of the ACM SIGPLAN 2010 Conference on Programming Language Design and Implementation, PLDI 2010; ACM: New York, NY, USA, 2010; pp. 233–243. [Google Scholar] [CrossRef]
- Andrysco, M.; Jhala, R.; Lerner, S. Printing Floating-Point Numbers: A Faster, Always Correct Method. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016; ACM: New York, NY, USA, 2016; pp. 555–567. [Google Scholar] [CrossRef]
- Adams, U. Ryū: Fast Float-to-String Conversion. In Proceedings of 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’18); ACM: New York, NY, USA, 2018; pp. 270–282. [Google Scholar] [CrossRef]
- Adams, U. Ryū Revisited: Printf Floating Point Conversion. Proc. ACM Program. Lang. 2019, 3, 169. [Google Scholar] [CrossRef] [PubMed]
- Giulietti, R. The Schubfach Way to Render Doubles. 2020. Available online: https://drive.google.com/file/d/1KLtG_LaIbK9ETXI290zqCxvBW94dj058/view (accessed on 1 September 2020).
- Jeon, J. Grisu-Exact: A Fast and Exact Floating-Point Printing Algorithm. 2020. Available online: https://github.com/jk-jeon/Grisu-Exact/blob/master/other_files/Grisu-Exact.pdf (accessed on 1 September 2020).
- Jeon, J. Dragonbox: A New Floating-Point Binary-to-Decimal Conversion Algorithm. 2024. Available online: https://github.com/jk-jeon/Dragonbox (accessed on 1 July 2024).
- Guo, Y.Y. Available online: https://github.com/ibireme/c_numconv_benchmark/blob/master/vendor/yy_double/yy_double.c (accessed on 1 January 2025).
- Guo, Y.Y. Available online: https://github.com/ibireme/yyjson (accessed on 1 August 2025).
- Cox, R. Available online: https://github.com/rsc/fpfmt (accessed on 1 January 2026).
- Cox, R. Floating-Point Printing and Parsing Can Be Simple And Fast. Available online: https://research.swtch.com/fp (accessed on 1 January 2026).
- Cox, R. Fast Unrounded Scaling: Proof by Ivy. Available online: https://research.swtch.com/fp-proof (accessed on 1 January 2026).
- Zverovich, V. Available online: https://github.com/vitaut/zmij (accessed on 1 March 2026).
- ANSI/IEEE Std 754-1985; IEEE Standard for Binary Floating-Point Arithmetic. IEEE: New York, NY, USA, 1985; pp. 1–20. [CrossRef]
- IEEE Std 754-2019; (Revision of IEEE 754-2008) IEEE Standard for Floating-Point Arithmetic. IEEE: New York, NY, USA, 2019; pp. 1–84. [CrossRef]
- Khuong, P. How to Print Integers Really Fast (with Open Source AppNexus Code!). Available online: https://pvk.ca/Blog/2017/12/22/appnexus-common-framework-its-out-also-how-to-print-integers-faster/ (accessed on 1 December 2017).
- Johnson, D. Converting Integers to Fixed-Width Strings Faster with Neon SIMD on the Apple M1. Available online: https://dougallj.wordpress.com/2022/04/01/converting-integers-to-fixed-width-strings-faster-with-neon-simd-on-the-apple-m1/ (accessed on 1 April 2022).
- Muła, W. SSE: Conversion Integers to Decimal Representation. Available online: http://0x80.pl/notesen/2011-10-21-sse-itoa.html (accessed on 1 October 2011).
- Lemire, D. Converting Integers to Decimal Strings Faster with AVX-512. Available online: https://lemire.me/blog/2022/03/28/converting-integers-to-decimal-strings-faster-with-avx-512/ (accessed on 1 March 2022).
- Xiang, J. Available online: https://github.com/xjb714/xjb/tree/main/bench/schubfach_xjb (accessed on 1 April 2026).
- Zverovich, V. Available online: https://github.com/fmtlib/fmt (accessed on 1 October 2025).
- Neri, C. Available online: https://github.com/cassioneri/teju_jagua (accessed on 1 November 2025).
- Leng, J. Available online: https://github.com/lengjingzju/json/jnum.c (accessed on 1 November 2025).


| Symbol | Brief Explanation | Example |
|---|---|---|
| % | Integer modulus operation | 2 = 8%3 |
| // | Integer division operation | 1 = 5//3 |
| or | Left or right shift of binary values | 8 = 13 |
| ? : | Similar to the ternary operator in C syntax | a = 1?a:b |
| Category | Float (Binary32) | Double (Binary64) |
|---|---|---|
| Subnormal | , | , |
| Normal | ||
| Irregular | , | , |
| Identified Limitation | Corresponding Solution in xjb |
|---|---|
| Frequent branch mispredictions | Branchless programming for core decision logic (Section 3.6, Section 3.7 and Section 3.10) |
| High-precision multiplication overhead | Minimized multiplication count via lookup-table restructuring (Section 3.4, Section 3.5 and Section 3.6) |
| Long instruction dependency chains | Restructured computation flow to expose instruction-level parallelism (Section 3.10) |
| Limited SIMD utilization | SIMD-optimized ASCII generation for decimal-to-string stage (Section 3.10, Table 5) |
| Float Number | Fixed-Point | Scientific |
|---|---|---|
| 2.34 | “2.34” | “2.34” |
| 12 | “12.0” | “1.2” |
| 120 | “120.0” | “1.2 × 102” |
| 0.012 | “0.012” | “1.2 × 10−2” |
| SIMD Implementation | Description |
|---|---|
| NEON [20] | Original author: Dougall Johnson. Runs on ARM processors with NEON instruction set. |
| SSE2 [21] | Based on scalar version; requires only SSE2 instruction set. |
| SSE4.1 | Nearly identical to SSE2 implementation; requires SSE4.1 instruction set. |
| AVX512 [22] | Original author: Daniel Lemire. Requires AVX512IFMA and AVX512VBMI instruction sets. |
| Type | Fixed-Point | Scientific |
|---|---|---|
| Float | Other ranges | |
| Double | Other ranges |
| Algorithm | Float | Double | Description: Author and Source Code |
|---|---|---|---|
| Schubfach [8] | Schubfach32 | Schubfach64 | Raffaello Giulietti, https://github.com/abolz/Drachennest/tree/master/src (accessed on 4 December 2025). |
| Schubfach_xjb [23] | Schubfach32_xjb | Schubfach64_xjb | The computation flow in the Schubfach source code has been modified by me, without altering the original output results, https://github.com/xjb714/xjb/tree/main/bench/schubfach_xjb (accessed on 4 December 2025). |
| Ryū [6,7] | Ryū32 | Ryū64 | Ulf Adams, https://github.com/ulfjack/ryu (accessed on 4 December 2025). |
| Dragonbox [10] | Dragonbox32 | Dragonbox64 | Junekey Jeon, https://github.com/jk-jeon/Dragonbox (accessed on 4 December 2025). |
| fmt [24] | fmt32 | fmt64 | Victor Zverovich, https://github.com/fmtlib/fmt version:12.1.0 (accessed on 4 December 2025) |
| yy_double [11] | - | yy_double | Guo YaoYuan, https://github.com/ibireme/c_numconv_benchmark/blob/master/vendor/yy_double/yy_double.c (accessed on 4 December 2025). |
| yy_json [12] | yy_json32 | yy_json64 | Guo YaoYuan, https://github.com/ibireme/yyjson version:0.12.0 (accessed on 4 December 2025) |
| teju_jagua [25] | teju32 | teju64 | Cassio Neri, https://github.com/cassioneri/teju_jagua (accessed on 4 December 2025). |
| xjb | xjb32 | xjb64 | This paper, https://github.com/xjb714/xjb (accessed on 4 December 2025). |
| zmij [16] | zmij32 | zmij64 | Victor Zverovich, https://github.com/vitaut/zmij (accessed on 8 April 2026). |
| jnum [26] | jnum32 | jnum64 | Jing Leng, https://github.com/lengjingzju/json/jnum.c (accessed on 4 December 2025). |
| uscalec [13] | - | uscalec | Russ Cox, https://github.com/rsc/fpfmt commit 6255750 (accessed on 19 January 2026). |
| Algorithm | AMD R7-7840H | Apple M1 | Apple M5 | |||
|---|---|---|---|---|---|---|
| Icpx 2025.0.4 | Apple Clang 21.0.0 | Apple Clang 21.0.0 | ||||
| Float | Double | Float | Double | Float | Double | |
| Schubfach | 12.20 | 11.51 | 11.64 | 13.12 | 7.59 | 7.71 |
| Schubfach_xjb | 4.44 | 6.33 | 5.16 | 6.58 | 3.15 | 3.75 |
| Ryū | 14.02 | 13.08 | 15.75 | 14.16 | 10.23 | 9.50 |
| Dragonbox | 10.19 | 10.05 | 11.78 | 12.03 | 7.56 | 7.39 |
| yy_json | 4.67 | 5.72 | 3.97 | 4.46 | 2.40 | 2.72 |
| yy_double | – | 5.24 | – | 4.08 | – | 2.71 |
| teju_jagua | 14.99 | 14.37 | 20.25 | 18.66 | 13.49 | 12.71 |
| zmij | 4.76 | 4.78 | 4.11 | 3.83 | 2.82 | 2.14 |
| uscalec | – | 11.27 | – | 15.26 | – | 9.61 |
| xjb | 2.24 | 3.76 | 2.15 | 2.58 | 1.44 | 1.55 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Xiang, J.; Wang, T. xjb: Fast Float to String Algorithm. Computers 2026, 15, 280. https://doi.org/10.3390/computers15050280
Xiang J, Wang T. xjb: Fast Float to String Algorithm. Computers. 2026; 15(5):280. https://doi.org/10.3390/computers15050280
Chicago/Turabian StyleXiang, Junbo, and Tiejun Wang. 2026. "xjb: Fast Float to String Algorithm" Computers 15, no. 5: 280. https://doi.org/10.3390/computers15050280
APA StyleXiang, J., & Wang, T. (2026). xjb: Fast Float to String Algorithm. Computers, 15(5), 280. https://doi.org/10.3390/computers15050280

