Enabling Deep Recursion in C++

Malkov, Saša N.; Čukić, Ivan Lj.; Đorđević, Petar Ž.

doi:10.3390/computers15010015

Open AccessArticle

Enabling Deep Recursion in C++

by

Saša N. Malkov

^1,*

,

Ivan Lj. Čukić

²

and

Petar Ž. Đorđević

¹

Department of Computer Science, Faculty of Mathematics, University of Belgrade, Studentski Trg 16, 11158 Belgrade, Serbia

²

Independent Researcher, 11000 Belgrade, Serbia

^*

Author to whom correspondence should be addressed.

Computers 2026, 15(1), 15; https://doi.org/10.3390/computers15010015

Submission received: 10 December 2025 / Accepted: 26 December 2025 / Published: 1 January 2026

Download

Browse Figures

Versions Notes

Abstract

Recursion is often presented as a nice and illustrative technique, only to later conclude that it should (almost) never be used due to potential problems with call stack overflow. However, recursion can often be the technique of choice during algorithm development and testing, and even in final solutions. Therefore, a simple but effective technique is needed to overcome call stack limitations. We designed and implemented the Extendable Stack Library (ESL), which provides a simple and effective interface that enables deep recursion in C++. Its flexible usage model allows deep recursion to be used where needed, without requiring major project modifications or customization of development tools. The performance overhead is moderate and localized only to deep recursive functions using ESL. The library is designed to be flexible and cross-platform. It supports Linux on AMD64 and AArch64 processors and Windows on AMD64. It can be adapted to more platforms with relative ease. ESL has been tested through a series of unit tests, experiments, and practical applications. It has proven to be an effective solution for deep recursion. ESL has been successfully used in the implementation of the Wafl programming language interpreter.

Keywords:

recursion; deep recursion; call stack; extendable stack; stack overflow; C++; ESL

1. Introduction

Recursion is a fundamental and widely applicable programming technique. It decomposes a problem into smaller subproblems, which are conceptually equivalent to the original problem, but have a lower complexity [1]. In C++, as in most modern programming languages, recursive functions are a natural way to implement algorithms such as binary search, quicksort, tree or graph traversal, or more general divide-and-conquer strategies and combinatorial computations [2,3,4,5].

A significant limitation in the use of recursion in practice is the lack of support for deep recursion. Although C++ itself does not impose any predefined limit on the depth of recursion, practical limitations come from the operating system, compilers, available computer system memory, and even processor architecture. Deep recursion is essentially limited by the available size of the call stack. If the imposed limit is exceeded, stack overflow errors and program crashes can occur [6]. Moreover, the practical limits are often not precisely known in advance, since the stack size and the speed of its exhaustion may depend on varying factors.

In an ideal situation, a safe deep recursion would be available to programmers in a similar way to the exception handling mechanism in C++ or the seatbelt in a car—not as a primary tool, but as a support for exceptional situations. When writing code that can lead to deep recursion, such a tool would help eliminate or at least reduce the risk of failure. This paper is a contribution to the research and design of techniques and strategies for enabling safe and efficient deep recursion in programs written in C++, as well as in the languages that translate to C++ or use virtual machines (or interpreters) written in C++.

The practical goal of our research was to answer the following research questions:

RQ1—Can the existing solution from the Wafl interpreter be generalized into a cross-platform library that allows deep recursion? The authors had at their disposal a solution that had been used for years in the implementation of the interpreter for the functional programming language Wafl [7]. It was created within the Wafl project [8], and its applicability in other projects was relatively narrow. The existing solution has now been generalized by separating it from its original context, adding a generalized interface, and defining the operation protocols more precisely.

RQ2—To what extent can this library be independent of specific compilers and linkers? The call stack is tightly coupled to the underlying computer system. The way a call stack operates depends heavily on the processor architecture and the operating system [9,10]. Some specifics also often depend on the compiler used and the options used during compilation. Consequently, any attempt to overcome the stack size limit must comply with all of these constraints and rules; therefore, the library must also be coupled to the system to an extent. On the other hand, the ideal solution should be cross-platform. While some components must be adapted to specific environments, those adaptations should be minimal and reside in the library code.

RQ3—What are the limitations or constraints that such a library imposes on developers? Such a library must operate at a very low level and adapt to the way compilers work in particular environments. As a result, certain operational limitations are almost inevitable. Recognizing them is important for understanding how they affect the library’s overall usability.

RQ4—How does using a library impact the performance of recursive functions? The library must do extra work to enable deep recursion, which inevitably impacts the performance of the program in which it is used. This impact must be understood and measured in order to assess the usability of the library.

In this paper, we present answers to these research questions. In the Section 2, we present the problem of using deep recursion and some existing solutions to the problem. The Section 3 presents a library designed and developed to support deep recursion. We focus on its functionality and usage methods. Some parts of the implementation are briefly presented. Section 4 presents how the library was tested and benchmarked. The Section 5 reports the results of the validation and performance benchmarking. In the Section 6, we analyze the results and discuss and evaluate the usability of the library.

2. Background

2.1. Recursion

Recursion is a fundamental concept used to describe, explain, define, and solve structures, processes, and problems across many disciplines, including mathematics, computer science, biology, and others. It is based on the understanding that many complex problems and structures are composed of simpler elements that are conceptually equivalent to them.

In Mathematics, it represents a basis for the principle of mathematical induction, which states (in a simplified form) the following: if a statement is proved to be true for n = 0 (base case), and it is proved that if a statement is true for n then it is also true for n + 1 (recursive case), then the statement is proved to be true for all natural numbers [11].

In computer programming, recursion is used to decompose a complex problem into one or more self-similar problems of lower complexity [1]. Each recursive implementation consists of one or more base cases and one or more recursive cases. The base cases define the scenarios in which the solution can be obtained without using recursion, often in a direct and trivial way. The recursive cases describe how the problem is defined in terms of the equivalent problems of lower complexity.

For example, the function fac(n), which computes the factorial of a non-negative integer n, is usually defined in mathematics as follows:

fac(n) = 1 · 2 · … · n, for n > 0,
fac(0) = 1.

In computer programming, the function factorial(n) can be implemented recursively with the base case n == 0, for which the result is 1, and with the recursive case n > 0, for which the problem is decomposed into the smaller factorial(n − 1) and multiplication… * n, as shown in Listing 1. Using the principle of mathematical induction, it is easy to prove that the implementation of the factorial function is equivalent to the mathematical definition of the function fac.

Listing 1. Recursive implementation of the factorial function in C/C++.

int factorial( int n ) {
if( n == 0 ) return 1;
else return factorial( n − 1 ) * n;
}

Many imperative programming courses (and books) introduce recursion as a nice and illustrative technique, only to later teach that it should (almost) never be used due to numerous drawbacks [12]. Still, recursion is the most natural way to describe and solve a wide set of problems. It is much easier to prove the correctness of recursive algorithms and functions than of equivalent iterative implementations [2,3]. Moreover, many iterative algorithms are developed from equivalent recursive solutions [13].

In many cases, recursive solutions are less efficient than their iterative counterparts, due to the additional overhead of recursive function calls and the increased memory usage. As a result, it is often recommended to convert recursion into iteration [14]. However, this is not a universal rule. When a single step (either recursive or iterative) is complex, the performance differences between recursive and iterative approaches may be less significant. Furthermore, recursive solutions are often translated into iterative ones that use explicitly defined stack-based data structures instead of the implicit call stack [4]. Such implementations can be more error-prone and may be equally or even less efficient than the original recursive solutions.

The complexity of recursive functions can be measured in the same way as the complexity of general algorithms, as a total count of operations required to compute the result. However, for recursive functions, it is often important to know another measure of complexity—the maximum depth of recursion, i.e., the maximum length of a sequence of recursive function calls [13].

2.2. Call Stack

During the late 1950s and early 1960s, Samelson, Bauer, and Dijkstra were among the most prominent computer scientists who developed the call stack concept and advocated its usage in the definition and implementation of ALGOL 60 [15,16]. The ability to support recursion was one of the major arguments for implementing stack-based procedure calls in the then state-of-the-art programming language ALGOL 60 [17]. In the following years, the call stack became one of the main computing concepts. It is often called the run-time stack, thread stack, or just the stack.

The call stack is a Last In First Out (LIFO) structured memory. It is used to store and handle data during program execution. Its primary use is as a storage area for local functions’ data and as a buffer for passing arguments and results between function calls. Each execution thread needs its own stack. Heavy multithreading across many active programs brings with it a large number of call stacks in use. Thus, the stacks should not be too large, as having many large stacks would allocate too much memory and limit the number of active processes and threads. On the other hand, if the call stack is insufficiently large, then a program may fail or even be prevented from using some programming techniques, like recursion.

Some of the techniques that put the most stress on the call stack are large local data usage and deep recursion. Local function data is usually allocated on the call stack, as automatic data [18]. So, if a function requires more local data than available on the stack, the program will fail. Similarly, deep recursion adds more data to the call stack with each new recursive call. If the recursion is too deep, it will cumulatively request more stack space than available and, again, the program will fail.

Deep recursion requires a proportionally large call stack to run properly. If a recursive function can be invoked with some arguments that can cause a recursion that is too deep, then such a function is not stack-safe. The maximal supported recursion depth is related to call stack usage. Recursive functions that store more local data on the call stack will have a lower maximum recursion depth than functions that store less local data on the stack.

2.2.1. Call Stack Implementation

Call stack implementation is usually based on preallocated contiguous memory. Modern CPUs (since the 1970s) incorporate some special support for efficient call stack implementation, including a dedicated stack-pointer register, stack-handling operations, and support for efficient indirect addressing of data stored on the call stack.

In general, a contiguous memory block is allocated and dedicated to be used as a thread call stack. One end of the block is called the stack bottom, and the other the stack limit. Stack grows from the bottom to the limit, and the last used location is pointed to by the dedicated CPU register called the Stack Pointer (SP). In most contemporary CPUs, the call stack grows downwards, so the stack bottom is at the end (the highest address), and the stack limit is at the beginning (the lowest address) of the stack memory block [19]. The call stack storage is organized as a sequence of words. The size of the word in bytes depends on the CPU architecture.

Basic stack operations are push and pop. The push operation moves the SP towards the stack limit (if growing downwards, then it decreases the SP register for a word size) and stores the pushed value at the location pointed to by the SP. The pop operation does the opposite—it reads the value stored at the location pointed to by the SP and then increases the SP by a word size. In addition to push and pop, it is usually possible to access the stack content using stack-relative addressing, i.e., to access a word on an address computed as SP+N.

On a higher level, the most important concepts are the stack frame and the stack frame protocol. All data needed to perform a single function call constitutes a stack frame that corresponds to that call. A stack frame includes at least the address where the execution will continue after the called function returns, the arguments for the function, its result and the local function’s data [19]. On a physical level, the stack is a sequence of bytes or words, but on a functional level, it is a sequence of stack frames. Each word in the used part of the stack belongs to a stack frame. At any moment during the program execution, each thread’s call stack contains all automatically allocated data that is used in that thread.

The stack frame protocol precisely and thoroughly defines the contents of the stack frames, the order of operations, and the responsibilities of the caller and the callee. It is important that callers and callees use the same stack frame protocol in order to exchange data properly. This is why stack frames have a strict form for each specific computer architecture and operating system, as defined in the corresponding Application Binary Interfaces (ABI) [20,21,22].

Some compilers may use different stack frame protocols internally when both the caller and callee are compiled by the same compiler. This is often used in debug-mode compilation. It is a common practice in debug mode to use one of the CPU registers as the stack frame pointer and to add a pointer to the previous stack frame to the next stack frame. This allows the contents of the stack to be analyzed in a relatively simple and efficient manner, which is useful during debugging.

2.2.2. Stack Allocation

Each thread has its own stack that is allocated when the thread is created. The initial stack allocation for a created process or thread is handled according to operating system policies. On Linux, the size of the main thread’s stack is set by the system resource limit RLIMIT_STACK, which can be modified using the ulimit command or the setrlimit function before the process starts. For other threads, a system default size is used, usually 2 MB or 8 MB [23]. On Windows, the initial stack size is specified during program building as a linkage parameter and is bound in the program binary code. This same stack size is used by default for all threads [24]. Some of the thread-related APIs allow us to explicitly specify the required stack size when threads are created, but this is not possible in other libraries, including the C++ standard library’s std::thread.

Both Linux and Windows use virtual memory [25] for stacks, although not in the same way. The stack address space is allocated when a thread is created, but it is not committed until used. The actual physical memory usage follows the used stack size. Linux uses regular VM page faults to commit one additional stack page at a time. On Windows, a committed part of the stack is followed by a guard page. If the guard page is accessed, the memory manager handles the exception and commits more pages. While this mechanism is more complex than on Linux, it allows for the commitment of many pages in a single step, thus reducing the number of exceptions [24].

The delayed commit of stack pages should make it easier to create many processes and threads with relatively large allocated stacks. But after a thread commits a large portion of the allocated stack, it is often the case that later, maybe even after a long time, it may not use that much stack space again. In both Windows and Linux, the committed stack memory never shrinks. The only way to free the unused stack space is to end the thread and create a new one. Moreover, in the case of the main thread of a process, the entire process must be terminated and restarted.

Even with the use of virtual memory, it is not easy to determine the stack size in advance. Whatever stack size is chosen, it may not be large enough for certain purposes or specific cases. Yet, we should not be too generous, because even the virtual address space is limited [26].

2.3. Deep Recursion Limitations

Each time a function is called during program execution, a new stack frame is created on the call stack. If a function is used recursively, then a sequence of recursive calls adds a sequence of corresponding stack frames to the stack. For a large enough recursion depth, this can lead to stack exhaustion and program failure.

Unpredictably deep recursion represents a significant risk in programs, which is one of the main reasons for recommending avoiding recursive solutions in critical systems [27]. This is why recursive solutions are often translated into iterative ones, even in cases when iterative code is more complex and brings no efficiency benefits.

The risk of deep recursion is quite often present in recursive processing of data structures with unpredictable complexity, like graphs and trees. For example, if a binary tree with N nodes is well balanced, then it often can be processed recursively with a relatively shallow recursion—the expected recursion depth for most algorithms on balanced trees is comparable to log(N), but if the tree is unbalanced and most nodes have a single branch, then such a tree is more like a collection of linked lists and the recursion depth can be close to N. Recursive processing of large unbalanced trees can exhaust the stack.

To make matters worse, there are virtually no standardized methods for detecting and resolving such failures in an acceptable manner. On the contrary, in most cases, stack exhaustion causes a violent program termination.

2.3.1. Tail Call Optimization

Tail calls are function calls that represent the last step of function execution. Tail Call Optimization (TCO) replaces the last function call by updating the contents of the current stack frame and directly jumping to the called function. This saves memory for an extra stack frame and time for call/return and stack handling.

TCO is a general type of optimization, but when the tail call is recursive (so-called tail recursion), then it can be considered a recursion optimization [28]. It can effectively reduce a recursive call with an iteration, thereby turning a recursive function (or at least part of it) into an iterative one. Not only does it make recursive functions more efficient, but it also eliminates the need for a large stack or dynamic stack manipulation.

Enabling deep recursion is not the primary goal of TCO, but it is an important consequence. The limitation is that TCO can only optimize a recursive call if it is a tail call. Therefore, many recursive calls remain beyond the reach of this method.

2.3.2. Dynamic Stack

There are many ways to overcome the limits of fixed-size stacks. The more radical ones attempt to abandon the call stack and use solutions such as heap-allocated continuations [29]. Another approach is to add a dynamic behavior to the otherwise static call stack.

The usual stack overflow handling assumes that some protected guard pages are placed after the stack limit, to cause faults on overflow. Another method is to use explicit limit checks in the functions, or their compiler-generated prologues. If the limit is met, the stack can be expanded or extended.

Stack expansion includes allocation of a new, larger contiguous stack space, movement of all stack frames to this new location, deallocation of the old stack, and updating of the SP. On the other hand, stack extension allocates only a new stack block (segment), copies only a few stack frames to the new block, links it to the previous one, and updates the SP. An expandable stack is often called a resizable, and the extendable stack is called a segmented stack. A hybrid strategy can also be used, which in some cases extends and in some other cases expands the call stack.

2.3.3. Split Stack

The GNU C++ compiler (GCC) and Clang C/C++ compiler (Clang) support a split-stack feature that implements a segmented call stack, which can be discontinuous and grow automatically as needed. The split stack is enabled via the -fsplit-stack compiler option. If it is enabled, then each function entry and exit code (prologue and epilogue) is modified to manage the segmented stack [30]:

The function prologue checks whether the stack contains enough free space for the function. If there is enough space, the function works as a regular one, in the same way as without a split stack. But, if there is not enough space, then a new stack segment is allocated, linked to the previous one, and the SP is updated to point into the new segment.
The function epilogue checks whether the prologue added a new stack segment, and if so, deallocates this segment and restores the SP to the previous segment.

This mechanism is quite simple and is completely handled by the compiler and linker. There is no need to modify the source code to use it. However, there are some significant drawbacks [31,32]. Most of the problems come from the mixing of code compiled with and without split stack support [30]. It is thus usually suggested to compile the entire project along with its dependencies, with the split-stack feature enabled, which is not always possible.

This solution is platform-specific. It is only supported by GCC and Clang for IA-32 and AMD64 architectures. To adapt it to another platform, support from the compiler vendors is required, which is a major drawback, even if it is open-source. Moreover, split stack currently requires the gold linker, which was recently deprecated [33], so its applicability is now questionable.

2.4. Fibers

A fiber is a unit of sequential execution that runs within a single thread. Fibers are not implicitly scheduled, but instead use cooperative multitasking, which means they choose when to yield the execution to other fibers and how to transfer their state. Fibers allow for finer control over how and when the execution context is switched compared to threads.

An important part of the fiber concept is that each fiber has its own stack. This is very important in the context of deep recursion, because it turns out that instead of modifying an existing stack and then continuing deep recursion using it, it is possible to delegate the remaining recursive execution to a new fiber. As we will show later, this can be used as a technique for implementing deep recursion.

A representative implementation of fibers is the Boost.Context library [34]. Fibers are currently being considered for inclusion in future versions of the C++ standard library [35].

3. Extendable Stack Library

3.1. The Goals

The precursor to the Extendable Stack Library was developed and used as a part of the implementation of an interpreter for the functional programming language Wafl. The interpreter implementation was based on graph reduction and relied on recursion. Since any recursive code in Wafl was interpreted using recursion in the interpreter, it was necessary to provide a means for performing deep recursion.

Extendable Stack Library (ESL) was developed based on this implementation of deep recursion in the Wafl interpreter. The library is designed to be more flexible and widely applicable, with some additional features. However, the main goals essentially remained the same as for the precursor:

Deep recursion should be supported on demand. A major issue with split-stack is its impact on building, as all program parts and libraries must use it. Using stack handling only on demand has no such limitations and limits the additional processing overhead to only the affected code, in accordance with the zero-overhead principle [36].
The performance penalty should be as low as possible. Dynamic stack manipulation inevitably leads to performance loss, so minimizing the overhead is crucial. Since stack status checks occur far more frequently than manipulations, optimizing their speed greatly affects overall performance.
ESL should be conceptually cross-platform. A cross-platform solution must rely on OS, CPU ABI, and compiler specifics, but in a way that keeps most of the code universal and platform-specific parts minimal. Support for at least Linux and Windows, with easy portability, is essential.

3.2. The Library Concepts

The ESL is based on the concept of an extendable dynamic stack. The basic idea is similar to that of split stack. The main difference is that the solution does not rely on a compiler, but on the library and its explicit usage. A stack block is a single contiguous part of the stack. Stack blocks are the units of allocation and activation. At any moment, exactly one stack block is active for a thread. An active stack block has to contain the active stack frame and all data used by the active function. As a part of stack block activation, all the required data has to be copied to this stack block from the previously active one.

The initial stack is regarded as the first stack block. It is different from the other blocks in that its size is determined by the operating system, based on the settings (for Linux) or the program file (for Windows). It is never discarded nor freed by the ESL. All the other stack blocks are managed and used on demand. Discarded stack blocks can be stored in a pool for future use. This pool can be shared between threads.

The general algorithm is to check the stack status on each entry into a deep recursive function. If the stack is found to be near exhaustion, then it has to be extended. If the stack has been extended, then it later has to be condensed before the exit from the function.

This concept has two immediate unpleasant consequences that affect the implementation of ESL. The main potential problem is that stack bounds have to be checked each time a recursive function is entered, which can be slow or inaccurate. Another problem comes from the need to copy some data from the old stack block to the new one when extending the stack. This takes some time, but more importantly, having two copies of the same data can cause other problems, which will be discussed later in the text.

3.3. ESL API

The library interface is defined within the namespace extendableStack. The core part of the library interface (API) is designed as a set of functions in this namespace. It’s main part consists of seven functions [37]: bool shouldExtendStack() checks whether the stack is almost exhausted; void extendStack() extends the stack by adding a new stack block; void condenseStack() condenses the stack by removing the current stack block and activating the previous one; runInExtended(fn, args…) is a function template that extends the stack, calls the given function fn with the given arguments, condenses the stack, and returns the result; runInExtendedExc(fn, args…) is similar to the previous, but it handles exceptions thrown by fn; void initForThread() initializes the stack data for the current thread; bool initGlobal(…) sets the global ESL configuration for the current process.

The most general way to use ESL is by directly calling the API functions. This usage pattern resembles the general algorithm shown in Listing 2. Another usage method is to delegate recursive calls and stack handling to the runInExtended API function template, as shown in Listing 3. This approach hides from the user most of the complexities of the stack operations. Also, because stack handling is delegated to another function, the recursive function does not have to add new variables, which often makes the stack frame smaller and recursion more efficient.

Listing 2. The general ESL usage pattern. The stack is handled explicitly, as needed.

type functionName( args ) {
bool allocated_new_stack = false;
if( extendableStack::shouldExtendStack() ) {
allocated_new_stack = true;
extendableStack::extendStack();
}
… function body …
if( allocated_new_stack )
extendableStack::condenseStack();
return result;
}

Listing 3. Delegation usage pattern. The same call is performed recursively with an extended stack.

type functionName( args ) {
if( extendableStack::shouldExtendStack() )
return extendableStack::runInExtended( functionName, args );
… function body …
return result;
}

To make ESL easier to use, a set of macros has also been defined, as well as some helper classes. We will not explicitly list and explain them here; instead, we will present some typical usage patterns. The definitions and documentation are available in the source code file ExtendableStack_common.h, and more examples are available in the benchmarks [37].

3.3.1. Usage Methods

The ESL API supports several usage methods, including the following:

Explicit usage—The basic usage method where ESL API functions are directly used (see sumNM_safe in Listing 4). This method allows the greatest control.
Guard macros—A pair of macros is used to simplify the ESL usage (see sumNM_macroGuard in Listing 4).
Block macros—A simple way to handle the case when stack extension is required is to recursively repeat exactly the same call (see sumNM_macroBlock in Listing 4). This adds one more recursive call but simplifies the implementation.
Duplicate macros—Similar to block macros, this method separates two cases but repeats the same code segment twice instead of using a recursive call (see sumNM_macroDuplicate in Listing 4).
Delegation—The stack handling and the recursive call are delegated to the runInExtended template function (see sumNM_delegate in Listing 4). The same function with the same arguments can be called in a similar way to the block macro usage method.

With block macros and delegation, the same function can be safely called with the same arguments, because infinite recursion cannot occur. The next call will take place in a new stack block, and it is certain that the stack will not need to be extended again immediately.

Listing 4. Examples of several methods of using ESL. Recursive functions for calculating the sum of integers between n and m are implemented using various ESL usage methods. Function names are bolded to highlight recursive calls. For more methods, see Listing S1 in Supplementary Materials or [37].

// Without ESL. Deep recursion not supported.
long sumNM_native( long n, long m ) {
if( n > m ) return 0;
return sumNM_native( n + 1, m ) + n;
}

// Explicit coding.
long sumNM_safe( long n, long m ) {
if( n > m ) return 0;
bool stackHasBeenExtended = false;
if( extendableStack::shouldExtendStack() ) {
stackHasBeenExtended = true;
extendableStack::extendStack();
}
long result = n + sumNM_safe( n + 1, m );
if( stackHasBeenExtended )
extendableStack::condenseStack();
return result;
}

// Using guard macro.
long sumNM_macroGuard( long n, long m )
{
if( n > m ) return 0;
EXTSTACK_GUARD_BEGIN;
long result = n + sumNM_macroGuard( n + 1, m );
EXTSTACK_GUARD_END;
return result;
}

// Using block macros.
long sumNM_macroBlock( long n, long m ) {
if( n > m ) return 0;
else EXTSTACK_IF_EXTENDING_RETURN(
sumNM_macroBlock( n, m )
)
else return n + sumNM_macroBlock( n + 1, m );
}

// Using delegation.
long sumNM_delegate( unsigned long long n, unsigned long long m )
{
if( n > m ) return 0;
else if( extendableStack::shouldExtendStack() )
return extendableStack::runInExtended( sumNM_delegate, n, m );
else
return n + sumNM_delegate( n + 1, m );
}

3.3.2. Initialization

Each thread must call extendableStack::initForThread() before using ESL. When a new thread is created, and it is expected that it will use ESL, it should call initForThread at the very start of the thread function.

Global initialization is not mandatory. The initGlobal function should be called only if the default library configuration is not appropriate. It has to be called before the first stack extension, or else its call will have no effects. It allows the programmer to set the stack block size, optional limits for used stack blocks per thread or per process, and some other fine tuning.

3.4. Some Implementation Details

Most of the library uses standard C++ and is fully portable. Most platform-specific parts are localized in separate files for easier maintenance. Implementations are provided for Linux on AMD64 and AArch64 architectures, with GCC and Clang, and for Windows on AMD64 and the Microsoft Visual C++ compiler (MSVC).

3.4.1. Stack Block Activation

Stack extension requires (1) allocating a new stack block, (2) preparing the block, and (3) activating the new stack block. Similarly, stack condensation requires (1) preparing, (2) activating the previous stack block, and (3) deallocating the unused block.

After a new stack block is allocated, the active stack frame has to be copied to the new stack block. If the stack extenstion is implemented as a function, then the caller’s stack frame must be copied. The same memory range must be copied back later during the corresponding stack condensation.

Implementations of the extendStack and condenseStack functions are platform-specific. The algorithm is universal, but access to system data and updating the SP register differ across operating systems and compilers. The main segment of code for extendStack on Linux (AMD64) is shown in Listing 5. The code for condenseStack and the variants for MSVC are similar. The complete ESL code is available at [37].

Listing 5. Stack extension. The main part of the implementation of the extendStack operation on Linux for AMD64, for GCC and Clang compilers in release builds.

size_t copyLen = EstimateStackFramesSize( 2 );
char* stackBottom = LastStackBlock_
? LastStackBlock_->TopOfUsableBlockSpace_
: (char*) compiler_Linux::getStackBottom();
char* reg_RSP;
__asm__ volatile ( "movq %%rsp, %0" : "=r" ( reg_RSP )); // reg_RSP = RSP
size_t maxLen = stackBottom - reg_RSP;
if( copyLen > maxLen )
copyLen = maxLen;

StackBlock* newBlock = AllocateAndLinkNewStackBlock( reg_RSP, 0 );
newBlock->OldTopOfUsableBlockSpace_ = reg_RSP + copyLen;
char* new_RSP = newBlock->TopOfUsableBlockSpace_ - copyLen;
memcpy( new_RSP-redZoneSize, reg_RSP-redZoneSize, copyLen+redZoneSize );
__asm__ volatile ( "movq %0, %%rsp" : : "r" ( new_RSP )); // RSP = reg_RSP

3.4.2. Determining the Stack Frame Size

There is no universal way to get the size of a given stack frame. Some compilers, and only in some build modes, store a pointer to the previous stack frame in the call stack, as this can make debugging easier. In release build mode, this is omitted, to make the function entry code simpler and more efficient. Some compilers provide alternative ways to find the stack frame position, but this is not portable and often not sufficient to get the pointer to the previous (caller’s) stack frame (e.g., GCC has the __builtin_frame_address function which is supposed to return a pointer to the stack frame, but it is often unreliable).

By default, the ESL estimates a stack frame size to be 1 KB and copies twice that amount. This estimate can be reconfigured (using the initGlobal function) to be larger if there is a deep recursive function that requires a larger stack frame.

3.4.3. Estimating Stack Limit

The shouldExtend function estimates if there is enough free space to make one more recursive call or not. To make it as efficient as possible, all calculations must be completed in advance, and it should only compare the value of the current SP with the prepared value.

When a stack block is created, a stack limit for that block is computed as follows:

NewStackLimit_ = AllocatedStackBlock_ + reservedSize,

where the reservedSize is fetched from the global ESL configuration. When a stack block is activated, its NewStackLimit_ is stored in a thread local variable, and the shouldExtend method later compares the SP to this limit and returns true if SP is below it.

The default reserved size depends on the build configuration and can be from 16 to 64 KB. It can be modified using the initGlobal function.

3.4.4. Exception Handling

If an exception is thrown while an extended stack block is active, the exception handler will expect a continuous call stack, which can lead to a fatal error. Proper exception handling in an extendable stack requires (1) catching the exception within the stack block in which it occurred (i.e., before stack condensation), (2) capturing the exception, (3) condensing the stack, and (4) rethrowing the captured exception. To accomplish this, the code between the extension and condensation operations must be enclosed in a try block.

The ESL provides the runInExtendedExc template function that is similar to runInExtended but handles exceptions properly. There are also the macros that handle exceptions properly in the same usage manner as block macros and duplicate macros [37].

3.4.5. Platform-Specific Elements

Adding support for a new platform requires adapting the library by implementing several methods of the ExtendableStack class for the new platform. These methods are platform-specific in how stack data are retrieved or updated, and how the stack pointer (and frame pointer, if available) is handled. To achieve a high level of efficiency, some elements should be implemented in assembly. All in all, it is a relatively sensitive job but not too complex.

3.5. Using Fibers

Instead of manually manipulating the stack extension and condensation, it is possible to use fibers. Since each fiber has its own stack, it can be used as a closed context for executing a part of a deep recursion. If stack exhaustion is detected, instead of extending the existing stack, a new fiber can be created with its own stack, and the execution can be delegated to it. If the stack dedicated to the new fiber is not large enough and the fiber’s execution again approaches stack exhaustion, a new fiber can be created, and so on.

Since std::fiber_context is not yet a part of the standard C++ library, we decided to use the Boost.Context library [34]. When a fiber is created in Boost.Context, it is assigned a function to execute (usually a lambda) and a stack allocator object that will provide a newly allocated stack for the fiber. All the elements specific to the fiber implementation are wrapped in the runInFiberWithNewStack function, as presented in Listing 6. The fn is a void function without arguments that is assigned to the fiber, and the Fiber::StackAllocator is an adapter for the ESL stack block allocator that implements the stack allocation interface.

The runInFiber function template is used to adapt any non-void function call with arguments into a function call without arguments, as expected by runInFiberWithNewStack. It has the same interface as the runInExtended function, which extends the stack and runs the given function (Listing 6).

Listing 6. Fiber-based implementation. Implementation of the runInFiberWithNewStack function and the runInFiber function template using the Boost.Context library. The runInFiber function assumes that the given function fn returns a result. The version for void functions fn is a little simpler.

void runInFiberWithNewStack( const std::function<void()>& fn )
{
Fibers::StackAllocator stack_allocator;
boost::context::fiber f( std::allocator_arg, stack_allocator,
[&stack_allocator,&fn]( boost::context::fiber&& caller ) {
Fibers::ContextData fcd;
Fibers::InitStackLimits( fcd, stack_allocator );
fn();
Fibers::DeinitStackLimits( fcd );
return std::move( caller );
});
f = std::move( f ).resume();
}

template<typename FnType, typename… Args>
auto runInFiber( FnType&& fn, Args… args )
{
decltype( std::forward<FnType>( fn )( args… ) ) result;
runInFiberWithNewStack( [&]() {
result = std::forward<FnType>( fn )( args… );
});
return result;
}

Note that runInFiber does not use perfect forwarding for Args. If perfect forwarding is used (i.e., Args&&… args arguments), performance drops significantly with both GCC and MSVC, so we decided not to use it, even though it introduces some constraints. Namely, native pointers should be used instead of references as arguments of recursive functions. This implementation allows the use of explicit references to non-local objects, such as std::ref(x).

4. Usability Assessment

4.1. Validation

The ESL is validated in several ways. There are unit tests, but due to the complex nature of stack handling, it is not sufficient to test the library using only unit tests. Unit tests can often detect when deep recursion is malfunctioning, but it is not always possible to catch the problem where it actually occurs. If a deep recursion (or the extension/condensation operations themselves) corrupts the stack, it is possible for the tests to produce correct results, but for the incorrect stack state to cause errors in some subsequent operations. The ESL modifies the stack in a way that the operating system, compiler, and testing libraries do not expect; therefore, unit testing libraries cannot safely rely on regular condition checking and exception handling to report all errors correctly.

Therefore, in addition to unit tests, a separate test program has been developed [37]. It tests some usage scenarios and reports extensively on the library and the stack state. It is also used to run the benchmarks and measure performance.

The validation process helped to identify some potential problems and constraints. The identified constraints are presented in the following chapter.

4.2. Performance Analysis

The efficiency of the ESL was measured at three levels using: (1) low-level benchmarks to measure the performance of basic ESL operations, (2) recursive C++ benchmarks to measure the performance of deep recursion in C++ functions, and (3) Wafl benchmarks to measure the impact of ESL on the performance of the Wafl interpreter.

4.2.1. Low-Level Benchmarks

The costs of the main ESL operations were estimated by measuring the execution time of simulated workflows. The goal was to measure the performance cost of the stack status checks and pairs of stack extension and condensation operations.

When measuring stack status checks, the checks and their corresponding branching were measured together using 109 iterations. The stack state at check was always far from exhaustion, and the result of the shouldExtendStack function was always negative.

The cost of the extend/condense cycle is measured by repeating a sequence of extensions followed by a sequence of condensations. In each test, a total of 1,000,000 extend/condense cycles were performed in sequences of different lengths (i.e., the depth) from 1 to 100,000 extensions. During the extend/condense operations, only the edges of the blocks were used, which were affected by copying the current stack frames. The stack blocks are used from a pool of preallocated blocks.

4.2.2. Recursive C++ Benchmarks

Recursion benchmarks are chosen to cover cases with varying depths of recursion and computational complexity. The following benchmarks were used: (1) Tower of Hanoi (HAN-M)—computes a sequence of moves for the Towers of Hanoi problem; (2) Tower of Hanoi Count (HAN-C)—similar to Towers of Hanoi, but only counts moves, rather than recording a sequence of moves; (3) Fibonacci (FIB)—computes the elements of the Fibonacci sequence; (4) Catalan (CAT)—computes a Catalan number; (5) Count (COUNT)—counts the numbers between given integers N and M, this is a simple problem with linear size and recursion depth; (6) Sum (SUM)—sums the integers between N and M, it is just a little more complex than Count; (7) Ackermann (ACK)—computes an Ackermann number; (8) Binary search (BSRCH)—performs a number of binary searches over a sorted integer array; (9) Permutations (PERM)—generates all permutations of an integer array using a recursive backtracking algorithm; (10) Sort (SORT)—sorts an integer array using the quicksort algorithm.

In real-world cases, recursive calls should not address overlapping subproblems, as this repeats the processing of the same subproblems many times [4,5]. However, to prevent benchmarks from being reduced to simpler problems, we intentionally used overlapping recursive calls in some benchmarks (HAN-C, FIB, CAT, ACK).

The main characteristics of the benchmarks are shown in Table 1. Three benchmarks (COUNT, SUM, ACK) use deep recursion and perform both a large number of stack state checks and a large number of stack extension and condensation operations. COUNT and SUM make 100,000,000 recursive calls with the same maximum depth. They run 9 tests each with recursion depth from 1 to 100,000,000, and the computation is repeated “100,000,000/recursion depth” times. The ACK benchmark is more complex and makes more recursive calls overall (2,862,983,902), as well as more stack extensions and condensations (54,832), but with a smaller maximum recursion depth (65,535) and stack extension depth (3 blocks, in the case of a 1 MB block size).

Four of the benchmarks (HAN-C, HAN-M, FIB, CAT) make many recursive calls, but without deep recursion. They are used only to measure the impact of stack status checking on the performance.

Three benchmarks (BSRCH, PERM, SORT) are chosen to more closely resemble real-world cases. They make relatively fewer recursive calls, without deep recursion, but are computationally more complex than the others.

For each of the benchmarks, variants with different methods of using ESL were implemented and measured: (1) Native—no ESL usage and no deep recursion support; (2) Safe—the basic explicit ESL API usage; (3) M-Grd—ESL guard macros; (4) M-Blk—ESL block macros; (5) M-Dup—ESL duplicate macros; (6) Del—ESL delegation API; (7) Fibers—ESL fibers-based delegation without exception handling; (8) Del-Exc—ESL delegation API with exception handling; (9) Fib-Exc—ESL fibers-based delegation with exception handling.

According to the code structure, the ESL usage methods can be divided into those with two branch points and explicit stack handling (Safe and M-Grd have one branching point before and another after the function body, as well as explicitly performing stack extension and condensation), those with one branch point and explicit stack handling (M-Blk, M-Dup), and those with one branch point and delegating stack handling to a template function (Del, Fibers, Del-Exc, Fib-Exc).

4.2.3. Wafl Benchmarks

As an example of a real-world application of ESL, we used the Wafl interpreter. It implements recursion as recursive graph evaluation, using ESL to make deep recursion work. The usage pattern is similar to the ESL delegation method. ESL is used by default, but there is an option to disable ESL and use only the regular call stack, which can be useful for debugging or performance analysis.

The same set of recursive benchmarks was written in Wafl and executed with ESL enabled and with ESL disabled, and with a sufficiently large stack.

5. Results

5.1. Identified Limitations and Constraints

By analyzing the implementation and testing specific cases, some potential problems with ESL have been identified and understood. Most of the identified problems are due to the use of a non-contiguous stack and occur in close proximity to stack block boundaries. The remaining problems are due to the implementation method and represent trade-offs for achieving efficiency.

5.1.1. Lost Updates

When a stack is extended, a few stack frames are copied into a new block, creating two independent copies. This creates a risk of lost updates if data is updated in the inactive old stack, as it will be overwritten during condensation. Such incorrect updates can happen if (1) a local variable is passed by address as an updateable argument, or (2) a variable’s address is cached in a register before the stack is extended or condensed, and then this variable is accessed at that old cached address.

The first condition is under the programmer’s control, so in recursive functions using ESL, local variables should only be used by value. But, caching the address of a variable in a register is beyond the programmer’s control. Therefore, one should avoid using the same variable of a non-primitive type before and after stack extension or before and after stack condensation.

5.1.2. Incomplete Copies

When a stack frame is copied to another stack block, its contents are copied bitwise, without calling the copy or move constructor for local variables or for class-type function arguments. This can lead to incomplete copies, causing lost updates or more serious problems, such as dangling pointers or references.

This problem can be avoided by not using the same class type variables both before and after stack extension or condensation.

5.1.3. Undesirable Optimization

In some cases, compiler optimization can cause some variants of lost updates or incomplete copies. Such optimization is a problem, not a benefit, so we say it is undesirable. We already discussed in the last updates the case when a register is used to optimize access to a local variable. Another type of undesirable optimization is when class-type objects in C++ code are passed by value, but the compiler optimizes this to be passed by address. Furthermore, Linux ABI requires that class type objects are always passed by address [21].

The only way to completely overcome this problem is to avoid using the same class type objects both before and after stack extension or condensation, and never pass class type objects as arguments or function results by value. Also, as already noted for lost updates, local class type objects must not be passed by reference.

Most methods that use ESL first prepare the result, then condense the stack, and return the prepared result. The variable in which the result is stored is used both before and after stack condensation. If the result is a primitive value, then everything should be fine. But, if the result is a class type object, then it is risky, because the outcome depends on the applied optimizations. Even if such class type variables are used by value in C++ code, the compiler will often have to use them by address. This makes local variables of class types vulnerable to deep recursion.

The unit tests show that almost all of the presented ESL usage methods work well with class-type results for Windows with MSVC, except for the guard object method, but for Linux with g++, only explicit coding and guard macros work well with them. Even with these methods, there is a risk that some new version of the compiler will add some optimizations or stricter adherence to the said Linux x64 ABI rule, which might lead to failure.

An alternative is to use native pointers. Native pointers are primitive types, and returning results that way works with the tested compilers with all the presented ESL usage methods. It is safer to return pointers to dynamically created objects than to return automatically created objects.

5.1.4. Thrashing

If a leaf call (a call to a function that makes no further calls) requires stack extension, then condensation immediately follows. If, during the execution of a program, many leaf calls require extensions, then excessive stack handling can impact performance. This can occur when there is a large number of recursive calls of the same depth, which is just enough to require a new block activation. This problem is usually called thrashing (or hot-splitting). Thrashing is not easy to resolve, and it can cause people to avoid extendable stacks [38].

If an extendable stack is used for general purposes (i.e., in all functions) and with relatively small stack blocks, then thrashing can be a significant problem. On the other hand, with ESL and a reasonable stack block size, thrashing can occur only in deep recursive functions and only in the recursion leaf calls. Thus, with ESL and relatively larger stack blocks, the thrashing is much less of a burden as it is not easy to reproduce. It is in accordance with some other results [32]. For example, a recursive solution to the Tower of Hanoi problem with N disks has a maximum recursion depth of N and makes 2^N recursive calls, half of which are of maximum depth N. If the size of stack blocks causes the stack to extend exactly on the N-th recursive call, then half of the calls will extend (and condense) the stack, and the thrashing will probably take more time than the computation. However, the program’s real limitation is not stack size or thrashing, but exponential problem complexity. A single-stack block can easily handle a recursion depth of 1000, yet no processor can execute 21,000 calls in a reasonable time.

When computing Ackermann(3, 13), which similarly has a large number of leaf calls (Table 1), there are 2863 M of recursive calls with 1432 M stack status checks and up to 54,786 stack extensions. That is more than 52,000 recursive calls and 26,000 stack checks on average during the lifetime of a stack block. As our benchmarks show later in this paper, at this rate, the cost of the extend/condense cycles is far less than the cost of the stack status checks.

5.1.5. Inaccurate Estimates

Estimates of remaining free space on the stack and the size of stack frames copied between stack blocks are relatively imprecise. In both cases, it is crucial not to underestimate the required space, and the practical decision was to use some conservative estimates. This may waste some space, so it would be better to have more precise estimates.

The significance of this problem decreases with increase of the block size. The larger the stack block size, the less important these inaccuracies are, since extend/condense operations will be used less frequently and the relative amount of the wasted space will be lower.

5.1.6. Non-Leaf Recursive Functions

If a recursive function calls some non-recursive functions that use a significant amount of stack space, they will not check the free stack space and an overflow might occur. It is possible to modify these functions to use ESL, but this would contradict the goal of using ESL only in recursive functions.

The ESL spares a part of stack space for such functions. The default is 16 KB, but this can be reconfigured using the initGlobal function. Since programmers have complete control over what and how their deep recursive functions are used and how much stack space they actually need, this adds some extra work but is not a major limitation.

As an alternative, the ESL also provides a static method ThreadStackLimit:: HasEnoughFreeSpace(nBytes) that checks if there is at least nBytes of free space on the stack. It can be used in recursive functions instead of shouldExtendStack().

5.1.7. Class-Type Arguments and Results

The implementation of delegation-based methods introduces the restriction that reference-type arguments may not be used in recursive functions, as already pointed out in the description of the implementation of fiber-based delegation. Given the other class-type related restrictions (incomplete copies and undesirable optimization), this is a relatively similar restriction. For fiber-based delegation, this is the only limitation.

5.1.8. Stack Size Limit

The ESL imposes no limit on the number of stack extensions. The stack can be extended as long as there is enough memory available to allocate a new stack block. The good thing is that no artificial limits are imposed, i.e., a recursive function can go as deep into the recursion as the memory size of the computer system allows. On the other hand, if the recursion goes deeper than that, then it will eventually fail, regardless of whether ESL is used or not. But, if ESL is used, then it will take longer to fail, and more processor time and other system resources will be wasted. Therefore, ESL allows the programmer to set two types of soft limits—the maximum number of stack blocks that a thread can use and the maximum total number of stack blocks that can be allocated to a process. Both limits are unset by default.

5.2. Performance Benchmarks

Benchmark measurements were performed on Linux using the GCC (GNU C++ compiler, version 14.2.0), and on Windows using the MSVC (MS Visual C++ compiler, MS Visual Studio 2022, 17.14.0, compiler version 19.44.35207). A workstation with an AMD Ryzen 5 3600X processor with a fixed operating frequency of 3.7 GHz and 64 GB RAM was used. Benchmarks were run 30 to 80 times.

5.2.1. Low Level Benchmarks

Measurements show that checking the stack status, along with the corresponding branching, takes an average of 0.271 ns on Linux and 0.297 ns on Windows.

Extend/condense cycle time measurements were performed with stack block sizes of 16 KB and 1 MB. The results were similar. A single stack extend/condense cycle takes between 75 and 900 ns, depending on the ESL usage method, as well as the operating system and compiler, as shown in Figure 1. In cases with moderate stack depths (up to 1000 blocks), when the used parts of all stack blocks are in the processor cache, cycles can take only 75 to 200 ns. Such cases are indeed possible (e.g., the ACK benchmark), but not very likely in real-world use, as stack usage will often be higher and there will be other threads and processes that will also use the cache. Therefore, we should rely on higher measured values of 400 to 900 ns.

On both Windows and Linux, the fastest cycles are obtained when using the fiber-based methods. The difference mainly comes from the fact that fiber-based methods do not copy stack frames during stack extend/condense operations.

5.2.2. Recursive Function Benchmarks

To get an objective idea of the cost of using ESL, performance was measured using code built and optimized for release. Compiler optimizations are complex and often unpredictable, so even small code changes can affect performance in unexpected ways [39]. Furthermore, the impact is not the same for different compilers. This is further emphasized by the relative simplicity of the benchmark functions. As a result, the performance of different methods of using ESL can sometimes differ significantly, and it is not entirely predictable which usage method will be the most efficient. For example, modern compilers implement C++ exceptions in a way that the cost is paid only when the exceptions occur, and generally not in the meantime, so adding exception handling is expected to slow down the program only slightly, but we have observed both more significant slowdowns and unexpected speedups.

Optimization of sibling calls was disabled for GCC for the SUM and COUNT benchmarks, as it completely turns recursion into iteration and makes them useless for benchmarking. All other optimizations were enabled, as in the usual release build.

Benchmarks were run multiple times (10–20) to obtain more reliable data. In each run, the tests were repeated multiple times (3–8), with the first execution time discarded to reduce the impact of memory allocation. Each benchmark, for each usage method, was repeated for several different arguments to observe how increasing the recursion depth or problem size affects performance. In most cases, only the results for the most complex cases are presented.

The results for the SUM benchmark on Linux and Windows are presented in Figure 2. For all tested ESL usage methods, stack extensions start at test S-6 (recursion depth is 100,000, requiring from 1 to 4 extensions, depending on the method) and increase with later tests. In test S-9 with a recursion depth of 100,000,000, the methods require stack extensions as follows: on Linux, the Safe method uses 1834 extensions, M-Grd 1868, M-Blk 1528, M-Dup 1528, Del 1019, Fibers 1017, Del-Exc 1019, and Fib-Exc uses 1017 extensions; on Windows, most of the methods use 4586 extensions, and Fibers and Fib-Exc use 4578 extensions.

The performance differences between the different methods increase with recursion depth. All methods slow down as the recursion depth increases, but some methods are more sensitive than others. For example, on Linux, the Safe, M-Grd, M-Blk, and M-Dup methods start to slow down at test S-6, while the Del, Fibers, Del-Exc, and Fib-Exc methods only slow down at test S-7. There are no major differences between tests S-8 and S-9.

Figures S1–S9 in the Supplementary Materials present the corresponding data for other benchmarks on both Linux and Windows.

Different implementations of deep recursion benchmarks (SUM, COUNT, ACK) were run with different stack block sizes from 64 KB to 8000 MB to test how block size affects performance. The impact of block size on performance increases with the recursion depth. It is not the same for all benchmarks and ESL usage methods.

Figure 3 shows how block size affects the performance of ESL usage methods for the SUM benchmark. This benchmark uses over 1 GB of stack space on Linux and over 4 GB on Windows. In the case of 64 KB blocks, there are from 16,371 to 30,693 extensions on Linux and from 73,736 to 75,528 extensions on Windows. For most block sizes, the differences on Linux with GCC are moderate—mostly under 2%, with only the 64 KB, 1000 MB, and 8000 MB block sizes making a difference of 4 to 8.5%. Differences are larger on Windows with MSVC (up to 30%).

Figures S10 and S11 in the Supplementary Materials show the corresponding data for the COUNT and ACK benchmarks on Linux and Windows.

To compare the performance of ESL usage methods against each other, it is most relevant to use the deepest recursion tested for each method and benchmark, as presented in Figure 4.

For the ACK, COUNT, and SUM benchmarks, native implementations support lower maximum recursion depths than ESL-based implementations. Figure S12 compares performance at the highest recursion depth achievable by native implementations on Linux and Windows. In these cases, the difference between ESL-based implementations is smaller, while the slowdown compared to the native version becomes more noticeable.

5.2.3. Wafl Benchmarks

The benchmarks performed show that the differences between using ESL in the Wafl interpreter and not using it are, in most cases, below 1%. As Figure 5 shows, in some cases, ESL takes additional time, but in other cases, it even increases the performance of the interpreter. The largest difference was measured for the CAT benchmark on Windows, where ESL slowed it down by 3%. On average, there was no high cost to using ESL in Wafl on Linux, and it was below 1% on Windows.

6. Discussion

6.1. Constraints

Most of the constraints discovered are related to stack frame copying between stack blocks and the fact that it can result in two independent instances of some arguments or local variables. To safely avoid these problems, (1) local variables must not be passed by address as updatable arguments, and (2) class type objects and variables whose addresses are stored in registers must not be used both before and after stack extension or condensation.

The first rule is clear. Passing local variables by value and passing them by address as read-only arguments is correct (pointer to const). Passing heap objects by address is correct regardless of whether it is for reading or writing.

The second rule has already been discussed in the Lost Updates and Undesirable Optimization sections in Results. To summarize, if the same variable must be used in two stack blocks (i.e., both before and after extension or condensation), then the following safety rules must be followed:

Never define a variable before stack extension, modify it between extension and condensation, and then access it again (e.g., return it) after condensation. Doing so can cause incorrect results even for primitive types.
If the result must be of a class type, then create it dynamically and return it using a raw pointer. Never return a smart pointer; it is of the class type and is prone to the same problems.
If a try-catch block is used,
It must contain both extendStack and condenseStack or neither;
It is best to end the try-block with a return statement.

It is strongly recommended not to use automatic class-type objects as results or non-const arguments of recursive functions at all. If some class-type objects are to be transferred between recursive calls, then it is recommended to create them dynamically and to access them (and transfer) by pointers.

Even if this can look like a major limitation, it is not. Local variables in recursive functions should not be of any complex type, including class types, because their size would speed up the stack block exhaustion and make for more frequent stack extensions. So this limitation of ESL, in fact, just emphasizes an already existing softer limitation.

6.2. Performance

The sources of performance differences between different implementations of the same benchmark are (1) additional stack state checks, (2) stack extension and condensation operations, (3) some implementation specifics, including (4) the size of the stack frame, and (5) code optimizations applied by the compiler.

For each benchmark, all implementations using ESL perform the same number of stack checks. Therefore, stack checks affect only the difference between implementations that use ESL and native implementations. Stack extend/condense cycles only affect performance in cases of deep recursion, where stack usage exceeds the size of a single stack block. The impact of stack frame size is also significant in these cases, as larger stack frames cause higher stack usage and a higher number of extend/condense cycles.

Implementation specifics include the number of additional conditional statements and the way in which extend/condense operations are performed (explicitly or by delegation). Code optimization can depend on every implementation detail and can strongly influence benchmark results. This is especially evident in the case of relatively simple benchmarks with multiple recursion but no deep recursion (CAT, FIB, HAN-C, HAN-M), where the influence of other factors is smaller.

In the case of delegation-based methods, the processing is moved out of the recursive function. Therefore, code optimizations have a potentially greater impact when ESL operations are performed directly rather than by delegation. If the recursive function with a direct ESL method is well optimized, then it can be more efficient than a delegation-based implementation, but if not, then it is slower. For example, in the case of CAT, benchmark, where in a recursive step a number of recursive calls is made in a loop, the M-Blk method and some similar methods are well optimized and perform better than delegation-based methods for both compilers tested.

The benchmark results show that GCC has more efficient optimizations than MSVC, so the performance difference between the different ESL usage methods is more noticeable on Linux with GCC than on Windows with MSVC. Because the native implementations of the benchmarks are the simplest and most optimized, some functions may appear very slow on Linux, being up to 10 times slower than the native variants but are actually much faster than the corresponding functions on Windows.

The effect of the optimizations is clearly visible in Figure S12, where the native implementation of the ACK benchmark in the case without stack extension (Test A-4) on Linux runs almost 3 times faster than any other implementation, while on Windows it is only 0.1% faster than the fastest ESL-based implementation.

6.2.1. ESL Performance Cost Estimation

The measured performance cost of ESL operations motivates us to check whether it is possible to assess the impact of ESL operations on a given recursive function. It comes out that the basic estimate is relatively simple, but real-world results will rarely reflect the estimate, primarily due to the effects of optimizations. Since MSVC for Windows applies less aggressive optimizations, the results are closer to the estimates. The GCC’s optimizations are so strong that a simple change in the recursive function code often has stronger effects than the entire estimated cost of ESL operations.

For benchmarks without deep recursion, the estimated cost of the stack-checking operation is comparable to the total cost of using ESL, as shown in Table 2, even if that total cost includes some additional operations. The total cost of using ESL is equal to or lower than the estimated stack-checking cost in 28 of 56 cases for Linux, and in 45 of 56 cases for Windows. All ESL implementations of HAN-C and HAN-M on Linux (16 of 16) are slower than expected, primarily due to the aggressive optimization of the native implementations—all ESL-based Linux implementations are faster than all Windows implementations, including the native. Also, the relative differences for SORT are quite low: under 3%. The only cases when delegation methods are slower than expected are for SORT on Linux and for CAT on Windows.

We can conclude that the estimate of the cost of stack status checking is relatively useful, as it can help define expectations about ESL usage, but other unpredictable factors (optimization primarily) can make this estimate imprecise.

To assess the cost of the extend/condense cycles in real-world implementations, we analyzed the results of a deep recursive benchmark ACK. It makes a lot of stack extend/condense cycles, but not very deep. When executed with different stack block sizes, the only differences are the number of stack extend/condense cycles and the size of the allocated blocks.

If we compute the stack extend/condense cycle cost based on these benchmark results with different stack block sizes, the results are consistent with the previously estimated low-level operations cost, as shown in Table 3. The results for 64 KB blocks are closer to the lower estimated cost values, due to better cache utilization. For larger blocks of 1 MB, the cycle cost increases. This is more pronounced on Windows.

6.2.2. Performance Implications of Stack Size

In the Section 5, we noted that the SUM benchmark slows down with an increase in the recursion depth, even if the total processing complexity is the same (Figure 2). If stack extend/condense cycles were the main reason for the slowdown, then all the tests from S-6 to S-9 should have similar times, because the total number of cycles is similar. Also, the number of stack checks is the same across all tests, so this cannot be the cause of the differences either. It is similar to the COUNT benchmark (Figure S5).

The main reason for the slowdown is the increased overall stack size: a large stack, contiguous or not, cannot fit in the processor cache and requires more access to main memory, which is significantly slower than the cache. On Linux, the S-7 test requires 10 to 18 extension blocks of 1 MB each (i.e., a total stack size of 11 to 19 MB), so the stack uses a significant portion of the processor cache (32 MB), and the slowdown is visible for methods that use more blocks. For tests S-8 and S-9, the stack size is far larger than the cache size, and main memory is heavily used; therefore, slowdowns are inevitable for all methods. On Windows, the test S-7 requires 45 stack blocks across all implementation methods, so all methods slow down.

Linux implementations of the SUM benchmark vary in their overall stack size. Since the recursion depth is the same across all implementations, the overall stack size directly depends on the stack frame size. Delegation-based methods perform less processing locally and can use smaller stack frames, so they are significantly more efficient in this case.

6.2.3. Performance Implications of Stack Block Size

To observe the impact of stack block size, the same benchmarks are run multiple times with block sizes ranging from 64 KB to 8 GB (Figure 3, Figures S10 and S11). The measurements did not involve the allocation of new stack blocks; rather, they were pulled from the pool. This was done to reduce the impact of memory management on the measurements, as the efficiency of these operations depends on many external factors.

If the recursion is linear, like in the SUM and COUNT benchmarks, the increase in the number of extend/condense cycles is approximately linear to the reduction of the block size.

For the SUM benchmark on Linux (Figure 3a), the performance differences are below 8%. The smallest block size of 64 K reduces the performance (2–8%) for all usage methods, and the next block size of 128 KB reduces performance only in a few cases. For M-Blk, M-Dup, and Fib-Exc, all other block sizes provide almost the same performance. For other methods, block sizes of 100 MB and above cause performance degradation. On Windows (Figure 3b), for the M-Blk and M-Dup methods (which are the slowest methods for this benchmark) the differences are relatively small (mostly under 2%), but for all the other methods there is a moderate performance drop for block sizes of 128 KB and smaller (up to 5%), and a significant performance drop for sizes of 5 MB and above (up to 30%). The differences for block sizes of 250 KB to 2 MB are relatively small.

For the SUM benchmark, the Windows results (Figure S10b) are almost the same as for COUNT. On Linux (Figure S10a), the performance degradation for small and large stack blocks is even lower than for COUNT.

In the case of the ACK benchmark, reducing the stack block size causes a more than linear increase in the number of extend/condense cycles. In such cases, the benchmarks run faster for larger stack blocks, with relatively uniform execution times for stack block sizes above some limit. On Linux (Figure S11a), in most cases, performance decreases when the stack block size is below 500 KB. On Windows (Figure S11b), in most cases, performance decreases when the stack block size is smaller than 1 MB, with M-Grd decreasing for 2 MB and smaller. In most cases, the differences are relatively small (up to 2%), except for the smallest block sizes of 64–128 KB on Linux and 64–250 KB on Windows.

Knowing that the total stack utilization is almost the same for all stack block sizes, we conclude that moderate stack block sizes (250 KB–2 MB) generally have little impact on performance. On the other hand, very small (below 200 K) and very large (above 5 MB) stack blocks can degrade the performance. Smaller stack blocks contribute to better utilization of allocated memory, but they increase the number of stack extend/condense cycles, and can increase the probability of thrashing (e.g., in the ACK benchmark). With smaller blocks, the total memory utilization can additionally increase due to the spare space at the end of the stack blocks (16 KB by default).

In most cases, the best choice for Linux is block sizes of 250 K to 2 M, and for Windows, block sizes of 500 K to 1 M. For most cases, 1 MB is a good initial choice. When ESL is used by a large number of threads, a smaller block size should be chosen.

6.2.4. Comparing Performance of Different Usage Methods

The differences in performance of the ESL usage methods are more pronounced for relatively simple benchmarks. For more complex benchmarks (BSRCH, PERM, SORT), the differences are relatively low. Also, the differences are more pronounced on Linux with GCC, owing to its more aggressive optimization techniques.

It is not possible to identify the generally fastest method. Similarly, there is no strict rule about when methods that do not handle exceptions are faster than methods that do, but the differences here are relatively small in most cases. In some cases, some implementations that use ESL are even faster than the native implementations (e.g., BSRCH on Linux).

In most cases (HAN-C, FIB, COUNT, SUM, ACK, BSRCH, PERM, SORT), delegation-based methods (Del, Fibers, Del-Exc, and Fib-Exc), or at least some of them, are among the fastest. On the other hand, in the CAT benchmark, they are among the slowest methods on both Linux and Windows. They also lag behind in HAN-M and in FIB (Windows only), but not by much.

The M-Blk method is among the best in some cases (HAN-C, HAN-M (Windows), FIB (Windows), ACK, CAT, PERM, SORT), but is very slow in some other cases (FIB (Linux), COUNT, SUM). The other methods are among the slowest in most cases.

When implementing a function that uses deep recursion, the first choice should be one of the delegation methods, as they show relatively stable performance. We suggest using the Fib-Exc method as an initial choice. It is based on delegation and includes exception handling with relatively low overhead (one condition is added, no new variables, the stack frame is not enlarged), so it has a relatively small impact on optimization. It also has fewer limitations than other methods.

If the function code is such that exceptions will certainly not occur, then it may be more efficient to use a variant that does not handle exceptions. If the cost of the chosen method turns out to be high, then use of another method should be considered, as well as the acceptability of the limitations it introduces.

This is consistent with the choice of method used in the Wafl interpreter. The Wafl function interpretation mechanism is essentially the same regardless of whether the functions use deep recursion or not. Tests have shown that on both Linux with GCC and Windows with MSVC, the delegation methods are usually more efficient than the others, with a slight advantage for Fib-Exc on Linux and Del-Exc on Windows.

6.3. Future Work

C++ compilers provide specific code metadata, called unwind tables or exception handling tables, that are used to properly handle exceptions, including destroying the automatic objects created in the corresponding code blocks. This metadata can be used to detect stack frame boundaries [40] and to determine the total size of the last two (or more) stack frames. This should be one of the primary future improvements to ESL, as stack frame size estimation is only rough in the current version.

Thrashing can be automatically detected by counting how many checks were made in each extend/condense cycle. If a certain number of short cycles are detected, the stack block size can be increased to reduce further thrashing. This would require additional benchmarking to determine if the performance-price of the approach is justified.

A static code analyzer (e.g., as a clang-tidy plugin) could check whether the use of ESL complies with the constraints. This would make it easier for the programmer to use ESL.

The ESL currently supports Windows with MSVC and Linux with GCC for AMD64 and AArch64 processors. Additional work is needed to support more platforms.

6.4. Alternatives

There are many solutions that aim to expand the call stack in size when required. Similar to the split stack, the function prologue has to check the stack status. But, if there is not enough space, instead of allocating a new stack segment, such solutions allocate a larger contiguous stack space, copy the entire old stack to the new location, activate it, and deallocate the old stack. There are some projects that previously used the split stack and, at some point, switched to specific expandable stack solutions, like the implementation of the Go language [32].

Such an expansion process is more expensive than linking a new stack segment. Copying the existing stack contents requires more CPU time. This can usually be compensated for by allocating enough stack space so that copying occurs infrequently. Then the number of stack expansions is usually logarithmically proportional to the stack size (instead of linearly), and the total number of extend operations is thus lower than the number of extend/condense cycles. On the other hand, if there are many threads and their stacks grow in this way, then a lot of memory can be allocated in a short time. With the expandable stack, it is also more demanding to perform stack condensation (i.e., moving back from a larger to a smaller stack), but it is also more important, because the larger unused space has to be released.

The expandable stack solutions provide a contiguous call stack, which is an advantage over segmented stacks. However, they share many of the problems with ESL, in some respects even more seriously. While there is no problem with lost updates, because there is never more than one live copy of a stack frame, other problems mostly remain as stack frames are moved from one location to another. We discussed the ESL issue of moving class-type objects by making binary copies. While this is a problem for ESL, it is even more significant for the expandable stacks, because they move the entire call stack, including all stack frames preceding recursive calls. This threatens to effectively make it impossible to use local class type variables throughout the program or requires more complex stack-moving procedures.

6.5. Summary

The Extendable Stack Library is designed to enable deep recursion in C++ programs. The primary aims were to support deep recursion on demand, to be as efficient as possible, and to be conceptually a cross-platform solution. The library has been implemented in standard C++. Platform-specific library elements were implemented for Linux on AMD64 and AArch processors for GCC and Clang compilers, and for Windows on AMD64 for the MSVC compiler.

The most similar publicly available solution is the split-stacks technique. The main conceptual difference is in the level of implementation. Split stacks are implemented and used at the compiler level, which is a good thing from a security, performance, and ease of use perspective, but implies that it is only applicable to one platform (only AMD64 processors, GCC and Clang compilers, and gold linker).

An important feature of ESL is that it is conceptually cross-platform. The library must be adapted for use with a new compiler or a new processor architecture, but there is no need to modify the compiler itself or other tools. The adaptation is not trivial, but it is not overly demanding either. ESL is implemented as a library and is used explicitly at the source code level. Explicit usage makes it more flexible in complex projects, as it does not require rebuilding entire projects and all subprojects, as is the case with split stacks. On the other hand, it makes it less safe and efficient.

To summarize the discussion, we answer the research questions opened in the Introduction:

RQ1—Can the existing solution from the Wafl interpreter be generalized into a cross-platform library that allows deep recursion? In short, the answer is Yes.

RQ2—To what extent can this library be independent of specific compilers and linkers? The ESL implementation is independent in the sense that no compiler or linker modifications are required to make it work. However, the library itself needs to be adapted to a specific architecture and appropriate building tools in order to be used.

RQ3—What are the limitations or constraints that such a library imposes on developers? We have discussed the identified constraints. Although some of them reduce the freedom in writing recursive functions, they are not too restrictive, and ESL is usable in the real world.

RQ4—How does using a library impact the performance of recursive functions? Some performance costs cannot be avoided, but we consider them acceptable because they are moderate and only paid for functions that explicitly use ESL. Using ESL can make the difference between working and failing code—if deep recursion is required, a moderate performance loss is usually more acceptable than failure.

7. Conclusions

In this paper, we discuss the problem of deep recursion in C++ and present the Extendable Stack Library (ESL), which enables the safe use of deep recursion in C++. ESL is based on an existing solution used in the Wafl interpreter. It has been generalized and modified to be applicable in other contexts.

The limitations that ESL imposes on the programmer, as well as the resulting performance, are analyzed. The results show that ESL is relatively easy to use and has acceptable performance. Existing recursive functions can be easily modified to use ESL to support deep recursion without significant impact on the software build process.

The library is conceptually independent of architecture and build tools, but it has to be adapted for new platforms. It supports Linux on AMD64 and AArch64 processors with GCC, and Windows on AMD64 with MSVC.

The ESL is available as an open-source project [37].

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/computers15010015/s1, Listing S1: Examples of different methods of using ES, Figure S1: Comparison of execution times for different variants of the HAN-C benchmark, Figure S2: Comparison of execution times for different variants of the HAN-M benchmark, Figure S3: Comparison of execution times for different variants of the FIB benchmark, Figure S4: Comparison of execution times for different variants of the CAT benchmark, Figure S5: Comparison of execution times for different variants of the COUNT benchmark, Figure S6: Comparison of execution times for different variants of the ACK benchmark, Figure S7: Comparison of execution times for different variants of the BSRCH benchmark, Figure S8: Comparison of execution times for different variants of the PERM benchmark, Figure S9: Comparison of execution times for different variants of the SORT benchmark, Figure S10: The COUNT benchmark performance comparison with different stack block sizes, for the most complex test (C-9), Figure S11: The ACK benchmark performance comparison with different stack block sizes, for the most complex test (A-6), Figure S12. Relative comparison of execution times for different variants of benchmark functions, with the deepest recursion tested working for the native variants, for a block size of 1 MB. File S1. benchmarks.csv. File S2. wafl_benchmarks.csv.

Author Contributions

Conceptualization, S.N.M.; Methodology, S.N.M., I.L.Č. and P.Ž.Đ.; Software, S.N.M.; Validation, I.L.Č. and P.Ž.Đ.; Formal Analysis, S.N.M.; Data Curation, S.N.M., I.L.Č. and P.Ž.Đ.; Visualization, S.N.M.; Original Draft Preparation, S.N.M.; Review and Editing, S.N.M., I.L.Č. and P.Ž.Đ. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by the Ministry of Education, Science and Technological Development of the Republic of Serbia, through project 451-03-136/2025-03/200104.

Data Availability Statement

Data are contained within the article or Supplementary Materials. The Extendable Stack Library (ESL) is available online as an open-source project [37]. All measurement results are generated by programs included in ESL and are provided as files benchmarks.csv and wafl_benchmarks.csv in the Supplementary Materials. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ABI	Application binary interface
API	Application programming interface
CPU	Central processing unit
ESL	Extendable Stack Library
GCC	GNU C++ compiler
MSVC	Microsoft Visual C++ compiler
SP	Stack pointer
TCO	Tail call optimization

References

Graham, R.L.; Knuth, D.E.; Patashnik, O. Concrete Mathematics; Addison-Wesley: Boston, MA, USA, 1988; pp. 1–20. ISBN 0-201-55802-5. [Google Scholar]
Knuth, D.E. The Art of Computer Programming, Volume 1: Fundamental Algorithms, 3rd ed.; Addison-Wesley: Boston, MA, USA, 1997. [Google Scholar]
Sedgewick, R.; Wayne, K. Algorithms, 4th ed.; Addison-Wesley: Boston, MA, USA, 2011; ISBN 978-0-321-57351-3. [Google Scholar]
Drozdek, A. Data Structures and Algorithms in C++, 4th ed.; Cengage Learning: Boston, MA, USA, 2012; ISBN 9781285415017. [Google Scholar]
Weiss, M.A. Data Structures and Algorithm Analysis in C++, 4th ed.; Pearson Education: Upper Saddle River, NJ, USA, 2013; ISBN 978-0132847377. [Google Scholar]
Patterson, D.A.; Hennessy, J.L. Computer Organization and Design: The Hardware/Software Interface, 4th ed.; Morgan Kaufmann: San Francisco, CA, USA, 2011; ISBN 978-0123747501. [Google Scholar]
Wafl Programming Language. Available online: https://poincare.matf.bg.ac.rs/~sasa.malkov/wafl/index.html (accessed on 2 June 2025).
Malkov, S. Customizing a Functional Programming Language for Web Development. Comput. Lang. Syst. Struct. 2010, 36, 345–351. [Google Scholar] [CrossRef]
Grune, D.; Reeuwijk, K.; Bal, H.E.; Jacobs, C.J.H.; Langendoen, K. Modern Compiler Design, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2012; ISBN 978-1461446989. [Google Scholar]
Levine, J.R. Linkers and Loaders; Morgan Kaufmann: San Francisco, CA, USA, 1999; ISBN 978-1558604964. [Google Scholar]
O’Regan, G. Mathematical Induction and Recursion. In Guide to Discrete Mathematics; Springer: Cham, Switzerland, 2016. [Google Scholar] [CrossRef]
McCauley, R.; Grissom, S.; Fitzgerald, S.; Murphy, L. Teaching and learning recursive programming: A review of the research literature. Comput. Sci. Educ. 2015, 25, 37–66. [Google Scholar] [CrossRef]
Rubio-Sanchez, M. Introduction to Recursive Programming; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar]
Liu, Y.A.; Stoller, S.D. From recursion to iteration: What are the optimizations? SIGPLAN Not. 1999, 34, 73–82. [Google Scholar] [CrossRef]
Samelson, K.; Bauer, F.L. Sequential formula translation. Comm. ACM 1960, 3, 76–83. [Google Scholar] [CrossRef]
Dijkstra, E.W. Recursive Programming. Num. Mathematik 1960, 2, 312–318. [Google Scholar] [CrossRef]
Daylight, E.G. Dijkstra’s rallying cry for generalization: The advent of the recursive procedure, late 1950s–early 1960s. Comput. J. 2011, 54, 1756–1772. [Google Scholar] [CrossRef]
Aho, A.V.; Lam, M.S.; Sethi, R.; Ullman, J.D. Compilers: Principles, Techniques, and Tools, 2nd ed.; Pearson Education: Upper Saddle River, NJ, USA, 2006; pp. 427–503. [Google Scholar]
Bryant, R.E.; O’Hallaron, D.R. Computer Systems: A Programmer’s Perspective, 3rd ed.; Pearson Education: Upper Saddle River, NJ, USA, 2016; ISBN 978-1292101767. [Google Scholar]
Overview of x64 ABI Conventions. In Microsoft Learn. Available online: https://learn.microsoft.com/en-us/cpp/build/x64-software-conventions (accessed on 6 February 2025).
System V Application Binary Interface, AMD64 Architecture Processor Supplement. Available online: https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf (accessed on 6 February 2025).
C++ ABI for the Arm® 64-bit Architecture (AArch64). In Application Binary Interface for the Arm® Architecture. Available online: https://github.com/ARM-software/abi-aa/tree/main/cppabi64 (accessed on 6 February 2025).
Love, R. Linux System Programming, 2nd ed.; O’Reilly Media: Sebastopol, CA, USA, 2013; ISBN 978-1449339531. [Google Scholar]
Yosifovich, P.; Ionescu, A.; Russinovich, M.E.; Solomon, D.A. Windows Internals, Part 1: System Architecture, Processes, Threads, Memory Management, and More, 7th ed.; Microsoft Press: Redmond, WA, USA, 2017; ISBN 978-0735684188. [Google Scholar]
Denning, P.J. Virtual memory. ACM Comput. Surv. 1970, 2, 153–189. [Google Scholar] [CrossRef]
Memory Limits for Windows and Windows Server Releases. In Microsoft Learn. Available online: https://learn.microsoft.com/en-us/windows/win32/memory/memory-limits-for-windows-releases (accessed on 6 February 2025).
MISRA C++:2008: Guidelines for the Use of C++ Language in Critical Systems; MIRA Limited: Nuneaton, UK, 2008; p. 110. ISBN 978-1906400040.
Options That Control Optimization. In A GNU Manual. Available online: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html (accessed on 2 February 2025).
Appel, A.W. Compiling with Continuations; Cambridge University Press: Cambridge, UK, 1992. [Google Scholar]
Taylor, I.L. Split Stacks in GCC. In GCC Wiki. Available online: https://gcc.gnu.org/wiki/SplitStacks (accessed on 2 February 2025).
Anderson, B. Abandoning Segmented Stacks in Rust. 2013. Available online: https://web.archive.org/web/20230605154530/https://mail.mozilla.org/pipermail/rust-dev/2013-November/006314.html (accessed on 6 May 2023).
Ma, Z.; Zhong, L. Bringing Segmented Stacks to Embedded Systems. In Proceedings of the 24th International Workshop on Mobile Computing Systems and Applications, HotMobile 2023, Newport Beach, CA, USA, 22–23 February 2023; ACM: New York, NY, USA, 2023; pp. 117–123. [Google Scholar] [CrossRef]
GNU Binutils 2.44 Released. Info-Gnu Discussion. Available online: https://lists.gnu.org/archive/html/info-gnu/2025-02/msg00001.html (accessed on 6 February 2025).
Boost C++ Libraries. Boost. Context Library. Version 1.88. Available online: https://www.boost.org/libs/context/ (accessed on 6 February 2025).
Kowalke, O.; Goodspeed, N. Fiber_Context—Fibers Without Scheduler, C++ Standard Proposal P0876R20. Available online: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p0876r20.pdf (accessed on 6 February 2025).
Stroustrup, B. C++ Exceptions and Alternatives. In JTC1/SC22/WG21—Papers. 2019. Available online: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1947r0.pdf (accessed on 23 June 2025).
Extendable Stack Library. Available online: https://gitlab.com/smalkov_/libextstack (accessed on 12 January 2025).
Farvardin, K.; Reppy, J. From folklore to fact: Comparing implementations of stacks and continuations. In Proceedings of the PLDI 2020: Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation, London, UK, 15–20 June 2020; ACM: New York, NY, USA, 2020; pp. 75–90. [Google Scholar] [CrossRef]
Theodoridis, T.; Su, Z. Refined Input, Degraded Output: The Counterintuitive World of Compiler Behavior. In Proceedings of the ACM on Programming Languages, Copenhagen, Denmark, 24–28 June 2024; pp. 671–691. [Google Scholar] [CrossRef]
Otsuki, Y.; Kawakoya, Y.; Iwamura, M.; Miyoshi, J.; Ohkubo, K. Building stack traces from memory dump of windows x64. Digit. Investig. 2018, 24, S101–S110. [Google Scholar] [CrossRef]

Figure 1. Extend/condense cycle duration with a stack block size of 16 KB. The x-axis is the stack extension’s depth n. Each test was repeated 1,000,000/n times, so the y-axis represents the test duration in ms, as well as the duration of one extend/condense cycle in ns. The direct case performs extend/condense cycles in a loop; the loop case does the same in a separate function; the recursive case uses a recursive function, like the Safe method of using ESL; the other cases are similar to using the ESL methods with the same names. For more details, see benchmark_lowlevel.hpp in [37]. (a) Linux with GCC. (b) Windows with MSVC.

Figure 2. Comparison of execution times for different variants of the SUM benchmark. For test S-n, the recursion depth is 10ⁿ⁻¹, and the test is repeated for 10⁹⁻ⁿ times, so in each test the total number of calls is 100,000,000. The y-axis represents time in seconds. Tests from S6 to S9 use recursion depths of 100,000 and above and do not work for the native method with the default stack size of 8 MB. (a) Linux with GCC. (b) Windows with MSVC.

Figure 3. The SUM benchmark performance comparison with different stack block sizes, for the most complex test (S-9). The x-axis shows different block sizes for each of the ESL usage methods. The y-axis shows the ratio of execution time of a method with a particular block size, relative to the same method with a block size of 1 MB. (a) Linux with GCC. (b) Windows with MSVC.

Figure 4. Relative comparison of execution times for different variants of the benchmark functions with the deepest recursion tested, for a block size of 1 MB. Execution times are shown relative to the slowest variant for the same benchmark (e.g., a value of 0.7 means “30% faster than the slowest variant”). (a) Linux with GCC. (b) Windows with MSVC.

Figure 5. Comparison of relative execution times for the benchmark functions written in Wafl. The y-axis shows the ratio of execution times with ESL enabled to those with ESL disabled and with a sufficiently large stack (e.g., a value of 1.01 means that execution with ESL enabled takes 1% longer than when it is disabled).

Table 1. Characteristics of the performed benchmarks. All data is given for the most complex measured cases. The number of recursive calls includes all repetitions. The total number and maximum depth of stack extensions depend on the stack block size and how the ESL is used; the table shows only the highest numbers for stack blocks of 1 MB.

Benchmark	Arguments	Iterations	Recursion Depth	Total # of Rec. Calls	Total # of Stack Extensions	Max. Depth of Stack Extensions
Tower of Hanoi Count	30	1	30	1074 M	-	-
Tower of Hanoi	27	1	27	134 M	-	-
Fibonacci	40	1	40	331 M	-	-
Catalan	18	1	18	387 M	-	-
Count	100,000,000	1	100 M	100 M	4586	4586
Sum	100,000,000	1	100 M	100 M	4586	4586
Ackermann	A(3, 13)	1	65,535	2863 M	54,786	3
Binary Search	int[100,000,000]	12.5 M	≤27	320 M	-	-
Permutations	int[11]	8	11	320 M	-	-
Sort	int[1,000,000]	20	20	40 M	-	-

Table 2. Comparison of estimated stack checking cost and measured benchmark execution times. The Native time is the duration of the native benchmark variant, without using ESL. The ESL checking cost is estimated by multiplying the number of recursive calls by the estimated stack checking cost of 0.27 ns for Linux and 0.30 ns for Windows. The Total cost represents a range of differences to the native time for ESL usage methods. Slower methods list the ESL usage methods for which the total cost is higher than the estimated stack checking cost. Measuring was performed with a stack block size of 1 MB, and only for the most complex test.

		Linux				Windows
Bench.	# of Rec. Calls (M)	Native Time (s)	Est. Chk. Cost (s)	Total Cost (s)	Slower Methods	Native Time (s)	Est. Chk. Cost (s)	Total Cost (s)	Slower Methods
HAN-C	1074	0.21	0.29	0.77–0.86	All (8)	1.04	0.32	−0.028–0.001	/
HAN-M	134	0.81	0.036	0.11–0.14	All (8)	0.54	0.04	0.01–0.11	Safe, M-Grd (2)
FIB	331	0.17	0.09	0.03–0.29	Safe, M-Grd, M-Blk, M-Dup (4)	0.51	0.1	0.03–0.11	/
CAT	387	0.19	0.1	0.004–0.089	/	0.62	0.12	0–0.25	Del, Fibers, Del-Exc, Fib-Exc (4)
BSRCH	320	1	0.087	−0.026–0.065	/	1.02	0.096	−0.02–0.35	Safe, M-Grd (2)
PERM	320	1.07	0.087	0.024–0.083	/	1.04	0.096	−0.01–0.05	/
SORT	40	1.37	0.011	0.027–0.040	All (8)	1.35	0.012	−0.004–0.04	Safe, M-Grd, M-Blk (3)

Table 3. Stack extend/condense operations cost, computed based on the ACK benchmark results. The columns present the total number of extend/condense cycles and the estimated duration of a single extend/condense cycle in ns, for Linux and Windows, and for stack block sizes of 64 K and 1 M. The cost of an extend/condense cycle is estimated as the difference between the benchmark duration for the selected stack block size and the average duration for block sizes sufficient to run without stack extensions, divided by the number of cycles.

	Linux				Windows
Benchmark	64 K #Ext	64 K Ext.cost (ns)	1 M #Ext	1 M Ext.cost (ns)	64 K #Ext	64 K Ext.cost (ns)	1 M #Ext	1 M Ext.cost (ns)
Safe	2,161,716	239.6	129,461	733.9	1,079,415	294.7	57,382	415.1
M-Grd	2,161,716	233.7	105,219	313.8	1,079,415	328.1	57,382	497.7
M-Blk	719,865	244.9	57,385	283	1,080,179	348.8	57,380	890.9
M-Dup	1,080,257	247.6	57,382	327.5	1,079,363	345.9	57,377	990.3
Del	358,882	250.1	34,432	162.5	1,080,179	428.6	57,380	1051.9
Fibers	350,290	253.2	34,383	383.9	1,054,998	277.5	57,281	723.8
Del-Exc	358,882	274.7	34,432	357.6	1,080,179	433.2	57,380	1116.8
Fib-Exc	350,287	247	34,382	348.9	1,054,998	290.6	57,281	794
AVG		248.9		363.9		343.4		810.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Malkov, S.N.; Čukić, I.L.; Đorđević, P.Ž. Enabling Deep Recursion in C++. Computers 2026, 15, 15. https://doi.org/10.3390/computers15010015

AMA Style

Malkov SN, Čukić IL, Đorđević PŽ. Enabling Deep Recursion in C++. Computers. 2026; 15(1):15. https://doi.org/10.3390/computers15010015

Chicago/Turabian Style

Malkov, Saša N., Ivan Lj. Čukić, and Petar Ž. Đorđević. 2026. "Enabling Deep Recursion in C++" Computers 15, no. 1: 15. https://doi.org/10.3390/computers15010015

APA Style

Malkov, S. N., Čukić, I. L., & Đorđević, P. Ž. (2026). Enabling Deep Recursion in C++. Computers, 15(1), 15. https://doi.org/10.3390/computers15010015

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enabling Deep Recursion in C++

Abstract

1. Introduction

2. Background

2.1. Recursion

2.2. Call Stack

2.2.1. Call Stack Implementation

2.2.2. Stack Allocation

2.3. Deep Recursion Limitations

2.3.1. Tail Call Optimization

2.3.2. Dynamic Stack

2.3.3. Split Stack

2.4. Fibers

3. Extendable Stack Library

3.1. The Goals

3.2. The Library Concepts

3.3. ESL API

3.3.1. Usage Methods

3.3.2. Initialization

3.4. Some Implementation Details

3.4.1. Stack Block Activation

3.4.2. Determining the Stack Frame Size

3.4.3. Estimating Stack Limit

3.4.4. Exception Handling

3.4.5. Platform-Specific Elements

3.5. Using Fibers

4. Usability Assessment

4.1. Validation

4.2. Performance Analysis

4.2.1. Low-Level Benchmarks

4.2.2. Recursive C++ Benchmarks

4.2.3. Wafl Benchmarks

5. Results

5.1. Identified Limitations and Constraints

5.1.1. Lost Updates

5.1.2. Incomplete Copies

5.1.3. Undesirable Optimization

5.1.4. Thrashing

5.1.5. Inaccurate Estimates

5.1.6. Non-Leaf Recursive Functions

5.1.7. Class-Type Arguments and Results

5.1.8. Stack Size Limit

5.2. Performance Benchmarks

5.2.1. Low Level Benchmarks

5.2.2. Recursive Function Benchmarks

5.2.3. Wafl Benchmarks

6. Discussion

6.1. Constraints

6.2. Performance

6.2.1. ESL Performance Cost Estimation

6.2.2. Performance Implications of Stack Size

6.2.3. Performance Implications of Stack Block Size

6.2.4. Comparing Performance of Different Usage Methods

6.3. Future Work

6.4. Alternatives

6.5. Summary

7. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI