An ADMM Based Parallel Approach for Fund of Fund Construction

: In this paper, we propose a parallel algorithm for a fund of fund (FOF) optimization model. Based on the structure of objective function, we create an augmented Lagrangian function and separate the quadratic term from the nonlinear term by the alternate direction multiplier method (ADMM), which creates two new subproblems that are much easier to be computed. To accelerate the convergence speed of the proposed algorithm, we use an adaptive step size method to adjust the step parameter according to the residual of the dual problem at every iterate. We show the parallelization of the proposed algorithm and implement it on CUDA with block storage for the structured matrix, which is shown to be up to two orders of magnitude faster than the CPU implementation on large-scale problems.


Introduction
Fund of funds (FOF) has become a hot topic during the past several years. As a mutual fund scheme, FOF uses other funds as investment targets and achieve the purpose of indirectly holding securities such as stocks and bonds [1]. The first target of FOF construction is to obtain optimal portfolio among different funds based on the tradeoff between return and risk. To meet this goal, one of the main research activities for the past few years has been FOF modeling. It is worth mentioning that many existing FOF optimization models are based on the Mean-Variance (MV) framework. In 1952, Markowitz introduced the seminal work, the MV model [2], which is regarded as the fundamental finding of modern portfolio theory (MPT). Since then, it has become the most essential theory to study the FOF asset allocation problem. Despite huge attention, the MV model still has some obvious shortcomings. One drawback which many studies emphasize is that the optimal portfolio can be extremely sensitive to input parameters [3,4]. Another drawback of the MV model is that this approach sometimes provides over-concentrated portfolios, which would probably cause huge losses if a financial crisis were to happen. Moreover, the MV model ignores the subjective views of investors. These shortcomings limit the MV models' application in the FOF construction. As a result, it is not surprising that many research efforts were placed on extending the MV model [5][6][7][8][9][10][11].
From a practical perspective, there exist many trading restrictions in the real-world financial market. Transaction cost is one of which that should be paid attention. In fact, transaction costs in the financial market is usually treated as non-linear functions, which makes the FOF optimization model a nonlinear programming problem. The interior point method (IPM) [12,13], sequence quadratic programming (SQP) method [14], parallel variable distribution (PVD) [15] and line search algorithm [16] are all popular methods for nonlinear programming problems. However, the aforementioned methods can be inefficient since they does not explore the structure of the objective function. For example, the general SQP procedure uses the derivative vector and the Hessian matrix of the Lagrangian function to construct quadratic approximation at each iteration [17]. The Hessian matrix of FOF optimization models may not necessarily be positive semi-definite, so an approximation of the Hessian matrix is needed. Constructing a Hessian approximation by quasi-Newton methods may be poorly performed and time-consuming. The IPM maintains the feasibility during iterations but it is a centralized and computationally expensive method. Moreover, it is hard to parallelize. The PVD method is suitable for parallelization but needs to solve a difficult convex subproblem when applied to FOF optimization model, which makes it hard to be extended to large-scale problems.
Most of algorithms that developed in recent years have concentrated on structured nonlinear programming, which aims to characterize a range of nonlinear problems. In 1977, Glowinski and Marrocco [18] proposed the alternating direction method of multiplies (ADMM) based on the decomposition. It is widely used in large-scale nonlinear optimization problems thanks to the advantage of being easily extended to parallel and distributed systems. However, it still faces the challenge of slow convergence when applied in the FOF optimization model. Chaves et al. [19] first proposed a Newton-based efficient algorithm; however, it only works when facing a non-constrained FOF model, which is unrealistic in practice. A least-squares method and an alternating direction method were developed by Bai et al. [20] and Costa et al. [21], respectively. However, they still consume a lot of time as the scale of the problem grows.
The main contribution of this paper is to propose a parallel algorithm general enough to characterize most of the existing specific FOF optimization models. Moreover, our approach is able to solve a large number of FOF optimization models that have the same structure as this paper studied. The main contributions of this paper could be summarized as follows: • We develop a new FOF optimization model which integrates complicated and diversified constraints into the Mean-Variance framework accompanied by a Black-Litterman based-asset expected return and covariance. • We propose an approach to handle the FOF optimization model mentioned above based on elevating the original problem into a higher-dimension convex problem which is solved by a modified ADMM algorithm. Moreover, we compare the modified ADMM algorithm efficiency to two of the best performing methods, SQP and IPM. • To explore a larger search space for better solutions, we parallelize the proposed approach on GPU using CUDA and study its speedup on different problem scales.
The remainder of the paper is organized as follows. In Section 2, we give the step-bystep formulation of our new FOF optimization model, especially focusing on the Black-Litterman based-asset expected return and covariance. Section 3 contains an approach to handle the FOF optimization model mentioned above. More specifically, we introduce new variables to transfer the inequality constraints in the model to the equality constraints and use a modified ADMM algorithm to construct the optimal portfolio. In Section 4, the GPU-based parallel approach using CUDA is introduced to improve computational efficiency and cope with large scale problems. This is followed by implementation details of the proposed parallel approach in Section 5,where we present the efficiency of the proposed approach compared with some of the best performing methods, as well as the acceleration effect of the parallel approach. In Section 6, we conclude the paper.

Problem Formulation
The core of FOF funds is to diversify investment, so it is no surprise that asset allocation has become an important part of FOF construction. Based on the asset allocation, we can improve the return-risk feature of a fund portfolio. For simplicity, we limit the fund type to equity funds, so asset allocation is beyond the scope of this paper. Furthermore, another determining factor for the performance of an FOF strategy is the quality of the fund pool. It is naturally an extremely important thing for fund managers to screen out their own fund pools from a huge number of funds and build fund portfolios accordingly. There are many fund selection criteria such as risk/return parameters and the professionalism of the management team. In this paper, we take return, standard deviation and fund size as the key factors of focus to screen out funds.
It is assumed that investors with initial capital C will assign their wealth to n equity funds in the following T periods. Let x 0 i be the proportion on fund i included in the portfolio at the previous period. S r , S s , S m , S b and S c are the set of the funds with high risk and high return, with medium to high risk and medium to high return, with medium risk and medium return, with low to medium risk and low to medium return, and with low risk and low return, included in the portfolio, respectively. A new FOF optimization model is proposed as follows: where the transaction cost f i (x) = exp(−( , a i ∈ R and b i ∈ R are the given parameters, which are various from different funds. P denotes the covariance matrix of the fund returns, and q ∈ R n refers to the fund returns vector. l i and u i (i = 1, . . . , n) are the lower and upper bound of investment proportion on fund i. Without loss of generality, we can abstract the new FOF optimization model into a general form.
Notation. Let R denote the set of real numbers, R n the n-dimensional real space, and S n ++ (S n + ) the set of real n-by-n symmetric positive (semi)definite matrices. We denote by R m×n the set of real m-by-n matrices, I n ∈ R n×n is the identity matrix and 0 ∈ R n×n is the zero matrix, and 1 n is the n-dimensional vector with all the entries being 1. I C is the indicator function over the affine constraints of (5), i.e., For a, b ∈ R, the projection of z onto [a, b] is given by Π[a, b](z) = min(max(z, a), b). The FOF optimization model that considered transaction cost is given by: where x ∈ R n is the optimization variable. P ∈ S n + is a symmetric semidefinite covariance matrix and q ∈ R n . A ∈ R m×n and b ∈ R ∈m . l i and u i (i = 1, . . . , n) are the lower and upper bound. f i (x), x ∈ R, i = 1, . . . , n is the transaction cost function. The function f i (x) is the subscription and redemption cost for adjusting the ith fund. If the transaction cost f i (x) is given by linear function or quadratic function, i.e., f i (x) = α i x, then the optimization problem (3) is quadratic programming. Throughout this paper, we consider the transaction cost function f i (x) to be convex, so that the objective function is nonlinear and convex. The linear constraints imply the bound of the portfolio weight and the investment requirement of the portfolio weight.

Transforming the Inequality to Equality
To solve the nonlinear problem (1), we consider the general formulation of the nonlinear problem (3). We start by transforming the convex problem (3) to a new problem without inequality constraints by introducing the slack variables to transfer the inequality constraints into the equality constraints. Letx ∈ R m , the inequality constraints Ax ≤ b and the equality constraint 1 T x = 1, which could be rewritten as Letx = x x , then the problem (3) could be rewritten as min x∈R m+n 1 2x By introducing the slack variable z ∈ R m+n , the problem (5) is equivalent to min x∈R m+n z∈R m+n 1 2x with variables x ∈ R m+n and z ∈ R m+n .
. . , n, z i ≥ 0, i = n + 1, . . . , n + m}. One can easily figure that the original problem (3) is transformed to problem (6) by two steps. The first step is the slack variablex, which changes the inequality Ax = b to the equalityÃx =b. The second step, with the help of the slack variable z, transforms the problem (5) to problem (6) without any inequality constraints remaining.

ADMM Steps
It is costly to solve the optimization problem (5) or (6) by calling the default solvers such as the fmincon(MATLAB). One key to solve the problem (3) is to separate the quadratic term and the nonlinear term, since there are effective solvers for quadratic problems. The augmented Lagrangian of (6) is Then, we develop the ADMM used in every iteration with regard to the variables x k and z k as follows.
x k+1 = arg miñ The update of the variables x k is z k is obvious. We minimize the augmented Lagrangian function. In practice, we find that it requires many more iterations on large problems. The convergence of the ADMM algorithm is guaranteed under fairly general assumptions [22,23]. Since the objective function is convex and the global solution exists, we have x k → x * , z k → z * and u k → u * . For faster convergence, we suggest to perform the relaxed ADMM algorithm. The relaxed ADMM iterates x k , z k and u k , for k = 0, 1, . . . , by Here, u k ∈ R m+n denotes the dual variables on iteration k. (τ k , γ k ) are sequences of penalty and relaxation parameters. If the problem (3) is solvable, then the sequence (x k , z k , u k ) converges to its primal-dual solution. On the other hand, if the problem is infeasible, then the sequence (x k , z k , u k ) does not converge, but the sequence (x k − x k−1 , z k − z k−1 , u k − u k−1 ) always converges and can be used to certify the infeasibility of the problem [24].
Since the nonlinear function f i (x) is separatable, we develop the ADMM-based algorithm and parallelize it on CUDA. The key point of ADMM iteration presented in (6) contains three steps: the solving of the linear equation, the minimum of the nonlinear problem (27) and the update of the dual variables λ.

Solving the Reduced KKT System
We describe how x k is updated with (11). Minimizing the augmented Lagrangian (11) involves solving the following equality-constrained least-squares problem: Ax =b. Let , the minimizingx can be found by solving the following linear equation: P is a (m + n) × (m + n) matrix andÃ is a m × (m + n) matrix, we need to solve the 2m + n linear equation. As τ k goes to change, we have to solve the linear system at each iteration, which will prevent us from solving the target problem effectively. Since τ k > 0, the matrixP + τ k I is positive defined. There exist orthogonal matrixQ such that We have Then, x k+1 could be updated by where

Adaptive Step Size
The ADMM algorithm is the first-order method with a linear convergence rate. For large-scale problems, achieving high accuracy requires more iterations. The adaptive step has shown a successful application in ADMM. Let k be the current iteration and k 0 be an older iteration, such that the optimal stepsize choice is then written as The spectral stepsizeα k is updated bŷ The spectral stepsizeβ k is similarly estimated by ∆G k = z k − z k 0 and ∆u k = u k − u k 0 [25]. To guarantee convergence, τ k , γ k are bounded by where C cg is a (large) constant. Algorithm 1 summarizes all the steps.

Termination Criteria
For the given iterates (x k , z k , u k ), the primal and dual residuals are defined as It has been observed that these residuals approach zero as the algorithm approaches a true solution. The authors in [26] show that the pair (u k , z k ) satisfies optimality conditions for all k > 0. If the problem is also solvable, the residuals r k prim and r k dual will converge to zero. A termination criterion for detecting optimality is thus implemented by checking that r k prim and r k dual are small enough, i.e., r k prim < ε prim , r k dual = ε dual .

Acceleration Approaches
Another contribution of this paper is the GPU implementation of the proposed algorithms based on CUDA. In this section, we show how we parallized the proposed algorithm on CUDA and expose step by step the strategies that we used to achieve an optimal performance in our CUDA implementation of the proposed algorithm.

Parallelization
In the k-th iteration for updating z k+1 , we solve the nonlinear optimization problems (13). The procedure for solving each variable z k+1 i (i = 1, . . . , n + m) is independent. We parallelize the steps for updating z k+1 by To avoid solving the constrained problems, we solve the unconstrained problems since the objective function is convex.z k+1 i = arg miñ Listing 1 shows an overview of how we designed the kernel in the device on GPU. Each thread computes the new variables z k+1 i by (27) and (28) in parallel. To make the program more extendable, we created a base class named Cost to implement the different types of the cost function by the subclass in the device and pass the point to launch the kernel.

x + t h r e a d I d x . x ; i f ( t i d >= n + m) r e t u r n ; d f p _ s o l v e r (& c o s t [ t i d ] , &z [ t i d ] ) ; z [ t i d ] = min ( ub [ t i d ] , max ( z [ t i d ] , l b [ t i d ] )
; r e t u r n ; } In our implementation, we apply the quasi-Newton method for minimizing (27), w hich is a second-order method that helps to quickly find the optimal value (see Listing 2).

Remark 1.
When the function f i (x), i = 1, . . . , n is a linear or quadratic function, it is achievable to perform the matrix equilibration to transfer the matrixP into diagonal matrix. After that, we do not have any term likex ixj , which will reduce the number of iterations. However, for more complicated functions, such preconditioning will make updates of z k+1 difficult to be parallelized.

Do as Much as You Can on CUDA
CUDA is an extension of the C programming language created by NVIDIA. Its main idea is to have a large number of threads that solve a problem in parallel. To execute the program on GPU, we launch the kernels which are defined by the global keywords.
Before calling the kernel, CUDA needs the input data to be transferred from CPU to GPU through the PCI Express bus [27]. This stage of data transfer is a necessary part of any GPU code and can be the bottleneck affecting the program's performance. So once the data has been transferred to the device memory, it should not return to the CPU until all operations are finished. In our first CUDA implementation of the algorithm, this fact was not taken into account since the data is transferred from GPU to CPU during iterations. Therefore, the results of this first implementation are not outperform. This shows that direct implementations of not trivially parallelizable algorithms may initially disappoint the programmers expectations regarding GPU programming. This occurs regardless of the GPU used, which means that optimizations are necessary for this type of algorithms even when running on the latest CUDA architecture. We suggest finishing all the computing steps on GPU and avoid data copy as much as possible. First, we copy the matrixP,q,Ã andb to GPU and start the iteration. It is admirable to conduct the upgrade steps (11)− (14) on GPU at each iteration; however, when it comes to the steps for judging the stopping criteria, we have to copy the data from GPU to CPU, which will increase unnecessary time on data transmission. Usually, the stopping criteria is calculated by comparing some matrix or vector norms, so we suggest avoiding data transfer by calling the cuBLAS to compute the norms and output the result on a CPU, which will significantly reduce the time spending on data transmission (see Listing 1).

Use CUDA Libraries
There exist excellent libraries shipped with the CUDA Toolkit that implement various functions on the GPU. We summarize the NVIDIA libraries used in our paper.
cuBLAS is a CUDA implementation of BLAS, which enables easy GPU acceleration of code that uses BLAS functions. We use the level-1 API function for the computation of norms. Level-2 and level-3 API functions are used for the matrix-vector product and matrix-matrix product (see Listing 3). i f ( * r e s u l t < eps ) break ; } cuSolver is a high-level package based on the cuBLAS and cuSPARSE libraries. It is a GPU accelerated library for decompositions and linear equation for both dense and sparse matrices. We use the syevj to compute the spectrum of a dense symmetric system.

Constant Memory and Page-Locked Memory Usage
Constant memory is a read-only memory, so it cannot be written from the kernels. Therefore, constant memory is ideal for storing data items that remain unchanged along with the whole algorithm execution and are accessed many times from the kernels. In our implementation of the proposed algorithm, we store the constant parameters we need during iteration. These values do not depend on the FOF model and do not change along with the thread execution, so they are ideal for constant memory. Our tests indicate that the algorithm is 5% to 15% faster, depending on the model size when using constant memory.
Page-locked (or "pinned") memory is used as a staging area to transfer the device to the host. We can avoid the cost of the transfer between pageable and pinned host arrays by directly allocating our host arrays in pinned memory. Pinned memory is beneficial because it avoids copying data directly between the CPU and GPU. Listing 4 shows how we apply fixed memory and compute residual norm in conjunction with the CuBLAS library. Since the residual is stored on the GPU, we need to calculate its norm to determine the stopping criteria on the CPU.

GPU Occupancy
Occupancy is the ratio of active warps per multiprocessor to the maximum number of possible active warps. The highest occupancy is no guarantee for obtaining the best overall performance, but the low occupancy always reduces the ability to hide latencies, resulting in general performance degradation. Therefore, we perform an experimental test on the device to determine exactly the best number of threads per block for our algorithm.
The best values of serialized warps appear with a size of 128 threads and it achieves a maximum speedup. (See Listing 5) Figure 1 shows the roofline of the kernel for computing the variables z k+1 i . The roofline provides a visually intuitive way for users to identify performance bottlenecks and motivate code optimization strategies. Performance (GFLOP/s) is bound by GFLOP/s ≤ min Peak GFLOP/s Peak GB/s × Arithmetic Intensity (29) which produces the traditional Roofline formulation when plotted on a log-log plot. As can be seen from Figure 1, the performance of the kernel increases about 4.5× after optimization when solving an FOF model (N = 2000). As the scale of the model increases, the utilization for compute and memory resource of the kernel approximates to the theoretical maximum performance bottleneck.

Storage Block Matrix on Device
In practice, we do not have to store the whole matrixP,Ã,b andQ on CUDA. It is feasible to store the matrix in blocks since those matrices have many zero blocks. All of the matrix-vector or matrix-matrix operations presented in our algorithm could be done via block operation. For example,Q(D + τ k I) −1Q (τ k v k −q) could be calculated bỹ . For the diagonal matrix, it is stored as an array in GPU. The computation steps of D + τ k I could be done via vectorvector operation.

Experiment
We implemented our algorithms on CPU and parallelized them on GPU. All the programs run on a server with i9-10940X with 128 M RAM and one NVIDIA RTX2080ti, 11 GB memory. The server's CPU is i9-10940X, with 128 M memory and an NVIDIA RTX 2080ti GPU, 11 GB memory. The performance of the proposed methods is reported in comparison with the sequential quadratic programming (SQP) and the interior point method (IPM). We simulated the net value of N = 5n different funds with N ranging from 50 to 5000, which could be the maximum portfolio capacity for the funds. All the fund data is simulated by the Geometric Brownian motion (GBM), with different risk level σ with regard to the different type of funds. The following table shows the running time for solving the problems (1) of SQP and the CPU implementation of our proposed method. The relative error is calculated by the |x * − x * SQP |/x * fmincon , where x * and x * SQP represent the optimal value of the proposed method and the SQP, respectively. For smaller-scale problems, we ran numerical experiments 100 times and then averaged them, and for larger-scale problems, we ran the experiments 20 times.  Table 1 shows the running time and relative error for solving the portfolio proble with different methods. As N increases to more than 2 × 10 3 , it will be challenge for the commercial solver to solve the problems in an acceptable time (3 × 10 3 s). The proposed algorithm does not show an outstanding performance in the CPU version. Table 2 shows the GPU implementation of the proposed algorithm. By conducting the iteration in parallel, we reach a respectable acceleration.  Figure 2 shows the calculation time and relative error curves of the FOF model (2.1) solved by different methods. We implement this method in CPU(ADMM) and GPU(cuADMM). We can see that the CPU implementation of the proposed method is better than SQP and IPM. When solving problems of different scales, the relative error of this method decreases faster, and the solving speed of the GPU version accelerates significantly.  We also implemented a fixed stepsize (τ k ≡ 1) for comparison. For N = 200, the relative error of the fixed stepsize decreases very slowly . As the relative error decreases, the convergence speed becomes slower. The adaptive stepsize method converges and reaches the stopping criteria in only a few iteration.  When N is small (N < 200), the acceleration of the GPU's implementation is not obvious as the copying data step takes up most of the time to execute the code. As N increases, the proposed algorithm not only maintains a fast convergence ratio but also significantly outperforms the existing methods.

Conclusions
This paper solved the convex FOF optimization model. The ADMM algorithm helped us separate the quadratic terms and the nonlinear terms of the objective function. We solved the KKT linear system and the nonlinear optimization problem at each iteration and parallelized the proposed algorithm on CUDA to solve nonlinear optimization problems. The number of iterations could be significantly reduced by the adaptive stepsize, which enables us to apply the proposed method to solve large-scale problems faster. We also implemented the proposed methods on CUDA and reported the optimization techniques to maximize the number of utilized kernels and use the device's memory architecture. The GPU version showed a very satisfying speedup for large-scale problems. Our numerical experiment raised the dimension of the FOF optimization model into a new scale, enabling the investors to allocate assets in the broader range of funds. However, there still exist certain limitations. For example, if there is a correlation between the nonlinear transaction cost functions when constructing the FOF model, the model could be more challenging to solve and parallelize. Therefore, the research on how to solve the model in parallel and reach the maximum performance of the algorithm on CUDA remains to be studied.