l 1 -Regularization in Portfolio Selection with Machine Learning

: In this work, we investigate the application of Deep Learning in Portfolio selection in a Markowitz mean-variance framework. We refer to a l 1 regularized multi-period model; the choice of the l 1 norm aims at producing sparse solutions. A crucial issue is the choice of the regularization parameter, which must realize a trade-off between ﬁdelity to data and regularization. We propose an algorithm based on neural networks for the automatic selection of the regularization parameter. Once the neural network training is completed, an estimate of the regularization parameter can be computed via forward propagation. Numerical experiments and comparisons performed on real data validate the approach.


Introduction
Regularization parameter selection is one of the most essential tasks in solving largescale ill-posed problems. Problems of this kind arise in various applications, and generally, their solution requires some regularization; that is, the problem is substituted by a related one with better numerical properties. A common approach is to add a penalty term that enforces uniqueness and stability. The penalty term is tuned using a regularization parameter. It must realize a trade-off between fidelity to data and regularization. If the regularization parameter is too small, the model has numerical features similar to the unregularized one; on the other hand, the solution does not fit the original model if it is too big. In the context of Portfolio optimization, different regularization techniques have been suggested for the mean-variance Markowitz model [1]. Among these, l 1 penalization has been considered. This is an effective technique to obtain sparse portfolios that allow the investor to reduce both the number of positions to be monitored and the holding costs [2][3][4][5]. While the literature offers many methods for Tikhonov regularization, l 1 regularization parameter selection is often based on problem-dependent criteria and related to iterative empirical estimates, requiring a high computational cost. In [2], a least-angle regression algorithm is presented, which starts from large values and proceeds by decreasing them. In [3,5], the authors propose a modified version of the Bregman iteration procedure, which includes an adaptive rule for the selection of the regularization parameter, respectively, in the singleperiod and multi-period framework. In that algorithm, the starting value is very small and is increased within an iterative procedure. In this work, we explore the use of supervised Machine Learning (ML) for the automatic selection of the regularization parameter in the multi-period portfolio selection model presented in [5]. ML provide methods which are data-driven; it is actually extensively applied to Finance [6][7][8][9], in particular to the portfolio selection problem [10]. Several contributions concern the use of ML models to predict stock returns [11][12][13]. In [14], the authors adopt deep learning models to optimize the portfolio Sharpe Ratio directly. The idea of using ML for tuning parameters is investigated in [15,16].
In [15], the authors use deep learning networks to compute the regularization parameter for solving inverse problems, while in [16], ML techniques are applied to adjust the risk aversion coefficient in the portfolio optimization context. The aim of ML is to develop algorithms which can learn and progress over time and can be used for predictions. In particular, the goal of supervised learning is to predict the value of one or more outputs for a set of inputs. Thus, we aim at approximating f : X ⊆ R q −→ Y ⊆ R t with a function where f θ : X −→ Y is usually nonlinear and θ ∈ R q is a large set of unknown parameters. The learning phase uses a training set to produce a set of parameters θ that minimize a loss (or cost) function L that measures the accuracy of the predicted f θ (x) with respect to f (x). There are many kinds of loss functions in supervised learning, such as the square Euclidean distance, crossentropy, contrast loss, hinge loss, information gain, Poisson deviance, and so on [17].
In this paper, we consider the Neural Networks (NN) that have become particularly popular among Machine Learning methods [18,19]. NNs were originally conceived as models that would imitate the function of the human brain, that is, a set of neurons joined together by a bunch of connections. Neurons, in this context, are a weighted sum of inputs; they are intended as derived features. A nonlinear activation function is applied to neurons to compute the output (the target). These ideas have existed since the 1960s, but their power has been fully exploited with the advent of modern computing architectures. They have shown their ability to perform well on supervised learning tasks, mainly when training data are abundant. There are typically three parts in a NN: an input layer, one or more hidden layers, and an output layer. Each layer is a collection of neurons (units) connected with the previous layer with synapses (or weights) optimized during the training process. The network learns by examining data examples, generates a prediction for each unit, and makes adjustments to the weights whenever it makes an incorrect prediction. This process is repeated many times, and the network continues to improve its predictions until one or more stopping criteria have been met. The simplest networks contain no hidden layers and are equivalent to linear regressions.
The paper is organized as follows. In Section 2, we recall the multi-period l 1 -regularized model and the algorithm based on Bregman iteration used to solve it. In Section 3, we describe how NN is used for the selection of the regularization parameter. In Section 4, we show numerical experiments that validate our approach. Finally, in Section 5, some conclusions are given.

Multi-Period l 1 -Regularized Mean-Variance Markowitz Model
In this section, we recall the l 1 -regularized model for multi-period portfolio selection introduced in [5]. The investment strategy is either a medium or a long term one; thus, decisions can change according to the time evolution of available information.
In a multi-period setting, time is assumed to evolve continuously, but the investment period is split into m sub-periods, delimited by the so-called rebalancing dates j, j = 1, . . . , m; decisions are taken at rebalancing dates and kept within sub-periods. Let n be the number of assets and u T j ∈ n the portfolio held at the rebalancing date j. The optimal portfolio referred to the overall investment period is thus defined by the vector u = (u T 1 , u T 2 , . . . , u T m ) T ∈ N , where N = m · n; u j is the portfolio at time j; thus, the element u i+(j−1)·n is the amount invested on asset i at time j. We develop our model in a mean-variance Markowitz framework; thus, we aim at minimizing the risk of the strategy, imposing constraints on expected wealth. The multi-period portfolio modelling in a Markowitz framework has been extensively analyzed; we address to [20,21] and references therein for the theoretical aspects of the formulation of the Markowitz model in the multiperiod case.
We represent the risk measure as a sum of terms. Each one provides a risk estimate in one investment period, using information available at the beginning of the period. In particular, we consider the variance as a single-period risk measure. The risk of the strategy is then given by: where C j ∈ R n×n is the covariance matrix at time j, assumed to be positive definite. The choice of functional F(u) allows, on one hand, to deal with quadratic programming, on the other, to use a time consistent risk measure, as discussed in [22]. l 1 -regularization is applied to improve conditioning, to produce sparse portfolios [3][4][5]23,24]. The penalty term is weighted by means of the regularization parameter, denoted with τ in the following. The problem is a non-smooth optimization one, with equality constraints: ξ init is the invested amount, that is, constraint (1b) fixes the initial wealth. r j is the expected return vector estimated at time j, so equation (1c) imposes that the strategy is self-financing: the portfolio at time j is the revaluation of its value at time j − 1. ξ term is the target wealth that the investor gains when the investment ends, as stated in (1d). Since the target wealth is fixed only at the final date, it is assumed that the investor does not exit the investment before the end. Both the objective function and the constraints can be reformulated in compact form, leading to the following equivalent problem: where C = diag(C 1 , C 2 , . . . , C m ) ∈ R N×N is a m × m diagonal block matrix, A is a m × m lower bidiagonal block matrix, with blocks of dimension 1 × n: diag(A) = (1 T n , 1 T n , ..., 1 T n ) subdiag(A) = (−(1 n + r 1 ) T , ..., −(1 n + r m−1 ) T ) and b = (ξ init , 0, . . . , ξ term ) T ∈ R m . The linear system in (1) incorporates the temporal constraints (1a-c). In particular, the first and the last rows of the system state the requirements on the initial and the terminal wealth, respectively. Intermediate equations state the selffinancing property of the strategy. We assume that A is full rank to guarantee the existence of solutions [25]. In this work, we present a procedure based on Bregman iteration [26]. It is a well-established method for the solution of l 1 -regularized optimization problems, successfully applied in different fields, including finance [27]. Methods based on Bregman iteration have proved to be efficient for the solution of problem (1) as well [3,5,23,24]. Bregman iterative scheme requires at each step the solution of a l 1 -regularized unconstrained optimization subproblem, in which the functional J is replaced by its Bregman distance at the current iterate u D where p is a subgradient in the subdifferential of J at point u. Bregman iteration applied to problem (1) produces the Algorithm 1.

Algorithm 1 Bregman Iteration for Portfolio Selection (BIPS).
Given τ, λ k := 0 u 0 := 0, p 0 := 0 while not convergence do % solve the inner minimization problem Note that a cheaper version of the Bregman iteration can be obtained. Following [28], thanks to the linearity of the equality constraints, a simplified formulation can be derived; Equations (2) and (3) can be replaced by: In this version, the Bregman vectors b u k inside the quadratic penalty function enforce the equality constraints and allow the use of functional J in place of its Bregman distance.
A crucial issue in the solution of (1) is the choice of a suitable value for the regularization parameter τ that realizes a trade-off between sparsity (requiring sufficiently large values) and fidelity to data (requiring small values). Several approaches have been proposed, both in the single and multi-period case [2,3,5], all based on an adaptive rule to select the regularization parameter. The idea is to change the value of τ during the optimization process until a financial target is satisfied. In [5] a modified version of Algorithm 1, which we here denote with A-BIPS (Adaptive BIPS) is presented, developed in a multi-period setting. The algorithm A-BIPS produces an increasing sequence of values with µ > 1 and 0 < i ≤ k.
To make the Bregman distance associated with J k (u) = u T Cu + τ k u 1 at point u k well defined if τ k = τ k−1 , the subgradient p k must be updated as follows: before computations in (2) and (3). So, the method presented in [5] produces the optimal portfolio as well as the optimal parameter: where h is not greater than the total number of Bregman iterations. Numerical results show that the τ opt computed by procedure A-BIPS generally raise portfolios that exhibit a percentage of sparsity significantly greater than the required one; this could affect diversification. For this reason, in the next section, we propose a NN approach to approximate τ opt using a grid method based on Algorithm 1.

Neural Networks for Regularization Parameter Selection
The value h in (5) is unknown a priori, and it is strongly dependent on data and the financial target. In this work, we assume that the financial target is defined in terms of the maximum number of active positions in the optimal portfolio; that is, we require a minimum number N sparse of zeros in the solution. It is then reasonable to assume that h is a function of certain features of data, that is, we assume that a nonlinear target function f : X −→ Y exists, such that h = f (x), where X ⊆ R q , Y ⊆ N 0 and q is the number of features. The function f is unknown and could be complex; we employ ML techniques to derive its approximation. In this framework, deep NNs are a natural candidate to perform this task since they are known as universal function approximators [29].
It is mandatory to select features containing significant statistical information on data to train the network effectively. Let n be the number of assets and M the length of the return time series. The dataset is represented by the matrix R = (R * 1 , R * 2 , . . . , R * n ) ∈ R M×n , which has the complete return time series of each asset along the columns. We denote with r ∈ R n the vector of asset average returns. In this work, we select q = 6 features that are significant in a mean-variance framework, where only the first and second-order moments are considered, and take into account the financial target. The feature vector is defined component-wise as: where E is the expectation operator. More in details, x 1 and x 2 are respectively the maximum and the mean of the asset average returns; x 3 and x 4 are respectively the maximum and the mean standard deviations of asset returns; x 5 is the maximum covariance of asset returns, and finally x 6 is the desired sparsity degree in the solution.
The NN requires a suitable set, containing L samples {x l , h l }, l = 1, . . . , L, which is split into the training and the testing sets. To build it, we generate L equally sized subsets of asset returns and compute the related features x. For each subset we iterate Algorithm 1 to compute the corresponding value h opt . In more detail, starting from τ = τ 0 we compute the solution to problem (1); if the financial target is not met, we increase τ according to (4) and solve the problem again. The procedure is summarized in Algorithm 2.

Algorithm 2 Iterative BIPS (I-BIPS).
Once the training set has been obtained, we use NNs to approximate the functional relationship between x and h. The procedure is described in Algorithm 3.
In particular, we use a deep NN, described below. In this section, we introduce deep NNs in a general setting. The deep NNs provide a sequence of stacked layers where each layer processes as input the output of the previous one. Let D ∈ N be the number of hidden layers and q k ∈ N for k = 1, . . . , D be a sequence of integers indicating the number of units in each layer; let q 0 = q and q D = 1, the mechanism behind a deep learning network can be formalized as follows: where w (k) 0 ∈ R q k , W (k) ∈ R q k ×q k−1 are the weights and φ k (·) for k = 1, . . . , D denote the activation functions. Possible choices for the activation function are: The performance of the network depends on the value of the weights w (k) l,j , which must be properly calibrated. This training process involves an unconstrained optimization problem where, chosen a suitable loss function L(w (k) l,j , ·), its minimum is sought. The backpropagation algorithm is the most used for the training of feed-forward NNs. The algorithm compares the predicted values against the desired ones (objective) and modifies the weights by back-propagating the gradient of the loss function. A first-order stochastic gradientbased optimization algorithm, based on adaptive estimates of lower-order moments, was used [30]. However, when D is very large, the gradients often get smaller and smaller and approach zero, which eventually leaves the weights of the first layers unchanged. This issue is known as the vanishing gradients problem. To fight the vanishing gradient problem that affects deeper learning networks, skip connections were introduced [31]. They are additional connections that directly connect the input layer to the output layer and concatenate the input data with the output of the last hidden layer. In case of skip connections, the output layer reads: where w (skip) ∈ R q 0 are the weights associated to the input in the output layer and ·, · denotes the scalar product. Several possible choices for the loss function L are possible.
Since we desire to model the number of iterations h, the response variable is an integer. For these kinds of problems, some authors have successfully applied the Poisson loss function to train the network [32,33]. Following [34], the Poisson loss of the model can be written as follows:

Numerical Experiments
In this section, we present numerical experiments to validate the NN-based approach. We run all the tests on a desktop computer with an Intel(R) Core(TM) i9-8950HK CPU @2.90 GHz 16.00 GB RAM. We consider the data provided in [35,36] that contain return time series of assets belonging to several major stock markets across the world, cleaned from errors.
In particular, we focus on the assets that make up the S&P 500 index. We simulate 10 years investment strategies where the investor revises decisions once a year. In our tests, the target sparsity percentage varies in {30%, 35%, 40%, 45%, 50%, 55%, 60%}. For each value of target sparsity, we generate a learning sample of 1000 portfolios by applying the procedure described in Section 3 on the full set of S&P 500 assets, with n = 100. Then, the overall learning sample contains 7000 portfolios, among which the 75% makes the training set (L train = 5250) and the remaining 25% is used as testing set (L test = 1750).
Tests have been performed in MatlabR2020 and R environments. To solve the minimization problem (4) we use the Fast Proximal Gradient method implemented in FOM, a Matlab toolbox containing a collection of first-order methods for solving mainly convex optimization problems [37]. The implementation of the NNs has been carried out by using the Keras package [38].

Neural Network Calibration
When NNs are employed to approximate functions, the values to be assigned to hyper-parameters such as number of layers, number of units per layer, activation function, and others must be chosen carefully. For example, too many layers and/or units per layer can lead to over-fitting, while few layers and/or units can lead to under-fitting. For this reason, these parameters are typically calibrated on data. We carried out a preliminary analysis to explore how these hyper-parameters affect the performance of the NN models. In particular, we analyzed the role of The activation function φ ∈ P = { relu , sigmoid , tanh } (where φ = φ k for all the hidden layers k = 1 . . . , D). In summary, we consider 36 = |D × Q × P | different NN architectures. Furthermore, the analyses of a single NN training could not be sufficient to make a judgment since it could vary among different training attempts (depending on the initial value of the optimization algorithm and on the random selection of batches of training data to calculate the gradients used in back-propagation). So, we fit each network 10 times and compare the results of the different architectures. In each run, the network weights are recursively adjusted to minimize the Poisson loss between the predicted and the reference values. The models were each fit for 500 epochs, and the model with the best performance during these epochs, measured on the validation set, was used. The validation set consists of the last 5% of the sample, which means that the networks fit 95% of the training examples, and performance was assessed on the remaining 5%.
We first compare the NN architectures analyzing the loss function values on the validation set in the 10 different training attempts. Figure 1 shows the boxplot of the Poisson loss in the validation set for the other architectures.  We observe that the sigmoid-based architectures present the largest validation loss. This evidence suggests that the sigmoid activation function is less suitable for describing the phenomenon under investigation with respect to the others. Moreover, we observe that considering the relu and tanh architectures, the performance on the validation set improves when the number of layers and units per layer increases. This result indicates that the additional parameters induced by the additional layers and units give more flexibility to the model and improve its out-sample performance. In the experiments shown in the following, we use the network that produces the best performance on the validation set and, therefore, this is expected to be the network with the best out-sample accuracy. In particular, we observe that the best performance validation set can be obtained using a four-layer architecture with 2 6 units per layer and the relu activation function.

Neural Network-Based Strategy Performance
In this section, we compare the solutions of the portfolio selection problem obtained with Algorithms 2 and 3. This comparison is carried out by considering the values of τ, the percentage of sparsity in the solution, and the Sharpe Ratio (SR) [39]. The SR is the ratio between the average of the portfolio's expected return and its standard deviation, which measures the risk. Since one would maximize return and minimize risk, great values of Sharpe Ratio are desirable. In the multi-period framework, we compute it in the following way: where p = (p 1 , ..., p m ), and We analyze the out-sample performance of the NN model selected on the test set. First, we investigate the differences between the τ obtained through the I-BIPS strategy and those calculated via NNPS. In Figure 2, the densities of the residual distributions of τ for the testing and training sets are illustrated. The figure shows that both the distributions are centred in zero, thus suggesting that there is not a systematic component in the errors. This behaviour occurs for both the testing and training set. A graphical comparison of the values of τ is depicted in Figure 3 which compares the value of τ produced by Algorithms 2 and 3 on the testing set. Again, the comparison is carried out for different levels of the target sparsity. According to the theoretical results, the obtained τ increase with the required sparsity for both methods. Moreover, we observe that the distributions have similar positions for all the considered sparsity levels. In this regard, we performed some statistical tests to compare the mean values of the two distributions. We use the paired samples t-test, and Table 1 reports the results of the statistics and the respective p-values for the different sparsity levels. The null hypothesis is that the true difference between the means of the two distributions is zero, while the alternate hypothesis is that it is different from zero. The p-values are rather high in most cases, and we can conclude that the average τ are not significantly different. The only case where the p-value appears low is for target sparsity equal to 40%. However, the graphical inspection of the whole τ distributions in Figure 3 and the distributions of the model residuals in Figure 2 do not reveal any particular criticisms also for this level of sparsity.  In Figure 4, we analyze the percentage of sparsity realized by Algorithm 2 versus the one realized by Algorithm 3. We observe that the boxplots of the two methods have a similar median; however, in some cases, the NNPS method produces a lower level of sparsity with respect to the target sparsity. The boxplots of the two methods show a similar median; however, the variability of the NNPS appears greater than the variability of I-BIPS. We also note that, in some cases, the NNPS method produce sparsities marginally lower than the target one; this happens when the NN underestimates τ opt . Finally, in Figure 5, we compare the Sharpe Ratio of the optimal portfolios produced by the two methods. We observe similar distributions of the Sharpe Ratio for all the values of the target sparsity.

Comparison with the Adaptive Strategy
In this section, we compare the results of the NNPS procedure with those obtained with the A-BIPS one in terms of values of τ, sparsity, Sharpe Ratio and computational cost. Figure 6 shows the boxplots of the τ produced by the two methods for different target sparsity levels in the testing set. The A-BIPS produces some outliers, so we use the logarithmic scale to improve the readability of the Figure  Due to the changes of the regularization parameter within the optimization procedure, A-BIPS produces greater values of τ with respect to those obtained by NNPS for the same sparsity target. The difference between the obtained τ seems more significant when required sparsity levels are higher. Larger values of the regularization parameter also induce effects on the realized sparsity. Figure 7 compares the distributions of realized sparsity for different target sparsity levels.
As expected, we observe that the A-BIPS method generally produces more sparse solutions than the NNPS. On the one hand, this effect guarantees the achievement of the required financial target, but, on the other hand, it could penalize the risk diversification. Figure 8, which compares the SR of the two methods, confirms this argument. In particular, we observe that the median SR of the NNPS is always larger than the one resulting from the A-BIPS method for all sparsity degrees. Furthermore, this difference increases when we consider higher target sparsity levels.
This result is further confirmed by Table 2. It lists, for each level of sparsity, the average SR obtained with NNPS (second column) and A-BIPS (third column), and the percentage of cases in which the SR produced by NNPS is greater than the one resulting from A-BIPS. We observe that the average SR of NNPS is always greater than that of A-BIPS for all levels of sparsity. Furthermore, the success rate is above 78%. Table 2 also lists the computational times required by the two methods to solve the portfolio selection problem. We observe that the NNPS outperforms the A-BIPS method from a computational cost point of view. This result is quite reasonable since NNPS solves the problem by using the τ produced by the neural network, while the A-BIPS approach runs an adaptive approach to identify the optimal τ.  We finally test Algorithm 3 on a different, more recent dataset available on the Kaggle repository (https://www.kaggle.com/ (accessed on 16 January 2022)), which provides the monthly asset returns of the SP500 assets from January 2017 to December 2021. In this case, m = 5 rebalancing dates are considered. We perform the same comparison analysis on Algorithms 2 and 3. The results are reported in Table 3. The two methods are compared in terms of (average) τ, realized sparsity and SR. The results confirm the behaviour observed in the previous tests.

Conclusions
In this work, we discussed the application of Deep Learning to select the regularization parameter in l 1 regularized portfolio optimization. Preliminary results show the effectiveness of our approach. The NN-based approximation seems to accurately capture the relation between the selected features and the optimal regularization parameter. Optimal portfolios exhibit satisfying financial properties; however, we observed in some cases the underestimate of the optimal regularization parameter, which leads to a sparsity level that is slightly lower than the target one. This issue suggests to explore the use of asymmetric loss function that penalizes more underestimates than overestimates. Moreover, results show that the proposed algorithm often outperforms an existing method for the same problem.  [35,36].