Abstract
Automatic differentiation (AD) is a general method for computing exact derivatives in complex sensitivity analyses and optimisation tasks, particularly when closed-form solutions are unavailable and traditional analytical or numerical methods fall short. This paper introduces a vectorised formulation of AD grounded in matrix calculus. It aligns naturally with the matrix-oriented style prevalent in statistics, supports convenient implementations, and takes advantage of sparse matrix representation and other high-level optimisation techniques that are not available in the scalar counterpart. Our formulation is well-suited to high-dimensional statistical applications, where finite differences (FD) scale poorly due to the need to repeat computations for each input dimension, resulting in significant overhead, and is advantageous in simulation-intensive settings—such as Markov Chain Monte Carlo (MCMC)-based inference—where FD requires repeated sampling and multiple function evaluations, while AD can compute exact derivatives in a single pass, substantially reducing computational cost. Numerical studies are presented to demonstrate the efficacy and speed of the proposed AD method compared with FD schemes.
Keywords:
automatic differentiation; derivative computation; matrix calculus; MCMC; MLE; optimisation; simulation-based inference JEL Classification:
C11; C53; E37
1. Introduction
Automatic differentiation (AD) has become a foundational tool in modern statistical computing, enabling efficient and exact gradient computation in a wide range of applications—from parameter estimation [1,2] and sensitivity analysis [3,4,5] to simulation-based methods such as variational inference and Markov-Chain Monte Carlo (MCMC) inference [6,7,8,9,10,11]. Its widespread adoption is evident in major software ecosystems such as PyTorch [12], TensorFlow [13], Stan [14], and Julia [7], where AD powers both machine learning workflows and traditional statistical methods.
AD works by transforming a program that computes the value of a function into one that also computes its derivatives by systematically applying the chain rule to elementary operations. This allows AD to compute derivatives with machine-level precision and minimal overhead, avoiding truncation and round-off errors and eliminating the need for repeated function evaluations, a known bottleneck in numerical differentiation, especially in high-dimensional problems [8,15,16]. While symbolic differentiation can provide exact derivatives, it requires closed-form expressions and cannot handle procedural logic (e.g., if-else statements and for-loops) or stochastic elements such as random number generation or Monte Carlo simulations. AD bridges this gap, offering a robust and general-purpose solution for derivative computation.
Early implementations of AD relied on operator overloading [17] and source code translation [18] techniques that, while powerful, had notable limitations. Operator overloading incurs significant runtime overhead and is inherently local, recording operations as they occur without access to global program structure or opportunities for optimisation. In contrast, compiler-based systems transform entire programs automatically but often make it difficult to selectively extract intermediate values for debugging or inspection. Modern AD frameworks improve on these predecessors by adopting either eager execution (as in PyTorch 1 and 2) or static computational graphs (as in TensorFlow 1). These approaches offer greater flexibility, traceability, and support for modular development and deep introspection. However, the eager mode requires users to structure their code in specific ways; for example, making explicit calls such as backward() and zero_grad() in PyTorch can appear rigid and error-prone. Moreover, because it operates step-by-step, the eager approach often fails to exploit the broader structure of computations, such as block matrix operations, thereby missing optimisation opportunities. Conversely, while static graph systems are more declarative and amenable to global analysis, they can struggle with dynamic control flow and runtime-dependent logic. In both paradigms, the need to conform to AD-specific programming idioms often shifts user attention away from the statistical problem itself and toward the mechanics of the AD system.
In this work, we present a vectorised formulation of AD grounded in the matrix calculus of [19], designed to align more naturally with the matrix-oriented style prevalent in the field of statistics and the statistical programming language R. Our approach mirrors the derivation style of analytical work, enabling clearer and more intuitive implementations. It also exposes opportunities for high-level optimisation, including the use of sparse matrix representations and block-wise computations, features often inaccessible in bottom-up, scalar-based AD systems. This formulation supports transparent complexity analysis and efficient implementations, particularly in settings involving Kronecker products. It also enables fully automatic workflows, akin to source code translation techniques, while preserving the ability to inspect intermediate variables as in eager execution and graph-based systems.
We introduce a complete set of matrix calculus rules for building an AD system tailored to statistical applications, including operations for random variable simulations and structural transformations, many of which are undocumented in the existing literature. We also introduce the sparse representation of transformation matrices and discuss a range of optimisation techniques applied to the AD system to achieve significant performance gains in practice, which we demonstrate through comparisons with finite differences (FD). As an illustration, we apply the proposed methods using a real data example: a factor model estimated using simulated maximum likelihood, a setting commonly encountered when modelling dependence structures in complex data. The numerical results confirm the computational advantage of the proposed vectorised AD, particularly for simulation-intensive functions where FD incurs unnecessary repeated calculations.
The remainder of the paper is organised as follows: Section 2 introduces the core mechanism of the AD system and presents the full set of matrix calculus rules. Section 3 discusses optimisation strategies and evaluates computational performance, while Section 4 details the application. All code listings are provided in the Appendix B.
2. Materials and Methods
2.1. AD via Vectorisation
2.1.1. From Vector Calculus to Matrix Calculus via Vectorisation
Our AD formulation builds on a set of vector calculus rules rather than elementary scalar calculus [19]. Before presenting the full framework, we introduce three key definitions: Definition 1 defines the derivative of a vector-valued function; Definition 2 introduces the vectorisation operator; and Definition 3 combines the first two to define the derivative of a matrix-valued function.
Definition 1.
Suppose and , then Jacobian matrix J of is a matrix with entry given by:
where and are components of and .
Definition 2.
Let A be an matrix and its j-th column. Then is the column vector (i.e., stacks the columns of A):
Note that , where A, B, and C are three matrices with appropriate dimensions such that the matrix product is well-defined; denotes the transpose of C, and denotes the Kronecker product of and A.
Definition 3.
Let be a real matrix function, the Jacobian matrix of F at X is defined to be the matrix:
For notational convenience, we write as . In the definition above, the numerator is always treated as a column vector and the denominator as a row vector. This allows us to write as without ambiguity, rather than the more cumbersome . Indeed, since , the identity matrix of dimension , accepting as in the denominator amounts to a notational simplification:
aligning with the definition above.
A key advantage of Definition 3 is that it allows higher-order matrix derivatives to remain within the familiar matrix framework, rather than escalating into high-order tensors, e.g., the Hessian of F at X is also a matrix. This simplifies notation and facilitates the use of matrix algebra to exploit structure in Jacobian and Hessian matrices, making the formulation more efficient. For further discussion and critique of alternative matrix derivative conventions—including the numerator and denominator layouts—see [19].
Once derivatives are defined for the basic operations, they can be propagated through a computation using the matrix-based chain rule. Suppose that at a certain stage, we have already computed matrices A and B, along with their derivatives and , with respect to some input X. The next step of the computation involves evaluating a new matrix , where F is differentiable in all parameters. Using the chain rule in vectorised form, the derivative of C with respect to X is given by:
In our formulation, any computation that can be decomposed into a sequence of basic matrix operations—each admitting a tractable and well-defined derivative—can be differentiated efficiently using the chain rule. These operations include (i) basic matrix arithmetic such as addition, subtraction, product, inverse, and Kronecker product; (ii) element-wise arithmetic such as Hadamard product/division and element-wise univariate differentiable transformations; (iii) scalar matrix arithmetic such as scalar matrix addition, subsection, multiplication, and division; (iv) structural transformations such as extracting elements and rearranging or combining matrices; and (v) operations on matrices such as Cholesky decomposition, column/row sum, cross-products, transposition of cross-products, determinants, and traces. They will be presented in Section 2.1.3.
2.1.2. Dual Construction
We present an implementation of an AD system that can differentiate any multivariate matrix polynomial to illustrate the underlying logic of our AD formulation. Let A, B, and C be matrices and the diagonal (identity) matrix, and consider the following two matrix calculus rules:
To implement these rules, first, we attach to each matrix a dual component that stores the derivative with respect to some other parameters (i.e., let ) and refer to them as the dual matrices. For example, if the parameters are the entries of A and B, then ; similarly . We can then define the arithmetic for dual matrices using (2) and (3):
and program them as shown in Listing 1.
These 16 lines of code define an AD system that can handle the class of multivariate matrix polynomials formed by addition and multiplication. For example, the derivative of the function is simply the one-line
rather than a tedious program associated with the analytical derivative
df <- function(A, B) A %times% ((A %times% B) %plus% (B %times% B)) %plus% B
Readers encountering AD for the first time may be surprised that the program df above appears to compute f itself, rather than its derivative—which is precisely what makes AD so appealing. Given the addition and multiplication operators defined for dual matrices, any function constructed using these operators will automatically have its derivative computed. Specifically, derivatives are evaluated on the fly each time %times% or %plus% is called. The final output of df is a dual matrix, where the first component is the result , and the second component is the derivative of f at . This approach abstracts away the derivative calculation, allowing users to obtain derivatives automatically once the function f is implemented. To complete the system, subtraction and inverse operations are also required (also 16 lines). To main the flow, we list them under the Appendix B, along with a complete working example.
| Listing 1. Implementation of the sum and product matrix calculus rules in R. |
‘%plus%‘ <- function(A_dual, B_dual) { A <- A_dual$X; dA <- A_dual$dX; B <- B_dual$X; dB <- B_dual$dX; list(X = A + B, dX = dA + dB) } ‘%times%‘ <- function(A_dual, B_dual) { A <- A_dual$X; dA <- A_dual$dX; B <- B_dual$X; dB <- B_dual$dX; list( X = A %∗% B, dX = (t(B) %x% I(nrow(A))) %∗% dA + (I(ncol(B)) %x% A) %∗% dB ) } I <- diag # function to create diagonal matrices |
2.1.3. A Layered Approach to Construction
In the previous section, we showed that formulating AD with dual matrices closely mirrors the underlying analytic derivation. Once the calculus rules for dual matrices are established, building an AD system becomes relatively straightforward. In this section, we present the full set of matrix calculus rules. The rules are grouped by type and presented in the order they would typically be implemented in practice. Illustrative statistical applications are provided in the Appendix.
In the following discussion, the derivative of any matrix is assumed to be w.r.t. some input z with d parameters. Hence, if A is a matrix, then is a Jacobian matrix, and this shall be written as for the sake of convenience.
2.1.4. Notation
The symbols are reserved for the follow special matrices:
- is the identity matrix.
- is the matrix where the entries on the diagonal are all ones, and the entries off the diagonal are all zeros.
- is the commutation matrix. We also define .
- is the elimination matrix.
- is the matrix of ones.
(Definitions of the commutation and elimination matrices can be found in Section 2.2.4 and Section 2.2.5, respectively.)
Let A be a matrix. We denote
- the -entry of A is denoted by or ,
- the i-th row of A is denoted by or ,
- the j-th column of A is denoted by or .
We define and such that they satisfy the relations
and they are given by the formulas
This indices-conversion function is needed because the derivative of the -entry of A is stored in the -th row of , and vice versa.
2.1.5. Matrix Arithmetic
We present the matrix calculus rules associated with basic matrix arithmetic:
- Addition: Let A and B be matrices, then .
- Subtraction: Let A and B be matrices, then .
- Product: Let A and B be and matrices, then
- Inverse: Let A be a matrix, then .
- Kronecker-product: Let A and B be and matrices, then
- Transpose: Let A be a matrix, .
We now examine the computational advantages of AD relative to FD through a complexity analysis of basic matrix operations. Our discussion is restricted to central (finite) differencing instead of forward differencing, since the former has superior accuracy ([20], pp. 378–379) and is generally preferred over the latter. Applying central differencing to a function f incurs a cost of evaluating f multiplied by twice the dimension of the inputs. (For ease of discussion, we assume that conditional branches, if they exist, have the same order of computational complexity.) Applying AD incurs a cost of evaluating f and the derivative as given by the calculus rules.
For matrix arithmetic, suppose A and B are matrices, so the dimension of the inputs is for matrix additions, subtractions, products, and Kronecker products, and for matrix inversions. The number of operations associated with applying the central differencing to is:
Assuming standard matrix multiplication, the computational costs of addition, subtraction, multiplication, and the Kronecker product are , , , and , respectively. Since we are only interested in the leading terms, these simplify to , , , and . The number of operations for matrix inversion using the LU decomposition followed by the inversion of triangular matrices is around [21]. It then follows that applying finite-differencing would require (in the leading term) operations for addition and subtraction, operations for product, operations for matrix inversion, and operations for the Kronecker product.
For AD, the number of operations works out to be for addition and subtraction, for multiplication, for matrix inversion, and for the Kronecker product. The results are summarised in Table 1. Note that when a Kronecker product is post-multiplied by a matrix, there is a shortcut to avoid the explicit computation of the Kronecker product. The details will be given in Section 2.2.7. From Table 1, we observe that finite differencing and AD have the same complexity for the operations listed in the table, but AD generally has equal or better leading coefficients, except for the matrix inversion.
Table 1.
Comparison of the number of operations (in terms of the leading order) needed in finite differencing and AD.
2.1.6. Element-Wise Arithmetic
We now present matrix calculus rules for element-wise operations. These rules follow directly from applying scalar calculus to each entry independently. The cases of addition and subtraction are identical to those covered in the previous section on standard matrix arithmetic. Let A, B, and C be matrices, and be the square matrix in which the vector is placed on the diagonal.
- Hadamard product:Alternatively,
- Hadamard division:Alternatively,
- Univariate differentiable function f:Alternatively,where denotes applying f element-wise to A. Note that f may also be a function that is differentiable almost everywhere, e.g., is differentiable everywhere except at . When the derivative is evaluated at the non-differentiable locations, it is common to use a subgradient [8]—in this case, any value in the interval —or simply to assume a value such as 0, effectively treating these points as having no impact on the result.
2.1.7. Scalar-Matrix Arithmetic
Let A be a matrix and c be a scalar. Then the differentials and , where , can be computed by lifting the scalar c to a matrix of the same dimension as A, via multiplication with a matrix of ones, . This allows the scalar-matrix operation to be treated as an element-wise operation. Importantly, this lifting is a conceptual construct, and the implementation need not construct a new matrix . Instead, the operation can be performed element-wise directly. For instance, the product rule becomes
2.1.8. Structural Transformation
We now present calculus rules for structural transformations—operations that extract, rearrange, or combine matrix entries without performing any arithmetic computations. Because these transformations are primarily operational rather than analytical, they seldom appear in formal derivations and are often left undocumented in standard references.
- Transpose: Let A be a matrix, .
- Row binding: Let be matrices, , thenwhere , .
- Column binding: Let be matrices, , then
- Subsetting: Let A be a matrix,1. Index extraction: for fixed2. Row extraction: for fixed i, where ,3. Column extraction: for fixed j , where ,4. Diagonal extraction: (column vector) , where .
- Vectorisation: Let A be a matrix,
- Half-vectorisation: Let A be a matrix, .Note that S follows the order as the column-major order of A, i.e.,
- Diagonal expansion: Let be a vector, then is defined to be the matrix with on the diagonal. If , then for ,where .
2.1.9. Operations on Matrices
- Cholesky Decomposition: Let A be a matrix, be the Cholesky decomposition and , thenwhere is the duplication matrix for triangular matrices.
Let A be a matrix.
- Column-sum:
- Row-sum:
- Sum:
- Cross-product:
- Transpose of cross-product:Alternatively, both ‘crossprod’ and ‘tcrossprod’ can be implemented directly as is, since they are composed of the multiplication and transpose operations defined previously.
Let A be a matrix.
- Determinant:
- Trace: . Alternatively, it can be implemented by composing sum and diagonal extraction defined previously.
2.1.10. Random Variables
Given a probability space , a random variable X is a -measurable function mapping to . The formalism suggests that in the process of simulating a random variate, the randomness can always be isolated, and it is possible to differentiate (in the pathwise sense) the random variables w.r.t. the parameters when the derivative exists. In the simplest case of normal random variables, the parameters and the randomness can be separated as follows:
As Z depends smoothly on the parameters and , the derivatives w.r.t. these parameters are well defined. This is commonly referred to as the reparametrisation trick [22].
When explicit separation cannot be done, we utilise the inverse transform method. Suppose where is the cumulative density function of Z, assumed to be invertiable, and is the parameter. Then Z can be simulated using the inverse transform method , . It then follows that if is differentiable in , then the derivative of a random sample is well-defined. This applies to, for instance, the Exponential, Weibull, Rayleigh, log-Cauchy, and log-Logistic distributions.
In the most general case where Z is high-dimensional and may not be known, we rely on the class of isoprobabilistic transformations , which transforms an absolutely continuous k-variate distribution into the uniform distribution on the k-dimensional hyper-cube [23]. This gives an explicit formula to find the derivative of a random vector:
assuming so that the inverse exists.
For clarity, let us consider a 1-dimensional example. Suppose we have a random variable , where is invertible, then an isoprobabilistic transformation would simply be as is distributed uniformly. Hence, it follows that
It is easy to check via elementary means that this is indeed correct. Starting with the identity and applying implicit differentiation, we have
Some specific one-dimensional cases include the gamma, inverse-gamma, chi-squared, Dirichlet, Wishart, and inverse-Wishart distributions.
2.2. Optimising AD Implementation
In the previous section, we introduced the vectorised AD formulation along with the full set of matrix calculus rules to support the dual construction. In this section, we explore several implementation strategies aimed at optimising execution. Benchmarking is carried out in R using the ADtools package available on CRAN ([24]) and GitHub ([25]). While the exact performance gains presented here (and in Section 3) are environment-specific, the optimisation principles are broadly applicable, and improvements can be expected in other environments.
2.2.1. Memoisation
Memoisation (or tabulation) is a technique for non-invasively attaching a cache to a function, allowing it to store and reuse previously computed results for repeated inputs [26]. It can greatly accelerate tasks such as constructing large structured matrices and provides a convenient way to organise computations.
The technique works by checking whether a given input has already been evaluated. If so, the cached result is returned; if not, the computation is performed, and the result is stored for future use. Table 2 shows a speed comparison of the built-in R function diag, with and without memoisation. An illustrative 12-line implementation is included in the Appendix B.
Table 2.
Speed comparison of the diagonal matrix function with and without memoisation. mem_diag is memorised. The best results in each column are highlighted in bold.
2.2.2. Sparse Matrix Representation
For efficient implementation, all special matrices are constructed and stored using sparse representations, which improve both computational and memory efficiency. A sparse matrix is typically represented as a list of triples, where each triple records the value v at position in the matrix.
2.2.3. The Diagonal Matrix
An diagonal matrix is represented as , where is the kth diagonal entry of . This representation takes storage space and incurs an computation cost when multiplied by a dense matrix. A speed comparison of the diagonal matrix function with dense and sparse representations is provided in Table 3. In the table, “Dense” uses the R function diag, and “Sparse” uses the R function ADtools::diagonal. Using sparse representation, a substantial increase in speed was observed for large matrices.
Table 3.
Speed comparison of the diagonal matrix function with dense and sparse representations. Time is averaged over 100 executions. The best results in each column are highlighted in bold.
2.2.4. The Commutation Matrix
The matrix is a commutation matrix if for any matrix A [27]. From (7), we have
where . As is the matrix that maps to , and we have derived that the kth entry of needs to map to -th entry of , it follows that is a matrix having value 1 at position . Therefore, in the sparse representation, we have
It is worth noting that since the commutation matrix simply reorders the entries of , one can implement a function that directly remaps the indices as specified in (8), rather than explicitly constructing the matrix and performing the associated matrix multiplication. Table 4 compares the performance of commutation matrix functions implemented using dense and sparse representations. The “Dense” implementation relies on the R function matrixcalc::elimination.matrix, while the “Sparse” version uses ADtools::elimination_matrix. A substantial improvement in speed is observed with the sparse approach.
Table 4.
Speed comparison of the commutation matrix function with dense and sparse representations. Time is averaged over 100 executions. The best results in each column are highlighted in bold.
2.2.5. The Elimination Matrix
Let E (and ) denote the elimination matrix and D the duplication matrix. These matrices are defined by the identities they satisfy:
where is the half-vectorisation operator (vectorising the lower-triangular part of a square matrix). The names of the special matrices come from the fact that D duplicates entries to turn a half vector into a full vector, and E eliminates entries to turn a full vector into a half vector. Note that if A is an matrix, then has a length and has a length of . Hence, D has a dimension of , and E has a dimension of .
Now we derive the sparse representation of the elimination matrix. First, for an matrix A, we define and such that they satisfy the relations.
where , must be in the lower triangular part of A. The functions and are given by the formulae:
where , and
and are needed to convert back and forth between the matrix and the half-vector representations. It then follows that:
where . Since by definition maps to , and we have shown is mapped to , so is a matrix having a value of 1 at the position . Hence, the sparse representation of the elimination matrix is given by:
In the actual implementation, b does not need to be computed recursively; it can be obtained directly using the closed-form expression . A performance comparison of the elimination matrix function using dense and sparse representations is shown in Table 5. In the table, “Dense” uses the R function matrixcalc::commutation.matrix, and “Sparse” uses the R function ADtools::commutation_matrix. The results demonstrate a clear speed advantage for the sparse implementation, with performance gains increasing as matrix size grows.
Table 5.
Speed comparison of the elimination matrix function with dense and sparse representations. Time is averaged over 100 executions. The best results in each column are highlighted in bold.
We also define the half-duplication matrix for lower-triangular matrix A to be the matrix that satisfies , where A is a lower-triangular matrix. For any lower-triangular matrix A, we have .
2.2.6. Matrix Chain Multiplication
In the implementation of vectorised AD, the derivative computation frequently involves sequences of matrix multiplications, as shown in Equation (1). While mathematically straightforward, these operations can become computational bottlenecks in high-dimensional settings. A key yet often overlooked aspect is that matrix multiplication, although associative in output—i.e., —is not associative in computational cost. To illustrate, consider matrices and a vector x. Evaluating requires explicitly forming the intermediate matrix , resulting in a complexity . In contrast, computing avoids this and reduces the cost to . This simple example highlights a crucial insight: although the result is invariant to the order of operations, the efficiency is not. In large-scale computations, suboptimal ordering—such as naive left-to-right evaluation—can be unnecessarily costly. In simple cases where the dimensions of the matrices are known in advance, an optimal order can be enforced manually. However, in many applications, matrix dimensions are unknown until runtime, making it impossible to prespecify the optimal multiplication order. This leads to the matrix chain multiplication problem.
Matrix chain multiplication is an optimisation problem concerned with multiplying a chain of matrices using the least number of arithmetic operations. This is traditionally solved using dynamic programming with the complexity , which can be reduced to when the memoisation technique is employed. Ref. [28] provides an algorithm that solves the problem with a complexity. However, given that in many applications the length of the matrix chain rarely goes beyond , it is usually sufficient to consider the simpler dynamic programming solution as follows:
Let be the minimal number of arithmetic operations needed to multiply out a chain of matrices , and suppose for any i, has dimension . Our goal is to find . The recursive formula is given by [29]:
The above only provides the optimal number of arithmetic operations. To obtain the order of multiplication, we define the split point of the matrix chain as:
For example, if , then the matrix chain should be split after index 2 in the order of , whereas if , then the matrix chain is ordered as , after which one inspects to decide the full order.
Table 6 shows the increase in speed gained by switching from a naive (left-to-right) order to an optimal order. The comparison was conducted using 1000 simulations. The length of the chain was sampled from the set , with probabilities proportional to , and matrix sizes were sampled from the discrete uniform distribution . Note that there is no speed-up by multiplying two matrices because there is only one possible order (the extra 0.04 is merely statistical noise). For a matrix chain with a length of three to five, the average speed-up was about 1.5 times. The speed-up distributions conditioning on the length of the chain were all positively skewed and had positive excess kurtosis (i.e., fat tails).
Table 6.
Comparing multiplications of a chain of matrices in the naive and optimal orders. Figures represent the mean and standard deviation (in bracket) of the speed-up over 1000 simulations.
2.2.7. Kronecker Products
Among the basic matrix operations, the Kronecker product is one of the most computationally expensive. In general, computing the Kronecker product of an matrix and a matrix has a complexity of . For simplicity, assume ; then the complexity becomes . In the context of Jacobian matrix computations, it is rare to compute a standalone Kronecker product. Instead, it typically appears as part of a larger expression—for example, in forms such as and , where are of size and are of size .
If one first computes the Kronecker product explicitly and then multiplies it by the remaining matrix, the total complexity is . However, by exploiting structural properties and avoiding explicit computation of the Kronecker product, the same result can be obtained in operations. We now show that this reduced complexity holds in the general case as well:
Proposition 1.
Suppose are matrices (), and are and matrices, respectively. Then and can be computed in operations instead of operations, as observed in the naive order.
The proposition suggests that unless the Kronecker product itself is of interest, one should never compute it explicitly when it comes to multiplication because the algorithm (presented after the proof) always performs orders faster. The proposition also holds for cases in which have arbitrary sizes, but we do not state the proposition in that form because it obscures the complexity improvement and logic of the proof. Nevertheless, the algorithm provided later does support the most general case.
Proof.
We begin with the base case . Let be the element of B and be the ith block row of Z (which is of size ), then
For the kth block-row, we have . Together with the multiplication by A, the sum can be computed in operations, and because there are n block rows, the overall complexity is .
We now proceed to the general case by abstracting the component that makes the above work and applying it recursively to the chain of Kronecker products. If we define two binary operations ⊡ and ⊛ such that:
then . The key idea behind the two new binary operations is that they define block-wise matrix multiplication. Suppose both V and Z can be split into n by 1 blocks. The first binary operation defines the block-wise (pre-)multiplication, where each block of V is pre-multiplied by A, with A having the number of columns matching the number of rows of a block. The second binary operation ⊛ defines the block-wise matrix multiplication. This operation produces a matrix of blocks, where the i-th block is given by , naturally extending the usual matrix multiplication , where is the jth entry of a column vector c. The last term in (14) is also denoted as such that corresponds to the kth block-row .
The new binary operations allow us to evaluate the expression in a different order and avoid the Kronecker product in the process. As a result, it takes fewer arithmetic operations to evaluate the expression, as we have seen in the base case. Next, it follows that
and the general case is given by
Intuitively, every time we use the new binary operations to avoid a Kronecker product, the complexity is reduced by one in order, and given that there are of them, we expect to see the total complexity to be . We now present the formal proof by induction.
Let be the statement that has complexity , where are matrices, and denotes a matrix. We have shown in (12) and (13) that is true. Now suppose is true and consider ,
In (17), computing requires operations, resulting in a matrix of size . Next, in (18), the expression inside the square bracket, by hypothesis, has a complexity of , and the complexity accumulated so far remains as . Finally, given that there are n block-rows, the overall complexity is . This proves the inductive step, and by induction, is true for . This completes the proof for the case in Proposition 1.
For the other case, , because we are premultiplying the chain of Kronecker products, we work with blocks of columns instead of blocks of rows. Denote
by (where c stands for columns). Then the two corresponding binary operators and are defined as:
Now, it follows that , and the remainder of the proof proceeds the same as in the other case. □
In Table 7, we compare the performance of evaluating and with and without explicitly computing the Kronecker product. We conducted 1000 simulations, and in each simulation, the number of rows of , and the number of columns of were sampled from , where . Obviously, the number of columns of X needs to match the number of rows of (= the number of rows of B × the number of rows of A), and likewise for the number of rows of Z. The speed-up was computed using , which denotes the time needed to evaluate the full expression using explicit and implicit Kronecker products, respectively. We note that in the case of , there were two speed-up “outliers”: one at 0.42 and, the other at 0.94. All the rest were above 1. The median speed-up was about 15X, and the mean speed-up was about 16X, favouring the implicit evaluation.
Table 7.
Speed-up achieved by evaluating and without explicitly calculating the Kronecker product. The speed-up is computed using . The number of simulations is 1000.
2.2.8. Kronecker Product: More Special Cases
We identified common special cases of the Kronecker product and represented those using the new binary operators to reduce the computation cost further. In particular, we examine the four cases: , , , and . These are chosen because they arise naturally in common operations such as the product rule and the transpose product rule . Moreover, computing a Kronecker product (explicitly) with an identity matrix merely makes copies of the original matrix and arranging them in a particular way, hence yielding potential savings in time and memory use (despite that the complexity order remaining the same).
Note that and , so
- Similarly, and , so
In Table 8, we present the speed-up using the optimised implementation over the naive implementation. We conducted 1000 simulations, and in each simulation, the number of rows of and the number of columns of were sampled from , where . The number of columns of A and the number of rows of D were specified such that the multiplication makes sense (and the choice is unique). The speed-up was computed using , which denotes the evaluation time corresponding to the optimised and the naive implementations, respectively. We observe that the median speed-up is about 8×, While the mean speed-up is about 10×, this aligns with our theoretical result that the Kronecker product should not be evaluated explicitly unless the product itself is of interest.
Table 8.
Speed-up achieved evaluating , , and without explicitly computing the Kronecker product. The speed-up is computed using . The number of simulations is 1000.
3. Results
This section provides some computational examples to demonstrate the speed and efficacy of our proposed methods. We first benchmark our method against the traditional numerical derivative under basic matrix operations and the computation of a covariance matrix’s log determinant. We then demonstrate our derivative computation’s effectiveness within a large stochastic optimisation scheme, i.e., simulated maximum likelihood estimation (SMLE) of a stochastic factor model.
3.1. Basic Operations
We benchmarked the performance of AD against that of FD using the basic arithmetic operations addition: subtraction, multiplication, matrix inversion, and Kronecker product. The results are presented in Table 9. The time figures in the table represent the averages of 100 executions. Overall, AD performed much faster compared with the FD.
Table 9.
Benchmarking of AD against central FD in terms of basic arithmetic operations. Faster times are in bold.
3.2. Dynamic Factor Model Inference
Numerical assessments are often required in both classical and Bayesian statistical inference. Below, we illustrate the benefits of applying AD to derivative computation for the maximum likelihood estimation (MLE) of factor models when the analytical expression of the likelihood is not available and numerical assessment of derivatives is required. We show substantial computational gains using both simulated and real data. Readers interested in the use of AD in the context of Bayesian sensitivity analysis are referred to [30].
Factor models have been widely used in many areas, including psychology, bioinformatics, economics, and finance, to model the dependence structure of high-dimensional data. Different specifications of the factor model have been widely discussed in the literature ([31,32]). We follow [32] and consider a variation of the factor model in which the analytic expression of the derivative of the log-likelihood is intractable. Let denote the vector of observations at time t where , and let represent a column vector of latent factors. Then, the k-factor model with student-t noise is specified as:
where is the column vector of intercepts and is the loading matrix. The factors are assumed to be normally distributed, , where is , and they are independent from innovations that are multivariate-t distributed , where is . For the purpose of identification, we require and assume that is lower triangular with diagonal entries all equal to 1 ([32]). This particular specification is commonly used in financial econometrics.
For maximum likelihood inference of such a model, the likelihood function can be maximised directly via Monte Carlo simulation. Let be the observations and be the parameters. Under our setting, the likelihood function does not have an analytical expression, and it needs to be evaluated using numerical methods. Specifically, we can write where . Therefore, it follows that:
where denotes the probability density function of the multivariate-t distribution. In the rest of this section, we focus on a case in which both and are diagonal matrices: and .
For the implementation of the simulated maximum likelihood approach, we use the Ada-Delta variation of the stochastic gradient descent algorithm [33] for the estimation (with hyperparameters ). We first considered a simulated dataset of 1000 observations, each with 10 measurements, where the dimensions of the hidden factors were 3 and the entries in the factor loading matrix were sampled from the standard normal distribution. The entries of the diagonal covariance matrices of the factors and the innovations were sampled from and , respectively. The estimates converged at around 2000 iterations.
The first column of results in Table 10 reports the total run times with derivatives computed using AD and, for comparison, FD, for the simulated data example.
Table 10.
Run-time comparison between SMLE analysis of the factor model using either AD or (central) FD in the stochastic gradient computations. The per-iteration summaries are based on 100 evaluations under the simulated data example. The best performance in each column is highlighted in bold.
Under AD, the estimation took 4.36 h, compared with over 12 h using FD for the required derivative calculation. This reduction in run-time by almost 66% is confirmed by the remaining results in Table 10, which provide a more detailed run-time comparison between the two implementations, confirming the patterns from the overall run-time. The improvement was consistent for both per-iteration run-time and total run-time in the case of simulated data and for the total run-time for simulated to real data.
The real data set contained data on currency exchange rates. The sample included 1045 observations of daily returns of nine international currency exchange rates relative to the United States dollar from January 2007 to December 2010. We applied a factor model to the log of the exchange rate returns (i.e., , where denotes the daily closing spot rate for currency i at time t. The nine selected currencies were the Australian dollar (AUD), Canadian dollar (CAD), Euro (EUR), Japanese yen (JPY), Swiss franc (CHF), British pound (GBP), South Korean won (KRW), New Zealand dollar (NZD), and New Taiwan dollar (TWD), representing the most heavily traded currencies over the period. The estimates converged at around 1000 iterations. As reported in the second column of results in Table 10, we again observed a substantial reduction in estimation time, with the run time under an AD-based derivative computation being a third of the time required by the FD-based computation.
4. Conclusions
This paper presents a vectorised formulation of AD grounded in matrix calculus, tailored for statistical applications that involve high-dimensional inputs and simulation-intensive computations. The proposed approach uses a compact set of matrix calculus rules to enable efficient and automatic derivative computation and introduces optimisation techniques, such as memoisation, sparse matrix representation, matrix chain multiplication, and implicit Kronecker product, to improve the efficiency of the AD implementation.
Compared to other AD approaches, our formulation aligns more naturally with the matrix-oriented notation commonly used in statistics and econometrics. It supports fully automatic workflows similar to source code transformation methods while also providing direct access to intermediate variables for inspection and analysis. Unlike imperative AD frameworks, it does not require users to adopt AD-specific programming idioms. In addition, the approach enables high-level optimisation by explicitly making use of matrix structure and the order of matrix multiplications, which are typically inaccessible in scalar-based or imperative implementations.
Despite its advantages, our formulation of AD introduces computational overhead, which can make it less efficient than FD for small-scale or low-dimensional problems. In such cases, the simplicity of FD often results in faster performance (but at the cost of lower accuracy). The benefits of our approach become more apparent in high-dimensional settings, where its scalability and accuracy outweigh the initial overhead, as we showed in the numerical study. A further limitation arises when computing second-order derivatives such as the Hessian. Under our vectorised approach, this requires dual numbers to carry second-order matrix derivatives, which can quickly exceed the memory capacity of typical personal computers. For a function mapping to , the Jacobian has dimension , while the Hessian grows to . In contrast, FD computes these matrices entry by entry by perturbing the function input one coordinate at a time; although slower, this method avoids out-of-memory issues. Similar memory constraints may also occur when handling extremely large Jacobian matrices.
In view of the modern trend toward increasingly large and high-dimensional datasets, the performance overhead of AD in small-scale settings is becoming less of a practical concern. As data and models continue to grow in complexity, the scalability advantages of AD are expected to outweigh its initial costs in a broader range of applications. For the memory demands associated with higher-order derivatives, potential solutions include distributed computing, on-disk array storage, and blocked algorithms that process derivative computations in smaller, memory-efficient segments. These strategies offer promising avenues for extending the applicability of our vectorised AD approach to even larger and more demanding statistical problems.
Author Contributions
Conceptualisation, C.F.K. and D.Z.; methodology, C.F.K. and D.Z.; software, C.F.K.; validation, C.F.K. and D.Z.; formal analysis, C.F.K.; investigation, C.F.K.; resources, L.J. and D.Z.; data curation, C.F.K. and D.Z.; writing—original draft preparation, C.F.K.; writing—review and editing, C.F.K., L.J. and D.Z.; visualisation, C.F.K.; supervision, L.J. and D.Z.; project administration, L.J. and D.Z.; funding acquisition, L.J. and D.Z. All authors have read and agreed to the published version of the manuscript.
Funding
The authors acknowledge support from the Australian Research Council through funding from DP180102538 and FT170100124.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The exchange rate data used in this study is publicly available from the Federal Reserve Economic Data (FRED) at https://fred.stlouisfed.org and is freely accessible without restrictions.
Acknowledgments
We thank the anonymous reviewers for their valuable comments and constructive suggestions.
Conflicts of Interest
The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
Appendix A. Illustrative Examples
In this section, we present some statistical applications that use the matrix operations that could benefit significantly from the use of AD.
Example A1.
Local sensitivity of the Seemingly Unrelated Regression (SUR) model [34]. Consider the SUR model,
- where , and
- and , I is the identity matrix.
In a more compact form,
and we write it as , where . The Generalised-Least-Squares (GLS) estimator is given by . Given the matrix multiplications, inversions and Kronecker products involved, it is tedious to find the analytical expression of the local sensitivity β with respect to all the noise parameters , i.e., . In contrast, AD only requires implementing the original expression and then the derivative will be available “for free”.
Example A2.
Local sensitivity of the Bayesian normal regression model. Consider the model with the normal prior on the parameter and with being the hyperparameters. The posterior mean b of β is given by
The local sensitivity of the posterior mean is concerned with the effect of a small change of the hyperparameters and the data X on the posterior mean b.
Even in such simple case, it is clear that applying AD directly to (A1) is less error-prone than deriving and implementing the analytic derivative by hand:
where is the standard differential operators, , and are duplication matrices of appropriate dimensions.
Example A3.
(Mixed-regressive) Spatial-Auto-Regressive (SAR) model [35] (pp. 8, 16).
where are vector, X is a matrix, β is a vector, are scalars, and are spatial weight matrices. The loglikelihood function is given by:
and the derivative is needed to perform MLE using gradient-based methods. In addition to the convenience of not needing to implement the derivative manually, AD often has the advantage of having less duplicate computation compared with when the loglikelihood function and its derivative are implemented separately.
Example A4.
MLE with Simultaneous equations [19] (p. 371). Simultaneous equation model is a generalisation of the multivariate linear regression model
where are column vectors, and is a matrix. and it has the form of
where is a matrix. Assuming that and the data matrix has full rank k, the loglikelihood consists of the cross-product, determinant and trace operations as follows:
In the above, the parameters are collected into θ, and are stacked by rows.
For the multilevel generalisation [36] where the observations are clustered into l independent groups (of the same size ), the loglikelihood is given by:
Note that is stacked by rows, and it follows a matrix normal distribution .
In the example above, AD offers an easy way to extend existing model to incorporate structural assumptions, which often leads a more complicated derivative expression. It enables researchers to readily experiment with different ways of generalising the working model.
Example A5.
Infinite Gaussian Mixture model. Consider the model , where are the data and are the parameters. Let be the normal density and be the gamma density, the log-likelihood is given by:
where the second line is the Monte Carlo approximation. As the simulated log-likelihood depends on parameters through the random sample, it is more convenient to use AD to compute the derivative, especially if one wants to explore different choices of the mixture distribution g. Moreover, it is also practical to use AD when the mixture models are multi-level, or when the marginalised parameters are high-dimensional.
Appendix B. Code Listings
| Listing A1. Implementation of the subtraction and inverse matrix calculus rules in R. |
‘%minus%‘ <- function(A_dual, B_dual) { A <- A_dual$X; dA <- A_dual$dX; B <- B_dual$X; dB <- B_dual$dX; list(X = A - B, dX = dA - dB) } ‘%divide%‘ <- function(A_dual, B_dual) { A <- A_dual$X; dA <- A_dual$dX; B <- B_dual$X; dB <- B_dual$dX; B_inv <- solve(B) dB_inv <- -(t(B_inv) %x% B_inv) %∗% dB B_inv_dual <- list(X = B_inv, dX = dB_inv) A_dual %times% B_inv_dual } |
| Listing A2. An example using the simple AD system in Section 2.1.2. |
f <- function(A, B) { A %∗% (A %∗% B + B %∗% B) + B } # Derivative by Auto-Differentiation df_AD <- function(A, B) { # A and B are dual matrices A %times% ((A %times% B) %plus% (B %times% B)) %plus% B } # Derivative by Analytic Formula df_AF <- function(A, B, dA, dB) { # optimisation by hand to avoid repeated computation I_n <- I(nrow(A)) I_n2 <- I(nrow(A)^2) In_x_A <- I_n %x% A In_x_B <- I_n %x% B tB_x_In <- t(B) %x% I_n # the analytic formula (t(A %∗% B + B %∗% B) %x% I_n + (In_x_A) %∗% tB_x_In) %∗% dA + (In_x_A %∗% In_x_A + In_x_A %∗% (tB_x_In + In_x_B) + I_n2) %∗% dB } ## ------------------------------------------------------------- # Helper functions zeros <- function(nr, nc) matrix(0, nrow = nr, ncol = nc) dual <- function(X, dX) list(X = X, dX = dX) # Main code n <- 10 set.seed(123) A <- matrix(rnorm(n^2), nrow = n, ncol = n) B <- matrix(rnorm(n^2), nrow = n, ncol = n) res <- f(A, B) dA <- cbind(I(n^2), zeros(n^2, n^2)) dB <- cbind(zeros(n^2, n^2), I(n^2)) res_DF <- df_AF(A, B, dA, dB) # Analytic approach res_AD <- df_AD(dual(A, dA), dual(B, dB)) # AD approach # Compare accuracy sum(abs(res_AD$X - res)) # 0 sum(abs(res_AD$dX - res_DF)) # 5.016126e-13 |
| Listing A3. An illustrative implementation of one-argument memoisation in R. |
memoise <- function(f) { # takes a function ‘f’ as input record <- list() # attach a table to ‘f’ (using lexical scoping) hash <- as.character return(function(x) { # returns a memoised ‘f’ as output result <- record[[hash(x)]] # retrieves result if (is.null(result)) { # if the result does not exist result <- f(x) # then evaluate it and record[[hash(x)]] <<- result # save it for future } return(result) }) } |
| Listing A4. R code to compare the speed and accuracy of AD and FD. |
# remotes::install_github("kcf-jackson/ADtools") library(ADtools) # 1. Setup set.seed(123) # for reproducibilit X <- matrix(rnorm(10,000), 100, 100) Y <- matrix(rnorm(10,000), 100, 100) B <- matrix(rnorm(10,000), 100, 100) f <- function(B) { sum((Y - X %∗% B)^2) } # Deriving analytic derivative by hand df <- function(B) { -2 ∗ t(X) %∗% (Y - X %∗% B) } # 2. Speed comparison system.time({ AD_res <- auto_diff(f, at = list(B = B)) }) # user system elapsed # 0.387 0.054 0.445 system.time({ FD_res <- finite_diff(f, at = list(B = B)) }) # user system elapsed # 10.660 1.918 12.591 system.time({ truth <- df(B) # runs fastest when available }) # user system elapsed # 0.001 0.000 0.001 # 3. Accuracy comparison AD_res <- as.vector(deriv_of(AD_res)) FD_res <- as.vector(FD_res) truth <- as.vector(truth) max(abs(AD_res - truth)) # [1] 0 max(abs(FD_res - truth)) # [1] 0.006982282 |
| Listing A5. R code to illustrate that our vectorised formulation can produce derivative automatically and seamlessly. |
# Example 1: Seemingly Unrelated Regression set.seed(123) T0 <- 10 M <- 5 l <- 6 # Regression coefficients beta <- do.call(c, lapply(1:M, \(id) rnorm(l, mean = 0, sd = 2))) # Predictors Xs <- lapply(1:M, \(id) matrix(rnorm(T0 ∗ l), nrow = T0, ncol = l)) X <- diag(1, nrow = M ∗ T0, ncol = M ∗ l) for (i in seq_along(Xs)) { X[1:T0 + (i-1) ∗ T0, 1:l + (i-1) ∗ l] <- Xs[[i]] } X # Noise Sigma_c <- crossprod(matrix(rnorm(M^2), nrow = M)) I <- diag(T0) u <- mvtnorm::rmvnorm(1, mean = rep(0, T0 ∗ M), sigma = kronecker(Sigma_c, I)) # Observation y <- X %∗% beta + t(u) # Estimator estimator <- function(Sigma_c, I, X, y) { inv_mat <- solve(kronecker(Sigma_c, I)) beta_est <- solve(t(X) %∗% inv_mat %∗% X, t(X) %∗% inv_mat %∗% y) } # remotes::install_github("kcf-jackson/ADtools") library(ADtools) auto_diff(estimator, wrt = c("Sigma_c"), at = list(Sigma_c = Sigma_c, I = I, X = X, y = y)) |
References
- Gardner, J.; Pleiss, G.; Weinberger, K.Q.; Bindel, D.; Wilson, A.G. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. Adv. Neural Inf. Process. Syst. 2018, 31, 7587–7597. [Google Scholar]
- Abril-Pla, O.; Andreani, V.; Carroll, C.; Dong, L.; Fonnesbeck, C.J.; Kochurov, M.; Kumar, R.; Lao, J.; Luhmann, C.C.; Martin, O.A.; et al. PyMC: A modern, and comprehensive probabilistic programming framework in Python. PeerJ Comput. Sci. 2023, 9, e1516. [Google Scholar] [CrossRef] [PubMed]
- Joshi, M.; Yang, C. Algorithmic Hessians and the fast computation of cross-gamma risk. IIE Trans. 2011, 43, 878–892. [Google Scholar] [CrossRef]
- Allen, G.I.; Grosenick, L.; Taylor, J. A generalized least-square matrix decomposition. J. Am. Stat. Assoc. 2014, 109, 145–159. [Google Scholar] [CrossRef]
- Jacobi, L.; Joshi, M.S.; Zhu, D. Automated sensitivity analysis for Bayesian inference via Markov chain Monte Carlo: Applications to Gibbs sampling. SSRN 2018. [Google Scholar] [CrossRef]
- Ranganath, R.; Gerrish, S.; Blei, D. Black box variational inference. In Proceedings of the Artificial Intelligence and Statistics. PMLR, Reykjavik, Iceland, 22–25 April 2014; pp. 814–822. [Google Scholar]
- Revels, J.; Lubin, M.; Papamarkou, T. Forward-mode automatic differentiation in Julia. arXiv 2016, arXiv:1607.07892. [Google Scholar]
- Baydin, A.G.; Pearlmutter, B.A.; Radul, A.A.; Siskind, J.M. Automatic differentiation in machine learning: A survey. J. Mach. Learn. Res. 2018, 18, 1–43. [Google Scholar]
- Kucukelbir, A.; Tran, D.; Ranganath, R.; Gelman, A.; Blei, D.M. Automatic differentiation variational inference. J. Mach. Learn. Res. 2017, 18, 1–45. [Google Scholar]
- Chaudhuri, S.; Mondal, D.; Yin, T. Hamiltonian Monte Carlo sampling in Bayesian empirical likelihood computation. J. R. Stat. Soc. Ser. B Stat. Methodol. 2017, 79, 293–320. [Google Scholar] [CrossRef]
- Chan, J.C.; Jacobi, L.; Zhu, D. Efficient selection of hyperparameters in large Bayesian VARs using automatic differentiation. J. Forecast. 2020, 39, 934–943. [Google Scholar] [CrossRef]
- Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in PyTorch. In Proceedings of the NIPS 2017 Autodiff Workshop, Long Beach, CA, USA, 9 December 2017. [Google Scholar]
- Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. {TensorFlow}: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), Savannah, GA, USA, 2–4 November 2016; pp. 265–283. [Google Scholar]
- Kucukelbir, A.; Ranganath, R.; Gelman, A.; Blei, D. Automatic variational inference in Stan. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 568–576. [Google Scholar]
- Klein, W.; Griewank, A.; Walther, A. Differentiation methods for industrial strength problems. In Automatic Differentiation of Algorithms; Springer: New York, NY, USA, 2002; pp. 3–23. [Google Scholar]
- Griewank, A.; Walther, A. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation; Siam: Philadelphia, PA, USA, 2008; Volume 105. [Google Scholar]
- Griewank, A.; Juedes, D.; Utke, J. Algorithm 755: ADOL-C: A package for the automatic differentiation of algorithms written in C/C++. ACM Trans. Math. Softw. (TOMS) 1996, 22, 131–167. [Google Scholar] [CrossRef]
- Bischof, C.H.; Roh, L.; Mauer-Oats, A.J. ADIC: An extensible automatic differentiation tool for ANSI-C. Softw. Pract. Exp. 1997, 27, 1427–1456. [Google Scholar] [CrossRef]
- Magnus, J.R.; Neudecker, H. Matrix Differential Calculus with Applications in Statistics and Econometrics; Wiley: Hoboken, NJ, USA, 1999. [Google Scholar]
- Glasserman, P. Monte Carlo Methods in Financial Engineering; Springer Science & Business Media: New York, NY, USA, 2003; Volume 53. [Google Scholar]
- Intel. Matrix Inversion: LAPACK Computational Routines. 2020. Available online: https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2023-0/matrix-inversion-lapack-computational-routines.html (accessed on 15 March 2025).
- Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the International Conference on Learning Representations, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
- Rosenblatt, M. Remarks on a multivariate transformation. Ann. Math. Stat. 1952, 23, 470–472. [Google Scholar] [CrossRef]
- Kwok, C.F.; Zhu, D.; Jacobi, L. ADtools: Automatic Differentiation Toolbox, R Package Version 0.5.4, CRAN Repository. 2020. Available online: https://cran.r-project.org/src/contrib/Archive/ADtools/ (accessed on 15 March 2025).
- Kwok, C.F.; Zhu, D.; Jacobi, L. ADtools: Automatic Differentiation Toolbox. GitHub Repository. 2020. Available online: https://github.com/kcf-jackson/ADtools (accessed on 15 March 2025).
- Abelson, H.; Sussman, G.J.; Sussman, J. Structure and Interpretation of Computer Programs; MIT Press: Cambridge, MA, USA, 1996. [Google Scholar]
- Lütkepohl, H. Handbook of Matrices; Wiley Chichester: Chichester, UK, 1996; Volume 1. [Google Scholar]
- Hu, T.; Shing, M. Computation of matrix chain products. Part II. SIAM J. Comput. 1984, 13, 228–251. [Google Scholar] [CrossRef]
- Cormen, T.H.; Leiserson, C.E.; Rivest, R.L.; Stein, C. Introduction to Algorithms; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
- Chan, J.C.; Jacobi, L.; Zhu, D. An automated prior robustness analysis in Bayesian model comparison. J. Appl. Econom. 2019, 37, 583–602. [Google Scholar] [CrossRef]
- Brennan, M.J.; Chordia, T.; Subrahmanyam, A. Alternative factor specifications, security characteristics, and the cross-section of expected stock returns. J. Financ. Econ. 1998, 49, 345–373. [Google Scholar] [CrossRef]
- Geweke, J.; Zhou, G. Measuring the pricing error of the arbitrage pricing theory. Rev. Financ. Stud. 1996, 9, 557–587. [Google Scholar] [CrossRef]
- Zeiler, M.D. Adadelta: An adaptive learning rate method. arXiv 2012, arXiv:1212.5701. [Google Scholar]
- Zellner, A. An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias. J. Am. Stat. Assoc. 1962, 57, 348–368. [Google Scholar] [CrossRef]
- LeSage, J.; Pace, R.K. Introduction to Spatial Econometrics; CRC Press: Boca Raton, FL, USA, 2009. [Google Scholar]
- Hernández-Sanjaime, R.; González, M.; López-Espín, J.J. Multilevel simultaneous equation model: A novel specification and estimation approach. J. Comput. Appl. Math. 2020, 366, 112378. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).