Next Article in Journal
Revisiting the Replication Crisis and the Untrustworthiness of Empirical Evidence
Previous Article in Journal
Theoretical Advancements in Small Area Modeling: A Case Study with the CHILD Cohort
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Analysis of Vectorised Automatic Differentiation for Statistical Applications

by
Chun Fung Kwok
1,
Dan Zhu
2,* and
Liana Jacobi
3
1
St. Vincent’s Institute of Medical Research, Melbourne 3065, Australia
2
Department of Econometrics and Business Statistics, Monash University, Melbourne 3800, Australia
3
Department of Economics, University of Melbourne, Melbourne 3010, Australia
*
Author to whom correspondence should be addressed.
Stats 2025, 8(2), 40; https://doi.org/10.3390/stats8020040
Submission received: 20 March 2025 / Revised: 28 April 2025 / Accepted: 12 May 2025 / Published: 19 May 2025
(This article belongs to the Section Computational Statistics)

Abstract

:
Automatic differentiation (AD) is a general method for computing exact derivatives in complex sensitivity analyses and optimisation tasks, particularly when closed-form solutions are unavailable and traditional analytical or numerical methods fall short. This paper introduces a vectorised formulation of AD grounded in matrix calculus. It aligns naturally with the matrix-oriented style prevalent in statistics, supports convenient implementations, and takes advantage of sparse matrix representation and other high-level optimisation techniques that are not available in the scalar counterpart. Our formulation is well-suited to high-dimensional statistical applications, where finite differences (FD) scale poorly due to the need to repeat computations for each input dimension, resulting in significant overhead, and is advantageous in simulation-intensive settings—such as Markov Chain Monte Carlo (MCMC)-based inference—where FD requires repeated sampling and multiple function evaluations, while AD can compute exact derivatives in a single pass, substantially reducing computational cost. Numerical studies are presented to demonstrate the efficacy and speed of the proposed AD method compared with FD schemes.

1. Introduction

Automatic differentiation (AD) has become a foundational tool in modern statistical computing, enabling efficient and exact gradient computation in a wide range of applications—from parameter estimation [1,2] and sensitivity analysis [3,4,5] to simulation-based methods such as variational inference and Markov-Chain Monte Carlo (MCMC) inference [6,7,8,9,10,11]. Its widespread adoption is evident in major software ecosystems such as PyTorch [12], TensorFlow [13], Stan [14], and Julia [7], where AD powers both machine learning workflows and traditional statistical methods.
AD works by transforming a program that computes the value of a function into one that also computes its derivatives by systematically applying the chain rule to elementary operations. This allows AD to compute derivatives with machine-level precision and minimal overhead, avoiding truncation and round-off errors and eliminating the need for repeated function evaluations, a known bottleneck in numerical differentiation, especially in high-dimensional problems [8,15,16]. While symbolic differentiation can provide exact derivatives, it requires closed-form expressions and cannot handle procedural logic (e.g., if-else statements and for-loops) or stochastic elements such as random number generation or Monte Carlo simulations. AD bridges this gap, offering a robust and general-purpose solution for derivative computation.
Early implementations of AD relied on operator overloading [17] and source code translation [18] techniques that, while powerful, had notable limitations. Operator overloading incurs significant runtime overhead and is inherently local, recording operations as they occur without access to global program structure or opportunities for optimisation. In contrast, compiler-based systems transform entire programs automatically but often make it difficult to selectively extract intermediate values for debugging or inspection. Modern AD frameworks improve on these predecessors by adopting either eager execution (as in PyTorch 1 and 2) or static computational graphs (as in TensorFlow 1). These approaches offer greater flexibility, traceability, and support for modular development and deep introspection. However, the eager mode requires users to structure their code in specific ways; for example, making explicit calls such as backward() and zero_grad() in PyTorch can appear rigid and error-prone. Moreover, because it operates step-by-step, the eager approach often fails to exploit the broader structure of computations, such as block matrix operations, thereby missing optimisation opportunities. Conversely, while static graph systems are more declarative and amenable to global analysis, they can struggle with dynamic control flow and runtime-dependent logic. In both paradigms, the need to conform to AD-specific programming idioms often shifts user attention away from the statistical problem itself and toward the mechanics of the AD system.
In this work, we present a vectorised formulation of AD grounded in the matrix calculus of [19], designed to align more naturally with the matrix-oriented style prevalent in the field of statistics and the statistical programming language R. Our approach mirrors the derivation style of analytical work, enabling clearer and more intuitive implementations. It also exposes opportunities for high-level optimisation, including the use of sparse matrix representations and block-wise computations, features often inaccessible in bottom-up, scalar-based AD systems. This formulation supports transparent complexity analysis and efficient implementations, particularly in settings involving Kronecker products. It also enables fully automatic workflows, akin to source code translation techniques, while preserving the ability to inspect intermediate variables as in eager execution and graph-based systems.
We introduce a complete set of matrix calculus rules for building an AD system tailored to statistical applications, including operations for random variable simulations and structural transformations, many of which are undocumented in the existing literature. We also introduce the sparse representation of transformation matrices and discuss a range of optimisation techniques applied to the AD system to achieve significant performance gains in practice, which we demonstrate through comparisons with finite differences (FD). As an illustration, we apply the proposed methods using a real data example: a factor model estimated using simulated maximum likelihood, a setting commonly encountered when modelling dependence structures in complex data. The numerical results confirm the computational advantage of the proposed vectorised AD, particularly for simulation-intensive functions where FD incurs unnecessary repeated calculations.
The remainder of the paper is organised as follows: Section 2 introduces the core mechanism of the AD system and presents the full set of matrix calculus rules. Section 3 discusses optimisation strategies and evaluates computational performance, while Section 4 details the application. All code listings are provided in the Appendix B.

2. Materials and Methods

2.1. AD via Vectorisation

2.1.1. From Vector Calculus to Matrix Calculus via Vectorisation

Our AD formulation builds on a set of vector calculus rules rather than elementary scalar calculus [19]. Before presenting the full framework, we introduce three key definitions: Definition 1 defines the derivative of a vector-valued function; Definition 2 introduces the vectorisation operator; and Definition 3 combines the first two to define the derivative of a matrix-valued function.
Definition 1.
Suppose x R n and f : R n R m , then Jacobian matrix J of f is a m × n matrix f x with ( i , j ) entry given by:
f i x j , i = 1 , 2 , , m ; j = 1 , 2 , , n
where f i and x j are components of f and x .
Definition 2.
Let A be an m × n matrix and a j its j-th column. Then vec A is the m n × 1 column vector (i.e., vec A stacks the columns of A):
[ a 11 a 21 a m 1 a 12 a 22 a m 2 a 1 n a 2 n a m n ] T .
Note that vec A B C = ( C T A ) vec B , where A, B, and C are three matrices with appropriate dimensions such that the matrix product A B C is well-defined; C T denotes the transpose of C, and C T A denotes the Kronecker product of C T and A.
Definition 3.
Let F : R n × q R m × p be a real matrix function, the Jacobian matrix of F at X is defined to be the m p × n q matrix:
D F ( X ) : = vec F ( X ) ( vec X ) T
For notational convenience, we write vec X as d X . In the definition above, the numerator is always treated as a column vector and the denominator as a row vector. This allows us to write D F ( X ) as d F ( X ) d X without ambiguity, rather than the more cumbersome d F ( X ) ( d X ) T . Indeed, since vec X ( vec X ) T = I , the identity matrix of dimension m p , accepting d X as ( d X ) T in the denominator amounts to a notational simplification:
d F ( X ) d X = vec F ( X ) vec X · I = vec F ( X ) vec X · vec X ( vec X ) T = vec F ( X ) ( vec X ) T ,
aligning with the definition above.
A key advantage of Definition 3 is that it allows higher-order matrix derivatives to remain within the familiar matrix framework, rather than escalating into high-order tensors, e.g., the Hessian of F at X is also a matrix. This simplifies notation and facilitates the use of matrix algebra to exploit structure in Jacobian and Hessian matrices, making the formulation more efficient. For further discussion and critique of alternative matrix derivative conventions—including the numerator and denominator layouts—see [19].
Once derivatives are defined for the basic operations, they can be propagated through a computation using the matrix-based chain rule. Suppose that at a certain stage, we have already computed matrices A and B, along with their derivatives D A ( X ) and D B ( X ) , with respect to some input X. The next step of the computation involves evaluating a new matrix C = F ( A , B , X ) , where F is differentiable in all parameters. Using the chain rule in vectorised form, the derivative of C with respect to X is given by:
D C ( X ) = vec F vec X ( A , B , X ) + vec F vec A ( A , B , X ) D A ( X ) + vec F vec B ( A , B , X ) D B ( X ) .
In our formulation, any computation that can be decomposed into a sequence of basic matrix operations—each admitting a tractable and well-defined derivative—can be differentiated efficiently using the chain rule. These operations include (i) basic matrix arithmetic such as addition, subtraction, product, inverse, and Kronecker product; (ii) element-wise arithmetic such as Hadamard product/division and element-wise univariate differentiable transformations; (iii) scalar matrix arithmetic such as scalar matrix addition, subsection, multiplication, and division; (iv) structural transformations such as extracting elements and rearranging or combining matrices; and (v) operations on matrices such as Cholesky decomposition, column/row sum, cross-products, transposition of cross-products, determinants, and traces. They will be presented in Section 2.1.3.

2.1.2. Dual Construction

We present an implementation of an AD system that can differentiate any multivariate matrix polynomial to illustrate the underlying logic of our AD formulation. Let A, B, and C be n × n matrices and I n the n × n diagonal (identity) matrix, and consider the following two matrix calculus rules:
C = A + B d C = d A + d B ,
C = A B d C = ( B T I n ) d A + ( I n A ) d B .
To implement these rules, first, we attach to each matrix a dual component that stores the derivative with respect to some other parameters (i.e., let A d u a l = A , d A d , B d u a l = B , d B d ) and refer to them as the dual matrices. For example, if the parameters are the entries of A and B, then d A d = d A d [ A | B ] = I n 2 , 0 n 2 ; similarly d B d = d B d [ A | B ] = 0 n 2 , I n 2 . We can then define the arithmetic for dual matrices using (2) and (3):
A d u a l + B d u a l = A , d A d + B , d B d = A + B , d A d + d B d
A d u a l · B d u a l = A , d A d · B , d B d = A B , ( B T I n ) d A d + ( I n A ) d B d
and program them as shown in Listing 1.
These 16 lines of code define an AD system that can handle the class of multivariate matrix polynomials formed by addition and multiplication. For example, the derivative of the function f ( A , B ) = A ( A B + B 2 ) + B is simply the one-line
df <- function(A, B) A %times% ((A %times% B) %plus% (B %times% B)) %plus% B
rather than a tedious program associated with the analytical derivative
d f ( A , B ) = [ ( A B + B 2 ) T I n + ( I n A ) ( B T I n ) ] d A + ( I n A ) 2 + ( I n A ) ( B T I n + I n B ) + I n 2 d B .
Readers encountering AD for the first time may be surprised that the program df above appears to compute f itself, rather than its derivative—which is precisely what makes AD so appealing. Given the addition and multiplication operators defined for dual matrices, any function constructed using these operators will automatically have its derivative computed. Specifically, derivatives are evaluated on the fly each time %times% or %plus% is called. The final output of df is a dual matrix, where the first component is the result f ( A , B ) , and the second component is the derivative of f at ( A , B ) . This approach abstracts away the derivative calculation, allowing users to obtain derivatives automatically once the function f is implemented. To complete the system, subtraction and inverse operations are also required (also 16 lines). To main the flow, we list them under the Appendix B, along with a complete working example.
Listing 1. Implementation of the sum and product matrix calculus rules in R.
‘%plus%‘ <- function(A_dual, B_dual) {
  A <- A_dual$X;  dA <- A_dual$dX;
  B <- B_dual$X;  dB <- B_dual$dX;
  list(X = A + B, dX = dA + dB)
}
   
‘%times%‘ <- function(A_dual, B_dual) {
  A <- A_dual$X;  dA <- A_dual$dX;
  B <- B_dual$X;  dB <- B_dual$dX;
  list(
    X = A %∗% B,
    dX = (t(B) %x% I(nrow(A))) %∗% dA + (I(ncol(B)) %x% A) %∗% dB
  )
}
   
I <- diag   # function to create diagonal matrices
          

2.1.3. A Layered Approach to Construction

In the previous section, we showed that formulating AD with dual matrices closely mirrors the underlying analytic derivation. Once the calculus rules for dual matrices are established, building an AD system becomes relatively straightforward. In this section, we present the full set of matrix calculus rules. The rules are grouped by type and presented in the order they would typically be implemented in practice. Illustrative statistical applications are provided in the Appendix.
In the following discussion, the derivative of any matrix is assumed to be w.r.t. some input z with d parameters. Hence, if A is a ( m × n ) matrix, then d A d z is a ( m n × d ) Jacobian matrix, and this shall be written as d A for the sake of convenience.

2.1.4. Notation

The symbols I . , K . , E . , D . , 1 . are reserved for the follow special matrices:
  • I n is the n × n identity matrix.
  • I n q is the n × q matrix where the entries on the diagonal are all ones, and the entries off the diagonal are all zeros.
  • K n q is the n q × n q commutation matrix. We also define K n : = K n n .
  • E n is the n ( n + 1 ) 2 × n 2 elimination matrix.
  • 1 n q is the n × q matrix of ones.
(Definitions of the commutation and elimination matrices can be found in Section 2.2.4 and Section 2.2.5, respectively.)
Let A be a m × n matrix. We denote
  • the ( i , j ) -entry of A is denoted by A i j or A i , j ,
  • the i-th row of A is denoted by A i or A i , ,
  • the j-th column of A is denoted by A j or A , j .
We define v A ( i , j ) and v A 1 ( k ) such that they satisfy the relations
( i , j ) - entry of A v A ( i , j ) - th entry of vec ( A ) v A 1 ( k ) - entry of A k - th entry of vec ( A )
and they are given by the formulas
v A ( i , j ) = i + ( j 1 ) m and v A 1 ( k ) = [ ( k 1 ) mod m ] + 1 , k m .
This indices-conversion function is needed because the derivative of the ( i , j ) -entry of A is stored in the v A ( i , j ) -th row of d A , and vice versa.

2.1.5. Matrix Arithmetic

We present the matrix calculus rules associated with basic matrix arithmetic:
  • Addition: Let A and B be m × n matrices, then d ( A + B ) = d A + d B .
  • Subtraction: Let A and B be m × n matrices, then d ( A B ) = d A d B .
  • Product: Let A and B be m × n and n × k matrices, then
    d ( A B ) = ( B T I m ) d A + ( I k A ) d B .
  • Inverse: Let A be a n × n matrix, then d A 1 = ( A 1 A 1 ) d A .
  • Kronecker-product: Let A and B be m × n and p × q matrices, then
    d ( A B ) = ( I n K q m I p ) I m n vec ( B ) d A + vec ( A ) I p q d B .
  • Transpose: Let A be a m × n matrix, d A T = K m n d A .
We now examine the computational advantages of AD relative to FD through a complexity analysis of basic matrix operations. Our discussion is restricted to central (finite) differencing instead of forward differencing, since the former has superior accuracy ([20], pp. 378–379) and is generally preferred over the latter. Applying central differencing to a function f incurs a cost of evaluating f multiplied by twice the dimension of the inputs. (For ease of discussion, we assume that conditional branches, if they exist, have the same order of computational complexity.) Applying AD incurs a cost of evaluating f and the derivative d f as given by the calculus rules.
For matrix arithmetic, suppose A and B are n × n matrices, so the dimension of the inputs is d = 2 n 2 for matrix additions, subtractions, products, and Kronecker products, and d = n 2 for matrix inversions. The number of operations associated with applying the central differencing to f ( A , B ) : R d R k is:
2 d perturbations ( additions / subtractions ) of the input + 2 d evaluations of f with the perturbed input + d times 2 k operations for finite - differencing the output ( subtractions and divisions ) = 2 d · ( 1 + cos t ( f ) + k ) .
Assuming standard matrix multiplication, the computational costs of addition, subtraction, multiplication, and the Kronecker product are n 2 , n 2 , 2 n 3 n 2 , and n 4 , respectively. Since we are only interested in the leading terms, these simplify to n 2 , n 2 , 2 n 3 , and n 4 . The number of operations for matrix inversion using the LU decomposition followed by the inversion of triangular matrices is around 2 n 3 [21]. It then follows that applying finite-differencing would require (in the leading term) 8 n 4 operations for addition and subtraction, 8 n 5 operations for product, 4 n 5 operations for matrix inversion, and 8 n 6 operations for the Kronecker product.
For AD, the number of operations works out to be n 2 + 2 n 4 for addition and subtraction, ( 2 n 3 n 2 ) + ( 8 n 5 2 n 4 ) for multiplication, 2 n 3 + ( 8 n 5 2 n 4 ) for matrix inversion, and n 4 + 6 n 6 for the Kronecker product. The results are summarised in Table 1. Note that when a Kronecker product is post-multiplied by a matrix, there is a shortcut to avoid the explicit computation of the Kronecker product. The details will be given in Section 2.2.7. From Table 1, we observe that finite differencing and AD have the same complexity for the operations listed in the table, but AD generally has equal or better leading coefficients, except for the matrix inversion.

2.1.6. Element-Wise Arithmetic

We now present matrix calculus rules for element-wise operations. These rules follow directly from applying scalar calculus to each entry independently. The cases of addition and subtraction are identical to those covered in the previous section on standard matrix arithmetic. Let A, B, and C be m × n matrices, and diag ( v ) be the square matrix in which the vector v is placed on the diagonal.
  • Hadamard product:
    C i j = A i j B i j , i = 1 , 2 , , m , j = 1 , 2 , , n ( d C ) k , = B v B 1 ( k ) ( d A ) k , + A v A 1 ( k ) ( d B ) k , , k = 1 , , m n
    Alternatively,
    C = A B d C = diag ( vec ( B ) ) d A + diag ( vec ( A ) ) d B .
  • Hadamard division:
    C i j = A i j / B i j , i = 1 , 2 , , m , j = 1 , 2 , , n ( d C ) k , = B v B 1 ( k ) 1 ( d A ) k , A v B 1 ( k ) B v B 1 ( k ) 2 ( d B ) k , , i = 1 , , m n
    Alternatively,
    C = A B d C = diag ( vec ( B 1 ) ) d A diag ( vec ( A B 2 ) ) d B .
  • Univariate differentiable function f:
    C i j = f ( A i j ) , i = 1 , 2 , , m , j = 1 , 2 , , n ( d C ) k , = f ( A v A 1 ( k ) ) ( d A ) k , , i = 1 , , m n
    Alternatively,
    C = f ( A ) d C = diag vec f ( A ) d A ,
    where f ( A ) denotes applying f element-wise to A. Note that f may also be a function that is differentiable almost everywhere, e.g., f ( x ) = | x | is differentiable everywhere except at x = 0 . When the derivative is evaluated at the non-differentiable locations, it is common to use a subgradient [8]—in this case, any value in the interval [ 1 , 1 ] —or simply to assume a value such as 0, effectively treating these points as having no impact on the result.

2.1.7. Scalar-Matrix Arithmetic

Let A be a m × n matrix and c be a scalar. Then the differentials d ( c Op A ) and d ( A Op c ) , where Op { + , , , / } , can be computed by lifting the scalar c to a matrix of the same dimension as A, via multiplication with a matrix of ones, 1 m n . This allows the scalar-matrix operation to be treated as an element-wise operation. Importantly, this lifting is a conceptual construct, and the implementation need not construct a new matrix k · 1 m n . Instead, the operation can be performed element-wise directly. For instance, the product rule d ( c A ) becomes
B = c A B i j = c A i j , i = 1 , 2 , , m , j = 1 , 2 , , n ; ( d B ) k , = A v A 1 ( k ) d c + c ( d A ) k , , k = 1 , 2 , , m n .

2.1.8. Structural Transformation

We now present calculus rules for structural transformations—operations that extract, rearrange, or combine matrix entries without performing any arithmetic computations. Because these transformations are primarily operational rather than analytical, they seldom appear in formal derivations and are often left undocumented in standard references.
  • Transpose: Let A be a m × n matrix, d A T = K m n d A .
  • Row binding: Let A , B be m × n , k × n matrices, rowBind ( A , B ) : = A B , then
    C = rowBind ( A , B ) ( d C ) k , = 1 r m · ( d A ) v A ( r , c ) , + 1 r > m · ( d B ) v B ( r m , c ) ,
    where ( r , c ) = v C 1 ( k ) , k = 1 , 2 , , ( m + k ) n .
  • Column binding: Let A , B be m × n , m × k matrices, colBind ( A , B ) : = [ A B ] , then
    C : = colBind ( A , B ) d C = rowBind ( d A , d B ) .
  • Subsetting: Let A be a m × n matrix,
    1. Index extraction: A i j for fixed i , j d A i j = ( d A ) v A ( i , j ) ,
    2. Row extraction: A i for fixed i d A i = ( d A ) S , , where S = { i , i + m , , i + ( n 1 ) m }
    3. Column extraction: A j for fixed j  d A j = ( d A ) S , , where S = { ( j 1 ) m + 1 , ( j 1 ) m + 2 , , ( j 1 ) m + n }
    4. Diagonal extraction: [ A i i ] i = 1 , 2 , , n (column vector) d [ A i i ] i = 1 , 2 , , n = ( d A ) S , , where S = { 1 , 1 + ( m + 1 ) , 1 + 2 ( m + 1 ) + , 1 + ( min ( n , m ) 1 ) ( m + 1 ) } .
  • Vectorisation: Let A be a n × n matrix, d vec ( A ) = d A
  • Half-vectorisation: Let A be a n × n matrix, d vech ( A ) = ( d A ) S , , S = { v A ( i , j ) , i j } .
    Note that S follows the order as the column-major order of A, i.e.,
    S = { v A ( 1 , 1 ) , v A ( 2 , 1 ) , v A ( 3 , 1 ) , , v A ( 2 , 2 ) , v A ( 2 , 3 ) , , v A ( n , n ) , } .
  • Diagonal expansion: Let v be a n × 1 vector, then d i a g ( v ) is defined to be the n × n matrix with v on the diagonal. If B = diag ( v ) , then for k = 1 , 2 , , n 2 ,
    ( d B ) k , = ( d A ) k mod ( n + 1 ) , · 1 k S + 0 · 1 k S ,
    where S = { 1 , n + 2 , 2 n + 3 + , n 2 } .

2.1.9. Operations on Matrices

  • Cholesky Decomposition: Let A be a n × n matrix, A = L L T be the Cholesky decomposition and L = C h o l ( A ) , then
    d L = D n [ E n ( I n 2 + K n ) ( L I n ) D n ] 1 E n d A
    where D n = E n T is the duplication matrix for triangular matrices.
Let A be a m × n matrix.
  • Column-sum:
    colSum ( A ) : = i A i , ( d colSum ( A ) ) k , = i = ( k 1 ) m + 1 , ( k 1 ) m + 2 , , k m ( d A ) i , , k = 1 , , n
  • Row-sum:
    rowSum ( A ) : = j A , j ( d rowSum ( A ) ) k , = i = k , k + m , , k + ( n 1 ) m ( d A ) i , , k = 1 , , m
  • Sum: sum ( A ) : = i , j A i j d sum ( A ) = colSum ( d A )
  • Cross-product:
    crossprod ( A ) : = A T A d crossprod ( A ) = ( I q 2 + K q q ) ( I q A T ) .
  • Transpose of cross-product:
    tcrossprod ( A ) : = A A T d tcrossprod ( A ) = ( I n 2 + K n n ) ( A n I n ) .
    Alternatively, both ‘crossprod’ and ‘tcrossprod’ can be implemented directly as is, since they are composed of the multiplication and transpose operations defined previously.
Let A be a n × n matrix.
  • Determinant: d det ( A ) = det ( A ) · vec ( A T ) T · d A
  • Trace: d tr ( A ) = vec ( I n ) T d A . Alternatively, it can be implemented by composing sum and diagonal extraction defined previously.

2.1.10. Random Variables

Given a probability space ( Ω , F , P ) , a random variable X is a F -measurable function mapping Ω to R . The formalism suggests that in the process of simulating a random variate, the randomness can always be isolated, and it is possible to differentiate (in the pathwise sense) the random variables X F X ( x ; α ) w.r.t. the parameters α when the derivative exists. In the simplest case of normal random variables, the parameters and the randomness can be separated as follows:
Z N ( μ , σ ) Z = μ + σ Z 0 , Z 0 N ( 0 , 1 ) d Z = d μ + d σ · Z 0
As Z depends smoothly on the parameters μ and σ , the derivatives w.r.t. these parameters are well defined. This is commonly referred to as the reparametrisation trick [22].
When explicit separation cannot be done, we utilise the inverse transform method. Suppose Z F Z ( z ; θ ) where F Z ( z ; θ ) is the cumulative density function of Z, assumed to be invertiable, and θ is the parameter. Then Z can be simulated using the inverse transform method Z = F Z 1 ( U ; θ ) , U U [ 0 , 1 ] . It then follows that if F Z 1 ( · ; θ ) is differentiable in θ , then the derivative of a random sample is well-defined. This applies to, for instance, the Exponential, Weibull, Rayleigh, log-Cauchy, and log-Logistic distributions.
In the most general case where Z is high-dimensional and F Z 1 may not be known, we rely on the class of isoprobabilistic transformations T ( x 1 , , x k ; α ) , which transforms an absolutely continuous k-variate distribution F ( x 1 , , x k ; α ) into the uniform distribution on the k-dimensional hyper-cube [23]. This gives an explicit formula to find the derivative of a random vector:
X α = T ( X , α ) X 1 T ( X , α ) α
assuming det T ( X , α ) X 0 so that the inverse exists.
For clarity, let us consider a 1-dimensional example. Suppose we have a random variable X F X ( x ; α ) , where F X is invertible, then an isoprobabilistic transformation T ( X , α ) would simply be F X ( X , α ) as F X ( X , α ) is distributed uniformly. Hence, it follows that
X α = f X ( X , α ) 1 F X ( X , α ) α .
It is easy to check via elementary means that this is indeed correct. Starting with the identity F X ( F X 1 ( U , α ) , α ) = U and applying implicit differentiation, we have
F X ( F X 1 ( U , α ) , α ) α = 0 F X X ( F X 1 ( U , α ) , α ) · F X 1 ( U , α ) d α + F X α ( F X 1 ( U , α ) , α ) = 0 F X 1 ( U , α ) d α = F X α ( F X 1 ( U , α ) , α ) f X ( F X 1 ( U , α ) , α ) X α = f X ( X , α ) 1 F X ( X , α ) α .
Some specific one-dimensional cases include the gamma, inverse-gamma, chi-squared, Dirichlet, Wishart, and inverse-Wishart distributions.

2.2. Optimising AD Implementation

In the previous section, we introduced the vectorised AD formulation along with the full set of matrix calculus rules to support the dual construction. In this section, we explore several implementation strategies aimed at optimising execution. Benchmarking is carried out in R using the ADtools package available on CRAN ([24]) and GitHub ([25]). While the exact performance gains presented here (and in Section 3) are environment-specific, the optimisation principles are broadly applicable, and improvements can be expected in other environments.

2.2.1. Memoisation

Memoisation (or tabulation) is a technique for non-invasively attaching a cache to a function, allowing it to store and reuse previously computed results for repeated inputs [26]. It can greatly accelerate tasks such as constructing large structured matrices and provides a convenient way to organise computations.
The technique works by checking whether a given input has already been evaluated. If so, the cached result is returned; if not, the computation is performed, and the result is stored for future use. Table 2 shows a speed comparison of the built-in R function diag, with and without memoisation. An illustrative 12-line implementation is included in the Appendix B.

2.2.2. Sparse Matrix Representation

For efficient implementation, all special matrices are constructed and stored using sparse representations, which improve both computational and memory efficiency. A sparse matrix is typically represented as a list of triples, where each triple ( i , j , v ) records the value v at position ( i , j ) in the matrix.

2.2.3. The Diagonal Matrix D n

An n × n diagonal matrix D n is represented as { ( k , k , v k ) , k = 1 , 2 , , n } , where v k is the kth diagonal entry of D n . This representation takes O ( n ) storage space and incurs an O ( n 2 ) computation cost when multiplied by a n × n dense matrix. A speed comparison of the diagonal matrix function with dense and sparse representations is provided in Table 3. In the table, “Dense” uses the R function diag, and “Sparse” uses the R function ADtools::diagonal. Using sparse representation, a substantial increase in speed was observed for large matrices.

2.2.4. The Commutation Matrix K n q

The n q × n q matrix K n q is a commutation matrix if vec ( A T ) = K n q vec ( A ) for any n × q matrix A [27]. From (7), we have
k th entry of vec ( A ) = v A 1 ( k ) th entry of A = ( a , b ) entry of A = ( b , a ) entry of A T = v A T ( b , a ) th entry of vec ( A T ) = [ b + ( a 1 ) q ] th entry of vec ( A T )
where a = [ ( k 1 ) mod n ] + 1 , b = k n . As K n q is the matrix that maps vec ( A ) to vec ( A T ) , and we have derived that the kth entry of vec ( A ) needs to map to k n + ( k 1 ) mod n q -th entry of vec ( A T ) , it follows that K n q is a matrix having value 1 at position k n + ( k 1 ) mod n q , k , k = 1 , 2 , , n q . Therefore, in the sparse representation, we have
K n q = k n + ( k 1 ) mod n q , k , 1 , k = 1 , 2 , , n q .
It is worth noting that since the commutation matrix simply reorders the entries of vec ( A ) , one can implement a function that directly remaps the indices as specified in (8), rather than explicitly constructing the matrix and performing the associated matrix multiplication. Table 4 compares the performance of commutation matrix functions implemented using dense and sparse representations. The “Dense” implementation relies on the R function matrixcalc::elimination.matrix, while the “Sparse” version uses ADtools::elimination_matrix. A substantial improvement in speed is observed with the sparse approach.

2.2.5. The Elimination Matrix E n

Let E (and E n ) denote the elimination matrix and D the duplication matrix. These matrices are defined by the identities they satisfy:
vech ( A ) = E vec ( A )
D vech ( A ) = vec ( A ) , for symmetric matrix A
where vech ( . ) is the half-vectorisation operator (vectorising the lower-triangular part of a square matrix). The names of the special matrices come from the fact that D duplicates entries to turn a half vector into a full vector, and E eliminates entries to turn a full vector into a half vector. Note that if A is an n × n matrix, then vec ( A ) has a length n 2 and vech ( A ) has a length of n ( n + 1 ) 2 . Hence, D has a dimension of n 2 × n ( n + 1 ) 2 , and E has a dimension of n ( n + 1 ) 2 × n 2 .
Now we derive the sparse representation of the elimination matrix. First, for an n × n matrix A, we define h A ( i , j ) and h A 1 ( k ) such that they satisfy the relations.
( i , j ) - entry of A h A ( i , j ) - th entry of vech ( A ) h A 1 ( k ) - entry of A k - th entry of vech ( A ) ,
where i j , h A 1 ( k ) must be in the lower triangular part of A. The functions h A ( . , . ) and h A 1 ( . ) are given by the formulae:
h A ( i , j ) = i + ( j 1 ) n j ( j 1 ) 2 and h A 1 ( k ) = ( a , b ) ,
where a = k + b ( b 1 ) 2 ( b 1 ) n , b = f ( k , n , 1 ) and
f ( k , n , c ) = f ( k n , n 1 , c + 1 ) if k > n , c otherwise .
h A ( . , . ) and h A 1 ( . ) are needed to convert back and forth between the matrix and the half-vector representations. It then follows that:
k th entry of vech ( A ) = h A 1 ( k ) entry of A = v A ( h A 1 ( k ) ) th entry of vec ( A ) = v A k + b ( b 1 ) 2 ( b 1 ) n , b th entry of vec ( A ) = k + b ( b 1 ) 2 th entry of vec ( A ) ,
where b = f ( k , n , 1 ) . Since by definition E n maps vec ( A ) to vech ( A ) , and we have shown k + b ( b 1 ) 2 th entry of vec ( A ) is mapped to k th entry of vech ( A ) , so E n is a matrix having a value of 1 at the position k , k + b ( b 1 ) 2 . Hence, the sparse representation of the elimination matrix is given by:
E n = k , k + b ( b 1 ) 2 , 1 , k = 1 , 2 , , n ( n + 1 ) 2 .
In the actual implementation, b does not need to be computed recursively; it can be obtained directly using the closed-form expression b = ( n + 0.5 ) ( n + 0.5 ) 2 2 k ) . A performance comparison of the elimination matrix function using dense and sparse representations is shown in Table 5. In the table, “Dense” uses the R function matrixcalc::commutation.matrix, and “Sparse” uses the R function ADtools::commutation_matrix. The results demonstrate a clear speed advantage for the sparse implementation, with performance gains increasing as matrix size grows.
We also define the half-duplication matrix D for lower-triangular matrix A to be the matrix that satisfies D vech ( A ) = vec ( A ) , where A is a lower-triangular matrix. For any lower-triangular matrix A, we have D vech ( A ) = E T vech ( A ) .

2.2.6. Matrix Chain Multiplication

In the implementation of vectorised AD, the derivative computation frequently involves sequences of matrix multiplications, as shown in Equation (1). While mathematically straightforward, these operations can become computational bottlenecks in high-dimensional settings. A key yet often overlooked aspect is that matrix multiplication, although associative in output—i.e., ( A B ) C = A ( B C ) —is not associative in computational cost. To illustrate, consider n × n matrices A , B and a n × 1 vector x. Evaluating ( A B ) x requires explicitly forming the intermediate matrix A B , resulting in a complexity O ( n 3 ) . In contrast, computing A ( B x ) avoids this and reduces the cost to O ( n 2 ) . This simple example highlights a crucial insight: although the result is invariant to the order of operations, the efficiency is not. In large-scale computations, suboptimal ordering—such as naive left-to-right evaluation—can be unnecessarily costly. In simple cases where the dimensions of the matrices are known in advance, an optimal order can be enforced manually. However, in many applications, matrix dimensions are unknown until runtime, making it impossible to prespecify the optimal multiplication order. This leads to the matrix chain multiplication problem.
Matrix chain multiplication is an optimisation problem concerned with multiplying a chain of matrices A 1 · A 2 · · A m using the least number of arithmetic operations. This is traditionally solved using dynamic programming with the complexity O ( 2 m ) , which can be reduced to O ( m 3 ) when the memoisation technique is employed. Ref. [28] provides an algorithm that solves the problem with a O ( m log m ) complexity. However, given that in many applications the length of the matrix chain rarely goes beyond m = 5 , it is usually sufficient to consider the simpler dynamic programming solution as follows:
Let m ( i , j ) be the minimal number of arithmetic operations needed to multiply out a chain of matrices A i · A i + 1 · · A j , and suppose for any i, A i has dimension d i × d i + 1 . Our goal is to find m ( 1 , n ) . The recursive formula is given by [29]:
m ( i , j ) = 0 if i = j , min i k < j m ( i , k ) + m ( k + 1 , j ) + d i · d k + 1 · d j + 1 if i < j
The above only provides the optimal number of arithmetic operations. To obtain the order of multiplication, we define the split point of the matrix chain A i · A i + 1 · · A j as:
s ( i , j ) = arg min i k < j m ( i , k ) + m ( k + 1 , j ) + d i · d k + 1 · d j + 1 .
For example, if s ( 1 , 4 ) = 2 , then the matrix chain A 1 A 2 A 3 A 4 should be split after index 2 in the order of ( A 1 A 2 ) ( A 3 A 4 ) , whereas if s ( 1 , 4 ) = 3 , then the matrix chain is ordered as ( A 1 A 2 A 3 ) ( A 4 ) , after which one inspects s ( 1 , 3 ) to decide the full order.
Table 6 shows the increase in speed gained by switching from a naive (left-to-right) order to an optimal order. The comparison was conducted using 1000 simulations. The length of the chain was sampled from the set { 2 , 3 , 4 , 5 } , with probabilities proportional to { 1 2 , 1 3 , 1 4 , 1 5 } , and matrix sizes were sampled from the discrete uniform distribution U [ 10 , 200 ] . Note that there is no speed-up by multiplying two matrices because there is only one possible order (the extra 0.04 is merely statistical noise). For a matrix chain with a length of three to five, the average speed-up was about 1.5 times. The speed-up distributions conditioning on the length of the chain were all positively skewed and had positive excess kurtosis (i.e., fat tails).

2.2.7. Kronecker Products

Among the basic matrix operations, the Kronecker product is one of the most computationally expensive. In general, computing the Kronecker product of an ( m × n ) matrix and a p × q matrix has a complexity of O ( m n p q ) . For simplicity, assume m = n = p = q ; then the complexity becomes O ( n 4 ) . In the context of Jacobian matrix computations, it is rare to compute a standalone Kronecker product. Instead, it typically appears as part of a larger expression—for example, in forms such as X ( B A ) and ( B A ) Z , where A , B are of size n × n and X , Z are of size m × n 2 , n 2 × m .
If one first computes the Kronecker product explicitly and then multiplies it by the remaining matrix, the total complexity is O ( n 4 + n 4 m ) = O ( n 4 m ) . However, by exploiting structural properties and avoiding explicit computation of the Kronecker product, the same result can be obtained in O ( n 3 m ) operations. We now show that this reduced complexity holds in the general case as well:
Proposition 1.
Suppose A 1 , A 2 , , A p are n × n matrices ( p 2 ), and X , Z are m × n p and n p × m matrices, respectively. Then X ( A 1 A 2 A p ) and ( A 1 A 2 A p ) Z can be computed in O ( n p + 1 m ) operations instead of O ( n 2 p m ) operations, as observed in the naive order.
The proposition suggests that unless the Kronecker product itself is of interest, one should never compute it explicitly when it comes to multiplication because the algorithm (presented after the proof) always performs ( p 1 ) orders faster. The proposition also holds for cases in which A 1 , , A p have arbitrary sizes, but we do not state the proposition in that form because it obscures the complexity improvement and logic of the proof. Nevertheless, the algorithm provided later does support the most general case.
Proof. 
We begin with the base case ( B A ) Z . Let b i , j be the ( i , j ) element of B and Z i be the ith block row of Z (which is of size n × m ), then
( B A ) Z = b 1 , 1 A b 1 , 2 A b 1 , n A b 2 , 1 A b 2 , 2 A b 2 , n A b n , 1 A b n , 2 A b n , n A ·                     Z 1                                         Z 2                                         Z n                    
=                     j = 1 n b 1 , j A Z j                                         j = 1 n b 2 , j A Z j                                         j = 1 n b n , j A Z j                     : = d e n o t e j = 1 n b k , j A Z j k = 1 n
For the kth block-row, we have j = 1 n b k , j A Z j = A j = 1 n b k , j Z j . Together with the multiplication by A, the sum can be computed in O ( n 2 m ) operations, and because there are n block rows, the overall complexity is O ( n 3 m ) .
We now proceed to the general case by abstracting the component that makes the above work and applying it recursively to the chain of Kronecker products. If we define two binary operations ⊡ and ⊛ such that:
A V = A [ V k ] k = 1 n : = d e f [ A V k ] k = 1 n and
B Z = B Z k k = 1 n : = d e f j = 1 n b k , j Z j k = 1 n ,
then ( B A ) Z = A ( B Z ) = A ( B Z ) k k = 1 n . The key idea behind the two new binary operations is that they define block-wise matrix multiplication. Suppose both V and Z can be split into n by 1 blocks. The first binary operation A V defines the block-wise (pre-)multiplication, where each block of V is pre-multiplied by A, with A having the number of columns matching the number of rows of a block. The second binary operation ⊛ defines the block-wise matrix multiplication. This operation produces a matrix of n × 1 blocks, where the i-th block is given by j = 1 n b i , j Z j , naturally extending the usual matrix multiplication j = 1 n b i , j c j , where c j is the jth entry of a column vector c. The last term in (14) is also denoted as [ ( B Z ) k ] k = 1 n such that ( B Z ) k corresponds to the kth block-row j = 1 n b k , j Z j .
The new binary operations allow us to evaluate the expression in a different order and avoid the Kronecker product in the process. As a result, it takes fewer arithmetic operations to evaluate the expression, as we have seen in the base case. Next, it follows that
( C B A ) Z = ( C ( B A ) ) Z = ( B A ) ( C Z ) = ( B A ) ( C Z ) k C k C = 1 n = A ( B ( C Z ) k C ) k C = 1 n = A ( B ( C Z ) k C ) k B k B = 1 n k C = 1 n ,
and the general case is given by
( A 1 A 2 A p ) Z = [ [ [ p 1 A p ( A p 1 ( A p 2 ( ( A 1 Z ) k 1 ) ) k p 2 ) k p 1 ] k p 1 = 1 n ] k 2 = 1 n ] k 1 = 1 n p 1 .
Intuitively, every time we use the new binary operations to avoid a Kronecker product, the complexity is reduced by one in order, and given that there are ( p 1 ) of them, we expect to see the total complexity to be O ( n 2 p ( p 1 ) m ) = O ( n p + 1 m ) . We now present the formal proof by induction.
Let S ( p ) be the statement that ( A p A p 1 A 1 ) Z ( p ) has complexity O ( n p + 1 m ) , where A 1 , , A p are n × n matrices, and Z ( p ) denotes a n p × m matrix. We have shown in (12) and (13) that S ( 2 ) is true. Now suppose S ( k ) is true and consider S ( k + 1 ) ,
( A p + 1 A p A 1 ) Z
= ( A p A 1 ) ( A p + 1 Z ( p + 1 ) )
= ( A p A 1 ) ( A p + 1 Z ( p + 1 ) ) k k = 1 n
= ( A p A 1 ) Z k ( p ) k = 1 n
In (17), computing ( A p + 1 Z ( p + 1 ) ) k requires O ( n p + 1 m ) operations, resulting in a matrix Z k ( p ) of size n p × m . Next, in (18), the expression inside the square bracket, by hypothesis, has a complexity of O ( n p + 1 m ) , and the complexity accumulated so far remains as O ( n p + 1 m ) . Finally, given that there are n block-rows, the overall complexity is O ( n p + 2 m ) . This proves the inductive step, and by induction, S ( n ) is true for n 2 . This completes the proof for the case ( A 1 A 2 A p ) Z in Proposition 1.
For the other case, X ( A 1 A 2 A p ) , because we are premultiplying the chain of Kronecker products, we work with blocks of columns instead of blocks of rows. Denote
X = | | | X 1 X 2 X n | | |
by [ X k ] k = 1 n c (where c stands for columns). Then the two corresponding binary operators c and c are defined as:
V c A = [ V k ] k = 1 n c c A : = d e f [ V k A ] k = 1 n c , and Z c B = [ Z k ] k = 1 n c c B : = d e f i = 1 n b i k Z i k = 1 n .
Now, it follows that X ( B A ) = ( X c B ) c A , and the remainder of the proof proceeds the same as in the other case. □
In Table 7, we compare the performance of evaluating ( B A ) Z and X ( B A ) with and without explicitly computing the Kronecker product. We conducted 1000 simulations, and in each simulation, the number of rows of X , A , B , and the number of columns of A , B , Z were sampled from ζ , where ζ U [ 10 , 50 ] . Obviously, the number of columns of X needs to match the number of rows of B A (= the number of rows of B × the number of rows of A), and likewise for the number of rows of Z. The speed-up was computed using t e x p l i c i t / t i m p l i c i t , which denotes the time needed to evaluate the full expression using explicit and implicit Kronecker products, respectively. We note that in the case of ( B A ) Z , there were two speed-up “outliers”: one at 0.42 and, the other at 0.94. All the rest were above 1. The median speed-up was about 15X, and the mean speed-up was about 16X, favouring the implicit evaluation.

2.2.8. Kronecker Product: More Special Cases

We identified common special cases of the Kronecker product and represented those using the new binary operators to reduce the computation cost further. In particular, we examine the four cases: A ( B I ) , A ( I C ) , ( B I ) D , and ( I C ) D . These are chosen because they arise naturally in common operations such as the product rule d ( A B ) = ( B T I n ) d A + ( I n A ) d B and the transpose product rule d ( A A T ) = ( I A + A I ) . Moreover, computing a Kronecker product (explicitly) with an identity matrix merely makes copies of the original matrix and arranging them in a particular way, hence yielding potential savings in time and memory use (despite that the complexity order remaining the same).
Note that I V = [ I V k ] k = 1 n = [ V k ] k = 1 n = V and I Z = [ Z k ] k = 1 n = Z , so
  • ( B I ) D = I ( B D ) = B D
  • ( I C ) D = C ( I D ) = C D
    Similarly, V c I = [ V k I ] k = 1 n = [ V k ] k = 1 n = V and Z c I = [ Z k ] k = 1 n = Z , so
  • A ( B I ) = ( A c B ) c I = A c B
  • A ( I C ) = ( A c I ) c C = A c C
In Table 8, we present the speed-up using the optimised implementation over the naive implementation. We conducted 1000 simulations, and in each simulation, the number of rows of A , B , C , I and the number of columns of B , C , D , I were sampled from ζ , where ζ U [ 10 , 50 ] . The number of columns of A and the number of rows of D were specified such that the multiplication makes sense (and the choice is unique). The speed-up was computed using t o p t i m i s e d / t n a i v e , which denotes the evaluation time corresponding to the optimised and the naive implementations, respectively. We observe that the median speed-up is about 8×, While the mean speed-up is about 10×, this aligns with our theoretical result that the Kronecker product should not be evaluated explicitly unless the product itself is of interest.

3. Results

This section provides some computational examples to demonstrate the speed and efficacy of our proposed methods. We first benchmark our method against the traditional numerical derivative under basic matrix operations and the computation of a covariance matrix’s log determinant. We then demonstrate our derivative computation’s effectiveness within a large stochastic optimisation scheme, i.e., simulated maximum likelihood estimation (SMLE) of a stochastic factor model.

3.1. Basic Operations

We benchmarked the performance of AD against that of FD using the basic arithmetic operations addition: subtraction, multiplication, matrix inversion, and Kronecker product. The results are presented in Table 9. The time figures in the table represent the averages of 100 executions. Overall, AD performed much faster compared with the FD.

3.2. Dynamic Factor Model Inference

Numerical assessments are often required in both classical and Bayesian statistical inference. Below, we illustrate the benefits of applying AD to derivative computation for the maximum likelihood estimation (MLE) of factor models when the analytical expression of the likelihood is not available and numerical assessment of derivatives is required. We show substantial computational gains using both simulated and real data. Readers interested in the use of AD in the context of Bayesian sensitivity analysis are referred to [30].
Factor models have been widely used in many areas, including psychology, bioinformatics, economics, and finance, to model the dependence structure of high-dimensional data. Different specifications of the factor model have been widely discussed in the literature ([31,32]). We follow [32] and consider a variation of the factor model in which the analytic expression of the derivative of the log-likelihood is intractable. Let y t denote the n × 1 vector of observations at time t where t = 1 , , T , and let f t represent a k × 1 column vector of latent factors. Then, the k-factor model with student-t noise is specified as:
y t = β + A f t + ϵ t ,
where β is the n × 1 column vector of intercepts and A is the n × k loading matrix. The factors are assumed to be normally distributed, f t N ( 0 , Ω ) , where Ω is k × k , and they are independent from innovations that are multivariate-t distributed ϵ t t ν ( 0 , Σ ) , where Σ is n × n . For the purpose of identification, we require n 2 k + 1 and assume that A is lower triangular with diagonal entries all equal to 1 ([32]). This particular specification is commonly used in financial econometrics.
For maximum likelihood inference of such a model, the likelihood function can be maximised directly via Monte Carlo simulation. Let Y = ( y 1 , y 2 , . . y T ) be the observations and θ = [ β , vech ( A ) , vech ( Ω ) , vech ( Σ ) ] be the parameters. Under our setting, the likelihood function g ( y t ; θ ) does not have an analytical expression, and it needs to be evaluated using numerical methods. Specifically, we can write f t = Ω 1 / 2 Z t where Z t N ( 0 , I k ) . Therefore, it follows that:
g ( y t ; θ ) = E Z [ t ν ( y t | β + A Ω 1 / 2 Z t , Ω ) ] ,
where t ν ( · | μ , Σ ) denotes the probability density function of the multivariate-t distribution. In the rest of this section, we focus on a case in which both Σ and Ω are diagonal matrices: Σ = diag ( σ 1 2 , , σ n 2 ) and Ω = diag ( ω 1 2 , , ω k 2 ) .
For the implementation of the simulated maximum likelihood approach, we use the Ada-Delta variation of the stochastic gradient descent algorithm [33] for the estimation (with hyperparameters ( γ , η , ϵ ) = ( 0.9 , 0.01 , 10 8 ) ). We first considered a simulated dataset of 1000 observations, each with 10 measurements, where the dimensions of the hidden factors were 3 and the entries in the factor loading matrix A were sampled from the standard normal distribution. The entries of the diagonal covariance matrices of the factors and the innovations were sampled from U ( 1 , 5 ) and U ( 0.5 , 1 ) , respectively. The estimates converged at around 2000 iterations.
The first column of results in Table 10 reports the total run times with derivatives computed using AD and, for comparison, FD, for the simulated data example.
Under AD, the estimation took 4.36 h, compared with over 12 h using FD for the required derivative calculation. This reduction in run-time by almost 66% is confirmed by the remaining results in Table 10, which provide a more detailed run-time comparison between the two implementations, confirming the patterns from the overall run-time. The improvement was consistent for both per-iteration run-time and total run-time in the case of simulated data and for the total run-time for simulated to real data.
The real data set contained data on currency exchange rates. The sample included 1045 observations of daily returns of nine international currency exchange rates relative to the United States dollar from January 2007 to December 2010. We applied a factor model to the log of the exchange rate returns (i.e., y i t = 100 log ( p i , t / p i , t 1 ) , where p i t denotes the daily closing spot rate for currency i at time t. The nine selected currencies were the Australian dollar (AUD), Canadian dollar (CAD), Euro (EUR), Japanese yen (JPY), Swiss franc (CHF), British pound (GBP), South Korean won (KRW), New Zealand dollar (NZD), and New Taiwan dollar (TWD), representing the most heavily traded currencies over the period. The estimates converged at around 1000 iterations. As reported in the second column of results in Table 10, we again observed a substantial reduction in estimation time, with the run time under an AD-based derivative computation being a third of the time required by the FD-based computation.

4. Conclusions

This paper presents a vectorised formulation of AD grounded in matrix calculus, tailored for statistical applications that involve high-dimensional inputs and simulation-intensive computations. The proposed approach uses a compact set of matrix calculus rules to enable efficient and automatic derivative computation and introduces optimisation techniques, such as memoisation, sparse matrix representation, matrix chain multiplication, and implicit Kronecker product, to improve the efficiency of the AD implementation.
Compared to other AD approaches, our formulation aligns more naturally with the matrix-oriented notation commonly used in statistics and econometrics. It supports fully automatic workflows similar to source code transformation methods while also providing direct access to intermediate variables for inspection and analysis. Unlike imperative AD frameworks, it does not require users to adopt AD-specific programming idioms. In addition, the approach enables high-level optimisation by explicitly making use of matrix structure and the order of matrix multiplications, which are typically inaccessible in scalar-based or imperative implementations.
Despite its advantages, our formulation of AD introduces computational overhead, which can make it less efficient than FD for small-scale or low-dimensional problems. In such cases, the simplicity of FD often results in faster performance (but at the cost of lower accuracy). The benefits of our approach become more apparent in high-dimensional settings, where its scalability and accuracy outweigh the initial overhead, as we showed in the numerical study. A further limitation arises when computing second-order derivatives such as the Hessian. Under our vectorised approach, this requires dual numbers to carry second-order matrix derivatives, which can quickly exceed the memory capacity of typical personal computers. For a function mapping R m to R n , the Jacobian has dimension n × m , while the Hessian grows to n m × m . In contrast, FD computes these matrices entry by entry by perturbing the function input one coordinate at a time; although slower, this method avoids out-of-memory issues. Similar memory constraints may also occur when handling extremely large Jacobian matrices.
In view of the modern trend toward increasingly large and high-dimensional datasets, the performance overhead of AD in small-scale settings is becoming less of a practical concern. As data and models continue to grow in complexity, the scalability advantages of AD are expected to outweigh its initial costs in a broader range of applications. For the memory demands associated with higher-order derivatives, potential solutions include distributed computing, on-disk array storage, and blocked algorithms that process derivative computations in smaller, memory-efficient segments. These strategies offer promising avenues for extending the applicability of our vectorised AD approach to even larger and more demanding statistical problems.

Author Contributions

Conceptualisation, C.F.K. and D.Z.; methodology, C.F.K. and D.Z.; software, C.F.K.; validation, C.F.K. and D.Z.; formal analysis, C.F.K.; investigation, C.F.K.; resources, L.J. and D.Z.; data curation, C.F.K. and D.Z.; writing—original draft preparation, C.F.K.; writing—review and editing, C.F.K., L.J. and D.Z.; visualisation, C.F.K.; supervision, L.J. and D.Z.; project administration, L.J. and D.Z.; funding acquisition, L.J. and D.Z. All authors have read and agreed to the published version of the manuscript.

Funding

The authors acknowledge support from the Australian Research Council through funding from DP180102538 and FT170100124.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The exchange rate data used in this study is publicly available from the Federal Reserve Economic Data (FRED) at https://fred.stlouisfed.org and is freely accessible without restrictions.

Acknowledgments

We thank the anonymous reviewers for their valuable comments and constructive suggestions.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. Illustrative Examples

In this section, we present some statistical applications that use the matrix operations that could benefit significantly from the use of AD.
Example A1.
Local sensitivity of the Seemingly Unrelated Regression (SUR) model [34]. Consider the SUR model,
y μ = X μ β μ + u μ , μ = 1 , , M ,
  • where y μ , u μ R T , X μ R T × l μ , β μ R l μ , and
  • E ( u μ ) = 0 , V ( u μ ) = σ μ μ I and E ( u i u j ) = σ i j I , I is the identity matrix.
In a more compact form,
y 1 y 2 y M = X 1 0 0 0 X 2 0 0 0 X M β 1 β 2 β M + u 1 u 2 u M
and we write it as y = X β + u , where V ( u ) = Σ c I , Σ c = [ σ i j ] i , j = 1 , , M . The Generalised-Least-Squares (GLS) estimator is given by β ^ = ( X T ( Σ c I ) 1 X ) 1 X T ( Σ c I ) 1 y . Given the matrix multiplications, inversions and Kronecker products involved, it is tedious to find the analytical expression of the local sensitivity β with respect to all the noise parameters σ i j , i.e., d β ^ d Σ c . In contrast, AD only requires implementing the original expression and then the derivative will be available “for free”.
Example A2.
Local sensitivity of the Bayesian normal regression model. Consider the model y N ( X β , V ) with the normal prior on the parameter β N ( b , H 1 ) and with b , H being the hyperparameters. The posterior mean b of β is given by
b = H 1 ( H b + X T V 1 y ) , where H = H + X T V 1 X .
The local sensitivity of the posterior mean is concerned with the effect of a small change of the hyperparameters V 1 , b , H 1 and the data X on the posterior mean b.
Even in such simple case, it is clear that applying AD directly to (A1) is less error-prone than deriving and implementing the analytic derivative by hand:
d b = b b H H 1 H D k d v H 1 + H 1 H d b + H 1 e V 1 b H 1 X V 1 d vec X + e H 1 X D n d v V 1 ,
where d is the standard differential operators, e = y X b , and D n , D k are duplication matrices of appropriate dimensions.
Example A3.
(Mixed-regressive) Spatial-Auto-Regressive (SAR) model [35] (pp. 8, 16).
y = X β + j = 1 p λ j W j y + ϵ , ϵ N ( 0 , σ 2 I m )
where y , ϵ are m × 1 vector, X is a m × k matrix, β is a k × 1 vector, { λ j , j = 1 , , p } are scalars, and { W j , j = 1 , , p } are m × m spatial weight matrices. The loglikelihood function l ( λ 1 , , λ p , β , σ 2 ) is given by:
m 2 ln 2 π σ 2 + ln I m j = 1 p λ j W j 1 2 σ 2 I m j = 1 p λ j W j y X β T I m j = 1 p λ j W j y X β
and the derivative is needed to perform MLE using gradient-based methods. In addition to the convenience of not needing to implement the derivative manually, AD often has the advantage of having less duplicate computation compared with when the loglikelihood function and its derivative are implemented separately.
Example A4.
MLE with Simultaneous equations [19] (p. 371). Simultaneous equation model is a generalisation of the multivariate linear regression model
y i T , = x i T B 0 + u i T , i = 1 , , n ,
where y i , u i R m , x i R k are column vectors, and B 0 is a k × m matrix. and it has the form of
y i T Γ 0 + x i T B 0 + u i T , i = 1 , , n ,
where Γ 0 is a m × m matrix. Assuming that u i N ( 0 , Σ 0 ) , i = 1 , , n and the n × k data matrix has full rank k, the loglikelihood consists of the cross-product, determinant and trace operations as follows:
l ( θ ) = m n 2 log 2 π + n 2 | Γ 0 T Γ 0 | n 2 log | Σ 0 | 1 2 tr Σ 0 1 ( Y Γ 0 + X B 0 ) T ( Y Γ 0 + X B 0 ) .
In the above, the parameters Γ 0 , B 0 are collected into θ, and X , Y are { x i T , y i T } i = 1 , , n stacked by rows.
For the multilevel generalisation [36] where the observations are clustered into l independent groups (of the same size n l = n / l ), the loglikelihood is given by:
l ( θ ) = n m 2 ln ( 2 π ) m l 2 ln | V 0 | n 2 ln Γ 0 1 T Σ 0 Γ 0 1 1 2 j = 1 l tr V 0 1 Y j X j B 0 Γ 0 1 Γ 0 Σ 0 1 Γ 0 T Y j X j B 0 Γ 0 1 T .
Note that U j is { u i T } i = 1 , , n l stacked by rows, and it follows a matrix normal distribution N n l , m ( 0 , V 0 , Σ 0 ) .
In the example above, AD offers an easy way to extend existing model to incorporate structural assumptions, which often leads a more complicated derivative expression. It enables researchers to readily experiment with different ways of generalising the working model.
Example A5.
Infinite Gaussian Mixture model. Consider the model y N ( x β , r 1 ) , r 1 Γ ( k , θ ) , where x , y are the data and r , k , θ are the parameters. Let f ( y ; μ , σ 2 ) be the normal density and g ( r ; k , θ ) be the gamma density, the log-likelihood is given by:
l ( β , k , θ ) = log 0 f ( y ; x β , r 1 ) g ( r 1 ; k , θ ) d r 1 log 1 N i = 1 N f ( y ; x β , r i 1 ) , r i 1 g ( r 1 ; k , θ ) ,
where the second line is the Monte Carlo approximation. As the simulated log-likelihood depends on parameters through the random sample, it is more convenient to use AD to compute the derivative, especially if one wants to explore different choices of the mixture distribution g. Moreover, it is also practical to use AD when the mixture models are multi-level, or when the marginalised parameters are high-dimensional.

Appendix B. Code Listings

Listing A1. Implementation of the subtraction and inverse matrix calculus rules in R.
‘%minus%‘ <- function(A_dual, B_dual) {
  A <- A_dual$X;  dA <- A_dual$dX;
  B <- B_dual$X;  dB <- B_dual$dX;
  list(X = A - B, dX = dA - dB)
}
             
‘%divide%‘ <- function(A_dual, B_dual) {
  A <- A_dual$X;  dA <- A_dual$dX;
  B <- B_dual$X;  dB <- B_dual$dX;
             
  B_inv <- solve(B)
  dB_inv <- -(t(B_inv) %x% B_inv) %∗% dB
             
  B_inv_dual <- list(X = B_inv, dX = dB_inv)
  A_dual %times% B_inv_dual
}
        
Listing A2. An example using the simple AD system in Section 2.1.2.
<- function(A, B) {
    A %∗% (A %∗% B + B %∗% B) + B
}
             
# Derivative by Auto-Differentiation
df_AD <- function(A, B) {  # A and B are dual matrices
    A %times% ((A %times% B) %plus% (B %times% B)) %plus% B
}
             
# Derivative by Analytic Formula
df_AF <- function(A, B, dA, dB) {
    # optimisation by hand to avoid repeated computation
    I_<- I(nrow(A))
    I_n2 <- I(nrow(A)^2)
    In_x_<- I_%x% A
    In_x_<- I_%x% B
    tB_x_In <- t(B) %x% I_n
    # the analytic formula
    (t(A %∗% B + B %∗% B) %x% I_n + (In_x_A) %∗% tB_x_In) %∗% dA +
        (In_x_A %∗% In_x_A + In_x_A %∗% (tB_x_In + In_x_B) + I_n2) %∗% dB
}
             
## ------------------------------------------------------------- 
# Helper functions
zeros <- function(nr, nc) matrix(0, nrow = nr, ncol = nc)
dual <- function(X, dX) list(X = X, dX = dX)
             
# Main code<- 10
set.seed(123)
A <- matrix(rnorm(n^2), nrow = n, ncol = n)
B <- matrix(rnorm(n^2), nrow = n, ncol = n)
res <- f(A, B)
             
dA <- cbind(I(n^2), zeros(n^2, n^2))
dB <- cbind(zeros(n^2, n^2), I(n^2))
res_DF <- df_AF(A, B, dA, dB)              # Analytic approach
res_AD <- df_AD(dual(A, dA), dual(B, dB))  # AD approach
             
# Compare accuracy 
sum(abs(res_AD$X - res))        # 0 
sum(abs(res_AD$dX - res_DF))    # 5.016126e-13
Listing A3. An illustrative implementation of one-argument memoisation in R.
memoise <- function(f) {     # takes a function ‘f’ as input
  record <- list()           # attach a table to ‘f’ (using lexical scoping)
  hash <- as.character
  return(function(x) {       # returns a memoised ‘f’ as output
  result <- record[[hash(x)]]     # retrieves result
  if (is.null(result)) {          # if the result does not exist
    result <- f(x)                # then evaluate it and
    record[[hash(x)]] <<- result  # save it for future
  }
  return(result)
  })
}
        
Listing A4. R code to compare the speed and accuracy of AD and FD.
# remotes::install_github("kcf-jackson/ADtools") 
library(ADtools)
             
# 1. Setup
set.seed(123)    # for reproducibilit<- matrix(rnorm(10,000), 100, 100)
Y <- matrix(rnorm(10,000), 100, 100)
B <- matrix(rnorm(10,000), 100, 100)
             
f <- function(B) { sum((Y - X %∗% B)^2) }
# Deriving analytic derivative by hand
df <- function(B) { -2 ∗ t(X) %∗% (Y - X %∗% B) }
             
             
# 2. Speed comparison 
system.time({
  AD_res <- auto_diff(f, at = list(B = B))
})
# user  system elapsed 
# 0.387   0.054   0.445
             
system.time({
  FD_res <- finite_diff(f, at = list(B = B))
})
# user  system elapsed 
# 10.660   1.918  12.591
             
system.time({
  truth <- df(B)      # runs fastest when available
})
# user  system elapsed 
# 0.001   0.000   0.001
             
             
# 3. Accuracy comparison
AD_res <- as.vector(deriv_of(AD_res))
FD_res <- as.vector(FD_res)
truth  <- as.vector(truth)
             
max(abs(AD_res - truth))
# [1] 0 
max(abs(FD_res - truth))
# [1] 0.006982282
        
Listing A5. R code to illustrate that our vectorised formulation can produce derivative automatically and seamlessly.
# Example 1: Seemingly Unrelated Regression
set.seed(123)
T0 <- 10
M <- 5
l <- 6
             
# Regression coefficients 
beta <- do.call(clapply(1:M, \(id) rnorm(l, mean = 0, sd = 2)))
             
# Predictors
Xs <- lapply(1:M, \(id) matrix(rnorm(T0  l), nrow = T0, ncol = l))
X <- diag(1, nrow = M  T0, ncol = M  l)
for (i in seq_along(Xs)) {
  X[1:T0 + (i-1)  T0, 1:l + (i-1)  l] <- Xs[[i]]
}
X
             
# Noise
Sigma_c <- crossprod(matrix(rnorm(M^2), nrow = M))
I <- diag(T0)
u <- mvtnorm::rmvnorm(1, mean = rep(0, T0  M),
                      sigma = kronecker(Sigma_c, I))
             
# Observation<- X %∗% beta + t(u)
             
# Estimator
estimator <- function(Sigma_c, I, X, y) {
  inv_mat <- solve(kronecker(Sigma_c, I))
  beta_est <- solve(t(X) %∗% inv_mat %∗% X, t(X) %∗% inv_mat %∗% y)
}
             
# remotes::install_github("kcf-jackson/ADtools") 
library(ADtools)
auto_diff(estimator,
          wrt = c("Sigma_c"),
          at = list(Sigma_c = Sigma_c, I = I, X = X, y = y))
       

References

  1. Gardner, J.; Pleiss, G.; Weinberger, K.Q.; Bindel, D.; Wilson, A.G. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. Adv. Neural Inf. Process. Syst. 2018, 31, 7587–7597. [Google Scholar]
  2. Abril-Pla, O.; Andreani, V.; Carroll, C.; Dong, L.; Fonnesbeck, C.J.; Kochurov, M.; Kumar, R.; Lao, J.; Luhmann, C.C.; Martin, O.A.; et al. PyMC: A modern, and comprehensive probabilistic programming framework in Python. PeerJ Comput. Sci. 2023, 9, e1516. [Google Scholar] [CrossRef] [PubMed]
  3. Joshi, M.; Yang, C. Algorithmic Hessians and the fast computation of cross-gamma risk. IIE Trans. 2011, 43, 878–892. [Google Scholar] [CrossRef]
  4. Allen, G.I.; Grosenick, L.; Taylor, J. A generalized least-square matrix decomposition. J. Am. Stat. Assoc. 2014, 109, 145–159. [Google Scholar] [CrossRef]
  5. Jacobi, L.; Joshi, M.S.; Zhu, D. Automated sensitivity analysis for Bayesian inference via Markov chain Monte Carlo: Applications to Gibbs sampling. SSRN 2018. [Google Scholar] [CrossRef]
  6. Ranganath, R.; Gerrish, S.; Blei, D. Black box variational inference. In Proceedings of the Artificial Intelligence and Statistics. PMLR, Reykjavik, Iceland, 22–25 April 2014; pp. 814–822. [Google Scholar]
  7. Revels, J.; Lubin, M.; Papamarkou, T. Forward-mode automatic differentiation in Julia. arXiv 2016, arXiv:1607.07892. [Google Scholar]
  8. Baydin, A.G.; Pearlmutter, B.A.; Radul, A.A.; Siskind, J.M. Automatic differentiation in machine learning: A survey. J. Mach. Learn. Res. 2018, 18, 1–43. [Google Scholar]
  9. Kucukelbir, A.; Tran, D.; Ranganath, R.; Gelman, A.; Blei, D.M. Automatic differentiation variational inference. J. Mach. Learn. Res. 2017, 18, 1–45. [Google Scholar]
  10. Chaudhuri, S.; Mondal, D.; Yin, T. Hamiltonian Monte Carlo sampling in Bayesian empirical likelihood computation. J. R. Stat. Soc. Ser. B Stat. Methodol. 2017, 79, 293–320. [Google Scholar] [CrossRef]
  11. Chan, J.C.; Jacobi, L.; Zhu, D. Efficient selection of hyperparameters in large Bayesian VARs using automatic differentiation. J. Forecast. 2020, 39, 934–943. [Google Scholar] [CrossRef]
  12. Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in PyTorch. In Proceedings of the NIPS 2017 Autodiff Workshop, Long Beach, CA, USA, 9 December 2017. [Google Scholar]
  13. Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. {TensorFlow}: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), Savannah, GA, USA, 2–4 November 2016; pp. 265–283. [Google Scholar]
  14. Kucukelbir, A.; Ranganath, R.; Gelman, A.; Blei, D. Automatic variational inference in Stan. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 568–576. [Google Scholar]
  15. Klein, W.; Griewank, A.; Walther, A. Differentiation methods for industrial strength problems. In Automatic Differentiation of Algorithms; Springer: New York, NY, USA, 2002; pp. 3–23. [Google Scholar]
  16. Griewank, A.; Walther, A. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation; Siam: Philadelphia, PA, USA, 2008; Volume 105. [Google Scholar]
  17. Griewank, A.; Juedes, D.; Utke, J. Algorithm 755: ADOL-C: A package for the automatic differentiation of algorithms written in C/C++. ACM Trans. Math. Softw. (TOMS) 1996, 22, 131–167. [Google Scholar] [CrossRef]
  18. Bischof, C.H.; Roh, L.; Mauer-Oats, A.J. ADIC: An extensible automatic differentiation tool for ANSI-C. Softw. Pract. Exp. 1997, 27, 1427–1456. [Google Scholar] [CrossRef]
  19. Magnus, J.R.; Neudecker, H. Matrix Differential Calculus with Applications in Statistics and Econometrics; Wiley: Hoboken, NJ, USA, 1999. [Google Scholar]
  20. Glasserman, P. Monte Carlo Methods in Financial Engineering; Springer Science & Business Media: New York, NY, USA, 2003; Volume 53. [Google Scholar]
  21. Intel. Matrix Inversion: LAPACK Computational Routines. 2020. Available online: https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2023-0/matrix-inversion-lapack-computational-routines.html (accessed on 15 March 2025).
  22. Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the International Conference on Learning Representations, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
  23. Rosenblatt, M. Remarks on a multivariate transformation. Ann. Math. Stat. 1952, 23, 470–472. [Google Scholar] [CrossRef]
  24. Kwok, C.F.; Zhu, D.; Jacobi, L. ADtools: Automatic Differentiation Toolbox, R Package Version 0.5.4, CRAN Repository. 2020. Available online: https://cran.r-project.org/src/contrib/Archive/ADtools/ (accessed on 15 March 2025).
  25. Kwok, C.F.; Zhu, D.; Jacobi, L. ADtools: Automatic Differentiation Toolbox. GitHub Repository. 2020. Available online: https://github.com/kcf-jackson/ADtools (accessed on 15 March 2025).
  26. Abelson, H.; Sussman, G.J.; Sussman, J. Structure and Interpretation of Computer Programs; MIT Press: Cambridge, MA, USA, 1996. [Google Scholar]
  27. Lütkepohl, H. Handbook of Matrices; Wiley Chichester: Chichester, UK, 1996; Volume 1. [Google Scholar]
  28. Hu, T.; Shing, M. Computation of matrix chain products. Part II. SIAM J. Comput. 1984, 13, 228–251. [Google Scholar] [CrossRef]
  29. Cormen, T.H.; Leiserson, C.E.; Rivest, R.L.; Stein, C. Introduction to Algorithms; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
  30. Chan, J.C.; Jacobi, L.; Zhu, D. An automated prior robustness analysis in Bayesian model comparison. J. Appl. Econom. 2019, 37, 583–602. [Google Scholar] [CrossRef]
  31. Brennan, M.J.; Chordia, T.; Subrahmanyam, A. Alternative factor specifications, security characteristics, and the cross-section of expected stock returns. J. Financ. Econ. 1998, 49, 345–373. [Google Scholar] [CrossRef]
  32. Geweke, J.; Zhou, G. Measuring the pricing error of the arbitrage pricing theory. Rev. Financ. Stud. 1996, 9, 557–587. [Google Scholar] [CrossRef]
  33. Zeiler, M.D. Adadelta: An adaptive learning rate method. arXiv 2012, arXiv:1212.5701. [Google Scholar]
  34. Zellner, A. An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias. J. Am. Stat. Assoc. 1962, 57, 348–368. [Google Scholar] [CrossRef]
  35. LeSage, J.; Pace, R.K. Introduction to Spatial Econometrics; CRC Press: Boca Raton, FL, USA, 2009. [Google Scholar]
  36. Hernández-Sanjaime, R.; González, M.; López-Espín, J.J. Multilevel simultaneous equation model: A novel specification and estimation approach. J. Comput. Appl. Math. 2020, 366, 112378. [Google Scholar] [CrossRef]
Table 1. Comparison of the number of operations (in terms of the leading order) needed in finite differencing and AD.
Table 1. Comparison of the number of operations (in terms of the leading order) needed in finite differencing and AD.
OperationsCentral DifferencingAD
Addition 8 n 4 2 n 4
Subtraction 8 n 4 2 n 4
Multiplication 8 n 5 8 n 5
Inversion 4 n 5 8 n 5
Kronecker product 8 n 6 6 n 6
Table 2. Speed comparison of the diagonal matrix function with and without memoisation. mem_diag is memorised. The best results in each column are highlighted in bold.
Table 2. Speed comparison of the diagonal matrix function with and without memoisation. mem_diag is memorised. The best results in each column are highlighted in bold.
FunctionFirst TimeSecond TimeAverage over 100 Executions
diag(5000)97.87 ms127.7 ms117.3 ms
mem_diag(5000)101.8 ms0.132 ms0.959 ms
Table 3. Speed comparison of the diagonal matrix function with dense and sparse representations. Time is averaged over 100 executions. The best results in each column are highlighted in bold.
Table 3. Speed comparison of the diagonal matrix function with dense and sparse representations. Time is averaged over 100 executions. The best results in each column are highlighted in bold.
Tasksn = 1000n = 2000n = 3000n = 4000n = 5000
Creation—Dense2.436 ms17.39 ms50.23 ms63.86 ms303.7 ms
Creation—Sparse0.080 ms0.091 ms0.110 ms0.077 ms0.089 ms
Multiplication—Dense19.40 ms111.6 ms404.4 ms1.041 s1.576 s
Multiplication—Sparse17.42 ms66.50 ms236.6 ms432.3 ms668.2 ms
Table 4. Speed comparison of the commutation matrix function with dense and sparse representations. Time is averaged over 100 executions. The best results in each column are highlighted in bold.
Table 4. Speed comparison of the commutation matrix function with dense and sparse representations. Time is averaged over 100 executions. The best results in each column are highlighted in bold.
Tasksn = q = 10n = q = 20n = q =30n = q =40
Creation—Dense10.77 ms708.4 ms18.09 s120.4 s
Creation—Sparse0.646 ms0.667 ms1.553 ms1.595 ms
Multiplication—Dense0.727 ms41.54 ms477.3 ms2.821 s
Multiplication—Sparse0.189 ms2.388 ms11.48 ms71.69 ms
Table 5. Speed comparison of the elimination matrix function with dense and sparse representations. Time is averaged over 100 executions. The best results in each column are highlighted in bold.
Table 5. Speed comparison of the elimination matrix function with dense and sparse representations. Time is averaged over 100 executions. The best results in each column are highlighted in bold.
Tasksn = 10n = 20n = 30n = 40n = 50
Creation—Dense4.981 ms113.1 ms939.9 ms4.735 s20.97 s
Creation—Sparse0.823 ms0.816 ms0.823 ms0.816 ms0.891 ms
Multiplication—Dense0.412 ms21.98 ms237.0 ms1.490 s5.715 s
Multiplication—Sparse0.115 ms1.723 ms6.720 ms22.45 ms49.16 ms
Table 6. Comparing multiplications of a chain of matrices in the naive and optimal orders. Figures represent the mean and standard deviation (in bracket) of the speed-up over 1000 simulations.
Table 6. Comparing multiplications of a chain of matrices in the naive and optimal orders. Figures represent the mean and standard deviation (in bracket) of the speed-up over 1000 simulations.
Length of Chain2345
Speed-up, t n a i v e / t o p t i m a l 1.04 (0.322)1.47 (0.564)1.64 (0.688)1.56 (0.637)
Table 7. Speed-up achieved by evaluating ( B A ) Z and X ( B A ) without explicitly calculating the Kronecker product. The speed-up is computed using t e x p l i c i t / t i m p l i c i t . The number of simulations is 1000.
Table 7. Speed-up achieved by evaluating ( B A ) Z and X ( B A ) without explicitly calculating the Kronecker product. The speed-up is computed using t e x p l i c i t / t i m p l i c i t . The number of simulations is 1000.
Tasks/Percentiles0%25%50%75%100%Mean
( B A ) Z 0.429.4314.0219.9957.4815.82
X ( B A ) 2.6510.8215.4421.3454.3016.81
Table 8. Speed-up achieved evaluating ( B I ) D , ( I C ) D , A ( B I ) and A ( I C ) without explicitly computing the Kronecker product. The speed-up is computed using t o p t i m i s e d / t n a i v e . The number of simulations is 1000.
Table 8. Speed-up achieved evaluating ( B I ) D , ( I C ) D , A ( B I ) and A ( I C ) without explicitly computing the Kronecker product. The speed-up is computed using t o p t i m i s e d / t n a i v e . The number of simulations is 1000.
Tasks/Percentiles0%25%50%75%100%Mean
( B I ) D 1.515.027.7912.2351.339.48
( I C ) D 1.955.067.8311.6279.339.48
A ( B I ) 2.35.888.5412.968.3610.25
A ( I C ) 2.225.138.2512.7161.629.98
Table 9. Benchmarking of AD against central FD in terms of basic arithmetic operations. Faster times are in bold.
Table 9. Benchmarking of AD against central FD in terms of basic arithmetic operations. Faster times are in bold.
Estimation Time (in ms) for Standard Tasks Under AD and FD by Matrix Size
TasksADFDADFDADFDADFDADFD
n = 10n = 20n = 30n = 40n = 50
Addition4.6122.985.64138.1716.15435.7073.761136.07146.962746.58
Subtraction3.9221.915.42107.3219.33427.6677.411236.07167.062970.38
Multiplication13.6023.4020.48141.9438.36580.23139.871732.45245.834251.67
Inverse2.4620.948.0993.9829.20334.35156.82945.50444.411839.47
n = 5n = 10n = 15n = 20n = 25
Kronecker15.19.354.2181.7351.91756.41478.67492.15530.531,950.7
Table 10. Run-time comparison between SMLE analysis of the factor model using either AD or (central) FD in the stochastic gradient computations. The per-iteration summaries are based on 100 evaluations under the simulated data example. The best performance in each column is highlighted in bold.
Table 10. Run-time comparison between SMLE analysis of the factor model using either AD or (central) FD in the stochastic gradient computations. The per-iteration summaries are based on 100 evaluations under the simulated data example. The best performance in each column is highlighted in bold.
Run-Time Comparison of Simulated MLE
Total Run-TimePer Iteration Run-Time (Simulated Data)
Simulated DataReal DataminlqMeanMedianuqmax
AD4.36 h2.08 h7.23 s7.41 s7.52 s7.47 s7.56 s8.37 s
FD12.04 h6.10 h20.35 s20.51 s20.76 s20.60 s20.69 s26.82 s
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kwok, C.F.; Zhu, D.; Jacobi, L. An Analysis of Vectorised Automatic Differentiation for Statistical Applications. Stats 2025, 8, 40. https://doi.org/10.3390/stats8020040

AMA Style

Kwok CF, Zhu D, Jacobi L. An Analysis of Vectorised Automatic Differentiation for Statistical Applications. Stats. 2025; 8(2):40. https://doi.org/10.3390/stats8020040

Chicago/Turabian Style

Kwok, Chun Fung, Dan Zhu, and Liana Jacobi. 2025. "An Analysis of Vectorised Automatic Differentiation for Statistical Applications" Stats 8, no. 2: 40. https://doi.org/10.3390/stats8020040

APA Style

Kwok, C. F., Zhu, D., & Jacobi, L. (2025). An Analysis of Vectorised Automatic Differentiation for Statistical Applications. Stats, 8(2), 40. https://doi.org/10.3390/stats8020040

Article Metrics

Back to TopTop