An Analysis of Vectorised Automatic Differentiation for Statistical Applications
Abstract
:1. Introduction
2. Materials and Methods
2.1. AD via Vectorisation
2.1.1. From Vector Calculus to Matrix Calculus via Vectorisation
2.1.2. Dual Construction
Listing 1. Implementation of the sum and product matrix calculus rules in R. |
‘%plus%‘ <- function(A_dual, B_dual) { A <- A_dual$X; dA <- A_dual$dX; B <- B_dual$X; dB <- B_dual$dX; list(X = A + B, dX = dA + dB) } ‘%times%‘ <- function(A_dual, B_dual) { A <- A_dual$X; dA <- A_dual$dX; B <- B_dual$X; dB <- B_dual$dX; list( X = A %∗% B, dX = (t(B) %x% I(nrow(A))) %∗% dA + (I(ncol(B)) %x% A) %∗% dB ) } I <- diag # function to create diagonal matrices |
2.1.3. A Layered Approach to Construction
2.1.4. Notation
- is the identity matrix.
- is the matrix where the entries on the diagonal are all ones, and the entries off the diagonal are all zeros.
- is the commutation matrix. We also define .
- is the elimination matrix.
- is the matrix of ones.
- the -entry of A is denoted by or ,
- the i-th row of A is denoted by or ,
- the j-th column of A is denoted by or .
2.1.5. Matrix Arithmetic
- Addition: Let A and B be matrices, then .
- Subtraction: Let A and B be matrices, then .
- Product: Let A and B be and matrices, then
- Inverse: Let A be a matrix, then .
- Kronecker-product: Let A and B be and matrices, then
- Transpose: Let A be a matrix, .
2.1.6. Element-Wise Arithmetic
- Hadamard product:
- Hadamard division:
- Univariate differentiable function f:
2.1.7. Scalar-Matrix Arithmetic
2.1.8. Structural Transformation
- Transpose: Let A be a matrix, .
- Row binding: Let be matrices, , then
- Column binding: Let be matrices, , then
- Subsetting: Let A be a matrix,1. Index extraction: for fixed2. Row extraction: for fixed i, where ,3. Column extraction: for fixed j , where ,4. Diagonal extraction: (column vector) , where .
- Vectorisation: Let A be a matrix,
- Half-vectorisation: Let A be a matrix, .Note that S follows the order as the column-major order of A, i.e.,
- Diagonal expansion: Let be a vector, then is defined to be the matrix with on the diagonal. If , then for ,
2.1.9. Operations on Matrices
- Cholesky Decomposition: Let A be a matrix, be the Cholesky decomposition and , then
- Column-sum:
- Row-sum:
- Sum:
- Cross-product:
- Transpose of cross-product:Alternatively, both ‘crossprod’ and ‘tcrossprod’ can be implemented directly as is, since they are composed of the multiplication and transpose operations defined previously.
- Determinant:
- Trace: . Alternatively, it can be implemented by composing sum and diagonal extraction defined previously.
2.1.10. Random Variables
2.2. Optimising AD Implementation
2.2.1. Memoisation
2.2.2. Sparse Matrix Representation
2.2.3. The Diagonal Matrix
2.2.4. The Commutation Matrix
2.2.5. The Elimination Matrix
2.2.6. Matrix Chain Multiplication
2.2.7. Kronecker Products
2.2.8. Kronecker Product: More Special Cases
- Similarly, and , so
3. Results
3.1. Basic Operations
3.2. Dynamic Factor Model Inference
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Illustrative Examples
- where , and
- and , I is the identity matrix.
Appendix B. Code Listings
Listing A1. Implementation of the subtraction and inverse matrix calculus rules in R. |
‘%minus%‘ <- function(A_dual, B_dual) { A <- A_dual$X; dA <- A_dual$dX; B <- B_dual$X; dB <- B_dual$dX; list(X = A - B, dX = dA - dB) } ‘%divide%‘ <- function(A_dual, B_dual) { A <- A_dual$X; dA <- A_dual$dX; B <- B_dual$X; dB <- B_dual$dX; B_inv <- solve(B) dB_inv <- -(t(B_inv) %x% B_inv) %∗% dB B_inv_dual <- list(X = B_inv, dX = dB_inv) A_dual %times% B_inv_dual } |
Listing A2. An example using the simple AD system in Section 2.1.2. |
f <- function(A, B) { A %∗% (A %∗% B + B %∗% B) + B } # Derivative by Auto-Differentiation df_AD <- function(A, B) { # A and B are dual matrices A %times% ((A %times% B) %plus% (B %times% B)) %plus% B } # Derivative by Analytic Formula df_AF <- function(A, B, dA, dB) { # optimisation by hand to avoid repeated computation I_n <- I(nrow(A)) I_n2 <- I(nrow(A)^2) In_x_A <- I_n %x% A In_x_B <- I_n %x% B tB_x_In <- t(B) %x% I_n # the analytic formula (t(A %∗% B + B %∗% B) %x% I_n + (In_x_A) %∗% tB_x_In) %∗% dA + (In_x_A %∗% In_x_A + In_x_A %∗% (tB_x_In + In_x_B) + I_n2) %∗% dB } ## ------------------------------------------------------------- # Helper functions zeros <- function(nr, nc) matrix(0, nrow = nr, ncol = nc) dual <- function(X, dX) list(X = X, dX = dX) # Main code n <- 10 set.seed(123) A <- matrix(rnorm(n^2), nrow = n, ncol = n) B <- matrix(rnorm(n^2), nrow = n, ncol = n) res <- f(A, B) dA <- cbind(I(n^2), zeros(n^2, n^2)) dB <- cbind(zeros(n^2, n^2), I(n^2)) res_DF <- df_AF(A, B, dA, dB) # Analytic approach res_AD <- df_AD(dual(A, dA), dual(B, dB)) # AD approach # Compare accuracy sum(abs(res_AD$X - res)) # 0 sum(abs(res_AD$dX - res_DF)) # 5.016126e-13 |
Listing A3. An illustrative implementation of one-argument memoisation in R. |
memoise <- function(f) { # takes a function ‘f’ as input record <- list() # attach a table to ‘f’ (using lexical scoping) hash <- as.character return(function(x) { # returns a memoised ‘f’ as output result <- record[[hash(x)]] # retrieves result if (is.null(result)) { # if the result does not exist result <- f(x) # then evaluate it and record[[hash(x)]] <<- result # save it for future } return(result) }) } |
Listing A4. R code to compare the speed and accuracy of AD and FD. |
# remotes::install_github("kcf-jackson/ADtools") library(ADtools) # 1. Setup set.seed(123) # for reproducibilit X <- matrix(rnorm(10,000), 100, 100) Y <- matrix(rnorm(10,000), 100, 100) B <- matrix(rnorm(10,000), 100, 100) f <- function(B) { sum((Y - X %∗% B)^2) } # Deriving analytic derivative by hand df <- function(B) { -2 ∗ t(X) %∗% (Y - X %∗% B) } # 2. Speed comparison system.time({ AD_res <- auto_diff(f, at = list(B = B)) }) # user system elapsed # 0.387 0.054 0.445 system.time({ FD_res <- finite_diff(f, at = list(B = B)) }) # user system elapsed # 10.660 1.918 12.591 system.time({ truth <- df(B) # runs fastest when available }) # user system elapsed # 0.001 0.000 0.001 # 3. Accuracy comparison AD_res <- as.vector(deriv_of(AD_res)) FD_res <- as.vector(FD_res) truth <- as.vector(truth) max(abs(AD_res - truth)) # [1] 0 max(abs(FD_res - truth)) # [1] 0.006982282 |
Listing A5. R code to illustrate that our vectorised formulation can produce derivative automatically and seamlessly. |
# Example 1: Seemingly Unrelated Regression set.seed(123) T0 <- 10 M <- 5 l <- 6 # Regression coefficients beta <- do.call(c, lapply(1:M, \(id) rnorm(l, mean = 0, sd = 2))) # Predictors Xs <- lapply(1:M, \(id) matrix(rnorm(T0 ∗ l), nrow = T0, ncol = l)) X <- diag(1, nrow = M ∗ T0, ncol = M ∗ l) for (i in seq_along(Xs)) { X[1:T0 + (i-1) ∗ T0, 1:l + (i-1) ∗ l] <- Xs[[i]] } X # Noise Sigma_c <- crossprod(matrix(rnorm(M^2), nrow = M)) I <- diag(T0) u <- mvtnorm::rmvnorm(1, mean = rep(0, T0 ∗ M), sigma = kronecker(Sigma_c, I)) # Observation y <- X %∗% beta + t(u) # Estimator estimator <- function(Sigma_c, I, X, y) { inv_mat <- solve(kronecker(Sigma_c, I)) beta_est <- solve(t(X) %∗% inv_mat %∗% X, t(X) %∗% inv_mat %∗% y) } # remotes::install_github("kcf-jackson/ADtools") library(ADtools) auto_diff(estimator, wrt = c("Sigma_c"), at = list(Sigma_c = Sigma_c, I = I, X = X, y = y)) |
References
- Gardner, J.; Pleiss, G.; Weinberger, K.Q.; Bindel, D.; Wilson, A.G. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. Adv. Neural Inf. Process. Syst. 2018, 31, 7587–7597. [Google Scholar]
- Abril-Pla, O.; Andreani, V.; Carroll, C.; Dong, L.; Fonnesbeck, C.J.; Kochurov, M.; Kumar, R.; Lao, J.; Luhmann, C.C.; Martin, O.A.; et al. PyMC: A modern, and comprehensive probabilistic programming framework in Python. PeerJ Comput. Sci. 2023, 9, e1516. [Google Scholar] [CrossRef] [PubMed]
- Joshi, M.; Yang, C. Algorithmic Hessians and the fast computation of cross-gamma risk. IIE Trans. 2011, 43, 878–892. [Google Scholar] [CrossRef]
- Allen, G.I.; Grosenick, L.; Taylor, J. A generalized least-square matrix decomposition. J. Am. Stat. Assoc. 2014, 109, 145–159. [Google Scholar] [CrossRef]
- Jacobi, L.; Joshi, M.S.; Zhu, D. Automated sensitivity analysis for Bayesian inference via Markov chain Monte Carlo: Applications to Gibbs sampling. SSRN 2018. [Google Scholar] [CrossRef]
- Ranganath, R.; Gerrish, S.; Blei, D. Black box variational inference. In Proceedings of the Artificial Intelligence and Statistics. PMLR, Reykjavik, Iceland, 22–25 April 2014; pp. 814–822. [Google Scholar]
- Revels, J.; Lubin, M.; Papamarkou, T. Forward-mode automatic differentiation in Julia. arXiv 2016, arXiv:1607.07892. [Google Scholar]
- Baydin, A.G.; Pearlmutter, B.A.; Radul, A.A.; Siskind, J.M. Automatic differentiation in machine learning: A survey. J. Mach. Learn. Res. 2018, 18, 1–43. [Google Scholar]
- Kucukelbir, A.; Tran, D.; Ranganath, R.; Gelman, A.; Blei, D.M. Automatic differentiation variational inference. J. Mach. Learn. Res. 2017, 18, 1–45. [Google Scholar]
- Chaudhuri, S.; Mondal, D.; Yin, T. Hamiltonian Monte Carlo sampling in Bayesian empirical likelihood computation. J. R. Stat. Soc. Ser. B Stat. Methodol. 2017, 79, 293–320. [Google Scholar] [CrossRef]
- Chan, J.C.; Jacobi, L.; Zhu, D. Efficient selection of hyperparameters in large Bayesian VARs using automatic differentiation. J. Forecast. 2020, 39, 934–943. [Google Scholar] [CrossRef]
- Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in PyTorch. In Proceedings of the NIPS 2017 Autodiff Workshop, Long Beach, CA, USA, 9 December 2017. [Google Scholar]
- Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. {TensorFlow}: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), Savannah, GA, USA, 2–4 November 2016; pp. 265–283. [Google Scholar]
- Kucukelbir, A.; Ranganath, R.; Gelman, A.; Blei, D. Automatic variational inference in Stan. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 568–576. [Google Scholar]
- Klein, W.; Griewank, A.; Walther, A. Differentiation methods for industrial strength problems. In Automatic Differentiation of Algorithms; Springer: New York, NY, USA, 2002; pp. 3–23. [Google Scholar]
- Griewank, A.; Walther, A. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation; Siam: Philadelphia, PA, USA, 2008; Volume 105. [Google Scholar]
- Griewank, A.; Juedes, D.; Utke, J. Algorithm 755: ADOL-C: A package for the automatic differentiation of algorithms written in C/C++. ACM Trans. Math. Softw. (TOMS) 1996, 22, 131–167. [Google Scholar] [CrossRef]
- Bischof, C.H.; Roh, L.; Mauer-Oats, A.J. ADIC: An extensible automatic differentiation tool for ANSI-C. Softw. Pract. Exp. 1997, 27, 1427–1456. [Google Scholar] [CrossRef]
- Magnus, J.R.; Neudecker, H. Matrix Differential Calculus with Applications in Statistics and Econometrics; Wiley: Hoboken, NJ, USA, 1999. [Google Scholar]
- Glasserman, P. Monte Carlo Methods in Financial Engineering; Springer Science & Business Media: New York, NY, USA, 2003; Volume 53. [Google Scholar]
- Intel. Matrix Inversion: LAPACK Computational Routines. 2020. Available online: https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2023-0/matrix-inversion-lapack-computational-routines.html (accessed on 15 March 2025).
- Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the International Conference on Learning Representations, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
- Rosenblatt, M. Remarks on a multivariate transformation. Ann. Math. Stat. 1952, 23, 470–472. [Google Scholar] [CrossRef]
- Kwok, C.F.; Zhu, D.; Jacobi, L. ADtools: Automatic Differentiation Toolbox, R Package Version 0.5.4, CRAN Repository. 2020. Available online: https://cran.r-project.org/src/contrib/Archive/ADtools/ (accessed on 15 March 2025).
- Kwok, C.F.; Zhu, D.; Jacobi, L. ADtools: Automatic Differentiation Toolbox. GitHub Repository. 2020. Available online: https://github.com/kcf-jackson/ADtools (accessed on 15 March 2025).
- Abelson, H.; Sussman, G.J.; Sussman, J. Structure and Interpretation of Computer Programs; MIT Press: Cambridge, MA, USA, 1996. [Google Scholar]
- Lütkepohl, H. Handbook of Matrices; Wiley Chichester: Chichester, UK, 1996; Volume 1. [Google Scholar]
- Hu, T.; Shing, M. Computation of matrix chain products. Part II. SIAM J. Comput. 1984, 13, 228–251. [Google Scholar] [CrossRef]
- Cormen, T.H.; Leiserson, C.E.; Rivest, R.L.; Stein, C. Introduction to Algorithms; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
- Chan, J.C.; Jacobi, L.; Zhu, D. An automated prior robustness analysis in Bayesian model comparison. J. Appl. Econom. 2019, 37, 583–602. [Google Scholar] [CrossRef]
- Brennan, M.J.; Chordia, T.; Subrahmanyam, A. Alternative factor specifications, security characteristics, and the cross-section of expected stock returns. J. Financ. Econ. 1998, 49, 345–373. [Google Scholar] [CrossRef]
- Geweke, J.; Zhou, G. Measuring the pricing error of the arbitrage pricing theory. Rev. Financ. Stud. 1996, 9, 557–587. [Google Scholar] [CrossRef]
- Zeiler, M.D. Adadelta: An adaptive learning rate method. arXiv 2012, arXiv:1212.5701. [Google Scholar]
- Zellner, A. An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias. J. Am. Stat. Assoc. 1962, 57, 348–368. [Google Scholar] [CrossRef]
- LeSage, J.; Pace, R.K. Introduction to Spatial Econometrics; CRC Press: Boca Raton, FL, USA, 2009. [Google Scholar]
- Hernández-Sanjaime, R.; González, M.; López-Espín, J.J. Multilevel simultaneous equation model: A novel specification and estimation approach. J. Comput. Appl. Math. 2020, 366, 112378. [Google Scholar] [CrossRef]
Operations | Central Differencing | AD |
---|---|---|
Addition | ||
Subtraction | ||
Multiplication | ||
Inversion | ||
Kronecker product |
Function | First Time | Second Time | Average over 100 Executions |
---|---|---|---|
diag(5000) | 97.87 ms | 127.7 ms | 117.3 ms |
mem_diag(5000) | 101.8 ms | 0.132 ms | 0.959 ms |
Tasks | n = 1000 | n = 2000 | n = 3000 | n = 4000 | n = 5000 |
---|---|---|---|---|---|
Creation—Dense | 2.436 ms | 17.39 ms | 50.23 ms | 63.86 ms | 303.7 ms |
Creation—Sparse | 0.080 ms | 0.091 ms | 0.110 ms | 0.077 ms | 0.089 ms |
Multiplication—Dense | 19.40 ms | 111.6 ms | 404.4 ms | 1.041 s | 1.576 s |
Multiplication—Sparse | 17.42 ms | 66.50 ms | 236.6 ms | 432.3 ms | 668.2 ms |
Tasks | n = q = 10 | n = q = 20 | n = q =30 | n = q =40 |
---|---|---|---|---|
Creation—Dense | 10.77 ms | 708.4 ms | 18.09 s | 120.4 s |
Creation—Sparse | 0.646 ms | 0.667 ms | 1.553 ms | 1.595 ms |
Multiplication—Dense | 0.727 ms | 41.54 ms | 477.3 ms | 2.821 s |
Multiplication—Sparse | 0.189 ms | 2.388 ms | 11.48 ms | 71.69 ms |
Tasks | n = 10 | n = 20 | n = 30 | n = 40 | n = 50 |
---|---|---|---|---|---|
Creation—Dense | 4.981 ms | 113.1 ms | 939.9 ms | 4.735 s | 20.97 s |
Creation—Sparse | 0.823 ms | 0.816 ms | 0.823 ms | 0.816 ms | 0.891 ms |
Multiplication—Dense | 0.412 ms | 21.98 ms | 237.0 ms | 1.490 s | 5.715 s |
Multiplication—Sparse | 0.115 ms | 1.723 ms | 6.720 ms | 22.45 ms | 49.16 ms |
Length of Chain | 2 | 3 | 4 | 5 |
---|---|---|---|---|
Speed-up, | 1.04 (0.322) | 1.47 (0.564) | 1.64 (0.688) | 1.56 (0.637) |
Tasks/Percentiles | 0% | 25% | 50% | 75% | 100% | Mean |
---|---|---|---|---|---|---|
0.42 | 9.43 | 14.02 | 19.99 | 57.48 | 15.82 | |
2.65 | 10.82 | 15.44 | 21.34 | 54.30 | 16.81 |
Tasks/Percentiles | 0% | 25% | 50% | 75% | 100% | Mean |
---|---|---|---|---|---|---|
1.51 | 5.02 | 7.79 | 12.23 | 51.33 | 9.48 | |
1.95 | 5.06 | 7.83 | 11.62 | 79.33 | 9.48 | |
2.3 | 5.88 | 8.54 | 12.9 | 68.36 | 10.25 | |
2.22 | 5.13 | 8.25 | 12.71 | 61.62 | 9.98 |
Estimation Time (in ms) for Standard Tasks Under AD and FD by Matrix Size | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Tasks | AD | FD | AD | FD | AD | FD | AD | FD | AD | FD |
n = 10 | n = 20 | n = 30 | n = 40 | n = 50 | ||||||
Addition | 4.61 | 22.98 | 5.64 | 138.17 | 16.15 | 435.70 | 73.76 | 1136.07 | 146.96 | 2746.58 |
Subtraction | 3.92 | 21.91 | 5.42 | 107.32 | 19.33 | 427.66 | 77.41 | 1236.07 | 167.06 | 2970.38 |
Multiplication | 13.60 | 23.40 | 20.48 | 141.94 | 38.36 | 580.23 | 139.87 | 1732.45 | 245.83 | 4251.67 |
Inverse | 2.46 | 20.94 | 8.09 | 93.98 | 29.20 | 334.35 | 156.82 | 945.50 | 444.41 | 1839.47 |
n = 5 | n = 10 | n = 15 | n = 20 | n = 25 | ||||||
Kronecker | 15.1 | 9.3 | 54.2 | 181.7 | 351.9 | 1756.4 | 1478.6 | 7492.1 | 5530.5 | 31,950.7 |
Run-Time Comparison of Simulated MLE | ||||||||
---|---|---|---|---|---|---|---|---|
Total Run-Time | Per Iteration Run-Time (Simulated Data) | |||||||
Simulated Data | Real Data | min | lq | Mean | Median | uq | max | |
AD | 4.36 h | 2.08 h | 7.23 s | 7.41 s | 7.52 s | 7.47 s | 7.56 s | 8.37 s |
FD | 12.04 h | 6.10 h | 20.35 s | 20.51 s | 20.76 s | 20.60 s | 20.69 s | 26.82 s |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kwok, C.F.; Zhu, D.; Jacobi, L. An Analysis of Vectorised Automatic Differentiation for Statistical Applications. Stats 2025, 8, 40. https://doi.org/10.3390/stats8020040
Kwok CF, Zhu D, Jacobi L. An Analysis of Vectorised Automatic Differentiation for Statistical Applications. Stats. 2025; 8(2):40. https://doi.org/10.3390/stats8020040
Chicago/Turabian StyleKwok, Chun Fung, Dan Zhu, and Liana Jacobi. 2025. "An Analysis of Vectorised Automatic Differentiation for Statistical Applications" Stats 8, no. 2: 40. https://doi.org/10.3390/stats8020040
APA StyleKwok, C. F., Zhu, D., & Jacobi, L. (2025). An Analysis of Vectorised Automatic Differentiation for Statistical Applications. Stats, 8(2), 40. https://doi.org/10.3390/stats8020040