JAMPI: efficient matrix multiplication in Spark using Barrier Execution Mode

The new barrier mode in Apache Spark allows embedding distributed deep learning training as a Spark stage to simplify the distributed training workflow. In Spark, a task in a stage does not depend on any other tasks in the same stage, and hence it can be scheduled independently. However, several algorithms require more sophisticated inter-task communications, similar to the MPI paradigm. By combining distributed message passing (using asynchronous network IO), OpenJDK's new auto-vectorization and Spark's barrier execution mode, we can add non-map/reduce based algorithms, such as Cannon's distributed matrix multiplication to Spark. We document an efficient distributed matrix multiplication using Cannon's algorithm, which improves significantly on the performance of the existing MLlib implementation. Used within a barrier task, the algorithm described herein results in an up to 24 percent performance increase on a 10,000x10,000 square matrix with a significantly lower memory footprint. Applications of efficient matrix multiplication include, among others, accelerating the training and implementation of deep convolutional neural network based workloads, and thus such efficient algorithms can play a ground-breaking role in faster, more efficient execution of even the most complicated machine learning tasks.


Introduction
The recent decade has seen the emergence of two immensely powerful processes in tandem: the rise of big data handling solutions like Apache Spark on one hand and the apotheosis of deep learning as the tool of choice for demanding computational solutions for machine learning problems. Yet at its essence, big data and deep learning remain not only separate communities but also significantly separate domains of software. Despite deep learning over big data becoming a crucial tool in a range of applications, including in computer vision, [1,2] bioinformatics, [3][4][5][6] natural language processing (NLP), [7][8][9][10] clinical medicine, [11][12][13][14][15][16] anomaly detection in cybersecurity and fraud detection, [17][18][19] and collaborative intelligence/recommender systems, [20][21][22][23] its full potential remains to be harnessed. The primary impediment in this respect is largely a divergence of attitudes and concerns, leading to two divergent paradigms of development: • The big data paradigm, primarily designed around RDDs and the the DataFrame-based API. This outlook has dominated the development of Apache Spark.
• The DL/ML paradigm, which is primarily focused on efficient linear algebra operations to facilitate machine learning approaches, especially matrix algebra for deep neural networks.
The future of deep learning over big data depends greatly on facilitating the convergence of these two worlds into a single, unified paradigm: the use of well-designed big data management tools, such as Apache Spark, to interoperate with the demands of deep learning. The road towards this convergence depends on the development of efficient matrix primitives that facilitate rapid calculations over distributed networks and large data sets. The current execution model of Apache Spark is principally focused on independent, embarrassingly parallel tasks that are run and scaled, but the needs of deep learning are primarily focused on distributed training: the performance of completely communicating and coordinating tasks, optimized for interconnectivity rather than independent parallel running, while also maintaining scalability and efficiency. With the recent introduction of the barrier execution mode in Apache Spark, it has finally become possible to construct a computational approach that allows for such networked execution to take place, facilitating distributed training of deep neural networks (see Figure 1). JAMPI (Java Assisted Matrix Product with Inter-task communication), the framework described in this paper, is an efficient and rapid solution to an aspect of efficient matrix primitives, namely matrix multiplication. By integrating JDK's new Vector API, asynchronous network IO (nio) for distributed message passing and Spark's barrier mode, a pure Scala implementation of Cannon's 2.5D matrix multiplication algorithm can be devised that is significantly more ef- ficient than MLlib's BlockMatrix.multiply function. JAMPI thus avoids reliance on foreign, low level or native code in combination with JNI on one hand, being a pure Scala implementation. On the other hand, it provides a prewritten framework that integrates with Spark as a native task rather than an external MPI procedure call, and handles intertask communication directly, yielding performance benefits that would otherwise be associated with a low-level MPI implemented resource negotiation framework.

Cannon's algorithm
Matrix multiplication plays a significant role in a range of practical applications, including (but not limited to) scientific computing, non-linear modeling, agent-based models and the training of deep convolutional neural networks (deep learning). The proliferation of deep learning as the cognitive technology of choice for problems with large source data sets and high-dimensional or high-order multivariate data means that efficiency gains in the underlying linear algebra primitives has the potential to enable significant performance benefits in a wide range of use cases. In particular, constructing primitives that leverage computational capacity through rapid parallel computation and efficient interchange lends itself as an avenue towards these performance gains. While packages comprising efficient matrix primitives already exist, [24] these often operate at a low level and do not integrate well with existing and proven solutions to manage large computational loads.
The matrix multiplication operation for an p × q matrix A and an q × r matrix B is defined so that for the resultant matrix C = A B, each element c i, j is the dot product of the i-th row of A and the j-th column of B, i.e.
The multiplication of square matrices constitutes a special case. For a square matrix of order n, i.e. an n × n matrix, a special case obtains, which can be resolved efficiently using Cannon's algorithm. [25] For a square matrix of order n, i.e. n × n, Cannon's algorithm uses a toroidally connected mesh P n×n of n 2 processes. Rendered in pseudocode, the algorithm can be expressed as follows for p processors: for all i = 0 :  Cannon's algorithm is designed to be performed on a virtual square grid P of p processors (i.e. a √ p × √ p matrix).
The multiplicand and multiplier matrices A and B are laid out on P, after which the i-th row of A is circularly shifted by i to the left and the j-th column of B circularly shifted by j elements up. Then, n times, the two entries mapped onto p i, j are multiplied and added onto the running value of p i, j , after which each row of A is shifted left by one element and each column of B is shifted up by one element. Standard methods of multiplying dense matrices require O(n 3 ) floating operations for an n × n matrix. Cannon's algorithm improves on this by reducing it to O( n 3 p ). In particular, because of the fact that memory is not dependent on the number of processors, it scales dynamically with the number of processors. This makes it an attractive candidate for implementation as a high-performance distributed matrix multiplication primitive.

Spark's barrier mode
Spark's barrier mode is a new mode of execution introduced to Apache Spark as part of Project Hydrogen. [26] Barrier execution features gang scheduling on top of the MapReduce execution model to support distributed deep learning tasks that are executed or embedded as Spark steps. The current implementation ensures that all tasks (limited to mapPartitions) are executed at the same time, and collectively cancels and restarts all tasks in case of failure events. In addition to true parallel execution, the workers' host names and partition identifiers are accessible inside the tasks, alongside a barrier call, similar to MPI's MPI Barrier function. [27] While this functionality is sufficient to support the primary use case of Spark's barrier mode -namely, executing embedded MPI or other foreign, i.e. non-Spark and non-JVM, steps within a Spark application -, it does not provide any inter-task communication primitive to implement the same algorithms within JVM/Spark native steps. In fact, the design documentation for Spark's barrier mode clearly defines this as outside the scope of the project, stating that beyond a simple Barrier-TaskContext.barrier() call, no intra-communication functionality will be part of the implementation. It is assumed that such functionality would be handled by the user program. It is our view based on our extensive experience with implementing deep learning solutions on distributed systems that this is a clear show-stopper: if Spark is to be a force to be reckoned with as the data layer for deep learning applications over big data, it should not force execution outside Spark's boundaries.

Cannon's algorithm on MPI
The MPI version of the algorithm described in Subsection 1.1 relies on MPI's Cartesian topology. After setting up a 2D communication grid of processors with MPI Cart create, processors exchange data with their neighbors by calling MPI -Sendrecv replace. In the main loop, each processor executes a local dot product calculation, then shifts the results horizontally for matrix a and vertically for matrix b. In our benchmarks, we used MPICH version 3.3.2 as the underlying MPI implementation.
To speed up matrix multiplication, we applied -O4ftree-vectorize -march=native GNU C compiler flags to ensure vectorized code execution. By vectorization, we refer using SIMD (Single Instruction, Multiple Data) CPU features, more precisely Advanced Vector Extensions (AVX-512F) that allows faster execution of fused multiply-add (FMAC) operations in local/partial matrix dot product steps. After compiling our code with GCC 7.3.1 we ensured that the disassembled code contains vfmadd231sd instruction for vectorized FMAC.

JAMPI
JAMPI is a de novo native Scala implementation of Cannon's algorithm as described in described in Subsection 1.1. For message passing, we built an nio based asynchronous message passing library that mimics MPI's Cartesian topology and send-receive-replace functionality. To avoid unnecessary memory copies and to optimize performance for both throughput and latency, our PeerMessage object allocates fixed 8MB off-heap buffers for both sending and receiving data. Send and receive network operations are executed asynchronously and in parallel.
The matrix multiplication is embedded into a barrier execution task, which is parametrized by the the number of par-titions, the local partition ID, the hostnames for the other partitions (address from BarrierTaskContext.getTaskInfos()), as well as the the local matrix pairs from the RDD. JAMPI supports double, float and int Java primitive data types passed as Java Arrays.

Vectorization using Panama OpenJDK
In order to achieve performance on par with the optimized MPI implementation for local dot product steps, we used JVM's native vector intrinsics and super-word optimization capabilities for both JAMPI and MLlib Spark application benchmarks. The most recent and most comprehensive vectorization support in JVM is found in the Vector API module, part of OpenJDK's Project Panama. While the Vector API module is currently in incubation status, we consider it stable enough to use for both the Spark platform and application code.
For fair benchmarking, we avoided using Vector<> objects or advanced methods such as manual unrolling. While these techniques could potentially further improve performance, our goals were to compare the distributed algorithms' performance with the same CPU opcodes used in local matrix multiplications. From the JIT compiler outputs, we confirmed that both Spark applications were using vfmadd231sd, just as in the GCC compiled MPI version.
To use the new vector intrinsics' features, we built a custom OpenJDK package from the tip of the panama/dev branch (dev-442a69af7bad). The applied JVM flags were --add-modules jdk.incubator.vector and -XX:TypeProfileLevel=121 for both JAMPI and MLlib applications.

Apache Spark MLlib
We used Apache Spark MLlib's built-in BlockMatrix.multiply() as a baseline to compare with JAMPI's speed and resource usage. It is known that MLlib's implementation is often faster if the number of partitions exceeds that of worker cores (typically by a factor of 2-4 at least), a scenario known as overpartitioning. To ensure that this is adequately reflected, we performed two test runs -a 'normal' test run, where partitions are set to equal the number of worker cores, and an 'overpartitioned' test run, where partitions equal four times the number of worker cores.

Test protocols
All tests were performed on Amazon Web Services EC2 instances using m5 instance types with Intel R Xeon R Platinum 8175M CPUs and 4GB RAM per core. Tests were conducted on Apache Spark 3.0.0-preview2 with a separate master node. The driver process was initiated from the master node, and its JAMPI: efficient matrix multiplication in Spark using Barrier Execution Mode -4/8 resource consumption is not included in the results. For single core tests, 2-core CPUs were used, with the second CPU core having been manually disabled in the VM. Applications reported only the dot product execution time. A single one-value reducer (avg) was included to trigger RDD reduce/collection on Spark without moving substantial amount of data to the driver process. Timings thus exclude the MPI and Spark application startup times, but included the time required to establish a barrier task step during the RDD reduce step. For testing, random matrices composed of 64-bit floating point elements were used. Test scenarios were performed ten times, capturing execution time, CPU and memory consumption. Test scenarios, as well as the original JAMPI source code, are available online on Github under https://github.com/starschema/jampi-sparkdotmatrix.

Scalability analysis
An important aspect of any distributed algorithm is its ability to scale up as the problem size increases. This is crucial for proving the value of an algorithmic solution, since it demonstrates its ability to solve increasingly complex instances of the same fundamental problem effectively. There are intrinsic issues when scaling distributed multi-processor algorithms. It is known, for instance, that the memory requirement for each processor increases as we add processors to a computation. Therefore, we must analyze the effect of problem size on the memory requirements per processor.
For Cannon's algorithm multiplying two square matrices of size n × n, the problem size W is on the order of n 2 , i.e., The sequential time, that is when p = 1, is For p processors, the execution time for a matrix of size n × n is given as T p (n). It follows that parallelization of the problem yields a speed-up calculated as W T p (n) . In addition, parallel execution of an n × n problem size over p processors will incur a performance overhead of T o (n, p), including all communication costs.
It is known that the communication cost D, which is how much data is being shifted across the p processors, can be calculated as Using the following iso-efficiency relationship of parallel systems, Substituting Equation (3) in Equation (5), it follows that It follows thus from Equation (6) and the definition of W in Equation (2) that More generally, it holds that for a problem size W and p processors, Cannon's Algorithm memory requirements increase by a constant factor c 2 that is independent of the number of processors p involved in the computation. Since the memory requirements per processor increase linearly, without direct relationship to p, it can be said that Cannon's algorithm is extremely scalable. Figure 3 illustrates this scaling behavior comparatively between JAMPI, a pure MPI implementation and MLlib. JAMPI, as well as the MPI algorithm test case, are both direct implementations of Cannon's algorithm, thus having the same scalability behavior.
It is evident from Figure 3 that MLLib's memory requirement increases quite fast, suggesting that its scalability factor is larger than that of Cannon's algorithm (i.e. it is less scalable). This is a key limitation of MLlib and Spark when compared to MPI and JAMPI alike, which scale better. Indeed, in some test scenarios, we have been unable to scale MLlib beyond a certain problem size, indicating that in addition to its poor performance compared to MPI and JAMPI, it is also limited in the maximum problem size it can accommodate with a set level of resources. Neither JAMPI nor the native MPI implementation is so limited.

Results
Comparative analysis of runtimes over a range of matrix sizes reveals that JAMPI is significantly superior to MLlib, even when over-partitioned (see Figure 4, over-partitioning is denoted by op). When normalized against JAMPI's execution times over 16 and 64 cores, execution time is slower for smaller matrices (under 4096 × 4096 elements) due to the need to establish and run the barrier execution task. However, beyond a trivial problem size, JAMPI and the MPI implementation rapidly become significantly more efficient, regardless of the number of cores. Notably, plain MLlib (i.e. without over-partitioning) was unable to accommodate a problem size beyond 10240 × 10240 (for 16 cores) or 20480 × 20480 (for 64 cores).

Memory usage
Memory usage has been a documented limiting factor, with pure MLlib reaching execution limits at relatively trivial matrix dimensions per processor (Table 1). While over-partitioning slightly increases the maximum matrix size, MLlib suffers from not only lower performance but also a memory consumption upper bound that limits its ability to scale to larger problem sizes.
Our research indicates that for a 10,240 × 10,240 element standard matrix, JAMPI and MPI perform approximately equally (4,889 MB vs 5,108 MB, respectively, for 256 cores), while both over-partitioned and regular MLlib execution creates a marginally larger memory footprint (6,049 MB and 6,423 MB, respectively, for 256 cores). However, with increasing problem size, differences become vastly apparent: for a 30,720 × 30,720 element matrix, MPI and JAMPI continue to require a constant memory footprint (5,572 MB and 6,084 MB, respectively), while the same problem size requires 24,525 MB with over-partitioning and 29,445 MB without. In other words, JAMPI and MPI memory burden increases constantly regardless of the number of cores, while MLLib's memory consumption increases rapidly, as Figure 3 indicates. For instance, when processing 30,720 × 30,720 matrix size, MLlib requires a 4.03 (with over-partitioning) to 4.84 (without over-partitioning) times larger memory allocation.
Comparative analysis of memory usage (see Figure 3) shows that JAMPI is generally on par (within 30%) of the pure MPI implementation, while MLlib typically requires approximately four times the amount of memory allocation that the MPI based approaches demand, with regular MLlib requiring typically 15% to 50% more memory than over-partitioned implementations.
However, MLlib execution times rapidly increases, yielding an average of 29.52% (with over-partitioning) to 54.54% (without over-partitioning) slower execution time at the largest feasible matrix for the given number of cores. On the other hand, for large matrices, as Figure 4 shows, a pure MPI implementation is typically 3.38% to 19.59% faster than JAMPI.
The comparative analysis of performance indicators shows that while a pure MPI implementation is somewhat faster than JAMPI, this difference is significantly smaller than the difference between the MLlib implementation and JAMPI, proving that JAMPI is an efficient and fast alternative to pure MPI applications without a significant performance overhead.

Conclusion
Cannon's algorithm can be implemented quite conveniently using a barrier task within Spark, providing a native interpretation of this highly efficient distributed linear algebra primitive. By using barrier tasks to reimplement matrix primitives with Panama's built-in efficient vectorization and asynchronous communication (as provided by nio in this case), very significant performance gains can be effected on frequently used tasks. The proposed implementation of Cannon's algorithm, for instance, has yielded an almost 25% decrease in execution time, and has been superior to the MLlib implementation on all core sizes above trivial matrix sizes. While this algorithm is limited to square matrices, the general effectiveness gains are indicative of a strong theoretical and practical benefit of further research in ways efficient matrix primitives can be integrated with big data solutions like Apache Spark. Further research in this field is required to create a coherent stack of matrix primitives in order to allow modern deep learning applications, relying greatly on such building blocks, to leverage the performance benefits of big data solutions in storing and managing data as a layer of an integrated framework of large-scale machine learning.