2.1. Board Comparison
Before designing our cluster, we evaluated the power and performance tradeoffs found in seventeen different commodity 32-bit and 64-bit ARM boards, as listed in
Table 1. The boards, all running Linux, span a wide variety of speeds, cost and processor types. More complete info on the hardware capabilities of the boards can be found in
Appendix A.
2.1.1. Experimental Setup
During the experiments we configure the machines as if they were a node in a larger cluster. No extraneous devices (keyboards, mice, monitors, external drives) are attached during testing; the only connections are the power supplies and network cables (with the exception of the Chromebook, which has a wireless network connection and a laptop screen). Machines that did not have native Ethernet were provided with a USB Ethernet adapter.
2.1.2. Benchmarking Programs
Choosing a representative set of High Performance Computing (HPC) benchmarks remains difficult, as cluster performance is tightly tied to the underlying workload. We chose two HPC benchmarks that are widely used in cluster benchmarking: Linpack and STREAM.
High-performance Linpack (HPL) [
5] is a portable version of the Linpack linear algebra benchmark for distributed-memory computers. It is commonly used to measure the performance of supercomputers worldwide, including the twice-a-year Top500 Supercomputer list [
4]. The program tests the performance of a machine by solving complex linear systems through use of basic linear algebra subprograms (BLAS) and the message-passing interface (MPI).
For our experiments, mpich2 [
6] was installed on each machine to provide a message passing interface (MPI), and the OpenBLAS [
7] library was installed on each machine to serve as the BLAS.
The second benchmark we use is STREAM [
8], which tests a machine’s memory performance. STREAM performs operations, such as copying bytes in memory, adding values together and scaling values by another number. STREAM completes these operations and reports the time it took, as well as the speed of the operations.
We compiled our benchmarks with the version of gcc that was installed on the various machines (typically it was gcc 4.9, as most machines were running the Raspbian Jessie Linux distribution). We used the default compiler options when compiling.
We did not use any digital signal processing (DSP) or graphics programming unit (GPU) acceleration, even though many of the boards support this. In general, the boards do not support Open Computing Language (OpenCL) or any other abstraction layer on top of the accelerators. To gain access to the DSP or GPU would require extensive custom coding for each individual board, and often, these interfaces are not well documented. The Jetson TX-1 board does support NVIDIA CUDA, so we tried running HPL_cuda on the board. This consistently crashed the system, so we were unable to obtain results. The TX-1 GPU is optimized for single-precision floating point, so direct comparisons against the CPU results (which use double-precision) would not be possible.
2.1.3. Power Measurement
The power consumed by each machine was measured and logged using a WattsUpPro [
9] power meter. The meter was configured to log the power at its maximum sampling speed of once per second.
The power readings were gathered on a separate machine from the one running the benchmarks. For proper analysis, the timestamps of the power readings need to match up with the start and stop times of the benchmarks. We did this by synchronizing the clocks of the two machines to the same network time protocol (NTP) time server before starting the runs. There is some potential for drift, but since our power meter only provides one second of resolution, this solution was deemed to be good enough.
2.1.4. HPL FLOPS Results
Table 2 summarizes the floating point operations per second (FLOPS) results when running HPL. We took many results for each board, varying the N term to find the maximum performance. N is the problem size: usually higher is better, but at some point, performance starts declining as the amount of memory available is exhausted.
The FLOPS value is unexpectedly low on the Cortex-A8 machines; the much less advanced ARM1176 Raspberry-Pi obtains better results. This is most likely due to the “VFP-lite” floating point unit found in the Cortex-A8, which takes 10 cycles per operation rather than just one. It may be possible to improve these results by changing the gcc compiler options; by default, strict IEEE-FP correctness is chosen over raw speed.
The ARM1176-based systems (low-end Raspberry Pis) all cluster together with similar performance, differing mostly by the CPU clock frequency.
The more advanced Cortex-A9 and Cortex-A7 systems have a noticeable improvement in floating point performance. This includes the Pandaboard, Cubieboard2 and Raspberry Pi Model 2B. Some of this is due to these machines having multiple cores. We do not have numbers for the Trimslice: building a BLAS for it proved difficult, as it lacks NEON support (NEON is optional on Cortex-A9), and a later hardware failure prevented further testing.
The Cortex-A15 machines (Chromebook and Odroid-xU) have an even greater boost in FLOPS, with the highest performance of the 32-bit systems.
The 64-bit systems have high performance, as well, with the high-end Cortex-A57 (Jetson TX-1) with the best performance and the lower end Cortex-A53 systems (Dragonboard, Raspberry Pi Model 3B) not far behind.
The Raspberry Pi Model 3B posed some interesting challenges. Unlike previous models, when running HPL, the chip can overheat and produce wrong results or even crash [
10]. With an adequate heat sink, cooling and boot loader over-volt settings, an impressive 6.4 GFLOPS can be obtained, but on stock systems, the CPU overheats and/or clocks down the CPU, with much lower results are obtained.
For comparison, we show results from a few x86 machines. We find that while the high-end ARM systems can outperform a low-end atom-based x86 server, recent high-end AMD and Intel servers have at least an order of magnitude more FLOPS than any of the ARM systems.
2.1.5. HPL FLOPS per Watt Results
Table 2 also shows the GFLOPS per average power results (GFLOPS/W). This is shown graphically in
Figure 1, where an ideal system optimizing both metrics would have points in the upper left. In this metric, the 64-bit machines perform best by a large margin. The Jetson TX-1 (and properly-cooled Raspberry Pi 3B) break the 1 GFLOP/W barrier. The Chromebook is at a disadvantage compared to the other boards, as it is a laptop and has a display that was operating while the test was running. This is most noticeable in the idle power being higher than all of the other boards.
While the 64-bit machines have much better efficiency than earlier processors, a high-end x86 server can still obtain twice the power per watt than even the best ARM system to which we have access.
2.1.6. HPL FLOPS per Cost Results
Table 2 also shows the FLOPS per dollar cost (purchase price) of the system (higher is better); this is also shown in
Figure 2, where an ideal system optimizing both would have points in the upper left. The Raspberry Pi 3 performs impressively on the MFLOPS/US$ metric, matching a high-end x86 server. The Raspberry Pi Zero is a surprise contender, due mostly to its extremely low cost.
2.1.7. STREAM Results
We ran Version 5.10 of the STREAM benchmark on all of the machines. We use the default array size of 10 million (except for the Gumstix Overo, which only has 256 MB of RAM, so a problem size of 9 million was used).
Figure 3 shows a graph of the performance of each benchmark. The more advanced Cortex-A15 chips have much better memory performance than the earlier boards, most likely due to the use of dual-channel low-power double-data rate (LPDDR3) memory. The Jetson-TX1 has extremely high memory performance, although not quite as high as a full x86 server system.
To fully understand the results, some knowledge of modern memory infrastructure is needed. On desktop and server machines, synchronous dynamic random access memory (SDRAM) is used, and the interface has been gradually improving over the years from DDR (double data rate) to DDR2, DDR3 and now DDR4. Each new generation improves the bandwidth by increasing how much data can be sent per clock cycle, as well as by increasing the frequency. Power consumption is also important, and the newer generations reduce the bus voltage to save energy (2.5 V in DDR, 1.8 V in DDR2, 1.5 V in DDR3 and down to 1.2 V in DDR4). Embedded systems can use standard memory, but often they use mobile embedded SDRAM (low-power), such as LPDDR2, LPDDR3 or LPDDR4. This memory is designed with embedded systems in mind, so often trade off performance for lower voltages, extra sleep states and other features that allow using less power.
One factor affecting performance is the number of DRAM (dynamic random access memory) channels the device has: despite having similar CPUs, the Trimslice only has a single channel to memory, while the Pandaboard has two, and the memory performance is correspondingly better.
The bus frequency can also make a difference. Note that the Odroid-xU and the Dragonboard both have LPDDR3 memory; however, the Odroid runs the memory bus at 800 MHz versus the Dragonboard’s 533 MHz, and the STREAM results for Odroid are correspondingly better.
2.1.8. Summary
Results of both the HPL and STREAM benchmarks show the Jetson TX1 machine as the clear winner on performance, STREAM and performance per Watt metrics. If cost is factored in, the Raspberry Pi 3B makes a strong case, although it has the aforementioned problems with overheating.
The Pi 3B and Jetson TX-1 were not yet released when we started building our cluster, so our design choice was made without those as options. At the time, the best performing options other than the Pi were the Chromebook (which has a laptop form factor and no wired Ethernet) and the Odroid-xU (which was hard to purchase through our university’s procurement system). We chose to use Raspberry Pi B boards for various practical reasons. A primary one was cost and the ease of ordering large numbers at once. Another important concern is the long-term availability of operating support and updates; the Raspberry Pi foundation has a much stronger history of this than the manufacturers of other embedded boards.
Our cluster originally used Model B boards, and we have since updated to B+ and then 2B. The compatible design of the Raspberry Pi form factor means it should be easy to further upgrade the system to use the newer and better performing Model 3B boards.
2.2. Cluster Design
Based on the analysis in
Section 2.1, we chose Raspberry Pi Model 2B boards as the basis of our cluster. The Raspberry Pi boards provide many positive features, including small size, low cost, low power consumption, a well-supported operating system and easy access to general-purpose input/output (GPIO) pins for external devices.
Figure 4 shows the cluster in action. The compute part of the cluster (compute nodes plus network switch) costs roughly US$2200; power measurement adds roughly $200; and the visualization display costs an additional $700.
2.2.1. Node Installation and Software
Each node in the cluster consists of a Raspberry Pi Model 2B with its own 4 GB SD card. Each node has an installation of the Raspbian operating system, which is based on Debian Linux and designed specifically for the Raspberry Pi.
One node is designated as the head node and acts as a job submission server, central file server and network gateway. The file system is shared via Network File System (NFS) and subsequently mounted by the sub-nodes. Using NFS allows programs, packages and features to be installed on a single file system and then shared throughout the network, which is faster and easier to maintain than manually copying files and programs around. Passwordless SSH (Secure Shell) allows easily running commands on the sub-nodes. MPI (message passing interface) is installed to allow cluster-wide parallel jobs. The MPI implementation used for this cluster is MPICH2, a free MPI distribution written for UNIX-like operating systems. For job submission, the Slurm [
11] batch scheduler is used.
The nodes are connected by 100 MB Ethernet, consisting of a 48-port 10/100 network switch, which draws approximately 20 Watts of power.
2.2.2. Node Arrangement and Construction
The initial cluster has 24 compute nodes plus one head node. It is designed so expansion to 48 nodes is possible.
A Corsair CX430 ATX power supply powers the cluster. The Pi boards are powered by the supply’s 5-V lines, as well as via the 12-V lines through a direct current (DC-DC) converter that reduces this to 5-V. We found it necessary to draw power from both the 5-V and 12-V lines of the power supply, otherwise the voltages provided would become unstable. This is typical behavior of most desktop power supplies, as they are designed to provide a minimum load on the 12-V lines, and if this load is not present, the output voltages can drift outside of specifications.
The head node is powered by the supply’s standby voltage, which allows the node to be powered up even when the rest of the cluster is off. The head node can power on and off the rest of the cluster by toggling the ATX power enable line via GPIO.
Power can be supplied to a Raspberry Pi in two ways, through the micro USB connector or through the GPIO header. The power pins on the GPIO header connect directly to the main power planes and have no protection circuitry. We use the micro USB power connector to take advantage of the fuses and smoothing capacitors that add an extra layer of protection. This did complicate construction, as we had to crimp custom micro-USB power cords to connect the Pis to the power measurement and distribution boards.
The boards are attached via aluminum standoffs in stacks of four and are placed in a large server case that has had wooden shelving added.
2.2.3. Visualization Displays
Two main external displays are used to visualize the cluster activity.
The first is a series of 1.2 inch bi-color 8×8 LED matrix displays (Adafruit, New York, NY, USA) attached to each node’s GPIO ribbon cable. These LED displays can be individually programmed via the nodes’ i2c interface. These per-node displays can be controlled in parallel via MPI programs. This not only allows interesting visualization and per-node system information, but provides the possibility for students to experience plainly visible representations of their underlying MPI programs.
The second piece of the front end is an LCD-PI32 3.2 inch LCD touchscreen (Adafruit, New York, NY, USA) that is programmed and controlled using the head node’s SPI interface. This screen allows a user to view overall power and performance information, as well as check the status of jobs running on all nodes of the cluster.
2.2.4. Power Measurement
Each node has detailed power measurement provided by a circuit as shown in
Figure 5. The current consumed is calculated from the voltage drop across a 0.1-Ohm sense resistor, which is amplified by 20 with an MCP6044 op-amp and then measured with an MCP3008 SPI A/D converter. Multiplying overall voltage by the calculated current gives the instantaneous power being consumed. There is one power measurement board for each group of four nodes; the first node in each group is responsible for reading the power via the SPI interface.
An example power measurement for the full cluster is shown in
Figure 6. The workload is a 10 k 12-node HPL run with 5 s of sleep on either side. While sampling frequencies up to at least 1 kHz are possible, in this run, the power is sampled at 4 Hz in order to not clutter up the graph.
One of the nodes (node05–2) is currently down, and the power measurement of three of the nodes (node03–0, node03–3 and node01–0) is currently malfunctioning. The rest of the nodes are measuring fine, and you can see detailed behavior across the cluster. Half of the nodes are idle, and the rest show periodic matching peaks and troughs as the workload calculates and then transmits results.
One advantage of our custom power measurement circuits is that we can sample at a high granularity. We can alternately measure system-wide power with a WattsUpPro [
9] power meter. The WattsUpPro can only sample power at 1-Hz resolution, which can miss fine-grained behaviors.
Figure 7 shows the loss of detail found if the sampling frequency is limited to 1-Hz.
2.2.5. Temperature Measurement
In addition to power usage, it is often useful to track per-node temperature. The Raspberry Pi has an on-chip thermometer that can be used to gather per-board temperature readings. Our power measurement boards also support the later addition of 1-wire protocol temperature probes if additional sensors are needed.