Power and Performance Evaluation of Memory-Intensive Applications †

: In terms of power and energy consumption, DRAMs play a key role in a modern server system as well as processors. Although power-aware scheduling is based on the proportion of energy between DRAM and other components, when running memory-intensive applications, the energy consumption of the whole server system will be signiﬁcantly affected by the non-energy proportion of DRAM. Furthermore, modern servers usually use NUMA architecture to replace the original SMP architecture to increase its memory bandwidth. It is of great signiﬁcance to study the energy efﬁciency of these two different memory architectures. Therefore, in order to explore the power consumption characteristics of servers under memory-intensive workload, this paper evaluates the power consumption and performance of memory-intensive applications in different generations of real rack servers. Through analysis, we ﬁnd that: (1) Workload intensity and concurrent execution threads affects server power consumption, but a fully utilized memory system may not necessarily bring good energy efﬁciency indicators. (2) Even if the memory system is not fully utilized, the memory capacity of each processor core has a signiﬁcant impact on application performance and server power consumption. (3) When running memory-intensive applications, memory utilization is not always a good indicator of server power consumption. (4) The reasonable use of the NUMA architecture will improve the memory energy efﬁciency signiﬁcantly. The experimental results show that reasonable use of NUMA architecture can improve memory efﬁciency by 16% compared with SMP architecture, while unreasonable use of NUMA architecture reduces memory efﬁciency by 13%. The ﬁndings we present in this paper provide useful insights and guidance for system designers and data center operators to help them in energy-efﬁciency-aware job scheduling and energy conservation.


Introduction
In recent years, memory-based computing is one of the alternative methods to solve many emerging workloads that are constrained by high data-access costs. Efficient data storage and analysis is one of the critical challenges in the big data paradigm [1][2][3][4][5][6][7]. Therefore, the processor-centric computing is transforming to memory-centric computing. Although a single 12 TB memory server has appeared on the IDC market today, in many cases, the data to be processed has exceeded the memory capacity of the server [8]. In addition, application-level scalability is limited by memory capacity and communication latency. A common solution is data parallelization, which divides the dataset into smaller subsets to accommodate memory capacity and parallelization acceleration. More specifically, data flows into and out of the processor in parallel like sparks for rapid analysis [2,4,9,10]. It is also possible to introduce new memory hierarchies, such as 3-D memory stacking, to improve bandwidth, energy efficiency and scalability [11][12][13][14][15][16][17][18][19]. How to reduce data movement and energy consumption as much as possible also depends on these new storage technologies in large-scale storage systems. In addition, ref [3,20] also proposed memorybased data calculation, which reduces the energy consumption of the server system by performing calculations in the memory module. In [21], the author believes that there is no universally accepted solution to provide efficient distributed shared memory.
For computing-intensive applications, data processing can be easily accelerated using many core processors or GPUs. However, for memory-intensive applications, the application performance is highly correlated with memory capacity and bandwidth. DRAM is an important source of server power consumption, especially when the server is running memory-intensive applications. Although power capping and thermal throttling of processors are well investigated and commonly used in data centers, fine-grained and scalable power-aware adaptation of memory systems is still an open problem. Current energy-efficiency-aware scheduling assumes that DRAM is also energy proportional like processors. However, the non-energy proportionality of DRAM significantly affects the energy consumption of the whole server system, especially for memory-intensive applications. However, the non-energy proportionality of DRAM causes the power consumption model to be unable to accurately represent the energy consumption of the server, which affects the energy consumption of the entire server system, especially for memory-intensive applications. Therefore, a full understanding of the server energy ratio under memory-intensive workloads can help better place workloads and increase the energy savings of data center hybrid resource scheduling [22,23]. For example, the memory of each core will also change when the scale of the system increases, which will affect the performance of the application and the cost of the overall system.
In response to the continuous increase in energy consumption caused by the continuous expansion of the data center, the industry standards organization has developed the SPECpower_ssj2008 [24] (referred to as SPECpower in the rest of this article) benchmarks to evaluate the energy efficiency of servers. SPECpower is widely used to characterize the energy efficiency of the system at different utilization levels. Mainstream server vendors submit the SPECpower test results to SPEC and provide them online after passing the review.
However, the SPECpower benchmark will not stress memory system well, because it is a server-side Java benchmark. In Table 1, we list the per-core memory statistics (MPC, the ratio of installed memory capacity to installed processor cores) for 658 servers released before 2020. It can be observed that in all the 658 SPECpower results published, there are only 13 servers with a single-core memory greater than or equal to 8 GB/core, and only 2 servers with a single-core memory of up to 16 GB/core. Most servers have less than 4 GB/core.
Assume we have a single-node server equipped with 8 Xeon 8260 CPUs (one of the most common CPU on the market in 2019Q4) and 8 TB memory (the maximum memory capacity supported by the processor), the memory per core is 42.6 GB/core. If the processor has fewer cores as most usual configurations of 2 or 4 sockets per node, the memory per core will be significantly greater than 42.6 GB/core. Therefore, from this perspective, the SPECpower results cannot be an ideal and reliable source for an energy efficiency study of large memory systems. This motivates us to investigate the energy efficiency of servers with large memory installation.
In this paper, we use the STREAM benchmark to test three rack servers with different workload intensities to study the energy efficiency of large memory servers running memory-intensive applications. Since the SPECpower benchmark cannot stress memory system well, we use the STREAM benchmark to check how the performance and power changes with different memory stress levels (multi-threads) or racks or modules. If we can find that the memory system has different energy efficiency (or proportionality patterns) from the CPU, at least under different workload types, we can provide some insights for workload placement in clouds or big data analytics scenarios. Our experiments show that hardware configuration can significantly affect server energy efficiency for memoryintensive applications. The findings we presented in this paper provide useful insights and guidance for system designers and data center operators to achieve energy-efficiency-aware job scheduling and energy conservation.
The remainder of this paper is organized as follows. In Section 2, we summarize related work. In Section 3, we first describe the server energy efficiency and energy proportionality from the published SPECpower benchmark results and introduce the energy efficiency metric for servers with large memory installations. In Section 4, we provide experimental results and observations of the energy efficiency of servers with large memory under typical memory-intensive workloads. In Section 5, we characterized the energy efficiency of memory systems, including the economies of scale in memory utilization, comparison of energy efficiency between SMP and NUMA architecture, and we also derive insights on energy efficiency of memory-intensive applications. We conclude the paper and make remarks on future work in Section 6.

Related Work
Nowadays, owing to rapid advancement in hardware technology, the ever-increasing main memory capacity has promoted the development of memory big data management and processing [5]. Although in-memory processing moves data into memory and eliminates disk I/O bottlenecks to support interactive data analysis, in-memory systems are more susceptible to the utilization, time/space efficiency, parallelism, and concurrency control of modern CPU and memory hierarchies than disk I/O-based systems [25][26][27][28][29].
In many cases, memory bandwidth restricts the performance of computer systems. Besides, because of pin and power constraints of CPU packages, it is also a challenge to increase the bandwidth. To increase performance under these restrictions, we have proposed the near-DRAM Computing (NDC), near-DRAM acceleration (NDA) architectures, Processing-In-Memory (PIM), Near Data Processing (NDP), or memory-driven computing [30][31][32][33][34]. In [30], the authors proposed Chameleon, an NDA architecture that can be achieved without depending on 3D/2.5D-stacking technology and seamlessly integrated with large memory systems for servers. Experiment has shown that a Chameleon-based system can provide 2.13× higher geo-mean performance while consuming 34% lower geo-mean data transfer energy than a system that integrates the same accelerator logic within the processor.
Recently, new memory hierarchies like 3-D memory stacking are proposed for promotion in energy efficiency, bandwidth, and scalability. For example, the hybrid memory cube (HMC) [35] has promised to enhance bandwidth and density and decrease power consumption for the next-generation main memory systems. Besides, to fill the gap between processors and memories, 3-D integration gives a second shot to revisit near memory computation. Active Memory Cube (AMC) [36], which has been proposed recently, contains general-purpose host processors and in-memory processors (processing lanes), which is specially designed and would be integrated in a logic layer within 3D DRAM memory. DRAM contains multiple resources called banks that can be accessed in parallel and maintain state information independently. In Commercial Off-The-Shelf (COTS) multicore platforms, banks are shared among all cores in common, even if programs running on the cores do not share memory space. In this situation, it is hard to predict memory as a result of contention in the shared banks [37]. For the sake of testing different forms of memory, various benchmarks and system have been implemented by some researchers [38][39][40][41].
In multi-core platforms, memory is shared among all processor cores. However, the gains of compute offered by multi-cores are often counteracted by degradation of performance towing of shared resources, such as main memory [42][43][44][45][46]. In order to efficiently use multi-core platforms, tightly binding the interference when accessing shared resources is required. For example, due to interference in the shared, a task running on one core can be delayed by other tasks running simultaneously on other cores DRAM main memory. In some cases, such memory interference delay can be large and highly variable. A tight upper bound on the worst-case memory interference in a COTS-based multi-core system was proposed by Kim [47] and he explicitly modeled the major resources in the DRAM system, including banks, buses, and the memory controller. Dirigent [48] is proposed to balance the performance of latency-critical jobs that finish sooner than required with higher system throughput. Fine time granularity QoS problems for GPUs in heterogeneous platforms [49] have been also tackled by Min et al. However, it is not common to use progress heuristics for the GPU, and the mechanism proposed is restricted to managing main memory bandwidth contention between the CPU and GPU.
In big data analytics, in order to address the problems of limited bandwidth, energy inefficiency, and limited scalability by enabling in-memory computations using non-volatile memristor technology [50], Computation-in-Memory (CIM)-based architectures are proposed. They found that in large-scale data analytics frameworks, the CPU is the main performance bottleneck of these systems. For example, the Tachyon [51] file system outperforms in-memory HDFS by 110× for writes. In [52], the authors propose FusionFS, a distributed file system and distributed storage layer local to the compute nodes, which saves an extreme amount of data movement between compute and storage resources and is in charge of most of the I/O operations. Compared with popular file systems such as GPFS, PVFS, and HDFS, FusionFS is better. epiC [53] is a big data processing framework, which is in order to tackle the Big Data's data variety challenge. epiC introduces a general actor-like concurrent programming model, which is independent of the data-processing models and is for specifying parallel computations. In [54], the authors propose DigitalPIM, a Digitalbased Processing In-Memory platform, which has abilities to accelerate fundamental big data algorithms in real time with orders of magnitude more energy efficient operation. In [55], the authors proposed power management schemes to raise the speedup of prior RRAM-based PIM from 69× to 273×, pushing the power usage from about 1 W to 10 W.
However, in [56], the authors found that naive adoption of hardware solutions does not guarantee superior performance over software solutions and point out problems in such hardware solutions that limit their performance, although hardware solutions can supply promising alternatives for realizing the full potential of in-memory systems. In [57], the authors found that if we increase the number of DRAM channels it will decrease DRAM power and improve the energy-efficiency across all applications at the same time.
In heterogeneous platforms, the CPU and the GPU are integrated into a single chip for higher throughput and energy efficiency. Its memory bandwidth is the most critically shared resource in such a single-chip heterogeneous processor (SCHP), requiring discreet management to maximize the throughput. Based on analysis of memory access characteristics, Wang et al. [58] proposed various optimization techniques and improved the overall throughput by up to 8% compared to FR-FCFS.
On modern multi-core platforms, memory bandwidth is highly variable for more memory-intensive applications. Jiang et al. [59] found that memory configuration on a virtualized platform also influences the server's power and performance. In [60], the authors proposed an efficient memory bandwidth reservation system, MemGuard, which has provided bandwidth reservation to guarantee the bandwidth for temporal isolation, utilizing the reserved bandwidth with efficient reclaim to maximal. It improves performance by sparing the best effort after satisfying each core's reserved bandwidth. In [61], the authors proposed to move computation closer to main memory, which offers an opportunity to reduce the overheads associated with data movement and they explore the potential of using 3D die stacking to move memory-intensive computations closer to memory. In [62], the authors propose OffDIMM, which can map a memory block in the address space of the OS to a subarray group or groups of DRAMs and sets a deep power-down state for the subarray group when offlining the block. However, OffDIMM decreases background power by 24 percent on average without notable performance overheads.
All in all, a processor's frequency scaling and power optimization is well investigated, but the research in memory-related power performance optimization is still insufficient. Furthermore, when in-memory computing becomes the mainstream paradigm for big data analytics, power consumption of large memory dominates. This drives us to investigate the power characteristics of servers running memory-intensive applications.

Notations of Server Energy Efficiency Evaluation
To drive server energy efficiency improvements, SPEC established SPECpower, which is the authoritative benchmark to measure the power and performance of computer systems in the industry. SPECpower's workload is designed to evaluate the energy efficiency and performance of small and medium-sized servers running server-side Java applications at different utilization levels. That is why SPECpower results are not ideal and reliable sources for an energy efficiency study of large memory systems are required.
However, the methodology of energy efficiency measurement and evaluation of SPECpower is still worthy of reference. Although SPECpower does not put pressure on storage components, it tests CPU, memory, cache, JVM, and other operating system components. Its detailed workload characteristics can be found in [63]. Specifically, SPECpower will report the server power consumption at different utilization levels within a set time period. SPEC has formulated very strict evaluation rules for SPECpower and requires the information of the tested system to be fully disclosed in the report. That is why 40 results published by the SPEC are marked as non-compliant and not accepted by SPEC.
We give a sample result of a server in Table 2, from the results released by SPEC-power_ssj2008 in 2016, the server's memory per core of 16 GB/core. In order to better understand this paper, we have listed some notations and terms of SPECpower benchmark results: (1) Utilization. In the target load column of specpower results in Table 2, which assume benchmark to delete all hardware components concertedly, there are 10 utilization levels, ranging from 10% to 100%.
(3) Energy efficiency (EE). It is defined as the performance to power ratio with unit of ssj_ops per watt. The formula is as follows: In Table 2, the energy efficiency values are the last column named "performance to power ratio" and we abbreviate energy efficiency as EE. However, for memory systems, we use bandwidth per watt (BpW) to measure memory energy efficiency (MEE): (4) Server overall energy efficiency. The server's overall performance to power ratio, that is, the ratio of the sum of ssj_ops to the sum of 10 utilization levels (from 10% to 100%) and the sum of active idle power (∑ssj_ops/∑power). In addition, the server overall energy efficiency is also used as its SPECpower score. For example, in Table 2, the overall energy efficiency of the server (total score) is 5316. (5) Peak energy efficiency. It is defined as the highest energy efficiency of a server among all utilization levels. For example, in Table 2, the server peak energy efficiency is 6619 (at 70% utilization). (6) Energy Proportionality (EP). In this paper, we use the energy proportionality (EP) metric proposed in [64]. Taking the server in Table 2 as an example, we can draw its normalized utilization-power curve in Figure 1. The solid line in Figure 1 is the EP curve of the server in Table 2, the dotted line is the ideal energy proportionality server and the dashed line is our tested and untuned server with 16 GB of memory per core running the SPECpower benchmark. Based on this, we can compute the EP of a real server by the following formula [64]:  Table 2 and an ideal energy proportional server (power normalized to power at 100% utilization).
From Equation (3), we can see that EP is greater than or equal to zero but less than 2.0. For the ideal energy proportional server, its EP value is 1.0, which is of great reference significance for the study of energy efficiency characteristics of servers.

Experiment Setup
Similarly, we can also derive a server's energy proportionality curve using a memoryintensive benchmark. We used STREAM [65], NAMD [66], and CloudSuite [67], as the memory benchmark. We modified the code to change the array size in STREAM to vary the workload from idle to 100% utilization. We ran benchmarks on three different 2U rack servers, which ran the same x64 version CentOS 7 with Linux kernel 3.10. All the power data were measured by a WattsUP.Net power meter. The base configuration of these servers is listed in Table 3.

Results of STREAM Workload
In order to stress the memory system, we ran different numbers of concurrent STREAM threads with varying array size from 4 GB to 16 GB. Due to space limitation, we only provide the results of 4 GB. We present the power consumption of the tested servers in Figures 2 and 3. When the cpu or server temperature is too high, the processor's power capping and thermal throttling technology will appropriately reduce CPU frequency to protect the server, in order to protect the cpu or server, so the power consumption of 48 threads is less than 36 threds on server #2. However, overall, our experiments show that with the increment of concurrent threads and therefore memory utilization, the power consumption of the server also increases. We present the perceived bandwidth of a single thread of the tested servers in Figures 4 and 5. The memory energy efficiency is listed in Figure 6.   (1) When the number of concurrent threads is 36, the power consumption and CPU utilization are the highest at the CPU frequencies of 1.2 GHz, 1.8 GHz, 2.4 GHz, and ondemand governor. Generally, power consumption grows with the CPU frequencies.
(2) When threads increase, the perceived bandwidth of triad computation in a single STREAM thread decreases at first and reaches its lowest at 36 threads. Then it bounces a little at 48 threads because of the contention and starvation of execution threads. (3) The perceived bandwidth increases while the bandwidth growth rate decreases as CPU frequency increases. Moreover, the bandwidth of different CPU frequency is almost the same at 24 threads, which is because the server has 12 physical cores and 24 execution threads in total.
(4) Both the memory energy efficiency and its change rate decreases as the number of concurrent threads increases. It can also be inferred that frequency scaling cannot improve memory energy efficiency a lot in a highly contented condition.

Results of NAMD Workload
We used the NAMD (Nanoscale Molecular Dynamics) simulator to simulate large systems (millions of atoms). We run the NAMD simulations of two virus structures, one contains 8 million atoms and another contains 28 million atoms, namely, the stmv.8M.namd configuration and stmv.28M.namd configuration. The results are shown in Figures 7 and 8. For all NAMD experiments, the system power is significantly correlated with the memory and CPU utilization on different machines with different hardware configurations and CPU generation. However, the correlation coefficient on server #2 is less than that of server #1. One possible reason may be that server #2 has a newer CPU than server #1 and the CPU difference makes sense when running NAMD. We list the Pearson correlation coefficients of power and memory and CPU utilization on server #1 and #2 in Table 4. We can see that for NAMD benchmark, both memory and CPU utilization are good indicators for system power consumption on both server #1 and #2.

Results of CloudSuite Workload
In order to confirm if the memory utilization is a good indicator for system power under other memory-intensive applications, we ran the CloudSuite In-Memory Analytics on server #2. The results show that neither memory nor CPU utilization is a good indicator for system power consumption. For example, the correlation coefficient of power and CPU utilization is 0.053 and −0.09 in Table 5 when we run the CloudSuite in-Memory Analytics and both In-memory Analytics and Data Serving benchmarks. We plot the real-time power and system utilization data in Figures 9 and 10. We also conducted experiments where the memory utilization ranged from 30% to 98% and, again, neither memory nor CPU utilization is a good indicator for system power consumption. This implies that we should not implement power-aware scheduling according to only a single parameter like memory or CPU utilization, even for large memory nodes running memory-intensive applications.

Economies of Scale in Memory Utilization
In order to investigate the power consumption of each server at varying memory utilization, we conduct experiments with different concurrently running STREAM threads. We then compute the power consumption per percentage of memory utilization in Figures 11 and 12. For other servers and array size, we can obtain similar results. Figures 11 and 12 suggest that when the number of threads increases, the power per percentage of memory utilization decreases. We also present the power per percent utilization of server #2 running the SPECpower benchmark in Figure 13 and compare the power per percent utilization of SPECpower and STREAM of server #2 in Figure 14. From Figures 13 and 14, we observe that the power consumption per percent utilization of SPECpower and STREAM benchmark decreases when system utilization increases. However, SPECpower has lower power per percent utilization than STREAM during all utilization levels.

SMP and NUMA Energy Efficiency Comparison
Server memory access architecture is divided into SMP and NUMA. In the SMP architecture, all CPUs share all resources (such as bus, memory, I/O system, etc.). In addition, since data exchange between the CPU and the memory is performed by the bus, the bus bandwidth can easily become a bottleneck of the SMP data transmission. To solve the limitation of the SMP bus bandwidth, the most intuitive solution is to increase the number of buses, so the non-uniform memory access (NUMA) architecture emerged. NUMA is a memory architecture designed for multiprocessor devices. In this architecture, the memory access time depends on the location of the memory relative to the processor, and it is quicker for processors to gain access to memory in the local node than in different nodes. This is due to the fact that there are multiple NUMA nodes, which have their own CPU and memory. In the NUMA architecture, each node's CPUs share a memory controller. Therefore, the memory access time is the same for CPU in a same NUMA node. However, access to the memory of another NUMA node needs to pass through the router, and the data consistency guarantee needs to be provided by the cache consistency protocol, so access speed to the memory of the remote node will be slower. It can be illustrated that the NUMA architecture application performance is related to the memory allocation strategy. Access to the remote memory will increase the access delay, thereby causing the application performance degradation. In our experimental platform in this section, we use a two-way server, so it is only divided into local nodes and remote nodes under the NUMA architecture. The experiments in this section are based on server #3 in Table 3.
In order to study the memory efficiency changes of the server in NUMA, we rewrite the STREAM test benchmark so that it can determine the proportion of memory allocated to the near-end and far-end, such as 10% memory are allocated to near-end memory and 90% to far-end memory. Through different memory allocation methods, we can experiment and analyze the memory energy efficiency of the server under the NUMA architecture. This article uses 16 G, 32 G, 64 G, and 128 G different STREAM array sizes, and divides the nearend memory and the far-end memory according to the proportions of 0%, 20%, 50%, 80%, 100%, etc. Zero percent means that the whole memory is allocated to the far-end, and 100% means that the whole memory is allocated to the near-end. In the experimental results, the frequency drive of the CPU is in the on-demand mode. For the memory system, this article uses the bandwidth per watt (BpW) as shown in Equation (4) First, the memory energy efficiency data of different STREAM array sizes are obtained under the SMP architecture of the same experimental platform. As shown in Table 6, different array sizes represent different memory utilization rates, but the power consumption and energy efficiency of the memory system do not change much. Next, the BIOS in the experimental platform and the NUMA switch in the kernel are turned on to activate the NUMA architecture. Figures 15 and 16 show the power consumption and energy efficiency of the memory system with different near-end and far-end memory allocation strategies when the STREAM array size is 16 G and 128 G, respectively. As the allocation ratio between the near end and the far end is different, the power consumption of the memory system in the two NUMA nodes also changes. Furthermore, as the percentage of memory allocated to the near-end increases, the speed of operating the memory becomes faster, and the average bandwidth of the STREAM load gets higher and higher, so the energy efficiency of the memory system gradually increases.   Although the allocation ratio of the memory system changes and the power consumption of the near-end memory and the far-end memory changes, the power consumption of the entire memory system (near-end memory + far-end memory) has not changed significantly, and the total power consumption of the memory is always 30 W. Further, the memory efficiency of the array size of 128 G under the same memory allocation method is slightly higher than that of 16 G. This shows that the relationship between memory power consumption and memory utilization in a memory system is not strongly related. It is not the case that the higher the memory utilization, the higher the power consumption of memory usage. Comparing the experimental results of the SMP architecture in Table 6, under the same array size of 128 G, all the memory of the NUMA architecture is allocated to the near end, and its average energy efficiency will be 16% higher than the SMP architecture. However, if all the memory is allocated to the remote end, the energy efficiency will be reduced by 13% compared to the SMP architecture. Therefore, using the NUMA architecture is beneficial to improve the energy efficiency of the memory system, but if the memory allocation method is not ideal, it will significantly affect the memory energy efficiency of the system.
In order to study the relationship between memory power consumption, energy efficiency, and the number of memory operations, we recompile the STREAM with OpenMP and set the STREAM array size to be 128 G under the NUMA architecture. Then the whole memory is allocated to near-end node and the number of parallel threads is 2,4,8,16,32,40,60, and 80, respectively. It should be noted that since the number of logical cores of a single CPU of the experimental platform is 40, the threads use logical threads on CPU0 if the number of parallel threads is less than 40. CPU0 and its near-end memory can be regarded as a near-end node. The memory energy efficiency improved significantly as the number of parallel threads increased. The experimental results of the NUMA architecture are shown in Figure 17. The memory energy efficiency improved significantly as the number of parallel threads increases. From 200 of a single thread to 1700 in multi-threading, the memory power consumption in the memory system also increased significantly. In previous studies of server energy efficiency, the energy efficiency of a memory system is usually considered to be a fixed value, which is estimated as a constant power consumption. However, compared with the single-threaded experiment, we can see that the memory power consumption is not static, and as the number of parallel threads increases, the memory power consumption also increases. However, it is not the case that the higher the number of threads, the higher the energy efficiency of the memory system. When the number of parallel threads is greater than 16, the memory energy efficiency decreases slightly and tends to be stable. Furthermore, if we still increase the number of threads, the energy efficiency of the memory cannot be increased, and instead it will be degraded. Figure 18 shows the experimental results of the multi-threaded parallel STREAM array size of 128 G under the SMP architecture. As the number of threads increases, the energy efficiency of the memory system also increases, and it will be decreased slightly and tends to be stable when the number of parallel threads is greater than 16. Besides, when the number of parallel threads is less than 16, the memory energy efficiency of SMP architecture is 40% lower than that of the NUMA architecture. In addition, when the number of parallel threads is greater than 16, the memory energy efficiency of SMP architecture is 5% higher than that of the NUMA architecture. The reason for this may be that when the number of parallel threads increases, some threads are assigned to the remote CPU in the NUMA system, resulting in remote memory access, which reduces memory access speed and memory energy efficiency greatly. Therefore, it is not suitable to use the NUMA architecture if the number of parallel threads of the application is too high. All in all, not all applications using the NUMA architecture can improve the memory energy efficiency and deciding whether to use the NUMA architecture should depend on the number of threads, the distribution of threads, and memory allocation strategies. Data center operators should focus on application types, task planning, and task placement to determine how to use NUMA architecture reasonably to improve memory efficiency.

Insights on Energy Efficiency of Memory-Intensive Applications
From the above observations, we derive some insights for memory-intensive applications in data centers in terms of power and energy consumption.
Insight #1: The power consumption per percentage utilization of the server decreases as array size increases because of the reduction of the number of concurrent STREAM threads and vice versa. This indicates that power consumption of the server may be increased by multiple threaded applications.
Insight #2: Neither memory nor CPU utilization is suitable for evaluating system power con-sumption when it comes to memory-intensive applications. Thus, for large memory nodes which memory intensive applications are running on, we are supposed to con-sider more indicators rather than memory and CPU utilization when implementing power aware scheduling.
Insight #3: The reasonable use of the NUMA architecture will improve the memory energy efficiency significantly. It is necessary to ensure that most of the memory is allocated to the near-end nodes when using NUMA architecture. Otherwise, the memory energy efficiency will be lower than that of the SMP architecture. The experimental results show that it can increase the memory energy efficiency by 16% more than the SMP architecture with a reasonable use of NUMA architecture, but it decreases the memory energy efficiency by 13% than the SMP architecture with an unreasonable use of NUMA architecture.
Insight #4: We should pay attention to the tasks distribution on CPU, and the data center operators should allocate the tasks to a single CPU if possible. Besides, they would better use the near-end memory when allocating memory.

Conclusions
Through the evaluation of large memory systems running a memory-intensive application to understand the energy efficiency characteristics of memory system, it can help data center managers and system operators in many folds, including system capacity planning, power shifting, job placement, and scheduling. In this paper, we conducted extensive experiments and measurements to investigate the power and energy characteristics of three 2U servers running various memory-intensive benchmarks. Experiment results show that server power consumption and performance changes with hardware configuration, workload intensity, and concurrent running threads. However, fully utilized memory systems are not the most energy efficient, In addition, though the memory system is not fully utilized, application's performance and server power consumption can be impacted a lot by different powered memory modules of installed memory capacity (the memory capacity per processor core). This implications can inspire us in desing of reconfigurable system and real-time power aware adaption. We verified the effect of different memory allocation and thread count strategies on the memory efficiency of NUMA and SMP architectures. The experimental results showed that proper task placement and memory allocation were required to make full use of NUMA architectures, otherwise the system energy efficiency would be reduced.
Our findings presented in this paper provide useful insights and guidance to system designers, as well as data center operators, for energy-efficiency-aware job scheduling and energy savings. In order to ensure that NUMA architecture improves the energy efficiency of the memory system, data center operators should pay attention to application types, task planning, and placement. As for future work, we plan to take further step into the energy efficiency of large memory systems run-ning more diverse memory intensive applications, such as in-memory databases, Hadoop, and Spark jobs, so that specific configurations can be selected to improve the server's energy efficiency based on the characteristics of the upper-layer applications.