This Section presents and discusses the results of our study. The evaluation consists of an improved result using oversampling (
Section 5.1); an analysis of energy and performance characterization (
Section 5.2), followed by input-dependent analysis (
Section 5.3); and an error analysis of our prediction model for speedup and energy efficiency (
Section 5.4). It concludes with the evaluation of the predicted set of Pareto solutions (
Section 5.5).
5.1. Experimental Evaluation of Oversampling
This section evaluates the accuracy of our speedup and normalized energy predictions using oversampling technique SMOTE. The modeling approach used for this evaluation is the one described in
Section 3.4 based on linear SVR, RBF SVR and trained on micro-benchmarks. For each application, we trained the speedup and normalized energy models with all the oversampled frequency configurations, predicted the values and then calculated the the error after actually running that configuration.
Figure 7a and
Figure 8a show the minimum, median and maximum error (%), and the error distribution of the 25 and 75 percentiles, for the speedup and normalized energy, respectively, for only one memory frequency (
mem-L). We do not show the errors on the other three memory frequencies because SMOTE technique mostly affects the minority class (
mem-L configuration). Therefore, we only show the prediction accuracy improvement on
mem-L.
Figure 7b and
Figure 8b show the prediction error using SMOTE. Compared to the results under lowest memory frequency in Fan [
3], the accuracy of speedup prediction and normalized energy prediction had only modest improvements of 0.1% and 1.0%, respectively.
5.2. Application Characterization Analysis
In
Figure 9, we analyzed the behavior of twelve test benchmarks in terms of both speedup and (normalized) energy consumption. For each code, we show speedup (
x axis) and normalized energy (
y axis) with different frequency configurations; the reference baseline for both correspond to the energy and performance value of the default frequency configuration. Generally, the applications show two main patterns (see top and other codes in
Figure 9), i.e., memory- vs. compute-dominated kernels, which correspond to the different sensitivity to core and memory frequency changes.
Speedup In terms of speedup, k-NN shows a high variance with respect of the core frequency: for mem-H and mem-h, speedup goes from 0.62 up to 1.12, which means that it can double the performance by only changing the core frequency; for the mem-l the difference is even larger. The limited data for mem-L suggest a similar behavior. At the other extreme, blackscholes and MT show very little speedup difference while increasing the core frequency: all configurations are clustered to the same speedup for mem-L and l, while in mem-h and H the difference is minimal (from 0.89 to 1). Other applications behave within those two extreme codes.
Normalized energy As previously mentioned, normalized energy often exhibits a parabolic distribution with a minimum. With respect to core frequency, it varies within smaller intervals. For the highest memory frequencies, it goes up to 1.4 for the first four codes, and up to 1.2 for the others. Again, the lowest configuration present very different behaviors: on k-NN, energy-per-task may be double the baseline, up to 0.8; in blacksholes, on the other hand, mem-L shows the same normalized energy for all the core frequencies.
High vs. low memory frequencies There is a big difference between high (mem-H and h) and low (mem-l and L) frequency configurations. Mem-H and h behave in a very similar way, with regard to both speedup and normalized energy. Both mem-l and mem-L have behavior that is much harder to predict. Mem-l behaves like the highest memory frequency at a lower normalized energy for the first four codes; however, on the other four codes, the configurations collapse to a line. The mem-L is even more erratic: in some codes, all points collapse to a very small area, practically a point. This is a problem for modeling: lowest memory configurations are much harder to model because their behavior is very erratic. In addition, because the supported configurations are not evenly distributed, we also have less points to base our analysis.
Pareto optimality In general, we can see two different patterns (this also extend to the other test benchmarks). In terms of Pareto optimality, most of the dominant points are mem-h and H. However, lower memory settings may as well contribute to the Pareto-set with configurations; in k-NN, for instance, mem-l has a configuration that is as fast the highest ones, but with 20% less energy consumption.
The default configuration is often a very good one. However, there are other dominant solutions that cannot be selected by using the default configuration.
5.3. Input-Size Analysis
Our previous work [
3] was built on static code features, and did not take different input sizes into consideration. In fact, changing problem size results in a significant effect on the performance [
20]. While in our case, it is more important to understand that the Pareto optimal solutions are likely to change with different input sizes. We analyze the statement presenting a case study with two applications:
Matrix Multiply and
MT (Mersenne Twister). These two applications have been chosen to represent very different behaviors, but the insights apply to all the tested applications. The applications have been executed with different problem sizes and the results are shown in
Figure 10 and
Figure 11.
Matrix Multiply Application We tested the four memory settings mentioned above, labeled for simplicity
L,
l,
h and
H, each with all supported core frequencies. The default setting (
mem-H and core at 1001 MHz) is at the intersection of the green lines. In terms of
Matrix Multiply (
Figure 10), speedup (the left column) benefits greatly from core scaling. On the other hand, (normalized) energy consumption behaves differently. In the middle column of
Figure 10, for three out of four memory configurations, normalized energy is similar to a parabolic function with a minimum point: while increasing the core frequency, first the energy decreases as the computational time is reduced; but then, the higher frequencies have an impact on energy in a way that it does not compensate for the improvement on speedup. The lowest memory configuration (
mem-L) seems to show a similar behavior; however, we do not have data at higher core frequencies to validate it (core frequencies larger than 405 MHz are not supported for
mem-L; details in
Figure 5a).
However, the speedup and normalized energy are slightly different between small and large input sizes. For the smaller problem size (the top in
Figure 10), the rate of speedup increases with core frequency scaling and also the curvature of normalized energy are lower than the other three larger input sizes. While for the other three input sizes, the speedup and normalized energy consumption do not change a lot as the input size increases from 262144 to 1048576, respectively.
The right column in
Figure 10 shows both energy and performance. As they behave differently, there is no single optimal configuration. In fact, this is a multi-objective optimization problem, with a set of Pareto-optimal solutions. It is important to note that for the small size (8192), the default configuration (black cross) is Pareto-optimal, while it is not for the other three input sizes.
Mersenne Twister Application In contrast to the behavior of
Matrix Multiply,
Mersenne Twister behaves differently, not only regarding the speedup and normalized energy consumption for the same input size, but also the effect of different sizes. For the speedup, increasing the core frequency does not improve performance, while selecting the highest memory frequency (
mem-H) does, as shown in the left column of
Figure 11. This behavior is justified by the larger number of memory operations. The energy consumption of
Mersenne Twister behaves similar to
Matrix Multiply, while the increase of energy consumption with higher core frequencies is larger.
In terms of input size, the speedup is decreased with the increasing input size, especially for
mem-l and
mem-L. The normalized energy consumption is better in the small input size (8192) than the other sizes. Mapping the observations to the bi-objective problem (the right column if
Figure 11), it is clear to find that the
mem-l solutions are not Pareto-optimal for large input size, which illustrates that the Pareto-optimal solutions mainly exist with higher memory frequency configurations for the large input size of a memory-bounded application.
5.5. Accuracy of the Predicted Pareto Set
Once the two models have predicted the speedup and normalized energy for all frequency configurations, Algorithm 1 is used to calculate the predicted Pareto set. The accuracy analysis of the Pareto set is not trivial because our predicted set may include points that, in actual measured performance, are not dominant each other. In general, a better Pareto approximation is a set of solutions that, in terms of speedup and normalized energy, is the closest possible to the real Pareto-optimal one, which in our case has been evaluated on a subset of sampled configurations.
Lowest memory configuration Because of technical limitations of NVML, the memory configuration
mem-L only supports six core configurations, up to only 405 MHz; therefore it covers only a limited part of the core-frequency domain. This leads to a lower accuracy of normalized energy prediction (
Figure 12b). In addition, the Pareto analysis shows that the last point is usually dominant to the others, and it contributes to the overall set of Pareto points in 11 out of 12 codes, as shown in
Figure 13 (the six
mem-L points are in green, the last point is blue when dominant).
We used a simple heuristics to cover up with this issue: we used the predictive modeling approach on the other three memory configurations, and added the last of the mem-L configuration in the Pareto set. This simple solution is accurate for all but one code: AES.
Pareto frontier accuracyFigure 13 provides an overview of the Pareto set predicted by our method and the real ones, over a collection of twelve test benchmarks. The gray points represent the measured speedup and normalized energy of all the sampled frequency configurations (
mem-H,
mem-h and
mem-l), except for
mem-L, which are in green because they are not modeled with our predictive approach. The default configuration is marked with a black cross. The blue line represent the
real Pareto front
, while the red crosses represent our predicted Pareto set
(we did not connect these points because they are not necessarily dominant each other).
Coverage differenceTable 2 shows different metrics that evaluate the accuracy of our predicted Pareto set. A measure that is frequently used in multi-objective optimization is the
hypervolume (HV) indicator [
37], which measures the volume of an approximation set with respect of a reference point in terms of the dominated area. In our case, we are interested on the
coverage difference between two sets (e.g., the real Pareto set
and the approximation set
). Therefore, we use the
binary hypervolume metric [
38], which is defined by:
Because we maximize on speedup and minimize on normalized energy consumption, we select (0.0, 2.0) as the reference point. In addition, we also indicate the cardinality of both predicted and optimal Pareto set.
The twelve test benchmarks in
Figure 13 are sorted by coverage difference.
Perlin Noise is the code with the nearest distance to the optimal Pareto set: the 12 predicted points are very close to the 10 optimal ones, and the overall coverage distance is minimal (
). Overall, the Pareto predictions for the first six codes are very accurate (
). Five more codes have some visible mispredictions which, however, translate to a not so large error (
).
k-NN is the worst code because of lowest accuracy of speedup prediction, which shows in
Figure 12a.
Accuracy on Extrema We additionally evaluated the accuracy of our predictive approach on finding the extreme configurations; e.g., the two dominant points that have, respectively, minimum energy consumption and maximum speedup. Again, we removed from this analysis the
mem-L configurations, whose accuracy was discussed above. The rational behind this evaluation is that the accuracy on the Pareto predictions may not reflect the accuracy on these extreme points. As shown in
Table 2, the point with maximum speedup is predicted exactly in 7 out of 12 cases, and the error is small. In case of the point with minimum energy, we have larger mispredictions in general; in particular two codes,
AES and
MT, have a very large error. This reflects the single-objective accuracy observed before, where the accuracy of speedup is generally higher than the accuracy of energy. The high error on all our analysis with the
MT code is mainly due to the fact that lower memory configurations collapses to a point (
mem-L) and a line (
mem-l), a behavior that is not showed by other codes.
Predictive modeling in a multi-objective optimization scenario is challenging because few mispredicted points may impact the whole prediction, as they may dominate other solutions with a good approximation. Moreover, errors are not all equals: overestimation on speedup, as well as underestimation on energy, are much worse than the opposite, as they may introduce wrong dominant solutions. Despite that, our predictive approach is able to deliver good approximations in ten out of twelve test benchmarks.