# A Preliminary Empirical Study of the Power Efficiency of Matrix Multiplication

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- Definition-based matrix multiplication (will be referred to as D3.0 in this work).
- Basic divide-and-conquer matrix multiplication by Strassen (D2.8).
- Optimized divide-and-conquer multiplication (D2.4).

## 2. Literature Review

## 3. Methods and Procedures

#### 3.1. Experimental Environment

#### 3.2. Test Dataset Generation

`int`in C++). A dataset was run three hundred times in each case, of which the first twenty were ignored to bypass the initial thermal state of the system and start from a consistent point. The remaining runs were enough to obtain a reliable average.

#### 3.3. Executable Files

#### 3.4. Profiling Tools

## 4. Results and Discussion

#### 4.1. Miss Rate Analysis

#### 4.2. Main Memory Trends

#### 4.3. Algorithm Behavior

## 5. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## Abbreviations

HPC | High Performance Computing |

D3.0 | The definition-based matrix multiplication |

D2.8 | Strassen’s divide-and-conquer matrix multiplication |

D2.4 | An optimized divide-and-conquer matrix multiplication |

CPU | Central Processing Unit |

GPU | Graphics Processing Unit |

EXE | Executable |

## References

- Reed, D.A.; Dongarra, J. Exascale computing and big data. Commun. ACM
**2015**, 58, 56–68. [Google Scholar] [CrossRef] - Abulnaja, O.A.; Ikram, M.J.; Al-Hashimi, M.A.; Saleh, M.E. Analyzing power and energy efficiency of bitonic mergesort based on performance evaluation. IEEE Access
**2018**, 6, 42757–42774. [Google Scholar] [CrossRef] - Aljabri, N.; Al-Hashimi, M.; Saleh, M.; Abulnaja, O. Investigating power efficiency of mergesort. J. Supercomput.
**2019**, 75, 6277–6302. [Google Scholar] [CrossRef] [Green Version] - Al-Hashimi, M.; Aljabri, N. Exploring Power Advantage of Binary Search: An Experimental Study. Int. J. Adv. Comput. Sci. Appl.
**2022**, 13, 789–795. [Google Scholar] [CrossRef] - Dlamini, G.; Jolha, F.; Kholmatova, Z.; Succi, G. Meta-analytical comparison of energy consumed by two sorting algorithms. Inf. Sci.
**2022**, 582, 767–777. [Google Scholar] [CrossRef] - Shi, J.f.; Lin, Z.h.; Wang, J. Optimization of software codes for CPU Chip Reliability. In Proceedings of the 2010 Fifth International Conference on Frontier of Computer Science and Technology, Washington, DC, USA, 18–22 August 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 595–599. [Google Scholar]
- Wang, J.; Shi, J.F.; Lin, Z.H. Study on relationship between the usage rate of CPU Chip and its temperature. Microelectron. Comput.
**2008**, 25, 45–46. [Google Scholar] - Strassen, V. Gaussian elimination is not optimal. Numer. Math.
**1969**, 13, 354–356. [Google Scholar] [CrossRef] - Coopersmith, D.; Winograd, S. Matrix multiplication via arithmetic progressions. In Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing, STOC′87, New York, NY, USA, 25–27 May 1987; pp. 1–6. [Google Scholar]
- Dasgupta, S.; Papadimitriou, C.; Vazirani, U. Algorithms, 1st ed.; McGraw-Hill Education: New York, NY, USA, 2006. [Google Scholar]
- Khan, A.U.; Al-Mouhamed, M.; Fatayer, A.; Mohammad, N. Optimizing the Matrix Multiplication Using Strassen and Winograd Algorithms with Limited Recursions on Many-Core. Int. J. Parallel Program.
**2016**, 44, 801–830. [Google Scholar] [CrossRef] - Khan, K.N.; Hirki, M.; Niemi, T.; Nurminen, J.K.; Ou, Z. RAPL in Action: Experiences in Using RAPL for Power Measurements. ACM Trans. Model. Perform. Eval. Comput. Syst.
**2018**, 3, 9:1–9:26. [Google Scholar] [CrossRef] - Basmadjian, R.; De Meer, H. Evaluating and modeling power consumption of multi-core processors. In Proceedings of the 2012 Third International Conference on Future Systems: Where Energy, Computing and Communication Meet (e-Energy), Madrid, Spain, 9–11 May 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 1–10. [Google Scholar]
- Khezripour, H.; Pourmozaffari, S. Fault Tolerance and Power Consumption Analysis on Chip-Multi Processors Architectures. In Proceedings of the 2012 Seventh International Conference on Availability, Reliability and Security, Washington, DC, USA, 20–24 August 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 301–306. [Google Scholar]
- Park, W.H.; Yang, C.K.K. Effects of using advanced cooling systems on the overall power consumption of processors. IEEE Trans. Very Large Scale Integr. Syst.
**2012**, 21, 1644–1654. [Google Scholar] [CrossRef] - Al-Hasib, A.; Kjeldsberg, P.G.; Natvig, L. Performance and energy efficiency analysis of data reuse transformation methodology on multicore processor. In Proceedings of the European Conference on Parallel Processing, Rhodes Islands, Greece, 27–31 August 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 337–346. [Google Scholar]
- Kodaka, T.; Takeda, A.; Sasaki, S.; Yokosawa, A.; Kizu, T.; Tokuyoshi, T.; Xu, H.; Sano, T.; Usui, H.; Tanabe, J.; et al. A near-future prediction method for low power consumption on a many-core processor. In Proceedings of the 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 18–22 March 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 1058–1059. [Google Scholar]
- Dargie, W.; Wen, J. A probabilistic model for estimating the power consumption of processors and network interface cards. In Proceedings of the 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, Melbourne, Australia, 16–18 July 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 845–852. [Google Scholar]
- Yuechuan, Y.; Guosun, Z.; Chunling, D.; Wei, W. Analysis method of energy for C source program and its application. In Proceedings of the 2013 IEEE International Conference on Green Computing and Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing, Beijing, China, 20–23 August 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 1397–1402. [Google Scholar]
- Hamady, F.; Kayssi, A.; Chehab, A.; Mansour, M. Evaluation of low-power computing when operating on subsets of multicore processors. J. Signal Process. Syst.
**2013**, 70, 193–208. [Google Scholar] [CrossRef] - Poon, P.; Stout, Q.F. Time-power tradeoffs for sorting on a mesh-connected computer with optical connections. In Proceedings of the 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, Cambridge, MA, USA, 20–24 May 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 611–619. [Google Scholar]
- Aliaga, J.I.; Barreda, M.; Dolz, M.F.; Martín, A.F.; Mayo, R.; Quintana-Ortí, E.S. Assessing the impact of the CPU power-saving modes on the task-parallel solution of sparse linear systems. Clust. Comput.
**2014**, 17, 1335–1348. [Google Scholar] [CrossRef] [Green Version] - Yildiz, O.; Dorier, M.; Ibrahim, S.; Antoniu, G. A performance and energy analysis of i/o management approaches for exascale systems. In Proceedings of the Sixth International Workshop on Data Intensive Distributed Computing, Vancouver, BC, Canada, 23–27 June 2014; pp. 35–40. [Google Scholar]
- Cebrián, J.M.; Sánchez, D.; Aragón, J.L.; Kaxiras, S. Managing power constraints in a single-core scenario through power tokens. J. Supercomput.
**2014**, 68, 414–442. [Google Scholar] [CrossRef] - Lastovetsky, A.; Manumachu, R.R. New model-based methods and algorithms for performance and energy optimization of data parallel applications on homogeneous multicore clusters. IEEE Trans. Parallel Distrib. Syst.
**2016**, 28, 1119–1133. [Google Scholar] [CrossRef] - Abdel-Hafeez, S.; Gordon-Ross, A. An Efficient O (N) Comparison-Free Sorting Algorithm. IEEE Trans. Very Large Scale Integr. Syst.
**2017**, 25, 1930–1942. [Google Scholar] [CrossRef] - Gupta, U.; Ayoub, R.; Kishinevsky, M.; Kadjo, D.; Soundararajan, N.; Tursun, U.; Ogras, U.Y. Dynamic power budgeting for mobile systems running graphics workloads. IEEE Trans.-Multi-Scale Comput. Syst.
**2017**, 4, 30–40. [Google Scholar] [CrossRef] - Haidar, A.; Jagode, H.; Vaccaro, P.; YarKhan, A.; Tomov, S.; Dongarra, J. Investigating power capping toward energy-efficient scientific applications. Concurr. Comput. Pract. Exp.
**2019**, 31, e4485. [Google Scholar] [CrossRef] - Kondo, M.; Miyoshi, I.; Inoue, K.; Miwa, S. Power management framework for post-petascale supercomputers. In Advanced Software Technologies for Post-Peta Scale Computing; Springer: Berlin/Heidelberg, Germany, 2019; pp. 249–269. [Google Scholar]
- Chandra, T.B.; Verma, P.; Dwivedi, A.K. Impact of programming languages on energy consumption for sorting algorithms. In Software Engineering; Springer: Berlin/Heidelberg, Germany, 2019; pp. 93–101. [Google Scholar]
- Ozer, G.; Garg, S.; Davoudi, N.; Poerwawinata, G.; Maiterth, M.; Netti, A.; Tafani, D. Towards a Predictive Energy Model for HPC Runtime Systems Using Supervised Learning. In Proceedings of the Euro-Par 2019: Parallel Processing Workshops: Euro-Par 2019 International Workshops, Göttingen, Germany, 26–30 August 2019; Revised Selected Papers. Springer: Berlin/Heidelberg, Germany, 2019; pp. 626–638. [Google Scholar]
- Aljabri, N.; Abulnaja, O. Build Power Profiling Tool for Modern CPUs. JKAU Comp. IT Sci.
**2019**, 8, 11–18. [Google Scholar] - David, H.; Gorbatov, E.; Hanebutte, U.R.; Khanna, R.; Le, C. RAPL: Memory power estimation and capping. In Proceedings of the 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED), Austin, TX, USA, 18–20 August 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 189–194. [Google Scholar]
- Javaid, Q.; Zafar, A.; Awais, M.; Shah, M.A. Cache memory: An analysis on replacement algorithms and optimization techniques. Mehran Univ. Res. J. Eng. Technol.
**2017**, 36, 831–840. [Google Scholar] [CrossRef] [Green Version]

**Figure 2.**Power consumption in watts (W): estimated boundaries for L1, L2, and L3 caches marked for each case. Note the recursive divide-and-conquer methods spilled earlier due to increased stack storage overheads.

**Figure 3.**Total energy consumption in kJ, where the estimated points of spill out to main memory are marked.

**Figure 4.**Total energy consumption in joule (J) for small matrix dimensions, where computation is estimated to be within L1 cache.

**Figure 5.**Execution time in seconds. Note the time trend seems to closely follow the total energy consumption.

**Figure 6.**A detailed view of execution time (ms) for small matrix dimensions within estimated L1 cache boundary.

Processor | Intel Xeon E5-2680/v3 2.50 GHz 12 cores |

Cache | L1 data: 12 × 32 KB (8-way set associative) |

L1 instruction: 12 × 32 KB (8-way set associative) | |

L2: 12 × 256 KB (8-way set associative) | |

L3: 30 MB shared (20-way set associative) | |

Memory | 8 GB |

Operating System | Linux Ubuntu 16.04 64-bit |

Compiler | GCC 7.5.0 (Ubuntu 7.5.0-3ubuntu118.04) |

**Table 2.**Average energy in millijoules (mJ), power in watts, and percentage difference of D2.4 relative to the other methods, where negative indicates better performance. The region of best power savings is marked.

Matrix Dimension | Energy | Power | % D2.4 Advantage | |||||||
---|---|---|---|---|---|---|---|---|---|---|

Energy | Power | |||||||||

D3.0 | D2.8 | D2.4 | D3.0 | D2.8 | D2.4 | D3.0 | D2.8 | D3.0 | D2.8 | |

50 | 39.6 | 47.2 | 53.5 | 13.2 | 11.8 | 10.7 | 35 | 13 | −19 | −9 |

100 | 132.3 | 136 | 141.9 | 14.7 | 13.6 | 12.9 | 7 | 4 | −12 | −5 |

150 | 385 | 402.3 | 394.8 | 15.4 | 14.9 | 14.1 | 3 | −2 | −8 | −5 |

200 | 1333.8 | 1312 | 1271.7 | 17.1 | 16.4 | 15.7 | −5 | −3 | −8 | −4 |

250 | 3860.6 | 3698.4 | 3633.7 | 19.4 | 18.4 | 17.9 | −6 | −2 | −8 | −3 |

300 | 14,552 | 10,341.2 | 7819.5 | 21.4 | 20.6 | 19.5 | −46 | −24 | −9 | −5 |

350 | 50,020 | 29,106 | 18,144 | 24.4 | 23.1 | 22.4 | −64 | −38 | −8 | −3 |

400 | 166,123 | 79,679.8 | 37,861.2 | 27.1 | 25.4 | 23.4 | −77 | −52 | −14 | −8 |

450 | 559,056 | 256,522 | 85,106.8 | 30.4 | 29.2 | 26.3 | −85 | −67 | −13 | −10 |

500 | 1,771,599 | 755,158.6 | 183,804.8 | 32.1 | 30.7 | 28.4 | −90 | −76 | −12 | −7 |

550 | 5,644,914 | 2,231,517.6 | 380,553.6 | 34.1 | 32.4 | 29.4 | −93 | −83 | −14 | −9 |

600 | 18,422,747 | 6,576,116.8 | 833,593.6 | 37.1 | 34.1 | 32.2 | −95 | −87 | −13 | −6 |

650 | 58,690,200.6 | 19,493,061.4 | 1,775,916.8 | 39.4 | 36.1 | 34.3 | −97 | −91 | −13 | −5 |

700 | 184,560,201 | 58,058,035.2 | 3,738,227.2 | 41.3 | 38.4 | 36.1 | −98 | −94 | −13 | −6 |

750 | 587,196,378 | 169,003,328.4 | 8,201,318.4 | 43.8 | 41.4 | 39.6 | −99 | −95 | −10 | −4 |

800 | 1,825,939,422 | 493,783,689.6 | 17,065,369.6 | 45.4 | 43.2 | 41.2 | −99 | −97 | −9 | −5 |

850 | 5,815,657,278 | 1,449,803,759 | 35,870,412.8 | 48.2 | 45.3 | 43.3 | −99 | −98 | −10 | −4 |

900 | 8,591,024,333 | 3,852,873,000 | 75,220,172.8 | 50.4 | 47.9 | 45.4 | −99 | −98 | −10 | −5 |

1000 | 10,566,721,109 | 5,256,980,789 | 156,404,940.8 | 52.4 | 50.7 | 47.2 | −99 | −97 | −10 | −7 |

1100 | 12,542,732,410 | 6,813,082,979 | 327,390,003.2 | 54.4 | 52.1 | 49.4 | −97 | −95 | −9 | −5 |

1200 | 14,483,051,326 | 8,602,704,239 | 677,312,921.6 | 56.8 | 54.8 | 51.1 | −95 | −92 | −10 | −7 |

1300 | 18,771,993,578 | 14,159,804,022 | 2,046,518,886 | 60.1 | 71.2 | 77.2 | −89 | −86 | 28 | 8 |

1400 | 21,728,744,360 | 15,771,824,266 | 4,214,980,608 | 64.2 | 73.4 | 79.5 | −81 | −73 | 24 | 8 |

1500 | 24,178,945,769 | 19,860,856,069 | 8,493,583,565 | 66.8 | 76.1 | 80.1 | −65 | −57 | 20 | 5 |

Matrix | L1 Misses | L2 Misses | L3 Misses | ||||||
---|---|---|---|---|---|---|---|---|---|

Dimension | D3.0 | D2.8 | D2.4 | D3.0 | D2.8 | D2.4 | D3.0 | D2.8 | D2.4 |

50 | 50,641 | 24,312 | 21,643 | 12,471 | 10,478 | 8741 | 24 | 17 | 15 |

100 | 48,531 | 24,781 | 22,314 | 12,781 | 10,241 | 8914 | 22 | 19 | 17 |

150 | 53,152 | 26,140 | 23,146 | 13,784 | 11,364 | 9246 | 24 | 21 | 19 |

200 | 53,941 | 27,140 | 24,691 | 17,425 | 12,634 | 10,656 | 25 | 23 | 20 |

250 | 54,631 | 28,631 | 25,631 | 18,421 | 13,847 | 11,634 | 27 | 27 | 23 |

300 | 55,981 | 29,140 | 26,147 | 22,641 | 14,852 | 12,647 | 29 | 28 | 24 |

350 | 56,910 | 30,147 | 27,931 | 30,145 | 15,362 | 14,654 | 31 | 30 | 26 |

400 | 57,931 | 31,651 | 28,146 | 33,652 | 16,324 | 15,698 | 33 | 31 | 27 |

450 | 59,713 | 32,950 | 30,147 | 41,320 | 17,422 | 16,874 | 35 | 33 | 28 |

500 | 60,235 | 33,165 | 31,460 | 50,361 | 21,632 | 21,698 | 38 | 34 | 30 |

550 | 75,321 | 33,714 | 32,785 | 55,617 | 23,547 | 22,948 | 41 | 36 | 33 |

600 | 79,310 | 34,601 | 33,147 | 70,142 | 29,841 | 25,478 | 43 | 38 | 34 |

650 | 85,312 | 33,631 | 33,910 | 82,156 | 35,261 | 33,695 | 45 | 39 | 39 |

700 | 87,932 | 35,489 | 34,942 | 83,149 | 39,475 | 35,954 | 46 | 41 | 40 |

750 | 91,324 | 36,631 | 35,147 | 90,145 | 44,361 | 41,658 | 49 | 43 | 41 |

800 | 95,312 | 37,326 | 36,147 | 95,961 | 55,641 | 50,647 | 48 | 46 | 43 |

850 | 98,123 | 38,971 | 37,120 | 97,447 | 62,145 | 59,841 | 50 | 49 | 45 |

900 | 99,145 | 39,361 | 28,147 | 99,147 | 70,456 | 63,587 | 52 | 50 | 47 |

1000 | 99,569 | 42,698 | 30,958 | 99,365 | 72,941 | 65,941 | 58 | 54 | 56 |

1100 | 914,320 | 714,327 | 678,910 | 916,347 | 578,912 | 469,820 | 60 | 59 | 58 |

1200 | 4,678,940 | 3,768,453 | 2,876,453 | 1,090,657 | 1,019,876 | 1,009,765 | 5698 | 4698 | 3548 |

1300 | 3,547,931 | 4,236,941 | 5,631,740 | 1,011,649 | 1,156,941 | 1,296,148 | 70,658 | 82,658 | 90,568 |

1400 | 7,890,147 | 8,316,740 | 12,321,945 | 1,260,478 | 1,340,658 | 1,345,964 | 150,968 | 192,689 | 210,658 |

1500 | 11,365,741 | 12,630,948 | 20,103,941 | 1,345,968 | 1,406,157 | 1,469,123 | 185,698 | 245,698 | 410,698 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Jammal, F.; Aljabri, N.; Al-Hashimi, M.; Saleh, M.; Abulnaja, O.
A Preliminary Empirical Study of the Power Efficiency of Matrix Multiplication. *Electronics* **2023**, *12*, 1599.
https://doi.org/10.3390/electronics12071599

**AMA Style**

Jammal F, Aljabri N, Al-Hashimi M, Saleh M, Abulnaja O.
A Preliminary Empirical Study of the Power Efficiency of Matrix Multiplication. *Electronics*. 2023; 12(7):1599.
https://doi.org/10.3390/electronics12071599

**Chicago/Turabian Style**

Jammal, Fares, Naif Aljabri, Muhammad Al-Hashimi, Mostafa Saleh, and Osama Abulnaja.
2023. "A Preliminary Empirical Study of the Power Efficiency of Matrix Multiplication" *Electronics* 12, no. 7: 1599.
https://doi.org/10.3390/electronics12071599