Loop-Block-Level Automatic Parallelization in Compilers

Chen, Mengyao; Zhou, Qinglei; Nie, Kai; Li, Haoran

doi:10.3390/app16031533

Open AccessArticle

Loop-Block-Level Automatic Parallelization in Compilers

Country School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(3), 1533; https://doi.org/10.3390/app16031533

Submission received: 5 January 2026 / Revised: 30 January 2026 / Accepted: 2 February 2026 / Published: 3 February 2026

Download

Browse Figures

Versions Notes

Abstract

To address the issues of coarse-grained thread allocation and difficult load balancing in compiler automatic parallelization for processors, this paper proposes a loop-block-level automatic parallelization method for compilers based on an iterative compilation mode, using the SWGCC compiler on the Sunway platform. An automatic parallelization method that independently sets the number of threads for each loop block is designed within the SWGCC, which allocates threads to each parallelizable loop block in the program at a finer granularity. Meanwhile, iterative compilation is combined with a genetic algorithm to iteratively optimize the optimal thread group. Through operations such as the chromosome encoding of thread allocation schemes, weighted mutation operations based on loop execution proportions, and fitness function-guided population evolution, the optimal thread combination is efficiently searched for the loop block thread allocation algorithm. Experiments are validated on the Sunway processor using the SPEC2006 test suite. The results reveal that the loop-block-level compiler automatic parallelization algorithm combined with the evolutionary algorithm achieves a maximum performance score improvement of 19% and an average performance score improvement of 4% compared to the baseline automatic parallelization algorithm in the tests.

Keywords:

compiler optimization; automatic parallelization; multi-threaded parallelism; iterative compilation

1. Introduction

With the development of processor manufacturing technology, processing capabilities are gradually reaching physical limits. Moore’s Law has slowed since 2000, and by 2018, the gap between projected and actual performances had expanded by 15× and continued to widen [1,2]. The use of Dennard Scaling decreased significantly in 2007, and in 2012, processor energy consumption reached a plateau [3]. Traditional instruction-level parallelism (ILP) is unsustainable due to energy issues, leading hardware architectures to shift to multi-core designs [4,5]. In the 2023 Top500 list, nine clusters had over 1 million cores, compared to only one in 2017 [6].

This trend in hardware design is intended to simplify control logic, allocate more silicon to computing engines, and transfer the responsibility of parallelism identification and utilization to programmers and language systems. Existing high-level languages execute inefficiently. Though optimization can drastically improve efficiency [7], manual optimization is difficult: most programmers are better at serial programming and are unfamiliar with hardware, making automatic compiler optimization critical. Hennessy et al. noted that hardware–software co-design will usher in a golden age of architecture [8], and Backus foresaw the importance of compilers in scientific computing considering hardware complexity 40 years ago [9]. Current compilers already support vectorization, instruction scheduling, multi-core parallelism, and other optimizations, becoming the primary means of program optimization.

As optimization features multiply, modern compilers offer a wide range of optimization methods. Choosing applicable scenarios and parameters for different methods has become a complex problem. With the development of heuristic optimization searches and iterative compilation, intelligent compiler tuning frameworks combining compile-time tuning have emerged as key future solutions, such as the following:

Developing cost functions via heuristic experience to guide optimal solution search—this relies on optimization experience and specific hardware, with long construction cycles and poor portability [10,11].

Combining adaptive iterative compilation with a heuristic search to converge the optimization space through repeated compilation—this depends on iterative cycles, leading to high time costs and poor reusability [12,13,14].

Predicting optimization methods using historical data—this heavily relies on the training set size and generalization ability [15,16,17]. While no universal solution exists for heuristic search or iterative compilation methods in compiler optimization, targeted use in specific scenarios already effectively improves performance.

This paper studies the optimization capabilities of the SWGCC (SunWay GNU Compiler Collection) on Sunway processors. All of the work in this paper is carried out based on SWGCC 8.3. To further exploit the parallelization potential, we design a fine-grained automatic parallelization method with better parallelism capacities, combining the iterative compilation from both internal and external compiler perspectives. Our main contributions are twofold:

Integrating with existing automatic parallelization frameworks, we develop a loop block thread allocation algorithm to address load imbalances during thread selection, focusing on balancing each loop block during parallelization.

Using iterative compilation to select thread groups for the loop block algorithm, we apply genetic algorithms to iterate thread combinations, with iterative results serving as optimal thread groups.

2. Loop Block Thread Allocation Method

2.1. Limitations of Automatic Parallel Design

In the SWGCC’s automatic parallelization optimization process, the compiler uses an automatic parallelization option to set the number of parallel threads for all parallelizable loops in the program at once. While this method simplifies the compiler’s optimization logic, it struggles to ensure load balancing. As a coarse-grained parallelism approach, multi-threaded parallelism faces core challenges: balancing task allocation across threads while minimizing thread synchronization and communication overhead. When using the minimum number of threads (i.e., serial execution), there is no synchronization or communication overhead, but the load balancing is at its weakest. Conversely, using the maximum number of threads supported by the processor usually distributes tasks evenly across threads, maximizing load balancing benefits—yet the synchronization and communication overhead also peaks. In practice, at maximum thread counts, the reduced per-thread task volume often leads to other overheads exceeding the performance gain from the parallelization. Thus, parallelization optimization requires a trade-off between the synchronization/communication overhead and load balancing benefits. Ensuring that loads are balanced as much as possible is key to guaranteeing parallelization performance [18,19,20].

Additionally, in the compiler’s baseline automatic parallelization, it is difficult for the compiler to perform a load analysis on loop blocks due to uncertainty regarding runtime states and execution environments. The baseline automatic parallelization follows a “maximize parallelization” principle: as long as a loop block meets the legality analysis conditions for automatic parallelization, the compiler deems it parallelizable. Moreover, since the compiler uses a static scheduling strategy to evenly distribute loop block tasks among threads, unsuitable loops may be parallelized, negatively impacting execution performance.

For the baseline automatic parallelization function, this paper tests SWGCC 8.3’s baseline automatic parallelization on the SW3231 processor. This processor includes 32 cores, with each core supporting up to 64 threads, a 2.4 GHz clock speed, the SW_core3 instruction set, and a maximum 64 MB L3 cache. It uses OpenMP 4.5 as the parallel library. Experiments are conducted on the openEuler operating system using the SPEC2006 benchmark suite. Performance is measured using the program’s SPEC2006 score, derived from the execution time via the suite’s internal algorithm—shorter execution times yield higher scores, reducing runtime disturbances and stabilizing performance tests. Tests use different thread counts on the SPEC2006 suite; Figure 1 presents speedup ratios (calculated as the ratio of the parallel execution score to the serial execution score). Due to the hardware architecture and program structure characteristics, increasing the thread counts reduces the per-thread task volume and increases the communication overhead, leading to performance degradation. Furthermore, programs often contain multiple parallelizable loops: a thread count suitable for some loops may be too large or small for others, causing performance losses.

Thus, in the automatic parallelization optimization process, determining the appropriate number of threads for a program is a critical issue. Existing methods for setting thread counts globally adopt a coarser-grained approach for generality, using a single user-specified thread count to determine the global parallelization mode. However, this thread allocation method has limitations: as seen from the test results in Figure 1, the parallelization performance speedup does not always improve with increasing thread counts. For some test cases, the performance reaches a maximum at a certain thread count, and subsequent increases in thread counts lead to performance degradation. To address this issue, this paper proposes a new thread allocation scheme, modifying the original macro-level thread setting scheme to assign a separate thread count to each loop. This reduces the implementation granularity of the parallelization to achieve a greater parallel performance.

2.2. Loop Block Thread Allocation Strategy

In the SWGCC automatic parallelization process, the program is treated as a whole. The automatic parallelization analysis does not consider the specific execution details and environment of parallelizable loops but rather only checks the parallel legality of loops. For thread settings, the SWGCC delegates the thread configuration to users via the -ftree-parallelize-loops option.

However, in the scenarios discussed earlier, the execution time

T_{p}

of the program’s parallelizable part is not derived from a single loop segment. In many programs,

T_{p}

may be a set of

T_{p 1} {, T}_{p 2}, \dots

[21]. For example, suppose there are two parallelizable loops,

L_{1}

and

L_{2}

, in a program:

Under 16 threads:

L_{1}

runs for 10 s, and

L_{2}

runs for 15 s →

T_{p} (16) =

25 s;

Under 24 threads:

L_{1}

runs for 15 s, and

L_{2}

runs for 6 s →

T_{p} (24) =

21 s.

Although increasing the number of threads improves the overall performance, this improvement is driven by local performance gains offsetting local losses. The resulting overall performance benefit is clearly not the optimal solution.

In actual program environments, the situation may be more complex: sometimes local performance losses exceed local gains, leading to negative impacts on parallelization.

When a compiler performs automatic parallelization on programs containing multiple parallelizable loops, if a unified thread count allocation strategy is adopted, it may achieve overall speedup under specific parameters. However, this speedup could result from “local gains offsetting local losses”: some loops speed up due to appropriate thread count allocation, while others suffer performance degradation due to thread load imbalances.

To address this issue, this paper introduces a loop block thread allocation algorithm in the thread allocation phase of automatic parallelization. The algorithm marks all parallelizable loop segments, collects information about parallelizable loops, and generates corresponding parallel thread groups based on the set of all parallelizable loops. For any program, let

L_{n} = {L_{p 1}, L_{p 2}, \dots \dots, L_{p n}}

denote the set of loops in the program, where each element corresponds to a thread count

p

, forming a thread group

p_{n}

. In automatic parallelization with a unified thread allocation strategy,

p_{n} = {p_{u s r}, p_{u s r}, \dots \dots, p_{u s r}}

. In the loop block thread allocation algorithm, a new thread group

p_{new}

is initialized based on

L_{n}

, where

p_{new} = {p_{1}, p_{2}, \dots \dots, p_{n}}

assigns a separate thread count to each parallelizable loop. Subsequently, during the parallel directive generation phase, the SWGCC generates parallel directives with different thread parameters for the corresponding loops based on

p_{new}

, completing the parallelization process.

The structure of the loop block thread allocation algorithm is shown in Figure 2. The thread allocation process of the baseline automatic parallelization is depicted in the dashed line section of Figure 2: after completing the automatic parallelization analysis, the user-set thread number is input into the compiler, and then, the corresponding parallel program is generated based on the set number of threads in the parallel program’s generation phase.

In the loop block thread allocation algorithm, the algorithm and the baseline automatic parallelization algorithm share a parallel legality analysis tool, which identifies parallelizable loops that pass the parallelism check as a loop block. After the automatic parallelization analysis is completed, as shown in the solid line section of Figure 2, a thread group is constructed based on the compilation information obtained from the analysis and the information of loop blocks. Then, parameter adjustments are performed on the thread group members according to the machine information and program structure. The adjusted thread group is input into the compiler, and the thread counts set in the thread group are assigned to corresponding loops by the thread processing module to generate the parallel program.

2.3. Experimental Analysis

A comparative test was conducted between the original automatic parallelization and the automatic parallelization with the addition of the loop block thread allocation algorithm. The 410.bwaves test case from the SPEC2006 benchmark suite was tested on the Sunway SW3231 platform. In the SWGCC automatic parallelization framework, this test case contains 13 parallelizable loops. The performance analysis tool gprof in Linux was used to analyze the hotspot code of the program. The largest loop is at line 172 in the block_solver.f file, as shown in Figure 3a, accounting for 68.49% of the program’s execution time, while the smallest parallelizable loop accounts for only 0.7% of the execution time—there is a huge difference in the task volume between different loop structures.

The test result scores are shown in Figure 3: as the number of threads increases, the program’s performance gradually decreases after reaching a peak. Based on the optimal thread selection of the original automatic parallelization, by independently adjusting the thread count of the core loop with the highest execution proportion in the 410.bwaves test case while keeping the thread counts of other loop segments unchanged, the core segment was set to 13 threads in an overall 8-thread parallel environment. The test results are shown in the “dyna” row of Figure 3b: after applying the new parallelization method, the program performance score increased by 10%.

The core loop of this test case is a multi-layer nested computation of a four-dimensional array, with each layer having 65 iterations. Parallelization was applied at the third layer of the nested loops. With 13 threads, the inner loop was divided into five complete blocks—no threads needed to handle extra iterations that cause long waits, achieving optimal load balancing.

The experimental results demonstrate that this automatic parallelization method can effectively improve programs’ automatic parallelization performance; however, the manner in which the number of threads is selected is a challenge faced by this method. The number selected in this experiment is derived from a program analysis and multiple adjustments—even a single member in the thread group requires extensive manual debugging. Without a familiarity with the program structure or architecture information, it is difficult to assign appropriate values to each member in the program’s thread group. To address this issue, this paper combines this method with iterative compilation to select the number of threads for loops.

3. Thread Selection Strategy Combined with Iterative Compilation

Although the loop block thread allocation algorithm can effectively improve the automatic parallelization performance, it still faces challenges in determining the number of threads. For the original multi-thread strategy, only a global thread count needs to be specified to complete the thread initialization for automatic parallelization; however, in the loop block thread allocation algorithm, each loop requires a separate thread count setting, which increases the initialization requirements for the thread group. If the program is a black box program, it will be difficult to select the optimal thread count through analysis. To address this issue, this paper combines the automatic parallelization algorithm with iterative compilation, using a genetic algorithm to iterate over the optimal thread set, thereby solving the problem of thread count selection for automatic parallel loops.

3.1. Selection of Iterative Algorithm

The loop block thread allocation algorithm is essentially a high-dimensional discrete combinatorial optimization problem. Thread group selection consists of two questions: whether the current thread group is the optimal one and how to determine the optimal thread count for each loop. The former can be judged by the quality of the speedup effect after optimization; for the process of finding the optimal thread count for a single loop, if thread counts are randomly selected without considering the architecture and program characteristics, the size of the search space for the thread group will be determined by the number of parallelizable loops and the number of processor cores, and the search space grows exponentially with the number of loops in the thread group, resulting in a huge search space.

Therefore, for the thread selection problem of loop blocks in programs, it is necessary to design an efficient thread selection method. To this end, drawing on the idea of the iterative optimization of compilation parameters, we designed an external iterator to optimize thread allocation parameters. Its goal is to screen out the optimal thread group for the loop block thread allocation algorithm. In a fixed environment and program execution scenario, the optimal thread group is relatively stable. Therefore, the selection of the iterator algorithm must prioritize ensuring the quality of the results. In this scenario, using a genetic algorithm as the iterator has obvious advantages: as a type of stochastic optimization algorithm that simulates biological evolution, its core strengths lie in representing thread group schemes as chromosomes through population encoding, combining advantageous features of different thread groups via crossover operations, introducing new search directions through mutation operations, and finally retaining the best-performing offspring schemes through selection operations. This parallel group search mechanism can effectively balance global exploration and local exploitation, avoid falling into local optima, and meet the needs of thread group optimization, thus efficiently approaching the optimal thread allocation scheme in an exponential solution space.

Iterative compilation can effectively achieve better results, but it also has the drawback of long iteration time. In extreme cases, the time spent on result selection will be far longer than the program’s running time, and the algorithm does not take into account the cost of iteration. When selecting the algorithm, we prioritize result quality over solution efficiency, allowing longer iteration times in exchange for better optimization effects. Although other heuristic algorithms may converge faster as iterators, the genetic algorithm as an iterative optimization method is more suitable for the current scenario.

3.2. Algorithm Design

3.2.1. Chromosome Representation and Fitness Function

The primary task of the genetic algorithm (GA) in solving the optimal thread group problem for program parallelization is chromosome encoding and the fitness definition. For a program containing n parallelizable loops, with the loop block set

L_{n} = {L_{p 1}, L_{p 2}, \dots \dots, L_{p n}}

, an n-bit chromosome X_Q is defined, where each bit corresponds to the number of threads for a loop block. Specifically,

X_{Q} = p_{n e w} = {p_{1}, p_{2}, \dots \dots, p_{n}}

, where each element represents the thread count for the corresponding loop block in the thread group.

Meanwhile, the speedup effect of the thread group compared to serial execution is set as the fitness f_q, for which the calculation is shown in Formula (1). A higher fitness value indicates a better thread allocation scheme corresponding to the chromosome.

ƒ_{q} = \frac{T (1)}{T (X_{q})}

(1)

3.2.2. Initial Population Generation

The initial population needs to balance search efficiency and global coverage capability. The algorithm sets a dynamically adjusted population size and introduces special chromosomes to prevent the population from falling into local optima. The dynamic population size

N_{p o p}

is determined jointly by the number of chromosome bits n and the CPU core count

K

.

As the chromosome length increases (i.e., the number of loops in the thread group increases), the population requires a larger size to ensure that key chromosomes are covered in the search.

The value of

K

is determined by the number of physical cores supported by the processor. A larger

K

means a wider optional range of thread counts, requiring more searches to find potential optimal solutions.

Therefore, when generating the population, two coefficients,

c

₁ and

c

₂, are set as the chromosome bit and CPU core count coefficient, respectively, to control the population size:

c

₁ = 3: Each chromosome bit uses three different values.

c

₂ = 1: Expand the population size according to the CPU core count.

To expand the search coverage and reduce the risk of local optima, two types of special chromosomes are introduced:

Single-thread chromosome: All thread group members are set to 1, simulating serial execution.

Max-thread chromosome: All thread group members are set to

K

, attempting maximum thread execution to cover boundary scenarios from serial to maximum parallelism, avoiding the negative speedup caused by the parallel overhead exceeding benefits.

The population size is given by Formula (2).

N_{p o p} = c_{1} * n + c_{2} * K + 2

(2)

3.2.3. Parent Selection and Crossover Operation

Parent selection uses a truncation selection algorithm: after each iteration, the current population and offspring population are merged and sorted in a descending order of fitness, and the top

N_{p o p}

individuals are selected to enter the next generation. The advantage of this method lies in eliminating the need for probability calculations and ensuring the stable transmission of high-quality genes by retaining the top half of the mixed population.

The crossover operation adopts a multi-point crossover strategy: the

⌈n / 2⌉

-th genetic factor of the chromosome is selected as the crossover point, and the genetic factors before and after the crossover point of the two chromosomes are combined into a new chromosome. The algorithm sets the crossover rate to 0.5, meaning 50% of the chromosomes in the population are randomly selected for crossover.

3.2.4. Mutation Operation and Genetic Factor Weighting

The mutation operation introduces new genes through random perturbation to prevent the premature convergence of the population. The aim is to directionally select mutation sites based on genetic factor weights. The position of the mutated genetic factor is determined by the weight

W_{i}

, which is calculated from the execution time proportion of the corresponding loop in the entire program. The probability of a genetic factor being selected for mutation is

W_{i} = R_{i} / \sum_{1}^{n} R_{n}

, where

R

is the execution proportion of the loop. Using weights allows loops with high execution proportions to be prioritized, ensuring a sufficient exploration of thread counts for key loops. After selecting the site, a discrete random reset of the gene value is performed: the thread count is mutated to a random value within the range of physical threads supported by the hardware environment, and it does not repeat the existing thread counts of the loop in the current population to avoid redundant searches.

3.2.5. Algorithm Convergence

After multiple iterations of the population, the algorithm determines convergence based on the fitness increase rate of chromosomes over consecutive rounds. When the fitness increase rate of the optimal chromosome in the population is less than 5% for five consecutive iterations, the population is considered to have converged to an approximate optimal solution, and iteration stops. The optimal chromosome in the population is output as the thread group that achieves the best performance.

3.3. The Overall Structure of the Algorithm

The genetic algorithm iterator is implemented outside the compiler using Python 3.1 and Shell 5.1, which externally controls the entire iterative compilation process. Its overall structure is shown in Figure 4. The algorithm first executes the target program serially to collect program information and execution data. Based on the program information, parallelizable loops in the program are identified as loop blocks. Subsequently, chromosomes are constructed from the loop blocks, which are then input into the loop block thread allocation algorithm to generate parallel programs. Populations are generated based on the running results and hardware parameters. After initialization, chromosomes are iterated according to the GA settings, and the optimal chromosome is output as the thread group for the parallel execution of the program. Finally, the thread group is input into the loop block thread allocation algorithm to generate the corresponding parallel program, which serves as the optimal parallel execution parameter.

Each execution of the algorithm iterates an optimal thread group for the program. This thread group is optimal for program execution in the current environment and only applies to the program running in this environment; in other environments, performance deviations may occur due to hardware influences.

4. Translation Result

4.1. Test Environment and Results

This paper uses a Sunway architecture server as the experimental platform, with the Sunway SW3231 processor and SWGCC 8.3 compiler. Tests were conducted on automatically parallelizable benchmarks from the SPEC2006 suite. On the Sunway platform, the -O3 option was used as the base compilation flag for serial execution tests; the -O3 option was used along with enabling the -ftree-parallelize-loops option and setting thread counts to 8, 16, 24, and 32 respectively for basic automatic parallelization tests; the -O3 option and the loop block-level automatic parallelization option -ftree-parallelize-block were used to test the algorithm proposed in this paper. After multiple iterations, the execution results under the optimal thread count were obtained.

The benchmark scores are shown in Table 1, which compares the following:

Serial execution (thread count = 1);
Baseline automatic parallelization with four different thread counts (thread count = 8 16 24 32);
The parallel thread allocation algorithm proposed in this paper.

The performance improvement ratio is calculated by comparing the score of our algorithm with the optimal score from other test results (including serial execution and parallel execution with four different thread counts). The results show that the loop block thread allocation algorithm combined with the evolutionary algorithm effectively ensures an automatic parallelization performance and achieves a better performance in some benchmarks.

Table 1. Program performance scores with different thread allocation strategies.

	1	8	16	24	32	GA	Improvement Ratio
403.gcc	8.31	8.47	8.21	8.37	8.35	8.50	1.00
410.bwaves	5.35	11.50	10.04	8.45	7.00	12.70	1.10
429.mcf	4.60	4.88	4.60	6.65	4.64	6.77	1.02
433.milc	6.23	6.30	6.29	5.11	6.23	6.30	1.00
435.gromacs	9.22	9.22	9.22	9.17	9.12	9.26	1.00
436.cactusADM	10.30	31.50	34.50	38.40	48.40	57.40	1.19
459.GemsFDTD	7.03	20.60	21.20	22.02	20.80	23.70	1.08
471.omnetpp	10.60	9.27	9.76	9.23	9.65	10.60	1.00
473.astar	7.67	8.78	8.67	8.53	8.76	8.92	1.02
481.wrf	9.76	11.30	9.08	9.11	8.57	11.33	1.00
483.xalancbmk	9.37	9.49	10.10	9.89	10.10	10.20	1.01

4.2. Analysis of Experimental Results

A speedup comparison with the baseline automatic parallelization algorithm is shown in the figure. The data in Figure 5 represent the scores of the corresponding parallel and serial versions of the program. The proposed loop block thread allocation algorithm combined with the evolutionary algorithm has advantages over the compiler built-in method in automatic parallelization: the automatic parallelization performance using the new method generally outperforms that of the automatic parallelization method under a single thread and can avoid the problem of automatic parallelization performance degradation under certain threads.

Among the benchmarks, some are unsuitable for parallelization—e.g., 471, which contains only one parallelizable loop for continuous data printing. The baseline automatic parallelization ignores parallelization benefits while focusing solely on legality, leading to performance losses. With the genetic algorithm, the thread group selects serial execution to preserve performance. Another group of benchmarks (403, 433, 435, 459, 473, 483) exhibits poor parallelization effects: although parallelization produces marginal improvements, the parallelizable loops are not core computations. For example, 433 has 15 parallelizable loops—7 for IO initialization (execution proportion < 0.1%) and 8 for computations (total < 0.5%)—making parallelization ineffective.

For benchmarks like 410, which has multiple parallelizable loops with high execution proportions, the loop block thread allocation algorithm combined with the evolutionary algorithm yields better results. Beyond the core loop analyzed in Section 2, other loops account for 16.6% of the execution time (max 10.33%, min 0.71%). These non-core loops involve simple single-step computations and assignments; high-thread parallelization leads to a synchronization overhead that exceeds parallel gains.

Benchmark 436 has 21 parallelizable loops, 7 of which are large tasks with parallel benefits. These loops satisfy both parallelization and instruction-level parallelism (ILP) conditions. After parallelization, the SWGCC performs the ILP optimization within threads, increasing memory access per iteration. When the thread granularity falls below a threshold, excessive vector instruction concurrency saturates the hardware memory bandwidth, causing over-optimization and suboptimal performances. Reducing thread counts appropriately improves results.

Compared to the SWGCC’s baseline algorithm, the proposed algorithm effectively balances parallel loads, eliminates invalid optimizations for non-parallelizable loops, and enhances the parallelization performance while raising its upper limit.

5. Conclusions

This paper addresses core issues of the SWGCC’s automatic parallelization:

To overcome limitations of traditional global thread allocation in load balancing and fine-grained adaptation, a loop block thread allocation algorithm is designed inside the compiler.

Inspired by iterative compilation, a genetic algorithm is integrated as an iterator to solve adaptive thread parameter selection issues.
The combination forms the loop-block-level automatic parallelization method, validated by experiments to achieve effective speedups.

For single-core parallelism, iterative compilation aims to expand parallel boundaries and increase parallelization rates by mining more parallelizable blocks [22,23,24]. Traditional automatic parallelization prioritizes load balancing via scheduling optimizations [25,26]. Our work in this paper is based on SWGCC. As a modified version of the GCC compiler, the optimization work targeting SWGCC can also be ported to GCC, though some additional adaptations may be required. While sharing the core logic of load optimization, our method focuses on thread allocation rather than scheduling under fixed thread counts. This approach balances loads effectively but suffers from long iteration times—a key limitation, even though we prioritize result quality over efficiency.

Author Contributions

M.C.: Conceptualization, Methodology, Software, Data Curation, Writing—Original Draft Preparation, Writing—Reviewing and Editing. K.N.: Visualization, Investigation; Q.Z.: Supervision; H.L.: Validation. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Munoz, R. Furthering Moore’s Law Integration Benefits in the Chiplet Era. IEEE Des. Test 2024, 41, 81–90. [Google Scholar] [CrossRef]
Menon, H.; Diffenderfer, J.; Georgakoudis, G.; Laguna, I.; Lam, M.O.; Osei-Kuffuor, D.; Parasyris, K.; Vanover, J.; Schordan, M. Approximate High-Performance Computing: A Fast and Energy-Efficient Computing Paradigm in the Post-Moore Era. IT Prof. 2023, 25, 7–15. [Google Scholar] [CrossRef]
Zhao, Y.; Du, Z.; Guo, Q.; Xu, Z.; Chen, Y. Rescue to the Curse of Universality. Sci. China Inf. Sci. 2023, 66, 192102. [Google Scholar] [CrossRef]
Kadosh, T.; Hasabnis, N.; Mattson, T.; Pinter, Y.; Oren, G. Quantifying OpenMP: Statistical Insights into Usage and Adoption. In Proceedings of the 2023 IEEE High Performance Extreme Computing Conference (HPEC 2023), Boston, MA, USA, 25–29 September 2023. [Google Scholar] [CrossRef]
Matsuoka, S.; Domke, J.; Wahib, M.; Drozd, A.; Hoefler, T. Myths and Legends in High-Performance Computing. Int. J. High Perform. Comput. Appl. 2023, 37, 245–259. [Google Scholar] [CrossRef]
TOP500 Team. Frontier Remains No.1 in the TOP500 but Aurora with Intel’s Sapphire Rapids Chips Enters with a Half-Scale System at No. 2. 2023-11-13. Available online: https://www.top500.org/news/frontier-remains-no-1-in-the-top500-but-aurora-with-intels-sapphire-rapids-chips-enters-with-a-half-scale-system-at-no-2/ (accessed on 14 January 2024).
Kung, H.; Leiserson, C. Systolic Arrays (for VLSI). In Sparse Matrix Proceedings; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 1979; Volume 1, pp. 256–282. [Google Scholar]
Hennessy, J.L.; Patterson, D.A. A New Golden Age for Computer Architecture. Commun. ACM 2019, 62, 48–60. [Google Scholar] [CrossRef]
Backus, J.; Dai, M. Can Programming Be Liberated from the von Neumann Style? Functional Programming and Its Algebra of Programs. J. Comput. Sci. 1984, 3, 21–43. [Google Scholar]
Kurra, S.; Singh, N.K.; Panda, P.R. The Impact of Loop Unrolling on Controller Delay in High Level Synthesis. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition; IEEE Computer Society: Piscataway, NJ, USA, 2007. [Google Scholar]
Cooper, K.D.; Harvey, T.J.; Waterman, T. An Adaptive Strategy for Inline Substitution. In Proceedings of the International Conference on Compiler Construction; Springer: Berlin, Germany, 2008. [Google Scholar]
Liu, H.; Xu, J.L.; Zhao, R.C.; Jinyang, Y. Compiler Optimization Sequence Selection Method Guided by Learning Model. Comput. Res. Dev. 2019, 56, 2012–2026. [Google Scholar]
Sato, Y.; Yuki, T.; Endo, T. An Autotuning Framework for Scalable Execution of Tiled Code via Iterative Polyhedral Compilation. ACM Trans. Archit. Code Optim. 2019, 15, 67. [Google Scholar] [CrossRef]
Nobre, R.; Martins, L.G.A.; Cardoso, J.M.P. A Graph-Based Iterative Compiler Pass Selection and Phase Ordering Approach. ACM SIGPLAN Not. 2016, 51, 21–30. [Google Scholar] [CrossRef]
Liu, H.; Zhao, R.C.; Wang, Q. Parameter Selection Method for Function Level Compiler Optimization Guided by Supervised Learning Model. Comput. Eng. Sci. 2018, 40, 957–968. [Google Scholar]
Fursin, G.; Miranda, C.; Temam, O. MILEPOST GCC: Machine Learning Based Research Compiler. In Proceedings of the GCC Developers’ Summit, Ottawa, ON, Canada, 17–19 June 2008. [Google Scholar]
Park, E.; Cavazos, J.; Alvarez, M.A. Using Graph-Based Program Characterization for Predictive Modeling. In Proceedings of the Tenth International Symposium on Code Generation and Optimization; Association for Computing Machinery: New York, NY, USA, 2012; pp. 196–206. [Google Scholar]
Li, Y.P. Load-Balanced OpenMP Static Scheduling Method Under Multi-Threading. Master’s Thesis, Zhengzhou University, Zhengzhou, China, 2022. [Google Scholar] [CrossRef]
Tzen, T.H.; Ni, L.M. Trapezoid Self-Scheduling: A Practical Scheduling Scheme for Parallel Compilers. IEEE Trans. Parallel Distrib. Syst. 1993, 4, 87–98. [Google Scholar] [CrossRef] [PubMed]
Gao, Y.C.; Zhao, R.C.; Han, L.; Li, L. Research on Loop Automatic Parallelization Technology. J. Inf. Eng. Univ. 2019, 20, 82–89. [Google Scholar]
Schryen, G. Speedup and Efficiency of Computational Parallelization: A Unifying Approach and Asymptotic Analysis. J. Parallel Distrib. Comput. 2024, 187, 104835. [Google Scholar] [CrossRef]
Zhao, D.S. Research on the Thread-Level Speculation Execution Model for LLVM Compiler. Ph.D. Thesis, Northwest A&F University, Xianyang, China, 2021. [Google Scholar] [CrossRef]
Xu, M. Research on Key Technologies of Reconfigurable Manycore Stream Processor Architecture. Master’s Thesis, University of Science and Technology of China, Hefei, China, 2012. [Google Scholar]
Liu, B.; Zhao, Y.L.; Han, B.; Li, Y.X.; Ji, S.; Feng, B.Q.; Wu, W.J. A Loop Selection Approach Based on Performance Prediction for Speculative Multithreading. J. Electron. Inf. Technol. 2014, 36, 2768–2774. [Google Scholar] [CrossRef]
Nie, K. Research on Multi-threaded Compilation Optimization Techniques for Master-Slave Hybrid Architecture of CPU. Ph.D. Thesis, Information Engineering University of the Strategic Support Force, Zhengzhou, China, 2021. [Google Scholar] [CrossRef]
Xu, J.; Wang, G.; Han, L.; Nie, K.; Li, H.; Chen, M.; Liu, H. Research on Parallel Scheduling Strategy Optimization Technology Based on Sunway Compiler. Comput. Sci. 2025, 52, 137–143. [Google Scholar]

Figure 1. Speedup comparison of SPEC2006 benchmarks compiled with SWGCC with different thread counts.

Figure 2. Diagram of loop-block-level thread allocation algorithm.

Figure 3. Performance scores of 410 benchmark with different thread counts.

Figure 4. Overall framework of iterative compilation.

Figure 5. Performance speedup comparison of genetic algorithm.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, M.; Zhou, Q.; Nie, K.; Li, H. Loop-Block-Level Automatic Parallelization in Compilers. Appl. Sci. 2026, 16, 1533. https://doi.org/10.3390/app16031533

AMA Style

Chen M, Zhou Q, Nie K, Li H. Loop-Block-Level Automatic Parallelization in Compilers. Applied Sciences. 2026; 16(3):1533. https://doi.org/10.3390/app16031533

Chicago/Turabian Style

Chen, Mengyao, Qinglei Zhou, Kai Nie, and Haoran Li. 2026. "Loop-Block-Level Automatic Parallelization in Compilers" Applied Sciences 16, no. 3: 1533. https://doi.org/10.3390/app16031533

APA Style

Chen, M., Zhou, Q., Nie, K., & Li, H. (2026). Loop-Block-Level Automatic Parallelization in Compilers. Applied Sciences, 16(3), 1533. https://doi.org/10.3390/app16031533

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Loop-Block-Level Automatic Parallelization in Compilers

Abstract

1. Introduction

2. Loop Block Thread Allocation Method

2.1. Limitations of Automatic Parallel Design

2.2. Loop Block Thread Allocation Strategy

2.3. Experimental Analysis

3. Thread Selection Strategy Combined with Iterative Compilation

3.1. Selection of Iterative Algorithm

3.2. Algorithm Design

3.2.1. Chromosome Representation and Fitness Function

3.2.2. Initial Population Generation

3.2.3. Parent Selection and Crossover Operation

3.2.4. Mutation Operation and Genetic Factor Weighting

3.2.5. Algorithm Convergence

3.3. The Overall Structure of the Algorithm

4. Translation Result

4.1. Test Environment and Results

4.2. Analysis of Experimental Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI