Qinling : A Parametric Model in Speculative Multithreading

Speculative multithreading (SpMT) is a thread-level automatic parallelization technique that can accelerate sequential programs, especially for irregular applications that are hard to be parallelized by conventional approaches. Thread partition plays a critical role in SpMT. Conventional machine learning-based thread partition approaches applied machine learning to offline guide partition, but could not explicitly explore the law between partition and performance. In this paper, we build a parametric model (Qinling) with a multiple regression method to discover the inherent law between thread partition and performance. The paper firstly extracts unpredictable parameters that determine the performance of thread partition in SpMT; secondly, we build a parametric model Qinling with extracted parameters and speedups, and train Qinling offline, as well as apply it to predict the theoretical speedups of unseen applications. Finally, validation is done. Prophet, which consists of an automatic parallelization compiler and a multi-core simulator, is used to obtain real speedups of the input programs. Olden and SPEC2000 benchmarks are used to train and validate the parametric model. Experiments show that Qinling delivers a good performance to predict speedups of unseen programs, and provides feedback guidance for Prophet to obtain the optimal partition parameters.


Introduction
The emergence of the speculative multithreading (SpMT) model [1][2][3][4][5][6] in the last decade has provided significant breakthrough in non-numeric applications.Exploring the partition law in SpMT, however, is challenging due to the complexity of influence parameters.Program parallelization primarily includes two methods: compiler-based automatic parallelization and machine learning-based parallelization.Compiler-based automatic parallelization is a widely studied area and can potentially deliver significant speedups for sequential programs.The studies [7][8][9][10] focused on loops, and they decomposed loops into multiple code segments to achieve the performance improvement.These research works [2,11] partitioned the whole program into multiple threads to be executed in parallel.
Machine learning technology has been successfully introduced to SpMT, for program parallelism [5,[12][13][14][15][16][17].Wang et al. [15] developed an automatic compiler-based approach to map a parallelized program to multi-core processors using machine learning.Long et al. [13] presented a machine learning-based approach to parallel workload allocation in a cost-aware manner.Chen et al. [17] presented an adaptive open multiprocessing(OpenMP)-based mechanism capable of generating a reasonable number of representative multithreaded versions for a given loop, and selecting a suitable version at runtime to execute on a multicore architecture using machine learning.
Li et al. [6] used an artificial immune algorithm to obtain the optimal thread partition scheme.Liu et al. [18] used virtual sample generation and a K-nearest neighbor(KNN) algorithm to realize thread partition.Machine learning has recently also been investigated by a number of researchers in the area of compiler optimization.Much of the prior work in machine-learning based compilation relied on program feature-based characterization.For instance, Monsifrot et al. [19], Stephenson et al. [20], and Agakov et al. [21] all used static loop nest features.Cavazos et al. [22] considered a reaction-based scheme that used the sequence of transformations was applied to a program as an input to a learnt model.Wang et al. [5,23] together used dynamic features and machine learning method to exploit probably parallel legacy code.
Moreover, regression or statistics-based methods have also received much attention [24][25][26][27].Lee et al. [26] applied a regression modeling to derive simulation-free statistical inference models, in order to reduce the number of required simulations.Cavazos et al. [28] developed a logistic regression technique that automatically selected the best set of optimizations for different sections of a program.Khan and Luk et al. [24,25] used a statistical machine learning and a fully automatic method, respectively, to map potential parallelisms onto threads in the context of SpMT.
In both cases, the benefits of statistical regression are highlighted in the following two aspects.On the one hand, the selection of best set of optimization and the mapping of potential parallelisms onto threads are all automatically completed; on the other hand, statistical regression is effective to learn and apply thread partition regular.
The paper develops a parametric model, namely multiple regression model(Qinling) to explore the inherent law between the influencing factors during thread partition and performance.We develop this model on Prophet [29,30] and automatically predict speedups according to partition parameter values of unseen programs.This is achieved by training Qinling offline on a set of training data, which then automatically learns inherent law.

Definitions
In this section, we present several definitions, so that abbreviations of them can be well understood.Thread partition is a process, in which sequential programs are divided into many segments that are mapped to many processing elements to run.SP is abbreviation of spawning point, which is used to spawn a child thread.CQIP is abbreviation of control quasi-independent point, which is used to validate successor thread.
Spawning distance (SD) is the dynamic instruction count from spawning point to control quasi-independent point.SD represents the time difference between execution of predecessor thread and its successor thread.
P-slice is abbreviation of pre-computation slice, which is a simplified version of dependent instructions between predecessor and successor threads.
Thread granularity is the size of thread, which is generated by partitioning sequential programs.

Motivations
The purpose of multiple regression model is to exploit the inherent regular between influencing factors and speedups.Thus, finding the primary influencing factors during thread partition becomes the first issue to be handled.

Motivation from Partition Algorithms
In this paper, we refer to two partition algorithms: Algorithm 1 and Algorithm 2 [31] and time overhead analysis graph (in Figure 1) to describe the process of extracting influencing factors.Algorithm 1 gives the description of loop partition.While partitioning a function, the loop regions are first identified and partitioned.The profiling information about the number of iterations and the loop body size are considered together to decide the partition of loop region.Data dependence count of successive iterations of the loop is also checked.Only when the thread spawning for the next iteration is profitable, then the thread is spawned.Between the 6 th to 10 th line in Algorithm 1, for loop with proper granularity anddata dependence count of inter-iteration is small, each iteration is specified as a candidate thread.For small loops, they will not be parallelized otherwise the overhead of spawning thread offsets performance improvement by SpMT; instead, the loop is unrolled to increase parallelism.The pseudo code for non-loop partitioning is shown in Algorithm 2. This function "partition_thread" partitions the sequential code segments between two basic blocks into multiple threads by calling itself recursively.In Algorithm 2, "curr_thread" represents subgraph of the current candidate thread, and it consists of all the basic blocks between exit node of previous thread and "start_block".If "curr_thread" is NULL, it means that start_block is the exit node of previous thread, or "curr_thread" can not be the candidate thread, as "curr_thread" is too small or there are too much data dependence."pdom_block", which acts as the control-independent point of current basic block, is the postdominator of "start_block".The "likely_path" is the most likely path between "start_block" and "pdom_block".The next function "find_optimal_dependence" keeps the optimal data dependence counts between current thread and the future thread below DEP_THRESHOLD.In order to get the best speedup at runtime, the lower and the upper limit of thread granularity should be limited to balance thread granularity.From the 7 th and 25 th line in Algorithm 2, the "curr_thread" whose granularity is within the limits and whose dependence with the successor thread is less than DEP_THRESHOLD can be partitioned and generates a new thread.If the granularity of "curr_thread" is too large, then the subgraph between "start_block" and its control-independent basic block will be further partitioned for potential candidate threads.Furthermore, if the "curr_thread" is too small even when including the basic blocks along the most likely path, then no new candidate thread will be created at the control-independent point, the "future_thread" will be simply added to "curr_thread".   1, the length of precomputation-slice(p-slice) is represented with variable p-slice, spawning distance from predecessor thread to successor thread is sp_dis, the correlative instruction count along spawning path is dep_cnt, so where C represents the overhead to construct pslice.Thus, the reduced time (indicated by time_ahead) for speculative execution is shown in formula ( 2): where dep_cnt is determined by dependence count, and C is also affected by many other factors.The whole time_ahead is determined by spawning distance and dependence count.

Determination of Influencing Factors
During the process of loop partition and nonloop partition, bold words in the above paragraphs as well as the bold statements in Algorithm 1 and Algorithm 2 suggest that three factors, including thread granularity, data dependence count, and spawning distance are the primary influencing factors during partition.In the process of time_ahead analysis, the time_ahead is mainly influenced by spawning distance and dependence count.We give a set of influencing factors, containing three factors: spawning distance, dependence count, and thread granularity.
In terms with Sections 2.2.1 and 2.2.2, we come to make a conclusion and get the independent variables and dependent variables in Table 1.

SpMT Execution Model
Speculative multithreading technique [6,32] is actually an aggressive program execution, and multiple code segments are executed in parallel simultaneously on multi-core to improve the speedups of sequential programs.In SpMT execution model, sequential programs are partitioned into multiple speculative threads; furthermore, each of the speculative threads executes a different part of the sequential program.There is a special thread called a non-speculative thread among concurrently executed threads.It is the only one allowed to commit its results to memory, while the other threads are speculative.A speculative thread is marked by a spawning instruction pair.When a spawning instruction is found during program execution and if the existing processor resources allow spawning, a parent thread will spawn a new speculative thread.
When the execution of the non-speculative thread is completed, it will verify its successor thread.If the validation is correct, the non-speculative thread will commit all the values, which the successor thread generates to memory and then the successor thread will become non-speculative.Otherwise, the non-speculative thread will revoke all speculative child threads and re-execute its successor threads.
On Prophet [29,30], a spawning instruction pair are composed of a Spawning Point (SP) and a Control-Quasi Independent Point (CQIP).The SP defined in parent thread can spawn a new thread to execute speculatively the code segment behind the CQIP during program execution.Thread-level speculative model is shown in Figure 2. A sequential program is mapped to a SP-CQIP, and the speculative multithreading program becomes a sequential program as shown in Figure 2a.
When an SP is found on program execution, the parent thread will spawn a new speculative thread and execute the code segment speculatively behind the CQIP, as shown in Figure 2(b).
Validation failure or Read-after-Write (RAW) violations will lead to fail.When validation fails, predecessor thread executes the speculative thread in a sequential manner as shown in Figure 2(c).When there is a violation in RAW dependence as shown in Figure 2(d), the speculative thread restarts itself on the current state.

Pre-Computation Slices
In SpMT, the key is how to deal with inter-thread data dependences.Synchronization mechanism and value prediction have been applied so far.The synchronization approach imposes a high overhead when dependences are frequent and seriously affect the parallel performance.Value prediction has more potential if the values computed by one thread and consumed by another can be predicted.The consumer thread can be executed in parallel with the producer thread since these values are only needed for validation at later stages.On the Prophet compiler [29], in order to reduce inter-thread dependences, the speculative p-slices [1] are constructed and inserted at the beginning of each speculative thread.P-slices are used to calculate the live-ins (dependent variables that are generated by predecessor thread and consumed by a successor thread) of the new speculative thread, but they do not need to guarantee their correctness, since the underlying architecture can detect and recover from mis-speculations.The p-slices are extracted from the producer thread at compile time but triggered at run-time to pre-fetch the live-ins.The steps to build the p-slices for a given spawning pair are: (1) identifying the live-ins produced on the speculative path; and (2) generating the optimal p-slices.

Data Dependence Calculation
Data dependence [32] includes data dependence count (DDC) and data dependence distance (DDD).DDC is the weighted count of the number of data dependence arcs coming into a basic block from other blocks, while DDD between two basic blocks B1 and B2 models the maximum time that the instructions in block B2 will stall for instructions in B1 to complete, if B1 and B2 are executed in parallel.DDC and DDD are, respectively, achieved in formula (3) and formula (4).Among DDC and DDD, we select DDC as the counted dependence criteria.DDC models the extent of data dependence that this block has on other blocks.In Figure 3, we give a description of data dependence between two blocks.The values of x, y in B3 rely on the ones from B1 and B2.The dotted lines represent the dependences.If the dependence count is small, then this block is more or less data independent from other blocks and we can start a thread at the beginning of that basic block.While counting the data dependence arcs, the compiler gives more weights to the arcs coming from blocks that belong to threads that are closer to the block under consideration.The motivation is that dependences from distant threads are likely to be resolved earlier and hence the current thread is less likely to wait for data generated there: A n is dependence edges f rom T n to T; (3) Furthermore, the compiler gives less weightage to the data dependence arcs coming from the less likely paths.The rationale behind using the data dependence count are twofold: firstly, it is simple to compute; secondly, if the processing elements do out of order execution then the data dependence distant model may not be very accurate because it assumes serial execution within each thread.However, in practice, due to out of order execution, instructions that are lower in the program order can be executed before the earlier instructions inside the threads.Thus, data dependence count tries to model the extent of data dependence in the presence of out of order execution.

Deployment
We use a multiple regression method to build a parametric model, the specific steps are concluded as follows:

Heuristic rules
SP can be anywhere in programs and as far as possible behind function call instruction.CQIP is at the entrance of basic block in non-loop region.In loop region, CQIP is located in front of the loop branch instruction in the last basic block of the loop.SP-CQIPs are located in the same function or in the same loop.The number of dynamic instructions between SP and CQIP must be greater than the lower limit of thread granularity (LLoTG) and less than the upper limit of thread granularity(ULoTG).Spawning distance between candidate threads must be greater than the lower limit of spawning distance(LLoSD) and is less than the upper limit of spawning distance(ULoSD).Data dependence of two consecutive candidate threads must be less than the data dependence count(DDC).Function call instructions between SP and CQIP are less than CALL_LOWER.

Index Variable Setting
As shown in Section 2.2.3, the influence factors are spawning distance, dependence count, and thread granularity.During thread partition, the specific determinant is the values of these factors.In accordance with heuristic rules [6](shown in the section 4.2), we extract the specific variables and determine the final index variables.
Usually, given the execution time of a parallelized program on N cores T p , and of the original sequential program T s , the absolute speedup (shown in formula ( 5)) is defined as T s /T p [33]: Speedup (Sp) is a time ratio between the runtime spent for a task to run in a single processing unit and the time cost in processing the same task in p processing units.According to heuristic rules, what affects speedups are five parameters: (DDC), LLoTG, ULoTG, LLoSD, ULoSD.These five parameters are independent variables, and speedup(Sp) is dependent variable, and all of them are listed in Table 1.

Gathering of Statistical Data
In order to achieve a credible and truthful regression model, we build the model on the foundation of statistical data.After ensuring the dependent variables and independent variables, we then gather and organize data from a hybrid sample set.Conventionally, the heuristic rules-based (HR-based) sample generation approach is efficient, but it is just one-size-fits-all way, and can not generate the optimal samples for all applications.Then, a hybrid sample generation approach is proposed.With this method, we firstly generate samples which are mips codes, consisting of spawning points (SPs) and control quasi-independent points (CQIPs) by heuristic rules on Prophet [30], and then manually adjust the positions of SPs and CQIPs and rebuild precomputation slice (p-slice) to obtain the best sample set.During the implementation of hybrid sample generation, three mechanisms: bias weighting, preservation of optimal solutions, summary of greedy rules are carried out.In this way, hybrid samples own the optimal partition positions.
We use the manual statistical methods to obtain values of five independent variables and dependent variables from the hybrid sample set.

Determination of Mathematical Form
Let us then consider an appropriate mathematical form to describe the relation among variables, the conventional method mainly used the scatter diagram to describe the relation between independent variables and dependent variables, to guide the building of regression model.After that, we will give a description of five independent variables: A (data dependence count), B (the lower limit of thread granularity), C (the upper limit of thread granularity), D (the lower limit of spawning distance), E (the upper limit of spawning distance), as well as a dependent variable: Sp (speedup).In order to determine the final relation between A, B, C, D, E and Sp, we adopt "other-fixed-one-change" mechanism.For example, if we build the relation between Sp and A, we just change A and fix B, C, D, E.
Seen from Figure 4 to Figure 8, we can conclude that the distribution of sample points and dependent variable Sp have a linear relation.Due to the value precision of variable C, speedup has little change in the area of data selection, but we can also see that its relation is essentially linear dependent.After detecting the relations between every affecting factors and speedup, we then assume a multiple linear correlation model, which is in accordance with the relations.Through model building and validation, we validate the correctness of the model.We assume that the relation between speedup and five factors (described in Section 5.3) can be expressed in formula (6), where A,B,C,D,E are five influencing factors and β 1 , β 2 , β 3 , β 4 , β 5 are coefficients of linear representation: where β 1 , β 2 , β 3 , β 4 , β 5 are unknown parameters.

Model Parameter Estimation
Positional parameters, in multiple regression model, are usually estimated by the least square method.The processing of obtaining the estimated values of parameter β is shown in the formula (7): By getting the minimum value of β, we can ensure the least square estimation of β from formula (8): In the actual process, we use parallel computer to implement the process of the least square method, and then get the estimation values of model parameters.

Validation and Modification
After getting the estimated values of unknown parameters, we set up a regression model.Then, we need verify the model to make model s accuracy be proved, and modify the model to be more accurate.Among all the validation methods of regression functions, significance validation is one of the most commonly used methods.
Significance validation method of regression function is listed as formula ( 9): When H 0 sets up, statistics magnitude is shown in the following: where, Usually, we regard SS R to be regression square sum, and SS E to be the square sum of residual errors.Once significance α is given, the refusal domain of validation is shown in formula (10): The results of program show that the obtained regression function is of statistical significance.

Application of Model
The whole model (shown in Figure 9) is divided into two parts: training stage and application.Once training programs are inputted, we use heuristic rules-based thread partition approaches to partition input programs and figure out the optimal values of partition performance values.Then, we assign the independent variables and dependent variables with profiling values and speedups, which are obtained by heuristic rules-based partition approach.Finally, we start to train our multiple regression model.Once the model is trained, we come to the application stage in which similarity comparison between tested program and the trained one is firstly made, and then we apply the trained regression model to predict the performance of testing programs.

Experiment
In this section, we introduce our experimental setup, providing details of the Prophet simulator [29,30] and benchmarks used throughout the evaluation.In the end, we analyze and discuss our results.

Experiment Configuration
We have implemented the execution model and machine learning(ML)-based thread partitioning algorithms on Prophet [30], which is developed based on SUIF/MACHSUIF [34] and Weka [35].All the compiler analysis is performed at the high-level intermediate representation (IR) of SUIF.A profiler is implemented to produce profiling information from SUIF-IR in forms of annotations.The profiler interprets and executes SUIF programs and provides information such as control flow, path prediction, data value prediction, the number of dynamic instructions of loops and subroutines.The Prophet simulator [17] models a generic SpMT processor with sixteen pipelined million instructions per second(MIPS)-based R3000 processing elements (PEs).The simulator is an execution-driven simulation and executes binaries generated by Prophet compiler.Each PE has its own program counter, fetch unit, decode unit, and execution unit, and it can fetch and execute instructions from a thread.Each PE can issue up to four instructions per cycle in an in-order fashion.Each PE also has private multiversioned L1 cache with two cycles access latency.Multiversion L1 cache is used to buffer the speculation results for each PE as well as performs cache communication, and the sixteen PEs share a write-back L2 cache via a snoopy bus.Table 2 shows the simulation parameters similar to those listed in a recent publication on Hydra [36]. Figure 10 shows the Prophet framework, and Prophet simulator is the software abstract of implementation scheme based on MIPS processing element in Prophet framework.In this section, we use Olden benchmarks [37] to evaluate our approach.Olden benchmarks are popular benchmarks for the study of irregular programs, and they process complex control flow and irregular, pointer-intensive data structures.These programs have dynamic structures such as trees, lists and DAGs so that they are hard to be parallelized by the conventional approaches.

Experiment Assumption
Figure 9 gives the description on how Qinling is trained and applied.When a sequential program comes, the program is firstly converted into a SUIF intermediate representation (SUIF IR).The IR programs pass through our developed profiler analysis module.The profiler collects execution statistics such as the number of dynamic instructions of a loop body and subroutine, and the branch probability of each branch instruction.The annotated SUIF IRs are partitioned into multithread programs by the heuristic-based thread partitioner.The MachSUIF [38] back-end and Linker take threaded SUIF IR as input and generate threaded MIPS programs.Then, the MIPS programs are evaluated at simulator to generate speedups.
Before we construct the parametric model Qinling which extracts parameter values from partitioner, we assume that some thread overheads are ignored.Qinling is trained offline, and applied to predict speedups of unseen applications, and the specific process is shown in the Section 5.4.We use leave-one-out cross-validation to evaluate our approach.This means that we remove the program to be predicted from the training samples and then build a regression model (shown in Figures 9 and Figure 11), also called prediction model based on the remaining programs.This guarantees that our regression model has not seen the target program before.The prediction model is used to generate speedups for the removed programs.We repeat this process for each program in turn.It is a standard evaluation methodology, providing an estimation for the generalization ability of a regression model for predicting unknown programs.There are several assumptions to make.First, emphasis is not placed on the process of heuristic-based partition.Second, the similarity comparison between training samples and testing samples is directly inferred from other papers.Third, this paper focuses on building and application of a multiple regression model.

Model Building
Table 3 presents an example of extracted data from Olden benchmarks.The 1 st column is list of benchmarks, and the 2 nd and 3rd column show values of (A, B, C, D, E) in formula (11), and the general speedups.The total item count is 97, which is larger than 2 5 = 32 (five is the count of independent variables).
Formula ( 11) can be expressed as formula (15): According to Cramer's Rules [39], where M is the determinant of matrix M, and M i is the replaced matrix, whose i th column is replaced with: From formula (5), we can deduce the values of β1, β2, β3, β4, β5 and obtain the final parametric model (shown in the formula ( 18)):

Model Validation
Once we get the multiple linear regression Equation ( 18), we will apply it to predict speedups.The purpose of our model can be classified into two headings: speedup prediction, and feedback guidance for Prophet.

Speedup Prediction
We select 516 testing samples from Olden benchmark randomly, using the trained model to predict their speedups, and complete a comparison between predicted speedups and real speedups obtained from a simulator (shown in Figure 12).Table 4 shows forty values of A-E, and real speedups obtained from a simulator.Via Qinling (shown in the formula (18)) and values of A-E, we get the predictive values.Then, we compare predictive speedups with real speedups.Figure 12 shows the comparison results between predictive results and real results.From Figure 12, we can find that there exist gaps between predictive results and real results.The reasons can be classified into two headings: first, the applied parametric model ignores the error (ξ); second, no adequate similarity comparisons between training samples and testing samples are performed.Figure 13 shows the predictive results and real results for part of Standard Performance Evaluation Corporation(SPEC)2000 benchmarks on different cores.Figure 14 presents the speedup comparisons for SPEC2000 between our predictive model and Mitosis [2].In Figure 14, the red boxes denote the speedups of predictive model.The next step is to classify input applications according to the similarities among them.The samples in the same class will be applied a fine parametric model, while the samples in a different class use different models.Once Qinling is built and trained, we can also make a feedback guidance for Prophet, used in reality to partition sequential programs and get speedups via simulator.Within the primary influencing factors, A denotes DDC, which is objective and can not be changed, while B, C, D, E are four variables that form a solution space.We regard formula (18) as an objective function.Take the lrand48 (in Table 4) as an example, and we use parameter model (formula (18)) to obtain the optimal parameter values.Table 5 shows a segment of Matlab code, which is used to search the optimal combination of <B,C,D,E>.During the process of searching, S(p) = -0.212×B+ 0.582×C -1.209×D -0.060×E is regarded as an objective function.
To do feedback guidance for Prophet, we firstly build a solution space with four dimensions.In the solution space, every point is a possible solution.The scale of solution space is 30 4 = 8.1 × 10 5 .During the process of traversing all the combination points, there exists a basic restriction, namely B<C&&D<E.Then, we obtain all the objective values of all possible combination points, and find the maximum as well as its corresponding combination of < B,C,D,E>, which are shown in Table 6.Note that the speedup shown in Table 6 is not the final result, as the objective function does not conclude the part 0.445 A, so it is just an intermediate result.

Conclusions
In this paper, we have presented and evaluated a parametric model Qinling, in order to explicitly explore the inherent law between thread partition factors and performance.Qinling makes use of a multiple regression model to predict speedups and to do feedback guidance for Prophet.It does so by three steps: first, it exploits linear relations between every primary influencing factors during thread partition and speedups; then, it builds and trains a multiple regression model offline, as well as predicting speedups of unseen applications; finally, by ways of building solution space and searching overall space to find the optimal solution, it searches offline for the optimal combination of thread partition so as to guide Prophet to achieve real speedups online.
The key characters of parametric model can be concluded: (1) primary influencing factors of thread partition are correlated with performance (speedup) by a parametric model Qinling; (2) the inherent law between thread partition and speedup is explicitly expressed; and (3) both offline prediction of speedups and online guidance of thread partition are realized.
Two future research works will be done based on Qinling: (1) training and validating programs will be classified so that different classes of programs use more fine parametric models; and (2) Qinling will be enhanced to meet the requirements of classifying applications.

Figure 1 .
Figure 1.Time overhead analysis in speculative multithreading.

Figure 3 .
Figure 3. Data dependence arcs between basic blocks.

Figure 4 .
Figure 4. Statistical values between A and speedups.

Figure 5 .
Figure 5. Statistical values between B and speedups.

Figure 6 .
Figure 6.Statistical values between C and speedups.

Figure 7 .
Figure 7. Statistical values between D and speedups.

Figure 8 .
Figure 8. Statistical values between E and speedups.

Figure 9 .
Figure 9. Two stages of model: training and application.

Figure 11 .
Figure 11.Training and application flow of Qinling.

Figure 13 .
Figure 13.Predictive speedups and real speedups on different cores.

Table 1 .
Set of independent variables and dependent variables.

Table 3 .
Extracted data from olden benchmarks.

Table 4 .
Testing samples (Sp is real speedups from simulator).