An Optimization Framework for Codes Classification and Performance Evaluation of RISC Microprocessors

Pipelines, in Reduced Instruction Set Computer (RISC) microprocessors, are expected to provide increased throughputs in most cases. However, there are a few instructions, and therefore entire assembly language codes, that execute faster and hazard-free without pipelines. It is usual for the compilers to generate codes from high level description that are more suitable for the underlying hardware to maintain symmetry with respect to performance; this, however, is not always guaranteed. Therefore, instead of trying to optimize the description to suit the processor design, we try to determine the more suitable processor variant for the given code during compile time, and dynamically reconfigure the system accordingly. In doing so, however, we first need to classify each code according to its suitability to a different processor variant. The latter, in turn, gives us confidence in performance symmetry against various types of codes—this is the primary contribution of the proposed work. We first develop mathematical performance models of three conventional microprocessor designs, and propose a symmetry-improving nonlinear optimization method to achieve code-to-design mapping. Our analysis is based on four different architectures and 324,000 different assembly language codes, each with between 10 and 1000 instructions with different percentages of commonly seen instruction types. Our results suggest that in the sub-micron era, where execution time of each instruction is merely in a few nanoseconds, codes accumulating as low as 5% (or above) hazard causing instructions execute more swiftly on processors without pipelines.


Introduction
Reduced Instruction Set Computer [1], or RISC for short, has seen tremendous advancement over the last four decades.Starting from a simple MIPS [2], RISC processors dominated in smartphones and tablet computers [3], and have recently been used in a supercomputer named Sunway TaihuLight [4], comprising ten million cores-making it the fastest supercomputer in the world (https://www.top500.org/lists/2017/11/).It has also been reported (https://www.theverge.com/2017/3/9/14867310/arm-servers-microsoft-intel-compute-conference) that Microsoft (R) has recently unveiled its new Advanced RISC Machines (ARM) server designs, thereby beginning to challenge Intel's dominance of the industry.
Earlier versions of the RISC processor did not have pipeline stages; instead each instruction was executed exactly in a clock cycle in a mutually exclusive manner-therefore, the name Single Cycle Processor (SCP) [5].The performance limitations of the SCP were addressed in the advanced versions, which were based on multicycle execution of each instruction, and later followed by incorporation of pipelining, in which multiple instructions could be executed in parallel [6].Since pipelining was supposed to guarantee massive throughputs, it became the de facto architecture for most modern processors [7,8], video coders/decoders [9,10], and crypto systems [11,12], to name a few.Unfortunately, it was completely overlooked that there might be situations in which the older variants, SCP and Multicycle Processors (MCP), could outperform the Pipelined Processors (PiP).One such situation could be the way an assembly language program was written.It is often the case that a particular instruction causes hazards that are not suitable for the PiP, and rather executes more swiftly on an SCP.The greater the hazardous instructions in the assembly language code, the more suitable the simpler variant shall be.Therefore, in this work we try to analyze the given assembly language code first, before making choice of the processing architecture.
While proceeding miniaturization has allowed integration of billions of transistors on a chip [13], and resources are readily available in abundance, excessive power consumption, rather than size, in digital circuits, such as microprocessors, is becoming a much graver concern.In our context, fabricating the three variants of RISC processors on a single chip, and then making a choice between them according to the assembly language code should resolve the problem, however, only at the expense of increased dynamic power consumption.Luckily, there exists a technique called Dynamic Partial Reconfiguration (DPR) [14,15], which allows real-time reconfiguration of a part of the circuit, while another part is still in execution.Therefore, it is possible to generate partial bit files for each variant, keep them in the reconfigurable memory of the chip, and download one of them at a time on to the system according to the code being executed without having to stop or restart the whole system.Importantly, this will constrain the power consumption only to the active processor type.This, however, was only possible if we were first able to classify the assembly language codes according to their suitable processor variant, using some classification method [16].
In this work, we first develop a mathematical performance model for each of the three design paradigms (SCP, MCP, and PiP) using a set of commonly seen instructions.By subjecting these models to a symmetry-targeted monotone optimization technique, we determine which class of variants does a given assembly language code, with a certain percentage of each instruction, suit the best.We carry out our analysis on 8-bit, 16-bit, 32-bit and 64-bit MIPS processors, where the number of instructions in each code varies between 10 and 1000.Our confidence stems from 324,000 assembly language codes, each comprising different percentage of each instruction, per processor architecture.Please note that it is beyond the scope of this work to present design and operation of the dynamically reconfigurable (DR) processor; instead, we try the use of each design paradigm according to the given assembly language program by comparing their execution times with each other, and advocating the DR processor that promises performance symmetry in every circumstance for the given code in run time.The major contributions of this work, therefore, are as follows: 1. Performance modeling of three conventional processor types for commonly seen instructions 2. Classification of assembly language codes for code-to-processor mapping using an optimization technique based on symmetry-improving nonlinear transformation We conclude that in the sub-micron era, where execution time of each instruction is merely in a few nanoseconds, codes accumulating as low as 5% hazard causing instructions execute more swiftly on processors without pipelines.Our results shall be vital in the context of multi-processor systems-on-chip and chip multi-processors, where one more efficient function unit is replaced by multiple simpler variants in order to attain increased throughputs by exploiting parallelism, yet, keeping the complexity of the system unaffected or marginally increased [17].To the best of our knowledge, there is no framework available in the literature that could be considered equivalent to the proposed one.The rest of the paper is organized as follows: In Section 2, we review some of the recent applications of DPR, and introduce our basic processors and the instructions that they support.
A mathematical performance model for each processor is presented in Section 3, which is subsequently used to define three optimization problems and their solutions in Section 4-this is the main contribution, which also comprises our proposed research methodology.Section 5 presents results and evaluation, and a few sample assembly language codes that suit SCP more than the PiP according to the proposed formulation.We conclude the paper in Section 6.

Dynamic Partial Reconfiguration
Dynamic Partial Reconfiguration (DPR) is a technique used to update the logic blocks dynamically while a part of the system is still in execution.The DPR allows the designers to generate partial bit files, which can be implemented and downloaded into the system without the need for system shutdown or restart.As a result, the system functionality is upgraded in runtime without any interruption.
The digital systems using the concept of DPR can be categorized into two: a static non reconfigurable part of the design, and runtime reconfigurable part.The former uses the generated full bit stream of the design downloaded into the system at boot time, whereas, the runtime reconfigurable part of the design may comprise several independent reconfiguration regions.These regions have a flexibility to be reconfigured in runtime by downloading the generated partial bit streams without affecting the functionality of the static non reconfigurable part [18].
The system reconfiguration time for a specific reconfigurable region using the DPR concept is proportional to the partial bit stream size.This timing constraint is a key factor in determining the worst case execution time of the design, and is considered as a time overhead each time the system is reconfigured [19].
The major advantage of DPR is that it enhances design flexibility and minimizes the design area.This promising feature can be used to implement numerous system applications used in diverse engineering fields, such as signal, image and video processing [20].The concept of DPR is also used for database management.An energy aware SQL query acceleration method using DPR concept on XILINX ZYNQ platform has been presented [21], and a significant improvement in energy consumption as compared to X86 based system is shown.Another diverse field recently using the DPR concept is the evolution of artificial neural networks on FPGA.A very unique method to address fault problems in synapses of spiking neural networks using astrocyte regulation, inspired by brain recovery process is demonstrated [22].
The DPR concept may also be used in applications and systems where latency is considered as one of the prime factors to determine the system's performance.Various processor design styles [23,24] may be implemented in runtime that will have a significant impact in terms of execution time-thereby on performance of a specific program, as discussed in this work.

Processor Design Styles
RISC architectures, especially when used in industrial embedded systems applications, generally follow one of the three design paradigms.These include single cycle, multi cycle, and pipelined architectures.MIPS [2] is still considered as the benchmark architecture lying in core of most of the modern RISC processors.This is why, in this work we restrict our analysis to MIPS, but we try to keep our assumptions and methodology as general as possible.In what follows, we briefly present the design and operation of each of the three design paradigms in turn.

SCP
As the name suggests, an SCP is guaranteed to execute each instruction in the instruction set architecture (ISA) exactly in one clock cycle, where each instruction is supposed to access various function units constituting the processor.These units typically include instruction/program memory, register file, arithmetic and logic unit (ALU), data memory, and control unit (CU), each with a different latency.Since each instruction may access a different set of units in a unique sequence, the execution time for each instruction will be different, and naturally, the clock cycle time should be long enough to accommodate the slowest instruction (with the largest execution time).

MCP
An MCP executes each instruction in more than one clock cycles, depending upon the number of function units it accesses.Therefore, a longer instruction will consume several clock cycles while executing, whereas the shorter instructions will consume less.The clock cycle is just long enough to accommodate only one function unit-naturally it must be the slowest function unit to dictate the clock cycle time.
Since only one function unit is supposed to work in each clock cycle, it has become a convention to name each clock cycle after the function unit in charge of that clock cycle: Instruction Fetch (read the program memory), Decode (read the operands in the register file, and CU decodes OP-Code), Execute (ALU either performs the desired operation or computes physical address to read/write the data memory), Memory Access (read/write the data memory or register write for some instructions), and Write-back (data read from memory is written on to a register).

PiP
The pipelining technique divides the datapath into n pipeline stages, named exactly like clock cycles on the MCP, where each stage consists of exactly one function unit.These types of processors are supposed to achieve higher throughput than the previous ones, by ensuring that no pipeline stage remains idle at any point in time.Instead, n instructions can form a queue in the datapath, while each occupies a pipeline stage simultaneously, thereby exploiting parallelism.Each stage, m, needs to be synchronized with m − 1 and m + 1 neighboring stages, otherwise data from one stage may interfere the operation of the next.Understandably, the clock cycle time is determined according to the slowest pipeline stage, which ensures proper synchronization between the pipeline stages.Figure 1 presents timing diagram for the three variants on random instructions.A drawback associated with PiP is the existence of structural, data, and control hazards [25].Without going into details of each of those, it is essential mentioning that a few of data and control hazards require stalling the pipeline, inserting a bubble, or flushing a pipeline register for correct operation.In either case, each of such hazards will incur delay of a time slot on top of the normal execution time of the instruction causing hazard.In case the code to be executed comprises several instructions causing hazards, the execution time may exceed that of the SCP or MCP, making latter an appropriate choice specifically for this code.However, in integrated circuits that are not dynamically reconfigurable, one has to bear this undesired overhead.

Instruction Types
As there can be infinitely many types of instructions, each accessing the function units infinitely many times, it is usually reasonable to restrict the analysis to a specific instruction types.There are five basic types of instructions supported by, more or less, every microprocessor.These include: 1. Register (R)-Format, in which the source as well as the destination operands belong to the register file.2. Load Word (LW), in which a data item is fetched from data memory and loaded in a register.
The physical address is formed by adding a base address, which comes from a register, to an offset encoded in the instruction.3. Store Word (SW), in which the data item is read from a register and moved into a location on data memory, where physical address is computed in the same manner as for LW. 4. Branch, in which flow of the program changes based on a condition: instead of fetching the next sequential instruction, instruction present at the target address is fetched on to the processor.The condition is usually checked by the ALU or a comparator on operands from register file.Please note that until the condition is checked (say found true), at least one instruction, usually the one next in line sequentially, may have already been fetched into the pipeline-leading to a control hazard in case of PiP.It is called a hazard since the incorrectly fetched instruction needs to be flushed out of the pipeline before it carries out an erroneous activity, e.g., a memory read/write or a register write. 5. Jump, in which flow of the program changes unconditionally.Likewise for the branch instruction in a PiP, Jump will require flushing the pipeline at least once, before the correct instruction is fetched.
Although there are several variants of these instructions, we will do the formulation only for the basic instructions enumerated above in the following section.Our objective is to estimate the execution times for different instruction mix, i.e., for assembly language codes comprising varying percentages of the selected instructions on each processor.The execution time of a program comprising I instructions, in general, is given as Equation ( 1).
where CPI refers to number of clock cycles per instruction, and CLK is clock cycle time.The formulation in the next section will enable us to classify codes according to their appropriateness for each type of processor.This classification requires optimization of the performance models, which can be achieved by various methods-generally categorized into deterministic and stochastic optimization methods-discussed next.

Optimization Methods
The stochastic optimization methods are used in situations where data, for some reason, are not known in advance, or at least not known with certainty [26].Bio-inspired techniques, over the years, have gained attention in solving such optimization problems, which incorporate the uncertainty into the model; a few recent examples include [27][28][29][30].Advantages of these techniques are best leveraged if the solution search space is not well-structured and understood.Naturally, their latency, in terms of convergence rate, will be substantial, yet they are known to be best-effort techniques, since their objective is to obtain a near-optimal solution [31,32].
As shall be seen in Sections 3 and 4, the mathematical models that we have developed, and the optimization problem at hand, hardly have an uncertainty involved, due to which the solution search space is nicely structured as well.This relieves us from employing much more complex stochastic optimization, and turn to much simpler solutions for deterministic optimization problems.The mathematical model has yielded a nonlinear optimization problem, for which we have made use of symmetry-targeted nonlinear transformation [33] (discussed later in Section 4), followed by a widely adopted method of linear programming.An optimization problem is said to be linear programming problem if its objective function, decision variables and constraints are all linear functions.Such problems are typically handled using the simplex method, in which the decision variables are iteratively updated to yield the most feasible solution (optimal objective function) [34].
While, the details on the proposed optimization method will be presented in Section 4, in what follows, we present modeling of the three processor types first.

Preliminary Assumptions
Let L = {α i , α i+1 , . . ., α i+4 } be the set of latencies for the major function sequences involved in execution of a typical instruction on a processor implementing Harvard architecture.Here α i , ∀i ∈ N and i < 6, represents IM access, RF access, ALU operation, DM access, and CU operation respectively.For simplicity, we are assuming that α RFread = α RFwrite and similarly α DMread = α DMwrite .Also, it is realistic to assume that α CU < α RFread .Without the loss of generality, let us assume the instruction mix is as follows: Branch = x 1 %, Jump = x 2 %, R − Format = x 3 %, Load = x 4 %, and Store = x 5 %.Considering the fact that each variant of the processor will suffer the same penalty, we also assume the probability of read/write miss to be zero.

Formulation for SCP
The execution time for each type of instruction is formulated as follows in Table 1: for each type of instruction on Single Cycle Processor.

Instruction Expression
The clock cycle time for this type of processor (CLK S ) is given by Equation (2).
The execution time (E S ) of the given code, also termed as User CPU Time is given by Equation ( 3): where I refers to the number of instructions in the given code; note that the CPI for SCP is 1.It is widely understood that by employing pausible clocks [25], each instruction may be executed by a different clock cycle, and therefore the performance of such processors may be significantly improved.In that case, Equation (2) will not hold; we will have to compute the average clock cycle time, considering different E i -let us denote it with CLK SV , given by Equation (4).
Similarly, Equation (3) for the execution time for this variant will now have to be modified accordingly, given by Equation ( 5).

Formulation for MCP
In multicycle processors, the clock cycle time, CLK M is determined by the slowest function unit, and each instruction may consume multiple clock cycles to execute, as shown in Table 2. Its execution time, E M , is given by Equation ( 6).
where CPI av and CLK M are given by Equations ( 7) and ( 8) respectively.

Instruction Number of Clock Cycles
Once again, pausible clocking may be employed to reduce CLK M , but this time the improvement will not be notable.While clock cycles 1 and 4, requiring memory accesses, should be according to Equation (8), clock cycles 3 and 5 should be dictated by α 3 and α 2 respectively, and clock cycle 2 should depend upon the condition max(α 2 , α 3 ) due to the overlap of register file access and ALU operation.The average CLK MV in this case will be computed as Equation (9).

Formulation for PiP
Unfortunately, formulation for the PiP is not that simple.The data hazards that require stalling the pipeline tend to add an extra time slot; so the more the hazards, the more the cycles will be wasted.Similarly, the control hazards also increase the latency by one time slot, for example, the Jump instruction will unconditionally cost an extra time slot, and Branch instructions, if true, will do the same.All this needs to be accounted for while computing the exact CPU time for the given code.
Equation ( 8) holds true for this processor too; so CLK P = CLK M .Other than the time slots wasted due to conditions discussed above, the execution time is computed as follows: the first instruction consumes n time slots for n-stage deep pipeline, while each of the following instructions is executed in one time slot.This is given by Equation (10).
As far as additional time slots due to hazards are concerned, each case needs to be addressed independently.We have mentioned already that each Jump instruction will unconditionally cost an extra time slot.Therefore, x 2 % Jump instructions will add an overhead of 0.01 × x 2 × I clock cycles in the overall execution time.Likewise, for Branch instructions that turn out to be true, 0.01 × x 1 × I clock cycles will be added.However, in this case, the probability of Branch being true must be considered as well.Since this is nondeterministic, we assume a fair decision, i.e., 50% branches will be true.
The last case that remains is of a data hazard that forces a stall in the pipeline, i.e., a Load instruction followed by a dependent instruction.Recall that any instruction, other than Jump, may cause a data hazard with the preceding Load instruction with some probability deduced from the total number of registers in the register file.Furthermore, the dependency may exist between the target register ($Rd) of the Load and any of the two source operands ($Rs) or ($Rt) of the following instruction.The probability of a hazard due to matching of ($Rd) with any one of ($Rs) and ($Rt) is given by Equation ( 11) where R max is the total number of registers in the register file.Since a hazard cannot be caused by a Jump instruction, the probability for being a hazardous instruction is important, and is given by Equation (12).
Taking into account the cases for control and data hazards discussed above, the modified execution time for the PiP is given by Equation ( 13).
Substituting Equation (10) into Equation ( 13) and performing simplification for n = 5, simplified E P is given by Equation ( 14)

Estimating Worst and Best Case Performance
Our objective is to maximize and minimize E for estimating worst and best case performance of each processor type, where ∈ {S, SV, M, P_simp} corresponding to Equations (3), ( 5), (6), and ( 14) respectively.For a given I, E S , Equation (3), is a constant value; therefore, E Smax = E Smin − making this type of processor suitable only for codes comprising very few instructions.
For the case of PiP, the instruction mix dictates the best and worst case performance.E P_simp , Equation ( 14), is modified as following, Equations ( 17) and (18) to obtain E P_simp_min and E P_simp_max accordingly.

Discussion
Based on the formulation presented above, a few observations may be conveniently made: 1.The second variant of SCP performs much better for shorter instructions, such as Jump and Branch.So, the more the shorter instructions in the code, the more suitable the SCP should be.2. The performance of the PiP entirely depends upon instruction mix: if there is no hazardous instruction, this type will stand out as the best.However, the more the control hazards in the code, the larger the execution time will be.Furthermore, CLK P is dictated by the slowest function step, which means the larger the difference between the latencies of function units, the larger the CLK P will be in comparison to CLK SV .3. In terms of performance, it is difficult for the MCP to beat the other two.The reason for this observation is its CPI M of 3 for shorter instructions, which suit the SCP more.On the other hand, the PiP will outclass it for longer instructions.
So, we have to optimize instruction mix, and latencies of the function units to determine regions where the SCP will outperform the PiP.Based on our results, the compilers will be able to determine the better processing platform for the given application in run-time.

Problem Statement & System Model
The first objective, optimization problem (OP 1 ), of the proposed work is given by Equation ( 19) subject to: where E P_simp and E SV are given by Equations ( 5) and ( 14) respectively in terms of decision variables: x = {x i } ∀i ∈ {1 . . .5}, and C 11 to C 31 are the constraints that the optimized solution must satisfy.The inputs to our system include two types of microprocessors that have five types of instructions in their ISA, and we assume 8, 16, 32, and 64 bit architectures, which usually support eight, sixteen, thirty two, and sixty four general purpose registers respectively.Another input is latency of the function units, α i ∀i ∈ {1 . . .5}. Please note that α i ≈ α 4 are technology dependent-for DDR(2−4) SDRAM.The number of instructions in the given code I is the final input.
Using Equations ( 14) and ( 5), and following some trivial simplification and rearrangement, E P_simp − E SV may be written as Equation (20), which may further be simplified to Equation ( 21). where Clearly, the product x 2 x 4 makes our objective function a nonlinear programming (NLP) problem [35].Solving NLP problems requires nonconvex to convex relaxation [36], i.e., symmetry-improving nonlinear transformation by relaxing the bounds, in order to eradicate possibility of getting multiple local minima.Shachar et al. demonstrated that variable transformation with symmetric distribution (close to Gausian) helps in achieving linearity in the inter-variable relationship.Although it is often not expedient to achieve symmetry due to irregular structure of the variable, yet, it proves advantageous.We, therefore, introduce a variable, Z, as a first step to linearize Equation ( 21), leading to OP 2 , given by Equation (22).
subject to:

Convex Relaxation using McCormick's Envelopes
Although nonlinear to linear transformation reduces computational complexity, the obtained solution will only be optimal to the relaxed objective function, rather than to the original.Therefore, relaxation only provides a lower bound closer to the actual optimal solution, while the upper bound may then be obtained by solving the original nonconvex problem using solutions acquired from the relaxed optimization [38].Please note that tighter relaxation on bounds will yield solutions closer to the optimal solution.McCormick's Envelopes provides such relaxation, i.e., it retains convexity by keeping tight bounds [39].Figure 2   For solving OP 2 using McCormick's envelopes, recall that x 2 x 4 = Z−U(x) 6 I from C 22 .Then, the under and over estimators are given by 4 and lower and upper bounds are given above by C 31 and C 42 .With these new linear constraints, we are able to transform a nonlinear problem, Equation (22), into a linear optimization problem, OP 3 , as follows: subject to: ≤ 100x 4 which may be solved using conventionally used interior point [40] or simplex method of linear programming [34].

Proposed Methodology and Algorithm
Figure 3 summarizes our research methodology.We initialize the system by randomly choosing various architectures (Arch., Γ), various codes of different lengths (C), and a large set of α i arranged in pages (F) as shown.Each round of execution makes a selection from each of these inputs, and constructs a vector of lower bound (lb) on each type of instruction, except for Jump.To determine confidence intervals for feasible solutions, at each iteration we increment lb on each type of instruction by a certain number − details will be given in the evaluation section.By doing so, the optimizer (LP) is forced to find a solution in higher percentages of Jump instructions − providing a wide coverage of feasible solutions.The proposed Algorithm 1 continues to run until the number of pages (P max ), codes (C max ), and architectures (R max ) have expired.During each round of execution, a page of feasible solutions, comprising percentage of each instruction, is recorded.Upon termination, average percentage of each instruction (A), across the pages, is computed via 3D to 2D transformation (T mat ), following which in depth analysis of the results is carried out.This analysis mainly involve observing of Jump and Branch to rest of the instructions in feasible solutions (B R 2 , J R 2 ).The objective of this analysis is to determine contribution of the former two instructions in a code that should satisfy OP 3 , i.e., the cases where SCP performs better than the PiP.Input: Γ ← set of architectures, C ← set of codes, α ← latency of functional units, F ← 3D matrix of α, LP ← linz-approximation, A ← average vector, B R 2 ← branch to rest ratio, J R 2 ← jump to rest ratio, P max ← maximum number of pages, k ← number of rows, l ← number of columns, R max ← maximum number of architectures, C max ← maximum number of codes, B ← Branch, J ← Jump, RF ← R-format, LW ← Load Word, SW ← Store Word, LB ∈ R (C V ×V) ← vector of lower bounds, T mat ← transformed 2D matrix, , V ← maximum variables of search space, C ← possible combinations in search space.Initialization:

Evaluation and Sample Codes
To verify the proposed approach, we have modeled the linear optimization problem, given by Equation ( 23), using the methodology described in Section 4.3 in MATLAB.The linear programming solver that we have used is linprog, which uses dual-simplex method to generate an optimal solution.Our samples and datasets are initialized as discussed in the following subsection.

Data Initialization
For evaluating and comparing performances of SCP and PiP with each other, we have chosen the following vectors: 8}-representing propagation delays of function modules.The vector is chosen, as such, for simplicity, since α i , ∀i ∈ {1, 4}, is technology dependent, and may lie in the range {1 − 8}ns for recent technology nodes.Here, α 2 and α 3 will always be smaller than the other two, and are randomly selected.• I = {10, 100, 500, 1000}-representing four different assembly language code lengths.These values will give us a confidence interval for performance of each processor's variant.
• lb = {x 1_min , x 2_min , x 3_min , x 4_min , x 5_min }representing lb on each type of instructions, x i , in percentage.Since we already know that jump is the shortest instruction, and will matter the most in yielding feasible solutions to the optimization problems, we do not constrain its lb, and rather treat it as an output.Therefore, x 2_min = 0. Whereas, we iteratively vary the rest between {0, 10, 20}, resulting in 3 4 (= 81) assembly language codes with different instruction mix.
To have acceptable confidence in our results, we had to exploit a larger sample space; we, therefore, randomly generated 1000 values for each α i , resulting in a total of (1000 × codes) = (1000 × 81) permutations per assembly language code length per architecture.The results given and discussed below are average values of these total 81, 000 × 4 = 324, 000 iterations for each architecture.

Simulation Results
Some simulation results for an 8-bit architecture with hundred instructions in the assembly language code, and varying lb are summarized in Figure 4.Each figure in the table plots instruction mix against execution time for a different lb.Please note that advancement in technology will lead to smaller execution times, therefore, x-axis on each plot may represent newer to older processors (from left to right).Before commenting on each result, recall that the optimizer is supposed to find the number of shorter (jump, branch) instructions in a given code such that SCP performs better than the PiP.
In the case of unconstrained lb, i.e., when lb = [0 0 0 0 0], observe how conveniently the optimizer is able to find the feasible solutions.Especially at smaller execution times, possible solutions exist without any significant contribution by the shorter instructions.At greater execution times (>20 ns), however, much accumulated contribution (>50%) is needed by the shorter instructions for possible feasible solutions.With increasing lb on x 3 , x 4 , and x 5 , observe how the number of feasible solutions continues to decrease.For example, consider the case when lb = [0 0 20 20 10], i.e., when x 3 , x 4 , and x 5 together constitute more than half instructions in the given code, the optimizer fails to find feasible solutions beyond execution time 16 ns.Furthermore, the obtained feasible solutions comprise larger percentage of jump instructions.By continuously increasing lb, one may easily observe that the number of feasible solutions continue to drop down.For example, when lb = [20 0 20 20 20], there exists no feasible solution beyond execution time 5 ns.These results are interpreted as follows: for ∑ α i > 20 ns, i.e., relatively older processors, an assembly language code, in which jump and branch accumulate for fewer than 50% instructions, suits the PiP more than the SCP.On the other hand, in the case of recent technology nodes with ∑ α i ≤ 5 ns, the codes with merely 20% contribution by the shorter instructions will suit the SCP more than the PiP.To observe how should the number of branch instructions in a code vary to yield feasible results against increasing execution times, we have plotted the ratio between branch (x 1 ) and rest of the instructions (x 3 , x 4 , and x 5 ), as shown in Figure 5.These results were also generated for the same number of iterations as before, and then their average was computed.For ease of understanding, we have just plotted them for a few specific cases.The trend, however, remains the same for all iterations, as depicted.While the vertical axis, here, corresponds to the obtained ratio between branch and other instructions, the horizontal axis shows the size of each page (we randomly selected 20 samples of α i per page, and the number of pages was 1000).The α samples were initially sorted is ascending order, i.e., the last sample leads to highest execution time (oldest processor in other words).It may be conveniently observed that with increasing execution times, the number of Branch instructions must continue to increase with respect to rest of the instructions, except Jump.
Similarly, Figure 6 presents the case of ratio of Jump to rest of the instructions.Being the shortest instruction in the ISA, the Jump requires its contribution in the given code to be significantly higher than the rest.Once again the trend suggests that for larger execution times, contribution of the shorter instruction, Jump in this case, should be enormous − mostly ≥ 50%.
It is important to note in these results that reduction in feature sizes and voltage swings-leading to faster circuits, and therefore, processors-is giving SCP an opportunity to outperform the more modern PiP in average assembly language programs.Therefore, the modern processing platforms are recommended to offer more flexibility, to be able to switch between multiple architectures and design styles, say by means of dynamic partial reconfiguration.

Sample Codes and Mapping
The following three assembly language codes have been adopted (as they were) from two textbooks: one on MIPS32 and the other on 8051 microprocessors.The purpose of presenting them here is to map each of them to one of the two design paradigms we have discussed, SCP and PiP, with respect to technology, i.e., execution times.Please note that the first two are high-level descriptions, whose assembly codes may be found in the reference book, or with the Supplementary Material.;start timer LOOP: JNB TF0 LOOP ;wait for overflow CLR TF0 ;clear timer overflow flag CPL P1.0 ;toggle port bit SJMP LOOP ;repeat END It may be seen in the above codes that use of jumps and branches is increased in scenarios where some operations are to be performed repeatedly, i.e., in loops.In Code 2, the total number of iterations is determined based on n, and the complexity is determined as n(n−1) 2 in an average case.In Code 3, the delays are implemented to generate a square wave with 50 percent duty cycle and a period of 100 microseconds.More than 90 percent of the time, processor will be busy in processing jumps and branches.Based on this knowledge, the following code-to-processor mapping, i.e., suitability, may be concluded.
The approximate contribution by Jump and Branch instructions in each of the three codes is 30%, 20%, and 90% respectively.For these statistics, the proposed framework suggests that Code 1 will map much better on the SCP than the PiP only if execution time, i.e., ∑ α i < 5 ns.This may be observed in the top four plots in Figure 4. Similarly, if ∑ α i < 3.5 ns, SCP will execute Code 2 better than the PiP.This may be observed in the plot corresponding to lb = [0 0 20 20 10].Finally, for Code 3, SCP will conveniently outperform the PiP for ∑ α i ≤ 45 ns.Please note that these sample codes were chosen because of the fact that they provide a diverse contribution by the instructions favoring the SCP.This has given us significant confidence in the obtained results.

Conclusions
The mathematical models that we have developed have suggested that there may be situations, specifically assembly language codes, in which simpler processors may perform better than the more advanced pipelined processors.For a system to yield optimal performance in every situation-post fabrication-it should be possible to switch between the simpler and advanced variants, whichever more suitable, during compile time.To do so, however, one should be able to analyze the given code, and determine which variant it suits more.For this purpose, (1) we have presented performance models for three types of processors, (2) proposed a framework based on a symmetry-targeted non-linear optimization method for code classification, and (3) advocated on using dynamic partial reconfiguration to keep the area and power overhead of the system to minimum, besides making it flexible for swift switching.Our analysis is thorough, and it leads to the conclusion that for recent technology nodes, in the submicron era, it is even more convenient for simpler processors to outperform the pipelined processors.Therefore, the modern systems should flexibly adapt to the given situation by means of dynamic partial reconfiguration.
As a prospective step, following this theoretical framework, we aim to (1) design the three types of processors on an FPGA, supporting dynamic partial reconfiguration, (2) execute multiple benchmark codes available in the literature, (3) estimate the performance of each type, and (4) carrying out a detailed quantitative comparison between them.This will help us practically validate our mathematical models, and will give us confidence in our claim.

Figure 1 .
Figure 1.Timing in each processor variant.
presents under and over estimators in McCormick's envelopes for a nonlinear function w = xy, where U and L stand for upper and lower bounds respectively.

Figure 5 .Figure 6 .
Figure 5. Ratio of branch instruction to rest for feasible solutions.

Table 2 .
Clock Cycles (C i ) to execute each type of instruction on Multicycle Processor.