Programming Protocol-Independent Packet Processors High-Level Programming (P4HLP): Towards Uniﬁed High-Level Programming for a Commodity Programmable Switch

: Network algorithms are building blocks of network applications. They are inspired by emerging commodity programmable switches and the Programming Protocol-Independent Packet Processors (P4) language. P4 aims to provide target-independent programming neglecting the architecture of underlying infrastructure. However, commodity programmable switches have tight programming restrictions due to limited resources and latency. In addition, manufacturers tailor P4 according to their architecture, putting more restrictions on it. These intrinsic and extrinsic restrictions dilute the goal of P4. This paper proposes P4 high-level programming (P4HLP) framework, a suite of toolchains that simpliﬁes P4 programming. The paper highlights three aspects: (i) E-Domino, a high-level programming language that deﬁnes both stateless and stateful processing of data plane in C-style codes; (ii) P4HLPc, a compiler that automatically generates P4 programs from E-Domino programs, which removes the barrier between high-level programming and low-level P4 primitives; (iii) modular programming that organizes programs into reusable modules, to enable fast reconﬁguration of commodity switches. Results show that P4HLPc is efﬁcient and robust, thus is suitable for data plane high-level programming. Compared with P4, E-Domino saves at least 5.5 × codes to express the data plane algorithm. P4HLPc is robust to policy change and topology change. The generated P4 programs achieve line-rate processing.


Introduction
A software-defined network (SDN) [1] stimulates the innovation of domain-specific languages (DSLs) [2][3][4] and switch architectures [5][6][7][8]. Programming Protocol-Independent Packet Processors (P4) is a DSL to program the data plane of programmable switches. P4 defines the data plane behavior of the programmable switches by primitives, regardless of the underlying architecture.
Although P4 (Programming Protocol-Independent Packet Processors) [9] provides abstractions for the data plane, switch chips put tight restrictions on P4 programming [10,11]. These restrictions can be on processing resources, storage resources, and latency; programmers can declare arbitrary numbers and combinations of the P4 hardware primitives; the programmers must handle read/write dependencies when using storage resources, and they must pay attention to latency restrictions to guarantee line-rate processing; in addition, manufacturers tailor P4 specifications for their architectures, putting even more restrictions on P4. Programmers also tail their programs for different architectures. These intrinsic and extrinsic restrictions dilute the goal of P4.
Network algorithms are the fundamental blocks of network applications. They provide a network-wide view via monitoring massive network traffic. Some algorithms were implemented in software, which is weaker in speed and accuracy compared to programmable switches. Moving them to the data plane of the programmable switch is a better choice. In certain cases, we want to reproduce state-of-art hardware algorithms. Most importantly, we want to implement a new algorithm with ease. In this paper, we choose three types of network algorithms: (i) The basic algorithms which implement the single task and can be used as modules of other algorithms, e.g., the Bloom filter [12] and Flowlet [13]. (ii) The Heavy Hitter (HH) detection algorithms, e.g., SpaceSaving [14], HashPipe [15], Randomized Admission Policy (RAP) [16], and Probabilistic RECirculation admisSION (PRECISION) [10]. (iii) The network telemetry systems which are general-purpose algorithms, e.g., ElasticSketch [17] and SketchLearn [18].
However, programming these in P4 can be troublesome. Some algorithms are so complex that they augment the complexity of P4 programs. What is more, we have to write different P4 programs for different targets, as they use different architectural restrictions. Conventional wisdom concentrates on target-independent programming for control plane Application Programming Interfaces (APIs), i.e., P4Runtime APIs [19] which do not change with the P4 program. There is a gap between P4 data plane programming and the state-of-art commodity programmable switch. This paper aims to hide the restrictions on P4 data plane targets in order to provide unified high-level programming for commodity programmable switches, as follows.
• Inspired by the high-level synthesize [20] in Verilog programming and prior works [3,4,21], this paper proposes a domain-specific language (DSL), E-Domino, an enhanced version of Domino [4] that extends the grammar of Domino with loops, conditional statements, stateful atoms, and external functions. Also, E-Domino is C-like; thus, it is easy to use and lowers the learning curve. • This paper devises P4HLPc (Programming Protocol-Independent Packet Processors High-Level Programming compiler), which automatically generates P4 programs from E-Domino programs.
The generated P4 programs are eligible for programmable switches. P4HLPc removes the barrier between high-level programming and low-level primitives. P4HLPc comes with four fundamental techniques: loop unrolling, TBA (branch transform algorithm), atom template, and joint compiling. • This paper proposes the thought of modular programming for commodity switches, which provides reusability for standard modules. Thus, P4 high-level programming (P4HLP) enables fast reconfiguration for commodity switches, while prior works [3,4,21] treat the programs as a whole.
We hope to support more hardware structures in future works. Currently, P4HLP targets the Barefoot Tofino ASIC (application specific integrated circuit) [5] which is under a RMT (reconfigurable match-action tables) structure. P4HLP explores enlarging the function of the RMT pipeline to implement network algorithms. This paper uses E-Domino to implement prior algorithms on a Barefoot Tofino programmable switch [5], evaluates the availability and functionality of P4 codes automatically generated by P4HLPc, compares E-Domino code with P4 code and evaluates P4HLPc under different conditions. This paper is organized as follows: Section 2 presents background and some related works. Section 3 demonstrates our DSL, E-Domino, and an enhanced version of prior Domino. Section 4 illustrates P4HLPc design. Section 5 evaluates our design with existing algorithms. Section 6 presents some related works. Section 7 discusses our design and suggests some future works. Section 8 presents our materials and methods, and Section 9 concludes this paper.

Programmable Switches
A typical PISA switch ( Figure 1) includes a programmable parser, ingress, queue, egress, and deparser. The parser and deparser are reconfigurable to support user-defined packet header formats. The ingress and egress pipelines process packets through match-action tables that are arranged in stages. Match-action tables match the header filled with pre-defined rules and performs the corresponding action on the packet. Actions use primitives to modify the non-persistent resources (headers or metadata) of each packet. In this paper, the word programmable switch denotes a PISA switch. Programmers configure switches by DSL [2-4] programs. P4 language describes the data plane of programmable switches and the interface by which the control plane and the data plane communicate [25].
The P4 compiler [26] generates target-specific data-plane configurations and P4Runtime APIs from P4 programs [27]. P4Runtime APIs are target-independent compiler outputs which define interfaces for controlling the data plane elements, to enable control plane functions. The control plane executes the PTF (packet test framework) [28] script to invoke APIs that are exposed over Thrift-RPC (Thrift Remote Procedure Call) [29], including port management, table entry modification, and register operation.

Restrictions
Processing packets at line-rate incurs rigid restrictions towards data plane programming, as summarized in our previous work [30]: 1. Simple operations. Each stage can only perform simple operations. Only primitive arithmetic is allowed in each stage, e.g., addition and subtraction; Branching on conditions of other registers is too costly to implement; There is no loop in P4 language. 2. Limited concurrent memory access. Each stage can access a few memory locations of different register arrays but only one location of a register array. For the SpaceSaving algorithm [14], finding the minimum value within an array is impossible. Also, two or more stages cannot access the same memory location. The packets are processed in different stages; read and write operations are paralleled. This restriction avoids the read after write and write after write hazards of different stages. 3. Limited stages. As the number of pipeline stages increases, the latency scales linearly and causes poor performance to real-time traffic. Thus, the number of stages in a pipeline is finite.
Programmers can not use arbitrary numbers and combinations of the P4 hardware primitives. Also, programmers must handle read/write dependencies when using storage resources. Programmers must pay attention to latency restrictions to guarantee line-rate processing. In addition, manufacturers tailor P4 specifications for their architectures. Programmers have to tail their programs for different architectures.
For example, RMT uses a black box ( Figure 2) to operate persistent resources on switch chips. A black box has two slices, the high slice, and the low slice. Each slice includes two SALUs (stateful arithmetic logic units) that perform different arithmetic operations, a conditional module that performs conditional judgments, and a three-state gate blocked by both condition_hi and condition_lo. While FlexPipe follows P4 specifications that use read and write primitives to operate the persistent resources. RMT can match on at most eight tables and modify at most eight fields in one stage, while FlexPipe has no analogous constraints.  P4HLP aims to hide the restrictions on P4 data plane architectures in order to provide unified high-level programming for commodity programmable switches. Currently we make a small step towards RMT architecture. The P4HLP and P4Runtime both aim to hide the target-specific properties in P4 programming. However, they are different, as the P4HLP targets data plane programming and P4Runtime targets control plane APIs. The P4HLP aims to be the front end of the P4 compiler through transforming high-level language to P4 programs, where the P4 compiler further compiles the P4 programs to data plane configurations and P4Runtime APIs.

E-Domino by Example
We present the design of E-Domino, taking the PRECISION [10] heavy hitter detection algorithm as an example, as shown in Algorithm 1. PRECISION measures the size of flows with the key-value pair register Key i and Val i . PRECISION consists of two paths: the regular path for measuring every packet (lines 1-14), and the recirculation path for updating statistics (lines [15][16][17][18][19][20]. For each incoming packet whose flow key is iKey, PRECISION calculates the index l i using the hash function h i for stage i (1 ≤ i ≤ d) (line 3). If iKey matches Key i [l i ], PRECISION sets the flag matched to true and adds one to flow size Val i [l i ] (lines 4-6). Otherwise PRECISION updates the minimum value carry_min and stage number min_stage (lines 7-9). If the packet has no matching key in any stages, PRECISION rounds the carry_min to the form of 2 k , i.e., the new_val which is 2 log 2 (carry_min) (line 11). Last, PRECISION recirculates the packet with the new_val and min_stage at a probability of 1 new_val (lines [12][13][14]. The recirculation path updates the minimum key and value according to new_val and min_stage in the recirculated packet (lines [15][16][17][18][19] and drops the cloned packet. (line 20).
We present E-Domino code in Figure 3. Combining the forwarding module and other transactions with the PRECISION algorithm into the ingress pipeline, we build a prototype of a switch.

Algorithm 1: PRECISION heavy hitter algorithm.
Input: Packet with iKey Output: Packet Count for normal packets, find the minimum bucket    Table 1 presents the grammar of E-Domino. An E-Domino program p consists of header declarations hdr, parser definitions parser, ingress pipeline i and egress pipeline e. We follow the P4 header and parser definitions and omit them for simplicity. Modules. The ingress and egress pipeline consists of modules m, which implement the data plane functions. We propose the notion, modular programming, that writes transactions in separate modules and organizes the modules into integrated data plane functions. The modules are reusable, e.g., we can write an L2 forwarding module and reuse it for any programs. Modular programming reduces the workload of repeatedly writing the same P4 programs.

E-Domino Grammar
P4 programs also contain several modules, i.e., parser, ingress, egress, and deparser, etc. However, P4 is coarse-grained in modularization. The ingress and egress pipeline includes several match-action tables that execute different tasks. Modular programming is fine-grained in that each module carries out a specific function. Thus, we can reuse the module to enable fast programming.
Statements. E-Domino has three types of statements, assignment statements, conditional statements, and loop statements. The assignment statements modify the header field, metadata field, or register values. The conditional statements make up control flows in the ingress or egress pipeline or register update conditions in register operation actions. The loop statements are the key to enabling fast programs for different stages of similar actions, e.g., to find the minimum value across all stages in the PRECISION algorithm. While in Domino, we have to write codes of the d stages repeatedly.
Expressions. E-Domino supports unary operations, binary operations, and relational operations in expressions. The switch does not directly support these operations. Thus, we transform them into hardware primitives.
We also need complex mathematical operations that are not provided by the data plane. We implement these complex mathematical operations in external functions. External does not mean that the functions rely on external processing and work through a slow path of a switch. Instead, external functions rely on match-action tables and work with other transactions in the pipeline of a programmable switch. For example, the ceil function (line 11 of Algorithm 1) approximates an integer to 2 i (i ∈ N + ) due to the restriction to the upper bound parameter of random primitives. The ceil function is organized as a match-action table with n entries. The largest possible value is 2 n . The ith(i ∈ N + , i < n) entry is responsible for matching the number carry_min ∈ [2 i−1 + 1, 2 i ] and outputs the approximate value 2 i .
Assuming n = 8 and the largest possible value is 2 8 = 256, we will need eight entries as Table 2 shows. The entries are numbered from 1 to 8, where the priority decreases as the number is increasing. We use a ternary match for each entry, which is in the form of value & mask. The mask bits which are zero indicate a wildcard, where the corresponding bits in value can be either 1 or 0; while the non-zero mask bits indicate an exact match, where the corresponding bits in the value must be 0. Assuming carry_min = 6, the third entry matches with the highest priority. Entries 4-8 also match, but they have lower priorities than entry 3. Thus, the input number 6 is approximated to 8. The ceil function causes one stage latency due to the match-action table.
In all, the E-Domino is a modular top-down design language. Programmers write the high-level language in modules regardless of the underlying hardware restrictions, and the compiler is responsible for outputting the P4 programs for the programmable switch.

P4HLPc Design
This section presents P4HLPc design. The compiler handles loops, control flows, stateful processing, stateless processing, and external functions.

Unroll the Loops
The first step is to unroll all loops in the program. Loops have different varieties: One is for operations in different stages (e.g., the PRECISION algorithm module), and another is for different conditions (e.g., the forwarding module). It is easy to identify the two kinds of loops. The loops that operate register arrays belong to the former variety and must be arranged in sequential stages to obey restrictions. The loops that manipulate stateless fields belong to the latter variety, and they can be arranged in one single-stage match-action table. P4HLPc tags the type of loops with either stateful or stateless.
Note that the times of iterations must be fixed at compiling time for the former variety because the hardware pipeline must be fixed. We cannot modify the pipeline stages unless we halt the pipeline and apply a new P4 program. While for the latter variety, we can assume an upper limit of the times of iterations and set the match-action table size to that limit. We can add, delete and modify table entries which decide stateless processing at runtime.

Control Flows
After the unrolling, P4HLPc must identify all the possible paths and generate corresponding P4 control flows. The branch happens at each conditional statement. We use eFDDs (extended forwarding decision diagrams) for the intermediate representation of the E-Domino program.
The eFDDs are composed of nodes and directed edges. Each node denotes a set of expressions, and each edge indicates the dependency of two nodes. An eFDD acts like a binary decision diagram: Each intermediate node is a conditional expression, and each leaf node denotes actions. Each intermediate node has two successors: The left for the true condition, and the right for the false condition.
The conditional statement i f (t) {e 1 } else {e 2 } results in dependencies from t to e 1 and e 2 . P4HLPc connects eFDDs by these dependencies.

Type Description Resources
Read/Write Read a single register to field/ write a field to a single register.
The f ield denotes a packet header or metadata field, the register denotes the register_lo or register_hi, and the constant denotes a immediate value. The operations must be one of the atoms in Table 3. Besides altering register values, we can output either register_lo or register_hi to a header field. The SALU can operate a pair of registers but only can output to one of them. P4HLPc transforms stateful processing to atoms, by filing in predefined templates. Each atom forms an eFDD. If one of the conditions, operations, or output transforming fails, the operation is not supported by SALU, and P4HLPc reports an error.
The register operations also cause dependencies. Assuming a and b are two register slots, b depends on a if the program reads a and writes the value to b. We organize the eFDDs into strongly connected components (SCCs), according to the dependencies between eFDDs. We add an edge from one SCC to another, where the second one depends on the first one. Figure 4 presents the eFDDs and SCCs of the PRECISION module.

Stateless Processing
E-Domino uses mathematical symbols to represent stateless processing. The symbols exist in both statements and conditional expressions. For those in statements, we need to transform the mathematical symbols into the corresponding low-level primitives. For those in conditional expressions, we must further connect them by boolean operations. Due to this, we only elaborate the latter one. Table 4 lists the detailed design. For example, we adopt the ternary match to implement the boolean operation "or". The ternary match only cares about the bits whose corresponding mask is 1. For example, the expression if (ethernet.dst==0xC0A80164 or ethernet.src==0xC0A80165) is true if the key (ethernet.dst, ethernet.src) matches entry (0xC0A80164, *) or (*, 0xC0A80165). The "*" denotes a wildcard character. The table methods have no delay but consume some match-action tables. The primitive methods have a delay of 1-2 stages but consume no match-action tables. We use primitives instead of match-action tables to implement relational operations in order to trade-off resource consumption and latency. We propose a branch transform algorithm (BTA, Algorithm 2) that transforms stateless conditional expressions. BTA consists of preprocessing, transforming, and reassembling.  Preprocessing. Extract all conditional expressions in the control blocks. For each conditional expression, replace "and" with a conjunction, replace "or" with a disjunction, and transform it into a disjunctive normal form (DNF) [31] by calling function DNF_ALGORITHM(C i ). We use DNF instead of CNF (conjunctive normal form) [32] for simplicity. (see Section 5.1) Split DNF by disjunction symbols into conjunctive clauses (CCs). For each CC in DNF, split it by conjunction, i.e., "and" in the conditional expression, into boolean expressions.
Transform. For each boolean expression, transform it into reverse Polish notation (RPN) [33]. Then transform the RPNs into primitives according to the rules in Table 4.

Reassemble. Reassemble boolean expressions according to conjunction and disjunction.
Reassemble CCs first: For primitives in one CC, reassemble their results into a nested if. Then reassemble DNF: For CCs, reassemble their results into a match-action table.

External Functions
For the external functions, we adopt joint compiling to generate the match-action tables along with thrift codes to install table entries. So far we support three kinds of external functions: The math functions, such as the ceil function, that look up results in tables, the hash functions that calculate user-defined hash values, and the digest functions which generate a digest of packets for INT (in-band network telemetry).

Outputs
The programs may not be feasible to real machines, due to the atom and primitive restrictions. If any of the above steps fail, the compiling terminates, and P4HLPc throws compiling errors. Otherwise, P4HLPc generates the P4 programs from the E-Domino codes. Algorithm 3 summarizes the compiling steps.

Results
We implement algorithms in E-Domino, then compile the E-Domino programs in P4 code using P4HLPc. We evaluate the programs automatically generated by P4HLPc and compare the E-Domino code with the P4 code. We also evaluate P4HLPc under different conditions. Table 5 presents the E-Domino experiment results. The E-Domino can implement various data plane algorithms. Except for the space saving algorithm, which is not feasible on the state-of-art hardware, we implement the other algorithms (Bloom filter [12], Flowlet [13], HashPipe [15], RAP [16], PRECISION [10], Elastic sketch [17] and SketchLearn [18]) in E-Domino and then transform them into P4 programs using P4HLPc. The P4 programs use different atoms. The flowlet and hashpipe use PU and PO respectively, to operate the key-value map. The new atoms (PU and PO) extend register operations of Domino, thus enabling more algorithms than Domino. Also, the P4 programs consume a various number of SALUs and stages. They use resources as defined in E-Domino. No resources are wasted.

E-Domino
The E-Domino is expressive for data plane functions. P4 uses 5.5× to 47.9× more code than E-Domino. We only consider the lines of codes of the core module. Most importantly, E-Domino is highlighting for writing programs which repeatedly carry out the same routine in different stages, e.g., the SketchLearn algorithm in E-domino is only 16 lines, while the number is 767 in P4, owing to the repeated register declarations and writing operations in P4.

P4HLPc
P4HLPc goes through the steps in Algorithm 3 for a new program, such as the cold start phases in Table 6. The cold start only happens once for each program, at the very beginning of deploying it. The cold start is accomplished in 1-3 s, and there is no need to halt the switch because all the compiling work is done offline in other machines. P4HLPc finishes the compiling steps for all algorithms in less than 0.72 s in Table 5. Once finished compiling, we can incrementally add new modules to the prior programs with no halt. Policy change needs steps 1-5 in Table 6 because it changes control flows and dependencies for the program. The external functions and routing stay unchanged. Thus we can reuse them. When the network topology changes, we only need to reconfigure the forward table entries, which is accomplished in milliseconds. P4HLPc induces no halting to the data plane when compiling the programs.
If any of the compiling steps fail, P4HLPc reports error(s), which is faster than the P4 compiler which transforms the P4 codes to RMT configurations. P4HLP enables programmers to reuse some standard modules from prior programs. P4HLP also supports fast reproducing prior works which are implemented in P4 simulators or other platforms.
It is reasonable to use primitives instead of match-action tables to implement relational operations in Section 4.4. Although match-action tables are faster than primitives, they need vast entries to match the keys, e.g., for expression a ≥ b (a, b ∈ N + ), assuming the most significant value of a is A, we will need 1 + 2 + ... + (A + 1) = (A+1)(A+2) 2 entries for match-action tables. Using primitives is a better trade off because the primitives consume fewer resources compared with table entries, while the latency is not remarkable for the overall 12 stages of the pipeline.
DNF is more space-efficient than CNF. Each disjunction clause (DC) will result in a match-action table, and each boolean expression in a DC needs an entry in the table. By contrast, each CC only needs a few gates to implement nested ifs. Consider the expression i f ((a == 1 and b < 10) or c >= 0), CNF costs one gate and two match-action tables with six entries, doubling that of DNF, however, with no decrease in latency.  Table 7 shows the additional resource consumption of different algorithms on Tofino, normalized by the usage of the baseline program switch.p4, a P4 program that implements most networking features (L2/L3 forwarding, VLAN, QoS, etc.) for a typical switch. Bloom filter and Flowlet are lightweight algorithms that consume, on average 2.83% of all resources. HashPipe, RAP, and PRECISION are storage-intensive algorithms that consume 34.68%-46.67% SRAM (Static Random Access Memory) to store flow keys and values. They use 16.67%, i.e., eight out of all 48 SALUs to update the SRAM. Moreover, RAP and PRECISION consume much more TCAM (Ternary Content Addressable Memory) than HashPipe to implement the lookup table for the ceil function. Joint compiling automatically generates the table declarations and entries for the ceil function. We also test the controller python codes to dump the entries. The controller communicates with the data plane through thrift [29]. Elastic sketch and SketchLearn consume 75% and 66.67% SALUs, respectively. A SALU handles a register with one atom. We set the key length k = 32 bit. SketchLearn consumes one SALU for each bit of the key, i.e., 32 out of all 48 SALUs. We stress-test the throughput of the generated P4 programs on Tofino. Figure 5 shows the normalized throughput to the line-rate speed without measurement. The processing speed is preserved, and the variance is minimal. The external functions and branch transformation does not influence the throughput. The high performance comes from the character of pipeline stages that process multiple packets concurrently.

Related Works
Many DSLs explore how to program the data plane. Click [34] is a configurable software switch architecture. A Click router consists of several packet processing modules, e.g., packet classification, queueing, and controlling modules. Click programs these modules in C++. Software switches are 10×-100× slower than hardware switches. As network traffic volume grows, the software can not process packets at line-rate. We must implement the data plane functions in hardware.
Jose et al. [35] designed a P4 compiler, in order to map P4 logical match-action tables to physical switching chips, while meeting the data and control dependence in the program. They adopt integer linear programming (ILP) and greedy algorithms to optimize latency, resource occupancy, and power consumption. They compile benchmarks to two commodity switches, RMT and Flexpipe. The compiler use abstractions to hide hardware details while capturing the essence required for mapping. The compiler focuses on stateless processing while P4HLPc also supports stateful processing.
Arashloo et al. [3] concentrated on sophisticated stateful processing in the data-plane and suggested SNAP, a more straightforward "centralized" stateful programming model. SNAP views the distributed stateful elements as centralized, relieving programmers of placing and optimizing access to these stateful arrays.
Instead of targeting specific targets, NetAsm [21] uses intermediate representation (IR) to program various devices, such as FPGAs (Field Programmable Gate Arrays) and programmable switches. These IRs remove target restrictions from consideration, much like P4. P4Visor [36] provides new primitives to replace the state-of-art data-plane primitives, compiler optimizations, and program analysis-based algorithms, which reduce the resource overheads. P4Visor supports rapid testing and deployment life-cycles.
Sivaraman et al. [4] proposed a new DSL Domino to program data planes in a C-like manner. They also devised a hardware machine model, Banzai, for programmable line-rate switches. The Domino compiler generates Banzai primitives from the C-like Domino programs. Domino concentrates on stateful processing, under different types of register operations in the Banzai model, neglecting the stateless processing part in control blocks. What is more, the commodity switches are more complicated in stateful processing than in Banzai, e.g., Tofino has a branch on condition of the register values, while Banzai only depends on header or metadata fields. Also, Domino does not support loops, which is not friendly to algorithms that repeatedly operate in the same routine. The proposed E-Domino language and P4HLPc support stateless processing and loops.

Discussion
P4HLP provides the ability to migrate software algorithms to hardware devices, reproduce prior works, and develop novel algorithms efficiently. However, some work must be done before we apply P4HLP to line-rate switches.
1. P4HLP defines programs in the high-level language E-Domino, which hides the hardware details of programmable switches. This abstraction relieves programmers from concerns about hardware restrictions but may contribute to infeasible programs on real machines or programs that are low-efficiency, despite being feasible. For example, P4HLP supports the hash parallel algorithm which needs to recirculate every packet in order to update the smallest register slot. Some algorithms may require too many specific hardware resources in one stage, making the other resources in the stage unavailable for other modules. This question may account for external functions that consume match resources. P4HLP warns programmer when algorithm halves the throughput or consumes too many resources. 2. P4HLP may fail to generate some programs that originally violate the RMT restrictions thus cannot be mapped to RMT architecture through optimizing techniques. We may find these algorithms and explore possible manual optimizing methods. 3. P4HLPc does not optimize P4 programs. Currently we leave optimization to a P4 compiler. Future works can be done in either P4HLPc or a P4 compiler to optimize the pipeline implementation. We may combine P4HLPc with a P4 compiler to generate the optimized hardware configuration directly from E-Domino language, compared with the current two step workflow: First, compile the E-Domino program to the P4 program, and then compile the P4 to the target. 4. P4HLP may support more emerging ASIC such as XPA that varies tiny in specific primitives, with minor modifications to E-Domino grammar.

Materials and Methods
All materials in this paper are available at Github [37]. We developed and tested the P4HLPc on Ubuntu 16.04. We tested all P4 programs on a Wedge 100BF-32X switch, which has a Tofino 3.2 Tbps chip. We connected the switch with two end hosts, using 40 Gbps QSFP links. Each of the end hosts was a DELL PowerEdge R820 Server, with an Intel XL710-QDA2 40 Gbps network interface card, two Intel Xeon E5-4603 CPUs and 256 GB memory. We installed MoonGen [38], a scriptable high-speed packet generator built on libmoon [39] and packet processing library DPDK (Data Plane Development Kit) [40], to send and receive packets on the hosts, achieving a stable speed of 38.04 Gbps.
We tested the cold start for P4HLPc using four steps. First, we implemented the algorithm in E-Domino language. Second, the P4HLPc transformed the E-Domino programs to Tofino P4 programs. Third, the P4 compiler compiled the P4 programs to Tofino configurations. Last, we configured the Tofino switch and dump table entries. We collected the resource utilization summary to analyze the quality of P4 programs. We also changed the policy and typology to test scalability and flexibility.

Conclusions
This paper proposes E-Domino language and P4HLPc, to support unified high-level programming for programmable switches. The P4HLP framework aims to hide restrictions on processing resources, storage resources, and latency. E-Domino is a modular top-down design language, regardless of the underlying hardware restrictions. There are four aspects of the high-level language E-Domino that we highlight: (i) E-Domino extends Domino with loops, expediting the programming efficiency. (ii) E-Domino proposes arithmetic operations in the conditional statement of control flow. (iii) E-Domino extends stateful atoms with a Paired Update and Paired Output, which are widely used in key-value mapping programs. (iv) E-Domino supports external functions to extend the Tofino ASIC functions.
The P4HLPc transforms E-Domino programs to P4 programs with four fundamental techniques: loop unrolling, TBA, atom template, and joint compiling. (i) P4HLPc recognizes loops for stages and table entries and unrolls them. (ii) The TBA handles conditional statements in control flows by generating a combination of primitives and match-action tables. (iii) P4HLPc generates SALU codes from an atom template. (iv) The joint compiling technique generates a match-table declaration along with entries to implement external functions. P4HLPc not only automatically transforms E-Domino programs to P4 implementations, but also fulfills switches with more functions, by TBA and joint compiling. This paper also proposes the modular programming thought, which organizes E-Domino transactions in different modules. Programmers can reuse the modules from prior works.
Results show, compared with P4, E-Domino is precise and expressive, thus is suitable for programming data planes of programmable switches. P4HLPc is robust to policy change and topology change, and thus is suitable for compiling E-Domino programs. Modular programming is fine-grained and enables fast reconfiguration of commodity switches. The generated P4 programs maintain line-rate speed on Tofino switches.
In all, our works make a step towards unified high-level programming for commodity programmable switches. Acknowledgments: First of all, I am very grateful to my mentor, Chunyuan Zhang, for his careful guidance of my graduation thesis in the past three months, which greatly improved my understanding of academic writing and taught me a lot of specific research skills; knowledge is a vast ocean, I am only one of the flat boats. I am grateful to the teachers who have given me selfless help in my two years of development.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: P4HLP Programming