A Scalable Parallel Architecture Based on Many-Core Processors for Generating HTTP Trafﬁc

: The past years have witnessed the signiﬁcant development of the Internet. Numerous emerging network architectures and protocols have triggered the demand for trafﬁc generators which stand in stark contrast to previous schemes. Namely, ﬁxed test content is inefﬁcient in the presence of such a dynamic and realistic demand. Moreover, the requirement of high-performance has raised the stakes on developing a new concurrent system. In this paper, we present a hierarchical parallel design for a Web trafﬁc generator on a TILERAGX36 processor, called TGMP. We discuss the challenges in developing its hierarchical architectural design, and elaborate on its implementation details. Speciﬁcally, in order to generate a realistic network workload over a long and large time scale, we propose a user-control scheme based on cubic spline interpolation. To better improve the scalability of the system and satisfy the required ﬂow rate, we adopt techniques, including optimization of parameters under the Linux kernel, event-driven concurrency, and parallel architectures of a TILERAGX36 processor. The experimental results demonstrate that TGMP is able to create real trafﬁc and simulate 50,000 users accessing the Web server simultaneously.


Introduction
A traffic generator is the major tool to provide network background traffic to evaluate and verify the performance, protocol, and security of the experimental network. With the continuous in-depth application of the emerging network architecture, such as SDN (Software Defined Network), and CCN (Content-Centric Networking), there has been an urgent and ever-increasing demand for a traffic generator with high-performance, flexibility, low cost and reality. Unfortunately, such a tool is not easily available, and is usually of much smaller scale than needed. Recent research suggests that the Web still occupies 15%∼18% of the traffic [1]. Therefore, generating high-performance and a representative Web traffic generator is crucial for evaluating the emerging network architecture under different network conditions. At present, some commonly used tools have been made for a Web traffic generator, of which the two most typical types are the special equipment based on hardware and the software tool running on general hardware. The former is generally designed for a particular test scenario which lacks flexibility and extensibility, such as Spirent Avalanche [2], and Ixia IxLoad [3]. Spirent Avalanche is an application test which creates real-world dynamic user behavior with advanced browser-to-application-interaction capabilities. However, it is hard to cope with highly changeable scenarios. Owing to its flexibility and the simplicity of implementation, the latter provides low-cost and quick deployment. Some other well-known tools are Webstone [4], ProWGen [5] and GlobeTraff [6]. Compared to the hardware platform, the software is easier to promote and provides a cross-platform API (Application Programming 1. The software architecture of a Web traffic generation system is proposed by using the hierarchical design methodology including the control layer, virtual user layer and traffic generation layer, to provide high scalability. 2. We present a user-control method using the cubic spline interpolation based on the analysis of the Web user behavior simulation method of accessing the real server. The method enables the system to generate the background traffic with the characteristics of a real network over a long time scale according to the Internet user's access time in different scenarios. 3. In order to meet the requirements of concurrency, we have implemented a TGMP prototype. Three high concurrency strategies are employed, which enable the system to simulate a large number of virtual users at the same time, and generate more Web traffic. 4. We has been implemented and deployed in the real network at the third floor of the YiFu building in the CQUPT campus. The experiments show that compared with other systems, TGMP yields a more satisfactory performance in which 50,000 users access the Web server simultaneously.
The structure of this paper is as follows. The related Web traffic generation systems and the challenges in developing TGMP are introduced in Section 2. Section 3 presents a layered architecture of TGMP. Section 4 describes the cubic spline interpolation algorithm. Section 5 focuses on the high concurrency strategy. Section 6 presents the experimental results. Finally, we conclude this work in Section 7.

Background and Challenges
In this section, we discuss the advantages and the disadvantages of the currently existing traffic generations. The purpose of the discussion is to make a wise decision in the design of TGMP.

Overview of the TILEGX36 Architecture
Many-core processors, which are usually equipped with a larger number of cores per processor (tens to hundreds of cores per processor compared to up to six or eight in existing commodity multi-core processors), have emerged in the past years. In particular, each core in these processors can run a full operating system, which provides high flexibility and simplicity for software development.
The TILEGX36 processor is a typical many-core processor with 36 homogeneous cores, which are organized in a 6 × 6 grid, interconnected using an on-chip network structure as shown in Figure 1. Each tile is a full-featured computing system that can run independently. The (User Dynamic Network) UDN is executed in processes or threads, which effectively reduces the power consumption and communication delay.

The Existing Approaches of Traffic Generation
Accurate modeling and generation of a realistic network workload are difficult and challenging tasks because of heterogeneity, scale and complexity of the current Internet. In literature, a substantial number of works focus on the modeling and simulation of traffic generation. At present, the traffic generation has the following methods: 1. Traffic Replay [15,16] Traffic replay tools are used to repeat real traffic scenarios. Traditionally, such a tool relies on software solutions that capture the whole traffic trace, and send the trace to the test network. Some special hardware devices should be provided. For instance, Qiao [15] realizes a traffic capture and replay system based on The Add-on Card, which interprets the detailed logic design of UDP pipeline on FPGA (Field-Programmable Gate Array). Although the Web traffic generator based on hardware implementation can produce high-volume test traffic, its poor scalability makes it difficult to integrate with the experimental network. GoReplay offers us the similar idea of reusing our existing traffic for testing, which makes it incredibly powerful. Nonetheless, these kinds of approaches can only reflect a period of time instead of a long and large time scale.
2. Traffic Model [17] Some existing tools, such as WebStone [4], Web Polygraph [18], and httperf [19] are based on synthetic models of Web traffic. The network traffic model should be built by the study of the characteristics of a traffic generator. For example, Xu [17] set each virtual user with a corresponding configure file, and these files determine the visiting paths, visiting moments and stay time of virtual users based on the Continuous Time Markov Chain. These models are developed analytically and then validated experimentally with measurement studies. Although it allows fine-grained control over behavioral aspects, some serious drawbacks are also inevitable. For instance, it can only reflect the macroscopic characteristics of the network traffic. In particular, such tools can successfully create realistic traffic mixes as a function of overall load. However, these tools typically cannot provide good performance.
3. User Behavior [20,21] The process of visiting the Web site can be roughly divided into the following three stages: First, a user starts a conversation by selecting a link what they are interested in; Second, after browsing, they often click another link, in which they are interested, in a relatively short time; Third, when he obtains the required information, he will stay for a long time with an inactive state. In particular, Figure 2 shows the important traffic parameters, including think-time, session-time, etc. In such a case, traffic characterization must be in line with the behavior of an individual user. In this paper, we first analyze web workload characteristics such as file sizes, mean think time, and the number of requests made to an individual file. We utilize the typical ON/OFF [20] internet accessing statistical model in which its key parameters are procured from real traffic analysis. Thus, it can generate realistic network traffic.

Challenges
Generating high-volume, stable, realistic test traffic is crucial for assessing the performance of network devices in a reliable way and under different stress conditions. In this paper, we borrow from the hierarchical architecture design [22] in that it separates control plans and perform plans, and exposes the functions' interface to higher layers to deal with the dynamics of the network.
Two main technical challenges should be considered. First, its realistic replication is a challenging task over the large scale of time. The multidimensional heterogeneity of the current internet exacerbates the seriousness of the problem. However, the existing schemes can only reflect a period of time or the previous characteristics of the network. To address this issue, we need to generate the background traffic with the characteristics of a real network over a long and large time scale according to Internet users' access time data in different scenarios. Therefore, we present a method based on the cubic spline interpolation of the user control method to simulate the characteristics of a real network. Meanwhile, we generate a realistic network workload by accessing the real server.
Second, existing schedules have mostly been low-efficient due to the limited hardware resources and inefficient process. To better improve the scalability of the system and satisfy the required flow rate, the support for high-performance and configurable experiments is greatly needed in such context. Furthermore, in large-scale networks, tests have to be performed automatically because the size of the system under test may prevent the manual performance of activities on each and every host involved in the experiment. Even though existing traffic generators are quite useful, nonetheless, most of them suffer from the following: (1) It may need special equipment which is expensive or not commonly used. For example, MoonGen [23] is a flexible high-speed packet generator. A key feature is the measurement of latency with sub-microsecond precision and accuracy by using hardware timestamping capabilities of modern commodity NICs (Network Interface Cards); (2) It is not trivial to generate traffic data in arbitrary spatial regions using existing traffic generators. For example, DPDK (Data Plane Development Kit) is a set of libraries and drivers for fast packet processing, which is widely used in the field of traffic generators, such as MoonGen [23], TRex [24]. However, it can only capture the packet at the data link layer instead of dealing with it at the application layer. In order to generate real web traffic, the packet must enter the kernel protocol stack. The availability of generic many-core architectures with tens to hundreds of cores per processor which have the features of low-cost and feature-rich, is offering new opportunities for parallelization and extensibility. Taking advantage of the efficient network architecture , a schedule should be designed, which can efficiently address the challenges of high-performance and improve the efficiency to leverage the Many-Core processors.

Traffic Generator Systems Architecture
The development of a control system has attracted significant attention. Supporting software, which needs to be highly adaptable to changeable network scenarios, remains one of the greatest obstacles to a Web traffic generator. In this section, we will present the layered architecture of our traffic generator and its main components. Figure 3 gives an overview of TGMP architecture and the layered system architecture mainly includes an application layer, a control layer, a virtual user layer, and a traffic generation layer. The traffic generation experiments can be achieved by the coherent cooperation of these layers; whenever a traffic experiment is initiated, by calling northbound APIs, the experiment parameters and system setups can be forwarded to the lower layers including control layer, virtual user layer and traffic generation layer. Also, using southbound APIs, the real-time feedback information of the traffic generation tasks would be transmitted from the lower layers, virtual layer and traffic generation layer. Each of the components will be presented in detail in the following sections. (1) System Components Application layer contains abundant user-customized traffic generation applications which are designed as scheduled traffic generation experiments. North Bound application programming interfaces (APIs), provided by the control layer, are called by the application layer to enable its message exchange ability and system state monitoring function.
Control layer is assembled by the process management module, configuration module and remote management module. The process management module is responsible for process creation, multi-process parallelization settings, event-driven initialization and signal event registration. The configuration module provides the ability of setting the prerequisite parameters for the system to initiate unerringly. The remote management module monitors and handles the requests and messages from the application layer.
Virtual user layer provides the user management module, user behavior and load balance module. The user management module is responsible for virtual users' resource scheduling, allocation and isolation in each defined traffic generation task or experiment; the user behavior module is in charge of the complete implementation of the Web user behavior model, controlling each virtual user's web browsing actions; the load balancing module is to assign the amount of work that traffic generation has to do between two or more processes, so that more work can be done in the same amount of time.
Traffic generation layer can be divided into the request management module, HTTP processing module and log processing module. The management module is responsible for the requested object's creation; management of the resource pool; acquiring the target URL and HTTP request message parsing and structure; the establishment of a TCP connection; and network I/O event registration and callback processing. The HTTP processing module as the data processing module, mainly completes HTTP response message asynchronous parsing and discards the HTTP response entity. The log processing module will complete the process of user access log records and periodically access the log pushed to the log database.
(2) Use Cases A use case is employed in order to have an overview of how the structure works. As the centralized management maintaining multiple connections for each experimenter, the web server provides the opportunity to achieve deployment and configuration, which allows many operators to use the Web Site at the same time. The simulation process for a Web traffic generator as shown in Figure 3, mainly contains the following steps: In the first phase, the experimenter sends the simulated service message to the control layer, which includes the number of simulated users, the number of http sessions, interval time between sessions, the number and frequency of web clicks.
In the second phase, the control layer receives messages from the application layer, and then encapsulates it into a specific message format, which is sent to the virtual user layer. What is more, it keeps active until the processing result arrives from virtual user layer.
In the third phase, the virtual user layer parses the messages which are sent by the control layer and assigned to the corresponding resource (such as the virtual user resource). Each virtual user is activated and the corresponding timed events are registered. Notably, when a virtual user's timer is triggered, its callback function is called to send a request message to the traffic generation layer.
Last, the traffic generation layer obtains the specified URL according to the request message, then establishes a connection with the target Web server. When an event is triggered, a readable or writeable I/O event is assigned. In this case, data interaction can easily be manipulated by many event mechanisms. Finally, the request process is pushed into the database in a certain format.

User Control Method by Cubic Spline Interpolation
This system requires the generation of realistic Internet traffic from a test scenario's perspective without having to emulate network components or protocols. The method based on user behavior supports self-similarity by making each user equivalent in an on-off process. However, in the larger time scales (e.g., one hour or one day), the change is relatively stable. Based on the statistics of online time distribution by Baidu statistical traffic institute [25], this experiment covers more than 1.5 million sites. Figure 4 shows the number of Internet users in a single day, which obviously demonstrates a clear time of day diurnal pattern. The main implementation problem here is how to have the clearest time of day diurnal pattern base on the characteristics of the aggregate traffic without having to observe any obvious changes with time of day effect. In this section, we discuss the cubic spline interpolation algorithm in Table 1, which achieves the user time curves of refinement and keeps the change trend of users. We firstly finish the data pre-processing, then construct the cubic spline interpolation model, and subsequent sample refinement, then lastly carry out user control. More specifically, we assume the number of virtual users is Nv. According to the actual online user percentage in Figure 4, we obtain virtual online user percentage V(x) in 24 time slots. In step 3, discrete V(x) is interpolated in chronological order to obtain a continuous function S(x) by the cubic spline interpolation algorithm. In steps 4-10, we take a more accurate sampling for S(x) and obtain the virtual user reference set B. In steps 11-18, we continuously adjust the number of active virtual users in each time slot based on B. Finally, this algorithm achieves the user time curves of refinement and maintains the user's changing trend. In this way, we can obtain a more reliable load for Web traffic generation. We have built a prototype system to verify how our algorithm can generate traffic effectively in a field test.

Algorithm 1. Cubic Spline Interpolation Algorithm.
Input: the number of actual users N, actual online user percentage T(x). Output: the virtual user reference set B, the number of active virtual user A.

1.
Compute the number of virtual users Use the cubic spline interpolation algorithm, and compute continuous time function S(x) based on interpolation processing V(x); 4.
Identify the time slot j which divides 24 h into 30 s; 5. Initialize end while 10.
Obtain the virtual user reference set B in chronological order; 11.
Assume the number of active virtual user A j in time slot j; 12. while end while

The High Concurrency Strategy
In this section, we explore the concurrency strategy of TGMP systems on Many-Core architectures [26]. The plan includes three strategies. We first modify the Linux kernel parameters to provide basic conditions for high concurrency. For the purpose of evaluating high-traffic, we have integrated the event-driven programming [27] into our proposed design. Furthermore, to take advantage of the particular feature of the TILERAGX36 hardware, we break the bottlenecks by using the task decomposition and task mapping on TILERAGX36 platforms to improve the parallel processing.

Optimization of Parameters
The default Linux kernel parameters are the most common scenario, which obviously does not support high concurrent processing of a traffic generation system. Therefore, some parameters should be reconsidered in terms of their size and usage. Three aspects are mainly be considered, including file descriptor, port numbers, and TCP parameters. Based on the capabilities of the operating system, the right parameters can be set manually.

Event-Driven
Currently, the high concurrency strategy of the operating system follows two different design concepts: multi-threaded and event-driven [28]. Multi-threaded applications can be performed simultaneously on a single process by sharing the process resources, which makes it easy to communicate. The multi-threaded concept aims to increase the utilization of a single core by using thread-level as well as instruction-level parallelism. The event-driven concept, as a new method of programming, is not triggered by the sequence of events but is random, which is employed in many high performance open architectures, such as Nginx [29], and Memcached [30]. In an event-driven application, there is generally a main loop that listens for events, and then triggers a callback function when one of those events is detected. Motivated by this, in order to satisfy the special requirement of the high concurrency and avoid redundant processing, a wise decision must be made. In the following paragraphs, to juxtapose the performance on the TILERAGX36 processor, an abstract application scenario is described.
First, a process is bound to the designated tile on the Tile-Gx36 processor. Second, we assume a task that achieves from 1 to n of the cumulative calculation, where n is selected from a group of random numbers ranging from 10,000 to 20,000. Third, event-driven and multi-threaded applications perform 10 times respectively. To characterize the processing performance of the two different strategies, the task execution time and the total number of CPU clock are measured. Figure 5a shows that the task execution time keeps consistent when the task number is less than 5000. When it is higher than 5000, the multi-threaded approach increases significantly. When it comes to CPU utilization, it can be observed that the multi-threaded approach grows faster in Figure 5b. The event-driven application is processed in order, thus expensive context switching between tasks is not necessary. Motivated by these observations, the event-driven application is an effective method to adopt.

Parallel Architectures on TILERAGX36
At a high throughput level, more processes or threads may be required than that in the traditional platform in order to support the platform efficiently. This section describes how to design the parallel architectures on TILERAGX36 platforms. To reduce the contention of hardware resources, a reasonable allocation of tasks should be assigned in different cores. Since our goal is to reduce the energy consumption of processes generated mostly from inter-process communication (IPC), the IPC schemes will be discussed. Meanwhile, dynamic load balancing is a particularly important method to control the performance.

Task Decomposition
As described in Section 3, the system architecture is layered and centralized. Task decomposition can be considered as parallel processing of the basic strategy. It is used to solve complex computational problems by splitting them into sub-tasks quickly on a multiprocessor to achieve operation simultaneously. To demonstrate the processes of every layer in real-world settings, the implementation is shown in Figure 6. Blue, green and purple denote the control layer, virtual user layer and traffic generation layer, respectively. Taking full advantages of the TILERAGX's parallel processing ability, we use one control process as the father process to fork its child processes. The implementation method of parallel processing in this system, based on multiple processes, can be divided into the control process, virtual user processes, log processing and traffic generation process. It has been shown that a multi-threaded application has comparatively better event processing capabilities in terms of meeting processing deadlines than that of a multi-process application [28]. Task processing, user scheduling and load balancing all belong to the virtual users layer, so the multi-threaded application is utilized, with other modules executing a multi-process application. Since it is difficult to complete the assignment with a single engine, it is compulsory to move the traffic generation task to numerous engines.
According to the descriptions of use cases in Section 3, the first and second phase are relatively simple; we mainly describe virtual users layers and traffic generation in detail and explain the interaction process between them.

A. Virtual users layer
As a middle-level, the virtual users layer is mainly responsible for the management of virtual users, which needs both to handle the message that was sent by the control layer, and also deliver requests to the traffic generation layer. It is necessary to achieve flexible user management and resource allocation so as to make it available to multiple experimenters at the same time. According to the demand of experimenters, virtual users were divided into different users groups.
The user management module is responsible for implementing the experimenter's experiments while, at the same time, involving more virtual users of the resource allocation, user behavior parameter configuration and scheduling. In each experiment, the experimenter usually needs to create a large number of virtual users and the corresponding user's group, while these resources must be free at the end of the operation. The frequent memory operations can lead to a lot of overhead, and are easy to cause memory leaks and other issues. By establishing a resource pool and a set of connections, the use management strategy can be effective to avoid frequent resource creation and release overhead. This module designs user_pool, group_pool and epm_pool based on the given object pool technology implemented.
Task processing: During the simulation task message processing, some modules must be initialized according to the contents of the message. Firstly, we initialize the free queue (free_q) and active queue (live_q) , and then obtain the specified number of virtual users from user_pool and insert them into free_q; If the message contains scene parameters, the scene parameters are processed using the cubic spline interpolation algorithm described in Section 4.
User scheduling: User scheduling, which is responsible for scheduling a large number of users to achieve and generate the request message to join the request queue, can be divided into two sub-processes: user activation and user number control. If the task queue is not empty, the scheduler can scan the task queue for a certain period of time and activate the users in each group with a certain frequency to join live_q. If the user group contains scene parameters, it will be processed by the user control process when the number of active users reaches the number of scene start points. It controls the number of active users based on the user profile. Otherwise, all users in the group are activated, and the user behavior module controls the behavior of each user.
Load balancing: The numerous deployments of traffic generation engines offer the opportunity to exploit multiple accesses to the improvement of the concurrent performance. The load balancing module assigns the request queue messages to each traffic generation engine.

B. Traffic generation layer
Request management module: In order to ensure concurrency, asynchronous programming has to be used. In this module, the theory of finite-state machines is introduced. The description of the global state machine at this level is shown in Figure 7. The blue line represents the general state transition process, which shows that this state may takes several asynchronous processes to complete; the green line indicates that there is a pending URL transfer process; others denote the transition of the error condition during status processing. When a request message is received to indicate the start state, then a request object is fetched from the request pool and a page URL (a URL containing a plurality of embedded resources) accessed by this request is acquired. Finally, the result will be sent to the log processing module while the request object is reclaimed. HTTP processing module: We do not care about the specific content of the HTTP response, and there is no need to parse the content of the Web document. The HTTP processing module only parses the HTTP header to determine the length of the response entity, the coding and server information.
Log processing module: It is used to push the log into a database. When a new log is coming, using bulk inserts is a convenient way to improve efficiency. In addition, timing events can be used effectively to enhance concurrency.

Task Mapping
On the TILERAGX36 platform, by binding on the designated CPU rather than any CPU for a certain task or process that implements task mapping, each core in the processors can run a full operating system, which provides high flexibility and simplicity for software development. Under these circumstances, scheduling that process to execute on the same processor can improve its performance by reducing performance-degrading events. Additionally, it can effectively improve the cache hit ratio and reduce the number of memory accesses.
Furthermore, when a process or thread is bounded with one CPU, Linux kernal would not take it into the CPU schedule any more. As a result, the execution expense of the program would be largely reduced. For the Tile-Gx36 platform, the CPU affinity [31] is set by the following steps. Firstly, procuring the affinity set of the program; secondly, bounding the task process with a specific CPU according to its unique index in CPU affinity. The details can be described in Table 2. tmc_task_die("tmc_cpus_get_my_affinity() failed."); //bind to the allocated CPU if (tmc_cpus_set_my_cpu(tmc_cpus_find_nth_cpu(&cpus, rank)) < 0) tmc_task_die("tmc_cpus_set_my_cpu() failed."); Referring to the methods mentioned above, we can bound the decomposed tasks with multiple cores of Tile-Gx36. The control engine, virtual user engine and log engine are assigned with their own individual cores for processing. Each traffic generation engine is bounded with one core, thus the number of the traffic generation engine should be adjusted dynamically according to the restriction of the number of total cores. Moreover, Figure 8 describes the schematic diagram by specifying a CPU affinity setting for each process.

IPC Scheme
Not only task decomposition and mapping process, but also the IPC scheme is assigned in Many-Core processors. Pipe, message queues, Unix socket, signal and shared memory are currently the most widely used genre of IPCs. The Tile-Gx36 processor provide a new method, the UDN, which is used to improve data transfers among tiles. However, the UDN is used to send small packets with the size of no more than 128 bytes. If the packet is too large or receives buffer overflow, the system will lead to a deadlock.
In our real-time processing tasks, each layer requires good two-way communication to ensure better supervision and process scheduling. The Unix socket exhibits a very low transmission latency and meanwhile supports full duplex mode, which provides a much better bi-direction communication. Figure 9 shows the communication between each engine, which consists of two parts. One-way communication is adopted in the traffic generation module, control module and log processing module, while others use dual-way communication.

Dynamic Load-Balancing Based on the Minimum Number of Requests (DMR)
Due to the multiple traffic generation engines, dynamic load balancing is an important step to condition the parallel performance [32]. When the virtual users engine produces a mass of requests, it is necessary to predict which traffic generation engine to respond. To overcome this issue, very common approaches need to be raised based on the polling algorithm, weighting algorithm and hash algorithm to solve the load-balancing problem [33].
When the node performance of the traffic generation engine is basically the same, we can use a simple polling algorithm for load balancing. However, the traffic generation engine is based on the request object as a basic management unit, and the life cycle of each request object is not the same, leading to the evolution of the load being unpredictable. Furthermore, fast recovery of load-balancing can be very inefficient using a simpler approach, especially when it comes to a sudden increase in traffic or an abnormal process. In this paper, we present a dynamic load-balancing algorithm which accepts the dynamic change of the traffic generation engine based on the minimum number of requests (DMR). Each traffic generation process sends its own number of active requests to the virtual user layer in real-time. Then the load-balancing module updates the history record, and selects a traffic generation process with the smallest number of requests to send request messages. It can be classified as the weight class algorithm. The weight factor is the number of active requests in the process and the lower the weight, the higher the probability of selection.

Evaluation
In this section, we implemented and deployed TGMP in a real network to understand the performance of our system by conducting four groups of experiments, instances of its usage and directions for our future work.

Experimental Setup
TGMP has been implemented and deployed in a real network at the third floor of the YiFu building in the CQUPT campus. As shown in Figure 10, the deployment consists of Tile-Gx36, Nginx Web Service, LNMP (Linux + Nginx + Mysql + PHP) Web Server, etc. Nginx Web Service which has high concurrency performance was set up in many computers to conduct a comprehensive test on Web traffic. The Web Manage server provides a visual interface for the experiments. When the experimenter sends the simulated service message to Tile-Gx36, the TGMP deals with the task, and then sends a request to Nginx Web Service to generate real traffic. Furthermore, all equipment is restricted to within the deployment in the local area network (LAN) due to the limited bandwidth in the experimental scene. We also deployed many random test experiments that perform different tasks. The TGMP is implemented in C and the Web management interface is realized in PHP. Notice that the design concept of the whole system is based on Nginx, which has focused on high performance, high concurrency and low memory usage. Additional features on top of the web server functionality, such as load balancing, caching, access and bandwidth control, and the ability to integrate efficiently with a variety of applications, have helped to make Nginx a good choice for modern architectures.

The Traffic Self-Similarity
Many studies [34,35] have reported on traffic characteristics with trends in self-similarity. This section presents whether the cubic spline interpolation of the user-control method has a large-scale of traffic self-similarity. Similar to [36], we also adopt the Hurst index which is the key indicator to evaluate the self-similarity of network traffic, whose length should be 0.71∼0.89 to judge the traffic self-similarity. Therefore, we select the stress test software, http_load, to compare with the network traffic based on user behavior. We construct the background traffic and grab the packets, and then compute the Hurst parameter to check the self-similar degree of network traffic by adopting the variance-time method [37] and the R/S method [38]. Figure 11 shows that the Hurst index of http_load is far from that of the theoretical value (0.71∼0.89). The green line means that the generated traffic self similarity and the red line stands for theoretical traffic self-similarity. Thus, the gap between these two lines can be used to measure the realistic characteristic of generated traffic. It means that the smaller the gap could be, the more realistic the generated traffic would be and the better the traffic generation system would be. In a comparison in Figure 12-the Hurst index is 0.87 and 0.83 respectively-the result indicates that the similarity improves. In this paper, the ON/OFF-based user behavior model is adopted, and multiple ON/OFF source overlays can generate traffic that is more consistent with the actual network [39]. The weak self-similarity of http-load traffic is mainly caused by the smaller size of the requested resource and the smaller frequency of requests.  To evaluate the effect on self-similarity based on the Cubic Spline Interpolation algorithm, we further study the realistic network workload over a long and large time scale. The experimental scene is shown to demonstrate the large-scale of flow simulation. To be specific, we assume a network of 100,000 users, and the executed User control is in accordance with Baidu statistics in February 2016 [25]. A shell script is executed per-minute in 24 h to collect the network traffic data. By contrast, we observed that there is a strong correlation in Figure 13a with the cubic spline interpolation algorithm. As expected, it can be seen that the system can control the number of virtual users effectively to achieve large-scale traffic generation which is more similar and in line with the actual network.

Load Balancing
The main goal of load-balancing schemes is to improve resource utilization efficiently and provide a high concurrency performance system. We start a traffic generation process and bind it to the tile 3 allowing the CPU usage to remain at about 80%. When the system is stable, we add another traffic generation process binding on tile 4, and then capture the two data of the respective CPU utilization every 0.001 s. In order to describe the change of the CPU utilization more clearly, the full data set is split across 1500, using every 100 data points, to calculate the average rate of the segment at a time.
We have established the test scenarios which include DMR and Polling.
As can be seen from Figure 14a,b, when we add a new traffic generation process, the DMR algorithm achieves load synchronization for two processes in one cycle, while the Polling algorithm takes two cycles. This is mainly because the DMR can dynamically feedback the real-time access to each process of the real load situation, but Polling can only be completed with the request. In addition, the Polling has a greater variability than the DMR, which is mainly caused by the lifecycle of each request object. When a process is assigned to a larger lifecycle of the request message, it will take up more CPU resources, resulting in greater volatility. The DMR, however, can reduce the volatility by dynamic feedback function. Through the above two aspects of the comparative analysis, we can see that the proposed DMR has better performance.

Concurrent Performance
In this experiment, the most important index, called concurrent performance, is evaluated, which can reflect the efficiency. The test mainly includes two aspects: first, we put a single traffic generation process into Tile-Gx36 and compare it with Nginx which has a high-level process architecture and can handle multiple connections within a single process under the same test scenarios. Second, we change the number of traffic generation processes and observe the phenomenon.
In order to ensure the consistency of the platform, we install Nginx in another Tile-Gx36 platform, and open a single work process model. Figure 15a shows the comparison of CPU utilization which remains basically the same. Thanks to the adoption of HTTP long connection, the change of the traffic and the number of connections along with the number of virtual users are in a simple linear model as reflected in Figure 15b,c. We observe that the change of the memory is small in Figure 15c. Furthermore, when the CPU utilization is around 80%, a single traffic generation process can support more than 7000 active virtual users, resulting in a traffic size of approximately 80 MB/s (640 Mb/s) and total of 2.3 GB. The change of the number of virtual users is shown in Figure 16 when using many traffic generation processes. As observed in the previous experiment, all of those increase with the number of traffic generation process changes. Meanwhile, in Figure 16a, the number of virtual users has a strong but not a perfectly linear relationship, which may be caused by the response time. To be more specific, if the number of connections increases, mass traffic can lead to the response time of the server being lengthened in Figure 16d. In particular, when the number of traffic generation processes is more than eight, the response time of the server increases rapidly. As for increasing accessing users, the Web server processing time will increase dramatically. However, the CPU utilization is lower than we imagine. On the other hand, if there is a higher-performance Web server, the bottleneck will be greatly reduced. Above all, the traffic generation system has a good overall concurrency performance, which can simulate more than 50,000 users at the same time, and the flow rate is as high as 4 Gbps.

System Stability
The stability test of the system is mainly to verify whether the system can run stably under the condition of a large load. The main specific test ideas are as follows: the system starts with opening 20 traffic generation processes and the number of analogue users is set to 5 million. At the same time, the test assignment continues to run for 24 h, so that all virtual users have been active. As shown in Figure 17a, the network traffic can be relatively stable. Figure 17b shows that the system's CPU utilization does not increase over time when fluctuating. The system which has a very slight volatility of memory can avoid a memory leak situation, as shown in Figure 17c. In summary, the system which has strong stability can produce long-term and stable flow.

Comparison with Existing System
In order to understand the performance aspects of our traffic generator, we have compared it with some existing well-known tools in Table 3. Comparing with current proposed traffic generator systems, our parallel design on TILERAGX36 achieves a good performance for simulating 50,000 users accessing the Web server simultaneously. The obtained performance looks very promising. As it is possible to have up to two TILERAGX36 many-core processors on the same board, we can expect to almost double the attained performance in practice. Note that the main difference is the platform; MoonGen [23] and TRex [24] must rely on a special configuration. Moreover, the power consumption of the TILERA processor is 50 W at 1.2 Ghz which is much lower than others [14]. Another promising aspect of our design is that it adopts task decomposition and task mapping by binding on the designated CPU which enables our design to easily adapt and be extended to upgrades, especially as future processors will have an increasing number of cores.

Conclusions
A traffic generator plays an indispensable role in the research of network architecture, new network protocol, network services, etc. The lack of scalability and efficiency in existing schedules motivates us to design an efficient general framework. The emerging Many-Core Processors, which are a revolutionary operation mechanism, can be leveraged to enhance performance. In this paper, we have presented the TGMP system, which is the first practical Web traffic generator operating on the TILERAGX36 processor. More specifically, a scalable, flexible and extensible layered system architecture, is designed for coping with highly changeable scenarios.
We adopt the design principles of cubic spline interpolation to generate a realistic network. In order to solve the problem of a representative workload, three high concurrency strategies are designed and TGMP is implemented in a real network to evaluate its efficiency. The experiment shows that TGMP can yield comparable efficiency with existing tools, but with much less cost and maintenance effort. Due to the limited space, this paper only considered the traffic of Web, therefore we will extend the system so that it is able to generate other types traffic, such as, video traffic, P2P traffic, etc. Furthermore, the idea of the design proposed in this paper could also be enlightening.