At a high throughput level, more processes or threads may be required than that in the traditional platform in order to support the platform efficiently. This section describes how to design the parallel architectures on TILERAGX36 platforms. To reduce the contention of hardware resources, a reasonable allocation of tasks should be assigned in different cores. Since our goal is to reduce the energy consumption of processes generated mostly from inter-process communication (IPC), the IPC schemes will be discussed. Meanwhile, dynamic load balancing is a particularly important method to control the performance.
5.3.1. Task Decomposition
As described in Section 3
, the system architecture is layered and centralized. Task decomposition can be considered as parallel processing of the basic strategy. It is used to solve complex computational problems by splitting them into sub-tasks quickly on a multiprocessor to achieve operation simultaneously. To demonstrate the processes of every layer in real-world settings, the implementation is shown in Figure 6
. Blue, green and purple denote the control layer, virtual user layer and traffic generation layer, respectively.
Taking full advantages of the TILERAGX’s parallel processing ability, we use one control process as the father process to fork its child processes. The implementation method of parallel processing in this system, based on multiple processes, can be divided into the control process, virtual user processes, log processing and traffic generation process. It has been shown that a multi-threaded application has comparatively better event processing capabilities in terms of meeting processing deadlines than that of a multi-process application [28
]. Task processing, user scheduling and load balancing all belong to the virtual users layer, so the multi-threaded application is utilized, with other modules executing a multi-process application. Since it is difficult to complete the assignment with a single engine, it is compulsory to move the traffic generation task to numerous engines.
According to the descriptions of use cases in Section 3
, the first and second phase are relatively simple; we mainly describe virtual users layers and traffic generation in detail and explain the interaction process between them.
A. Virtual users layer
As a middle-level, the virtual users layer is mainly responsible for the management of virtual users, which needs both to handle the message that was sent by the control layer, and also deliver requests to the traffic generation layer. It is necessary to achieve flexible user management and resource allocation so as to make it available to multiple experimenters at the same time. According to the demand of experimenters, virtual users were divided into different users groups.
The user management module is responsible for implementing the experimenter’s experiments while, at the same time, involving more virtual users of the resource allocation, user behavior parameter configuration and scheduling. In each experiment, the experimenter usually needs to create a large number of virtual users and the corresponding user’s group, while these resources must be free at the end of the operation. The frequent memory operations can lead to a lot of overhead, and are easy to cause memory leaks and other issues. By establishing a resource pool and a set of connections, the use management strategy can be effective to avoid frequent resource creation and release overhead. This module designs user_pool, group_pool and epm_pool based on the given object pool technology implemented.
Task processing: During the simulation task message processing, some modules must be initialized according to the contents of the message. Firstly, we initialize the free queue (free_q) and active queue (live_q) , and then obtain the specified number of virtual users from user_pool and insert them into free_q; If the message contains scene parameters, the scene parameters are processed using the cubic spline interpolation algorithm described in Section 4
User scheduling: User scheduling, which is responsible for scheduling a large number of users to achieve and generate the request message to join the request queue, can be divided into two sub-processes: user activation and user number control. If the task queue is not empty, the scheduler can scan the task queue for a certain period of time and activate the users in each group with a certain frequency to join live_q. If the user group contains scene parameters, it will be processed by the user control process when the number of active users reaches the number of scene start points. It controls the number of active users based on the user profile. Otherwise, all users in the group are activated, and the user behavior module controls the behavior of each user.
Load balancing: The numerous deployments of traffic generation engines offer the opportunity to exploit multiple accesses to the improvement of the concurrent performance. The load balancing module assigns the request queue messages to each traffic generation engine.
B. Traffic generation layer
Request management module: In order to ensure concurrency, asynchronous programming has to be used. In this module, the theory of finite-state machines is introduced. The description of the global state machine at this level is shown in Figure 7
. The blue line represents the general state transition process, which shows that this state may takes several asynchronous processes to complete; the green line indicates that there is a pending URL transfer process; others denote the transition of the error condition during status processing. When a request message is received to indicate the start state, then a request object is fetched from the request pool and a page URL (a URL containing a plurality of embedded resources) accessed by this request is acquired. Finally, the result will be sent to the log processing module while the request object is reclaimed.
HTTP processing module: We do not care about the specific content of the HTTP response, and there is no need to parse the content of the Web document. The HTTP processing module only parses the HTTP header to determine the length of the response entity, the coding and server information.
Log processing module: It is used to push the log into a database. When a new log is coming, using bulk inserts is a convenient way to improve efficiency. In addition, timing events can be used effectively to enhance concurrency.
5.3.2. Task Mapping
On the TILERAGX36 platform, by binding on the designated CPU rather than any CPU for a certain task or process that implements task mapping, each core in the processors can run a full operating system, which provides high flexibility and simplicity for software development. Under these circumstances, scheduling that process to execute on the same processor can improve its performance by reducing performance-degrading events. Additionally, it can effectively improve the cache hit ratio and reduce the number of memory accesses.
Furthermore, when a process or thread is bounded with one CPU, Linux kernal would not take it into the CPU schedule any more. As a result, the execution expense of the program would be largely reduced. For the Tile-Gx36 platform, the CPU affinity [31
] is set by the following steps. Firstly, procuring the affinity set of the program; secondly, bounding the task process with a specific CPU according to its unique index in CPU affinity. The details can be described in Table 2
Referring to the methods mentioned above, we can bound the decomposed tasks with multiple cores of Tile-Gx36. The control engine, virtual user engine and log engine are assigned with their own individual cores for processing. Each traffic generation engine is bounded with one core, thus the number of the traffic generation engine should be adjusted dynamically according to the restriction of the number of total cores. Moreover, Figure 8
describes the schematic diagram by specifying a CPU affinity setting for each process.
5.3.3. IPC Scheme
Not only task decomposition and mapping process, but also the IPC scheme is assigned in Many-Core processors. Pipe, message queues, Unix socket, signal and shared memory are currently the most widely used genre of IPCs. The Tile-Gx36 processor provide a new method, the UDN, which is used to improve data transfers among tiles. However, the UDN is used to send small packets with the size of no more than 128 bytes. If the packet is too large or receives buffer overflow, the system will lead to a deadlock.
In our real-time processing tasks, each layer requires good two-way communication to ensure better supervision and process scheduling. The Unix socket exhibits a very low transmission latency and meanwhile supports full duplex mode, which provides a much better bi-direction communication. Figure 9
shows the communication between each engine, which consists of two parts. One-way communication is adopted in the traffic generation module, control module and log processing module, while others use dual-way communication.
5.3.4. Dynamic Load-Balancing Based on the Minimum Number of Requests (DMR)
Due to the multiple traffic generation engines, dynamic load balancing is an important step to condition the parallel performance [32
]. When the virtual users engine produces a mass of requests, it is necessary to predict which traffic generation engine to respond. To overcome this issue, very common approaches need to be raised based on the polling algorithm, weighting algorithm and hash algorithm to solve the load-balancing problem [33
When the node performance of the traffic generation engine is basically the same, we can use a simple polling algorithm for load balancing. However, the traffic generation engine is based on the request object as a basic management unit, and the life cycle of each request object is not the same, leading to the evolution of the load being unpredictable. Furthermore, fast recovery of load-balancing can be very inefficient using a simpler approach, especially when it comes to a sudden increase in traffic or an abnormal process. In this paper, we present a dynamic load-balancing algorithm which accepts the dynamic change of the traffic generation engine based on the minimum number of requests (DMR). Each traffic generation process sends its own number of active requests to the virtual user layer in real-time. Then the load-balancing module updates the history record, and selects a traffic generation process with the smallest number of requests to send request messages. It can be classified as the weight class algorithm. The weight factor is the number of active requests in the process and the lower the weight, the higher the probability of selection.