1. Introduction
Today, video equipment such as CCTV, mobile phones, and drones are highly advanced, and their usage has increased tremendously. As a result, the number of images generated in real-time is rapidly increasing. In recent years, this increase has led to a rise in the demand for deep learning image processing since the deep learning model can now process the images more accurately and faster in real-time [
1,
2,
3,
4,
5].
Real-time image processing is the processing of continuously occurring images within a limited time. The definition of limited time differs for each field that utilizes each system. Although the concept of limited time may be different for different fields, it is defined as real-time processing as it processes these images within a valid time period in all fields.
There are two ways to process real-time data. One is micro-batch [
6], and the other is stream processing [
7]. The micro-batch method is a type of batch processing. In the batch method, data are processed in one go when the user-defined batch size is accumulated. In particular, the micro-batch method processes the data in a short time by making the batch size very small. If micro-batch data are processed many times in a rapid fashion, it is called real-time processing. The micro-batch method can be used for real-time processing by modifying the existing batch processing platforms such as Hadoop [
8] and Spark [
9]. Using a micro-batch processing method on a platform, both real-time processing and batch processing can be used on one platform.
In stream processing, there is no wait time for the batch size data to accumulate, and no intentional delay occurs [
10,
11]. The system immediately processes the data as soon as it arrives. Hence, the stream processing method is mostly used in systems in which the critical time is very important. However, if the platform adopts the stream processing method, it can only process real-time data.
In order to process large-scale streaming data, it is necessary to increase the processing speed by splitting and distributing the tasks in a distributed environment [
12]. For large-scale distributed processing, there is a need for a technology that brings many nodes into a cluster and operates them. For this purpose, the Hadoop distributed file system (HDFS) [
13] has been developed to split large input files into distributed nodes. It uses batch processing to process input data in a distributed environment. HDFS uses a two-step programming model called MapReduce [
14], consisting of a Map and a Reduce phase. In the MapReduce model, the Reduce phase can be performed only when the Map phase has been completed. For this reason, MapReduce is not suitable for real-time processing for which the input data must be continuously entered into the Map phase system. For real-time processing, a pipeline method was applied to map and reduce models on distributed nodes to process incoming data continuously [
15]. However, this method increases the cost of data processing from the beginning, especially when the data that has gone through the map operation fails in the reduce step. For this reason, real-time processing of the batch processing method is not suitable in situations where critical real-time processing is required.
A streaming processing method has been developed to implement real-time processing in a distributed environment [
16]. The workflow is adopted in this model, enabling it to handle streaming data on the distributed environment as long as the user defines the application workflow.
The aforementioned distributed processing system provides distributed environment coordination, task placement, fault tolerance, and node scalability of several distributed nodes. The technology that provides these services is called orchestration. However, if the orchestration is configured as a physical node, the burden of setting a distributed environment by the user increases. The more nodes that are used, the more difficult it is for a user to construct a distributed environment. In recent years, in order to minimize the burden on users, systems that configure distributed processing environments based on virtual environments are increasing [
17,
18]. The advantage of configuring distributed nodes as virtual machines is that the number of nodes can be flexibly operated.
Until now, there have not been enough frameworks for distributed deep learning models to each distribute a node and process real-time streaming data. In addition, it is not easy to deploy an operating environment, such as an operating system, a runtime virtual environment, and a programming model, which implement a deep learning model inference based on multiple distributed nodes.
In this paper, we propose a new system called DiPLIP (Distributed Parallel Processing Platform for Stream Image Processing Based on Deep Learning Model Inference) to process real-time streaming data by deploying a distributed processing environment using a virtual machine and a distributed deep learning model and virtual environment to run it distributed nodes. It supports the distributed VM (Virtual Machine) container for users to easily process trained deep learning models as an application for processing real-time images in a distributed parallel processing environment at high speeds. DiPLIP provides orchestration techniques such as resource allocation, resource extension, virtual programming environment deployment, trained deep learning model application deployment, and the provision of an automated real-time processing environment. This is an extended and modified system to infer deep learning models based on our previous work [
19]. In the previous study, the user environment was deployed based on Linux script, but in this paper, the user environment is deployed and managed based on Docker. The purpose of this system is to submit the trained model as the user program for inferencing the deep learning model in real-time.
DiPLIP can process massive streaming images in a distributed parallel environment efficiently by providing a multilayered system architecture that supports both coarse-grained and fine-grained parallelisms simultaneously, in order to minimize the communication overhead between the tasks on distributed nodes. Coarse-grained parallelism is achieved by the automatic allocation of input streams into partitions, each processed by its corresponding worker node and maximized by adaptive resource management, which adjusts the number of worker nodes in a group according to the frame rate in real-time. Fine-grained parallelism is achieved by parallel processing of tasks on each worker node and is maximized by allocating heterogeneous resources such as GPUs and embedded machines appropriately. DiPLIP provides a user-friendly programming environment by supporting coarse-grained parallelism automatically by the system, while the users only need to consider fine-grained parallelism by carefully applying parallel programming on multicore GPU. For real-time massive streaming image processing, we design a distributed buffer system based on Kafka [
20], which enables distributed nodes to access and process the buffered image in parallel, improving its overall performance greatly. In addition, it supports the dynamic allocation of partitions to worker nodes that maximize the throughput by preventing worker nodes from being idle.
The rest of the paper is organized as follows: in
Section 2, we provide background information and related studies related to our system. In
Section 3, we describe the system architecture of DiPLIP, and explain its implementation in
Section 4. The performance evaluation is described in
Section 5. Finally,
Section 6 summarizes the conclusions of our research.
3. DiPLIP Architecture
In this section, we describe the system architecture of DiPLIP in detail. In general, deep learning model serving systems create endpoints after the model is served. After that, the user transmits the data for inference to the endpoint. In the existing model serving system [
24,
25], real-time model inference is practically difficult because there is no space to temporarily store the stream data generated in real-time. Moreover, as the scale of input data generated in real-time increases, there is a need to expand the storage space and processing nodes. In the existing model serving method, as the number of data entering the end point increases, it can no longer be accommodated, and accordingly, the processing node becomes jammed or data is lost. In this system, in order to solve this problem, the input data is not transferred directly to the processing logic after it is delivered to the end point, but is transferred to the processing logic through the buffer layer composed of Kafka broker nodes.
Figure 1 shows that the input data generated in real-time is delivered to the endpoint, then distributed and stored in several partitions on the buffer layer, and then delivered to the processing group.
Although there is only one endpoint for model inference, the system is uniformly delivered to multiple partitions configured by the user in a round-robin manner. The processing group is configured by automatically deploying an environment capable of parallel processing such as GPU, XeonPhi, and SIMD. When it is ready to process data, it accesses the buffer layer, takes the data, and processes it. According to Kafka’s method, when data is taken from a worker node, an ACK (Acknowledgement) is stamped, and a second ACK is stamped after the data is completely processed, so that a large amount of data can be processed in real-time without loss. The size of the buffer layer and the processing layer can be flexibly expanded and reduced according to the size of the real-time data.
Our system consists of four major layers: The user interface layer, master layer, buffer layer, and worker layer. The user interface layer takes the user’s requirements and delivers the requirements to the master layer. The user passes the number of nodes to configure the buffer and worker layer to the master layer. Once the buffer layer and worker layer have been successfully created, the deep learning model to run is ready to run. The user passes the trained model to the user interface, and when it is passed to the master layer, the model submitter on the master layer packages it as a docker image. The packaged Docker image is stored in the docker registry on the master layer, and the worker layer takes the trained image from the docker registry. The master node allocates the buffer layer and the distributed worker nodes according to the user’s request. In the buffer layer, the input data coming in in real-time is stored so that the worker node can take it. The worker nodes on the worker layer take input data stored in the buffer layer and process the data by performing the deep learning trained model submitted by the user. The trained deep learning model on the worker node is performed as a Docker image, so that the OS and programming environment of the worker node can be easily deployed.
The user interface layer, master layer, buffer layer, and worker layer are shown in the overall architecture of DiPLIP in
Figure 2.
3.1. User Interface Layer
The user inputs information about the amount of resources for the buffer layer and worker layer that he needs through the resource management service on the user interface layer. Then, the resource management service notifies the user of the current resource usage. Lastly, the model submitter is responsible for receiving the trained model from the user and delivering it to the master layer.
3.2. Master Layer
The master layer is responsible for resource management and the deployment of the deep learning trained model as a VM image. The resource requester on the master layer asks the resource agent manager to allocate the specific resources for broker nodes and worker nodes. The resource agent manager creates resource controllers in the master layer, each of which is connected to one of the broker and worker nodes and in charge of its life-cycle and status monitoring through the resource agent and resource status monitor, respectively.
The task manager creates and manages the task controller in the master layer, each of which is connected to one of the broker and worker nodes and in charge of deploying and executing the task through the task agent and task executor, respectively. The topic manager creates a topic controller in the master layer, each of which is connected to one of the broker nodes and controls the lifecycle of topics and the configuration of partitions in the buffer layer.
Meanwhile, the model submitter stores the submitted trained model in the Docker registry. It then delivers the address of the trained model in the docker registry to each worker node through the task controller. The resource monitor collects information about the status of nodes through a resource controller interacting with a resource status monitor and transfers the current status to users via the resource monitoring service in the user interface layer.
3.3. Buffer Layer
The buffer layer temporarily stores the stream data generated in real-time. One topic is a processing unit that can process a single deep learning model. In this topic, multiple broker nodes are placed to distribute the stream data accumulated in topics. In addition, multiple partitions within a single broker node provide logical storage. Having multiple partitions spreads the data evenly within one node. In the DiPLIP system, stream data is stored in the round-robin method on each node and partitions in the stream image frame. The number of broker nodes can be distributed in many ways by adjusting the number of partitions. Shows an example of a distributed environment consisting of three broker nodes and three VMs.
In
Figure 3, One broker has three partitions. VM 1 connects to broker node 1, VM 2 connects to Broker Node 2, and VM 3 connects to the Broker Node 3. Each VM has ownership of three partitions, and it goes around partitions 1, 2, and 3.
As another example case, the number of VM nodes increases to 9 in
Figure 4. In the figure, three VM nodes are allocated to one broker node.
Figure 4 the example case of distributed environment consisting of 3 broker nodes and 9 VMs; each VM node has ownership of one partition. Since one VM node is mapped to one partition, data distribution is more efficient. Since one VM node is mapped to one partition, data distribution is more efficient than the previous example case.
3.4. Worker Layer
The worker layer consists of a resource agent and task executor. The resource agent transfers the available CPU, memory, and GPU state of the current physical node to the resource monitor of the master layer. The resource agent receives the registry location of docker image from the task controller of the master layer. The resource agent executes a docker container as a VM from docker image. VM includes a learned deep learning model and a programming virtual environment. As soon as this Docker Container is executed, it immediately accesses the buffer layer and fetches the frame to process the image.
5. Performance Evaluation
In order to test the performance of DiPLIP, we constructed a clustered DiPLIP system with master, worker, broker, and streaming nodes. In addition, a streaming source node is also used to deliver the image to the DiPLIP in real-time. In the streaming source node, an image of 640 × 480 resolution at 30 fps is continuously generated and transferred to the buffer layer on the DiPLIP. Since our system is designed based on the Docker Container, it only works on Linux. Finally, we evaluate the real-time distributed processing performance by submitting various trained object detection models in our system for application. On the physical node, several VM worker nodes can be created using Docker. In this experiment, up to two VM workers were created in one physical node. In the experiment, one master node, three worker nodes, three broker nodes, and one streaming node were used. The master node and the broker node used 4 GB of RAM of dual cores, and the worker node used 16 GB of RAM of quad cores. Ubuntu 16.04 OS was used for all nodes constituting DiPLIP. In addition, we will compare the distributed processing speed of the system according to the various calculation amounts by using the deep learning model [
2,
5] for various object detection in the experiment as an application. Object detection is the goal of all models, but the layers that make up each model are different, so the accuracy and processing speed of the results are different. All models were also trained using the COCO [
27] dataset. The accuracy of each model was measured with mAP [
28], and the higher the accuracy, the slower the processing speed. A summary of each model is given in [
29].
Figure 11 shows the time taken to process the first 300 frames when the object detection model is inferred in the experimental environment.
Although the time taken for inference varies according to each model, it is evident that as the number of worker nodes increases, the time to process the input stream decreases. Case of 2 VMs on 2 physical worker nodes has a larger total number of VM worker nodes than case of 3 VM on 1 physical worker node, but considering that the case of 3 VMs on 1 physical worker node has a faster processing time, it is assumed that this is due to the effect of the network bandwidth. The results of this experiment show that the real-time deep learning model inference is processed faster as the number of worker nodes increases elastically.
Figure 12 shows the differential value of the unprocessed data over time on ssd mobile net model.
Data are input unprocessed for about 44 s, and after 45 s, data starts to be processed. When the data starts to be processed, it can be seen that the amount of unprocessed data decreases as the derivative value changes to a negative value. The fact that the differential value of the unprocessed data remains negative after some time elapses means that the amount being processed is greater than the amount of stream data being generated. If the differential value for the amount of unprocessed data remains positive, it means that the unprocessed value increases gradually, implying that it is time to further increase the number of worker nodes.
From the results of this experiment, it can be seen that when inferring a variety of trained deep learning models, the generated stream image can be processed at a faster rate in a distributed environment. Although the processing speed is different for each model, it can be seen that as the number of worker nodes increases, the number of frames allocated to each worker node decreases, and the overall speed increases accordingly. In addition, it can be seen through the derivative of the number of unprocessed frames that the number of processed frames increases rapidly when more worker nodes process. When the differential value of the number of unprocessed frames continues to be positive, it is implying that it is the time of expansion of the worker node.