Development of a Server for the Implementation of Data Processing Pipelines and ANN Training †

: Data processing and the use of machine learning techniques make it possible to solve a wide variety of problems. The great disadvantage of using this type of technology is the enormous amount of computation involved. This is why we have tried to develop an architecture that makes the best possible use of the resources available on each machine. The growth of cloud computing and the rise of virtualization techniques have led to a development that allows these tasks to be carried out in a more optimized way.


Introduction
The use of data processing techniques [1] and machine learning [2] is based on trying to detect patterns in a set of data in order to provide an estimate on the data. This technology is experiencing a great boom due to the optimization of the different algorithms and the notable increase in the computational capacity of the different systems.
Both database processing and the training of machine learning models are very complex and computationally expensive tasks. When this is added to the processing of very large databases or the development of complex models, it is common to have specific hardware to speed up these tasks. Otherwise, this task would take a long time to be performed on a conventional computer, even breaking some of its components due to the stress caused by computational volume.
In addition, defining different processing pipelines or different models can be very complex for people who are not experts in the field. To alleviate these deficiencies, there are tools that allow this task to be carried out visually. This would be the case of Weka [3], which allows performing these tasks in a simple way. However, this application does not allow its execution on different machines.
With these points in mind, namely ease of use and scalability, the architecture of a distributed system for database processing and training of machine learning models is proposed. In this way, the resources of the machine on which the different processes are executed will be specifically dedicated to this task.

Materials and Methods
The boom of the different virtualization technologies [4] makes them ideal for the construction of a system of this style. They allow the developed system to be independent of the machine on which it is executed, which provides great versatility and flexibility. In addition, these technologies allow an exclusive use of the resources, allowing them to contain only the necessary modules. One of the most powerful and versatile technologies in this field is Docker [5], which allows an easy definition of systems with their characteristics to be taken into account. This, in addition to efficiency when carrying out the tasks, would provide greater security since the machine will only contain the services necessary to perform the task entrusted to it.
The architecture developed must also allow the management of different users and databases. This is not such a costly task, so it will be included in the same module to optimize the architecture resources.

Results
Thanks to this architecture, load balancing [6] of the different training and data processing processes can be carried out exclusively. This implies that the nodes will be activated on demand and will have all the resources dedicated to the work they want to perform without taking into account other functionalities such as user authentication or the management of the different files that would consume a series of resources unnecessarily. Likewise, the architecture of the system would be as shown in the Figure 1.
It is necessary to mention that the current development is based on the ANN technique, which allows the implementation of deep learning models [7].
This architecture can be divided into a front-end part based on an MVC pattern [8] and a back-end part composed of three large modules. These modules are divided according to their expected workload. Firstly, there is a Data Processing module [1], whose objective is to perform the operations indicated by the user on the data. Secondly, a model training module [2] has been detected, which is in charge of generating the models indicated by the user and performing the training with the required database. Finally, a Facade module [9] is needed, in charge of acting as a facade and performing the less expensive operations such as user management and management of the different files on the server.

Discussion
A scheme has been defined for a server capable of performing the data processing and model training tasks in a distributed and on-demand manner. This offers a number of advantages over other systems such as Weka. The latter performs these tasks in a single instance, which causes the resources of the machine in which it is executed to be depleted due to the fact that it must manage all the functionalities present in the system. This approach offers the possibility of running on cloud services such as AWS [10], Azure [11], or Google Cloud [12]. This architecture enables the replication of nodes as needed for the execution of data processing or model training in a unique way, which offers a great advantage over desktop applications whose only source of computational power is the computer itself.

Future Work
This project presents numerous avenues for future work. One of these possible developments is motivated by the extension of the type of machine learning models. It would be straightforward to extend the set of models composed only of ANN to other algorithms such as SVM, KNN, RF, or LDA.
It is also necessary to highlight the possibility of interactive data processing, visualizing at each point how the different variables defined by the user behave.