In this section, the requirements for CMS are first analyzed, especially for long-time analytics and serving. Subsequently, a container-based architecture is designed and the key components are illustrated.
3.1. Requirement Analysis
One platform for learning and deployment. For an industrial machine-learning platform, it is important to support both an off-line trainer for model generation from historical data and online model serving for real-time prediction. For the off-line trainer, several machine learning suits, e.g., Spark ML, TensorFlow® and Keras®, should be supported as they are widely used by different developers. The different machine learning suits adopt different model formats, e.g., the HDF5 format for Keras models and the TensorFlow specific model format. The supports for the different machine learning suits mean that the model serving implementations should be designed independently with different model formats. Thus, the current main stream approaches should be seamlessly integrated. The compatibility of the existing modeling efforts is quite important for many existing industrial environments in which the model designs normally need considerable work from both data scientists and industrial domain-specific experts;
Autonomous model re-training. Due to the strong time series nature of industrial data, the obtained target value error may become larger as time goes on, leading to the loss of reference value for the target value. For an industrial model, the machine learning pipelines are carefully pre-defined via machine learning experts. The training processes execute the feature selection, feature construction and training process in a defined sequence. When the newer data arrive, it is important to update the model by executing the same computing algorithm again with data within a certain range. By doing so, an up-to-date model with a newer version number is generated and needs to be deployed into the model serving process. As the machine-learning platform is deployed close to the data source and it is mainly maintained by non-machine-learning experts, it is important to provide a solution that enables the whole process to be executed without human intervention;
Model validation. For a newly generated model, it has to be validated before it is deployed into the practical industrial environment. This means that the generated model should be verified, especially with the on-line data, to check the accuracy of the newly generated model. Therefore, it is vital for the reliability and robustness of the platform to ensure the generated model performances before pushing the generated model into the production environment. The transmission errors would result in data errors. Thus, the model validation should be coupled with data validation to address this problem.
Seamless model updating. When a set of new models are generated and validated, it is important to use the new models to replace their corresponding models used in the model serving service. For the industrial environments, many serving processes could not be interrupted. Thus, it is important to design a model seamless updating mechanism.
In addition to the aforementioned requirements, the computing platform needs to process the data near the data source, because enterprises do not want data to be uploaded to the cloud platform from the perspective of security and privacy. This platform that nears the data source can avoid data leakage and reduce transmission delay. However, this platform generally has limited resources, so it should also provide functions related to the general computing demands, e.g., computing resource allocation and isolation. Due to the model complexity and the large scale of data, trainers’
executions consume considerable CPU, GPU, disk and memory resources. Thus, it important to orchestrate the execution of trainers
to avoid excessively resource consumption. To achieve maximal flexibility, the existing cloud, fog and edge computing platforms mainly adopt the virtualization techniques, e.g., Hypervisor [30
] (system-level virtualization) and container technologies [31
] (operating-system-level virtualization). With the maturity of lightweight container technology, platform architectures [24
] prefer the container techniques such as Docker instead of traditional VM. In the following subsection, a Docker-based service architecture is introduced.
3.2. System Architecture
As shown from Figure 1
, the computing architecture for CMS is divided into three different layers: the infrastructure layer, the scheduling layer and the model management layer.
In the infrastructure layer, the Docker technology is used for resource allocation and application isolation. Underline resources are divided into dynamic allocable units to be used by various applications. The resource isolation feature of the Linux container enables multiple containers to be executed on a single host without the starving the others. The container orchestration layer focuses on scheduling different types of resources across multiple computers or domains. It can be implemented by reusing an existing Docker manage engine, e.g., Apache® Mesos, Google Kubernetes®, Docker® Swarm, Rancher® Cattle. These two layers are implemented mainly by reusing platforms with industrial big-data-related enhancements, e.g., data accession, data conversion.
On top of the container orchestration layer is the model management layer: this layer consists of four core sets of components:
Data accessors provide the data access, translation and mapping service from industrial cyber-physical systems. It is also responsible for receiving real-time data from the gateway, translating network protocol, issuing instructions and forwarding data, forwarding processed data to an historical database and real-time database, respectively. Its role also includes the isolation of the traffic between Distributed Control System (DCS) and the big data platform. It can be implemented with an existing data stream processing framework, e.g., Kafka, RabbitMQ, RocketMQ, together with an existing industrial network gateway.
Trainers generate models from historical data. Due to the fact of the model evolution, trainers have to be executed periodically or triggered by certain events. By taking historical data together with recently updated data as inputs, it emits a newly learned model. Trainers generally need supports from existing machine-learning platforms. A trainer normally demands a huge amount of computing resources. Thus, it is of crucial importance to effectively orchestrate trainers’ executions.
Modelets provide the model serving service in production environments. Each time online-time data are received, it queries the models to make prediction or deduce adaptation actions based on its enclosed model. In addition to the model invocation task, one Modelet is also responsible for loading, starting, stopping and unloading the model. One major design concern for modelet is the strict response time and the supports for scalability.
manages the execution of different trainers
to update their generated models. As can be seen from Figure 1
, the orchestrator
plays a core role in bridging the gap between trainers
. It triggers the execution of trainers
to generate up-to-dated models. Another major function of the orchestrator
is to determine the quality of generated models and to select an appropriated version to switch to. The orchestrator
needs to seamlessly deploy the selected model to its corresponding modelet
without interrupting the normal operation.
As one platform might support a dynamic set of modelets
, it is important to support the dynamicity and scalability of those components. The micro-service pattern is adopted for this architecture implementation. Microservice is a software architecture that decomposes large application systems into a set of independent services; services can be independently developed and deployed, and realize loosely coupled application development through modular combination [33
]. The other important feature is the support for the multitenancy, which is by Google [12
], to indicate the scene that multiple model servers and trainers
deployed in a single instance of the server, due to regulation and cost limitations. The CPU and memory-intensive trainers
should work together with the modelets
with stringent time limits, which might lead to cross interference and is challenging to solve. To enhance isolation between operations, this platform provides a set of configuration features to support the resource isolation configuration based on the Docker’s resource isolation features. This architecture directly reuses the Docker infrastructure service for the load balance and health checking. The dynamic resource allocation is also implemented with the resource scheduling mechanism provided by the Rancher.