Data-Driven Digital Twins for Technical Building Services Operation in Factories: A Cooling Tower Case Study

: Cyber-physical production systems (CPPS) and digital twins (DT) with a data-driven core enable retrospective analyses of acquired data to achieve a pervasive system understanding and can further support prospective operational management in production systems. Cost pressure and environmental compliances sensitize facility operators for energy and resource e ﬃ ciency within the whole life cycle while achieving reliability requirements. In manufacturing systems, technical building services (TBS) such as cooling towers (CT) are drivers of resource demands while they fulﬁl a vital mission to keep the production running. Data-driven approaches, such as data mining (DM), help to support operators in their daily business. Within this paper the development of a data-driven DT for TBS operation is presented and applied on an industrial CT case study located in Germany. It aims to improve system understanding and performance prediction as essentials for a successful operational management. The approach comprises seven consecutive steps in a broadly applicable workﬂow based on the CRISP-DM paradigm. Step by step, the workﬂow is explained including a tailored data pre-processing, transformation and aggregation as well as feature selection procedure. The graphical presentation of interim results in portfolio diagrams, heat maps and Sankey diagrams amongst others to enhance the intuitive understanding of the procedure. The comparative evaluation of selected DM algorithms conﬁrms a high prediction accuracy for cooling capacity (R 2 = 0.96) by using polynomial regression and electric power demand (R 2 = 0.99) by linear regression. The results are evaluated graphically and the transfer into industrial practice is discussed conclusively.


Introduction
The digital factory and cyber-physical production systems (CPPS) have become synonyms for future production systems, where virtual depictions of the factory, better known as digital twins (DT), are used to predict and continuously improve the production performance [1]. Innovation push has tremendously reduced the costs for sensors and measurement equipment. Continuously, data acquisition and high performance computational hardware has become affordable for operational management helping to process data in up to real time and achieve energy and resource transparency in factories [2,3]. Consequently, the goal-oriented data processing and the extraction of knowledge from data to support decision makers are growing tasks for actual and future engineers. In that regard, data mining (DM) develops into a mainstream for the interdisciplinary data-based research fields.
First described by   [4,5], DM related approaches have numerously been applied in research and practice. These include (but are not limited to) personalized product recommendations and shopping chart analyses in e-commerce and retail, and expertise finding systems and diagnostic tools for service providers. Regarding CPPS, DM is an urgent field of interest for both data scientists and operators. DM approaches can help to anticipate when maintenance services should be performed on machines [6][7][8][9] and it improves the modeling of complex production systems or enables accurate forecasts of energy consumptions [10][11][12]. Moreover, data-driven approaches can support applied remanufacturing activities in circular economies [13].
Within the last decades, energy and resource efficiency has become an important topic for manufacturing companies all over the world aiming to reduce environmental pollution and carbon emissions [14,15]. In manufacturing systems, significant shares of energy and resource demands are usually related to production machines and technical building services (TBS) that are interconnected by physical flows as well as data flows [16]. In particular, TBS exhibit crucial improvement potentials due to their cross-linking within the manufacturing system. Their main purpose is the conversion of final energy such as electricity or natural gas into useful energy forms like compressed air, heat or cooling water as well as the supply of connected machines and processes in the factory building [17]. As energy conversion is related to dissipations such as waste heat or noise emissions, TBS systems are identified as typical main energy consumers in manufacturing systems [18,19].
As a vital element of industrial TBS, cooling towers (CTs) are the prevalent technology to deal with occurring cooling demands from machines, processes and control units in the manufacturing system. Operators of CT systems aim to provide a reliable and economically feasible supply of cooling water. Thereby, they must consider several requirements such as local environmental compliances, production scheduling and local climate conditions [20][21][22][23]. The control of such a complex system requires a high degree of automation and a multi-sensorial network distributed throughout the CT system. Conclusively, the extensive acquisition and storage of operational data is already state of the art. However, this data is often used for monitoring purpose only. Transforming it with adequate methods and tools could help to support operators and decision makers in their challenging daily business. Statistical analyses of historical data can also be used to assess operation strategies regarding improvement potentials based on long-term experiences. Seasonal effects and unique events affecting CT operation can be identified and consequently improvement measures can be derived. Furthermore, additional information about current and future system status is the basis for predictive maintenance and a proactive operation.
The state of research regarding data-driven approaches for CT design and operation proposes artificial neural networks (ANN) and clustering as favored algorithms in this field (compare Section 2.3). However, as most approaches are application-specific, general recommendations to improve CT operation have hardly been formulated so far. This leads to an urgent need for holistic approaches addressing both pervasive system analyses and prediction of relevant aspects for CT operation. Moreover, the beneficial deployment of DT should be clearly described to enhance the transfer from concept into industrial practice.
Within this paper, the development of a data-driven DT for TBS operation applied to an industrial CT system is presented. The approach was developed on an industrial CT system in a manufacturing company located in Germany and implemented as an integrated approach with automated workflow to increase the usability in practice. It aims to uncover interrelations of operational business and technical system and allows to assess different operational strategies. Furthermore, it helps to forecast the CT system performance by predicting key performance indicators (KPIs) like electric power demand and cooling capacity. The approach comprises seven consecutive steps in a broadly applicable workflow that is based on the CRISP-DM paradigm. Initially, background on industrial cooling towers and data-driven approaches for DTs in production systems is presented in Section 2. Subsequently, the underlying case study is introduced and business issues are discussed in Section 3.1. A custom data processing procedure featuring data aggregation, outlier filtering and data transformation is explained stepwise in Section 3.2. A correlation analysis is further used to identify systematical interrelations within the dataset. Subsequently in Section 3.3, several DM algorithms are selected and examined for the DM task to predict performance-related KPIs. All DM algorithms are comparatively evaluated in terms of needed computational time and prediction accuracy. Finally, a conclusion and outlook are presented in Section 4.

Industrial Cooling Towers
Industrial CTs in production systems are part of the TBS that deal with occurring cooling demands from production machines by disposing waste heat to the environment. Figure 1 illustrates the main components and functions of a common industrial CT. The cooling water circulates between the production machines and the CT. Starting from the production machines, the heated water is supplied to the CT. Here, the water is sprayed as fine droplets into the CT rinsing down along fillers while reducing its temperature. In counter flow direction, the ambient air flows into the CT. For industrial applications, fans are installed to enhance the air flow. The air saturates with evaporating water and exits on top; hence, it needs to be refilled with fresh water regularly. Finally, the cooled water is pumped back to the production machines. The mathematical relations between the mass flows and temperatures of water and air can be described with Merkel's theorem [24]. . m air ·(h air,out (T air,out , ϕ air,out ) − h air,in (T air,in , ϕ air,in )) = . m water ·c water ·(T water,out − T water,in ) (1) The cooling demand of the production, i.e., the right side of equation, depends on the temperature difference of inlet water (T water, in ) and outlet water (T water, out ), its mass flow ( . m water ) as well as its heat capacity (c water ). The left side of the equation characterizes the cooling capacity of the ambient air. It features the absorbency for thermal energy and evaporating water based on ambient temperature and humidity. The equation comprise the air mass flow ( . m air ) as well as the specific enthalpy of inflowing air (h air,in ) and outflowing air (h air,out ) which dependent on air temperature (T air ) and relative humidity (ϕ air ) respectively. Consequently, the operation of CTs is highly impacted by environmental conditions of the location. Warm and humid climate impairs the energy and mass transfer leading to increased air demand and fan operation followed by increased energy demand [25]. outlier filtering and data transformation is explained stepwise in Section 3.2. A correlation analysis is further used to identify systematical interrelations within the dataset. Subsequently in Section 3.3, several DM algorithms are selected and examined for the DM task to predict performance-related KPIs. All DM algorithms are comparatively evaluated in terms of needed computational time and prediction accuracy. Finally, a conclusion and outlook are presented in Section 4.

Industrial Cooling Towers
Industrial CTs in production systems are part of the TBS that deal with occurring cooling demands from production machines by disposing waste heat to the environment. Figure 1 illustrates the main components and functions of a common industrial CT. The cooling water circulates between the production machines and the CT. Starting from the production machines, the heated water is supplied to the CT. Here, the water is sprayed as fine droplets into the CT rinsing down along fillers while reducing its temperature. In counter flow direction, the ambient air flows into the CT. For industrial applications, fans are installed to enhance the air flow. The air saturates with evaporating water and exits on top; hence, it needs to be refilled with fresh water regularly. Finally, the cooled water is pumped back to the production machines. The mathematical relations between the mass flows and temperatures of water and air can be described with Merkel's theorem [24]. ṁ air ⋅ (h air,out (T air, out , φ air, out )h air,in (T air, in , φ air, in )) = ṁ water ⋅ c water ⋅ (T water, out − T water, in ) (1) The cooling demand of the production, i.e., the right side of equation, depends on the temperature difference of inlet water (T water, in ) and outlet water (T water, out ), its mass flow (ṁ water ) as well as its heat capacity (c water ). The left side of the equation characterizes the cooling capacity of the ambient air. It features the absorbency for thermal energy and evaporating water based on ambient temperature and humidity. The equation comprise the air mass flow (ṁ air ) as well as the specific enthalpy of inflowing air (h air,in ) and outflowing air (h air,out ) which dependent on air temperature (T air ) and relative humidity (φ air ) respectively. Consequently, the operation of CTs is highly impacted by environmental conditions of the location. Warm and humid climate impairs the energy and mass transfer leading to increased air demand and fan operation followed by increased energy demand [25]. CTs are tailored constructions with individual specification and size. From small roof-top units for buildings over compact industrial force-draft CTs in industry up to immense natural-draft CTs in power plants, CTs are cross-sector applicable for numerous case studies [20,21,26]. The individual purpose determines design, size and, basically, the required cooling capacity provided by the CT. The achievement of the currently required cooling capacity is one main objective for the operational CT management. Main operational control levers are the installed fans and pumps, which can immediately adjust air and water flows. As these electric components are also main energy consumers of the CT system, they should be considered for energy efficiency issues [27]. A further important KPI for design and monitoring a CT is the energy efficiency ratio (EER), an equivalent to the coefficient of performance (COP) for heating units [28,29]. Equation (2) describes it as the ratio of cooling capacity (Q CT ) as desired output of the CT and electric power demand (P CT, electric ).

Data-Driven Approaches to Create Digital Twins in Factories
A transformation towards digitalization, internet of things (IOT) and industry 4.0 can be observed in most sectors of industry. This includes the establishment of extensive data acquisition systems by installing sensor networks which provide information about machine conditions, progress of production, individual qualities of produced goods etc. [30,31]. Data-driven approaches such as big data, data mining (DM) and visual analytics build upon this data to reveal hidden interrelations within the production system and to forecast vital performance indicators [32]. Novel approaches comprising IOT, DT and CPPS have been introduced for almost every aspect of factories [33,34]. The paradigm of DT comprises detailed virtual depictions of physical systems, their structures and dynamic interaction mechanisms to provide accurate information for prognostics and health management [35]. Prevalent objectives are, amongst others, the improvement of machine tool life cycles [36] and production performance evaluations [37].
The general concept to create a DT of a physical system comprises the definition of requirements, the model creation process and its deployment as illustrated in Figure 2. In particular, for the creation phase, various data-driven approaches are available, including statistics, DM and machine learning (ML). To define requirements for a DT, an in-depth inventory analysis of the physical system should be applied. The deployment of the DT can then encompass numerous tools and methods such as visual analytics, forecasts and predictive maintenance applications. One of the most comprehensive data-driven approaches in industrial practice is the Cross Industry Standard Process for Data Mining (CRISP-DM), which was first introduced by Wirth and Hipp [38] and further detailed in [39,40]. It comprises six sub-sequential steps: The initial step, business understanding, focuses on understanding the project objectives as well as requirements, assumptions and constrains. Data understanding starts with an initial data acquisition and proceeds with its exploration to gain first insights and to detect data quality shortages. Data preparation encompasses all activities to build a final dataset from raw data. It includes tasks to clean, format and merge data in order to derive the desired attributes for modeling tools. For the modeling step, several DM and ML algorithms are available: Supervised, predictive, unsupervised or descriptive algorithms [41][42][43][44]. Supervised ML algorithms include regression approaches (e.g., linear, polynomial regression), classification approaches (e.g., decision trees, support vector machines) or probabilistic algorithms (e.g., Naive Bayes, ANN). Prediction models are derived from existing data and applied to new data, e.g., to derive expectation values for the electric power demand of a technical system. In contrast, descriptive models are developed with algorithms of unsupervised learning such as clustering and association rules, e.g., for pattern recognition in electric load profiles [45,46]. As some algorithms have specific requirements regarding inputs data form, an iteration with previous steps is often necessary. Modeling results are thoroughly evaluated to make sure the model properly fulfils the business objectives. In the deployment step, knowledge gained from the DT needs to be organized and presented for relevant stakeholders and in a valuable form.

Data-Driven Approaches for Cooling Tower Systems
In recently published studies related with DM and ML for CT systems, two main application fields could be identified; the first is related with buildings such as office buildings and urban spaces [41,47] and the second focuses on industrial CT systems located in factories. Within both, DM and ML are applied to forecast energy demand and cooling capacity, in some cases accompanied with the assessment of environmental conditions. Within the studies, various DM and ML algorithms as well as statistical approaches have been applied. Amongst others, ANN is identified as one of the most common applied algorithm in the field of CT management [48][49][50][51][52]. One main advantage of ANN is the ability to represent systematic and non-linear interrelationships, which could otherwise only be determined in complex experiments [53][54][55][56]. Furthermore, clustering is used to detect patterns and recurring sequences in data from CT systems and TBS, such as typical power demand profiles and efficient operating states [57][58][59]. For example, Li et al. identified efficient operating states and control strategies for up to four connected CT using clustering [60]. Wang et al. investigated the influence of fan speed and ambient air condition on energy demand with a clustering [61]. However, as individual DM algorithms have both strengths and limitations, the combined application of two or more algorithms to an ensemble model is recommended in order to achieve optimal results and reduce the influence of missing values [51,62,63]. Table 1 summarizes recent studies categorized by used datadriven algorithms, applied case study and analyzed target KPIs. It further gives a brief insight into to specific objectives and used data sets.

Data-Driven Approaches for Cooling Tower Systems
In recently published studies related with DM and ML for CT systems, two main application fields could be identified; the first is related with buildings such as office buildings and urban spaces [41,47] and the second focuses on industrial CT systems located in factories. Within both, DM and ML are applied to forecast energy demand and cooling capacity, in some cases accompanied with the assessment of environmental conditions. Within the studies, various DM and ML algorithms as well as statistical approaches have been applied. Amongst others, ANN is identified as one of the most common applied algorithm in the field of CT management [48][49][50][51][52]. One main advantage of ANN is the ability to represent systematic and non-linear interrelationships, which could otherwise only be determined in complex experiments [53][54][55][56]. Furthermore, clustering is used to detect patterns and recurring sequences in data from CT systems and TBS, such as typical power demand profiles and efficient operating states [57][58][59]. For example, Li et al. identified efficient operating states and control strategies for up to four connected CT using clustering [60]. Wang et al. investigated the influence of fan speed and ambient air condition on energy demand with a clustering [61]. However, as individual DM algorithms have both strengths and limitations, the combined application of two or more algorithms to an ensemble model is recommended in order to achieve optimal results and reduce the influence of missing values [51,62,63]. Table 1 summarizes recent studies categorized by used data-driven algorithms, applied case study and analyzed target KPIs. It further gives a brief insight into to specific objectives and used data sets.
Based on the state of research it can be concluded that several data-driven algorithms are successfully applied on CT design and operation. In particular, ANN and clustering are the preferred algorithms in this field. However, as all approaches are highly specialized, the most promising approach to improve CT operation remains unclear. However, as most approaches are application-specific, general recommendations to improve CT operation have hardly been formulated so far. Furthermore, the transfer of valuable findings into a DT that is deployable in industrial practice is a virgin field. Addressing these research demands, the presented approach aims to describe the development of a data-driven DT of an industrial CT. Thereby, the overall procedure tries to preserve a generic nature in order to foster a transfer to other types of industrial TBS. The development process will be described step by step, beginning with gathered data and closing with a final evaluation of best fitting DM algorithm. Thereby, occurring challenges in data understanding and processing are discussed.

A Workflow to Create Digital Twins for Technical Building Services Operation
In the following, the approach to establish a data-driven DT is presented. Its fundamental structure bases on the CRISP-DM procedure detailed in [39]. Figure 3 illustrates the proposed workflow and its main elements. It starts with a brief technical analysis of the CT system and a business analysis in the first phase, followed by the DT creation phase that contains the tasks data understanding, data preparation and modeling. In this phase, seven consecutive steps are conducted, starting with data selection (I) and outlier filtering (II) followed by data aggregation (III) and transformation (IV). In feature selection (V), hyperparameter assessment (VI) and data mining (VII), several DM algorithms are applied on the procedure. Here, requirements of specific algorithms are taken into account and emerging characteristics are highlighted. Finally, in the third phase, DM results are comparatively evaluated and options for deployment in daily practice of CT management are discussed. development of a data-driven DT of an industrial CT. Thereby, the overall procedure tries to preserve a generic nature in order to foster a transfer to other types of industrial TBS. The development process will be described step by step, beginning with gathered data and closing with a final evaluation of best fitting DM algorithm. Thereby, occurring challenges in data understanding and processing are discussed.

A Workflow to Create Digital Twins for Technical Building Services Operation
In the following, the approach to establish a data-driven DT is presented. Its fundamental structure bases on the CRISP-DM procedure detailed in [39]. Figure 3 illustrates the proposed workflow and its main elements. It starts with a brief technical analysis of the CT system and a business analysis in the first phase, followed by the DT creation phase that contains the tasks data understanding, data preparation and modeling. In this phase, seven consecutive steps are conducted, starting with data selection (I) and outlier filtering (II) followed by data aggregation (III) and transformation (IV). In feature selection (V), hyperparameter assessment (VI) and data mining (VII), several DM algorithms are applied on the procedure. Here, requirements of specific algorithms are taken into account and emerging characteristics are highlighted. Finally, in the third phase, DM results are comparatively evaluated and options for deployment in daily practice of CT management are discussed.  The initial phase is related with business understanding (1) of the considered CT system. An inventory analysis is carried out comprising the given structure, measurands and control logics. Subsequently, the CT KPIs electric power demand and cooling capacity are analyzed regarding related influences from production system and environment. Characteristics of the CT system are identified and assumptions for the DM procedure are derived. The second phase encompasses the three CRISP-DM steps data understanding, data preparation and modeling (2) and extends them to a seven-step workflow. Since data must be in an appropriate form to apply DM algorithms, the first four work steps are used for general data processing. The subsequent steps are then applied individually for every single DM algorithm.
First, in the step of data selection (I) relevant measurands of the CT system, i.e., variables and measured data, are chosen and analyzed regarding potential interdependencies (e.g., by correlation analyses). Within data outlier filtering (II) selected variables are processed by filter techniques. Based on given thresholds and requirements from the physical system, outliers in the dataset are identified and cleared. Subsequently, a data aggregation (III) is performed to compress large data amounts while preserving valuable information and data characteristics. Subsequently, in the step of data transformation (IV) variables are transformed into their final form. The target KPIs (cooling capacity, electric power demand) are calculated based on variables and system specific constants. The cooling capacity of the CT system is calculated according to Equation (1). To consider both regressive and classifying algorithms, continuous values are discretized and assigned to classes. Equation (3) exemplifies this procedure for the electric power demand defining intervals with a range of 10 kW: As DM models should provide accurate predictions within appropriate computational times, the number of variables in the database is assessed in the next step. In an automated procedure, the feature analysis (V) aims to figure out the most relevant variables for each algorithm. The impact of each variable is evaluated in terms of the resulting prediction accuracy by calculating mean squared errors (MSE). For this purpose, the backward feature elimination method was chosen, where used variables are reduced in an iterating program and prediction errors are calculated in every loop. The variable with the least impact to reduce the forecast error is removed in every iteration, i.e., the process starts with all variables and ends with one variable. This dimension reduction approach analyses which variables are necessary for an accurate prediction and how each variable impacts the prediction result. Further, a hyperparameter assessment (VI) is performed for each DM algorithm. Hyperparameters are specific model parameters for DM algorithms that need to be set before the learning process begins, e.g., tree depth for decision trees or number of neurons for ANN. Several studies recommend experimental or rule-based methods to determine adequate hyperparameters [64,65]. In this study, a rule-based method is applied, including several sub steps like data normalization, partitioning and algorithm training. The model is trained within a loop for each possible hyperparameter combination followed by an evaluation of the prediction accuracy. To achieve a high reliability of results, a cross validation is integrated into the loop. Results are then mapped for a graphical evaluation. Subsequently, data mining (VII) is processed with the selected DM algorithms to predict cooling capacity and electric power demand. As various algorithms are basically suitable, an assessment of five algorithms predicting cooling capacity and nine algorithms predicting electric power demand is carried out (see Figure 4). To cope with weaknesses of single algorithm characteristics, several existing studies propose the combination of two or more algorithms in an ensemble model [51,62,63]. Therefore, a gradient boosted trees (GBT) algorithm was coupled with a multilayer perception neural network (MLP) to an ensemble model.  Finally, the phase of evaluation (3) is done based on statistical evaluations regarding coefficient of determination (R 2 ) and mean absolute error (MAE). By means of graphical analyses, results are related to the computational time, which is an important criterion for the applicability in daily practice. Finally, the possible deployment in industrial CT management is discussed.
The presented workflow was successfully applied on an industrial CT system located in a German automotive plant. In the following, the application of each process phase is described and exemplary results are presented. The developed methods are prototypically implemented in the software tools KNIME ® and Microsoft Excel© , which are, amongst others, typical tools to apply DM  Finally, the phase of evaluation (3) is done based on statistical evaluations regarding coefficient of determination (R 2 ) and mean absolute error (MAE). By means of graphical analyses, results are related to the computational time, which is an important criterion for the applicability in daily practice. Finally, the possible deployment in industrial CT management is discussed.
The presented workflow was successfully applied on an industrial CT system located in a German automotive plant. In the following, the application of each process phase is described and exemplary results are presented. The developed methods are prototypically implemented in the software tools KNIME ® and Microsoft Excel©, which are, amongst others, typical tools to apply DM approaches [41,66].

Business Understanding (Phase 1)
Starting with an analysis of the system requirements and constrains from a business perspective, two main aspects should be taken into account: On the one hand, the technical perspective defines the basis for data analysis. It is defined by the overall structure of the CT system with its technical properties such as installed technology types and number of devices as well as the available measurands and control logics. On the other hand, a systematical analysis of periodic and unique events during the CT operation is a vital part of the business understanding. It helps to identify typical operational characteristics of the CT system and determines requirements for the DT approach.

Technical Analysis of the Cooling Tower System
The considered industrial CT system is part of the TBS in a manufacturing company located in Germany. The CT system is used to dissipate heat from four nearby heat exchanger. It comprises three open circuit CTs (CT 1, CT 2, CT 3) illustrated in Figure 5. All CTs operate with water as coolant and follow a forced-draft air flow design, where the natural draft is supported by fans. While CT 1 and CT 2 have fans with static speed (i.e., without speed control), the fan of CT 3 supports a controllable speed range. Forward flow and backward flow pumps provide a circulation of water in the CT system. Each pump group comprises a static pump, a redundant standby pump as backup, and one speed-controlled pump. Flow and return circuits each have a tank to maintain the required amount of water and the specified pressure level. CT fans are switched on and off following hysteresis based on water flow temperatures. Three lower and three higher thresholds thereby define the fan operation. The speed-controlled fan in addition regulates its speed in a given range proportional to flow temperatures. J. Manuf. Mater. Process. 2020, 4, x FOR PEER REVIEW 10 of 25 For data acquisition purposes, an existing SCADA (Supervisory Control and Data Acquisition) system of the plant is used. It captures valuable measurands for a live visualization and control like water temperatures, electrical conductivity, water flows and pressure levels. The continuously collected data is stored within a MySQL database. A constant frequency of one full record (consisting of 32 values) each 10 s was chosen. More information about the data acquisition concept can be found in [20]. For data acquisition purposes, an existing SCADA (Supervisory Control and Data Acquisition) system of the plant is used. It captures valuable measurands for a live visualization and control like water temperatures, electrical conductivity, water flows and pressure levels. The continuously collected data is stored within a MySQL database. A constant frequency of one full record (consisting of 32 values) each 10 s was chosen. More information about the data acquisition concept can be found in [20].

System and Business Analysis
With focus on the most relevant KPIs for CT operation, a detailed system and business analysis considering electric power demand, cooling capacity and energy efficiency ratio EER of the CT system is introduced. Thereby, the impact of external influences such as seasonal weather conditions and production capacity on the CT system performance scheduling is analyzed.
As mentioned before, the cooling demand from the production system is a main parameter for CT operation and a driver for energy demand. Focusing on this aspect, the weekly electric power demand of the CT system for one year is illustrated as heatmap in Figure 6, classified by weekdays. The color indicates the amount of demanded energy from low (bright blue) to high (dark blue). In general, during weekdays (Monday-Friday) the power demand is higher compared to weekends. During one week, no reoccurring specific peak load can be identified. However, comparing all weeks within the year, certain periods of high and low electric power demand can be identified. High power demand particularly occurs between weeks 25 and 35 as well as between weeks 45 and 50. Typically, these periods are within high production seasons of the manufacturing system which induce higher cooling demands. Low energy demand periods between weeks 35 and 45 overlap with the typical holiday season during Mid-Europe's summer time that is related with reduced production capacities. As a result, it can be concluded that the scheduling of the production system influences operation states and thus electric power demands of the CT system. As a further aspect, the EER of the CT system and its dynamic during the year is of a special interest. Originally, the EER is primarily used for design purposes considering only a small number of defined typical temperature examples from the location [29]. However, the understanding of yearly EER dynamics could help to continuously adjust operational tasks and to counteract performance gaps, if necessary. To get an overview, Figure 7a depicts a boxplot of the monthly EER range for one year with an aggregated daily average. From October to May, the EER ranges between 5.5 and 7.5, while the lower and upper whisker achieve an EER of 2.5 and 10 respectively. During the summer months June to September, the EER is significantly lower from approximately 3.5 to 6.5. With a minimum of 1.5 and maximum of 7.5, the whisker range is comparably low. On the one hand, the collapse of the EER could be explained with the former discussed holiday season during summer. On the other hand, ambient temperature and humidity impact the CT performance (compare Equation (1)). This issue is further analyzed in Figure 7b, which puts the EER in relation to the ambient temperature with aggregated hourly averages. The respective months are identifiable by coloring. Typically, the CT operates in a temperature range between 2 and 20 °C, which corresponds to the average temperature profile in Mid-Europe. During late autumn and winter (November until March), the EER is significantly higher compared to the summer months (May until July). Generally, it can be As a further aspect, the EER of the CT system and its dynamic during the year is of a special interest. Originally, the EER is primarily used for design purposes considering only a small number of defined typical temperature examples from the location [29]. However, the understanding of yearly EER dynamics could help to continuously adjust operational tasks and to counteract performance gaps, if necessary. To get an overview, Figure 7a depicts a boxplot of the monthly EER range for one year with an aggregated daily average. From October to May, the EER ranges between 5.5 and 7.5, while the lower and upper whisker achieve an EER of 2.5 and 10 respectively. During the summer months June to September, the EER is significantly lower from approximately 3.5 to 6.5. With a minimum of 1.5 and maximum of 7.5, the whisker range is comparably low. On the one hand, the collapse of the EER could be explained with the former discussed holiday season during summer. On the other hand, ambient temperature and humidity impact the CT performance (compare Equation (1)). This issue is further analyzed in Figure 7b, which puts the EER in relation to the ambient temperature with aggregated hourly averages. The respective months are identifiable by coloring. Typically, the CT operates in a temperature range between 2 and 20 • C, which corresponds to the average temperature profile in Mid-Europe. During late autumn and winter (November until March), the EER is significantly higher compared to the summer months (May until July). Generally, it can be stated that the EER decreases with rising ambient temperatures. This is in line with the relations expressed in Equation (1) and the findings of [25], indicating that higher ambient temperatures negatively impact the energy and mass transfer in the CT, resulting in a lower EER. Additionally, the illustrations show the magnitude and the range of seasonal impacts on the EER dynamics. range for one year with an aggregated daily average. From October to May, the EER ranges between 5.5 and 7.5, while the lower and upper whisker achieve an EER of 2.5 and 10 respectively. During the summer months June to September, the EER is significantly lower from approximately 3.5 to 6.5. With a minimum of 1.5 and maximum of 7.5, the whisker range is comparably low. On the one hand, the collapse of the EER could be explained with the former discussed holiday season during summer. On the other hand, ambient temperature and humidity impact the CT performance (compare Equation (1)). This issue is further analyzed in Figure 7b, which puts the EER in relation to the ambient temperature with aggregated hourly averages. The respective months are identifiable by coloring. Typically, the CT operates in a temperature range between 2 and 20 °C, which corresponds to the average temperature profile in Mid-Europe. During late autumn and winter (November until March), the EER is significantly higher compared to the summer months (May until July). Generally, it can be stated that the EER decreases with rising ambient temperatures. This is in line with the relations expressed in Equation (1) and the findings of [25], indicating that higher ambient temperatures negatively impact the energy and mass transfer in the CT, resulting in a lower EER. Additionally, the illustrations show the magnitude and the range of seasonal impacts on the EER dynamics. As a first conclusion it can be stated, that particularly two main aspects impact CT performance and EER: the workload resulting from the cooling demand of the production system and seasonally changing environmental conditions. However, these influences could superimpose each other and distort conclusions. In order to uncouple these effects, cooling capacity and electric power demand are compared using a portfolio analysis. Figure 8a illustrated the general method to perform a portfolio analysis inspired by the energy portfolio from Thiede [16] to evaluate the energy efficiency ratio (EER). Figure 8b illustrates the extracted data for one operation year (hourly aggregation). To integrate the time perspective, a color code indicates the respective month of the year. The average values of electric power demand (57.7 kW) and cooling capacity (371.5 kW) define the four portfolio categories:

•
High electric power demand, low cooling capacity (category I): The EER during these times is low.
For the presented use case, such inefficiencies occur intermittently in almost every month of the year, but particularly frequent during May, June and July. • Low electric power demand, low cooling capacity (category II): The EER is in an acceptable range, whereas the workload of the CT system is comparatively low. On the one hand, these stages are mainly detected during winter season, when low ambient air temperatures increase the natural cooling effect (compare Equation (1)). This means, the CT system already achieves a sufficient cooling capacity with relatively low additional power demands. On the other hand, this portfolio category includes days in August and May, which are typically related with holiday season, and thus, reduced cooling demand from production system. the natural cooling effect (compare Equation (1)). This means, the CT system already achieves a sufficient cooling capacity with relatively low additional power demands. On the other hand, this portfolio category includes days in August and May, which are typically related with holiday season, and thus, reduced cooling demand from production system.  High electric power demand, high cooling capacity (category III): High workload is linked to high power demands, yet acceptable EER ranges. High workload occurs particularly during the warm summer season, e.g., June and July. Furthermore, October and November show overall the highest workload of the year, which could indicate high production capacities.  Low electric power demand, high cooling capacity (category IV): With high EER, those states are the most desirable for CT system operation. However, there are only few samples in April and May in this category.
(a) (b)   Figure 8. (a) Portfolio analysis to characterize energy efficiency ratio (EER) of CT inspired by the energy portfolio in [16]; (b) application of portfolio analysis (hourly data, coloring indicates related operation month).

Data Selection and Outlier Filtering
For this case study, operational data of one full year is taken into account (August 2016 to July 2017), while data is gathered in ten second intervals. If all 32 measurands of the CT system are considered (compare Figure 5), the resulting database comprises over 2.8 billion rows. The first crucial step of DM is to get a general understanding of the database and to identify interdependencies [67]. Statistical and visualization techniques such as correlation matrix, box plots and time series diagrams provide important insights into data characteristics like trends and seasonality and they allow to detect outliers. In order to filter outliers from the data set, a ruleset is derived exploratively here based on the electric power demand, cooling capacity and water volume flow. Based on these three variables, the operational system status of the CT can be identified, i.e., normal operation mode can be distinguished from single events such as shut down or maintenance. If single data points significantly deviate from the median value, they are removed as outliers (compare [68]). For example, if a value is more than 40% above the median of the last three hours, it is removed. Furthermore, zero values are excluded from the dataset as they indicate shutdowns. Figure 9 illustrates the average weekly cooling capacities and electric power demands for every month over the year before-before outlier filtering Figure 9a and after outlier filtering Figure 9b. After data filtering, the variance is significantly lower and the data range is as expected according to CT system design.
Subsequently, an analysis of the linear correlation provides valuable insights into data interdependencies. The resulting matrix of Pearson correlation coefficients (PCC) (in Figure 10 indicates negative correlations in red color and positive correlations in green color. The PCC ranges from −1 to 1. A value of 1 implies a linear positive relationship between X and Y, while a value of −1 implies a linear negative relationship. A value of 0 implies that there is no linear correlation between the variables [69]. As highly intensive colors relate to high PCC values and thus a high linear correlation between variables, the most relevant variables can easily be identified visually. These include environmental conditions, i.e., ambient air temperature and relative humidity, the temperature of warm and cold water storages, seasonal impacts such as the activity of heat sources and connected pumping stations, as well as time indicators such as weekdays and hours of the day. In order to improve information density, available variables are consolidated and aggregated, if necessary. This particularly affects variables representing technical devices with similar behavior or purpose such as pumps, fans or heat sources. Additionally, new parameters could be constructed to a tailored parameter set achieving the aspired decision support, such as EER and cooling capacity. diagrams provide important insights into data characteristics like trends and seasonality and they allow to detect outliers. In order to filter outliers from the data set, a ruleset is derived exploratively here based on the electric power demand, cooling capacity and water volume flow. Based on these three variables, the operational system status of the CT can be identified, i.e., normal operation mode can be distinguished from single events such as shut down or maintenance. If single data points significantly deviate from the median value, they are removed as outliers (compare [68]). For example, if a value is more than 40% above the median of the last three hours, it is removed. Furthermore, zero values are excluded from the dataset as they indicate shutdowns. Figure 9 illustrates the average weekly cooling capacities and electric power demands for every month over the year before-before outlier filtering Figure 9a and after outlier filtering Figure 9b. After data filtering, the variance is significantly lower and the data range is as expected according to CT system design.
(a) (b) Figure 9. Box plots of cooling capacity and electric power demand: (a) before outlier filtering; (b) after outlier filtering.
Subsequently, an analysis of the linear correlation provides valuable insights into data interdependencies. The resulting matrix of Pearson correlation coefficients (PCC) (in Figure 10 indicates negative correlations in red color and positive correlations in green color. The PCC ranges from −1 to 1. A value of 1 implies a linear positive relationship between X and Y, while a value of −1 implies a linear negative relationship. A value of 0 implies that there is no linear correlation between the variables [69]. As highly intensive colors relate to high PCC values and thus a high linear correlation between variables, the most relevant variables can easily be identified visually. These include environmental conditions, i.e., ambient air temperature and relative humidity, the temperature of warm and cold water storages, seasonal impacts such as the activity of heat sources and connected pumping stations, as well as time indicators such as weekdays and hours of the day. In order to improve information density, available variables are consolidated and aggregated, if necessary. This particularly affects variables representing technical devices with similar behavior or purpose such as pumps, fans or heat sources. Additionally, new parameters could be constructed to a tailored parameter set achieving the aspired decision support, such as EER and cooling capacity.

Data Aggregation and Transformation
In order to improve data management and efficiency of the DM process, the database is aggregated from original ten second intervals to hourly intervals. Furthermore, the reduction of used variables is examined. Combining variables of similar system components entails only a small loss of information whereas the information content of each variable increases. The combination and

Data Aggregation and Transformation
In order to improve data management and efficiency of the DM process, the database is aggregated from original ten second intervals to hourly intervals. Furthermore, the reduction of used variables is examined. Combining variables of similar system components entails only a small loss of information whereas the information content of each variable increases. The combination and transformation of variables is exemplified in Equation (4) for active heat sources in the CT system. As explained in Figure 5, the considered CT system includes four heat exchangers representing the heat sources. If a heat source is active, it emits waste heat in form of warm water to the CT. The activity is described as a binary value. However, the respective share of waste heat to the warm water flow cannot be allocated to the individual heat source. Thus, an evenly distribution of the waste heat sources is assumed and the current number of active heat sources is derived.
with activity heat source, i = 0, if heat source is not active 1, if heat source is active The same procedure is used for the number of active CT fans, forward flow pumps and backward flow pumps. Thereby, all binary values are formatted into continuous values, indicating how long system components are proportionately active in the interval. Moreover, according to [51], it might be helpful to use additional historical values of the variables, such as values of the day before. Therefore, several new parameters are introduced based on the previous day (marked with (−1) (variable name)), including average and extreme values (minimum, maximum). Furthermore, variables specific for CT operation and decision making are established in the final database, such as EER (comp. Equation (2)), electric power demand and cooling capacity. Figure 11 illustrates the development of data quantities in each processing step in a Sankey diagram. Data aggregation steps 1 and 2 reduce data quantities by approximately 99%. Subsequently, a further outlier filtering follows and data is then merged with the newly established variables to form the final database. After all processing steps, the data quantity is reduced from former 2.8 billion rows and 32 variables to approximately 7 thousand rows and 23 variables. Utilizing this data with higher information density is assumed to increase computational times and prediction accuracy.  (2)), electric power demand and cooling capacity. Figure 11 illustrates the development of data quantities in each processing step in a Sankey diagram. Data aggregation steps 1 and 2 reduce data quantities by approximately 99%. Subsequently, a further outlier filtering follows and data is then merged with the newly established variables to form the final database. After all processing steps, the data quantity is reduced from former 2.8 billion rows and 32 variables to approximately 7 thousand rows and 23 variables. Utilizing this data with higher information density is assumed to increase computational times and prediction accuracy. Figure 11. Sankey diagram of data quantity development through data aggregation and data transformation, unit is total number of data.

Feature Selection
A subsequent backward feature elimination procedure aims to assess the significance of variables for the applied DM algorithms. This crucial procedure is applied on every selected algorithm individually. For the use case, results are demonstrated for linear regression (LR), simple regression tree (SRT) and ANN multilayer perception (MLP) exemplarily. Figure 12 Figure 11. Sankey diagram of data quantity development through data aggregation and data transformation, unit is total number of data.

Feature Selection
A subsequent backward feature elimination procedure aims to assess the significance of variables for the applied DM algorithms. This crucial procedure is applied on every selected algorithm individually. For the use case, results are demonstrated for linear regression (LR), simple regression tree (SRT) and ANN multilayer perception (MLP) exemplarily. Figure 12 depicts resulting mean squared errors (MSE), indicating errors to predict the electric power demand in every step of the backward feature elimination. Variables with the greatest contribution to reduce prediction errors are removed last, i.e., the later a variable is removed, the more important it is for the model. As expected, MSE values generally increase with a decreasing number of variables used for prediction. However, the highest number of variables does not necessarily lead to minimal prediction errors. Throughout the backward feature elimination, the order in which variables are removed differs significantly between applied DM algorithms. However, variables representing activities of system components (CT fans, forward flow pumps, return flow pumps) are high-ranked for all algorithms. Thus, they can generally be assumed as very important features. Figure 13 illustrates the relevance of variables for the different algorithms in form of a heat map, while the relevance increases with darkening grey color.

Hyperparameter Assessment
In the following, results of the hyperparameter assessment applied on SRT and MLP algorithms for the target variable electric power demand are described exemplarily. Surface plots help to illustrate the importance of a thorough assessment of hyperparameters, as coefficients of determination (R 2 ) can be significantly improved when hyperparameters are set optimally. Figure  14a illustrates results for the SRT algorithm, where hyperparameters are the limit number of levels (i.e., the maximum depth of the decision tree) and the minimum split node size (i.e., the minimum    Throughout the backward feature elimination, the order in which variables are removed differs significantly between applied DM algorithms. However, variables representing activities of system components (CT fans, forward flow pumps, return flow pumps) are high-ranked for all algorithms. Thus, they can generally be assumed as very important features. Figure 13 illustrates the relevance of variables for the different algorithms in form of a heat map, while the relevance increases with darkening grey color. Throughout the backward feature elimination, the order in which variables are removed differs significantly between applied DM algorithms. However, variables representing activities of system components (CT fans, forward flow pumps, return flow pumps) are high-ranked for all algorithms. Thus, they can generally be assumed as very important features. Figure 13 illustrates the relevance of variables for the different algorithms in form of a heat map, while the relevance increases with darkening grey color.

Hyperparameter Assessment
In the following, results of the hyperparameter assessment applied on SRT and MLP algorithms for the target variable electric power demand are described exemplarily. Surface plots help to illustrate the importance of a thorough assessment of hyperparameters, as coefficients of determination (R 2 ) can be significantly improved when hyperparameters are set optimally. Figure  14a illustrates results for the SRT algorithm, where hyperparameters are the limit number of levels    Figure 13. Heatmap from feature selection indicating relevance of variables for selected algorithms assessed for electric power demand (dark grey color indicates high relevance).

Hyperparameter Assessment
In the following, results of the hyperparameter assessment applied on SRT and MLP algorithms for the target variable electric power demand are described exemplarily. Surface plots help to illustrate the importance of a thorough assessment of hyperparameters, as coefficients of determination (R 2 ) can be significantly improved when hyperparameters are set optimally. Figure 14a illustrates results for the SRT algorithm, where hyperparameters are the limit number of levels (i.e., the maximum depth of the decision tree) and the minimum split node size (i.e., the minimum number of records per branch in the decision tree). High R 2 values are reached if the limit number of levels is increased to 10 while choosing a minimum split node size of more than 31. Beyond that, no significant further improvements can be observed. The hyperparameter assessment for the MLP algorithm considers the number of hidden layers and the number of neurons per hidden layer (=hidden neurons) as hyperparameters. As Figure 14b illustrates, no clear correlations between R 2 and hyperparameter values could be detected. Consequently, an individual and software-supported automated hyperparameter assessment is recommendable for MLP instead of using experience values. In this case study, 3 hidden layers and 30 neurons per hidden layer are identified as optimal hyperparameter values. levels is increased to 10 while choosing a minimum split node size of more than 31. Beyond that, no significant further improvements can be observed. The hyperparameter assessment for the MLP algorithm considers the number of hidden layers and the number of neurons per hidden layer (=hidden neurons) as hyperparameters. As Figure 14b illustrates, no clear correlations between R 2 and hyperparameter values could be detected. Consequently, an individual and software-supported automated hyperparameter assessment is recommendable for MLP instead of using experience values. In this case study, 3 hidden layers and 30 neurons per hidden layer are identified as optimal hyperparameter values.
(a) (b) Figure 14. Surface plot of hyperparameter assessment: (a) R 2 for simple regression tree algorithm; (b) R 2 for multilayer perception algorithm.

Evaluation and Deployment of Data Mining Results (Phase 3)
Finally, the DM process is applied. Considering results of the business understanding step, target variables for prediction are cooling capacity and electric power demand. As various algorithms are basically suitable for this task, an assessment of five algorithms predicting cooling capacity and nine algorithms predicting electric power demand is conducted. Results are comparatively evaluated regarding their coefficient of determination (R 2 ) and mean absolute error (MAE) in relation to the computational time. Detailed results, including mean absolute errors (MAE) and mean absolute percentage errors (MAPE), can be found in Appendix A.

Prediction of Cooling Capacity
As the main purpose of a CT is to cool the production processes and machines, an accurate prediction of the cooling capacity is important for CT operation. In order to identify the best fitting algorithm, several alternatives are applied (compare Figure 4). Figure 15 shows differences between the selected DM algorithms. In general, all algorithms can predict the cooling capacity with a high accuracy, indicating by resulting R 2 values between 0.91 (MLPreg.) and 0.96 (PR). Computational times range from 2 to 7 min, which seems to be acceptable in terms of practical application.

Evaluation and Deployment of Data Mining Results (Phase 3)
Finally, the DM process is applied. Considering results of the business understanding step, target variables for prediction are cooling capacity and electric power demand. As various algorithms are basically suitable for this task, an assessment of five algorithms predicting cooling capacity and nine algorithms predicting electric power demand is conducted. Results are comparatively evaluated regarding their coefficient of determination (R 2 ) and mean absolute error (MAE) in relation to the computational time. Detailed results, including mean absolute errors (MAE) and mean absolute percentage errors (MAPE), can be found in Appendix A.

Prediction of Cooling Capacity
As the main purpose of a CT is to cool the production processes and machines, an accurate prediction of the cooling capacity is important for CT operation. In order to identify the best fitting algorithm, several alternatives are applied (compare Figure 4). Figure 15 shows differences between the selected DM algorithms. In general, all algorithms can predict the cooling capacity with a high accuracy, indicating by resulting R 2 values between 0.91 (MLP reg. ) and 0.96 (PR). Computational times range from 2 to 7 min, which seems to be acceptable in terms of practical application. As discussed in Section 3.1.2, the cooling capacity mainly depends on local environmental conditions. In order to cover seasonal trends, time series for one year are taken into account. Figure  16 provides a prediction of the cooling capacity from polynomial regression, the algorithm with highest prediction accuracy (R 2 = 0.96), highlighting the absolute errors (red color) compared with original data. Apparently, local trends and fluctuations are predicted with high accuracy. During May until November, higher fluctuations in the cooling capacity followed by increased prediction errors are observable. This could be attributed to local weather conditions and production scheduling, as discussed before.

Prediction of Electric Power Demand
The electric power demand is a main lever for CT efficiency and determines economic as well as environmental improvement potentials. Thus, a detailed analysis and forecast of future electric power demands can be an enabler for improving CT operation. In order to identify the best fitting algorithm for this task several alternatives of classification and regression algorithms are applied (compare Figure 4) and prediction results are depicted in Figure 17. Resulting R 2 range between 0.84 (NB) and 0.99 (LR) with related computational times between 2 and 9 min. absolute error predicted data original data Figure 15. Evaluation of data mining results predicting the cooling capacity.
As discussed in Section 3.1.2, the cooling capacity mainly depends on local environmental conditions. In order to cover seasonal trends, time series for one year are taken into account. Figure 16 provides a prediction of the cooling capacity from polynomial regression, the algorithm with highest prediction accuracy (R 2 = 0.96), highlighting the absolute errors (red color) compared with original data. Apparently, local trends and fluctuations are predicted with high accuracy. During May until November, higher fluctuations in the cooling capacity followed by increased prediction errors are observable. This could be attributed to local weather conditions and production scheduling, as discussed before. As discussed in Section 3.1.2, the cooling capacity mainly depends on local environmental conditions. In order to cover seasonal trends, time series for one year are taken into account. Figure  16 provides a prediction of the cooling capacity from polynomial regression, the algorithm with highest prediction accuracy (R 2 = 0.96), highlighting the absolute errors (red color) compared with original data. Apparently, local trends and fluctuations are predicted with high accuracy. During May until November, higher fluctuations in the cooling capacity followed by increased prediction errors are observable. This could be attributed to local weather conditions and production scheduling, as discussed before.

Prediction of Electric Power Demand
The electric power demand is a main lever for CT efficiency and determines economic as well as environmental improvement potentials. Thus, a detailed analysis and forecast of future electric power demands can be an enabler for improving CT operation. In order to identify the best fitting algorithm for this task several alternatives of classification and regression algorithms are applied (compare Figure 4) and prediction results are depicted in Figure 17. Resulting R 2 range between 0.84 (NB) and 0.99 (LR) with related computational times between 2 and 9 min.

Prediction of Electric Power Demand
The electric power demand is a main lever for CT efficiency and determines economic as well as environmental improvement potentials. Thus, a detailed analysis and forecast of future electric power demands can be an enabler for improving CT operation. In order to identify the best fitting algorithm for this task several alternatives of classification and regression algorithms are applied (compare Figure 4) and prediction results are depicted in Figure 17. Resulting R 2 range between 0.84 (NB) and 0.99 (LR) with related computational times between 2 and 9 min. The electric power demand depends on the performance of electric components, such as pumps and fans with individual operation controls. Figure 18 illustrates the time series of original data, absolute error and predicted data resulting from linear regression (R 2 = 0.99). Apparently, local trends in electric power demand such as fluctuating cooling demand by weekly production schedules can be predicted. In general, as discussed in Section 3.1.2, high and low seasons are significantly visible in the electric power demand during the year. Periodical peak consumption during the week can be explained with regularly maintenance activities.

Discussion
A limiting aspect of data-driven approaches in general that cannot be ignored is the availability of measurement data with sufficient scope and detail. The initial effort to develop a measurement infrastructure and database for feasible data-driven approaches, such as the presented DM approach, is typically high and cost intensive. The benefit of a (newly) establishment should therefore be economically assessed before. Furthermore, the actual support for decision makers and TBS operators should be assessed against former expectations. This will be part of future work. The DM algorithms show high accurate prediction of KPI that are relevant to operate and control an industrial CT system. However, a validation based on long-term operation experience is still outstanding. As defined target variables are of continuous character, regression algorithms result in higher prediction qualities  The electric power demand depends on the performance of electric components, such as pumps and fans with individual operation controls. Figure 18 illustrates the time series of original data, absolute error and predicted data resulting from linear regression (R 2 = 0.99). Apparently, local trends in electric power demand such as fluctuating cooling demand by weekly production schedules can be predicted. In general, as discussed in Section 3.1.2, high and low seasons are significantly visible in the electric power demand during the year. Periodical peak consumption during the week can be explained with regularly maintenance activities. The electric power demand depends on the performance of electric components, such as pumps and fans with individual operation controls. Figure 18 illustrates the time series of original data, absolute error and predicted data resulting from linear regression (R 2 = 0.99). Apparently, local trends in electric power demand such as fluctuating cooling demand by weekly production schedules can be predicted. In general, as discussed in Section 3.1.2, high and low seasons are significantly visible in the electric power demand during the year. Periodical peak consumption during the week can be explained with regularly maintenance activities.

Discussion
A limiting aspect of data-driven approaches in general that cannot be ignored is the availability of measurement data with sufficient scope and detail. The initial effort to develop a measurement infrastructure and database for feasible data-driven approaches, such as the presented DM approach, is typically high and cost intensive. The benefit of a (newly) establishment should therefore be economically assessed before. Furthermore, the actual support for decision makers and TBS operators should be assessed against former expectations. This will be part of future work. The DM algorithms show high accurate prediction of KPI that are relevant to operate and control an industrial CT system. However, a validation based on long-term operation experience is still outstanding. As defined target variables are of continuous character, regression algorithms result in higher prediction qualities

Discussion
A limiting aspect of data-driven approaches in general that cannot be ignored is the availability of measurement data with sufficient scope and detail. The initial effort to develop a measurement infrastructure and database for feasible data-driven approaches, such as the presented DM approach, is typically high and cost intensive. The benefit of a (newly) establishment should therefore be economically assessed before. Furthermore, the actual support for decision makers and TBS operators should be assessed against former expectations. This will be part of future work. The DM algorithms show high accurate prediction of KPI that are relevant to operate and control an industrial CT system. However, a validation based on long-term operation experience is still outstanding. As defined target variables are of continuous character, regression algorithms result in higher prediction qualities compared to classification algorithms. It could provide valuable insights of improvement potentials for the methodology as well as applicability in daily practice. Strength of such classification algorithms can be seen in the prediction of discrete values such as color, shape or production state. Hence, they could be used for future applications such as energy flexibility strategies by predicting future operation states and resulting energy demands.

Conclusions and Outlook
The authors present an approach to create a data-driven DT for TBS operation applied on an industrial CT system. It aims to uncover interrelations between operational business and technical system in order to improve operational strategies. The DM approach bases on the well-known CRISP-DM procedure, featuring a high integration capability into daily business and continuous improvement cycles. As CRISP-DM is a generic procedure for DM in industry, the approach is transferable to other TBS technologies apart from CT. Yet, to transfer the approach, it should be discussed to omit single steps or extend the workflow depending on the individual requirement of the use case. Based on a consistent and continuous database, the DM approach features three general phases for thorough system analysis and performance prediction. In the first phase, business understanding, a general system understanding is gained and KPIs to express system conditions and to identify hotspots are defined. A focus is then put on the second phase of creating the DT, while each of seven working steps such as outlier filtering, data aggregation and transformation is explained and illustrated by means off the case study. The use of intuitive diagrams like heat maps and Sankey diagrams is proposed to make workflow and results comprehensible. The comparison of several DM algorithms revealed their general aptitude to predict crucial operational KPIs like cooling capacity and electric power demand with high accuracy and acceptable computational times. The best predictions were achieved by polynomial regression (R 2 = 0.96) for cooling capacity and linear regression (R 2 = 0.99) for electric power demand. The accurate prediction of cooling capacities provides valuable insights in the overall system performance and operation reliability that is crucial for the whole production system.
With an outlook, the approach offers numerous opportunities. The forecast of energy demands enables a proactive and energy-oriented CT system operation and paves the way for future business models such as energy flexibility. A combination with additional forecast models, e.g., local weather or energy market prices, could increase the economic relevance. Moreover, the data-driven DT could serve as basis for further simulation as an alternative to formula-based simulation models. This would enable the user to run what-if scenarios and evaluate possible future operation strategies in the safe virtual DT environment without affecting the physical system. For further work, the authors also plan to improve applicability and transferability of the approach in daily practice. One contribution could be a higher degree of automation for the presented workflow, significantly reducing efforts for data processing. Furthermore, direct feedback to the physical control system could be a possible extension, reducing the need for human interventions. A full implementation of the DT approach into a real-world CT or any other TBS system with fully automatic response and control of the system can be regarded as aspired vision based on the proposed DT approach.
Author Contributions: C.B., concept, methodology, formal analysis, visualization, writing-original draft preparation; S.B., visualization, validation, writing-review and editing; S.T., concept, supervision, writing-review and editing; C.H., supervision, writing-review and editing. All authors have read and agreed to the published version of the manuscript.

Funding:
The authors gratefully acknowledge the financial support of the Kopernikus-Project SynErgie (Grant 03SFK3N1-2) by the Federal Ministry of Education and Research (BMBF) and the project supervision by the project management organization Projektträger Jülich (PtJ).

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.  Figure A2. Summary of data mining results predicting the electric power demand.