Evaluation of the Flourish Dashboard for Context-Aware Fault Diagnosis in Industry 4.0 Smart Factories

Kaupp, Lukas; Nazemi, Kawa; Humm, Bernhard

doi:10.3390/electronics11233942

Open AccessFeature PaperArticle

Evaluation of the Flourish Dashboard for Context-Aware Fault Diagnosis in Industry 4.0 Smart Factories

by

Lukas Kaupp

^1,*

,

Kawa Nazemi

²

and

Bernhard Humm

^1,3

¹

Faculty of Computer Science, Darmstadt University of Applied Sciences, Haardtring 100, 64295 Darmstadt, Germany

²

Research Group Human-Computer Interaction and Visual Analytics, Faculty of Media, Darmstadt University of Applied Sciences, Haardtring 100, 64295 Darmstadt, Germany

³

Hessian Center for Artificial Intelligence, Mornewegstraße 30, 64293 Darmstadt, Germany

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(23), 3942; https://doi.org/10.3390/electronics11233942

Submission received: 1 November 2022 / Revised: 23 November 2022 / Accepted: 24 November 2022 / Published: 28 November 2022

(This article belongs to the Special Issue Visual Analytics, Simulation, and Decision-Making Technologies)

Download

Browse Figures

Versions Notes

Abstract

:

Cyber-physical systems become more complex, therewith production lines become more complex in the smart factory. Every employed system produces high amounts of data with unknown dependencies and relationships, making incident reasoning difficult. Context-aware fault diagnosis can unveil such relationships on different levels. A fault diagnosis application becomes context-aware when the current production situation is used in the reasoning process. We have already published TAOISM, a visual analytics model defining the context-aware fault diagnosis process for the Industry 4.0 domain. In this article, we propose the Flourish dashboard for context-aware fault diagnosis. The eponymous visualization Flourish is a first implementation of a context-displaying visualization for context-aware fault diagnosis in an Industry 4.0 setting. We conducted a questionnaire and interview-based bilingual evaluation with two user groups based on contextual faults recorded in a production-equal smart factory. Both groups provided qualitative feedback after using the Flourish dashboard. We positively evaluate the Flourish dashboard as an essential part of the context-aware fault diagnosis and discuss our findings, open gaps, and future research directions.

Keywords:

context visualization; context-aware fault diagnosis; cyber-physical systems; cyber-physical production systems; Industry 4.0; autoencoder; artificial intelligence performance measures

1. Introduction

Industry 4.0 accelerates the development of novel interconnected cyber-physical systems (CPSs). The rise in complexity of CPSs makes the assessment and reasoning in an already complex environment laborious. Each CPS consists of cyber and physical components (software and hardware). Each component, e.g., sensors and software modules, produces loads of data in various forms and types. Reasoning within one CPS already becomes complex, but reasoning within a smart factory (SF) production line where multiple CPSs are tightly chained together is challenging. Maintenance and fault handling are labor-intensive in such a complex environment. Currently, professionals are confronted with the increased complexity, which will rise further in the future. To counteract the increased complexity, we have developed the concept of context-aware diagnosis. Moreover, the context-aware diagnosis is based on the TAOISM visual analytics (VA) model [1,2]. A VA model combines data, models, visualizations, and the knowledge of professionals in a defined process. A VA model weighs the human and their knowledge tantamount to the data, the models, and the visualizations in the process, which is a strength and a reason why we declared our context-aware diagnosis in a VA model.

The context is “the situation within which something exists or happens, and that can help explain it” (https://dictionary.cambridge.org/dictionary/english/context, accessed on 2 August 2022). Here, the context and the situation within the production line are considered in the reasoning. We provide a mathematical definition of context in the smart manufacturing (SF) domain [1] that builds a boundary around data to a given time. Those boundaries can now be utilized to divide data from the SF into reasonable chunks, forming context hierarchies [2] that encompass software and hardware components. Now, each context withholds only a fraction of information and complexity which can be manually overseen or used in algorithms and models to automatically survey and reason the SF production line. Moreover, the context grants the possibility to observe relationships and dependencies within the SF. Relationships can be observed within the context, between contexts (inter-contexts), between CPSs (inter-machinery), or in the entire production line. Following the principles of the TAOISM VA model, each context can be used as a foundation for a detection model that reflects a error model. Besides a prototypical end-to-end implementation of the context-aware fault diagnosis, each aspect of the TAOISM VA model has undergone distinct research in different proof-of-principles [1,2,3,4,5,6].

One gap that still exists is the missing end-to-end implementation and, thus, the integration of professionals. In order to close the gap, we propose a prototypical end-to-end implementation of the context-aware fault diagnosis in this article. The presented Flourish dashboard is part of our end-to-end implementation, which reflects the context-aware fault diagnosis. The implementation integrates contextual faults [6] with new recordings, a video live feed, context outlier detection (OD) models, the context-displaying Flourish visualization, and novel artificial intelligence (AI) performance metrics into a dashboard to provide guidance and transparency for professionals in a fault case. We evaluate our prototype in an experiment with two distinct groups of 15 professionals: junior professionals (10) and domain experts (5). In our case, the domain experts have an exceptional knowledge about the SF in which the contextual faults occur. Therefore, the domain experts are able to reason the presented faults in depth, whereas the junior professionals have only a background in the specific domain and superficial knowledge about the SF. The evaluation is split into a training phase, a guided fault case, a random fault case, and an overall evaluation. Both groups underwent the same process with the same questions, but only the domain experts were interviewed, whereas the junior professionals filled in a questionnaire. The qualitative results provide promising results on the implementation, the context-aware fault diagnosis in general, and pinpoint future research directions.

Our contribution is four-fold: (1) we provide an evaluated prototypical end-to-end implementation of the context-aware fault diagnosis, (2) present Flourish as a context-displaying visualization for smart manufacturing, (3) state a set of novel AI performance metrics for unsupervised ensembles, and (4) give an example of guidance and transparency when dealing with applied AI in the work process through the dashboard.

The remainder of the article is structured as follows. Section 2 describes related work in the fields of context-awareness in smart manufacturing, in visualizations and the current state of the art in evaluating industry-related systems. Section 3 presents the Flourish dashboard with background information. Section 4 explains the evaluation in depth including the environment, the data, user groups, the employed questionnaire, and the results. The questionnaire can be found in Appendix A. We conclude our work in Section 5 and pinpoint future research directions.

2. Related Work

In this section, we highlight the advances in the research fields of context-aware smart manufacturing and context-awareness in visualizations and provide a summary of the current evaluation techniques for industry visualizations. Context awareness is a subtle field, and various techniques may be used to facilitate the information in models, algorithms, or visualizations, e.g., to highlight contextual information. Consequently, we split the related work among context-awareness in smart manufacturing, visualizations, and evaluation of industry visualizations. Parts of the related work section was taken from [2], but the sections were extended and updated.

2.1. Context Awareness in Smart Manufacturing

The first step toward context awareness was made by Emmanouilidis et al. [7]. Their conceptual model integrates the knowledge of domain experts as a single entity. Their ideas to contextualize machinery symptoms by integrating human perception are an inspiration and can be seen as an early predecessor of our TAOISM VA model [1,2]. Another approach to context awareness was given by Zhou et al. [8] in the definition of the situational awareness model that utilizes measurements, qualitative (temperature zones with boundaries) and quantitative (temperature sensor data). Their situational awareness model (index part/computation model) deducts the state of the production line (low, guarded, elevated, high and severe). Additionally, to the best of our knowledge, this is the first model that formalizes a context in an SM process and calculates the severity of a context. In contrast to Zhou et al., we do not survey one context, but our approach allows artificial contexts spanning various parts of the SF. Moreover, we employ the concept of context hierarchies to be not only able to survey the production line but also distinct CPSs, dependencies, and relationships between hardware and software. Zhou et al. also used surrounding variables but stuck to sensor data and stayed behind our concept of surveyable sub-contexts. Additionally, with the techniques we employed, we are not restricted to numerical data only but are able to mirror the wide variety of multivariate data in more complex cases. Another attempt in the direction of context awareness was made by Filz et al. [9], who published a product state propagation concept, which has some similarities to our context concept. Their concept can be referred to as a product-centered context (spanning around a product) to identify the malicious process which leads to a faulty product. Unlike their product state propagation, our context is used to identify faults, dependencies, and event chains in the production line, CPSs, and their communication. Therefore, with our approach, we want to identify faults in the production line and its involved sub-process rather than identify the processes leading to the faulty product. Nevertheless, we both encourage the analysis of the whole SF instead of only single and isolated processes [2,9] to counter the risen complexity under Industry 4.0.

2.2. Visualization for Maintenance and Production

In this section, we focus on the related work in maintenance/production visualization and the foundations around our novel Flourish visualization. The foundation of the Flourish visualization is a sun-burst chart that visualizes the context hierarchy, the contexts, and contextual information. Here, we identify the visualization techniques that share similarities with our Flourish visualization. The Flourish dashboard is, at the same time, a monitoring dashboard. For this reason, we relate our work to other SF monitoring dashboards.

2.2.1. Smart Factory Monitoring Dashboards

Zhou et al. published a thorough survey of current visualizations in the SF domain [10]. Visualizations are centered around the concept of replacement and creation. Replacement frees people from tedious work through the implacement of intelligent devices or virtualizing dangerous work environments. In contrast, creation consists of the design phase (creation of products), the production phase (ideology to physical forms), the testing phase (guarantee established standards), and the service phase (insights from usage). Xu et al. [11] published a dashboard with an extended Marey graph and a station graph to exploit production flow and spatial awareness to provide insights and uncover anomalies. Jo et al. [12] contributed a novel visualization for ongoing tasks in the production line with an extended Gantt chart. Post et al. [13] facilitated user-guided visual analysis for flow, workload, and stacked graphs in a production line. In [14], Arbesser et al. developed a visual data quality assessment for time series data. Part of the work is the plausibility checks that are simple rules applied to meta-data (e.g., sensor type and position). In their unique concept, they conducted a staged visualization from overview to detail and employed a color scheme for different severity levels of the found anomaly.

2.2.2. Visualization Foundations

The Flourish visualization has a radial design that originates on a sunburst chart. Besides that, work exists that shares similarities with the Flourish visualization. For instance, Keim et al. [15] developed CircleView to visualize multivariate streaming data. Here, variables are placed on segmented circles reflecting periods of time. Despite the segmentation, Flourish visualizes the current situation without displaying values over time. Krzywinski et al. [16] published Circos, a multivariate display to differentiate genomes. Here, the center holds information, which is detailed in circles around the center. The Flourish visualization adapts this style of information aggregation in the center but uses a blossom to provide the status of the production line context and CPS contexts. Bale et al. [17] proposed Kaleidomaps. Like Kaleidomaps, the Flourish visualization highlights certain areas of interest and pinpoint the professional in a certain direction and provide remarkable patterns if the professional scroll through time. Tominski et al. [18], with the enhanced spiral view, and Weber et al., with the spiral graph [19], gave examples of a radial information flow, from the outer to the inner. Likewise, the Flourish visualization adapts that kind of information flow to achieve some similar attention highlighting with the focus on the current situation rather than giving a comparison between time spans. KronoMiner, published by Zhao et al. [20], is also a great example of a radial layout, where information is more aggregated toward the center. In the center, aggregated information can be selected in a circle, which expands into another layer, where now additional information is shown because of the space available. The Flourish visualization reacts similarly, where higher-level contexts also aggregate the information of the underlying contexts. The contexts also have sub-contexts and go from overview to more detailed contexts. In contrast to sharing the hierarchical order with KronoMiner, the Flourish visualization displays the results of distinct models and aggregates the current situation toward the center. A similar example is the SolarPlot published by Chuah et al. [21]. The solar plot with the aggregate tree map is used to display the hierarchy of a company from the top layer of management to the employees (outer layers). The chosen segments are the departments of the company. The Flourish visualization is similar; toward the center are the more high-level contexts, whereas on the outer layers are the more fine-grained contexts. Additionally, the segmentation is comparable between departments in the solar plots and the stations employed in a production line. Additionally, the Flourish visualization is inspired by nature. To the best of our knowledge, a similar visualization with similar functionalities for a comparable purpose does not exist at the moment.

2.3. Evaluation of Industry 4.0 Visualizations

There are only a few Industry 4.0 visualizations, and likewise, there are few evaluations. Conquering the scarcity, we also include Industry 4.0-related visualization evaluations in this section. Besides standard questionnaires, custom questionnaires are used that split the questions into categories to cover different aspects of the visualization. Some publications join standard questionnaires with their categorization or use different standard questionnaires for different categories. Additionally, the entangled group sizes differ, as does the used evaluation format. For quantitative measurements, mainly questionnaires with a reasonable group size are used, whereas for the specialized visualization, we use domain experts with a small group size (up to 5) and interviews for evaluation. Nielsen and Landauer [22] recommended 3–5 users as a minimum group size for usability testing with the best cost-benefit ratio. As a result of their recognized and well-established work, most evaluations with a small pool of potential testers (specificity of the visualization/domain) use measurements (questionnaires or interviews) with groups of up to five. In order to prove the effectiveness of Flourish, we performed a qualitative evaluation with ten junior professionals and five domain experts.

Merino et al. [23] performed a systematic literature review of software visualization evaluation to provide a comprehensive survey of the currently employed techniques in the evaluation domain. Rasmussen et al. [24] presented different techniques to evaluate industry related human-machine interfaces (HMI). Forsell et al. [25] provided a guideline for the development of information visualization questionnaires. They suggested 10–20 questions that were validated by domain experts upfront and were chosen from home-grown to standard questionnaires. We follow the advice and combine the questions originating from the Flourish dashboard and techniques with the ISO 9240/10 and the system usability score (SUS) questionnaire. Malhotra et al. [26] developed a matrix score algorithm for the evaluation of a questionnaire about barriers affecting reconfigurable manufacturing systems. Novais et al. [27] evaluated a software system in the facilitation of goals that are combined with tasks and questions that should be solved during the evaluation. We adapt the concept of a goal-centric evaluation approach and center our evaluation around a workflow model. Aranburu et al. [28] used the standard questionnaires PANAS-X and eye tracking to create an Industry 4.0 app experience Capturer. Moreover, the authors performed a task-centered evaluation and measured the execution time, errors, and emotions. Additionally, the evaluation was performed in circles with 16 and 5 participants. Bassil et al. [29] performed an online evaluation with 107 participants and employ two custom questionnaires centered around functional and practical parts of the software visualization tools. Bertoni et al. [30] used eight groups of 26 engineering students to evaluate a novel CAD tool. Here, an explanation is given beforehand, and then different tasks are given and evaluated through observations. In the end, a questionnaire is used to evaluate the software. We follow the segmentation of the evaluation and introduce the dashboard through a training session and then ask the participants to fulfill various tasks. Ciolkowski et al. [31] evaluated software control centers in four companies with 11 participants with different knowledge backgrounds. Their questionnaire-based evaluation encompasses questions of the ease of use, usefulness, and improvement potential of the software control centers. The authors provided an anonymous abstraction of the users in their evaluation, which we take as a role model for our evaluation and provide a similar description of our users. Fuertes et al. [32] evaluated augmented reality digital twin with 20 graduate students. Here, different scenarios are given to students that need to be solved. Afterward, the course is evaluated using the SUS questionnaire. Lohfink et al. [33] evaluated a dashboard for visually supported triage analysis with 15 participants. The participants also have a different technical background, as we have in our evaluation. The authors evaluated their dashboard with ten questions from the ISO 9240/10 questionnaire. We follow their path and use the same ten questions to evaluate the Flourish dashboard in the end. Contrary to Lohfink et al., we conduct five domain expert interviews, besides their validation with one domain expert. Reh et al. [34] evaluated their mobject visualization with 12 participants in 2 groups utilizing a questionnaire centered around three tasks. Richardson et al. [35] performed an HMI study with 5 participants using the SUS and ISONORM 9241 questionnaire. Shamim et al. [36] conducted an evaluation of opinion visualization with 146 participants recruited in a seminar and over an online questionnaire. The authors described the user groups by their gender and age. Shin et al. [37] employed 16 graduate students to evaluate ARcam with the NASA-TLX questionnaire with multiple experiments per task. In the end, there was a post-session questionnaire. Stelmaszewska et al. [38] performed their evaluation with 34 participants, which consists of graduate students and domain experts. The authors employed pseudo-names for the participants and split the evaluation into multiple tasks, where one part was performing a think-out-loud evaluation, and the other was an in-depth interview. Strandberg et al. [39] used five domain experts from the industry to evaluate their tool against the usage of only log data in a questionnaire. Väätäjä et al. [40] proposed information visualization heuristics in practical expert evaluation with a custom questionnaire that employs open-ended questions, sentence completion, the assessment of the usefulness of the heuristics, and statements with a ten-point Likert-scale. Villani et al. [41] published a user study for the evaluation of adaptive interaction systems for inclusive industrial workplaces. In the evaluation, 18 participants evaluated an adaptive HMI in different tasks. Here, the time for the tasks and the mistakes were recorded. In the end, the participants were asked to answer the SUS and a worker satisfaction questionnaire, post evaluation. Zhao et al. [42] defined tasks, which are grouped into cluster-oriented tasks and data-driven tasks, to evaluate multi-dimensional visualizations for fuzzy clusters. A task has assigned questions that the participant has to answer. Fifteen participants volunteer in the evaluation.

To summarize, the evaluation of Industry 4.0 visualization or applications involving a small number of participants (<20) [28,31,32,33,34,35,37,39,41,42]. Domain experts in the field are hard to obtain because of their specificity and the accompanying small pool size. Therefore, they are used to verify the findings and employed in a small number [31,33,34,35,38]. The employment of students and graduates in evaluations with different knowledge background is common [30,32,36,37,38]. All evaluations are backed by a questionnaire [27,28,29,31,32,33,34,36,37,38,39,41,42]. The use of custom questions together with standard questionnaires is common [28,32,37,38,41]. A common combination is a task-centered design with post-session questionnaires [28,32,33,35,37,38,41]. Moreover, a few combine questionnaires with expert interviews [31,33,34,35,38]. Most evaluations are performed qualitatively because of the specificity and the lack of a broad audience [28,30,32,33,34,38,39,41,42]. To conclude, our presented qualitative evaluation is in line with the current scientific standards in the Industry 4.0 domain, with two user groups with different knowledge background and the combination of task-centered questionnaires and guided domain-expert interviews.

3. The Flourish Dashboard

Context-aware fault diagnosis aims at mitigating the complexity of an evolving environment, which will tend to be even more complex in the future. Maintenance and fault diagnosis are laborious but will become even more tedious. A significant part of context-aware fault diagnosis is splitting the mass amounts of data into reasonable chunks that can be surveyed. For this reason, we established the context as a virtual boundary around sensors and software modules that describes the current situation in time, where dependencies and relationships may be seen. A context can span from simple units (e.g., one sensor) over multiple parts of a CPS (e.g., software modules and sensors) or various CPSs, to the whole production line. Therefore, contexts are inherently hierarchical. The Flourish dashboard is a proof of concept (PoC) of context-aware fault diagnosis. The PoC should be seen as a first approach to visualize the current situation in contexts and a context hierarchy to be used for fault diagnosis in a smart manufacturing environment. Additionally, the Flourish visualization is an attempt to visualize the contexts and the context hierarchy. Both are subject to future research, and contributions to the field of context-aware fault diagnosis are welcomed.

3.1. Structure and Behavior

A significant benefit of context-aware fault diagnosis is the independence of domain knowledge in the employed algorithms. Each employed technique is used in an unsupervised fashion. The goal of the Flourish dashboard is to provide guidance throughout the context-aware fault diagnosis. The Flourish dashboard follows the principle of highlighting and presenting valuable information to the professional to minimize search and evaluation time in the available loads of data. At the same time, it enables classifying detailed findings as errors.

The Flourish dashboard is sectioned into three areas, the Flourish visualization, the context OD model view, and the log view. Figure 1 depicts the whole dashboard. The Flourish visualization is the first interaction point for fault diagnosis. Figure 1 (left, first section) or in detail in Figure 6 shows all elements that are displayed at the start. First, after the moment of selection of a specific context, the context view becomes visible and moves the centered Flourish visualization to the left. The same happens if a surveyed context value is selected. After that, both views will move to the left to open the log view to the right.

The initial view will show the video live feed checkbox and the recording select box (1, Figure 1). Here, different experiment recordings are selectable. The experiments were centered around the contextual faults described in [6]. The recordings consist of the ground truth and the contextual faults of missing parts, missing pressure, and shuttle dropout. Each case was recorded twice and was additionally filmed. Consequently, eight datasets can be selected. The hierarchy-depth slider (2) allows adjustments to the visible context hierarchy depth displayed by the Flourish visualization (5), where one reflects the first hierarchy level, and the right-hand side reflects the last hierarchy layer or the leaves. Next to the slider is a spider diagram (3) that displays the AI performance metrics for unsupervised-trained ensembles. In the spider diagram, the metrics are shown in historical order (shift, day, and historical performance). The system performance (SP) computes the metrics over all datasets, which is comparable to all the data in the system since the employment, respectively all historical data. Additionally, the dataset performance displays the computed metrics over one dataset, which would translate to the daily performance of the system. Lastly, the dataset performance until current time t is computed every time the time slider (6) is moved. The computed metrics are comparable to those computed during an eight-hour shift. The visualization in a spider diagram enables the professional to see the difference in performance compared to historical values and determines whether the system is operating within acceptable limits. The shown metrics should increase trust in information displayed. The live video feed (4) can be opened in picture-in-picture (PIP) mode and is utilized to verify possible faulty situations. Together with the metrics, the video forms another information canal to validate in the Flourish visualization displayed information. Additionally, the video in PIP mode floats over the Flourish Dashboard and can be resized and moved accordingly to the professional’s needs.

The center of the dashboard is the Flourish visualization (5). Moreover, Flourish expands on hover as shown in Figure 3. Additionally, the threshold slider (7) is used to adjust the thresholds by percentage. All thresholds of all context OD models can simultaneously be increased up to 300 % or lowered by 100 % to zero. Therefore, the adjustment impacts the Flourish visualization, the displayed context status information, the production context error percentage, and the AI metrics. Meanwhile, an adjustment on the time slider (6) has an impact on the Flourish visualization, the live video feed, and the context view. All displayed information stays synchronized. If the time slider is adjusted, the video will seek the same point in time, and the context view is updated accordingly (the model performance (9) and the context window (10)). An adjustment in the live video feed will also update the time slider and the rest of the dashboard to the same time. A similar procedure applies if the professional pauses the video to inspect a specific point of interest in time. The context OD model performance (9) visualizes the performance of the underlying model, e.g., the reconstruction error (blue line) of the underlying context OD model (autoencoder neural network). The thresholds are set after the training by processing an additional ground truth dataset. Each unsupervised model has its threshold. The threshold is the first barrier when the system decides a medium error occurred (yellow line exceeded). The red threshold is 20% over the yellow threshold, and if exceeded, the context is flagged as a serious error (red). The displayed context window (10) consists of various time series fed to the neural network to the given point in time (vertical green line in model performance). All surveyed variables (hardware and software modules) in the given context are fed to the context OD model, in this case, a window of 10 s long. Only diagrams, respective variables, are visible in the context window area that had one or more changes during the 10 s window period. In order to observe longer periods of behavior, each context window has included a one-minute rolling window of each variable. Here, the last 10 s are also considered and fed as one context window to the context OD model, e.g., to detect a slowly evolving shift.

If the model performance is exceeded, as shown in Figure 1, the professional can use their domain knowledge to judge the displayed graphs in the context window. The graphs are the numerical representation of the context data fed to the context OD model. Transformations are in place to translate various types within the SF to numerical data. Various transformations have been published [1,2,5,6]. Now, if the professional spots something unusual, they can select a given graph (11), and the log view with the original log values (back-translated) is shown. Together with the drawn model performance, the context information, the back-translated log values, and the video, the professional can validate the contextual fault and may find inherent dependencies and relationships for the fault’s origin.

3.2. Background

The Flourish dashboard is based on the TAOISM VA model [1,2]. Moreover, the Flourish visualization (5) depicts the context hierarchy and the inherent contexts. Each context is backed by an OD model that represents the normal behavior of the SF. Various algorithms, such as (not exhaustive) support vector machines, decision trees, Markov models, and cluster-based algorithms (k-nearest neighbor), are able to learn the standard behavior. Algorithms that are trainable in an unsupervised fashion are preferred because labeling data in such a complex environment involves loads of workforce and is highly cost-inefficient. For our prototype, we use unsupervised autoencoder convolutional neural networks. The neural networks are trained in an offline step before they are used for live prediction.

An autoencoder learns to reconstruct a given input. The general idea behind outlier detection with autoencoder neural networks is that outliers are only a small portion of the data. Therefore, the presented less-common or outlying data produce a higher reconstruction error, which can be caught using a general threshold found through the reconstruction on multiple fault-free production cycles. A limitation of the method is that if the training data are impure, the threshold cannot be properly set, and the distinction of outliers is impossible. As with all data-driven OD methods, the outlier identification quality depends on the data quality.

The data are shown in [1,2] for the contexts in the context hierarchy come from various data sources. Currently, the Flourish visualization is backed by 279 independent context OD models. The context OD model consists of 11 layers (Figure 2). The 1D autoencoder consists of an encoder with six layers and a decoder with five layers. Moreover, the encoders’ first convolutional layer enlarges the feature size/dimension d using filters by a factor of 4. The remaining layers will shrink the dimension by 2, down to

d / 2

. After the first two convolutional layers, average pooling takes place. Here, extreme values will be averaged by a pool size of 2. The average pooling operation yields two benefits which are surveyed in field trials. First, the neural network does not over-fit quickly. Second, the neural network will not focus on extreme values/features too early, and as a result, features are weighted more equally, leading to a better reconstruction of previously seen ground truth data. After that operation, every remaining values run through another convolutional layer, and those values, in return, undergo a max pooling operation. Only the max values within the pool size of 2 are used in later operations. Max pooling at this stage has the advantage that the neural network will focus on the maximum values of the previous convolutional operation after the average pooling of a balanced selection of values, which best describes the current situation. Another convolutional layer is fed by the maximum values with only half of the input dimension d. Next, the decoder will reverse the convolutional operations using five transpose layers. Normally, an autoencoder is symmetric, but, as average pooling and max pooling cannot be reversed, the layers are only applicable in the encoder and spared in the decoder. Each convolutional layer uses Tangens Hyperbolicus

t a n h

as an activation function. In field trials, the neural network is able to learn standardized features with

t a n h

because

t a n h

has a value range of

- 1

and 1 and also can predict negative values. In contrast, rectified linear units (ReLU) with a value range from 0 to 1 wipe out negative values and only predict positive values. We employed the standard scaler for scaling the data because in a streaming environment, it is hard to determine a min/max value, and later scaling operations would lead to scaling problems in terms of exceeding the lower/upper set boundary. For training, the data of the contexts are recorded at once and then split into the respective contexts. Each context datum is now transformed into numerical data and scaled. After the scaling process, the respective context OD model is trained on the transformed data minimizing the mean absolute error (MAE) between input and output for 50 epochs. If the MAE reaches a plateau, the learning rate is reduced by

0.5

after being patient for three epochs. Early stopping is also in place to reduce training time. After ten epochs where the MAE does not exceed a minimum delta of

0.01

, the training process stops. A second ground truth dataset is used and fed to the already-trained neural networks to set a custom threshold for each context OD model individually; that threshold is the yellow border in the model performance (9). Additionally, we trained with mixed Float16 precision to reduce the memory footprint by a factor of 4. In order to avoid unstable numerical operations, we followed the best practice, and the last layer with the linear activation function converted everything back to Float32. We were able to train each of the 279 neural networks with the same dimensionality-based configuration. Furthermore, another benefit of autoencoders and the preprocessing steps used is that the transformed data can be back-translated to the original log values. The Flourish dashboard uses this advantage to display the original log values to the professional, while at the same time, the employed context OD models are used to classify the current situation.

3.3. Fault Diagnosis Procedure

All techniques work unsupervised, from type-based transformations to unsupervised dimensionality-based convolutional autoencoders with automatically set thresholds based on a second ground truth dataset. The Flourish dashboard should enable the professional to judge the current situation but also to evaluate the current system behavior and the classification. As a side effect of the unsupervised process, there are always models in which training is not sufficient. Therefore, an error baseline within the autonomous system exists because the insufficiently trained neural networks will report false positives. Consequently, the professional who works with the system has to determine the error baseline. In the prototypical implementation, the professional can determine the error baseline during a time-based walk through the dataset. The production line context error rate will keep in a certain range, and exceeding will hint toward a fault that is worth exploring (Figure 3). The error baseline depends heavily on the surveyed production line and the employed models. After a suspicious situation is spotted by the professional, the professional can dive into the highlighted situation. After a click on one of the context models, the context OD model view is shown (Figure 1, No. 9–10). After the selection, the professional has to evaluate the model’s performance (9). Figure 4 presents the three types of model performances: good, bad and non-functional. The model performance is rated as good if the model remains under the set threshold most of the time and only excels if something unusual happens (1). A sign of a working model is that the values fall under the threshold again after a short period. In order to enable the professional to judge the model, a longer period is displayed in the model performance (in the shown case, around 10 min). Insufficiently trained models can be distinguished, as their reconstruction error is mostly over the set threshold and has some spikes (2). Here, the threshold slider can be used to adjust the former to see if the spikes are within the range. Non-functional models will return values that, regardless of what is fed to the neural network, highly exceed the threshold (3). Consequently, no reasonable judgment can be made if the values relate to a fault. If a good model is traced, the professional can use the context OD model view (10) to detect something suspicious. In Figure 1 the professional selected the suspicious graph (Figure 1, 11) that describes the press arm position. Here, the press arm hits multiple times, hinting toward a missing relay. In the log view (Figure 1, 12) the real log values confirm the suspicion. The professional may use the live video feed to detect suspicious behavior and see the missing relay for extra validation.

3.4. AI Performance Metrics for Unsupervised Ensembles

An additional information canal for the professional is the AI performance metrics for unsupervised ensembles (Figure 5). In a highly complex Industry 4.0 setting, labeling data is a tedious and time-consuming task. For this reason, the Flourish dashboard and back-end are fully unsupervised. Consequently, no labeled data exist that can be used to calculate standard measures, such as accuracy, precision, and

F 1

-score to measure the performance of the employed neural network models. Furthermore, the factory environment forms a streaming environment where data are always generated. AI performance metrics target both challenges for unsupervised ensembles. The unsupervised system with all its context OD models can be seen as a joint ensemble with multiple models, where the professional evaluates the ensemble rather than a function. In this publication, we propose a novel set of metrics that lead to a comparability of the employed system rather than calculating the correctness. The general idea is that those non-functional models are distinguishable. We define a non-functional model as a model in which the threshold t is exceeded by an unreasonable factor n. In our prototype, we set this barrier to

n = 100

. Therefore, every model that exceeds the 100-times threshold (>100

\cdot t

) is considered non-functional. Next, this border can be used to classify a model as non-functional, but the border can also be used on an observation basis. Each context window that is fed to the neural network for classification is also an observation. Now, each situation to a time T can now be fed to x-context OD models, which will evaluate x-context windows, respectively x observations. This can be performed for all points in time T in a dataset to obtain the dataset performance

D P

and over all datasets to retrieve the system performance

S P

. Figure 5 shows the visualization of the AI performance metrics for unsupervised ensembles. The metrics are presented in a spider diagram (1). An explanation of the metrics is shown on mouse over (2).

Every metric is developed for use in a streaming environment. Consequently, every metric is a measure over time.

The first metric model failure over time (3, MoFoT) sets the relation between distinct models that have a non-functional prediction Z at least once over time in relation to all models N, Equation (1):

\begin{matrix} M o F o T = \frac{| Z |}{| N |} \end{matrix}

(1)

Next, metric combined failure error over time (CFEoT, 4) combines all failure observations F with all error classifications E (exceeding yellow and red thresholds) and set the errors in relation to all observations O, Equation (2):

\begin{matrix} C F E o T = \frac{| F + E |}{| O |} \end{matrix}

(2)

The medium error over time (MEoT, 5) divides the amount of all medium error classifications M through all observations, Equation (3):

\begin{matrix} M E o T = \frac{| M |}{| O |}, M \subset E \end{matrix}

(3)

The serious error over time (SEoT, 6) as a counterpart relates the amount of all serious error classifications S to all observations, Equation (4):

\begin{matrix} S E o T = \frac{| S |}{| O |}, S \subset E \end{matrix}

(4)

Failure error rate over time (FERoT, 7), in contrast, relates the failure observations against all error classifications, Equation (5), to determine the amount of failure classification on all error classifications:

\begin{matrix} F E R o T = \frac{| F |}{| F + M + S |}, F, M, S \subset E \end{matrix}

(5)

The observation failure over time (OFoT, 8) determines the amount of all failure classifications to all observations, Equation (6), to clarify the amount of non-functional classification on all classified observations:

\begin{matrix} O F o T = \frac{| F |}{| O |}, F \subset E \end{matrix}

(6)

All AI performance metrics intend to increase the trust of the professional in the information provided and, at the same time, ensure comparability. As an expert system, professionals constantly evaluate the Flourish dashboard and the back-end. The professionals observe and evaluate the correctness of the system, where the metrics ensure the comparability to notice changes early. For this reason, we compute the metrics in the prototype historically over all datasets reflecting the system performance

S P

, on the selected dataset as the dataset performance

D P

and up the selected point in time T on dataset

D P < = T

. The historical observation of the metrics should unveil the first signs of a worse-performing system and problems with the data connection or training of the models. Unlike the prototype, the production system would base the metrics on all historical data, the day performance, and the shift performance, which respectively translates to

S P, D P

, and

D P < = T

. The spider visualization was chosen to help the professional memorize the shape of the historical performance. A day or shift performance that exceeds the shape would be recognized fast, and countermeasures can be taken accordingly. In the example shown in Figure 5, the amount of non-functional classifications on all classified observations over all datasets is under 1% (

O F o T \cdot 100

), which is an indication of a working system.

3.5. Flourish Visualization

The Flourish visualization (Figure 6) is the central part of the dashboard and the anchor in the context-aware fault diagnosis. The Flourish visualization is the first approach to implement a context-displaying visualization, apart from the first drafts [1,2], and is the namesake of the Flourish dashboard. Moreover, the visualization is based on the foundations that we previously published [43]. Bellis Perennis, the daisy flower, inspired the visualization. Examples are given in the top and bottom right of Figure 6. The foundation of Flourish is a radial sunburst chart with custom logic. The context hierarchy [2] (1) splits the SF into smaller surveyable chunks, the contexts [1,2]. Here, leaves in the hierarchy are displayed in the outer bounds, where the superior levels are drawn up to the center of the visualization. Moreover, the segmentation of the visualization is oriented toward a daisy flower. The blossom (2) is segregated into an outer ring (3) and a center (4), reflecting the CPS context and the production line context. The blossom functions as a traffic light and aggregates the accumulated information about the current situation. The production line context displays the global aggregated error percentage. Both areas, the outer ring, and the center, utilize a linear scaled color scheme with green, yellow, and red in a range from 0, 0.3, to 1. In contrast, the rest of the visualization is visualized as binary. A greater border separates the central information aggregation from the rest of the visualization to reflect the different behavior. Each petal (5) displays the aggregated information about the CPS context (6) and the sub-contexts (7). Each context with a trained OD model is colored. The color reflects the reported status of the underlying OD model. If the first threshold is exceeded, the color is yellow. If the second threshold is exceeded, the context is colored red.

A special case is present if the model is reported as non-functional. In this case, the context is colored purple. If a context OD model is not trainable because insufficient data were recorded, the context stays gray, as the current context cannot be evaluated. The information flow within the visualization is towards the center. The following formulas compute the severity of each segment in the central information aggregation:

\begin{matrix} P r o d u c t i o n L i n e_{c o n t e x t} & = \frac{(| M | + x \cdot | S |)}{| N |} & , M, S \subset N \end{matrix}

(7)

\begin{matrix} C P S_{c o n t e x t} & = \frac{(| B M | + x \cdot | B S |)}{| B |} & , B M, B S \subset B \subset N \end{matrix}

(8)

Here, N depicts the amount of all OD models of the context hierarchy, M reflects all models that reported a medium error, whereas S reflects all models that reported serious errors in Equation (7). x is variable factor to reflect the severity of the serious errors. After field trials, we chose a four-time increase. The four-time increase weights the serious errors more heavily, leading to a better reflection of severity. The same formula applies to the severity of the CPS context (Equation (8)) but is only computed on the bases of the current CPS branch B with medium errors in

B M

and serious errors in

B S

(8). Both formulas do not consider non-functional models in the severity computation. The displayed production line error rate percentage is computed similarly (Equation (9)):

\begin{matrix} P r o d u c t i o n L i n e_{E r r o r R a t e} & = \frac{(| M | + | S |)}{| N |} \cdot y & , M, S \subset N \end{matrix}

(9)

The production line context and error rate consider the whole SF. The variable factor y can be adjusted to reflect the severity of the faults so the severity suits the analyses of a specific production line. Here, we choose the factor 100 after field trials. Consequently, the production line context aggregates also all CPS contexts (9). The used segmentation of the information aggregation allows not only to detect points in time that lead to high fault detection, but also to consider the direction where the fault originates. Another benefit of contexts and the context hierarchy is that the artificial boundaries equally consider hardware and software modules. Therefore, the Flourish visualization can pinpoint not only the direction, but also the affected hardware or software module affected by the fault. Consequently, the professional can focus the analysis and their time on certain parts of the SF and may not check the whole SF for the fault’s origin.

4. Evaluation

The evaluation of the Flourish dashboard prototype is based on a workflow model developed in our previous publications [1,2,44]. We describe the tasks for diagnosis that an analyst has to perform within the SF [1,2] (SF shown in Figure 7). We state the four tasks: exploration, knowledge acquisition, analysis, and reasoning. Later, we define the fault-handling process in Industry 4.0 [44]. Figure 8 depicts the relation of both publications, the workflow model, and the evaluation. We employ our workflow model to align the evaluation, the tasks, and the sub-sections of the questionnaire (Figure 9). The study is oriented around the presented workflow model. Each fault diagnosis process begins with knowledge acquisition or exploration, followed by fault detection, fault classification, fault prioritization, and finally, fault amendment. The evaluation starts with fault detection and goes over the classification to end with fault chain detection to narrow down the fault’s origin. Each section is segmented into multiple tasks. After finishing a task, a questionnaire section must be answered. In this section, we present the environment (Section 4.1), the user groups (Section 4.2), contextual faults and the recorded data (Section 4.3), the structure of the study (Section 4.4), and state the results (Section 4.5).

4.1. Environment

The evaluation of the Flourish dashboard takes place in the smart factory at Darmstadt University of Applied Sciences (DUAS-SF) shown in Figure 7. The DUAS-SF produces electrical relays similar to those used in wind turbines. The high-bay storage holds pallets with the relay parts. The 3-axis robot outsources each pallet onto one of two shuttles. The shuttle system interconnects all stations in a circle. After the placement, the shuttle moves to the assembly station. Moreover, the 6-axis assembly robot will assemble the relay parts. One station ahead, the pneumatic press will ensure connectivity of all parts after the assembly. Next, the optical and weight inspection will prove the relay. Soon after, the electrical inspection will test the relay also electronically. If the relay is proven well, the relay will be stored back in the high-bay storage. Otherwise, the relay will be sorted out for manual inspection in the manual inspection bay. Finally, the shuttle is free for a new job. Everything is fully automated to the latest Industry 4.0 standards. An open platform communications unified automation (OPC UA) model is available for each station. Explanations around OPC UA can be found in [45]. OPC UA is a communication protocol within Industry 4.0, and the CPS manufacturer can deliver a model of communication endpoints that can be queried to gather information about the CPS in a standardized format. In addition to the OPC UA models, we employ sensing units to gather additional insights into the smart manufacturing process. A sensing unit consists of up to three sensor groups. Each sensor group consists of multiple sensors, a motion processing unit (gyroscope and accelerometer), a pressure sensor, and a light sensor. Additional information is given in [6]. Furthermore, the sensing units and the OPC UA models are the foundation of the context hierarchy seen in Figure 1 and Figure 6.

4.2. Participants and User Groups

We evaluate the Flourish dashboard with two groups of professionals: junior professionals and domain experts. Statistics of both groups are shown in Table 1. Junior professionals are trained in the automation domain and have initial work experience. They do not know the internals of the SF so far. Domain experts have several years of work experience in the automation domain and have knowledge about the internals of the SF. The group of domain experts consists of five professionals, as considered sufficient by Nielsen et al. [22]. Furthermore, the group of junior professionals consists of 10 professionals. Moreover, the groups are chosen to evaluate the Flourish dashboard qualitatively from various perspectives with different technical backgrounds. Both groups are familiar with the DUAS-SF. The domain experts are all employees of the DUAS that work in the development of the SF. The junior professionals’ group consists of graduates (6 participants) and undergraduate students (4 participants) with different backgrounds in the automation domain. The students already have work experience in their field at a university of applied sciences. Most students study and work in parallel in the smart manufacturing domain. Additionally, for this study, 6 foreign students from the international master’s program took part in the evaluation. As a consequence, the questionnaire is bilingual, German, and English. Both groups answered the same questions after the experiment, but the domain experts were interviewed, whereas the junior professionals answered the questionnaire.

4.3. Data

In order to retrieve valid evaluation results, we record the contextual faults, previously published in [6], again. Each dataset was recorded twice. We recorded the data of the OPC UA models and the sensing units. We recorded the ground truth (D0, D1) for the evaluation and three fault cases (D2-D7). In general, eight datasets for evaluation were recorded. For the time of recording, the optical and weight inspection station was out of order due to development in the CPS. Nevertheless, the station’s OPC UA model is part of the recording and the context hierarchy. Moreover, the ground truth encompasses 16 relay assemblies with 8 well-tested and 8 faulty-tested relays. Each fault case mirrors the behavior of the ground truth with eight well-tested and eight faulty-tested relays, but at the same time, introduces the faults. The fault cases are missing parts (D2, D3), missing pressure (D4, D5), and a shuttle drop-out (D6, D7). A part on the pallet is missing in the missing parts fault case. Consequently, the pallet is tested as faulty and sorted out. In regular production, the remaining parts on the pallet would not be recycled (environmental impact) or manually retested, which either employs additional machinery or manual labor. If the fault is recognized, the unaffected parts on the pallet could be reused without additional effort, minimizing environmental impact and costs. In the next fault case (missing pressure), the pressure is regulated down to a critical point and afterwards restored to prevent any damage to the systems. Here, the slow reduction in pressure has to be detected to prevent a production outage. In the third case, the shuttle drop-out has to be detected. If only one of two shuttles operate, the production output is cut in half. If the shuttle drop-out is detected, production efficiency can be restored.

Recordings consist of all OPC UA models, encompassing 33.089 OPC UA variables. We automatically chose 1063 production-relevant OPC UA variables for our recordings (method published in [46]). Table 2 shows the various data types and the respective node count. The majority of nodes are Boolean, followed by byte and integer. It is important to note that some nodes deliver data that contradict their type. For example, Boolean nodes could also cover Boolean arrays, or the string data type can also hold numerical arrays. In contrast, the sensing units comprise only numerical values, see Table 3.

We only recorded the data changed events not to overload the bus systems employed in the production line and introduce new faults [6]. We split the pre-processing process into an offline and an online procedure. The offline procedure was used to train the neural networks:

Split the ground truth dataset D0 according to context hierarchy.
Perform pre-processing for each context, e.g., [2,5,6]. One requirement in an SF environment is that each pre-processing step has to be available in a streaming environment.
Train the neural networks (Section 3.2), e.g., [2,5]
Use the second ground truth dataset D1 to determine the thresholds for each context.

Save all models of the offline procedure (pre-processing and neural networks). In the online procedure, the methodology is now as follows:

Add new events to their respective contexts.
Apply the scaling with previously saved pre-processing models, e.g., the standard scaler.
Employ the saved neural networks to classify the context.

For the evaluation, the computation was completed ahead based on recorded fault cases (D2-D7). In a live system, the contexts would be evaluated after an event is added to the context. Table 4 provides an overview of the scores of the applied AI performance metrics on the datasets (DP) and system (SP) in percent (Section 3.4). Denoting that MEoT is always lower than SEoT hints that the ideal model thresholds should be increased in a production scenario, so it is assured that MEoT <= SEoT. Sometimes, even in the same fault cases recorded twice, the MoFoT spikes, as does, at the same time, FERoT, which is a hint that the training procedure of the context OD models could be better and may be adjusted in the future. Nevertheless, OFoT is always under 1%, which hints toward a working system, even in edge cases. An additional marker toward a working system is that most datasets operate within the system performance (SP). The SP is a union of all events per metric (e.g., the model failures in MoFoT or the non-functional predictions in OFoT). Therefore, the values for SP can be lower or higher than the single score of a DP.

Each procedure in the Flourish backend is compliant with the streaming environment and can be used on a data stream. For instance, scaling operations that need to determine the max/min values are not applicable because the data are generated constantly, and no assurance can be given that the max value will change unpredictably in the future. So the scaling operations in an Industry 4.0 SF environment are rather limited.

4.4. Structure, Pre-Study and Limitations

We centered the evaluation around the workflow model (Figure 8). Moreover, the evaluation is a guided user test. The Flourish dashboard is explained, and the participant solves different tasks that originate in the workflow model. All tasks are solvable with the existing background in smart manufacturing. To assure the solvability, a pre-study took place to validate the structure and questions of the study. In the pre-study, a junior professional and a domain expert went through the planned procedure and gave feedback on the questions asked and the structure. We have to limit our work in this case because of the small pool of domain experts in the field. One domain expert took part in the pre-study and the main study. Another limitation is that two domain experts co-authored a publication about the fault cases. Nevertheless, all faults in the evaluation are re-recorded, and the domain experts never saw the data, the Flourish dashboard, or parts of the functionality before the evaluation.

The study starts with an explanation of the context-aware fault diagnosis and details the goals. Now the training session starts (Figure 9, Familiarize Dashboard), where the user interface (UI) and technique behind it are explained. In Section 3, we explain the UI in the same order as the explanation given to the participants. The explanation is given on a ground truth dataset D0, where thresholds are not set (thresholds are set to −1). Next, the participants are confronted with a normal working system that covers a fault (D7). Now, the participant has the task of determining the error baseline, as models will always report false positives in a fully unsupervised system. After a first glimpse of the error baseline is determined (Figure 9, Determine Error Base Line), the participant is asked to find different model performance types (as shown in Figure 4) [Figure 9, Distinguish Models]. The training closes with the question of what the participant would do to determine a fault case and how they would search for a fault (Figure 9, Hypothetical Fault Diagnosis).

After the training, the next phase starts, where the participant will obtain a known fault case (D2) where the task exists to find one of four employed contextual faults (Figure 8, Fault Detection) (Figure 9, Find Contextual Fault). Here, the participant has to use the formerly learned capabilities to find the fault in the dataset. Next, the participant has to validate the found fault (Figure 9, Validate Contextual Fault) and perform the classification (Figure 8, Fault Classification). Afterward, the participants receive the task to judge if the available information is enough to prioritize the fault based on their domain knowledge (Figure 8, Fault Prioritization) (Figure 9, Prioritize fault). Now, the evaluation procedure is different for the two user groups (Figure 9, Find a Fault Chain). The junior professional is given a task to judge if the available information is enough to find fault chains with the software, while the domain experts are given the task explicitly (Figure 8, Fault Classification and Fault Amendment). The next task is the same for both groups. However, before, both groups were told to deactivate the live video feed. Each group is told that they are given a random fault case; for better comparability in the qualitative results, the groups receive two different recordings of the same fault (D4, D5) on random (Figure 9, Random Fault Case, and Classify Dataset). Now, the task is to determine the fault based on the data and the available functionality only, without any validation by the live video feed (Figure 8, complete Workflow Model). The participants are given up to 20 min before they have to guess if and what fault is present. At that point in the evaluation, the active part of the study is complete, and the participants are questioned to perform a background evaluation of the presented technique (Figure 8 and Figure 9, Additionals). Finally, the participants are asked to complete the standard questionnaires ISO 9240/10 and the system usability scale (SUS). The questions of the ISO 9240/10 are only an excerpt of all questions and have already been used by Lohfink et al. [33] to evaluate an anomaly detection dashboard. For this reason, we also employ the same questions, but contrary to Lohfink et al., we alter the answers on the 1-5 Likert-Scale. As a result, one refers to the negative, while five always refers to the positive. Only the SUS is not altered because of the scientifically validated scoring algorithm. An alteration of the answers or questions would falsify the results. The ISO9240/10 encompasses nine, and the SUS ten questions. The junior professionals answered 76 questions, while the domain experts answered 79. The questions can be found in Appendix A.

4.5. Results

This section analyzes the answers of the junior professionals and domain experts in our custom qualitative questionnaire. The questionnaire encompasses custom questions and questions of standard questionnaires (ISO9240/10 and SUS). The domain experts answered the same questions in a guided interview, while the junior professionals filled in the questionnaire. We align this section with the structure of the evaluation (Figure 9). First, we describe the custom questions, the answers, and the possible takeaways backed by the qualitative answers of the participants. Here, we depict and summarize only the most common qualitative answers. Next, we present the answers of the standard questionnaires (ISO9240/10) and SUS. In the end, we depict future improvements and possible research directions.

Before we go into detail, the key findings of the qualitative evaluation are described. Both groups value the dashboard equally high (A: 84%, E: 83%) of the total quantifiable points. At the same time, the domain experts tend to give higher ratings (18 out of 33 questions), which hints that the domain experts are more comfortable handling the software. Moreover, this finding is also backed by the average usability score (A: 75.31, E: 84.50). The score can be mapped to the adjective ratings of good and excellent [47]. This translates to the third- and second-best usability rating possible out of the seven rating scale (worst imaginable, ..., best imaginable). The answers of the ISO9240/10 underpin the usefulness of the software in the task of fault diagnosis. In summary, context-aware fault diagnosis and its current implementation in the Flourish dashboard is positively rated and appropriate for the task of fault diagnosis in a SF. Moreover, this is supported by the fault detection and classification success rate in the random fault case (A: 88–90%, E: 100%).

4.5.1. Custom Questionnaire

The questionnaire consists of open and qualitative 5-point Likert-scale questions (Appendix A). All quantifiable qualitative answers are summarized in Table 5. The first section (t) was answered after the training took place (Figure 9). The question AQ1

_{t}

refers to the ability to determine the error baseline within the system. The error baseline describes the production line context error percentage of the system, which is always present due to misclassifications during run time. Misclassification may happen due to unsound training, which can be due to insufficient data, unreliable architecture, or inaccurate parameters. The base error line forms a threshold where, when exceeded, the professional starts analyzing the situation. Both groups rate the question equally high (A: 4.2, E: 4.4) with an S and V below 1. The answer adds to the point of distinguishability between a faulty and normal situation. The next question, AQ2

_{t}

, asks about the decidability of good, bad, and non-functional OD models. Here, the junior professionals tend to give higher ratings, whereas the domain experts are more concerned (A: 4.3, E: 3.2) with S and V below 1. Nevertheless, the dashboard allows the professional to rate the quality of the current OD models. AQ3

_{t}

asks about the perceived complexity of the dashboard, where 1 indicates a high complexity and 5 indicates no complexity. Furthermore, group A tends to rate the dashboard more complex (A: 3.7), while the domain experts rate the perceived complexity lower (E: 4.2). That may be explained by the different technical background knowledge about common tooling in the field. Nevertheless, the participants gave the Flourish dashboard a good rating. Asked about the complexity rating (V3), the participants highlighted that the colors and the visualization are well-understood, but the navigation and the selection in higher hierarchy levels may be cumbersome. Next, in V4, the question about what part is hard to understand was commonly named the model performance. Moreover, V5 refers to the question of what was well-understood. The participants refer to the Flourish visualization’s hierarchy and the coloring connected to the model performance thresholds. Asked about the connection between SF and visualization (V86), the participants highlighted the assertion of involved stations, sections, contexts, and sub-contexts. This is also reflected in AQ4

_{t}

, which refers to the understandability of the visual structure of the Flourish visualization. Both groups rate the understandability high (A: 4.5, E: 4.6). After the training, the navigation throughout the Flourish dashboard (AQ5

_{t}

) is high (A:4.7, E: 4.6). Moreover, the rating hint toward working guidance. Finally, the participants are asked (AQ6

_{t}

) if they would find faults with the system. Here, the domain experts are more skeptical (E: 3.8) than the junior professionals (A: 4.2).

The questions were asked twice, once after the training and once after the active part of the evaluation (the random fault case), to gain a delta usage (d). For better comparability of the delta, the questions are placed in the following line of the table (B, Table 5), despite their ordering in the evaluation (Figure 9). Except for the domain experts in BQ3

_{t}

, every question is higher rated. The hint toward learnability emphasizes the usefulness of the current implementation and the context-aware diagnosis. Additionally, the decreased S and V overall questions support the statement. Significantly, the rating of the domain experts improved in BQ2

_{d}

(E: 4.6), the ability to distinguish good, bad, and non-functional OD models, as did the overall potential in finding faults utilizing the Flourish dashboard (BQ6

_{d}

, E: 4.6). The junior professional rating is also higher except for BQ3

_{d}

(A: 4). The Flourish dashboard and the context-aware fault diagnosis is a tool for a complex task. Initially, perceived complexity was lower than it was after the evaluation for group A because the minimalist dashboard may hide some complexity from the user. Missing complexity might initially draw some domain experts off, but the experience after the fault cases turned the values.

The next eight questions refer to finding a fault (f). After the participants found the contextual fault, the questionnaire section was answered. After the first experience is completed with the Flourish visualization, the next three questions refer to the capability to see the fault dependencies: throughout a CPS part (CQ1

_{f}

, (Figure 6, 5)), between CPSs (CQ2

_{f}

) and within the whole production line (CQ3

_{f}

). Besides the positive rating (>3), the values inherently have a decreasing trend toward the production line (CQ1

_{f}

[A: 4.3, E: 4.6], CQ2

_{f}

[A: 4.1, E:3.2],CQ3

_{f}

[A:4, E:3.2]). It is so far explainable that a dependency may be recognizable in the Flourish visualization only at a specific point in time, whereas some dependencies may unfold over time. In comparison, question CQ4

_{f}

asks the participants to rate the understandability of the Flourish visualization. Both groups vote equally high (A: 4.5, E: 5), which clarifies that the Flourish visualization is well suited for fault diagnosis. The domain experts unanimously agree on this question and give the highest possible rating. Moreover, the question CQ5

_{f}

specifically asks how well Flourish reflects the underlying OD model performance with its different thresholds. Both groups have a similar mean (A:4.5, E:4.4) and certify a good connection between the two following dashboard components. One of the outliers in the questions is CQ6

_{f}

(A: 2.8, E: 2.2), which targets the explainability of the model performance through the subsequent context window values (Figure 1, 9 and 10). Currently, only graphs are shown to the user that changes more than once in the context window. The higher context levels encompass the information of the subsequent layers. Consequently, even the partial presentation of the graphs may consist of multiple graphs. Here, some professionals wish additional features to highlight the most anomalous graph compared to history, which may help to explain the model performance. The next questions relate to the blossom (DQ1

_{f}

) and the petals (DQ2

_{f}

) of the Flourish visualization. Both questions ask if the central information aggregation (CPSs contexts and production line contexts, DQ1

_{f}

) or the context hierarchy (contexts and subcontexts, DQ2

_{f}

) is a help in finding fault. Again, both groups give a high rating (DQ1

_{f}

[A: 4.8, E: 4.8], DQ2

_{f}

[A: 4.9, E:4.4]) that underpins the usefulness in finding contextual faults during production. Both groups agree that the production line error percentage, the coloring, and the model performance were quite useful in determining the fault (V16).

After a contextual fault was found, the participants were asked to validate the found fault (v). DQ3

_{v}

refers to the difficulty in validating the fault facilitating one or more contexts. Here, the junior professionals (A: 3.9) had more difficulties validating the fault than the domain experts (E: 4.4), which is within the expectations. Nonetheless, both values are high. Asked about what was a help to find a valid functional model (V25), both groups agreed on the model performance and the behavior of the reconstruction error exceeding the thresholds. Some mentioned the video feed to have an additional information canal to verify the found outlier in the data. DQ4

_{v}

explicitly asked about the usefulness of the combination of the model performance, the context window, and the real log values to help in validation. The answer 1 indicates that none of them are useful. The answer 5 indicates that all of them are equally useful. In this case, the professionals voted around 4 (A: 4, E:3.8), which relies on the fact that not every time the formerly mentioned parts of the dashboard were used during the analysis. For clarification, the subsequent qualitative question (V26) asked about the most helpful part. Both groups agreed on the context hierarchy visualization and the model performance, but for the domain experts, the context window and the real log values had equal value. Some also mentioned that the video feed was beneficial to validate the found fault finally. Neural networks can be considered black boxes which do not explain the reasoning behind classifications. The Flourish dashboard tries to counteract this fact with transparency. The transparency is achieved through the visualization of the model performance with certain thresholds, the context window, which is fed to analyze the current situation, and the back-translation to the original log values. Now, the next three questions target the transparency aspect of the Flourish dashboard. DQ5

_{v}

asks the participants if the transparency helped to trust the provided system result. Both groups agree (A: 4.4, E: 4.6). Additionally, DQ6

_{v}

asks if the provided transparency was a help in finding a fault, where the junior professional agrees (A:4.3), and the domain experts tend to agree (E: 3.8). The diversity in S and V relies on one domain expert who has no clear tendency in this question. Even the domain experts may not know the entire production line and all stations in depth. Therefore, information that cannot be properly rated during the validation is the reason for the lowest value (1). EQ1

_{v}

asks if the transparency was help in validating the fault. Now both groups agree again (A:4.2, E: 4.4). A dedicated question (EQ2

_{v}

) targets the trust built by the performance metrics (Figure 5) to the displayed information. Here, both groups (A:3.7, E:3.2) agree that the performance metrics had an impact, but more minor than expected. Except for one expert, the domain experts stated during the interviews that they did not always use the performance metrics in the analysis. Here, some experts suggest an alternative visualization or more interaction possibilities. In contrast, one domain expert states that the performance metrics are interesting, have potential, and should be developed further.

After the validation, the participants are asked if a fault prioritization would be possible with the displayed information (EQ3

_{p}

). Here, the junior professionals (A: 4.2) agree more than the domain experts (E: 3.8). Through additional tooling, the domain experts would be able to gather more information than the current prototype has to offer, which explains the gap in voting. Now the sections are split for each group. The junior professionals are asked if they would be able to detect fault chains (EQ4

_{c}

), while the domain experts have to unveil a fault chain and are asked about the difficulty (EQ5

_{c}

). Consequently, these two questions are not comparable and are marked with (-) in the table (Table 5). The confidence is lower within the group of junior professionals (A: 3.9), contrary to the domain experts (E: 4.2), which, based on the technical background, is reasonable. Notable is that four of the five domain experts could find and validate the fault origin through the found fault chain. For instance, the measured voltage from the electrical inspection was still around zero, even after the electrical inspection arm was in position. The only reason could be that the relay part is missing from the socket. This circumstance was found through highlighting and navigation to the right context in the Flourish dashboard. Asked about the most helpful parts of the dashboard (V29/V51), the answers are split. The junior professionals focus on highlighting and the video feedback, while the domain experts weigh the live video minor and focus on the model performance, the context window, and the real log values.

After the task of finding a fault chain, the groups were asked to disable the live video feed and were told to select another recorded dataset which was chosen randomly. For better comparison, both groups received the same fault case (D4/D5) of missing pressure, but the datasets were toggled among professionals. In a separate table, the success rate in the classification task is shown (Table 6). The table is similar to (Table 5) with a minor change, displaying the success rate instead of the mean. RQ1

_{r}

asks the professional if a fault is present in the dataset. Both groups detected a fault equally well (A: 0.9, E: 1). The domain experts found the fault every time and answered accordingly, while 1 out of 10 professionals could not. Additionally, RQ2

_{r}

was asked conditionally if RQ1

_{r}

was answered with yes. Furthermore, 1 out of 9 remaining junior professionals chose that the dataset encompassed the same fault as before, which was wrong. The others, including the domain experts, chose that the fault must be something new, which was right in this case. The success rate for group A is 88%, while for E it is 100%. Asked about what could be the contextual fault, in this case, the domain experts unveiled all the missing pressure in the system as the faults origin equaling a 100% success rate. At the same time, the junior professionals with no detailed background knowledge about the SF were able to find the right station where the fault could be seen (5 out of 10, 50%). Two out of the five professionals found the origin of the fault, the missing pressure being the reason for the fault (2 out of 10, 20%). Asked about their confidence in their classification

E Q 6_{r}

, both groups were somewhat confident (A:3.1, E:3.6), even though the fault origin was found, which may be explainable by the complexity of the task and a missed second incident to validate the classification further. A domain expert stated that to achieve higher confidence, the system would have to be used on a daily basis. Asked about what had helped to build trust in the classification, most answered the Flourish visualization with the chosen coloring, the model performance, and the context window.

The final question after the random fault case asked about the impact of the missing information canal (video feed). Here, 1 stands for a high impact and 5 for no impact at all

F Q 1_{r}

. The domain experts voted for a minor impact (E: 3.2), while the junior professionals voted for a higher impact (A: 2.7). Furthermore, the S and V underline the position of the junior professionals because, in the known fault case, most of the professionals validated their finding in the data with the video feed, while the domain experts knew the variables in the context window and their meaning.

After the active part, the participants answered the context-aware diagnosis (CAD) related section and the two standard questionnaires (ISO9240/10, SUS). The CAD questions attribute the current prototype and the concepts of the context-aware diagnosis. FQ2

_{C A D}

inquired about a rating if the visualization was a help in finding the fault. Both groups agreed on that question (A: 4.7, E: 5.0). Especially the domain experts agree totally. When requested to rate if the contexts and context hierarchy were a help in finding fault FQ3

_{C A D}

, both groups also agreed (A: 4.6, E: 5). Again, the domain experts agreed totally. Moreover, asked about the potential of the context-aware diagnosis to uncover the origin of the fault FQ4

_{C A D}

, both groups equally agreed highly (A: 4.4, E: 4.6). Finally, the junior professionals and experts were asked to rate the usefulness of the current prototype (FQ5

_{C A D}

). Both groups agreed (A:4.4, E: 4.2); both values are high, but both suggested enhancements during the evaluation, which makes the scores reasonable.

In total, the Flourish dashboard received good to excellent ratings from both groups. The technical background had an impact on certain areas during the evaluation. Nevertheless, both groups could find contextual faults and accomplish each given task. So the task adequacy was given. Furthermore, both user groups were able to handle the Flourish dashboard from different perspectives and various technical backgrounds, and both succeeded in the task. The junior professionals trusted the system classification more and tried to search around a highlighted area, while the domain experts searched for additional indicators in the context window and back-translated log values. Asked about the wishes and how to alter the current implementation, the junior professionals favored a more fine-grained time selection and various highlight options to accentuate different areas in the Flourish dashboard, e.g., important graphs in the context window. In contrast, the domain experts tended toward more automatism and novel, even more complex algorithms based on the current features of the prototype. Therefore, both groups saw the added value through the context-aware fault diagnosis.

4.5.2. Standard Questionnaires

We encompassed both standard questionnaires (ISO9240/10, SUS) to prove the task adequacy (ISO9240/10) and usability (SUS). Meanwhile, both questionnaires have been used in a wide range of Industry 4.0 evaluations (Section 2), enabling comparability between research results. The statements of the ISO9240/10 display how well the Flourish dashboard performs in the fault diagnosis task and context-aware fault diagnosis. Furthermore, the SUS questionnaire evaluates the usability of the current prototypical implementation. A key finding is that the Flourish dashboard is good and well rated in both questionnaires. Despite the minor group sizes, both results hint toward the usefulness of the Flourish dashboard and the context-aware fault diagnosis. Nevertheless, both questionnaires rate the prototype in its current form.

We relied on the same statements as Lohfink et al. [33] to evaluate the prototype. The statements are assigned to the following categories: suitability for the tasks (V102–V106), controllability (V107), conformity with user expectations (V108), and suitability for learning (V109–V110). In contrast to Lohfink et al. [33], we altered the answers, so all answers with a five are positive asserted, while a one is always negative. A benefit is that the attention of the professionals is always kept high. Therefore, the statements are read carefully before an answer is given. The answers are shown in Figure 10. Given the shape of answers, if the implemented functions support the work (V102), both groups answer equally between 4 and 5 with a median of 5. The next statement has a wider range of answers, between 3 and 5, and states that too many steps are required to deal with the given task. Here, 5 indicates that the steps to complete are totally adequate, while 1 indicates way too many steps. Moreover, the median is higher for the junior professionals (A: 4.5) in contrast to the domain experts (E: 4). The shape of the violin plot describes that the majority of the junior professionals agree that the steps are total adequate, while the domain experts only have an intention toward this direction. Some domain experts suggested further guidance techniques that should be implemented, which may be an explanation for the gap. The statement (V104) declares that the software suits all requirements of a fault diagnosis. Both groups agree with a median of 5. However, the domain experts have a wider range of answers, which can be seen in the shape of the answers. This is because they knew that additional tooling employed would provide additional information around the suspicious variable found through the Flourish dashboard. In contrast, all domain experts agreed (5) that the required command to perform the work was easy to find (V105). While the majority of the junior professionals agreed (median of 5), some professionals had issues (4–5, minimum 3). V106 stated that the presented information supports the professional in his work. Both groups agreed on this point, but the domain experts saw some more advancement opportunities (median 4) compared to the junior professionals (median 5). Asked to vote if the navigation of the software is adequate (V107), both groups had a wider range (3–5) of answers. Both groups have a median of 4, indicating that the optimum is not represented in the current prototype. Moreover, the professionals suggested different expansions to the various navigation capabilities, which would explain the gap. The suggestions are discussed in Section 4.5.3. V108 asks whether the results are predictable if a function is executed. The results are consistent with the expectations. Here, the junior professionals are more certain with a wider range of answers (3–5, median 4), while the domain experts are confident about the results (4–5, median 5). The last two statements refer to the learnability of the system. The first states that a long time is needed to learn the software. Again, the answers were exchanged, so 5 indicates a short time, while 1 indicates a long time. The majority of the domain experts (4 out of 5) attested that the amount of time to learn the software was little (median 5), whereas the junior professionals with a different technical background somewhat agreed (median 4) but with a wider variety. The results are within expectations as the junior professionals have had to learn more about the SF, the process, and the employed technologies as domain experts. Additionally, V110 requests to rate the re-learnability of the software after a lengthy interruption. Both groups attested to the Flourish dashboard good re-learnability (median 5), but the domain experts were more confident (4–5) than the junior professionals (3–5). Nevertheless, in summary, the median in all answers to the statements is between 4 and 5, with a majority of 5 (11 out of 18) which is a clear hint toward the usefulness, capabilities, and task adequacy of the Flourish dashboard and context-aware fault diagnosis.

The SUS evaluates the usability in a quick fashion. The SUS is chosen to rate the usability of the Flourish dashboard, but also to reveal possible usability issues which the professional might encounter during the evaluation. The answers are shown in Figure 11. Following the work of Bangor et al. [47], both mean scores (A: 75.31, E: 84.50) can be translated to the ratings good (A) and excellent (E). Only the best imaginable would surpass both ratings on the author’s seven-scale score. The SUS score is computed as follows:

\begin{matrix} Q_{1} & = \sum_{k \in M} (s_{i} - 1), M = {1, 3, 5, 7, 9} \end{matrix}

(10)

\begin{matrix} Q_{2} & = \sum_{k \in N} (5 - s_{i}), N = {2, 4, 6, 8, 10} \end{matrix}

(11)

\begin{matrix} S U S_{s c o r e} & = (Q_{1} + Q_{2}) \cdot 2.5 \end{matrix}

(12)

Q_{1}

denotes the part that needs to be evaluated positive (Equation (10)), whereas

Q_{2}

represents the statements, where a low score is considered good (Equation (11)). The current score (

s_{i}

) needs to be either summed with

- 1

or subtracted from 5. Then, the sum of

Q_{1}

and

Q_{2}

is multiplied by 2.5 to retrieve the final score for one questionnaire (Equation (12)). In the end, we average the sum of the scores through the number of fully filled questionnaires in the respective group (A: 8, E: 5) to obtian the average SUS score.

Upfront, we have to limit our SUS evaluation for two reasons. First, the group sizes are small; for a more reliable result, the group size, especially for SUS, has to be larger. Meanwhile, our evaluation is only qualitative, which should be kept in mind. The reason is that in the evaluated domain of smart manufacturing and Industry 4.0, professionals with a technical background in the employed technology are scarce. Nevertheless, the group sizes for the completed evaluation are reasonable (Section 2) and, in the case of the domain experts, exceed the standards. The second limitation is that through a technical issue, the sixth question was not present during some sessions, which was fixed after noticing. As a result, some answers had to be collected through a subsequent request. For the computation of the mean SUS score, only fully filled questionnaires were used (A: 8, E: 5). However, this limitation affects the evaluation in a minor way, as with an evaluation of a qualitative nature, the success relies on hints toward possible issues rather than the computed SUS score.

In the first statement, V111 declared that the professionals would use the software frequently. Both groups have a median of 5 and agreed with the statement, but one domain expert would also use additional tooling, e.g., debugging, for fault diagnosis, which exceeds the capabilities of the Flourish dashboard. The vote on V112, which states that the system is unnecessarily complex, is clear (A: 2, E: 1). All domain experts were aware of the task complexity of an Industry 4.0 fault diagnosis, and the Flourish dashboard is not complex compared to other systems in the field. The next statement (V113) elaborates if the system is easy to use. Here, the median of both groups agreed (A:4, E:5), whereas one domain expert considered a 2. During the evaluation, multiple participants stumbled upon the statement, which may hint at a misleading statement. Therefore, the one vote may be seen as an outlier. Without this outlier, the majority thought the system was easy to use, while the domain experts with their additional technical background knowledge tended to have higher ratings (E: 4–5). The junior professionals had a wider range of votes (A: 3–5). The shape of the violin plot underpins the statement. Moreover, the violin plot of the next statement (V114) supports the former finding. Now, the participants judged if they need the support of a technical person to use the system. The junior professionals with limited background knowledge of the SF had issues classifying and utilizing the variable names in the context window, which explains the variance in answers from 1 to 5. The domain experts had no issues (E: 4–5). Furthermore, V115 says that the functions of the system are well-integrated. The domain experts agreed with a median of 5, whereas some junior professionals were neutral and only somewhat agreed with a median of 4. However, the majority of group A was between 4 and 5, which also tends toward agreement. Furthermore, the statement V48/V29 evaluates the felt inconsistency in the system. Most professionals did not feel much inconsistency (A: 1.5, E: 1). The domain experts especially felt that the Flourish dashboard is very streamlined. The next statement enforces a judgment about the learnability of the system (V116). Both groups agreed that the system can be learned quickly (A:4, E:4) with a variety towards the 5. Additionally, the next statement (V116) expresses that the system is cumbersome to use, which the junior professionals strongly disagreed with and the domain experts disagreed with (A:1, E:2). Asked to rate the confidence in using the system, both groups agreed (A: 4, E: 4) toward strongly agreeing. In the end, the last statement asked about the amount, which had to be learned upfront to get going with the system (V119). As the context-aware fault diagnosis and the Flourish dashboard are novelties for the audience, this rating weighs more for the evaluation and indicates how much the experience system varies from traditional fault diagnosis methods. Here, the median of the junior professionals (A: 2.5) indicates disagreement, but the majority oscillates between 2 and 4, with an answer over the whole scale (1–5). This is explainable by the technical background of the professionals and their various qualifications, which range from pre-graduate to post-graduate students. Nonetheless, the values indicate that onsite professionals, e.g., operators and maintenance teams need proper training before the system may be employed efficiently. In contrast, the domain experts disagreed (E: 2), but the variety span from 1–3, either neutral or strongly disagree. For this reason, a domain expert may be the preferable end-user group. Nevertheless, also domain experts need to learn the concepts of the context-aware diagnosis to facilitate the Flourish dashboard to its full potential.

4.5.3. Future Research Directions and Improvements

Future research directions and improvements were a part of the CAD section and may target issues or enhancements for the current implementation. Both groups were encouraged to suggest three alternatives or enhancements to the current prototype. In this section, we present the most commonly stated enhancements. Currently, the time slider of the Flourish dashboard may be a bit picky to jump to the correct position of interest, e.g., 100 ms away from the current point. Additionally, the prototype has no bidirectional navigation between the model performance the time. A majority of the users suggested advances for the time slider, such as a field to write down the exact timestamp that is navigated to. Some would navigate by steps using the arrows on the keyboard, equal to a plus and minus above the time slider to jump step-wise in the data. Many users tried to also click on the point of interest in the model performance during the evaluation to jump to this point in time and opted for a bidirectional relationship between both. One suggested a play button to go through the time and may be stopped on areas of interest, so the mind is free to oversee the updates in the Flourish visualization. A junior professional wanted to visualize the production line error as a graph over the time slider to quickly find points of interest and possible fault areas in the production line. This would enable the professional to model the baseline error more quickly in mind, and faults should be found more easily over time. Another suggestion in the same direction was to jump between the most outlying production line errors to evaluate those points first. A small quality-of-life change would be to allow the professionals to annotate the variables in the context window with a custom, more meaningful, name, so a variety of user groups can use the system. Others suggested an advanced mode for the context window, so that a broader area is visible, despite the normal context window size used during training, or to show all variables, not only the changes. Another professional suggested a mode for remembering a sequence for comparison at a later point in time. This would allow comparing the context windows which may be seen as abnormal to previously declared normal behavior. Additionally, a domain expert would be able to classify and store normal behavior over time that the junior professionals may use in their classification. Furthermore, a visualization of the trace of a chosen context variable was suggested to identify contexts that may also be affected by this variable, so a direct comparison between the involved context models may hint toward a relationship between the variables in the data. Additionally, an automatic approach to classifying good, bad, and non-functional models over another set threshold, a time span, was suggested so that after a certain time, the model is classified as non-functional if the rise is constant over a specific period. A domain expert suggested developing algorithms based on the performance metrics in joint with the production line error. Here, an additional traffic light on top of the Flourish visualization and the radar of the performance metrics could simplify the dashboard even more. If this value rises over a certain threshold, this traffic light will change in color. Multiple algorithms could be employed first to compute a value of both and second to find a reasonable threshold. Besides the quality of life alternations, the professionals suggested a wide variety of changes. Moreover, there was potential for further automation, which opens the door for further research based on contexts and context hierarchies, further development of the Flourish visualization, and context-aware diagnosis in general.

5. Conclusions

We presented the Flourish dashboard and the eponymous visualization Flourish. They facilitated finding faults in realistic scenarios by displaying contexts and context hierarchies. Additionally, we presented novel AI performance metrics to enable the evaluation of unsupervised ensembles on historical behavior. These metrics can be used with other techniques apart from neural networks. We also shared information about the employed techniques, e.g., pre-processing steps and the neural network configuration. Moreover, we explained the Flourish dashboard and its capabilities in depth. Along with the Flourish dashboard, we provided an example of guidance during fault diagnosis from top to bottom. Likewise, we gave an example of an explainable integration of black box AI in an Industry 4.0 fault diagnosis dashboard. Transparency allows a judgment of the professional over the taken decision of the neural network. Therefore, we gave an example of AI integration where the human is still in charge of classifying the final situation. Noticeably, the professional can judge the employed OD model performance and distinguish between good, bad, and non-functional OD models over time. Therefore, we allow humans to use AI systems without understanding the underlying technology. Additionally, with the employed visualization, the professional is able to not only find and classify a fault, but also determine the direction of the fault’s origin, which speeds up the fault diagnosis process. As the context encompasses information from software and hardware modules, the professional may circumscribe the faults’ origin even further. In addition, we released a workflow model based on our former research, which has the ability to also thrive for other evaluations in Industry 4.0 research. Moreover, we gave an example of how to translate the workflow model to an evaluation structure, tasks, and, finally, a questionnaire. The questionnaire is also an example of combining a custom questionnaire with standard questionnaires. While the custom questionnaire judges the performance of the evaluated software in the given task, the standard questionnaires assure comparability among research results. The questionnaires and interviews were bilingual. Thus, we gave an example of a bilingual evaluation in the Industry 4.0 domain. With the evaluation, we proved and validated that contextual faults can be found with Flourish and the Flourish dashboard. In the end, we provided an example of a context-aware fault diagnosis and how contexts and the context hierarchy can be used during a fault diagnosis process.

The core finding during the evaluation is that the context-aware fault diagnosis allows professionals with different technical backgrounds and knowledge levels to find faults in a complex Industry 4.0 environment facilitating the Flourish dashboard. Both groups of junior professionals and professionals judge the Flourish dashboard equally high (A: 84%, E: 83%) on all quantifiable qualitative questions. In the evaluation, the domain experts tended to evaluate the Flourish dashboard higher (18/33 questions), which hints that the domain experts are more comfortable handling the software. Both chosen standard questionnaires prove the former point. In the ISO9240/10, 11 out of 18 medians are between 4 and 5, which is a clear hint toward the task adequacy of the Flourish dashboard and the context-aware fault diagnosis. Moreover, the Flourish dashboard was evaluated as good and well in terms of usability (A: 75.31, E: 84.5). According to Bangor et al. [47], the values can be translated to good and excellent on their seven-scale score, only best imaginable would surpass both ratings. In the standard questionnaire, the domain experts rated the Flourish dashboard higher. Therefore, tendencies exist that domain experts could be the more applicable end-user group. Nevertheless, the questionnaires unveil that both groups need the training to use the context-aware fault diagnosis and the Flourish dashboard to the full potential. Junior professionals may need deeper and longer training on the software than domain experts. Nonetheless, the domain experts and the junior professionals provided high ratings throughout the evaluation, which hints toward the usefulness and capabilities of the Flourish dashboard and context-aware fault diagnosis.

The evaluation unveils two points of improvement. First, future work should go into the explainability of the model performance (CQ6

_{f}

, A: 2.8, E: 2.2) and tooling, e.g., advanced highlighting. Second, the integration of the performance metrics should be developed further to not only achieve the goal of comparability, but also increase trust in the displayed information (EQ2

_{v}

, A: 3.7, E: 3.2). In contrast, the evaluation through the delta usage indicates good learnability and cogency of the Flourish dashboard. Significantly, the domain experts rated the ability to find faults with the system higher after the fault cases (AQ6

_{t}

to BQ6

_{d}

, E:3.6 to 4.6). The prototype also exceeded fault detection and classification performance expectations tested through the random fault case. Both groups had a high success rate (A:88-90%, E: 100%). At the same time, all domain experts found not only the fault, but also the fault’s origin. The junior professionals with limited background knowledge identified the right station (50%) and the origin of the fault (20%). This is reflected by the confidence in the potential to find the fault origin (FQ4

_{C A D}

, A: 4.4, E: 4.2) through the Flourish dashboard and the context-aware fault diagnosis. Significantly, the Flourish visualization was highly rated (FQ2

_{C A D}

, A: 4.7, E: 5.0), and the usefulness of the contexts and the context-hierarchy (FQ3

_{C A D}

, A: 4.4, E: 4.6). Despite the good ratings, future work has to target techniques to increase the confidence in the classification. Despite their success rate, both groups rated their confidence in the made classification compared to the success to be relatively low (EQ6

_{r}

, A: 3.1, E: 3.6). In the end, a technical limitation of the current prototypical end-to-end implementation should also be mentioned. The currently employed autoencoder neural networks may only learn and detect outliers present in the data. Faults that do not affect the inspected data remain undetected, and other OD techniques for context surveillance should also be considered in this case (Examples in [1,2]).

Future work can be asserted to distinct parts of the Flourish dashboard. Future work can also be split into the quality of life (QoL) enhancements and future research directions. QoL enhancements are not directly research-relevant, but improve the convenience of the professionals, for instance, advances to the time slider to navigate through time more precisely or the establishment of a bidirectional relationship with the model performance. Another approach highlights the traces of variables throughout involved contexts. Another example is the ability to remember specific values of the context window to overlay the values over another context window to compare the situations. In addition to convenience, the QoL has research potential. For instance, automatically finding a past context window to compare with values in the current situation. The received feedback also contains research directions. For example, an algorithm combines the production line context error percentage with the performance metrics to build a more advanced version of the situation aggregation. Another example is suggesting sections in the timeline which are relevant for analyzing faults. Another research question is to automatically unveil how the context graphs in the context window contribute to the model performance. Here, the graphs with the highest contribution probability can be highlighted. Developing novel algorithms to classify good, bad, and non-functional OD models are also opportunities. Consequently, a future research direction is the development of algorithms based on context surveillance and outlier detection. Algorithms for automated context building should be further researched. Therefore, algorithms based on context hierarchies have potential and should be further investigated. Another research direction is visualizing context and context hierarchies for fault origin tracing. Particularly in greater smart factories, advances in visualizations must be made to present even more information in dense space.

To conclude, the Flourish dashboard and the eponymous visualization Flourish hold great potential for future research. Significantly, context-aware fault diagnosis is right in the beginning of unveiling its potential. All distinct parts of context-aware diagnosis and the prototype in the form of the Flourish dashboard should be considered living and open to future research—welcoming contributions to the vital research field of context-aware diagnosis.

Author Contributions

Conceptualization, L.K.; methodology, L.K. and K.N.; software, L.K.; validation, L.K., K.N. and B.H.; formal analysis, L.K.; investigation, L.K.; resources, L.K.; data curation, L.K.; writing—original draft preparation, L.K.; writing—review and editing, L.K., K.N. and B.H.; visualization, L.K.; supervision, K.N. and B.H.; project administration, L.K.; funding acquisition, B.H. All authors have read and agreed to the published version of the manuscript.

Funding

We acknowledge support by the Open Access Publication Funds of the Hochschule Darmstadt.

Informed Consent Statement

The data collection was prepared within known ethical guidelines. The purpose of the interview/questionnaire was communicated beforehand and at the start of each session. Before any audio was recorded, the permission of the participants was taken. The participants were assured that the audio files would not be uploaded on any server and only be stored locally. Moreover, the audio transcripts would be only used to answer the research objectives anonymously in a confidential manner.

Acknowledgments

We want to thank all the professionals who volunteered in the study spending their precious time to do the evaluation.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. Questionnaire

Table A1. Custom questionnaire with table codes (Table 5 and Table 6), the group assignment, the questions/statements and the answer possitbilities.

Table Codes	Tool Code	Group Assignment	Question/Statement	Answer Possibility
AQ1 $_{t}$	V33	A/E	Errors in an unsupervised system are always present. How easy was it to determine the error baseline?/Fehler sind in einem unüberwachten System immer präsent. Wie schwierig war es den Fehlerbasiswert zu ermitteln?	Rate from 1 = not easy/nicht einfach zu ermitteln to 5 = very easy/sehr einfach zu ermitteln
AQ2 $_{t}$	V34	A/E	How easy was it to distinguish between good and bad and non-functional context models?/Wie einfach war es zwischen guten, schlechten und nicht funktionierenden Modellen zu unterscheiden?	Rate from 1 = not easy/nicht einfach zu ermitteln to 5 = very easy/sehr einfach zu ermitteln
AQ3 $_{t}$	V1	A/E	Rate the complexity of the dashboard./Bewerte die Komplexität des Dashboards.	Rate from 1 = very complex/sehr komplex to 5 = no complexity at all/keinerlei Komplexität
	V3	A/E	Why do you took this rating? (optional, 1–2 Sentences)/Warum bewerten Sie so? (optional, 1–2 Sätze)	free form
	V4	A/E	What is hard to understand? (optional, 1–2 Sentences)/Was ist schwer zu verstehen? (optional, 1–2 Sätze)	free form
	V5	A/E	What is good to understand? (optional, 1–2 Sentences)/Was ist gut zu verstehen? (optional, 1–2 Sätze)	free form
AQ4 $_{t}$	V2	A/E	Is the visual structure of the central visualization understandable?/Ist die visuelle Struktur der zentralen Visualisierung verständlich?	Rate from 1 = not understandable/nicht verständlich to 5 = fully understandable/vollkommen verständlich
	V86	A/E	Can be a connection seen between the smart factory and the visualisation? How and why? (optional, 1–2 Sentences)/Ist es möglich eine Verbindung zwischen der Visualisierung und dem Aufbau der smarten Fabrik zu erkennen? Wie und Warum? (optional, 1–2 Sätze)	free form
AQ5 $_{t}$	V6	A/E	Is the navigation throughout the dashboard clear?/Ist die Navigation durch das Dashboard klar?	Rate from 1 = not clear/unklar to 5 = fully clear/vollständig klar
AQ6 $_{t}$	V7	A/E	After the first training, How good would you estimate finding faults with the system?/Nach dem ersten Training. Wie gut würden Sie einschätzen mit dem System Fehler zu finden?	Rate from 1 = I would not find any faults/Ich würde keine Fehler finden to 5 = I definitly would find faults/Ich würde garantiert Fehler finden
CQ1 $_{f}$	V9	A/E	How good are dependencies visible throughout a specific machinery context hierarchy?/Wie gut sind Abghängigkeiten durch die spezifische Maschinenkontext-Hierarchie sichtbar?	Rate from 1 = cannot see any dependencies/Ich erkenne keine Abhängigkeiten to 5 = the dependencies are very clear to detect/die Abhängigkeiten sind sehr klar zu erkennen
CQ2 $_{f}$	V8	A/E	How good are inter-machinery dependencies visible throughout the context hierarchy?/Wie gut sind inter-Maschinen Abhängigkeiten durch die Kontext-Hierarchie sichtbar?	Rate from 1 = cannot see any dependencies/Ich erkenne keine Abhängigkeiten to 5 = the dependencies are very clear to detect/die Abhängigkeiten sind sehr klar zu erkennen
CQ3 $_{f}$	V10	A/E	How good are dependencies in the whole production line visible throughout the context hierarchy?/Wie gut sind Abhängigkeiten in der Produktionslinie durch die Kontext-Hierarchie sichtbar?	Rate from 1 = cannot see any dependencies/Ich erkenne keine Abhängigkeiten to 5 = the dependencies are very clear to detect/die Abhängigkeiten sind sehr klar zu erkennen
CQ4 $_{f}$	V11	A/E	How understandable is the centered information aggregation (traffic light) for the machinery contexts and the production line context?/Wie verständlich ist die zentrale Informationsaggregation (Ampel) für die Maschinenkontexte und den Produktionslinienkontext?	Rate from 1 = I do not understand the information aggregation/Ich verstehe die Informationsaggregation nicht to 5 = it is clear how the information aggregation works/Es ist vollkommen klar wie die Informationsaggregations funktioniert
CQ5 $_{f}$	V12	A/E	How well the model performance is reflected by the central visualization?/Wie gut reflektiert die zentrale Visualisierung die Modelperformance?	Rate from 1 = i cannot see a relationship between visualization and model performance/Ich kann keine Beziehung zwischen Visualisierung und Model Perfomance erkennen to 5 = the relationship is clearly given and understandable/Die Beziehung ist klar gegeben und ist verständlich
CQ6 $_{f}$	V13	A/E	How well is the model performance explainable by the surveyed window values?/Wie gut ist die Model Perfmance erklärbar durch die überwachten Fensterwerte?	Rate from 1 = model performance cannot be explained/Die Model Performance kann nicht erklärt werden to 5 = model performance is explainable by the surveyed values/Die Model Performance ist vollständig erklärbar anhand der beobachteten Werte
DQ1 $_{f}$	V14	A/E	Is the central information aggregation (traffic light) a help in finding faults?/Ist die zentrale Informationsaggregation (Ampel) eine Hilfe im Finden von Fehlern?	Rate from 1 = no not at all/nicht im geringsten to 5 = yes it helps me to quickly finding faults/ja es hat mir geholfen Fehler schnell zu finden
DQ2 $_{f}$	V15	A/E	Is the visualization of contexts and the context hierarchy a help in finding faults?/Ist die Visualisierung der Kontexte und der Kontext-Hierarchy eine Hilfe im Finden von Fehlern?	Rate from 1 = the visualization does not help/Die Visualisierung hilft nicht to 5 = the visualization does help definitely to find faults/Die Visualisierung hilft Fehler zu finden
	V16	A/E	What parts of the dashboard helped to identify the fault? (optional, 1–2 Sentences)/Welche Teile des Dashboards haben geholfen den Fehler zu identifizieren? (optional, 1–2 Sätze)	free form
DQ3 $_{v}$	V17	A/E	How difficult was it to validate the fault using the model performance from one or more contexts?/Wie schwierig war es den Fehler anhand der Model Perfomance von einem oder mehreren Kontexten zu validieren?	Rate from 1 = very difficult/sehr schwer to 5 = very easy/sehr einfach
	V25	A/E	Multiple neural networks are used for detection. How you have found a valid functional model? (optional, 2–3 Sentences)/Mehrere neuronale Netzwerke werden zur Detektion benutzt. Wie haben Sie ein valides funktionierendes Model gefunden? (optional, 2–3 Sätze)	free form
DQ4 $_{v}$	V18	A/E	Is the visualization of the real log values, together with the surveyed values and the model performance, a help in validation of the fault?/Ist die Visualisierung der realen Log-Daten, zusammen mit den überwachten Werten und der Model Performance eine Hilfe um den Fehler zu validieren?	Rate from 1 = no help/keine Hilfe to 5 = all three are equal helpful/alle drei sind gleichverteilt hilfreich
	V26	A/E	What is helping the most and why? (optional, 2–3 Sentences)/Was hat dabei am meisten geholfen und warum? (optional, 2–3 Sätze)	free form
DQ5 $_{v}$	V19	A/E	Has the transparency in analysis (model performance, surveyed values, log values, metrics) higher the trust in the system results?/Hat die Transparenz in der Analyse (Model Performance, überwachte Werte, Log-Daten, Metriken) das Vertrauen in des System erhöht?	Rate from 1 = no help to trust the system/Es war keine Hilfe dem System zu vertrauen to 5 = very helpful for the trust level/Es hat sehr geholfen für das Vertrauen
DQ6 $_{v}$	V20	A/E	Has the transparency in analysis (model performance, surveyed values, log values, metrics) helped to find the context fault?/Hat die Transparenz in der Analyse (Model Performance, überwachte Werte, Log-Daten, Metriken) geholfen den Kontext-Fehler zu finden?	Rate from 1 = transperency did not help/Transparenz hat nicht geholfen to 5 = transperency did help alot/Transparenz hat sehr geholfen
EQ1 $_{v}$	V21	A/E	Has the transparency in analysis (surveyed values, log values, metrics) helped to validate the model performance?/Hat die Transparenz in der Analyse (Model Performance, überwachte Werte, Log-Daten, Metriken) geholfen die Model Performance zu validieren?	Rate from 1 = transperency did not help/Transparenz hat nicht geholfen to 5 = transperency did help alot/Transparenz hat sehr geholfen
EQ2 $_{v}$	V22	A/E	Have helped the visualisation of the performance metrics (history, day, shift) to higher the trust in the displayed information?/Hat die Visualisierung der Performancemetriken (Historisch, Tages- und Schicht-bezogen) geholfen das Vertrauen in die dargestellten Information zu erhöhen?	Rate from 1 = visualization did not help/Die Visualisierung hat nicht geholfen to 5 = visualization did help alot/Die Visualisierung hat sehr geholfen
	V28	E	Why do you took this rating? (optional, 1–2 Sentences)/Warum bewerten Sie so? (optional, 1–2 Sätze)	free form
EQ3 $_{p}$	V23	A/E	Suppose you have to make a guess. Are the displayed information enough to prioritize the fault?/Wenn Sie raten müsstest. Sind die dargestellten Informationen genug um einen Fehler zu priorisieren?	Rate from 1 = the displayed information is not enough/Die dargestellte Information ist nicht genug to 5 = the displayed information is enough/Die dargestellte Information is ausreichend
	V24	A/E	What parts of the dashboard would help you to prioritize the fault and why? (optional, 2–3 Sentences)/Welche Teile des Dashboards würden helfen den Fehler zu priorisieren und warum? (optional, 2–3 Sätze)	free form
EQ4 $_{c}$	V28	A	After your first usage. May you be able to find a fault chain with enough time?/Nach der ersten Benutzung. Wäre es Ihnen möglich mit genug Zeit eine Fehlerkette zu finden?	Rate from 1 = no not at all/nein nicht im geringsten to 5 = i would defentily find a fault chain/Ich würde definitiv eine Fehlerkette finden
	V29	A	What parts of the dashboard would be helpful for finding a fault chain if you have to make a guess and why? (optional, 2–3 Sentences)/Welche Teile des Dashboard wären hilfreich eine Fehlerkette zu finden, wenn Sie raten müssten und warum? (optional, 2–3 Sätze)	free form
EQ5 $_{c}$	V48	E	If you can identify a chain of faults. How difficult was it to find the fault chain?/Wenn Sie eine Fehlerkette identifizieren konnten, wie schwer war es die Fehlerkette zu finden?	Rate from 1 = very difficult/sehr schwer to 5 = the identification was easy/die Identifikation war einfach
	V49	E	What fault chain was found? (optional, 2–3 Sentences)/Welche Fehlerkette wurde gefunden? (optional, 2–3 Sätze)	free form
	V50	E	How you found and validate the fault chain? (optional, 2–3 Sentences)/Wie haben Sie die Fehlerkette gefunden und validiert? (optional, 2–3 Sätze)	free form
	V51	E	What parts of the dashboards helped to find fault chains and why? (optional, 2–3 Sentences)/Welche Teile des Dashboards waren eine Hilfe um die Fehlerketten zu finden und warum? (optional, 2–3 Sätze)	free form
RQ1 $_{r}$	V44	A/E	Is a fault present in the dataset?/Ist ein Fehler im Datensatz?	yes/ja, no/nein, I cannot say/Kann ich nicht sagen
RQ2 $_{r}$	V27	A/E	What fault is present in the dataset?/Was für ein Fehler ist im Datensatz?	Missing Parts/Fehlendes Bauteil, Must be something else/Muss etwas anderes sein, No fault present/Kein Fehler vorhanden
	V31	A/E	If you need to guess whats the contextual fault and why? (1–2 Sentences)/Wenn Sie raten müssten, was ist der vorhanden Kontext-Fehler? (1–2 Sätze)	free form
	V30	A/E	What helped to identify the dataset? (1–2 Sentences)/Was half Ihnen den Datensatz zu identifizieren? (1–2 Sätze)	free form
EQ6 $_{r}$	V83	A/E	How confident are you that you have identified the dataset right?/Wie sicher sind Sie, dass Sie den Datensatz richtig identifiziert haben?	Rate from 1 = low confidence/gar nicht sicher to 5 = high confidence/sehr sicher
	V45	A/E	What parts of the dashboard have helped to build the trust in your identification? (1–2 Sentences)/Welche Teile des Dashboards haben Ihnen geholfen Vertrauen in deine Identifikation aufzubauen? (1–2 Sätze)	free form
FQ1 $_{r}$	V32	A/E	What was the impact of the missing video information canal?/Was hatte der fehlende Video Informationskanal für einen Einfluss?	Rate from 1 = had high impact/hatte einen sehr großen Einfluss to 5 = no impact. the dataset could be identified with the other information sources/Keinerlei Einfluss. Der Datensatz konnte mit den anderen Datenquellen identifiziert werden
BQ1 $_{d}$	V35	A/E	Errors in an unsupervised system are always present. How easy was it to determine the error baseline?/Fehler sind in einem unüberwachten System immer präsent. Wie schwierig war es den Fehlerbasiswert zu ermitteln?	Rate from 1 = not easy/nicht einfach zu ermitteln to 5 = very easy/sehr einfach zu ermitteln
BQ2 $_{d}$	V36	A/E	How easy was it to distinguish between good and bad and non-functional context models?/Wie einfach war es zwischen guten, schlechten und nicht funktionierenden Modellen zu unterscheiden?	Rate from 1 = not easy/nicht einfach zu unterscheiden to 5 = very easy/sehr einfach zu unterscheiden
BQ3 $_{d}$	V87	A/E	Rate the complexity of the dashboard./Bewerte die Komplexität des Dashboards.	Rate from 1 = very complex/sehr komplex to 5 = no complexity at all/Keinerlei Komplexität
	V88	A/E	Why do you took this rating? (optional, 1–2 Sentences)/Warum bewerten Sie so? (optional, 1–2 Sätze)	free form
	V89	A/E	What is hard to understand? (optional, 1–2 Sentences)/Was ist schwer zu verstehen? (optional, 1–2 Sätze)	free form
	V90	A/E	What is good to understand? (optional, 1–2 Sentences)/Was ist gut zu verstehen? (optional, 1–2 Sätze)	free form
BQ4 $_{d}$	V91	A/E	Is the visuall structure of the central visualization understandable?/Ist der visuelle Aufbau der zentralen Visualisierung verständlich?	Rate from 1 = not understandable/nicht verständlich to 5 = fully understandable/vollkommen verständlich
	V92	A/E	Frage 48 - Can be a connection seen between the smart factory and the visualisation? How and why? (optional, 1–2 Sentences)/Ist es möglich eine Verbindung zwischen der Visualisierung und dem Aufbau der smarten Fabrik zu erkennen? Wie und Warum? (optional, 1–2 Sätze)	free form
BQ5 $_{d}$	V93	A/E	Is the navigation throughout the dashboard clear?/Ist die Navigation durch das Dashboard klar?	Rate from 1 = not clear/unklar to 5 = fully clear/vollständig klar
BQ6 $_{d}$	V94	A/E	After the fault cases. How good would you estimate finding faults with the system?/Nach den Fehlerfällen. Wie gut würden Sie einschätzen mit dem System Fehler zu finden?	Rate from 1 = I would not find any faults/Ich würde keine Fehler finden to 5 = I definitly would find faults/Ich würde garantiert Fehler finden
FQ2 $_{C A D}$	V95	A/E	How good is the visualization in helping find faults?/Wie gut hilft die Visualisierung bei der Fehlerfindung?	Rate from 1 = very bad/sehr schlecht to 5 = very good/sehr gut
FQ3 $_{C A D}$	V96	A/E	How good are the contexts in helping find faults and classify them?/Wie gut helfen Kontexte bei der Fehlerfindung und -klassifizierung?	Rate from 1 = very bad/sehr schlecht to 5 = very good/sehr gut
FQ4 $_{C A D}$	V97	A/E	How would you estimate the potential to find the origin of the fault through context-aware fault diagnosis?/Wie würdest du das Potenzial einschätzen die Fehlerursache durch die Kontext-bezogene Fehlerdiagnose zu finden?	Rate from 1 = no pontential/Kein Potenzial to 5 = very good potential/sehr großes Potenzial
	V98	A/E	What brings you to this rating? What chances and problems do you see? (optional, 1–3 sentences)/Was veranlasst Sie zu dieser Bewertung? Welche Chancen und Probleme sehen Sie dabei? (optional, 2–3 Sätze)	free form
FQ5 $_{C A D}$	V99	A/E	How would you rate the usefulness of the current prototypical implementation of the context-aware fault diagnosis?/Wie würden Sie die Nützlichkeit der aktuellen prototypischen Implementierung bewerten?	Rate from 1 = not usefull/nicht nützlich to 5 = very usefull/sehr nützlich
	V46	A/E	What do you like the most about the current implementation and why? (optional, 1–3 sentences)/Was mögen Sie am meisten an der aktuellen Implementierung und warum? (optional, 1–3 Sätze)	free form
	V47	A/E	If you get three wishes granted, what would you alter on the current implementation? (optional, 1–3 sentences)/Wenn Sie drei Wünsche frei hätten, was würden Sie ändern an der aktuellen Implementierung? (optional, 1–3 Sätze)	free form
	V102	A/E	The functions implemented in the software support me in performing my work./Die implementieren Funktionen in der Software unterstützen mich meiner Arbeit nachzugehen.	Rate from 1 = no not at all/nein überhaupt nicht to 5 = yes every feature is good to have/ja es ist gut jedes Feature zu haben
	V103	A/E	Too many different steps need to be performed to deal with a given task./Es sind zu viele verschiedene Schritte zu tun um eine bestimmte Aufgabe zu lösen.	Rate from 1 = yes there are too many steps/ja es sind viel zu viele Schritte to 5 = no, the number of steps is right/nein, die Anzahl der Schritte ist genau richtig
	V104	A/E	The software is well suited to the requirements of my work./Die Software entspricht den Anforderungen an meine Arbeit.	Rate from 1 = no not at all/nein überhaupt nicht to 5 = yes the software suits the requirements/ja, die Software entspricht genau den Anforderungen
	V105	A/E	The important commands required to perform my work are easy to find./Die wichtigen Befehle um meine Arbeit zu erledigen sind einfach zu finden.	Rate from 1 = no not at all/nein überhaupt nicht to 5 = yes the commands are very easy to find/ja, die Kommandos sind sehr einfach zu finden
	V106	A/E	The presentation of the information on the screen supports me in performing my work./Die Präsentation der Informationen auf dem Bildschirm unterstützen mich meine Arbeit zu erledigen.	Rate from 1 = no not at all/nein überhaupt nicht to 5 = yes the presented information is supportive/Ja, die dargestellten Informationen sind eine klare Unterstützung
	V107	A/E	The possibilities for navigation within the software are adequate./Die Möglichkeiten der Navigation in der Software sind angemessen.	Rate from 1 = no not at all/nein überhaupt nicht to 5 = yes totally adequate/ja total angemessen
	V108	A/E	When executing functions, I have the feeling that the results are predictable./Wenn ich Funktionen ausführe habe ich das Gefühl das die Ergebnisse einschätzbar sind.	Rate from 1 = no not at all/nein überhaupt nicht to 5 = yes the results are highly predictable/Ja, die Ergebnisse sind stark vorhersehbar
	V109	A/E	I needed a long time to learn how to use the software./Ich habe eine lange Zeit gebraucht um zu lernen wie die Software verwendet wird.	Rate from 1 = long time needed/lange Einarbeitung nötig to 5 = was learnable fast/war schnell zu erlernen
	V110	A/E	It is easy for me to relearn how to use the software after a lenghy interruption./Es ist einfach für mich die Software neu zu erlernen wenn ich eine längere Pause einlegen müsste.	Rate from 1 = not an easy task/keine einfache Aufgabe to 5 = very easy to relearn the software/sehr einfach die Software erneut zu erlernen
	V111	A/E	I think I would like to use the system frequently./Ich denke ich würde das System häufig nutzen.	Rate from 1 = strongly disagree/stimme überhaupt nicht zu to 5 = strongly agree/stimme vollkommen zu
	V112	A/E	I find the system was unnecessarily complex./Ich finde das System unnötig komplex.	Rate from 1 = strongly disagree/stimme überhaupt nicht zu to 5 = strongly agree/stimme vollkommen zu
	V113	A/E	I thought the system was easy to use./Ich dachte das System war einfach zu benutzen.	Rate from 1 = strongly disagree/stimme überhaupt nicht zu to 5 = strongly agree/stimme vollkommen zu
	V114	A/E	I think I would need the support of a technical person to be able to use this system./Ich denke ich brauche die Unterstützung einer technischen Person um das System nutzen zu können.	Rate from 1 = strongly disagree/stimme überhaupt nicht zu to 5 = strongly agree/stimme vollkommen zu
	V115	A/E	I found the various function in this system were well integrated./Ich fande die verschiedenen Funktionen in das System gut integriert.	Rate from 1 = strongly disagree/stimme überhaupt nicht zu to 5 = strongly agree/stimme vollkommen zu
	A: V48/E: V29	A/E	I thought there was too much inconsistency in this system./Ich dachte, da waren zu viele Inkonsistenzen in diesem System.	Rate from 1 = strongly disagree/stimme überhaupt nicht zu to 5 = strongly agree/stimme vollkommen zu
	V116	A/E	I would imagine that most people would learn to use this system very quickly./Ich würde mir vorstellen die meisten Leute könnten das System sehr schnell erlernen.	Rate from 1 = strongly disagree/stimme überhaupt nicht zu to 5 = strongly agree/stimme vollkommen zu
	V117	A/E	I found the system very cumbersome to use./Ich fande es sehr mühselig das System zu benutzen.	Rate from 1 = strongly disagree/stimme überhaupt nicht zu to 5 = strongly agree/stimme vollkommen zu
	V118	A/E	I felt very confident using the system./Ich fühlte mich sehr sicher in der Benutzung des Systems.	Rate from 1 = strongly disagree/stimme überhaupt nicht zu to 5 = strongly agree/stimme vollkommen zu
	V119	A/E	I needed to learn a lot of things before I could get going with the system./Ich musste sehr viel lernen bevor ich das System erstmalig nutzen konnte.	Rate from 1 = strongly disagree/stimme überhaupt nicht zu to 5 = strongly agree/stimme vollkommen zu

References

Kaupp, L.; Nazemi, K.; Humm, B. An Industry 4.0-Ready Visual Analytics Model for Context-Aware Diagnosis in Smart Manufacturing. In Information Visualisation—AI & Analytics, Biomedical Visualization, Builtviz, and Geometric Modelling & Imaging; Banissi, E., Ed.; IEEE: Piscataway, NJ, USA, 2020; pp. 350–359. [Google Scholar] [CrossRef]
Kaupp, L.; Nazemi, K.; Humm, B. Context-Aware Diagnosis in Smart Manufacturing: TAOISM, An Industry 4.0-Ready Visual Analytics Model. In Integrating Artificial Intelligence and Visualization for Visual Knowledge Discovery; Kovalerchuk, B., Nazemi, K., Andonie, R., Datia, N., Banissi, E., Eds.; Springer eBook Collection; Springer International Publishing and Imprint Springer: Cham, Switzerland, 2022; Volume 1014, pp. 403–436. [Google Scholar] [CrossRef]
Kaupp, L.; Beez, U.; Humm, B.; Hülsmann, J. From Raw Data to Smart Documentation: Introducing a Semantic Fusion Process for Cyber-Physical Systems. In Proceedings of the 5th Collaborative European Research Conference, Darmstadt, Germany, 12–14 June 2017; pp. 83–97. [Google Scholar] [CrossRef]
Beez, U.; Kaupp, L.; Deuschel, T.; Humm, B.G.; Schumann, F.; Bock, J.; Hülsmann, J. Context-Aware Documentation in the Smart Factory. In Semantic Applications; Hoppe, T., Humm, B., Reibold, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2018; pp. 163–180. [Google Scholar] [CrossRef] [Green Version]
Kaupp, L.; Beez, U.; Hülsmann, J.; Humm, B.G. Outlier Detection in Temporal Spatial Log Data Using Autoencoder for Industry 4.0. In Engineering Applications of Neural Networks; Macintyre, J., Ed.; Communications in Computer and Information Science Series; Springer International Publishing AG: Cham, Switzerland, 2019; Volume 1000, pp. 55–65. [Google Scholar] [CrossRef]
Kaupp, L.; Webert, H.; Nazemi, K.; Humm, B.; Simons, S. CONTEXT: An Industry 4.0 Dataset of Contextual Faults in a Smart Factory. Procedia Comput. Sci. 2021, 180, 492–501. [Google Scholar] [CrossRef]
Emmanouilidis, C.; Pistofidis, P.; Fournaris, A.; Bevilacqua, M.; Durazo-Cardenas, I.; Botsaris, P.N.; Katsouros, V.; Koulamas, C.; Starr, A.G. Context-based and human-centred information fusion in diagnostics. IFAC-PapersOnLine 2016, 49, 220–225. [Google Scholar] [CrossRef]
Zhou, F.; Lin, X.; Luo, X.; Zhao, Y.; Chen, Y.; Chen, N.; Gui, W. Visually enhanced situation awareness for complex manufacturing facility monitoring in smart factories. J. Vis. Lang. Comput. 2018, 44, 58–69. [Google Scholar] [CrossRef]
Filz, M.A.; Gellrich, S.; Herrmann, C.; Thiede, S. Data-driven Analysis of Product State Propagation in Manufacturing Systems Using Visual Analytics and Machine Learning. Procedia CIRP 2020, 93, 449–454. [Google Scholar] [CrossRef]
Zhou, F.; Lin, X.; Liu, C.; Zhao, Y.; Xu, P.; Ren, L.; Xue, T.; Ren, L. A survey of visualization for smart manufacturing. J. Vis. 2019, 22, 419–435. [Google Scholar] [CrossRef]
Xu, P.; Mei, H.; Ren, L.; Chen, W. ViDX: Visual Diagnostics of Assembly Line Performance in Smart Factories. IEEE Trans. Vis. Comput. Graph. 2017, 23, 291–300. [Google Scholar] [CrossRef]
Jo, J.; Huh, J.; Park, J.; Kim, B.; Seo, J. LiveGantt: Interactively Visualizing a Large Manufacturing Schedule. IEEE Trans. Vis. Comput. Graph. 2014, 20, 2329–2338. [Google Scholar] [CrossRef] [Green Version]
Post, T.; Ilsen, R.; Hamann, B.; Hagen, H.; Aurich, J.C. User-Guided Visual Analysis of Cyber-Physical Production Systems. J. Comput. Inf. Sci. Eng. 2017, 17, 9. [Google Scholar] [CrossRef]
Arbesser, C.; Spechtenhauser, F.; Mühlbacher, T.; Piringer, H. Visplause: Visual Data Quality Assessment of Many Time Series Using Plausibility Checks. IEEE Trans. Vis. Comput. Graph. 2017, 23, 641–650. [Google Scholar] [CrossRef]
Keim, D.A.; Schneidewind, J.; Sips, M. CircleView. In Proceedings of the Working Conference on Advanced Visual Interfaces, AVI ’04, Gallipoli, Italy, 25–28 May 2004; Costabile, M.F., Ed.; ACM Conferences. ACM: New York, NY, USA, 2004; p. 179. [Google Scholar] [CrossRef]
Krzywinski, M.; Schein, J.; Birol, I.; Connors, J.; Gascoyne, R.; Horsman, D.; Jones, S.J.; Marra, M.A. Circos: An information aesthetic for comparative genomics. Genome Res. 2009, 19, 1639–1645. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Bale, K.; Chapman, P.; Barraclough, N.; Purdy, J.; Aydin, N.; Dark, P. Kaleidomaps: A New Technique for the Visualization of Multivariate Time-Series Data. Inf. Vis. 2007, 6, 155–167. [Google Scholar] [CrossRef]
Tominski, C.; Schumann, H. Enhanced Interactive Spiral Display. In Proceedings of the SIGRAD 2008—The Annual SIGRAD Conference Special Theme: Interaction, Stockholm, Sweden, 27–28 November 2008; pp. 53–56. [Google Scholar]
Weber, M.; Alexa, M.; Müller, W. Visualizing Time-Series on Spirals. In Proceedings of the IEEE Symposium on Information Visualization 2001 (INFOVIS’01), San Diego, CA, USA, 22–23 October 2001; IEEE Computer Society: Washington, DC, USA, 2001; p. 7. [Google Scholar]
Zhao, J.; Chevalier, F.; Balakrishnan, R. KronoMiner: Using multi-foci navigation for the visual exploration of time-series data. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’11, Vancouver, BC, Canada, 7–12 May 2011; Tan, D., Fitzpatrick, G., Gutwin, C., Begole, B., Kellogg, W.A., Eds.; ACM: New York, NY, USA, 2011; pp. 1737–1746. [Google Scholar] [CrossRef]
Chuah, M.C. Dynamic Aggregation with Circular Visual Designs. In Proceedings of the 1998 IEEE Symposium on Information Visualization, Research, INFOVIS ’98, Triangle Park, NC, USA, 9–20 October 1998; IEEE Computer Society: Washington, DC, USA, 1998; pp. 35–43. [Google Scholar]
Nielsen, J.; Landauer, T.K. A mathematical model of the finding of usability problems. In Proceedings of the INTERACT ’93 and CHI ’93 Conference on Human Factors in Computing Systems, ACM Conferences. Amsterdam, The Netherlands, 24–29 April 1993; Arnold, B., Ed.; ACM: New York, NY, USA, 1993; pp. 206–213. [Google Scholar] [CrossRef]
Merino, L.; Ghafari, M.; Anslow, C.; Nierstrasz, O. A systematic literature review of software visualization evaluation. J. Syst. Softw. 2018, 144, 165–180. [Google Scholar] [CrossRef]
Rasmussen, M.; Laumann, K. Potential Use of HMI Evaluation Methods in HRA. Procedia Manuf. 2015, 3, 1358–1365. [Google Scholar] [CrossRef] [Green Version]
Forsell, C.; Cooper, M. Questionnaires for evaluation in information visualization. In Proceedings of the 2012 BELIV Workshop Beyond Time and Errors—Novel Evaluation Methods for Visualization, ACM Other Conferences, Seattle, WA, USA, 14–15 October 2012; ACM: New York, NY, USA, 2012; pp. 1–3. [Google Scholar] [CrossRef]
Malhotra, V.; Raj, T.; Arora, A. Evaluation of Barriers Affecting Reconfigurable Manufacturing Systems with Graph Theory and Matrix Approach. Mater. Manuf. Process. 2012, 27, 88–94. [Google Scholar] [CrossRef]
Novais, R.; Nunes, C.; Lima, C.; Cirilo, E.; Dantas, F.; Garcia, A.; Mendonca, M. On the proactive and interactive visualization for feature evolution comprehension: An industrial investigation. In Proceedings of the 34th International Conference on Software Engineering (ICSE), Zurich, Switzerland, 2–9 June 2012; Glinz, M., Ed.; IEEE: Piscataway, NJ, USA, 2012; pp. 1044–1053. [Google Scholar] [CrossRef]
Aranburu, E.; Lasa, G.; Gerrikagoitia, J.K.; Mazmela, M. Case Study of the Experience Capturer Evaluation Tool in the Design Process of an Industrial HMI. Sustainability 2020, 12, 6228. [Google Scholar] [CrossRef]
Bassil, S.; Keller, R.K. Software visualization tools: Survey and analysis. In Proceedings of the 9th International Workshop on Program Comprehension, Toronto, ON, Canada, 12–13 May 2001; IEEE Computer Society: Los Alamitos, CA, USA, 2001; pp. 7–17. [Google Scholar] [CrossRef] [Green Version]
Bertoni, A.; Bertoni, M.; Isaksson, O. Value visualization in Product Service Systems preliminary design. J. Clean. Prod. 2013, 53, 103–117. [Google Scholar] [CrossRef] [Green Version]
Ciolkowski, M.; Heidrich, J.; Munch, J.; Simon, F.; Radicke, M. Evaluating Software Project Control Centers in Industrial Environments. In Proceedings of the First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007), Madrid, Spain, 20–21 September 2007; pp. 314–323. [Google Scholar] [CrossRef]
Fuertes, J.J.; Prada, M.A.; Rodriguez-Ossorio, J.R.; Gonzalez-Herbon, R.; Perez, D.; Dominguez, M. Environment for Education on Industry 4.0. IEEE Access 2021, 9, 144395–144405. [Google Scholar] [CrossRef]
Lohfink, A.P.; Anton, S.D.D.; Schotten, H.D.; Leitte, H.; Garth, C. Security in Process: Visually Supported Triage Analysis in Industrial Process Data. IEEE Trans. Vis. Comput. Graph. 2020, 26, 1638–1649. [Google Scholar] [CrossRef] [PubMed]
Reh, A.; Gusenbauer, C.; Kastner, J.; Gröller, M.E.; Heinzl, C. MObjects–a novel method for the visualization and interactive exploration of defects in industrial XCT data. IEEE Trans. Vis. Comput. Graph. 2013, 19, 2906–2915. [Google Scholar] [CrossRef] [PubMed]
Richardson, N.T.; Lehmer, C.; Lienkamp, M.; Michel, B. Conceptual design and evaluation of a human machine interface for highly automated truck driving. In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 2072–2077. [Google Scholar] [CrossRef]
Shamim, A.; Balakrishnan, V.; Tahir, M. Evaluation of opinion visualization techniques. Inf. Vis. 2015, 14, 339–358. [Google Scholar] [CrossRef]
Shin, D.H.; Dunston, P.S. Evaluation of Augmented Reality in steel column inspection. Autom. Constr. 2009, 18, 118–129. [Google Scholar] [CrossRef]
Stelmaszewska, H.; Wong, B.W.; Sanderson, P.M. Methods for gathering and analyzing information seeking behaviour in electronic resource discovery systems. Proc. Hum. Factors Ergon. Soc. Annu. Meet. 2010, 54, 807–811. [Google Scholar] [CrossRef] [Green Version]
Strandberg, P.E.; Afzal, W.; Sundmark, D. Decision making and visualizations based on test results. In Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM ’18, ACM Conferences. Oulu, Finland, 11–12 October 2018; Oivo, M., Ed.; ACM: New York, NY, USA, 2018; pp. 1–10. [Google Scholar] [CrossRef] [Green Version]
Väätäjä, H.; Varsaluoma, J.; Heimonen, T.; Tiitinen, K.; Hakulinen, J.; Turunen, M.; Nieminen, H.; Ihantola, P. Information Visualization Heuristics in Practical Expert Evaluation. In Proceedings of the Sixth Workshop on Beyond Time and Errors on Novel Evaluation Methods for Visualization, Baltimore, MD, USA, 24 October 2016; Sedlmair, M., Ed.; ACM: New York, NY, USA, 2016; pp. 36–43. [Google Scholar] [CrossRef]
Villani, V.; Sabattini, L.; Zanelli, G.; Callegati, E.; Bezzi, B.; Baranska, P.; Mockallo, Z.; Zolnierczyk-Zreda, D.; Czerniak, J.N.; Nitsch, V.; et al. A User Study for the Evaluation of Adaptive Interaction Systems for Inclusive Industrial Workplaces. IEEE Trans. Autom. Sci. Eng. 2021, 19, 1–11. [Google Scholar] [CrossRef]
Zhao, Y.; Luo, F.; Chen, M.; Wang, Y.; Xia, J.; Zhou, F.; Wang, Y.; Chen, Y.; Chen, W. Evaluating Multi-Dimensional Visualizations for Understanding Fuzzy Clusters. IEEE Trans. Vis. Comput. Graph. 2018, 25, 12–21. [Google Scholar] [CrossRef] [PubMed]
Nazemi, K.; Kaupp, L.; Burkhardt, D.; Below, N. 5.4 Datenvisualisierung. In Praxishandbuch Forschungsdatenmanagement; Putnings, M., Ed.; De Gruyter Praxishandbuch Ser, Walter de Gruyter GmbH: Berlin, Germany, 2021; pp. 477–502. [Google Scholar] [CrossRef]
Webert, H.; Döß, T.; Kaupp, L.; Simons, S. Fault Handling in Industry 4.0: Definition, Process and Applications. Sensors 2022, 22, 2205. [Google Scholar] [CrossRef] [PubMed]
Bruckner, D.; Stanica, M.P.; Blair, R.; Schriegel, S.; Kehrer, S.; Seewald, M.; Sauter, T. An Introduction to OPC UA TSN for Industrial Communication Systems. Proc. IEEE 2019, 107, 1121–1131. [Google Scholar] [CrossRef]
Kaupp, L.; Humm, B.; Nazemi, K.; Simons, S. Autoencoder-Ensemble-Based Unsupervised Selection of Production-Relevant Variables for Context-Aware Fault Diagnosis. Sensors 2022, 22, 8259. [Google Scholar] [CrossRef] [PubMed]
Bangor, A.; Kortum, P.; Miller, J. Determining What Individual SUS Scores Mean: Adding an Adjective Rating Scale. J. Usability Stud. 2009, 4, 114–123. [Google Scholar]

Figure 1. Overview over the Flourish dashboard. The Flourish dashboard is split into three main sections (the Flourish visualization, the context OD model view, and the log view) that expand on selection. The anchor of the dashboard is the Flourish visualization (5). Figure 6 shows the elements visible at the start. The dashboard consists of a checkbox to enable the live video feed and a dropdown list to select a previous recording (1). The dropdown list to select a previous recording is present for the prototype evaluation only since in a live system, the production line will constantly be recorded. The hierarchy-depth slider (2) refers to the displayed context hierarchy in the Flourish visualization and enables adjustment of the displayed depth of the context hierarchy. The spider diagram (3) displays the developed AI evaluation metrics for unsupervised-trained ensembles. The live video feed (4) of the current situation is synchronized with the rest of the dashboard, respective to the data. An adjustment of the time slider (6) would also lead to an adjustment of the in Flourish displayed information and the time slider of the video. The threshold slider (7) adjusts the thresholds of all employed context OD models by the percentage that affects the Flourish visualization and the displayed model performance (9) after a context OD model is selected (8). Additionally, the model performance displays both used thresholds that distinguish between medium (yellow) and serious errors (red); both are affected by the threshold slider (6). The context window (10) is displayed under the model performance and visualizes the numerical transformed data that are fed to the context OD model to evaluate the current point in time (vertical green line in model performance). After some anomaly is spotted by the professional, the selection of a graph with numerical transformed values (11) opens the log view (12) and displays the back-transformed original log values. For better readability, the node id and the datatype are hidden.

Figure 2. Neural network configuration of a context OD model. The context OD model is a convolutional autoencoder neural network. Different layer types are colored differently. The first layer will enlarge the input dimension d with filters by a factor of 4. Furthermore, the descending layers will reduce the size by a factor of 2 down to

d / 2

. All

1 D

convolutional layers, including the transpose layers T, use a kernel size of three with a stride of one and keep the input-output dimension the same with padding (not drawn in the figure). The pooling layers’ pool size is two with no strides. All convolutional layers use the tangens hyperbolicus (tanh) as an activation function, besides the last layer, where a linear activation function is in place. The kernel initializer is the Glorot uniform algorithm, and the bias is initialized with zeros in all convolutional layers.

Figure 2. Neural network configuration of a context OD model. The context OD model is a convolutional autoencoder neural network. Different layer types are colored differently. The first layer will enlarge the input dimension d with filters by a factor of 4. Furthermore, the descending layers will reduce the size by a factor of 2 down to

d / 2

. All

1 D

convolutional layers, including the transpose layers T, use a kernel size of three with a stride of one and keep the input-output dimension the same with padding (not drawn in the figure). The pooling layers’ pool size is two with no strides. All convolutional layers use the tangens hyperbolicus (tanh) as an activation function, besides the last layer, where a linear activation function is in place. The kernel initializer is the Glorot uniform algorithm, and the bias is initialized with zeros in all convolutional layers.

Figure 3. Error propagation in the Flourish visualization. (1) No error is present in the system as the production line context error rate is within the range of normal behavior. (2) A rise over the production line context’s error baseline hints toward a potential fault. (3) A short time later, the production line context error rises further, and the OPC UA context threshold is exceeded. On hover, the area enlarges, and with a click on the suspicious model, the current situation can be inspected. Here, the model correctly classified the contextual fault, and the rise in the production line contexts error rate is the first indication.

Figure 4. Line charts with different reconstruction error indication. In unsupervised systems, insufficiently trained models are always present. Consequently, there is always a model to a point t that does not work properly and report false positives. Consequently, the professional in charge has to distinguish between good (1), insufficient (2), and non-functional (3) context OD models. First depicts a good model (1), which stays most of the time under the set threshold, and the time where the threshold is exceeded is rather limited, while an insufficiently trained context OD model (2) stays most of the time over the set thresholds. Spikes are noticeable but do not fall under the threshold. Here, the threshold slider can be used to determine if the spike ranges are reasonable. Non-functional context OD models will always report high values no matter what is fed to the neural network, and no judgment can be made about any situation happening in the system.

Figure 5. AI performance metrics. The spider diagram (1) shows all AI performance metrics (3–8). The metrics are explained by hovering over the information button (2). All metrics are computed over time. Nevertheless, the metrics can be computed over all data in the system, reflecting the overall system performance

S P

(blue), over just on the selected dataset as dataset performance

D P

(green), or just up to the navigated point in time T on the dataset (orange),

D P < = T

(9).

Figure 5. AI performance metrics. The spider diagram (1) shows all AI performance metrics (3–8). The metrics are explained by hovering over the information button (2). All metrics are computed over time. Nevertheless, the metrics can be computed over all data in the system, reflecting the overall system performance

S P

(blue), over just on the selected dataset as dataset performance

D P

(green), or just up to the navigated point in time T on the dataset (orange),

D P < = T

(9).

Figure 6. The Flourish visualization in detail. The Flourish visualization is inspired by nature, Bellis Perennis, alias the daisy flower (top right (https://www.pexels.com/photo/white-daisy-flower-612331/, Pexel License, Hilary Halliwell, 2017, accessed on 2 August 2022) and bottom right (https://www.pexels.com/photo/white-and-yellow-daisy-flower-close-up-photography-45901/, CC0, Pixabay, 2016, accessed on 2 August 2022)). Moreover, the Flourish visualization is a sunburst visualization that inherits custom logic to tie the contexts, the context hierarchy (1), and the production line together. The blossom of the Flourish visualization (2) is a centralized segregated information aggregation. In the outer ring of the blossom (3), the CPS contexts are placed, which aggregate the classified information of the sub-contexts. Furthermore, the center of the blossom (4) reflects the production line context, which aggregates the information of all context OD models and displays the current error percentage of the classified situation. The petals (5) are the CPS-related sections that comprise the CPS context (6) and the sub-contexts (7). Thereby, the information flow is always towards the center, as depicted. All sub-contexts that an OD model backs are aggregated (8) and form the CPS context. All CPS contexts are aggregated toward the center (9). The blossom has the functionality of a traffic light and describes the status of the production line. Each context that an OD model backs is colored; otherwise, it is gray. For that reason, the contexts are visualized distinctively, and a classified error will not change the color of the superior levels but will affect the CPS information aggregation. Employed colors are yellow (exceeding the first threshold), red (also exceeding the second threshold), purple (exceeding the set non-functional boundary), and gray (no OD model present). While the contexts have binary coloring, the information aggregation has a linear scaled color scheme between green, yellow and red.

Figure 7. Smart factory at Darmstadt University of Applied Sciences (DUAS-SF). The DUAS-SF [6] consists of five stations: the high-bay storage, the six-axis assembly robot, the pneumatic press, the optical and weight inspection, and the electrical inspection. Everything is interconnected with a monorail shuttle system. The manual inspection bay is the area where faulty relays are proven and sorted out by a professional. The SF produces relays in full automation. The SF produces relays in full automation. Reprint with permission from [6], “CONTEXT: An Industry 4.0 Dataset of Contextual Faults in a Smart Factory”; published by Elsevier, 2021.

Figure 8. The evaluation encompasses the workflow model. Moreover, the workflow model is based on former publications [1,2,44]. In [1,2] we specified the four main diagnosis tasks of an analyst in the smart manufacturing domain: Exploration, Knowledge Acquisition, Analysis, and Reasoning. These tasks can be mapped to the main parts of the fault diagnosis process [44]: Fault Detection, Fault Classification, Fault Prioritization, and Fault Amendment. Our workflow model also covers Knowledge Acquisition before Fault Detection. The workflow model structures our evaluation. Each step in the workflow model contains questions for the qualitative evaluation of the Flourish dashboard. The evaluation also covers additional questionnaires (shown as Additionals).

Figure 9. The evaluation is separated into four phases: Training, Known Fault Case, Random Fault Case, and Additionals. Each phase has different tasks. Each yellow line is associated with a questionnaire category (Q

_{C}

). The categories are: (t)raining, (f)ind contextual fault, (v)alidate contextual fault, (p)rioritize fault, find a fault (c)hain, (r)andom fault case, (d)elta usage. In the end, the context-aware diagnosis (CAD) section asks questions about the concept and the current implementation of the CAD. Afterward, the user groups evaluated the standard questionnaires ISO and SUS. Everything was structured around our workflow model. Here, the different sections of the workflow model are associated with the various tasks.

Figure 9. The evaluation is separated into four phases: Training, Known Fault Case, Random Fault Case, and Additionals. Each phase has different tasks. Each yellow line is associated with a questionnaire category (Q

_{C}

). The categories are: (t)raining, (f)ind contextual fault, (v)alidate contextual fault, (p)rioritize fault, find a fault (c)hain, (r)andom fault case, (d)elta usage. In the end, the context-aware diagnosis (CAD) section asks questions about the concept and the current implementation of the CAD. Afterward, the user groups evaluated the standard questionnaires ISO and SUS. Everything was structured around our workflow model. Here, the different sections of the workflow model are associated with the various tasks.

Figure 10. The violin plot shows the distribution of the answers of both groups, junior professionals (A) and domain experts (E), in the ISO9240/10. N denotes the number of participants that answered the question. The width of the violin plot describes the frequency of the answers. The white bullet highlights the median, while the thick line defines the lower bound (second quantile) and the upper bound (third quantile) of the data. The maximum and minimum are shown by the end of the thin line. We chose the same questions as Lohfink et al. [33] from the ISO9240/10 but altered the answers, so the positive answer to a negative statement was always five. Consequently, a five in this plot is always positively weighted for the evaluation, and a one is always negative.

Figure 11. The violin plot shows the distribution of the answers to the SUS questionnaire. N denotes the number of participants. The width of the violin plot determines the frequency of the answers. The median is the white point. The second and third quantile is the lower and upper bound of the thick line, while the ends of the thin line determine the minimum and maximum. The SUS has a defined algorithm to weigh the answers and to compute the SUS score. For this reason, the answers to the statements were kept unchanged. We annotate a plus and a minus as an indication if a high or low value in the specific answer is considered good or bad. Due to a technical problem, a question was missing during questioning. After the missing question was noticed, the question was added. Therefore, some answers to the missing question were filled separately. Consequently, only complete questionnaires were used to compute the mean SUS score. Eight of the junior professional and all five of the domain expert questionnaires.

Table 1. Group statistics. The gender was optional and may be given in male (m), female (f), and diverse (d). The ’-’ would count the participant, which does not name a gender. The age in years is given in mean (M), the lowest (L), and the highest (H) age of the group. The work experience of both groups is described with the mean (M), the lowest (L), and the highest value (H). Moreover, the domain experts were asked since they work in the domain; here, the mean (M), the lowest (L), and the highest (H) values are given.

Group	A	E
Gender (m/f/d/-)	8/2/0/0	5/0/0/0
Age in Years (M/L/H)	25/23/27	36.4/23/61
Work Experience in Years (M/L/H)	1.65/0/3.5	7.3/2/20
Domain Affiliation in Years (M/L/H)	-/-/-	14.5/4.5/34

Table 2. Data types of all available 33.089 OPC UA variables with examples [6]. Reprint with permission from [6], “CONTEXT: An Industry 4.0 Dataset of Contextual Faults in a Smart Factory”; published by Elsevier, 2021.

Data Type	Node Count	Example Values
Boolean	17874	True, “[False, ]”
Byte	4918	11
ByteString	406	b’\xff...’
DateTime	5	2020-07-09 16:05:11.795000
Double	1	0.0
Float	1248	3.1233999729156494
Int16	3003	4
Int32	1822	16
Int64	6	2103635700381566
SByte	34	“[32, 32, ...]”
String	93	V3.0, “[’+40,,’, ”, ”, ”, ”]”
UInt16	2655	2
UInt32	1024	3

Table 3. Data types of sensing units with examples [6]. Reprint with permission from [6], “CONTEXT: An Industry 4.0 Dataset of Contextual Faults in a Smart Factory”; published by Elsevier, 2021.

Columns	Data Type	Example Values
datetime	DateTime	2020/07/01 16:15:40:647
TCAp, sGRP	Integer	7, 0
TSL_IR, TSL_Full, TSL_Vis, TSL_LUX, BMP_TempC, BMP_Pa, BMP_AltM, MPU_AccelXMss, MPU_AccelYMss, MPU_AccelZMss, MPU_GyroXRads, MPU_GyroYRads, MPU_GyroZRads, MPU_MagXuT, MPU_MagYuT, MPU_MagZuT, MPU_TempC	Float	41.00, 256.00, 215.00, 29.47, 27.57, 99359.20, 87.82, 5.72, −6.28, −5.15, −0.00, −0.01, 0.00, 21.58, 29.34, −28.94, 30.40

Table 4. Computed AI performance metrics for each dataset (dataset performance [DP]) and for the entire system (system performance [SP]).

	Missing Parts		Missing Pressure		Shuttle Drop-Out
Metrics	D2 (DP)	D3 (DP)	D4 (DP)	D5 (DP)	D6 (DP)	D7 (DP)	SP
MoFoT	0.4%	8.3%	2.2%	2.2%	7.2%	12.2%	13.6%
CFEoT	5.4%	6.8%	5.7%	4.0%	5.9%	2.4%	4.9%
MEoT	0.6%	0.9%	1.2%	0.9%	0.6%	0.6%	0.8%
SEoT	4.8%	5.1%	4.1%	3%	5.2%	1.4%	3.9%
FERoT	0.1%	10.8%	7.8%	3.4%	3.3%	13.8%	6.1%
OFoT	0.0%	0.7%	0.4%	0.1%	0.2%	0.3%	0.3%

Table 5. The checkerboard-oriented (

(A, \dots, F), (1, \dots, 6)

) table provides an overview of the quantifiable results (5-point Likert-scale) of the qualitative questionnaire. Every tile of the checkerboard is a question (Q) with the stats (S), and the mean (M) asserted to one of the user groups. The order within the tiles is reversed between groups (junior professionals [a], domain experts [e]) which enables the comparison of the mean values. The higher value is marked in bold. Each question has an assigned category (C) which refers to the various sections of the questionnaire (Figure 9). The stats encompass the number of participants (N), the standard deviation (S), the variance (V), the highest (H), and the lowest (L) answer to the question. The question order is based on the questionnaire section, which reflects the workflow model. Due to the compact format, the questions are inserted in line from left to right and ordered by occurrence in the questionnaire. The categories can be mixed within lines. The questionnaire respects the technical background of the junior professionals and the domain experts. For this matter, the questionnaire is split into the task of finding a fault chain. Here, two questions in the questionnaire have no counterpart and are marked with (-) for the other group. At the end of the table, each tile row is summarized to the current and total values of the means and the highest values (bold). Again, the higher value is bold. The last box on the left summarizes the tile-row summarization and gives an overview, reflecting the current, the total number of possible points, and their percentage. The summarization is split among groups, the total mean, and the highest values. Only comparable values are used in the summarization. Additionally, equality in the mean is counted for both groups.

Table 5. The checkerboard-oriented (

(A, \dots, F), (1, \dots, 6)

) table provides an overview of the quantifiable results (5-point Likert-scale) of the qualitative questionnaire. Every tile of the checkerboard is a question (Q) with the stats (S), and the mean (M) asserted to one of the user groups. The order within the tiles is reversed between groups (junior professionals [a], domain experts [e]) which enables the comparison of the mean values. The higher value is marked in bold. Each question has an assigned category (C) which refers to the various sections of the questionnaire (Figure 9). The stats encompass the number of participants (N), the standard deviation (S), the variance (V), the highest (H), and the lowest (L) answer to the question. The question order is based on the questionnaire section, which reflects the workflow model. Due to the compact format, the questions are inserted in line from left to right and ordered by occurrence in the questionnaire. The categories can be mixed within lines. The questionnaire respects the technical background of the junior professionals and the domain experts. For this matter, the questionnaire is split into the task of finding a fault chain. Here, two questions in the questionnaire have no counterpart and are marked with (-) for the other group. At the end of the table, each tile row is summarized to the current and total values of the means and the highest values (bold). Again, the higher value is bold. The last box on the left summarizes the tile-row summarization and gives an overview, reflecting the current, the total number of possible points, and their percentage. The summarization is split among groups, the total mean, and the highest values. Only comparable values are used in the summarization. Additionally, equality in the mean is counted for both groups.

		1					2					3					4					5					6
		Q $_{C}$	aS	aM	eM	eS	Q $_{C}$	aS	aM	eM	eS	Q $_{C}$	aS	aM	eM	eS	Q $_{C}$	aS	aM	eM	eS	Q $_{C}$	aS	aM	eM	eS	Q $_{C}$	aS	aM	eM	eS
A	N	AQ1 $_{t}$	10	4.2	4.4	5	AQ2 $_{t}$	10	4.3	3.2	5	AQ3 $_{t}$	10	3.7	4.2	5	AQ4 $_{t}$	10	4.5	4.6	5	AQ5 $_{t}$	10	4.7	4.6	5	AQ6 $_{t}$	10	4.2	3.8	5
	S		0.87			0.49		0.64			0.75		0.9			0.4		0.5			0.8		0.46			0.49		0.75			0.75
	V		0.76			0.24		0.41			0.56		0.81			0.16		0.25			0.64		0.21			0.24		0.56			0.56
	H		5			5		5			4		5			5		5			5		5			5		5			5
	L		2			3		3			2		2			4		4			3		4			4		3			3
B	N	BQ1 $_{d}$	10	4.2	4.6	5	BQ2 $_{d}$	10	4.3	4.6	5	BQ3 $_{d}$	10	4.1	4	5	BQ4 $_{d}$	10	4.8	4.8	5	BQ5 $_{d}$	10	5	4.8	5	BQ6 $_{d}$	10	4.2	4.6	5
	S		0.6			0.49		0.64			0.49		0.54			0.63		0.4			0.4		0			0.4		0.6			0.49
	V		0.36			0.24		0.41			0.24		0.29			0.4		0.16			0.16		0			0.16		0.36			0.24
	H		5			5		5			5		5			5		5			5		5			5		5			5
	L		3			4		3			4		3			3		4			4		5			4		3			4
C	N	CQ1 $_{f}$	10	4.3	4.6	5	CQ2 $_{f}$	10	4.1	3.2	5	CQ3 $_{f}$	10	4	3.2	5	CQ4 $_{f}$	10	4.5	5	5	CQ5 $_{f}$	10	4.5	4.4	5	CQ6 $_{f}$	10	2.8	2.2	5
	S		0.46			0.49		1.04			1.17		1.18			1.17		0.67			0		0.5			0.49		1.08			0.75
	V		0.21			0.24		1.09			1.36		1.4			1.36		0.45			0		0.25			0.24		1.16			0.24
	H		5			5		5			4		5			4		5			5		5			5		4			3
	L		4			4		2			1		2			1		3			5		4			4		1			1
D	N	DQ1 $_{f}$	10	4.8	4.8	5	DQ2 $_{f}$	10	4.9	4.4	5	DQ3 $_{v}$	10	3.9	4.4	5	DQ4 $_{v}$	10	4	3.8	5	DQ5 $_{v}$	10	4.4	4.6	5	DQ6 $_{v}$	10	4.3	3.8	5
	S		0.6			0.4		0.3			0.8		0.83			0.49		0.89			0.98		0.49			0.49		0.64			1.47
	V		0.36			0.16		0.09			0.64		0.69			0.24		0.8			0.96		0.24			0.26		0.41			2.16
	H		5			5		5			5		5			5		5			5		5			5		5			5
	L		3			4		4			3		2			4		3			2		4			4		3			1
		Q $_{C}$	aS	aM	eM	eS	Q $_{C}$	aS	aM	eM	eS	Q $_{C}$	aS	aM	eM	eS	Q $_{C}$	aS	aM	eM	eS	Q $_{C}$	aS	aM	eM	eS	Q $_{C}$	aS	aM	eM	eS
E	N	EQ1 $_{v}$	10	4.2	4.4	5	EQ2 $_{v}$	10	3.7	3.2	5	EQ3 $_{p}$	10	4.2	3.8	5	EQ4 $_{c}$	10	3.9	-	-	EQ5 $_{c}$	-	-	4.2	5	EQ6 $_{r}$	10	3.1	3.6	5
	S		0.75			0.49		1			0.98		0.98			0.40		0.94			-		-			0.4		1.45			1.02
	V		0.56			0.24		1.01			0.96		0.96			0.16		0.89			-		-			0.16		2.09			1.04
	H		5			5		5			5		5			4		5			-		-			5		5			5
	L		3			4		2			2		2			3		2			-		-			4		1			2
F	N	FQ1 $_{r}$	10	2.7	3.2	5	FQ2 $_{C A D}$	10	4.7	5	5	FQ3 $_{C A D}$	10	4.6	5	5	FQ4 $_{C A D}$	10	4.4	4.6	5	FQ5 $_{C A D}$	10	4.4	4.2	5
	S		1.42			0.75		0.46			0		0.49			0		0.66			0.49		0.66			0.75
	V		2.01			0.56		0.21			0		0.24			0		0.44			0.24		0.44			0.56
	H		5			4		5			5		5			5		5			5		5			5
	L		1			2		4			5		4			5		3			4		3			3
Current (Total)				24.4 (30)	26 (30)				26 (30)	23.6 (30)				24.5 (30)	24.6 (30)				22.2 (25) *	22.8 (25) *				23 (25) *	22.6 (25) *				18.6 (25)	18 (25)
Current Bold (Total)				1 (6) †	6 (6) †				4 (6)	2(6)				3 (6)	3 (6)				2 (5) †	4 (5) †				4 (5)	1 (5)				3 (5)	2 (5)
Group		∑ Current (Total), %
A	138.7 (165), 84.1%
	17 (33), 51.5%
E	137.6 (165), 83.4%
	18 (33), 54.5%

* Only comparable counted. † Equality is counted for both groups

Table 6. Success rate (SR) of the random fault (r) classification split among user groups (junior professionals [a], domain experts [e]). Zero equals a not successful classification, and the vice versa indicates a successful classification. The table is equal to Table 5 despite the compared success rate. R1 reflects the answers if a fault is present. R2 is only shown if R1 is previously answered with yes (1). Consequently, N can be different for each group.

		1					2
		Q $_{C}$	aS	aSR	eSR	eS	Q $_{C}$	aS	aSR	eSR	eS
	N		10			5		9			5
	S		0.01			0		0.012			0
R	V	RQ1 $_{r}$	0.1	0.9	1	0	RQ2 $_{r}$	0.11	0.88	1	0
	H		1			1		1			1
	L		0			1		0			1

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kaupp, L.; Nazemi, K.; Humm, B. Evaluation of the Flourish Dashboard for Context-Aware Fault Diagnosis in Industry 4.0 Smart Factories. Electronics 2022, 11, 3942. https://doi.org/10.3390/electronics11233942

AMA Style

Kaupp L, Nazemi K, Humm B. Evaluation of the Flourish Dashboard for Context-Aware Fault Diagnosis in Industry 4.0 Smart Factories. Electronics. 2022; 11(23):3942. https://doi.org/10.3390/electronics11233942

Chicago/Turabian Style

Kaupp, Lukas, Kawa Nazemi, and Bernhard Humm. 2022. "Evaluation of the Flourish Dashboard for Context-Aware Fault Diagnosis in Industry 4.0 Smart Factories" Electronics 11, no. 23: 3942. https://doi.org/10.3390/electronics11233942

APA Style

Kaupp, L., Nazemi, K., & Humm, B. (2022). Evaluation of the Flourish Dashboard for Context-Aware Fault Diagnosis in Industry 4.0 Smart Factories. Electronics, 11(23), 3942. https://doi.org/10.3390/electronics11233942

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Evaluation of the Flourish Dashboard for Context-Aware Fault Diagnosis in Industry 4.0 Smart Factories

Abstract

1. Introduction

2. Related Work

2.1. Context Awareness in Smart Manufacturing

2.2. Visualization for Maintenance and Production

2.2.1. Smart Factory Monitoring Dashboards

2.2.2. Visualization Foundations

2.3. Evaluation of Industry 4.0 Visualizations

3. The Flourish Dashboard

3.1. Structure and Behavior

3.2. Background

3.3. Fault Diagnosis Procedure

3.4. AI Performance Metrics for Unsupervised Ensembles

3.5. Flourish Visualization

4. Evaluation

4.1. Environment

4.2. Participants and User Groups

4.3. Data

4.4. Structure, Pre-Study and Limitations

4.5. Results

4.5.1. Custom Questionnaire

4.5.2. Standard Questionnaires

4.5.3. Future Research Directions and Improvements

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Acknowledgments

Conflicts of Interest

Appendix A. Questionnaire

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI