Help Me Learn! Architecture and Strategies to Combine Recommendations and Active Learning in Manufacturing

: This research work describes an architecture for building a system that guides a user from a forecast generated by a machine learning model through a sequence of decision-making steps. The system is demonstrated in a manufacturing demand forecasting use case and can be extended to other domains. In addition, the system provides the means for knowledge acquisition by gathering data from users. Finally, it implements an active learning component and compares multiple strategies to recommend media news to the user. We compare such strategies through a set of experiments to understand how they balance learning and provide accurate media news recommendations to the user. The media news aims to provide additional context to demand forecasts and enhance judgment on decision-making.


Introduction
The decreased cost of sensors and connectivity [1], along with the development of the Internet of Things, Cloud Computing, Big Data Analytics and Blockchain technologies [2] have enabled an increasing digitalization of manufacturing and the introduction of new paradigms, such as Cyber-Physical Systems (CPS) [3,4] and Digital Twins (DTs) [5][6][7]. Moreover, they bring extensive added value to Industry 4.0 [8], enabling more effective operations, cost saving, and better product quality [9].
While an explosive growth of data available in the manufacturing industry has been observed [10], captured through sensors or made available from software, such as Enterprise Resource Planning (ERP) or Manufacturing Execution Systems (MES), much collective, semantic, and tacit knowledge that the employees are aware of is not digitalized. Furthermore, much of the digitalized data are not labeled, and thus no supervised learning algorithms can be applied to it. It is thus essential to identify how informative the newly collected data instances are to make good decisions regarding data management and machine learning models.
Much of the missing information can be introduced into the digital domain by asking users specific questions. Users can be queried regarding missing labels, asked for feedback on particular entries, or missing domain knowledge. The collection of locally observed collective knowledge can be achieved through a specialized solution [11,12]. The particular case of querying a user for labels given a large pool of unlabeled data is addressed by a sub-field of machine learning known as Active learning (AL) [13]. Active learning attempts to identify the most informative data instances, which are presented to the oracle (e.g., a human expert) asking for a label, reducing the data annotation effort. Newly labeled data are incorporated into the existing dataset and can be fed to the machine learning models. Batch machine learning models require regular deployments to make available the last trained version to the manufacturing software.
Active learning reduces the labeling stress posed on the user and provides a solution to the users' reticence to provide information and feedback [14]. Though, active learning alone does not solve the data labeling issue: a good user experience is key to the success of such a system [15], impacting conversion rates (amount of labeled samples) and user satisfaction (users will not abandon the feature or application) [16]. Therefore, we designed a user interface considering users' feedback can be implicit [17] or explicit. Assuming that the quality of our entries is acceptable (implicit feedback, if no other feedback is provided), we provide means to the user to signal disagreement (explicit feedback) [18][19][20]. When providing recommendations to the users, candidate data instances identified by an active learning strategy do not guarantee their quality and the consequent good user experience [21]. A compromise is required to balance exploration and exploitation while delivering good results. Furthermore, we ranked the unlabeled data entries to ensure entries whose high-quality is most probable are displayed first, and those that do not meet a certain quality threshold are not shown at all. For particular cases, such as when collecting feedback on decision-making options suggested to the user, we allowed the user to provide their own input. This way, we gather additional domain knowledge when the options provided so far do not satisfy the user. New domain knowledge provided by the users can be later incorporated into the application, promoting continuous knowledge gathering and learning.
This paper evolves previous work done in [22]. The scientific contributions of this paper are twofold. First, it describes an architecture we developed to realize a system that combines semantic technologies, machine learning, and explainable artificial intelligence to provide forecasts, explanations, and contextual information while guiding users' decisionmaking. Second, it compares nine active learning scenarios to understand the learning versus recommendation trade-off. Then, we evaluate them implementing a prototype application and recommending four categories of media news that enhance planners' awareness in a demand forecasting setting. In addition, we describe the implementation of a knowledge-based decision-making options recommender system implemented to advise logisticians regarding transport scheduling based on demand forecasts.
The media news we recommend to the users relates to four aspects influencing the demand for automotive engine components produced by a European original equipment manufacturer selling its products worldwide. First, the demand forecasting models were trained using real-world data provided by manufacturing partners of the European Horizon 2020 project FACTLOG [23][24][25]. Data we used included three years of shipment information daily, a month of demand forecasts for material and clients at a daily level, feature relevance for every prediction, forecast explanations created based on those feature rankings, and decision-making options created based on demand forecasts and heuristics.
We evaluate the outcomes of the machine learning models across different active learning scenarios assessing two metrics: area under the receiver operating characteristic curve (ROC AUC) [26] and Mean Average Precision (MAP) [27]. ROC AUC is widely adopted as a classification metric due to its desirable properties, such as being threshold independent and invariant to a priori class probabilities. We measure ROC AUC considering prediction scores cut at a threshold of 0.5. On the other side, MAP is a popular metric in the information retrieval domain, computing the precision of the recommendation set with the size associated with the relevant item's rank. Both metrics are used to assess the performance of recommender systems [28].
The rest of this paper is structured as follows: Section 2 presents related work, and Section 3 details the architecture we designed to satisfy the requirements described above. Section 4 describes the demand forecasting use case we considered to build and test the concept architecture and system. Section 5 presents the user interface, describing each of the components we built into it. Section 6 describes the decision-making recommender system implementations, while Section 7 details the experiments and results obtained when applying active learning for media news categorization and recommendation. Finally, Section 8 provides the conclusions and outlines future work.

Related Work
In this section, we first briefly introduce scientific literature describing demand forecasting models related to the automotive industry. We then describe related work regarding Explainable Artificial Intelligence (XAI), and conclude with an overview of scientific works related to the active learning field.

Demand Forecasting
Products' demand forecasting requires the application of different approaches conditioned by the demand characteristics. Widely adopted criteria to characterize the demand relate to the demands' lead times variance [29], the average demand interval magnitude [30], or the coefficient of variation (see Equation (1)) [31].
Demand is closely related to the product's characteristics and is influenced by the economic context, market type, and customer expectations. Among factors affecting the demand in the automotive industry we find personal income [32], fuel prices [33,34], gross domestic product [35], inflation and unemployment rates [36,37]. This information can be collected and encoded to datasets used to train machine learning models, which learn to predict future demand based on past data.
Statistical and machine learning models were successfully applied to provide accurate car, and car components demand forecasts. Among the most frequent machine learning algorithms used to train the models we find the Support Vector Machine (SVM) [36], Multiple Linear Regressor (MLR) [38,39] and Artificial Neural Networks (ANN) [40][41][42]. Popular statistical forecasting methods include the autoregressive integrated moving average (ARIMA) [32,43], autoregressive moving average (ARMA) [33] and moving average models [44].
While the accuracy of the demand forecasting models is critical for their adoption, given the influence on decision-making, it is imperative to provide details on the rationale followed by the model. Such insights help the user understand the reasons behind the forecast and decide whether it can trust it or not [25]. Furthermore, it has been argued that including domain context can further aid the planners assess the forecasts' soundness, and eventually correct it before making a decision [45][46][47].

Explainable Artificial Intelligence
While the Industry 4.0 paradigm represents a great potential for the manufacturing industry [48], risks associated with its implementation, such as the complexity of integration or the perceived risks of novel technologies [49] must be mitigated. One such perceived risk is the difficulty of providing an intelligible explanation regarding the machine learning models' predictions. Usual reasons behind models' opaqueness are: (i) the complexity of the formal structure of the model, which can be beyond human comprehension [50], or alien to human reasoning; and (ii) intentional hiding of the inner workings of the model (e.g., to avoid exposing some trade secret, or sensitive information) [51]. Research on how to provide intelligibility on the reasons behind the forecast and transparency regarding the machine learning forecasting model is known as explainable artificial intelligence [46]. Such insights and explanations increase the trust in AI models and provide additional information to assist users' decision-making.
Best practices on how to convey the insights regarding the models' reasoning process require the explanation to resemble a logic explanation [52], and take into account relevant context. Among context elements, ref. [53] considers three related to the explainee: (i) the user profile to whom the explanation is given; (ii) the goal of the explanation; and (iii) if the explanation is either global (describes the average AI model forecast), or local (describes a specific forecast instance). Common explanation types include feature rankings, prototype (local) explanations, and counterfactual explanations. Multiple techniques were developed to compute feature rankings, which convey information on which features exercised most influence on a given forecast (local explanation) [54][55][56], or forecasts in general (global explanation). Prototype explanations are data instances obtained from the train set, which are similar to the feature vector used to issue the prediction [57]. Such samples help us to understand which instances most likely influenced the model learning to provide a particular forecast. Finally, counterfactual explanations provide perturbed data samples that produce a different forecasting outcome than the original data instance [58][59][60]. Such samples allow the user to understand what values need to be changed to change a forecast outcome. Ideally, the perturbed features correspond to actionable aspects, on which the user can be advised to take action to influence future outcomes [61].
In the context of manufacturing, XAI technologies have been tested in several scenarios such as predictive maintenance [62], real-time process management [63], and quality monitoring [64]. One of our research goals is to highlight the models' explainability in smart manufacturing processes, aligning XAI technologies with human interaction. We also aim to collect feedback on the quality of such explanations since there are few validated measurements for user evaluations on explanations' quality [65].

Active Learning
Active learning is a sub-field of machine learning that studies how to improve the learners' performance by asking questions to an oracle (e.g., a human annotator), under the assumption that unlabeled data are abundant, while the labels are expensive to obtain [13]. Since users are usually reluctant to provide information and feedback, AL can be used to identify a set of data instances on which the users' input conveys the most valuable information to the system [14]. While active learning in itself helps to reduce the labeling effort focusing on the data that provides new information, it has been demonstrated that explainable artificial intelligence can provide meaningful information to the user, increasing the accuracy of the labels provided [66]. Furthermore, feedback on the explanations can be used to enhance them in the future further. A framework of three components can be used to gather feedback, considering a forecasting engine, an explanation engine, and a feedback loop to learn from the users [67].
The scientific literature describes multiple approaches towards the realization of active learning [13,68,69]. Regarding how the unlabeled data instances are obtained, we distinguish three scenarios: (i) membership query synthesis; (ii) stream-based selective sampling; and (iii) pool-based active learning. Membership query synthesis requires some mechanism (e.g., adversarial generative sampling [70]) to synthesize new data instances for the specific label they were requested. Stream-based selective sampling assumes a stream of unlabeled data instances is available. A decision must be made for each data instance regarding whether it should be discarded or provided to the oracle for labeling. Such a decision can be made based on an informativeness measure or determining a region of uncertainty, querying the data instances within it. Finally, pool-based active learning assumes a pool of unlabeled data from which data instances are selected greedily based on an informativeness measure, which enables to rank the entire pool before selecting the best candidate data instance.
While we envision that active learning can be applied to enhance the explanations provided by XAI, and the decision-making options recommendations we provide to the users regarding manufacturing-related operations [14], in this work, we only compare different active learning strategies to classify and recommend media news to the users.
We extend the approach proposed by [67] to collect feedback from forecasts, forecast explanations, media news related to demand forecasting, and decision-making options we recommend to the users. When recommending media news to the users, we evaluate our approaches against baselines described in [21]. Those baselines allow us to understand the exploration and exploitation trade-off required to learn from promising unlabeled data instances while providing good recommendations to the users.

Active Learning for Text Classification
Text classification is a procedure of assigning predefined labels to the text and is considered one of the most fundamental tasks in natural language processing [71]. Most classical machine learning approaches follow the two steps, where in the first step (handcrafted), features are extracted from the input texts and in the second step, the features are fed to a classifier that makes predictions. The choices of features include the bag-of-words (BoW) approach with various extensions, such as BoW with TF-IDF weighting [72], while the choices of classifiers include logistic regression and support vector machines [73]. In some tasks, such approaches can still provide competitive baselines.
To address the limitations of hand-crafted features, neural approaches have been explored, where the model learns to map the input text to a low-dimensional continuous feature vector [73,74]. Feature extraction from text can be done using the approaches, such as word2vec [75], doc2vec [76], universal sentence encoder [77], or by using transformerbased models, such as BERT [78,79] and RoBERTa [80]. In some approaches, there are multiple ways to obtain a single feature vector for the input text. E.g., this can be done, by using only the vector of a specific word from text, for example the classification token, or by averaging the feature vectors of all the words. Different techniques might yield different performances on a given task [79,81]. A neural feature extractor can be used to produce fixed feature vectors that are fed to the classifier as in the classical two-step approach, or the neural model can be trained end-to-end on the given task.
To achieve a satisfying performance, text classification models need a large number of annotated examples to learn from. As manual labeling is a resource-intensive task, active learning can alleviate some of the efforts. Different feature extraction techniques, classification models and query strategies might be used [74,[81][82][83]. The prediction uncertainty-based query strategies are widely adopted and used with both single model or committees [84,85] approaches. We are primarily interested in evaluating the strategies that tackle the trade-off between learning and recommendation, so we follow the conclusions from [81] to select the feature extraction method and classification model.

Proposed Architecture
To realize the system described in Section 1, we first drafted and iterated an architecture, which requires the following components: (see Figure 1A): • Database, stores operational data from the manufacturing plant. Data can be obtained from ERP, MES, or other manufacturing platforms; • Knowledge Graph, stores data ingested from a database or external sources and connects it, providing a semantic meaning. To map data from the database to the knowledge graph, virtual mapping procedures can be used, built considering ontology concepts and their relationships; • Active Learning Module, aims to select data instances whose labels are expected to be most informative to a machine learning model and thus are expected to contribute most to its performance increase when added to the existing dataset. Obtained labels are persisted to the knowledge graph and database; • AI model, aims to solve a specific task relevant to the use case, such as classification, regression, clustering, or ranking; • XAI Library, provides some insight into the AI models' rationale used to produce the output for the input instance considered at the task at hand. E.g., in the case of a User Interface, provides relevant information to the user through a suitable information medium. The interface must enable user interactions to create two-way communication between the human and the system. The knowledge graph is a central component of the system. Instantiated from an ontology (see Figure 1B), it relates forecasts, forecast explanations, decision-making options, and feedback provided by the users. To ensure context regarding decision-making options and feedback provided is preserved, different relationships are established. The feedback entity directly relates to a forecast, forecast explanation, and decision-making option. While a chain of decisions can exist for a given forecast, there is a need to model the decisionmaking options available at each stage and the sequence on which they are displayed. To that end, the decision-making snapshot entity aims to capture a list of decision-making options provided at a given point in time. A relationship between decision-making option snapshots (followedBy) provides information on such a sequence. For each decision-making snapshot, a selectedOption relationship is created to the user's selected decision-making option. A suggestsActionFor relationship is created between the forecast entity and entities that correspond to the first decision-making options displayed for that particular forecast. Since the decision-making options are linked to decision-making option snapshots and preserve a sequential relationship, all decision-making options can be traced back to the forecast that originated them.

Use Case
Demand forecasting is a key component of supply chain management since it directly affects production planning and order fulfillment. Accurate forecasts enable operational and strategic decisions regarding manufacturing and logistics for deliveries. We developed a model to forecast demand on a material and client level daily. The model was trained on three years of data for 516 time-series corresponding to 279 materials and 149 clients of a European automotive original equipment manufacturer's daily demand. While the forecasts were created and evaluated for all of the clients and materials, we used a subset of them to evaluate the application (e.g., the forecast explanations, media news we display, and recommended decision-making options). We generated forecast explanations using the LIME library [54], but other approaches could be used too (e.g., LionForests [86] or Shapley values [87]). We implemented two strategies for decision-making options recommendations, which allowed us to select a new transport or chose among existing ones. The first one consisted of a set of heuristics that satisfy certain criteria (e.g., have enough capacity to satisfy the expected demand for a given client), while the second one was a knowledge-based recommender. To enhance the context understanding related to demand forecasting, we display media entries for predetermined topics (Automotive Industry, Global Economy, Unemployment, and Logistics) obtained from a media event retrieval system for that day. Media events are queried based on a set of keywords. We developed machine learning models to classify media entries as interesting or not to the users and then gather labels from the users for new media entries. Given there is no need to deliver such media news entries in real-time, we opted to follow a pool-based active learning strategy, persisting all media news event entries, and selecting those considered most informative from the pool of unlabeled data.To provide decision-making options to the users, we implemented two recommender systems: one based on heuristics, and a knowledge-based recommender system. We describe both in Section 6.

User Interface
To provide forecasts, forecast explanations, media news, and decision-making options to the user, we developed a user interface with five distinct parts (see Figure 2). Among them we find: A Media news panel: displays media news regarding the automotive industry, global economy, unemployment, and logistics. The user can provide explicit feedback on them (if they are suitable or not), acting as an oracle for the active learning classifier. Once feedback is provided, a new piece of news is displayed to the user. B Forecast panel: given the date and material, it displays the forecasted demand for different clients. For each forecast, three options are available: edit the forecast (providing explicit feedback on the forecast value), display the forecast explanation, and display the decision-making options. The lack of editing on displayed forecasts is considered implicit feedback approving the forecasted demand quantities. C Forecast explanation panel: displays the forecast explanation for a given forecast. Our implementation displays the top three features identified by the LIME algorithm as relevant to the selected forecast. If users consider that some of the displayed features do not explain the given forecast, they can provide feedback by removing it from the list. D Decision-making options panel: displays possible decision-making options for a given forecast or step in the decision-making process. In particular, the decisionmaking options relate to possible shipments. If no good option exists, the user can create its own. E Feedback panel: gathers feedback from the user to understand the reasons behind the chosen decision-making option. While some pre-defined are shown to the user, we always include the user's possibility to add their reasons and enrich the existing knowledge base. Furthermore, such data can be used to expand feedback options displayed to the users in the future.

Decision-Making Options Recommendation
Demand forecasts influence decision-making on a wide variety of scenarios: from raw material orders to workers hiring and upskilling to logistics arrangements to meet the required deadlines. Decision-making recommender systems can alleviate such decisionmaking by suggesting to the user appropriate actions based on the projected demand. In particular, we implemented a decision-making options recommender system considering the logistics use case. We consider two possible scenarios. The first scenario refers to the user who schedules a new transport for a given demand, material, client, and date. Here, the decision-making options are the possible transports, differing in transport type, delivery time, and price. The second possible scenario relates to the user who decides to use an existing transport. Here each decision-making option selects one of the existing transports. In both steps, the recommendation module ranks the decision-making options from most to least relevant.
We developed two recommendation strategies: a heuristic-based and a knowledgebased approach. The heuristic-based recommender system follows simple rules, handcrafted either by the domain expert or simply by the system's developer based on his incomplete knowledge about the problem. At each step, the user should have the possibility to select any of the possible options regardless of their ranking. Such a system has no learning capacity, and therefore has little potential to improve the users' experience. The recommendation quality directly depends on the quality of the designed rules. An example of such a heuristic rule is consistently ranking the transports according to the price or keeping only existing transports delivering in the client's proximity and ranking them according to the remaining capacity.
The knowledge-based approach provides recommendations based on the feature vectors' similarity to a target vector describing users' requirements. To that end, each decision-making option at the given step is represented as a vector v. The representation captures all necessary information for the ranking, encoding the context up to the current step, the corresponding decision-making option, and its relation to all other possible decision-making options (the decision-making options snapshot). The model assigns the relevance score to each option based on v. The ranking is determined by sorting the scores from highest to lowest.
The representation and the underlying model should be expressive enough to cover the scenarios encountered in the use case. As with the heuristic-based strategy, domain knowledge heavily influences the design of features, but the content-based strategy provides greater flexibility. The features directly capture the context, which in our use case includes the forecasted demand, date, material, and client; the decision-making option, which in the case of scheduling a new transport includes the transport type, time of delivery, capacity, and price; and the relation of the decision-making option to the whole set of available options to capture, how this option is different from others and why it should be preferred.
Among the constraints of our recommender system, we must mention that we had no data regarding the physical characteristics of each product we created demand forecasts for. In addition, while we had no information regarding the specific addresses of the clients ordering such products, we had information of the destination country. To mitigate these constraints, we collected pricing and delivery time information for air, land, and sea shipments considering single standard forty feet containers from Slovenia to fourteen countries. Such data was retrieved from two specialized web-pages (we collected data regarding pricing and shipment time from World Freight Rates (https://worldfreightrates. com/freight) and SeaRates (https://www.searates.com/freight). We retrieved the data between 12 July and 16 July 2021). Finally, given the application was not deployed to a production environment yet, we lack data regarding logisticians' interaction and choices, which would enable recommender systems' performance evaluation. We envision that more complex models can be developed in the future once data regarding users' interaction with decision-making options is obtained.

Active Learning for Media News Categorization and Recommendation
When providing a demand forecast and the explanation that conveys an intuition regarding the reasons behind the forecast, the user can be interested in getting media news on events that can influence demand. In particular, when forecasting engine parts for the automotive industry, the user can be interested in news regarding the automotive industry, the global economy, unemployment, or logistics. While media news can be retrieved from some news intelligence platforms, keywords based queries can issue many false positives. It is thus imperative to develop a recommender system capable of discriminating and prioritizing good quality news over those considered false positives. Furthermore, it is desired that such a model improve the quality of discrimination over time and require as little manual labeling effort as possible. To realize this, we built a set of active learning binary classifiers, each one informing if the media news considered does fit into a specific media news category or not. We consider the end-user is at the same time the news consumer and the active learning's oracle, providing feedback regarding unlabeled instances. In our design, we display the news and collect feedback regarding them in the same user interface. This poses an exploration-exploitation dilemma since the same user interface space must be optimized to provide high-quality media news balancing between those entries where high confidence on the category exists but provide little additional information to the existing dataset, and those entries where the confidence is lower, but can provide a higher degree of novelty to the dataset [21]. In particular, each day, at most, ten pieces of news per each of the four categories are displayed to the user. The user can then provide positive or negative feedback (label) regarding each piece of news. The news should be informative for the system as the goal is to achieve good classification performance as soon as possible. On the other hand, the displayed news events should also be relevant so that the system is usable to the users after the first few iterations. The set of displayed news events on each day should therefore balance the learning vs. recommendation (exploration vs. exploitation). In this research, we do not deal with the cold-start problem since we consider it can be mitigated by pre-training the models with a set of manually annotated instances before starting the active learning dynamics. We have evaluated nine strategies (see Table 1), balancing learning and recommendation.

Random
Selects the k random instances at each step.

Uncertain
Selects k instances with highest uncertainty score at each step.

Certain
Selects k instances with lowest uncertainty score, that is, most certain examples.

Positive uncertain
Select at most k instances that were labeled as positive by the classifier and have the highest uncertainty scores.

Positive certain
Select at most k instances that were labeled as positive by the classifier and have the lowest uncertainty scores.

Positive certain and uncertain
Select at most k/2 positive points with lowest and at least k/2 points with highest uncertainty score.
Alpha trade-off (α = 0.5, 0.75, 1.0) We adapt the strategy proposed by [21] Different measures can be used to measure the classification certainty of the model, which is needed in the 5 out of 9 strategies presented in the Table 1. We use the uncertainty of classification which is defined for a single sample x as U(x) = 1 − max y P(yx) where a higher value of U(x) means higher uncertainty. In the case of the SVM model, the distance to the separating hyper-plane is an indicator of uncertainty, with the example having the lowest distance being most uncertain [88].
Strategy Uncertain straightly implements the uncertainty assumption that labels of the instances with the highest classification uncertainty are the most informative. It solely focuses on learning as such instances tend not to be the most relevant for the recommendation. The Random strategy is included as a baseline, and so is the Certain strategy, which only selects the least uncertain instances whose labels should provide the most negligible value for the system according to the uncertainty assumption. To also address the recommendation, the Positive uncertain strategy selects the instances labeled as positive by the model as this already signals that the instance is likely to be relevant for the recommendation. At the same time, it might still provide some value for learning due to uncertainty. On the other hand, the Positive certain strategy selects only the positive instances. Therefore, it ranks them according to the certainty, which should highly favor the recommendation while providing little value for learning. The Positive certain and uncertain strategy tries to include both recommendation and learning by following the Positive certain strategy for the first k/2 instances (or less if there are not enough positive instances) to provide relevant recommendations and next following the uncertainty strategy to select at least the k/2 instances relevant for learning.
The Alpha trade-off strategy is adapted from [21] and has a parameter α, used to control between learning and recommendation. It selects the instances according to the formula with P α being the (100α) th percentile of the distribution of predictive probabilities of positive class induced on the pool of new examples. For example P 0.5 equals to the median probability and P 1.0 equals to the maximum probability of the positive class assigned to an instance from the pool. According to [21], α = 0.5 selects the instance with highest uncertainty from the pool and thus favours learning while α = 1.0 selects most certainly positive instance and thus favours recommendation. Setting α = 0.75 could therefore provide a trade-off between learning and recommendation. The k instances closest to P α are selected to form the pool.
Positive certain and Alpha trade-off (α = 1.0) differ by the fact that Positive certain limits only to instances that were labeled as positive and selects at most k instances which have the lowest uncertainty score (highest probability of positive class), while Alpha trade-off (α = 1.0) always selects k examples according to the decreasing probability of positive class. Similarly as Positive certain, the Positive uncertain strategy also limits to examples that were labeled as positive and selects at most k instances, which have the highest uncertainty score (lowest probability of positive class).

Active Learning Experiments
The active learning experiments were performed on a dataset of media news events classified into four categories: Automotive Industry, Global Economy, Unemployment, and Logistics. The dataset was manually annotated by three human annotators, based on the specific keywords used in each category to retrieve them (see Table 2). The media news events were retrieved daily for a period of six months (from July 2019 to December 2019) from Event Registry [89], a well-established media events monitoring platform that has monitored mainstream media since 2014. The first month of the dataset was reserved for training the initial version of the models and for tuning the model's hyperparameters. The last month of the dataset was reserved for testing the classification performance of the models at each active learning step. The remaining data was used to execute the active learning experiments and evaluate the recommendation performance at each step. We provide an overview regarding the dataset in Table 3. We report the dataset size regarding the number of instances per dataset split (initialization set, learning set used with AL, and test set). We can observe that the datasets vary in size, with B being the smallest and D being almost two magnitudes larger than B. Further, we include the ratio of negative and positive instances, as labeled by the human annotators. The datasets are differently balanced, with D being most unbalanced as the ratio of positive instances is only 2.29%. We consider that the datasets' diversity strengthens the experiments designed to evaluate diverse scenarios. In addition to the dataset information, in Table 3 we also report the number of AL iterations per each dataset (the number of days when at least one news event is available) and the maximum number of instances that could be queried for a category, per day, on average. While we conducted the experiments with a fixed budget of at most k data instances per day; note that this number is smaller than k times the number of days; less than k instances were available for some days. Table 2. Active learning dataset categories, keywords used to query them, the number of instances per category, the number of days without entries for a given category, and the median of events per day for that category (MEPD). "# Instances" stands for "the number of instances". We executed the following procedure (see Figure 3). For each day, we retrieved all available events for that given day, and for each media news entry, we created the corresponding feature vector and assessed whether it should be displayed to the user to gather feedback (label the instance) or not. This decision was made based on a strategy (see Table 1) that considered how informative the news entries were to the existing dataset, and their quality towards the target category, given the requirement that the events should be both relevant for the user (recommendation quality) and informative for the model (improvement of classification). For each day, we selected at most k events, which were then shown to the user. We set k = 10, based on the median number of events per day, and acknowledging it is a common practice to query a fixed number of instances at each step according to the literature [81]. Once the media entry was displayed to the user, it was incorporated into the existing dataset if it provided an annotation. There are cases where the model can recommend less than k events. For example, this could be due to not enough events for a particular category exist that day or that only k < k of them are relevant or need a label. Thus, fewer events of that category are displayed to the user.

Category
We use a separate test set to measure the active learning models' classification performance to evaluate the models. In contrast, the recommendations' quality is measured at each step of active learning using the gold labels of the displayed news events. To measure the models' discrimination power, we adopted the ROC AUC, a widely used metric due to its invariance to a priori class probabilities. On the other hand, to measure the quality of the recommendations, we adopted the MAP metric, which computes the precision of the recommendation set, and is not affected by the number of entries considered in each particular case (the desired property when k < k media news events are shown to the user).
Only the titles of news events were considered for classification. We used two groups of classification models, namely, the ones that are retrained on the labeled set on each iteration of AL and the ones that are trained incrementally (online) on each new pre-sented labeled instance. We used logistic regression (LR), support vector machine (SVM), and random forest (RF) in the first group. Among the online algorithms, we trained an SGD-based logistic regression, a perceptron, and a passive-aggressive classifier (PA) [90], obtaining best results with the latest. The selection of the batch models follow the related work [74,81] where the SVM model was identified as a frequent choice for active learning for text classification. The models that are retrained were also selected based on their fast training time.
We experimented with three text representation techniques: TF-IDF weighted BoW representation, which is a classical representation technique used for text classification and serves as a strong baseline in our experiments; an average of token embeddings from the RoBERTa model (We have used the pre-trained version of "RoBERTa-base" model implemented in the Huggingface library [91]) which proved to be most effective for text classification based on the results obtained by [81], and representations obtained from the Universal sentence encoder [77] (We have used the model available at https://tfhub.dev/google/universal-sentence-encoder/4, accessed last time on November 2021). The TF-IDF representation is not straight-forward to adapt for the streaming setting (where one example at the time is available) so we have used the simpler, hashing based method (We have used the hashing vectorizer from Sci-kit learn library [92] available at https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction. text.HashingVectorizer.html, accessed on 11 November 2021) instead for streaming models. The hashing vectorizer creates features by transforming a collection of text documents to a matrix of token occurrences.
Through our experiments, we focused on comparing a set of data selections strategies, retrieving unlabeled data from a pool of data instances. We simulated a realistic scenario, where the news events were presented on a daily level, and the model received minibatch of labeled instances. We assume that the set of labeled instances always fits in the available memory so the batch models can be re-trained in each iteration to achieve the best performance. In addition to the batch models, we also tested several streaming models, from which the best performance was obtained with the Passive-Aggressive classifier, and the results included in this work. We consider that, given that there is no need to display media news in real-time, and that providing them at a daily granularity is enough, having a pool of news provides greater flexibility when choosing unlabeled data instances and choice of machine learning models. Nevertheless, our goal is to compare selection strategies. Thus, the models are useful towards providing if a strategy is consistently better across several aspects thorough the models of choice. Table 3. Overview of the active learning datasets with total number of instances per each split, ratio of negative and positive instances, as labeled by the human annotators, number of iterations on the learning set when AL is used and maximum number of possible queried instances by limiting the budget with k = 10 instances per iteration. "# instances" denotes "the number of instances".

Results
In this section, we present the results we obtained when conducting experiments regarding different AL strategies. Strategies and models were evaluated in the AL setting by following the procedure explained in Section 7.1. We report the classification performance as the ROC AUC score obtained in the last iteration of active learning, while recommendation performance is reported with the MAP metric for all iterations. Further, we compare different active learning strategies to determine the most successful in tackling the learning versus recommendation trade-off. We provide additional results in the Tables A1-A4, in the Appendix A.

Evaluating the Classification Baselines
Before conducting the experiments, we established a baseline by training multiple supervised machine learning models on all available labeled data, excluding the test set. In the baseline, we also included a fine-tuned RoBERTa model. This set of models aims to understand the maximum expected performance achieved with this dataset and its features. We report the baseline ROC AUC scores in Table 4. Table 4. Classification performance of the models trained on all labeled examples excluding the test set. Best score for each dataset is shown in bold. A-D correspond to the four datasets we used to conduct the experiments, which are described in Table 2. TF-IDF representation is calculated on the whole training set while the PA model is trained in the incremental learning setting.

Model
Representation From the baseline results shown in Table 4, we observe that RoBERTa model achieves the best or at least competitive performance on all but a single dataset. This is expected as fine-tuned language models are known to achieve state-of-the-art results on many text classification tasks. Still, we can observe that the performance of the second-best model on each dataset is very close, thus providing a good alternative to the RoBERTa model in our case since other models usually require less time to train.
Fine-tuning the RoBERTa model is shown to be almost always better than using fixed RoBERTa representations with a classifier on our datasets. There is no clear winner in terms of representations, although universal sentence encoder (USE) appears to be a strong competitor (if not better) to RoBERTa based representations recommended by [81].
An unexpected finding was that models based on the TF-IDF-based representations achieve very competitive performance. Namely, on text classification tasks, TF-IDF-based models usually lag in performance behind neural-based approaches.
As mentioned in the Section 7.1, TF-IDF representation calculation cannot be easily adapted to the streaming setting. Still, for better comparison with other models, we used TF-IDF representation calculated on the whole training set before training the PA classifier. To also include the real streaming setting, we have used the hashing based approach, which is referred to as Hashing in Table 4.

Evaluating the Classification Performance of AL Strategies
As an aggregation of the results from all experiments, we report the mean value and standard deviation, aggregated over models and representations on strategy and dataset level, for ROC AUC score in the Table 5. This gives us insight into the actual classification performance of the strategies on each of the datasets. Table 5. Mean ROC AUC score with standard deviation, aggregated over used models and representations, for each strategy, reported on each of the four datasets. The best score for each dataset is shown in bold. A-D correspond to the four datasets we used to conduct the experiments, which are described in Table 2. We observe little difference in final classification performance among the strategies in Table 5, although they have many different policies for selecting the instances. For example, the Uncertain and Certain strategies favor different (and, in a sense, complementary) subsets of instances while their performance appears not to differ much.

Strategy
As we aim to find the strategies suitable for learning regardless of the model and representation used, we further compare the classification performance of the strategies across all datasets. First, we group the results by model, representation, and dataset. Then, inside each group, we sort and rank the strategies by their ROC AUC score. Finally, we report the mean rank for each strategy in the Table 6. Additionally, for each active learning strategy, we compute the mean ROC AUC ratio towards the best strategy in the group (see Table 6). The mean rank gives us the ordering of the strategies. Finally, we determine the significance of differences between them using the Wilcoxon signed-rank test [93] on ROC AUC scores from all experiments, at a p-value = 0.005. According to the results from Table 6, there is little difference between the best seven strategies in terms of mean rank. Furthermore, we have observed no significant difference among those strategies. We attribute this result to the large enough number of queried instances at each step (k = 10 in our experiments) which, for our datasets, allows us to cover a diverse set of instances regardless of the instance selection strategy. We observed, however, a significant difference between the top seven strategies and the Positive certain and Positive uncertain strategies. We attribute this difference to the two strategies, limiting only to the instances with a positive label assigned by the model, which might noticeably limit the labeled set obtained during active learning. In comparison, other strategies always request the label for k instances at the given step. Positive certain strategy selects the positive instances on which the model is most certain. However, despite the certainty, there is no reason for such instances to be true positives. When the number of positive instances is less or equal to k, both Positive certain and Positive uncertain strategies select the same instances. We have observed that, on average, 69.28% of instances selected by the Positive certain strategy are positive while that percentage is 68.67% for the Positive uncertain strategy.
Furthermore, to evaluate whether active learning actually improves the performance of the models or training on the initialization set is enough for good performance, and to evaluate which of the strategies yield the highest performance increase, we report the mean value and standard deviation, aggregated over models and representations on strategy and dataset level, for change in the ROC AUC score in the Table 7. To determine the significance of performance change (either increase or degradation) when training with AL compared to training only on the initialization set and to determine the significance of different changes in performance among the strategies, we have used the Wilcoxon signed-rank test at a p-value = 0.005. Table 7. Mean change in ROC AUC score with standard deviation, aggregated over used models and representations, for each strategy, reported on each of the four datasets. The best score for each dataset is shown in bold. A-D correspond to the four datasets we used to conduct the experiments, which are described in Table 2. It shows the change in performance when models trained on initialization set are trained with AL. 0.0455 ± 0.0432 0.0457 ± 0.0291 0.0213 ± 0.0203 0.0095 ± 0.0213 Alpha trade-off (α = 0. 75) 0.0360 ± 0.0416 0.0492 ± 0.0340 0.0250 ± 0.0202 0.0185 ± 0.0198 Alpha trade-off (α = 1.0) 0.0338 ± 0.0460 0.0475 ± 0.0330 0.0288 ± 0.0215 0.0264 ± 0.0293

Strategy
We have observed that all strategies can either improve or degrade the performance of the model. However, the performance when the models are trained with AL is significantly better for all strategies except Positive uncertain and Positive certain, where no significant change in performance was observed-no significant improvement or degradation. When comparing the strategies, we did not observe any significant difference among the first 7 strategies as listed in the Table 6 while we have observed a significantly worse improvement for strategies Positive certain and Positive uncertain.

Evaluating the Recommendation Performance of AL Strategies
To evaluate one aspect of the strategies' recommendation performance, we aggregate the results from all experiments and report the mean value and standard deviation, aggregated over models and representations on strategy and dataset level, for MAP score in Table 8. MAP enables us to quantify, for each dataset, the strategies' performance on how accurate the recommended entries are while penalizing their ordering within the top K entries.
We observe that the performance of strategies that focus on the positively labeled instances, such as Positive certain or Alpha trade-off (α = 1.0), far exceeds the performance of uncertainty focused strategies, such as Uncertain or Alpha trade-off (α = 0.5). This is especially evident on the strongly imbalanced datasets C and D, where there is a large number of negatives, that is, irrelevant news events. A large number of negatives also explains the poor performance of Certain strategy, as classifying negatives appear to be more certain. Table 8. Mean MAP score with standard deviation, aggregated over used models and representations, for each strategy, reported on each of the four datasets. The best score for each dataset is shown in bold. A-D correspond to the four datasets we used to conduct the experiments, which are described in Table 2. Further, to find the strategies which are good in terms of MAP score regardless of the model and representation used, we compare them across all datasets by following the same procedure as in Table 6 for the classification performance. The metric under consideration is not the ROC AUC but the MAP score in this particular case. Results are reported in Table 9, where the mean rank is used to order the strategies. We determine the significance of differences between the strategies using the Wilcoxon signed-rank test on MAP scores from all experiments, at a p-value = 0.005. We can observe that the Positive certain strategy achieves significantly better performance than others. Moreover, despite showing worse classification performance, according to the results from Table 6, and thus yielding less capable classification models, it displays the most relevant instances to the user at each step. However, it has to be noted that such a strategy displays much fewer instances than others and thus might miss many relevant recommendations, achieving low recommendation recall. The Alpha trade-off (α = 1.0), Positive uncertain and Positive certain and uncertain strategies follow with significantly worse performance. We can further observe a drop in performance after the first four strategies focused on positively labeled instances. The performance of Alpha trade-off (α = 0.75), which is meant to balance between the learning and recommendation, is not significantly different from the performance of Uncertain strategy. The Random, Alpha trade-off (α = 0.5) and Certain strategy follow, again all with significantly worse performance, with Certain strategy being significantly the worst-performing.

Strategy
Another relevant dimension of recommender systems' performance is the recall. Recall evaluates how many of the relevant instances were actually recommended and displayed to the user. While MAP score measures how many of the k (or less) displayed instances are relevant and whether the relevant instances are shown first, the recall score measures the ratio of shown relevant instances versus all relevant instances. We aggregate the results from all experiments and report the mean value and standard deviation, aggregated over models and representations on strategy and dataset level, for Recall score in Table 10. Table 10. Mean recall score with standard deviation, aggregated over used models and representations, for each strategy, reported on each of the four datasets. The best score for each dataset is shown in bold. A-D correspond to the four datasets we used to conduct the experiments, which are described in Table 2. The Alpha trade-off (α = 1.0) strategy achieves the best mean recall score on all datasets and is closely followed by the Positive certain and uncertain strategy. Although the Positive certain strategy was ranked first according to the MAP score (see Table 9), it is evident that it performs well in terms of precision by trading the recall.

Strategy
To further compare the strategies regardless of the model and representation used, we follow the same procedure as for the MAP score (see Table 9). Results are reported in Table 11, where the mean rank is used to order the strategies. The significance of differences between the strategies is determined using the Wilcoxon signed-rank test on recall scores from all experiments, at a p-value = 0.005. We found the Alpha trade-off (α = 1.0) displayed the best performance with significant difference to the second best, Positive certain and uncertain strategy. The Uncertain strategy follows with significantly better results than the remaining strategies. It can be observed from the Table 10 that the score of Uncertain strategy is in range with the scores of the best two strategies on all datasets, and it does not even decrease as much as the score of others, worse-performing strategies, on dataset D. It might be that the uncertain instances are frequently from the positive class in our datasets. Next, we can observe the decrease in performance with the differences between the following four strategies not being significant, and the Alpha trade-off (α = 0.5) and Certain strategies at the tail.
Through the classification and recommendation results, we have evaluated how well each strategy performs in terms of learning and recommendation and how does its performance compares to others. As just a single strategy is implemented in the active learner, is has to be such that it best balances the learning and recommendation for the best user experience. Based on the results, we consider the Alpha trade-off (α = 1.0) strategy to be the best choice, followed by the Positive certain and uncertain strategies. The classification results (see Table 6) showed no statistically significant difference in performance between the best strategies. Although based on the precision of recommendation (MAP score) results (see Table 9), the Positive certain is the best performing strategy, it only performs well in one aspect of recommendation and ignores the recall. Both Alpha trade-off (α = 1.0) and Positive certain and uncertain are second-tiers in terms of MAP score with Alpha tradeoff (α = 1.0) strategy being slightly better, while they rank first and second in terms of recommendation recall.

Conclusions and Future Work
The current work presents an architecture designed to acquire and encapsulate complex knowledge using semantic technologies and artificial intelligence. The system was instantiated for the demand forecasting use case in the manufacturing domain, using realworld data from partners from the EU H2020 projects STAR and FACTLOG. In particular, the system provides forecasts and explanations, enriches users' domain knowledge through a set of media news, recommends decision-making options, and collects users' feedback. Furthermore, the system uses active learning to reduce manual labeling effort and better discriminate between good and bad media news reporting events related to the demand forecast domain. A series of experiments were executed to understand the best exploration and exploitation trade-off between strategies, which is required to learn from unlabeled media news entries while providing good recommendations to the users. We consider that the best performance was achieved by the Alpha trade-off (α = 1.0) and Positive certain and uncertain, which displayed a strong performance in terms of MAP score and recall. While many improvements can be introduced to increase the classification performance on top of the existing datasets, our research mainly focused on evaluating the impact of each strategy on learning. Future work will explain the models' criteria for classifying the media news events and the associated unlabeled entry uncertainty. We expect that such explanations will enhance users' understanding of the underlying model and ease their labeling effort. Furthermore, users' feedback can be leveraged in an active learning schema to learn how they perceive the explanations and to enhance their quality over time [94]. Finally, we envision extending such explanations towards the decision-making recommendations to increase the transparency behind such recommendations.

Appendix A
We report the performance of the models trained only on the initialization set to see if the active learning is really needed or if the initialization set itself provides enough labeled examples to obtain good performance. Table A1. Classification performance of the models trained on the initialization set. Best score for each dataset is shown in bold. A-D correspond to the four datasets we used to conduct the experiments, which are described in Table 2.

Model
Representation