## 2. Machine Learning

Machine Learning (ML) is a very practical field offering many solutions to problems in our daily life, thus making it so enormously useful today [

1]. This visible and convincing success is mostly due to three facts:

ML is grounded in Statistical Learning Theory (SLT) which provides a large framework for studying fundamental questions of learning and inference, extracting knowledge, making predictions and decisions and constructing formal models from data. Ultimately, SLT contributes to help to design better learning algorithms [

5,

6,

7].

**Uncertainty and Probabilistic Reasoning.** The basis for the great success of ML was set more than 250 years ago by

Thomas Bayes (1701–1761), whose work on decision making under uncertainty was communicated after his dead by

Richard Price (1723–1791) [

8]. However, it was actually

Pierre Simon de Laplace (1749–1827) some 20 years later [

9], who generalized these ideas and made the field of probabilistic reasoning accessible, usable and useful for computational approaches today. A further success factor was the predictive power of Gaussian Processes, which have been successfully used for dealing with stochastic processes in time [

10]. A

Gaussian process (GP) can be seen as a generalization of the normal probability distribution, which is named after

Carl Friedrich Gauss (1777–1855), and which can be used as a prior probability distribution over functions [

11]. This idea is surprisingly useful now for us dealing with high-dimensional data, because Bayesian inference can be easily applied, consequently it unites a consistent view with computability. Moreover, it is fascinating that the probabilistic reasoning approach fits well to explanations of human learning and problem [

12,

13,

14]. Furthermore, much practical value provides the use of

probabilistic programming. This programming concept is different from traditional programming, in a way that parts of the program are not fixed in advance; instead they take on values generated at runtime by random sampling procedures. A good example for this approach is the combination of probabilistic programming and Particle Markov Chain Monte Carlo (PMCMC), which allows automatic Bayesian inference on probabilistic models including stochastic recursion [

15]; for an implementation in Python see [

16]. The two additional powerful constructs to functional or imperative programming concepts include [

17]:

(1) the ability to draw values at random from probability distributions, and

(2) the ability to condition values of variables in a program via observations.

Many real-world problems of our daily life can be harnessed by probabilistic programs due to the applicability of probabilistic inference, i.e., computing an explicit representation of the probability distribution implicitly specified by a probabilistic program. Depending on the application, the desired output from the inference may vary, e.g., if we want to estimate the expected value of a function

f with respect to the distribution, or the mode of the distribution, or a set of samples drawn from this distribution [

1].

**Artificial Generation of Knowledge from Experience.** ML as a field of computer science started seven decades ago with ideas on developing algorithms that can automatically learn from data to gain knowledge from experience and to gradually improve their learning behaviour. The original definition was

“the artificial generation of knowledge from experience”, and first studies have been performed with games [

18]. While statistics aimed to provide a human the tools to analyze data manually, the aim of ML was from the beginning to replace the human, and similarly as we humans do, to learn automatically from data to make predictions and decisions. Consequently, ML was always a field of overlapping interest between cognitive science and computer science [

19]. The field progressed enormously in the last two decades with application successes in various fields, ranging from Astronomy to Zoology, mostly due to the availability of what is called “Big Data”, collected by satellites, telescopes, high throughput machines, sensor networks, smart phones, etc. [

20]. The best practice examples today include autonomous vehicles, recommender systems, or natural language understanding [

21]. Finally, the convincing successes of deep belief network approaches [

4,

22] made the field very prominent (see below).

Meanwhile industry from Amazon to Zalando is investing a lot into research as they envision enormous business potential in the near future which also stimulates fruitful cooperation between academia and industry, and even small companies have identified the value of ML for solving a large variety of business relevant problems [

23]. Health informatics is among the greatest application challenges, which is not surprising, because medicine is a good example for a domain full of uncertainty, where we are constantly confronted with probabilistic, unknown, incomplete, heterogenous, noisy, dirty, erroneous, inaccurate, and missing data sets in arbitrarily high dimensional spaces, which poses grand challenges to ML [

24,

25].

**Inverse Probability Allows to Infer Unknowns and to Make Predictions**. ML builds mainly on three pillars of mathematics: linear algebra, optimization and probability theory, although many other mathematical areas are involved, see e.g., [

26]. Probability theory [

27] provides the mathematical language for representing of and dealing with uncertainty, similarly as calculus is the language for representing of and dealing with rates of change (refer to

Zhoubin Ghahramani (2013) [

28]. The typical data organization is in form of

n-dimensional arrays, where the rows represent the samples (data items) and the columns represent the attributes (features), which can be seen as a

n-dimensional vector of attributes and the array as a matrix. We can learn from data—even from high-dimensional data in

${\mathbb{R}}^{n}$—by transformation of the prior probability distributions into posterior probability distributions. To illustrate this learning process let us show a simple example here in

${\mathbb{R}}^{2}$.

Note: events are labeled with capital letters

A; A random variable is also denoted by capital

X and may take values in small letters

x; the probability of an event is capital

$P\left(A\right)$. A connection between values and events is in the case

$\u201cX=x\u201d$, i.e., the event

X takes on the value

x; A discrete random variable has a probability mass function small

$p\left(x\right)$, and the connection between

P and

p is that

$P(X=x)=p\left(x\right)$. Note also that a continuous random variable has a probability density function

$f\left(x\right)$, and the connection between

P and

f is that

$P(a\leqq X\leqq b)={\int}_{a}^{b}f\left(x\right)dx$; in the following we use a small

${h}_{n}$ to indicate a hypothesis

n, and small

$\theta $ to indicate the hypothesized value of a model parameter; we use capital letters

$\mathcal{D}$ when talking about data as events, and small

x when talking about data as values. The expression

$p\left(x\right)$ with

$0\leqq p\left(x\right)\leqq 1$ denotes the probability that

x is true. Following

Bayes we can now instead of

$x,y$ denote

d for data and

h for the hypothesis, and with capital

$\mathcal{H}=\{{h}_{1},{h}_{2},\dots ,{h}_{n}\}$ define the hypotheses space; then

$\forall (h,d)$We can now use the ML notation by replacing the symbols: we replace

d by

$\mathcal{D}$ to denote our observed data set, and we replace

h with

$p\left(\theta \right)$ to denote the (yet) unknown parameters of our model.

$\overrightarrow{\theta}$ is called the parameter vector (set of parameters that generated

$(x,y)$), and the goal is to estimate

$\theta $ from given

x and

y. Let us consider

n data contained in a set

$\mathcal{D}={x}_{1:n}=\{{x}_{1},{x}_{2},\dots ,{x}_{n}\}$, and let be the likelihood

$p\left(\mathcal{D}\right|\theta )$ and specify a prior

$p\left(\theta \right)$, consequently we can compute the posterior:

Figure 1 illustrates this learning process: We receive the posterior probability function (green) by multiplying the prior probability (red) times the likelihood (blue), divided by the evidence (normalization—in high-dimensional spaces this is a challenge to computation). In short: the posterior is the likelihood times the prior through the evidence and the

inverse probability allows us to learn from data, to infer unknowns and to make predictions [

29].

**Representation Learning and Context.** The performance of any ML algorithm is dependent on the choice of the

data representations. Consequently, these data representations aka features are key for learning and understanding (see also

Section 3), hence much effort in ML goes into the design of preprocessing pipelines and in data transformations and data mappings that result in a respective representation which supports effective ML. Current learning algorithms have still an enormous weakness: they are unable to

extract the discriminative knowledge from the data. Consequently, it is of utmost importance to expand the universal applicability of learning algorithms, hence, to make them less dependent on (hand crafted) feature engineering.

Bengio, Courville and

Vincent (2013) [

30] argue that this can only be achieved if the algorithms can learn to identify and to

disentangle the underlying exploratory factors already existent among the low-level data. That entails that a truly intelligent algorithm is required to understand the

context, and to be able to discriminate between relevant and irrelevant features—similarly as we humans can do. “What is interesting?” and “What is relevant?” are hard questions, and as long as we cannot achieve this grand goal with automatic approaches, we have to develop algorithms which can be applied by a domain expert. Such an expert is likely to be aware of what is interesting and relevant in his/her domain, thereby can design features more appropriately than a machine learning engineer, who is mostly no domain expert. This calls for a new kind of

algorithm usability [

31]. Switching back to our probabilistic perspective, this would mean that learning features from data can be seen as recovering a parsimonious set of latent random variables (i.e., according to

Occams’s razor, see [

32] for a critical discussion), representing a distribution over the observed data to express a probabilistic model

$p(x,h)$ over the joint space of the latent variables,

h, and the observed data

x. Also this approach fits well into the perspective of cognitive science [

33].

**Automatic ML vs. Interactive ML.** The ultimate goal of the worldwide ML community is to develop algorithms/systems which can

automatically learn from data

without any human-in-the-loop [

34]. This

automatic machine learning (aML) works well when having large amounts of training data [

35], consequently “Big Data” is beneficial for automatic approaches. However, sometimes we do not have large amounts of data, and/or we are confronted with rare events and/or hard problems. The health domain is a representative example for a domain with many such complex data problems [

24,

36]. In such domains the application of fully automatic black-box approaches (“press the button and wait for the results”) seems elusive in the near future. Again, a good example are Gaussian processes, where aML approaches (e.g., kernel machines [

37]) struggle on function extrapolation problems, which are astonishingly trivial for human learners [

38]. Consequently, interactive Machine Learning (iML) approaches, by integrating a human-in-the-loop (e.g., a human kernel [

33]), or the involvement of a human directly into the machine-learning algorithm [

39], thereby making use of human cognitive abilities, is a promising approach. iML-approaches can be of particular interest to solve problems, where we are lacking big data sets, deal with complex data and/or rare events, where traditional learning algorithms suffer of insufficient training samples. In the medical domain a “doctor-in-the-loop” can help with his/her expertise in solving problems which otherwise would remain NP-hard. A recent experimental work [

40] demonstrates the usefulness on the Traveling Salesman Problem (TSP), which appears in a number of practical problems, e.g., the native folded three-dimensional conformation of a protein in its lowest free energy state; or both 2D and 3D folding processes as a free energy minimization problem belong to a large set of computational problems, assumed to be conditionally intractable [

41]. As the TSP is about finding the shortest path through a set of points, it is an intransigent mathematical problem, where many heuristics have been developed in the past to find approximate solutions [

42]. There is evidence that the inclusion of a human can be useful in numerous other problems in different application domains, see e.g., [

43,

44]. However, for clarification, iML means the integration of a human into the

algorithmic loop, i.e., to open the black box approach to a glass box. Other definitions speak also of a human-in-the-loop, but it is what we would call classic supervised approaches [

45], or in a total different meaning to put the human into physical feedback loops [

46].

**Deep Learning.** Last but not least deep learning (DL) approaches should be briefly mentioned here, because they are currently heavily contributing to the popularity of ML in the broader community generally, and to the success of industrial applications specifically. A few sentences above we have discussed the importance of learning representations. Deep learning approaches can be seen as

representation learning methods with

multiple levels of representations consisting of a number of simple non-linear single levels, where each level transforms the respective level into a representation of a higher—more abstract—level. Important here is to emphasize that the features are

not hand-crafted, instead fully automatically learned from the data, layer by layer, using a general-purpose learning procedure [

4]. The practical value has been proven in different applications, e.g., in computer vision [

47], natural language understanding [

48], connectomics (study of brain circuits) [

49], bioinformatics [

50], health informatics [

51,

52,

53], or in physics [

54], to point only to a few examples. DL also contributes to advances in implementing human-level intelligence [

55,

56], hence contributes to cognitive science. For an excellent overview and a good explanation of the history of deep learning refer to

Schmidhuber (2015) [

3]. Finally, it should be mentioned that deep learning as it achieves so fantastic performance on particular tasks, it has also serious limitations: they are black-box approaches, where it is currently difficult to explain

how and why a result was achieved, consequently lacking transparency and trust, are prone to catastrophic forgetting, are demanding huge computational resources, and need enormous amounts of training data (often millions of training samples), most of all they are poor at representing uncertainties.

**Bayesian Deep Learning.** Neural network approaches have achieved surprising success in certain application areas (e.g., machine vision, machine reading, machine hearing to mention three), however, simply being able to see, read, and hear is far from being truly intelligent, being able to understand the context. A good example is medical decision making: the medical professional looks at visible symptoms (e.g., on medical images), reads the corresponding report in the patient record, and hears the ailments of the patient. Now the medical doctor has to look for relations among different information, infer the etiology and to make predictions and finally decisions. A human can deal with uncertainties due to his/her previous knowledge and experience within a short time [

57,

58]. One of the pioneers in combining

Bayesian networks with probabilistic approaches to mathematically model

causality was

Judea Pearl [

2,

59]. This insights call for merging probabilistic graphical models with deep learning approaches (see the survey by

Wang and

Yeung (2016) [

60]). Neural network approaches (applied e.g., for regression and classification) do not well represent uncertainty, but

Bayesian models offer a mathematically grounded framework to reason about model uncertainty. Recently,

Yarin and

Ghahramani (2016) [

61], developed a new theoretical framework casting dropout training in deep neural networks as approximate

Bayesian inference in deep Gaussian processes, which provides new tools to model uncertainty with dropout neural networks, consequently inspires future work. However, a remaining big problem of deep learning is catastrophic forgetting [

62,

63].

**Deep Transfer Learning.** A very recent work by

Lee, Kim, Lee and

Yoon (2017) [

64] advances on deep learning for graph-structured data by incorporating another key concept: transfer learning (more details see in

Section 4): Convolutional Neuronal Networks (CNN) and Recurrent Neural Networks (RNN) extract data-driven features from input data (e.g., image, video, and audio data) structured in typically low-dimensional regular grids. Grid structures are often assumed to have statistical characteristics (e.g., stationarity, locality, etc.) to facilitate the modeling process. Learning algorithms can take advantage of this assumption and boost performance by simply reducing the complexity of the parameters [

65]. By overcoming the common assumption that training and test data should always be drawn from the same feature space and distribution, the transfer learning between different task domains can alleviate the burden of collecting new data and new training models for a new task. Given the importance of structural characteristics in graph analysis, it is necessary to transfer the data-driven structural features learned by deep networks from a source domain to a target domain.

## 3. Knowledge Extraction (KE)

**Stochastic Ontologies.** The combination of ontologies with ML approaches is a hot topic and not yet extensively investigated, having great future potential, particularly in complex domains such as the health domain. This is due to the fact, that both ontologies and ML constitute two indispensable technologies for domain specific knowledge extraction, actively used in knowledge-based systems. Little is yet known about how the two can be successfully integrated. The reason is that the two technologies are mainly used separately, without direct connection.

Tsymbal et al. (2007), [

66], emphasized that the knowledge extracted by the two techniques is complementary, consequently significant benefits can be obtained with an integration of both. A solution to this problem is of highest interest for health informatics, where relevant data sets are complex and of high dimensionality with heterogeneous features [

67], but where at the same time sophisticated bodies of knowledge are available for a long time, for example in the form of well-established classification systems including the unified medical language system (UMLS), the international classification of diseases (ICD), or the standard nomenclature of medical terms (SNOMED), as well as ontologies from the *omics data world including OMIM, GO, or FMA, just to mention a few.

Ontology learning is the trend towards the automatic ML-based creation of ontologies, because hand-crafting ontologies is extremely labor intensive and time consuming. One example has been presented by

Balcan et al. (2013) [

68], where they present and analyze a theoretical model to understand and explain the effectiveness of ontologies for learning multiple related tasks from primarily unlabeled data. In this model they show that an ontology, which specifies the relationships between multiple outputs, in some cases is sufficient to completely learn a classification using a large unlabeled data source. Interestingly, the motivator for this work was the famous Never Ending Language Learning (NELL) project by the group of

Tom Mitchell (2010) [

69].

Features are key to learning and understanding.

Andrew Y. Ng emphasizes in his courses that practical machine learning is feature engineering. Feature extraction and selection have become the focus of heavy research in areas for which data sets with hundreds of thousands of variables are available, e.g., in natural language processing, gene expression arrays, or combinatorial chemistry [

70]. In the following sections an incomplete, personally biased, but consistent overview about interesting topics relating to KE in natural language processing (NLP) and natural language understanding (NLU) is presented with a focus on and how to put it into a (personal)

context.

**Data as Knowledge Triggers.** In his Stanford NLP lecture series,

Christopher D. Manning (see also: [

71]) pointed out that human language in general is a symbolic/categorical signaling system; most information it conveys is

not contained in the words or sentences themselves. Rather, it triggers within the brain of the recipient a whole slew of associations relating to that person’s specific experiences as well as something we might call

world knowledge. Moreover, there is empirical evidence that, in some cases, a representation of the speakers’ intentions is helpful [

72], and there is agreement that

understanding language (not mere language processing) is more than the use of fixed conventions and/or decoding combinatorial structures and that probabilistic modeling may be helpful here [

73].

Consequently, language interpretation depends on uncertain real world knowledge, common sense,

and contextual knowledge, which explains the dominance of feature engineering tasks in the field of NLP and reduces the actual machine learning part to mere numerical optimization. Generally, the success of machine learning algorithms depend on feature learning, aka representation learning, because different representations can entangle the explanatory factors of variation behind the data [

30].

This contextual knowledge is even significant for the meaning of individual words, as e.g., the word

king triggers different associations depending on its usage within particular domains (history, chess, pop culture, pirate, etc.). Methods to automatically encode these conceptual peculiarities emerged only recently [

74,

75] and open up a multitude of new business application scenarios, especially pertaining to the analysis of small snippets of text which contain insufficient information for purely statistical analysis (bag-of-words methods, see e.g., [

76] and compare with the feature hashing trick [

77]—analogous to the kernel trick [

37,

78]).

However, aside from incorporating world knowledge into concept encodings the main problem is in lacking sufficient personal context to extract not only knowledge but also meaning from texts and to provide individual recommendations. Such information can be easily found in social graphs, embedding individual data within neighborhoods, whose structure encodes context. The problem with this approach, however, is the incomplete knowledge about the graph structure, either because the data is unavailable or has been anonymized for security reasons (e.g., due to the production of open data sets). This leads us to the question of minimal viable data sets and possible methods for reconstruction.

**Partial Context and Model (re-)construction**. In their work on Kronecker graphs (this is a generative model for networks)

Leskovec et al. (2010) [

79] asked themselves the interesting question

"How can we generate synthetic, but realistic looking, time-evolving graphs?". Although graph generators had been around for a while at that time, they were hitherto mostly unable to produce graphs displaying real-world properties, such as heavy tails for degree distributions or densification and shrinking diameters over time. Viable social network generators could also help with supplementing partially known graphs and therefore enabling ML approaches on much smaller or fragmented knowledge bases. In addition to generating realistic network structures, the task of re-populating anonymized feature vectors based on their structural embedding could prove crucial for practical ML: For instance, in SNAP (Stanford Network Analysis Platform [

80]) anonymized FB (Facebook) ego graphs, features such as

university attended are represented only as

anonymized feature xyz. Whereas this obviously tells us that all people who attended

anonymized feature 223 attended the same university, we can only guess as to which school is represented by

anonymized feature 224, which would result in an independent draw from whatever distribution we assume. Incorporating the social embedding of nodes into our model would make that draw depend on the values of connected nodes in the graph, allowing us to apply efficient sampling methods such as MCMC (Markov Chain Monte Carlo) [

81] to the problem. As a result, ML performance on anonymized graphs could be boosted without any personal re-identification attempts; a crucial advantage as more and more countries adopt stringent data privacy and security laws [

82].

A slightly different problem in model construction is that of finding suitable formal constraints from unstructured information formats. As an example we can take the construction of Business process models from event-based data such as automatic log files, or Github commit messages, which usually only provide positive examples of event paths, but omit negative information including state transitions that were prevented from taking place. The authors of [

83] developed an algorithm incorporating artificially generated negative events to act as additional constraints on the model, resulting in higher specificity—not allowing unintended, random behavior. Coupling this approach with semantic embeddings described above could result in automatic sequence model extraction from unstructured and un-processed data with tremendous potential in automatic exploration and sense-making of hitherto unspecified processes, e.g., disease stage development in the health sector or even research in underlying biological processes.

**Federated Learning and Client-side Learning.** As noted in [

84] data are often not available in bulk but arrive sequentially over time, so it is necessary to update an already learned model in real-time (also called

sequential learning or

online learning). This furthermore holds the advantage of computational simplicity by not having to store the entire data structure for model updates, especially when those adaptations can be performed in a de-centralized manner. Taking the idea of knowledge extraction from partially known models to the extreme, one could propose learning schemes in which global models result partly or solely from a large number of clients possessing only fragmented views on raw data. In a world permeated by smart devices with tremendous computing power and ubiquitous network access, such an approach could soon be poised to combine the above ideas into a powerful global knowledge extraction “organism”, which is the underlying idea of Google’s new

federated learning approach [

85]. In a recent work they trained a deep neural network (for an overview of deep learning in neural networks refer to: [

3]) in a federated learning model by application of distributed gradient descent across user-held training data on mobile devices [

86], which is a current hot topic [

87].

Taking a step back from those futuristic perspectives,

Leskovec et al. (2006) [

88] have conducted experiments on recommendation cascades, which are sequences of accepted and forwarded recommendations. Building on their insight that a vast majority of relevant recommendations within a social network originate from nodes within a radius of 1.2 and taking modern publish/subscribe architectures into account, we can arrive at the idea of a

local sphere of data permanently residing (and kept up-to-date) on clients such as smart phones or even Web browsers. Thus scalable recommender systems could be implemented with only a fraction of the cost and algorithmic complexity required today, paving the way for even greater “democratization” of Machine Learning related markets in the future.

## 4. Selected three Future Research Challenges

**Multi-Task Learning (MTL)** aims to improve the prediction performance by learning a problem together with multiple, different but related other problems through shared parameters or a shared representation. The underlying principle is

bias learning based on probable approximately correct learning (PAC learning) [

89]. To find such a bias is still the hardest problem in any ML task and essential for the initial choice of an appropriate hypothesis space, which must be large enough to contain a solution, and small enough to ensure a good generalization from a small number of data sets. Existing methods of bias generally require the input of a human-expert-in-the-loop in the form of heuristics and domain knowledge to ensure the selection of an appropriate set of features, as such features are key to learning and understanding. However, such methods are limited by the accuracy and reliability of the expert’ s knowledge (robustness of the human) and also by the extent to which that knowledge can be transferred to new tasks (see next subsection).

Baxter (2000) [

90] introduced a model of bias learning which builds on the PAC learning model which concludes that learning multiple related tasks reduces the sampling burden required for good generalization and bias that is learnt on sufficiently many training tasks is likely to be good for learning novel tasks drawn from the same environment (the problem of transfer learning to new environments is discussed in the next subsection). A practical example is

regularized MTL [

91], which is based on the minimization of regularization functionals similar to Support Vector Machines (SVMs, a good introduction can be found in [

92]), that have been successfully used in the past for single-task learning. The regularized MTL approach allows to model the relation between tasks in terms of a novel kernel function that uses a task-coupling parameter and largely outperforms single-task learning using SVMs. However, multi-task SVMs are inherently restricted by the fact that SVMs require each class to be addressed explicitly with its own weight vector. In a multi-task setting this requires the different learning tasks to share the

same set of classes. An alternative formulation for MTL is an extension of the large margin nearest neighbor algorithm (LMNN) [

93]. Instead of relying on separating hyper-planes, its decision function is based on the nearest neighbor rule which inherently extends to many classes and becomes a natural fit for MTL. This approach outperforms state-of-the-art MTL classifiers, and here many research challenges remain open [

94].

**Transfer Learning** is the ability to learn tasks permanently and this is crucial to the development of any artificial intelligence. Humans can do that very good—even very little children. A good counterexample are neural networks (deep learning) which in general are not capable of it and are considerably hampered by catastrophic forgetting.

The synaptic consolidation in human brains enables continual learning by reducing the plasticity of synapses that are vital to previously learned tasks.

Kirkpatrick et al. (2017) [

95], implemented an algorithm that performs a similar operation in artificial neural networks by constraining important parameters to stay close to their old values. As known a deep neural network consists of multiple layers of linear projections followed by element-wise non-linearities. Learning a task consists basically of adjusting the set of weights and biases

$\theta $ of the linear projections, consequently, many configurations of

$\theta $ will result in the same performance which is relevant for the so-called elastic weight consolidation (EWC): over-parametrization makes it likely that there is a solution for task B,

${\theta}_{B}^{*}$, that is close to the previously found solution for task A,

${\theta}_{A}^{*}$. While learning task B, EWC therefore protects the performance in task A by constraining the parameters to stay in a region of low error for task A centered around

${\theta}_{A}^{*}$. This constraint has been implemented as a quadratic penalty, and can therefore be imagined as a mechanical spring anchoring the parameters to the previous solution, hence the name elastic.

In order to justify this choice of constraint and to define which weights are most important for a task, it is useful to consider neural network training from a probabilistic perspective. From this point of view, optimizing the parameters is tantamount to finding their most probable values given some data

$\mathcal{D}$. Interestingly, this can be computed as conditional probability

$p\left(\theta \right|\mathcal{D})$ from the prior probability of the parameters

$p\left(\theta \right)$ and the probability of the data

$p\left(\mathcal{D}\right|\theta )$ by:

Here, the international research community is challenged to contribute on avoiding the problem of catastrophic forgetting, which is a hot topic with many open research avenues [

63].

According to

Pan and

Yang (2010) [

96] a major assumption in many ML algorithms is, that both the training data and future (unknown) data must be in the same feature space and required to have the same distribution. In many real-world applications, particularly in the health domain, this is not the case: Sometimes we have a classification task in one domain of interest, but we only have sufficient training data in another domain of interest, where the latter data may be in a completely different feature space or follows a different data distribution. In such cases transfer learning would greatly improve the performance of learning by avoiding much expensive data-labeling efforts, however, many open questions remain for future research [

97].

**Multi-Agent-Systems (MAS)** are collections of many agents interacting with each other. They can either share a common goal (for example an ant colony, bird flock, or fish swarm etc.), or they can pursue their own interests (for example as in an open-market economy). MAS can be traditionally characterized by the facts that (a) each agent has incomplete information and/or capabilities for solving a problem, (b) agents are autonomous, so there is no global system control; (c) data is decentralized; and (d) computation is asynchronous [

98]. For the health domain of particular interest is the

consensus problem, which formed the foundation for distributed computing [

99]. The roots are in the study of (human) experts in group consensus problems: Consider a group of humans who must act together as a team and each individual has a subjective probability distribution for the unknown value of some parameter; a model which describes how the group reaches agreement by pooling their individual opinions was described by

DeGroot (1974) [

100] and was used decades later for the aggregation of information with uncertainty obtained from multiple sensors [

101] and medical experts [

102]. On this basis

Olfati-Saber et al. (2007) [

103] presented a theoretical framework for analysis of consensus algorithms for networked multi-agent systems with fixed or dynamic topology and directed information flow. In complex real-world problems, e.g., for the epidemiological and ecological analysis of infectious diseases, standard models based on differential equations very rapidly become unmanageable due to too many parameters, and here MAS can also be very helpful [

104]. Moreover, collaborative multi-agent reinforcement learning has a lot of research potential for machine learning [

105], which is very suitable for collaborative interactive machine learning [

106].

## 5. Benefits of the New Journal MAKE

There are excellent and well established top journals in the field, for example: Machine Learning (MACH), the Journal of Machine Learning Research (JMLR), or the Knowledge and Information Systems (KAIS) journal—just to mention three.

Springer **Machine Learning** (MACH) is in operation since 1986 and is an established international forum for research on computational approaches to learning. The journal publishes articles reporting substantive results on a wide range of learning methods applied to a variety of learning problems. In 2001, forty editors and members of the editorial board of Machine Learning resigned in order to support the **Journal of Machine Learning Research** (JMLR), which was at that time the pioneering journal in machine learning: online available, open access and the copyright remaining with the authors. The JMLR is now the top-end journal and the benchmark of the field.

Springer **Knowledge and Information Systems** (KAIS) is in operation since 1999 and provides an international professional forum for advances on all topics related to knowledge systems and information systems. The journal focuses on systems, including their theoretical foundations, infrastructure and enabling technologies.

The journal for **MAchine Learning & Knowledge Extraction** (MAKE) is a peer-reviewed open access journal and the copyright remains with the authors. The publisher is the Multidisciplinary Digital Publishing Institute (MDPI), headquartered in Basel (Switzerland) with offices in Europe and China.

Unique features include:

Promotion of a cross-disciplinary integrated machine learning approach addressing seven sections to concert international efforts without boundaries, supporting collaborative, trans-disciplinary, and cross-domain collaboration between experts from these seven disciplines (see next section for details);

Appraisal of these different fields shall foster diverse perspectives and opinions, hence offering a platform for the exchange of novel ideas and a fresh look on methodologies to put crazy ideas into business for the benefit of the human; additionally to foster education (see details below);

Stimulation of replications and further research by inclusion of data and/or software regarding the full details of experimental work as supplementary material, if unable to be published in a standard way, or by providing links to repositories (e.g., Github) shall provide a benefit for the international research community (see issues of availability, usability and acceptance, below).

**Machine Learning Education.** The advances of machine learning research and the practical success in many different domains call worldwide for a new kind of research-oriented graduates. To keep students up-to-date with most recent material in such an innovative field is not an easy task. In a recent talk Nando de Freitas pointed out that alone deep learning research is like playing with a huge amounts of Lego blocks. Finding and putting together the right blocks is difficult. An integrative machine learning approach calls also for an integrated teaching approach and needs a concerted effort of the various disciplines.

In innovative and rapidly changing areas the application of Research-Based Teaching (RBT) approaches can be of great help [

107], where e.g., the curriculum is designed around current research topics, always grounded in relevant and necessary fundamentals. A sample curriculum for a course of “machine learning in health informatics” is described in [

108].

Consequently, the journal supports educational efforts, particularly in the form of valuable, concise, strictly peer-reviewed tutorial papers, similarly to the IEEE Signal Processing Magazine, which is doing an excellent job for the benefit of their community, see three examples [

109,

110,

111].

**Responsibility, Ethical/Social Issues, Law, Technology Assessment.** Both scientists and engineers are responsible for their developments. This is particularly true for the field of machine learning and its implications on our society. The enormous future potential of machine learning specifically, and artificial intelligence generally, requires to take not only over social responsibility, but even maximising the social benefit of these technologies [

112]. Here it is important not only to take care of ethics in the sense of how humans use computational approaches, instead to deal with machine learning ethics, which is concerning the ethical dimension to ensuring that the behavior of machines toward human users is ethically acceptable [

113], which is of increasing importance in learning machines, autonomous systems and decision making [

114,

115,

116].

Critical discussions of social implications are therefore of utmost importance, in combination with issues of regional, national, transnational and international laws, directives and regulations with a strong focus on privacy, data protection, safety and security (which is a own section of the integrated approach, see next chapter).

**Availability, Usability, Acceptance.** The value of machine learning algorithms for the progress of the international research community is to a large part dependent on three important issues:

The problem is still that much potential of sophisticated methods and tools can not be used due to lack of availability, interoperability, and reproducibility. Another huge obstacle is the lack of usability of available machine learning methods and tools, which often makes it hard for a domain expert to apply them. This calls for adequate machine learning usability [

118].

It is well understandable that all these topics mentioned within the previous pages cannot be tackled within one single discipline; instead it needs an combined effort of various sections, brought together in a concerted integrative approach. This leads us to the last open question: What is this "integrative machine learning" approach?

## 6. Integrative Machine Learning

The meaning of the words integrative or integrated stems from Latin integratus, which means “make whole”, i.e., "to put together parts or elements and combine them into a harmonious, interrelated whole, so that constituent units function in a cooperatively manner".

Although machine learning has a lot of awesome theoretical aspects and is deeply grounded in the field of artificial intelligence (AI) [

29,

119], it should always be emphasized that machine learning is a very practical field with many diverse application areas. Looking into the past, the field was just three decades ago a small niche with a few applications. Meanwhile, it evolved to a dominant field, constantly growing, with a lot of facets of enormous both width and depth.

Such a field needs an integrative approach.

Integrative/Integrated Machine Learning is based on the idea of combining the best of the two worlds dealing with understanding intelligence, which is manifested in the HCI–KDD approach: [

120,

121,

122]: Human–Computer Interaction (HCI), rooted in cognitive science, particularly dealing with

human intelligence, and Knowledge Discovery/Data Mining (KDD), rooted in computer science particularly dealing with

computational intelligence [

67]. This approach fosters a complete machine learning and knowledge extraction (MAKE) pipeline, ranging from the very physical issues of data pre-processing, mapping and fusion of arbitrarily high-dimensional data sets (see right side in

Figure 2) up to the visualization of the results in a dimension accessible to a human end-user and making data interactively accessible and manipulable (left side in

Figure 2).

**Cognitive Science** studies the principles of human learning from data to understand intelligence. The Motto of

Demis Hassabis from Google Deepmind is

“Solve intelligence—then solve everything else” (see also: [

55]). Our natural surrounding is in

${\mathbb{R}}^{3}$ and humans are excellent in perceiving patterns out of data sets with dimensions of

$\le 3$. In fact, it is amazing how humans extract so much knowledge from so little data [

19], which is a perfect motivator for the concept of interactive Machine Learning (iML), i.e., using the experience and knowledge of humans to help to solve problems which would otherwise remain computationally intractable. However, in most application domains, e.g., in the health informatics domain, we are challenged with data of arbitrarily high dimensions [

25]. Within such data, relevant

structural patterns and/or

temporal patterns (“knowledge”) are hidden, knowledge is difficult to extract, hence not accessible to a human. There is need to bring the results from high dimensions into the lower dimension, where humans are working on 2D surfaces on different devices (from tablet computers to large wall-displays), and hence the representation is limited to

${\mathbb{R}}^{2}$.

**Computer Science** studies the principles of computational learning from data to understand intelligence [

21]. Computational learning has been of general interest for a very long time, but we are far away from solving intelligence: facts are not knowledge and descriptions are not insight. A good example is the famous book by Nobel prize winner

Eric Kandel "Principles of Neural Science" [

123] which doubled in volume every decade—effectively, the goal should be to make this book shorter.

At high-level, cognitive science and machine learning had little overlap in the past. Most computer engineers had their interest in their machines and were not interested in any human factors. At the same time cognitive scientists showed rarely interest in computational approaches. Actually, it was the great practical success of machine learning in the last two decades, which brought them both together. Many successful people of the community nowadays have a background in both cognitive science and computer science and are fostering a close collaboration of both fields.

Even at low-level, HCI and KDD did not harmonize in the past. HCI had its focus on specific experimental paradigms, embedded rather in psychological issues, aiming to be cognitively plausible and resulting in nagging at design issues. KDD had its focus on computational learning problems, embedded in engineering, thereby focusing on algorithm optimization at small scale, and rather ignoring any design issues concerning a possible end user.

Consequently, a concerted effort of both worlds along with a multi-disciplinary skill-set encompassing various specializations can be highly beneficial for tackling the challenges of the future to help to understand intelligence and to develop software which learns from experience – similarly as we humans do.

The MAKE-topics may be illustrated (see

Figure 2) by seven sections with the aim to fertilize cross-disciplinary thinking. It is well known that scientific progress often emerges at the overlapping areas of seemingly distinct sections. In the following only a non-detailed high-level description is given (a description of challenges of each section is beyond the scope of this inaugural paper, and could be on the agenda for future work).

**The MAKE-Topics may be illustrated by 7 sections (see Figure 2):****Section 1: Data: Data preprocessing, integration, mapping, fusion.** This starts with understanding the physical aspects of raw data and fostering a deep understanding of the data ecosystem, particularly within an application domain.

**Section 2: Learning: Algorithms.** The core section deals with all aspects of learning algorithms, in the design, development, experimentation and evaluation of algorithms generally and in the application to application domains specifically.

**Section 3: Visualization: Data visualization, visual analysis.** At the end of the pipeline there is a human, who is limited to perceive information in dimensions $\leqq 3$. It is a hard task to map the results, gained in arbitrarily high dimensional spaces, down to the lower dimensions, ultimately to ${\mathbb{R}}^{2}$.

**Section 4: Privacy: Data Protection, Safety & Security.** Worldwide increasing demands on data protection laws and regulations (e.g., the new European Union data protection directions), privacy aware machine learning becomes a necessity not an add-on. New approaches, e.g., federated learning, glass-box approaches, will be important in the future. However, all these topics needs a strong focus on usability, acceptance and social issues.

**Section 5: Network Science: Graph-Based Data Mining.** Graph theory provides powerful tools to map data structures and to find novel connections between data objects and the inferred graphs can be further analyzed by using graph-theoretical, statistical and ML techniques.

**Section 6: Topology: Topology-Based Data Mining.** The most popular techniques of computational topology include homology and persistence and the combination with ML approaches would have enormous potential for solving many practical problems.

**Section 7: Entropy: Entropy-Based Data Mining.** Entropy can be used as a measure of uncertainty in data, thus provides a bridge to theoretical and practical aspects of information science (e.g., Kullback–Leibler Divergence for distance measure of probability distributions).