1. Introduction
Many biological circuits sense danger. Some respond to common molecular patterns associated with attack. Others perceive environmental threats for which fear or fighting may be helpful [
1,
2,
3,
4,
5].
An unusual or surprising environment provides another clue of danger. For example, the absence of an expected event could signal an anomaly. The famous comment by Sherlock Holmes about the dog that did not bark illustrates an anomalous absence of an expected event [
6].
A Scotland Yard detective asked Holmes: “Is there any other point to which you would wish to draw my attention?” Holmes answered: “To the curious incident of the dog in the night-time.” The detective replied: “The dog did nothing in the night-time.” Holmes countered: “That was the curious incident.”
Intuitively, humans have a sense of anomaly, when unexpected events trigger heightened alertness. The word “eerie” captures the notion of discomfort when “things don’t add up” in an unfamiliar situation.
For these reasons, anomaly detection focuses on deviations from what is typical. An anomaly detection circuit must learn an internal model of the typical pattern. Any departure from that model triggers a warning. This approach contrasts with detecting specific danger signals that directly indicate peril, instead emphasizing deviations from common observations.
In mammalian brains, hippocampal circuits detect anomalies [
7,
8,
9]. Immune systems may have such circuits [
10,
11]. Self versus nonself recognition is not fully understood [
12] and might, in some cases, depend on detecting anomalous patterns as nonself. Plants might use anomalous volatile organic compounds of neighbors as nonspecific danger signals [
13,
14]. However, few biological studies emphasize nonspecific anomaly detection.
This article introduces anomaly detection in machine learning [
15,
16,
17,
18]. Computational models use a wide variety of circuit types to detect anomalies. Those different types of computational circuits suggest the kinds of biological circuits that might detect anomalies. Because anomaly detection is a type of classification problem, aspects of this topic also provide insight into other biological classification challenges.
2. Contributions of This Work
2.1. Overview of the Series
This article continues the series on circuit design in biology and machine learning [
19]. The series uses insights from machine learning to understand how evolutionary processes build biological circuits. The first article in the series introduced the motivation and challenges for linking biological and machine learning circuits, with examples [
19]. This subsection adds further background.
Three facts suggest that machine learning may provide insight into the evolutionary aspects of biological design. First, machine learning and biological organisms often face similar challenges. How can environmental inputs be classified into categories? How can a system predict future inputs? What is the best response to a type of environment?
Second, natural selection is one type of learning algorithm. Machine learning deploys a broader range of algorithms. But those different algorithms tend to modify systems in broadly similar ways [
20,
21,
22].
Third, machine learning and biology often solve challenges by using a computational network to build an input–output response circuit. Here, we think of a biochemical network as a kind of circuit that takes inputs and computes outputs. When machine learning computes solutions without an explicit network, usually the computation can be embedded within a network to achieve the same result.
The fact that machine learning and biology typically build responses by creating computational circuits means that we can study how machine learning solves particular kinds of problems and use those solutions to make predictions about how evolutionary processes design biological circuits to solve the same sorts of challenges.
This series emphasizes simple biochemical circuits, primarily in cells. The analogy between neurobiological and machine learning circuits is well known, although directly linking the architecture and function of biological and computational circuits remains an ongoing challenge [
23,
24,
25,
26,
27].
By contrast, relatively little work has been conducted to match cellular or physiological circuits to common machine learning architectures. Two challenges arise.
First, although many biochemical circuits in cells have been identified and partially understood, it is not easy to describe complete circuits, understand their computational architecture, and evaluate the sorts of computations that are used to achieve their function.
Second, computational networks in machine learning tend to be much larger than could reasonably fit within a cell. Thus, we must develop new machine learning models that emphasize greatly reduced size.
Given those constraints, this series primarily aims to outline a new theory that links these two subjects. Some general predictions arise about the architecture of biological circuits. Overall, these articles show the broad conceptual links between particular external challenges and the types of biological circuits that may be favored by evolutionary processes.
2.2. Insights from Anomaly Detection
This article develops the following points, often with simple illustrative models and example quantitative analyses.
Machine learning provides new ideas for how cellular and physiological circuits may solve anomaly detection.
Some challenges require evaluating a single atemporal multivariate input for anomalies. Others require estimating deviations from recent temporal trends. Simple models illustrate different circuit designs for atemporal and temporal cases.
Detecting anomalies often requires evaluating multivariate patterns in inputs by integrating signals from ensembles of sensors or receptors. This article reviews basic measures of signal information.
Digital sensors reduce continuous analog inputs to discrete binary outputs, losing information but also reducing sensitivity to noise and measurement error. Digital sensors are easier to implement and easier to combine into broader circuits.
Machine learning uses large circuits. Cells require small circuits. This article shows that small circuits can achieve significant resolving power.
Some anomalies differ in mean input values. Summing the inferences by individual sensor outputs provides a good response.
Other anomalies differ in correlations between inputs. Decision trees work well, each sensor responding within a sequence based on the output of prior sensors.
Machine learning often deploys cascades of circuits, such as a cascade of separate decision trees.
Each small circuit passes its response to the next circuit, which corrects errors and boosts response quality.
Learning a sequence of boosted circuits matches the likely way that evolution works, sequentially improving an existing cascade of small modular subsolutions.
Dimensional reduction provides a potential alternative for anomaly detection. Typical multivariate inputs can be reduced to a lower dimension, similar to principal component analysis. An anomalous input tends to be relatively distant from typical inputs in the reduced space.
Small encoder circuits can reduce dimensionality, classifying differences in the correlational structure of typical and anomalous inputs. In general, dimensional reduction is likely to be a major feature of biological circuits.
As in all problems of biological design, evolutionary tuning with respect to tradeoffs inevitably plays a central role in shaping biological circuits.
3. Timescale
3.1. Instantaneous Versus Time-Dependent Inputs
Timescale broadly influences the kinds of circuits that can succeed in anomaly detection. Most anomaly detection methods consider multiple inputs at one point in time. If a single multivariate input is unusual compared with the set of typical multivariate points, then that unusual input is classified as an anomaly.
In some cases, an anomaly must be considered with respect to recent temporal trends [
28,
29]. For example, reactive oxygen species are often used as weapons in microbial warfare. A rapid increase in concentration of these dangerously reactive molecules may signal an attack.
For multivariate problems that use a single atemporal input, a machine learning method typically classifies by some sort of clustering, partitioning, or dimensional reduction [
15,
16,
17,
18]. The common inputs fall toward one cluster, or in a particular direction away from a partition, or in a particular location in a reduced space of constructed dimensions. The anomalous inputs are those that are not near the common set.
Temporal problems also require classification [
28,
29]. However, before classification, one must adjust for the temporal dependence of the input stream. For example, typical inputs may follow a rising trend. An anomaly must be measured against the expected input from the current trend, which requires a circuit to maintain an updated trend estimate.
3.2. Biological Response Times
Atemporal classification of anomalies demands a sufficiently fast circuit. The multivariate perception of input must be accomplished before the environment changes significantly. The calculations to classify must follow with sufficiently short lag to allow an appropriate response.
A neurobiological circuit would likely be quick enough to conduct atemporal classification. For cellular or physiological circuits, response speeds vary widely for different components, from slow biochemical reactions to fast receptors. If the environment changes significantly faster than a circuit’s classification inference, then such a circuit may not be able to classify the current environment as if it were an instantaneous isolated event.
Temporal classification over input sequences alters the timescale constraints. The circuit’s estimate of trends in inputs may update continuously, although with a time lag. The circuit can work well if its update lag is shorter than the timescale over which environmental trends change.
For temporal classification problems, neurobiological circuits would likely be quick enough for most challenges. Cellular and physiological circuits may sometimes be quick enough if intrinsic temporal smoothing of trend estimation provides sufficient information.
At present, we know little about the cellular and physiological response times of anomaly detection circuits. I limit the discussion to three brief comments.
First, cellular receptors can potentially respond on the timescale of their ligand on–off rates, which are often very fast. So, at the receptor level, sensory information may be able to keep up with environmental change.
Second, some cellular states depend on electric gradients, which change rapidly and can be transmitted at relatively high speed [
30]. These fast components of cellular response might provide a sufficient basis for speedy circuits.
Third, slower downstream biochemical reactions might constrain circuit design. Different biochemical processes vary in their response times [
31]. Altering the concentrations of reactants often triggers the fastest response. Covalent modifications of enzymes are slower than changes in reactant concentrations. Altering enzyme production or degradation rates is typically the slowest modification of biochemical circuits.
Many other factors could change biochemical response times. However, those factors are likely to be slow relative to the responses of receptors or electric gradients.
5. Multivariate Signals
The prior subsections analyzed deviations in a single dimension. Detecting anomalies often requires combining information from multiple dimensions. For example, identifying attacks on a computer network depends on the number of data bytes sent to the target computer that may be under attack, the number of data bytes returned to the potential attacker by the target computer, and the type of connection, such as email or web page.
Two widely used test datasets for computer network attack include those network measures along with several other dimensions of data [
37,
38]. The challenge is to classify whether a network connection to a target computer is a normal use or an attack. Is the connection pattern described by the multivariate measures of the connection typical or anomalous?
Many different machine learning methods have been applied to these benchmark datasets [
39,
40,
41]. The next subsection begins by evaluating each data dimension independently to infer anomalies and then combining the information in the independent dimensions.
The following subsections analyze the information in the correlation between dimensions by combining the dimensions into a decision tree or by using hierarchical dimension reduction by encoders, two widely used machine learning methods that may map relatively easily to biological circuits.
I start with artificial data to illustrate the methods. I then turn to real data that contrasts typical computer network connections with anomalous connections from attack.
I use the computer data because we do not have large datasets with multivariate measurements of typical and anomalous biological inputs. The goal here is to illustrate the key principles of circuit design that may be important for understanding how natural processes shape biological responses. Anomaly detection has hardly been studied in cellular biology but seems likely to be important in some circumstances.
5.1. Independent Data Dimensions and Ensembles
Suppose an input generates n independent data dimensions. For a typical input, the value in each dimension is a random sample from a normal distribution with mean mt and standard deviation σ. Similarly, an anomalous input generates n independent values, each sampled from a normal distribution with mean ma and standard deviation σ. Assume typical inputs are usually smaller than anomalous inputs, mt < ma.
Suppose a biological circuit can average the n independent values associated with each input. Then, the standard deviation of the average value is . The circuit classifies the average value as typical if it is less than a threshold value, τ, and anomalous if greater than the threshold.
Figure 3a illustrates how a change in threshold value alters the circuit’s success at classifying inputs. A smaller threshold causes a higher rate of classification as anomalous, which increases both the true rate of predicting anomalies and the false rate of predicting anomalies. As the threshold changes, the curve traces the tradeoffs between those different aspects of successful classification. The area under the curve (AUC) provides one way to measure the overall quality of the circuit’s ability to classify inputs.
Figure 3b shows how the circuit’s response characteristics improve for increasing levels of
n, the number of data dimensions sampled by the circuit. More data dimensions provide more precise information about whether the input is typical or anomalous.
5.2. Digital Circuits
Precise estimates for each of the
n data values may be difficult for biological sensors, making the circuit sensitive to perturbations in measurement. Suppose instead that each sensor encoded its response in a binary way, which we can label as 0 or 1. In other words, each sensor converts its analog input to a digital output: a 0 response when the value is below some threshold and a 1 response above the threshold. Such analog to digital conversion can be approximated by the Hill function response described by Equation (
1), which is widely observed in biology [
32,
33,
34].
With digital sensors, a circuit only has to combine the information into an overall frequency of 1 values, which are the anomaly signals. For example, if each sensor can trigger the activation of a transcription factor, then those transcription factors can bind to a gene promoter. By this process, the promoter can produce a response that grades with the overall frequency of anomaly signals from the sensors.
This digital circuit requires two threshold values. First, τ sets the point below which an individual sensor returns 0 for a typical input and above which the sensor returns 1 for an anomalous input. Second, a threshold ϕ sets the frequency of 1 responses among the individual sensors required for the circuit to return an overall classification of anomalous for a multivariate input.
Figure 4 shows how the two thresholds interact. Higher curves correspond to increasing numbers of sensors,
n. In (a), with
ϕ = 1/3, low thresholds for the individual sensors,
τ, cause increasing
n to provide relatively high false predicted anomalies (false positives). This pattern can be seen by starting with the lower curve for
n = 1 and the smallest labeled threshold of 90 marked by the gold circle.
As n increases and the curves rise, the gold circle for 90 moves to the right because the rate of false predicted anomalies along the x-axis increases. The reason is that, with both a low individual sensor threshold and a low overall threshold, the expected outcome for a typical input is a false positive prediction of anomaly. As n increases, the variance declines and the expected outcome increasingly dominates.
In
Figure 4b, with
ϕ = 2/3, high thresholds for the individual sensors tend to cause an increase in the inputs being predicted as typical. This increase raises the false positive rate of typical predictions, which corresponds to a reduced level along the
y-axis for true predicted anomalies. Once again, as
n increases, the variance declines and the expected outcome increasingly dominates, causing a drop in true predicted anomalies.
Figure 5 shows that analog to digital conversion by sensors decreases the maximum available information. The lower blue curve traces the smaller error rate for a fully analog circuit that averages the actual values coming into the sensors, as in
Figure 3. The upper gold curve shows the rise in the error rate caused by the information lost to digital conversion, as in
Figure 4.
Digital circuits reduce information but are simpler to construct and often are more robust. Small perturbation will usually not alter the 0/1 classification by a sensor. By contrast, many sources of noise will cause variability in a measured analog value.
5.3. Computer Network Anomaly Detection
In the NSL-KDD dataset of attacks on a central computer, a digital ensemble of sensors performs very well at detecting anomalous computer network characteristics associated with attacks. This dataset is widely used as a benchmark for machine learning studies of anomaly detection. The dataset contains measurements for many features of the computer network [
38].
A freely available Python (NSL-KDD-01-EDA-OneR: 0.929 ROC-AUC, version 3, 2022) notebook calculated how well each of 36 features could independently classify an input as a typical network pattern or an anomalous attack [
42]. The analysis used the area under the curve (AUC) to measure the resolving power of a feature, as in
Figure 3. Features with high resolving power included the amount of data sent by the remote computer to the target computer, the amount of data returned to the remote computer, the kind of service request to the target, such as email or web page, and the number of recent connections by the same remote computer.
The AUC values for 22 of 36 features were greater than 0.5, which means those features had some resolving power. The top 15 AUC values ranged from 0.82 to 0.66. If each sensor’s response is encoded as 0 for typical and 1 for anomalous, then an ensemble digital analysis can be created by summing the values for the 22 resolving features. The ensemble circuit’s AUC score is 0.93, which is good.
F1 provides another measure of classification success, combining how often a positive prediction is correct and how often a positive input is correctly predicted [
43]. The ensemble circuit’s F1 score is 0.9, which is also good.
Reducing the number of sensors to the top 4 with individual AUC values above 0.75, the ensemble AUC score is 0.92, and the F1 score is 0.89. Thus, a small and simple ensemble of digital sensors performs well for this classic benchmark dataset.
5.4. Extra Information in Multivariate Pattern
In the ensemble digital model, each sensor passes a digital response. That response can easily be combined with the outputs of other digital sensors to create an overall circuit response. Simple biological circuits may often be built in this way.
The digital ensemble uses each dimension of the input independently. Each digital sensor takes one input value and responds as a one-step decision tree (
Figure 6a). If the input is greater than some threshold, the decision tree responds one way. Otherwise, it responds the other way.
However, a multivariate pattern rarely occurs as a collection of independent dimensions. Most machine learning methods extract some of the extra multivariate information. The following sections consider two common machine learning circuits that may apply widely in biology.
6. Boosted Decision Trees
6.1. Deep Trees
A simple extension uses deeper decision trees. In
Figure 6b, the input value for the first feature of the multivariate data is split at some threshold value. If the first feature is greater than its threshold, then a second split occurs based on another feature and a different threshold. If the first feature is less than its threshold, then the second split happens based on different criteria.
A deeper tree analyzes multiple inputs, allowing for decisions that include correlations between different feature dimensions of the data. A tree of depth n creates 2n − 1 ≈ 2n splits in the data. For example, if a system has the capacity to create 26 = 64 splits, then it can create 20 = 1 trees of depth 6, or 21 = 2 trees of depth 5, or 22 = 4 trees of depth 4, and so on.
Approximately, for 2
n splits, the system can create 2
m trees of depth 2
n−m. Typically, machine learning applications perform better by using many trees of shallower depth rather than a small number of deep trees. Various methods exist for creating multiple trees and combining them into a single decision ensemble [
44,
45].
6.2. Boosting and Biological Design
The most widely successful method creates trees by a boosting process [
46]. Boosting creates trees sequentially, starting with a single relatively small tree. Then, with an optimized first tree, the algorithm adds a second tree that corrects errors made by the first tree. The process continues adding trees in this way, each tree boosting the success achieved by the prior ensemble.
Boosting seems like a good description of how biological circuits may be designed by natural selection. Initially, a small circuit may provide some information. A second circuit may boost performance, followed by a third, and so on. Sequentially boosted improvement may be the essence of biological design.
6.3. Typical vs. Anomalous Data as Self vs. Nonself
Figure 7 illustrates some of the tradeoffs in building an ensemble of boosted trees. In this case, I generated an artificial set of data with both typical and anomalous inputs by sampling from multivariate normal distributions. Each input has
f feature dimensions.
For the typical data, each feature dimension has a mean value drawn randomly from a normal distribution with mean zero and standard deviation σ. I call that standard deviation the mean scale because σ determines the scale of the fluctuations among the means of the different dimensions.
The variance in each dimension is one, so the
f-dimensional correlation matrix is also the covariance matrix. I generated that matrix by a random draw from a uniform distribution over all possible correlation matrices [
47]. Once this distribution is set for typical data, all typical observations come from this single distribution.
For the anomalous data, I used the same process to create a new multivariate normal distribution for each observation. Each anomalous observation is a single random draw from a unique distribution. Thus, classification requires recognizing what a typical observation looks like when compared with a wide variety of anomalous data patterns rather than recognizing specific signatures of danger. This structure captures the essence of self versus nonself discrimination. Here, the typical pattern defines self, and the anomalous observations define nonself, the variety of patterns distinct from self [
48,
49,
50,
51,
52,
53].
Figure 7.
Performance of boosted decision tree ensembles for classifying typical versus anomalous inputs. (
a–
i) Mean scale influences the amount of deviation in mean values between typical and anomalous inputs. F1 score measures the success of a circuit in classifying typical and anomalous data [
43]. That score combines how often a prediction of anomaly is correct with how often an anomalous input is correctly identified. Features is the number of dimensions in the data. Trees is the number of trees in an ensemble circuit. Depth is the depth of each tree in a circuit. The text describes the methods and main conclusions for this figure. I generated one dataset with 32 features and used subsets of the feature data for the various plots so that the correlation structure of the data was consistent between the various comparisons. The boosted tree ensemble was calculated by the widely used xgboost algorithm [
54]. For
T trees each of depth
n, the total number of splits is
.
Figure 7.
Performance of boosted decision tree ensembles for classifying typical versus anomalous inputs. (
a–
i) Mean scale influences the amount of deviation in mean values between typical and anomalous inputs. F1 score measures the success of a circuit in classifying typical and anomalous data [
43]. That score combines how often a prediction of anomaly is correct with how often an anomalous input is correctly identified. Features is the number of dimensions in the data. Trees is the number of trees in an ensemble circuit. Depth is the depth of each tree in a circuit. The text describes the methods and main conclusions for this figure. I generated one dataset with 32 features and used subsets of the feature data for the various plots so that the correlation structure of the data was consistent between the various comparisons. The boosted tree ensemble was calculated by the widely used xgboost algorithm [
54]. For
T trees each of depth
n, the total number of splits is
.
6.4. Performance
Figure 7a shows the success of a boosted decision tree circuit. For that panel, the circuit has 4 trees, each of depth 2. As the mean scale increases along the
x-axis, the circuits improve at detecting anomalies. A greater mean scale implies that, for each feature, the average deviation between the mean values of the typical and anomalous observations rises. Decision trees can easily detect differences in mean values for a feature by splitting at a threshold that likely separates typical and anomalous inputs.
The different curves show the varying numbers of features available in the data. More features tend to increase the largest deviations in mean values between typical and anomalous observations. More features also tend to increase the difference in multivariate correlation structure between typical and anomalous observations because greater dimensionality increases the space of possible correlation patterns.
The other panels show the increase in classification success as the number of trees or the depth of trees increases. Deeper trees are particularly good at identifying differences in multivariate patterns caused by correlations between features. That benefit can be seen by comparing the success of the deeper trees at low values of mean scale, for which there is little information available from differences in mean values between typical and anomalous observations.
The structure of this particular problem provides a strong challenge for anomaly detection because no common pattern exists among the anomalous inputs. Additionally, the generating process for the observations creates wide scatter among both typical and anomalous inputs. Nonetheless, the boosted tree ensembles significantly discriminate between typical and anomalous inputs.
6.5. Boosted Trees and Biological Circuit Evolution
Each node of a tree is simply a binary split based on input. Thus, any biological circuit that expresses the commonly observed Hill response could implement a node of a decision tree [
32,
33,
34]. Combining information from multiple trees is also likely to be something that simple biological systems could achieve.
As I mentioned earlier, the sequential process of building boosted trees likely matches the natural tendency for evolutionary processes to create solutions by adding improvements to an initial design. Thus, the simple way in which tree-like decision nodes can be implemented biologically and the sequential process of boosting make boosted trees an excellent model for cellular and neural circuits that solve challenges of classification and decision.
7. Encoders and Internal Models
7.1. Dimensional Reduction
Encoders reduce dimension by compressing inputs into informative components (for background, see Box 2 of Frank [
19]). Dimensional reduction by encoding can be an effective way to identify anomalous environmental conditions. A common autoencoder method first compresses the
f features of an input to a representation in a lower
-dimensional space. It then expands that representation back to the original
f dimensions, attempting to match the original input closely.
An autoencoder uses patterns in the data [
41]. For example, suppose the second feature tends to follow a particular function of the third and fourth features. In that case, the compression method can discard the second feature and recreate that feature during decompression. When a good autoencoder compresses and then decompresses an input, the final decompressed value tends to be close to the original input.
If anomalous inputs lack some of the patterns in typical inputs, an autoencoder built for typical inputs will often distort an anomalous input during the encoding–decoding sequence. The output for an anomalous input will often be farther from the original input than usually happens for typical inputs. Thus, the distance between the input and the output of an autoencoder can be used to classify inputs as typical or anomalous.
Using a sequence of compression steps often creates a more effective encoding. If, for example, the initial data have 2
n features, a first compression stage may reduce to 2
n−1 dimensions, followed by compression to 2
n−2 dimensions, and so on. Sequential compression helps to create an internal model of the data [
18,
55,
56,
57]. When sequentially compressing images of faces, early steps may focus on facets such as eyes, ears, nose, and mouth. Later steps may consider relations between those parts, providing an internal model of how a typical face tends to look [
58,
59,
60].
7.2. Anomaly Detection
For this article, we are particularly interested in the simplest effective circuits. A full autoencoder requires both encoding compression and decoding decompression. A simpler approach uses only the encoding step. We convert the f features in the input to compressed dimensions. If we design an encoder that tends to create a large distance between typical and anomalous inputs in the -dimensional representation, then we can use that distance to detect anomalies.
Figure 8 illustrates how an encoder separates typical and anomalous inputs. In this example, the four input dimensions were reduced to two output dimensions using a single-layer neural network. That small network separated typical and anomalous observations with high accuracy.
Figure 8.
Encoder model that reduces 4-dimensional inputs to 2 dimensions, separating typical and anomalous observations. I used the same methods to generate the data as for boosted trees, described previously. Of the initial 100,000 observations, 10% are anomalies, and the rest are typical. I randomly split the data into a training set composed of 70% of the observations and the remainder in the test set to evaluate the fitted model. This plot shows a random subset of the test data with approximately 2700 typical observations and 300 anomalous observations. Compared with
Figure 9, the mean scale value here is 1.6, and the number of features is 4. I used a distinct dataset for this figure to provide a visualization that shows the separation between typical and anomalous points more clearly. In this case, the F1 score is 0.96, corresponding to relatively few misclassified points. The model encoded the 4 input dimensions to 2 output dimensions with a single layer of a neural network using 10 parameters.
Figure 8.
Encoder model that reduces 4-dimensional inputs to 2 dimensions, separating typical and anomalous observations. I used the same methods to generate the data as for boosted trees, described previously. Of the initial 100,000 observations, 10% are anomalies, and the rest are typical. I randomly split the data into a training set composed of 70% of the observations and the remainder in the test set to evaluate the fitted model. This plot shows a random subset of the test data with approximately 2700 typical observations and 300 anomalous observations. Compared with
Figure 9, the mean scale value here is 1.6, and the number of features is 4. I used a distinct dataset for this figure to provide a visualization that shows the separation between typical and anomalous points more clearly. In this case, the F1 score is 0.96, corresponding to relatively few misclassified points. The model encoded the 4 input dimensions to 2 output dimensions with a single layer of a neural network using 10 parameters.
![Entropy 27 00896 g008 Entropy 27 00896 g008]()
Figure 9.
Encoder model to separate typical from anomalous inputs. The labels on each curve describe the number of features, f, in the data. I used the same methods to generate the data as for boosted trees, described previously. (a) I generated three separate input datasets and calculated F1 scores for each to compensate for peculiarities of any particular dataset. I then averaged the three values for each mean scale by feature combination. The overall pattern and magnitudes for each separate dataset were similar. For each dataset, I generated 32 features. I then used the first f features in the set for each curve. If we compress from inputs with f = 2n feature dimensions to a single output dimension, then a full encoder model has (2f + 5)(f − 1)/3 parameters when n is an integer. (b) I began calculation for each point using all 32 features. I then iteratively deleted the single feature that provided the least information, measured by the smallest reduction in F1 when deleting that feature. I continued until the specified number of features for a particular curve remained. This iterative deletion method provided a better set of features than simply picking the first f features as in (a).
Figure 9.
Encoder model to separate typical from anomalous inputs. The labels on each curve describe the number of features, f, in the data. I used the same methods to generate the data as for boosted trees, described previously. (a) I generated three separate input datasets and calculated F1 scores for each to compensate for peculiarities of any particular dataset. I then averaged the three values for each mean scale by feature combination. The overall pattern and magnitudes for each separate dataset were similar. For each dataset, I generated 32 features. I then used the first f features in the set for each curve. If we compress from inputs with f = 2n feature dimensions to a single output dimension, then a full encoder model has (2f + 5)(f − 1)/3 parameters when n is an integer. (b) I began calculation for each point using all 32 features. I then iteratively deleted the single feature that provided the least information, measured by the smallest reduction in F1 when deleting that feature. I continued until the specified number of features for a particular curve remained. This iterative deletion method provided a better set of features than simply picking the first f features as in (a).
![Entropy 27 00896 g009 Entropy 27 00896 g009]()
7.3. Factors Influencing Circuit Accuracy
Figure 9a compares an encoder’s classification efficacy under different conditions. The F1 score measures classification efficacy, combining how often a positive prediction is correct and how often a positive input is correctly predicted [
43]. The mean scale influences the amount of variation between typical and anomalous mean values in each dimension of the data.
The full data initially contained f = 32 features. I then calculated F1 scores by using only the first f = 4, 8, 16 feature dimensions. Each line in the figure is labeled with the number of features used, f = 2n. For this figure, an encoder reduces dimensionality from the f = 2n inputs to 20 = 1 output, using n layers in the neural network encoder.
Three conclusions follow from this figure. First, between typical and anomalous inputs, larger differences in mean values for each dimension make it easier to detect anomalies, shown in the figure as the mean scale increases along the x-axis.
Second, sampling more features of the data improves classification. The improvement occurs primarily for small values of the mean scale, in which mean differences provide limited information. In those situations, a classifier can succeed when it is able to infer distinctions between typical and anomalous inputs in the correlation pattern among the dimensions. In this example, raising the number of features enhances the information about correlational pattern, providing increasingly accurate classification.
The third conclusion is that, given a sufficient number of features, an encoder can achieve nearly perfect classification for these input data. In this case, an encoder using all 32 features of the data made very few classification errors.
The encoder for f = 32 features achieved high success by optimizing the 713 parameters in its encoding network. For f = 4, 8, 16, the circuits required 13, 49, and 185 parameters.
7.4. Simplifying Circuits
Figure 9b shows that, for a given performance level measured by F1, we can find simpler circuits with the same performance. In that plot, the calculation of each point began with all 32 features. I then iteratively removed one feature at a time, dropping the feature that provided the least amount of information, measured by the smallest decline in F1. I continued dropping features in this way, providing an F1 measure for
for each mean scale level. The plot shows curves for particular
f values.
Choosing the best f features of the full 32 in the data provides a better F1 score than taking the first f features in the data, as expected. Put another way, for a given F1 level of performance, we can use a smaller circuit if we select the best features rather than using a fixed feature set. The amount by which a circuit can be reduced for a given F1 score depends on the particular data structure, as shown by the plots.
We could further reduce the number of parameters in a circuit by imposing a cost on each parameter. A cost favors a parameter to decline close to zero when it adds relatively little improvement in performance. We then obtain a simplified circuit by pruning all parameters near zero.
Overall, relatively small encoder circuits can achieve good classification for some types of data.
8. Discussion
I have focused on anomalies as unusual observations, anything that differs from what is typical. Detection does not depend on specific anomalous patterns or danger signals. Instead, a system creates a model of a typical input and infers when an input differs from that internal model, something like “That’s an unusual smell” or “I’ve never seen that before”.
Sensory or neural adaptation provides a simple example. Many biological circuits adjust their baseline by averaging over recent inputs. That baseline allows the circuit to measure deviations from what has recently been typical [
61,
62,
63,
64,
65]. I presented simple circuits in Equations (
4) and (
5) that adapt to recent trends. An ensemble of such circuits could classify multivariate inputs.
Self versus nonself recognition occurs widely throughout biology [
66,
67,
68,
69,
70,
71,
72,
73,
74,
75]. In some cases, nonself is recognized by direct pattern recognition, which does not require the more challenging kinds of circuits discussed in this article. In other cases, self recognition is more complex and not fully understood [
12]. It seems that systems sometimes recognize what is self and classify as anomalous those observations that do not fit the self pattern, potentially sharing properties with the machine-learning circuits discussed in this article.
The human hippocampus appears to recognize novelty in certain contexts [
7,
8,
9]. Further studies suggest that memory creates a model of what is common. The system classifies inputs as novel or unusual when they deviate significantly from expectations [
76,
77]. With regard to the analyses in this article, some sort of dimensional reduction likely encodes the internal model.
Cellular and physiological systems would likely gain from anomaly detection. The models in this article suggest the kinds of small circuits that could work within these constrained biological systems.